# Text Mining from "The Great Gatsby"

## Introduction
I decided to perform the text mining task on the novel "The Great Gatsby" by F. Scott Fitzgerald. I found a complete .txt file of the book on the site "Project Gutenberg".

## Top entities
For each step entities have been counted and sorted by descending order to select the most common ones. After each code cell there can be seen the results but to sum up these are the most common ones divided by category:
- people: Gatsby, Tom, Daisy, Jordan, Wilson ...
- places: New York, West Egg, East Egg, Oxford, ...
- organizations/other: the rest

## Results comparation
I selected a couple sentences for doing a precise comparation but more extensive results are available by running the notebook.
The nlt based classification just recognized this 5 entities from the whole senteces so I'll compare them:
- East Egg (it's a place)
- Tom Buchanans (it's a character of the novel)
- Daisy (another character)
- Tom (same character as before but just the name)
- Chicago (a place)

### nltk-based classification
- East Egg (LOCATION)
- Tom Buchanans (ORGANIZATION)
- Daisy (PERSON)
- Tom (PERSON)
- Chicago (GPE)

### wikipedia-based classification
- East Egg = a Thing (I thought it recognized it because it's a very specific name, existing only in the book)
- Tom Buchanans = a Thing (also very specific, I expected it would have recognized it)
- Daisy = A day (wrong)
- Tom = a Thing (I expected this was easy, it's a very common first name)
- Chicago = shih-KAH-goh (correct classification but the summary is wrong, it reported the pronounciation)

### language model classification
- East Egg = location / miscellaneous (east as location but it doesn't recognize the whole 'east egg' concept)
- Tom Buchanans = person (correct)
- Daisy = person (correct)
- Tom = person (correct)
- Chicago = location (correct)


### To sum up:
- nltk-based classification: it worked great, classifying as people all the characters and as locations almost all of the places.
- wikipedia-based classification using nltk entities as the input: the results are definitely not satisying but I came to the idea that when you look on wikipedia for something that return multiple results (like "Gatsby" returns 3 pages) the library function return null, so it is classified as 'a Thing'.
- wikipedia-based classification using custom patterns as the input: basically the same as the other wikipedia-based classification approach
- language model classifications: I chose BERT and it performed really well, but we have to remember that it's more computationally expensive.


## Implementation

In [1]:
import pandas as pd
import nltk
from collections import Counter
import string
import wikipedia
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

### 1 ) Data Pre-Processing

In [2]:
text = ""
with open('the_great_gatsby.txt', 'r') as file:
    lines = file.readlines()
for line in lines:
    # Delete all of the "Chapter ..."
    if 'Chapter' in line:
        continue
    else:
        text = text + line.replace('"', "").replace('``','').replace('--','')

# I selected a cople sentences for the comparison task (line 141-146 of the txt file)        
text_for_comparation = "Across the courtesy bay the white palaces of fashionable East Egg glittered along the water, and the history of the summer really begins on the evening I drove over there to have dinner with Tom Buchanans. Daisy was my second cousin once removed and I'd known Tom in college. And just after the war I spent two days with them in Chicago"


### 2 ) POS Tagging

In [44]:
# Performing POS on the original text
sentences = nltk.sent_tokenize(text)
tokens = [nltk.word_tokenize(sent) for sent in sentences]
tagged = [nltk.pos_tag(sent) for sent in tokens]
print(tagged[:40])
print("\n"+"-"*100+"\n")

# Counting the tags and printing the most common ones
pos = [item for sublist in tagged for item in sublist]
count = Counter(pos)
sort_pos = sorted(count.items(), key=lambda count:count[1], reverse=True)
print('POS Top 10')
print(sort_pos[:10])
print("\n"+"-"*100+"\n")

# Filtering the original text and doing POS
flattened_tokens = [item for sublist in tokens for item in sublist]
filtered_tokens = [token for token in flattened_tokens 
                       if token not in string.punctuation 
                       if token not in nltk.corpus.stopwords.words('english')
                       if "'" not in token]
tagged = nltk.pos_tag(filtered_tokens)

# Counting again
count = Counter(tagged)
sort_tagged = sorted(count.items(), key=lambda count:count[1], reverse=True)
print('POS Top 10 filtered')
print(sort_tagged[:10])


[[('In', 'IN'), ('my', 'PRP$'), ('younger', 'JJR'), ('and', 'CC'), ('more', 'RBR'), ('vulnerable', 'JJ'), ('years', 'NNS'), ('my', 'PRP$'), ('father', 'NN'), ('gave', 'VBD'), ('me', 'PRP'), ('some', 'DT'), ('advice', 'NN'), ('that', 'IN'), ('I', 'PRP'), ("'ve", 'VBP'), ('been', 'VBN'), ('turning', 'VBG'), ('over', 'IN'), ('in', 'IN'), ('my', 'PRP$'), ('mind', 'NN'), ('ever', 'RB'), ('since', 'IN'), ('.', '.')], [('Whenever', 'WRB'), ('you', 'PRP'), ('feel', 'VBP'), ('like', 'IN'), ('criticizing', 'VBG'), ('any', 'DT'), ('one', 'CD'), (',', ','), ('he', 'PRP'), ('told', 'VBD'), ('me', 'PRP'), (',', ','), ('just', 'RB'), ('remember', 'VB'), ('that', 'IN'), ('all', 'PDT'), ('the', 'DT'), ('people', 'NNS'), ('in', 'IN'), ('this', 'DT'), ('world', 'NN'), ('have', 'VBP'), ("n't", 'RB'), ('had', 'VBD'), ('the', 'DT'), ('advantages', 'NNS'), ('that', 'IN'), ('you', 'PRP'), ("'ve", 'VBP'), ('had', 'VBN'), ('.', '.')], [('He', 'PRP'), ('did', 'VBD'), ("n't", 'RB'), ('say', 'VB'), ('any', 'DT'), 

### 3 ) NER with entity classification (using nltk.ne_chunk)

In [13]:
tokens = nltk.word_tokenize(text)
tagged = nltk.pos_tag(tokens)
ne_chunked = nltk.ne_chunk(tagged)

ner = {}
for entity in ne_chunked:
    if isinstance(entity, nltk.tree.Tree):
        t = " ".join([word for word, tag in entity.leaves()])
        ent = entity.label()
        if t not in ner:
            ner[t] = [ent, 0]
        ner[t][1] += 1
    else:
        continue

sort_ner = sorted(ner.items(), key=lambda entity: entity[1][1], reverse=True)
print('NER Top 20')
print("count --> entity\n"+"-"*80)
for s in sort_ner[:20]:
    print("{} --> {}".format(s[1][1], s[0]+" ("+s[1][0]+")"))


print("\n"+"-"*80)    
print("\nSentence for comparation:")
tokens = nltk.word_tokenize(text_for_comparation)
tagged = nltk.pos_tag(tokens)
ne_chunked_bis = nltk.ne_chunk(tagged)
for entity in ne_chunked_bis:
    if isinstance(entity, nltk.tree.Tree):
        t = " ".join([word for word, tag in entity.leaves()])
        ent = entity.label()
        print(t+" ("+ent+")")
    else:
        continue
    

NER Top 20
count --> entity
--------------------------------------------------------------------------------
212 --> Gatsby (PERSON)
163 --> Tom (PERSON)
162 --> Daisy (PERSON)
67 --> Wilson (PERSON)
57 --> Jordan (PERSON)
30 --> New York (GPE)
25 --> Miss Baker (PERSON)
22 --> Mr. Gatsby (PERSON)
22 --> Michaelis (PERSON)
21 --> Nick (PERSON)
20 --> West Egg (LOCATION)
19 --> Chicago (GPE)
17 --> Tom Buchanan (PERSON)
16 --> Myrtle (PERSON)
16 --> Mr. Wolfsheim (PERSON)
14 --> Oxford (GPE)
13 --> Catherine (PERSON)
11 --> Jordan Baker (PERSON)
10 --> East (LOCATION)
10 --> Wolfsheim (PERSON)

--------------------------------------------------------------------------------

Sentence for comparation:
East Egg (LOCATION)
Tom Buchanans (ORGANIZATION)
Daisy (PERSON)
Tom (PERSON)
Chicago (GPE)


### 4 ) NER with custom patterns

In [46]:
sentences = nltk.sent_tokenize(text)
tokens = [nltk.word_tokenize(sent) for sent in sentences]
tagged = [nltk.pos_tag(sent) for sent in tokens]
tagged_entities = {}
entity = []

adjectives = ['JJ']
nouns = ['NNP', 'NNPS'] # ['NN', 'NNS', 'NNP', 'NNPS']

for sentence in tagged:
    for word in sentence:
        word_txt = word[0]
        word_tag = word[1]
        if not entity: 
            if word_tag in adjectives: # match adjectives
                entity.append(word)
                continue # keep building the entity

            elif word_tag in nouns:  # match proper nouns
                entity.append(word)
                entity_str = " ".join([word[0] for word in entity])
                if entity_str not in tagged_entities:
                    tagged_entities[entity_str] = [entity, 0]
                tagged_entities[entity_str][1] += 1
                # entity complete, look for the next one

        else:  # if the entity isn't empty look at the last item 
            if entity[-1][1] in adjectives: # match adjectives 
                if word_tag == 'JJ': 
                    entity.append(word)
                    continue # keep building the entity

                elif word_tag in nouns:  # match proper nouns
                    entity.append(word)
                    entity_str = " ".join([word[0] for word in entity])
                    if entity_str not in tagged_entities:
                        tagged_entities[entity_str] = [entity, 0]
                    tagged_entities[entity_str][1] += 1
                    # entity complete, look for the next one

        entity = []

sorted_custom = sorted(tagged_entities.items(), key=lambda entity: entity[1][1], reverse=True)
for e in sorted_custom[:20]:
    print("{} --> {}".format(e[1][1], e[0]+" ("+e[1][0][0][1]+")"))


248 --> Gatsby (NNP)
184 --> Tom (NNP)
167 --> Daisy (NNP)
83 --> Mr. (NNP)
75 --> Wilson (NNP)
64 --> Jordan (NNP)
44 --> New (NNP)
39 --> Baker (NNP)
34 --> Miss (NNP)
31 --> Mrs. (NNP)
30 --> West (NNP)
30 --> York (NNP)
30 --> Wolfsheim (NNP)
29 --> Egg (NNP)
22 --> Buchanan (NNP)
22 --> Myrtle (NNP)
21 --> Michaelis (NNP)
20 --> Nick (NNP)
19 --> God (NNP)
18 --> Chicago (NNP)


### 5 ) Custom entity classification

In [15]:

def wikipedia_text(name):
    try:
        page = wikipedia.page(name)
    except:
        return "a Thing"
    
    tagged_tokens = nltk.pos_tag(nltk.word_tokenize(page.summary))
    
    grammar = "NP: {<DT>?<JJ>*<NN>}"
    cp = nltk.RegexpParser(grammar)
    result = cp.parse(tagged_tokens)
    
    for entity in result:
        if isinstance(entity, nltk.tree.Tree):
            return " ".join([word for word, tag in entity.leaves()])
        else:
            continue    

#### 5.1 ) wikipedia classification with step n°3 entities

In [48]:
# with the most common words from the whole text
for e in sort_ner[:20]:
    print(e[0], ' - ', wikipedia_text(e[0]))


Gatsby  -  a Thing




  lis = BeautifulSoup(html).find_all('li')


Tom  -  a Thing
Daisy  -  A day
Wilson  -  a Thing
Jordan  -  الأردن
New York  -  a Thing
Miss Baker  -  a squirrel monkey
Mr. Gatsby  -  the titular fictional character
Michaelis  -  a Thing
Nick  -  a Thing
West Egg  -  a Thing
Chicago  -  shih-KAH-goh
Tom Buchanan  -  novel
Myrtle  -  a Thing
Mr. Wolfsheim  -  novel
Oxford  -  a city
Catherine  -  derivation
Jordan Baker  -  a Thing
East  -  the compass
Wolfsheim  -  a Thing


In [28]:
# just the couple sentences for comparison
for e in ne_chunked_bis:
    x = ""
    if(type(e[0]) is tuple):
        if len(e)>1:
            x = e[0][0] + " " + e[1][0]
        else:
            x = e[0][0]
    else:
        x = e[0]
    print(x, ' - ', wikipedia_text(x))

Across  -  a Thing
the  -  a grammatical article
courtesy  -  the word
bay  -  A day
the  -  a grammatical article
white  -  a Thing
palaces  -  A palace
of  -  scale
fashionable  -  Fashion
East Egg  -  a Thing
glittered  -  a Thing
along  -  a Thing
the  -  a grammatical article
water  -  a Thing
,  -  The comma
and  -  a Thing
the  -  a grammatical article
history  -  historía
of  -  scale
the  -  a grammatical article
summer  -  Summer
really  -  a Thing
begins  -  بگعان
on  -  a Thing
the  -  a grammatical article
evening  -  day
I  -  a Thing
drove  -  a Thing
over  -  a Thing
there  -  a third-person pronoun
to  -  a Thing
have  -  a Thing
dinner  -  a Thing
with  -  a Thing
Tom Buchanans  -  a Thing
.  -  The full stop
Daisy  -  A day
was  -  a Thing
my  -  m
second  -  symbol
cousin  -  the lineal
once  -  unit
removed  -  a Thing
and  -  a Thing
I  -  a Thing
'd  -  uppercase
known  -  a form
Tom  -  a Thing
in  -  a Thing
college  -  A college
.  -  The full stop
And  -  a T

#### 5.2 ) wikipedia classification with step n°4 entities

In [49]:
# whole text
for e in sorted_custom[:20]:
    print(e[0], ' - ', wikipedia_text(e[0]))

Gatsby  -  a Thing
Tom  -  a Thing
Daisy  -  A day
Mr.  -  a Thing
Wilson  -  a Thing
Jordan  -  الأردن
New  -  a Thing
Baker  -  a Thing
Miss  -  an intrinsic property
Mrs.  -  contracted form
West  -  a Thing
York  -  a cathedral city
Wolfsheim  -  a Thing
Egg  -  An egg
Buchanan  -  a Thing
Myrtle  -  a Thing
Michaelis  -  a Thing
Nick  -  a Thing
God  -  a chemical element
Chicago  -  shih-KAH-goh


### 6 ) Language model classification

In [50]:
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

count = Counter(filtered_tokens)
sort_filtered_tokens = sorted(count.items(), key=lambda count:count[1], reverse=True)
ner_input = [t[0] for t in sort_filtered_tokens]
nlp = pipeline("ner", model=model, tokenizer=tokenizer)
ner_results = nlp(ner_input)
ner_results = [n for n in ner_results if n]

In [51]:
# I select just a subset for a clear output (of course it coud be done with every entity)
res = ner_results[:200]

# Group by the 4 different entities
people = [n[0]['word'] for n in res if "PER" in n[0]['entity']]
organizations = [n[0]['word'] for n in res if "ORG" in n[0]['entity']]
locations = [n[0]['word'] for n in res if "LOC" in n[0]['entity']]
misc = [n[0]['word'] for n in res if "MIS" in n[0]['entity']]
others = [n[0]['word'] for n in res if "O" in n[0]['entity']]

# Print results
print("Category: PERSON")
print(people)

print("Category: ORGANIZATION")
print(organizations)

print("Category: LOCATION")
print(locations)

print("Category: MISCELLANEOUS")
print(misc)

print("Category: O")
print(others)

Category: PERSON
['G', 'Tom', 'Daisy', 'Wilson', 'Baker', 'Wolf', 'Buchanan', 'Nick', 'Myrtle', 'Michael', 'God', 'M', 'Catherine', 'Long', 'Cody', 'George', 'Jay', 'Carr', 'Sloane', 'Dan', 'K', 'Meyer', 'J', 'James', 'T', 'Luc', 'Tell', 'Ba', 'Jimmy', 'B', 'Roosevelt', 'Hill', 'R', 'O', 'Walter', 'Buchanan', 'C', 'Barbara', 'Will', 'Ella', 'Kaye', 'M', 'Simon', 'Carlo', 'To', 'Edgar', 'P', 'Cecil', 'Earl', 'Mu', 'Fe', 'Henry', 'A', 'Kat', 'Taylor', 'Con', 'Fe', 'Ewing', 'G', '##G', 'Mid', 'Morgan', 'Mae', 'De', 'York', 'John', 'D', 'Peter', 'Peter', 'Kaiser', 'Wilhelm', 'Dai']
Category: ORGANIZATION
['Jordan', 'York', 'Oxford', 'G', 'Louisville', 'THE', 'Yale', 'Avenue', 'Park', 'Chester', 'Metro', 'College', 'Rise', 'Star', 'bureau', 'Point', 'Monte', 'Bel', 'Legion', 'Port', 'Detroit', 'T', 'Southampton', 'No', 'Dukes', 'B', 'Dodge', 'News', 'Columbus', 'H', 'Ville', 'Colonial', 'Society', 'D', 'Goddard', 'C', 'Post', 'J', 'Springs', 'Palm', 'Queens', 'GE', 'Ford', 'Rockefeller', 'E

## Comments
I haven't had particular issues during this task except that I was expecting better results, especially from the wikipedia-based classifier.
A possible extension could be classifying various books (I mean determing a genre) based on the entites, focusing especially on the "other" category (for example if see that the verbs "fight","die","suffer" are common, the book could be a tragedy) or using a more advanced ner classifier that can distinguish more than just 4 classes.