# Assignment 7
## Roll Number `31311`
### Text Analytics
- Extract Sample document and apply following document preprocessing methods: 
    - Tokenization, POS Tagging, stop words removal, Stemming and Lemmatization.
- Create representation of document by calculating: 
    - Term Frequency  
    - Inverse Document Frequency.

In [2]:
import spacy
import pandas as pd
from nltk.stem import PorterStemmer
nlp = spacy.load('en_core_web_sm')

In [3]:
df = pd.read_csv('./IMDB Dataset.csv')
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. The filming tec...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [4]:
doc = []
for x in df['review']:
    doc.append(nlp(x))
df['doc'] = doc
df.head()

Unnamed: 0,review,sentiment,doc
0,One of the other reviewers has mentioned that ...,positive,"(One, of, the, other, reviewers, has, mentione..."
1,A wonderful little production. The filming tec...,positive,"(A, wonderful, little, production, ., The, fil..."
2,I thought this was a wonderful way to spend ti...,positive,"(I, thought, this, was, a, wonderful, way, to,..."
3,Basically there's a family where a little boy ...,negative,"(Basically, there, 's, a, family, where, a, li..."
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,"(Petter, Mattei, 's, "", Love, in, the, Time, o..."


## Tokenisation
We will take our `doc` and print all the tokens from it.

In [5]:
tokens = [token.text for token in doc]
df['tokens'] = tokens
df.head()

Unnamed: 0,review,sentiment,doc,tokens
0,One of the other reviewers has mentioned that ...,positive,"(One, of, the, other, reviewers, has, mentione...",One of the other reviewers has mentioned that ...
1,A wonderful little production. The filming tec...,positive,"(A, wonderful, little, production, ., The, fil...",A wonderful little production. The filming tec...
2,I thought this was a wonderful way to spend ti...,positive,"(I, thought, this, was, a, wonderful, way, to,...",I thought this was a wonderful way to spend ti...
3,Basically there's a family where a little boy ...,negative,"(Basically, there, 's, a, family, where, a, li...",Basically there's a family where a little boy ...
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,"(Petter, Mattei, 's, "", Love, in, the, Time, o...","Petter Mattei's ""Love in the Time of Money"" is..."


### Part Of Speech Tagging
Tagging each token with the appropriate part of speech i.e., `noun`, `verb`, `pronoun` etc.

In [6]:
poses = []

for doc_nlp in df['doc']:
    poses.append(tuple(token.pos_ for token in doc_nlp))

df['poses'] = poses
df.head()

Unnamed: 0,review,sentiment,doc,tokens,poses
0,One of the other reviewers has mentioned that ...,positive,"(One, of, the, other, reviewers, has, mentione...",One of the other reviewers has mentioned that ...,"(NUM, ADP, DET, ADJ, NOUN, AUX, VERB, SCONJ, A..."
1,A wonderful little production. The filming tec...,positive,"(A, wonderful, little, production, ., The, fil...",A wonderful little production. The filming tec...,"(DET, ADJ, ADJ, NOUN, PUNCT, DET, NOUN, NOUN, ..."
2,I thought this was a wonderful way to spend ti...,positive,"(I, thought, this, was, a, wonderful, way, to,...",I thought this was a wonderful way to spend ti...,"(PRON, VERB, PRON, AUX, DET, ADJ, NOUN, PART, ..."
3,Basically there's a family where a little boy ...,negative,"(Basically, there, 's, a, family, where, a, li...",Basically there's a family where a little boy ...,"(ADV, PRON, VERB, DET, NOUN, SCONJ, DET, ADJ, ..."
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,"(Petter, Mattei, 's, "", Love, in, the, Time, o...","Petter Mattei's ""Love in the Time of Money"" is...","(PROPN, PROPN, PART, PUNCT, NOUN, ADP, DET, PR..."


### Stop words removal
spaCy provides a default list of stop words for various languages, including English, French, German, Spanish, and more.

In [7]:
filtered_tokens = []
for doc_nlp in df['doc']:
    filtered_tokens.append(tuple(token.text for token in doc_nlp if not token.is_stop))
df['filtered_tokens'] = filtered_tokens
df.head()

Unnamed: 0,review,sentiment,doc,tokens,poses,filtered_tokens
0,One of the other reviewers has mentioned that ...,positive,"(One, of, the, other, reviewers, has, mentione...",One of the other reviewers has mentioned that ...,"(NUM, ADP, DET, ADJ, NOUN, AUX, VERB, SCONJ, A...","(reviewers, mentioned, watching, 1, Oz, episod..."
1,A wonderful little production. The filming tec...,positive,"(A, wonderful, little, production, ., The, fil...",A wonderful little production. The filming tec...,"(DET, ADJ, ADJ, NOUN, PUNCT, DET, NOUN, NOUN, ...","(wonderful, little, production, ., filming, te..."
2,I thought this was a wonderful way to spend ti...,positive,"(I, thought, this, was, a, wonderful, way, to,...",I thought this was a wonderful way to spend ti...,"(PRON, VERB, PRON, AUX, DET, ADJ, NOUN, PART, ...","(thought, wonderful, way, spend, time, hot, su..."
3,Basically there's a family where a little boy ...,negative,"(Basically, there, 's, a, family, where, a, li...",Basically there's a family where a little boy ...,"(ADV, PRON, VERB, DET, NOUN, SCONJ, DET, ADJ, ...","(Basically, family, little, boy, (, Jake, ), t..."
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,"(Petter, Mattei, 's, "", Love, in, the, Time, o...","Petter Mattei's ""Love in the Time of Money"" is...","(PROPN, PROPN, PART, PUNCT, NOUN, ADP, DET, PR...","(Petter, Mattei, "", Love, Time, Money, "", visu..."


### Lemmatization
We convert the inflicted word forms of our `doc` and transform them to their root form.

In [8]:
lemmas = []

for doc_nlp in df['doc']:
    lemmas.append(tuple(token.lemma_ for token in doc_nlp))

df['lemmas'] = lemmas
df.head()

Unnamed: 0,review,sentiment,doc,tokens,poses,filtered_tokens,lemmas
0,One of the other reviewers has mentioned that ...,positive,"(One, of, the, other, reviewers, has, mentione...",One of the other reviewers has mentioned that ...,"(NUM, ADP, DET, ADJ, NOUN, AUX, VERB, SCONJ, A...","(reviewers, mentioned, watching, 1, Oz, episod...","(one, of, the, other, reviewer, have, mention,..."
1,A wonderful little production. The filming tec...,positive,"(A, wonderful, little, production, ., The, fil...",A wonderful little production. The filming tec...,"(DET, ADJ, ADJ, NOUN, PUNCT, DET, NOUN, NOUN, ...","(wonderful, little, production, ., filming, te...","(a, wonderful, little, production, ., the, fil..."
2,I thought this was a wonderful way to spend ti...,positive,"(I, thought, this, was, a, wonderful, way, to,...",I thought this was a wonderful way to spend ti...,"(PRON, VERB, PRON, AUX, DET, ADJ, NOUN, PART, ...","(thought, wonderful, way, spend, time, hot, su...","(I, think, this, be, a, wonderful, way, to, sp..."
3,Basically there's a family where a little boy ...,negative,"(Basically, there, 's, a, family, where, a, li...",Basically there's a family where a little boy ...,"(ADV, PRON, VERB, DET, NOUN, SCONJ, DET, ADJ, ...","(Basically, family, little, boy, (, Jake, ), t...","(basically, there, be, a, family, where, a, li..."
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,"(Petter, Mattei, 's, "", Love, in, the, Time, o...","Petter Mattei's ""Love in the Time of Money"" is...","(PROPN, PROPN, PART, PUNCT, NOUN, ADP, DET, PR...","(Petter, Mattei, "", Love, Time, Money, "", visu...","(Petter, Mattei, 's, "", love, in, the, Time, o..."


Just applying this concept to distill the customer reviews into positive and negative

In [32]:
text_adj = []

for doc_nlp in df['doc']:
    text_adj.append(tuple(token.lemma_ for token in doc_nlp if not token.is_stop and not token.is_punct and token.pos_ == 'ADJ'))

for x in text_adj:
    print(x)

('right', 'faint', 'hearted', 'timid', 'classic', 'experimental', 'high', 'irish', 'shady', 'main', 'pretty', 'mainstream', 'nasty', 'surreal', 'ready', 'high', 'graphic', 'crooked', 'mannered', 'middle', 'comfortable', 'uncomfortable', 'dark')
('wonderful', 'little', 'old', 'entire', 'seamless', 'diary', 'worth', 'masterful', 'great', 'little', 'traditional', 'solid', 'flat')
('wonderful', 'hot', 'light', 'hearted', 'simplistic', 'witty', 'likable', 'serial', 'sexy', 'average', 'spirited', 'young', 'witty', 'interesting', 'great')
('little', 'slow', 'watchable', 'real', 'similar', 'meaningless')
('stunning', 'vivid', 'human', 'different', 'present', 'different', 'previous', 'sophisticated', 'luxurious', 'different', 'big', 'good', 'human', 'sincere', 'good', 'talented', 'alive', 'good')
('favorite', 'noble', 'preachy', 'boring', 'old', 'sympathetic', 'slow', 'believable', 'startling')
('kid', 'black', 'white', 'new', 'ole', 'nice', 'plus')
('amazing', 'fresh', 'innovative', 'brilliant

### Stemming 
This removes common prefixes from the start of each token which further reduces processing size, although Lemmatization is a much better method of accomplishing the same goal

In [10]:
stemmer = PorterStemmer()

# stemmed = [stemmer.stem(token.text) for token in doc]
stemmed = []
line = 0
for doc_nlp in df['doc']:
    stemmed.append(tuple(stemmer.stem(token.text) for token in doc))
    if line == 500 or line==800:
        print(f'Completed line {line}')
    line += 1

df['stemmed'] = stemmed
df.head()

Unnamed: 0,review,sentiment,doc,tokens,poses,filtered_tokens,lemmas,stemmed
0,One of the other reviewers has mentioned that ...,positive,"(One, of, the, other, reviewers, has, mentione...",One of the other reviewers has mentioned that ...,"(NUM, ADP, DET, ADJ, NOUN, AUX, VERB, SCONJ, A...","(reviewers, mentioned, watching, 1, Oz, episod...","(one, of, the, other, reviewer, have, mention,...",(one of the other reviewers has mentioned that...
1,A wonderful little production. The filming tec...,positive,"(A, wonderful, little, production, ., The, fil...",A wonderful little production. The filming tec...,"(DET, ADJ, ADJ, NOUN, PUNCT, DET, NOUN, NOUN, ...","(wonderful, little, production, ., filming, te...","(a, wonderful, little, production, ., the, fil...",(one of the other reviewers has mentioned that...
2,I thought this was a wonderful way to spend ti...,positive,"(I, thought, this, was, a, wonderful, way, to,...",I thought this was a wonderful way to spend ti...,"(PRON, VERB, PRON, AUX, DET, ADJ, NOUN, PART, ...","(thought, wonderful, way, spend, time, hot, su...","(I, think, this, be, a, wonderful, way, to, sp...",(one of the other reviewers has mentioned that...
3,Basically there's a family where a little boy ...,negative,"(Basically, there, 's, a, family, where, a, li...",Basically there's a family where a little boy ...,"(ADV, PRON, VERB, DET, NOUN, SCONJ, DET, ADJ, ...","(Basically, family, little, boy, (, Jake, ), t...","(basically, there, be, a, family, where, a, li...",(one of the other reviewers has mentioned that...
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,"(Petter, Mattei, 's, "", Love, in, the, Time, o...","Petter Mattei's ""Love in the Time of Money"" is...","(PROPN, PROPN, PART, PUNCT, NOUN, ADP, DET, PR...","(Petter, Mattei, "", Love, Time, Money, "", visu...","(Petter, Mattei, 's, "", love, in, the, Time, o...",(one of the other reviewers has mentioned that...


### Named Entity recognition
Identifying and classifying named entities (people, organisations & locations) in a text

In [11]:
ents = []
for doc_nlp in df['doc']:
    ents.append(tuple(token.label_ for token in doc_nlp.ents))
df['ents'] = ents
df.head()

Unnamed: 0,review,sentiment,doc,tokens,poses,filtered_tokens,lemmas,stemmed,ents
0,One of the other reviewers has mentioned that ...,positive,"(One, of, the, other, reviewers, has, mentione...",One of the other reviewers has mentioned that ...,"(NUM, ADP, DET, ADJ, NOUN, AUX, VERB, SCONJ, A...","(reviewers, mentioned, watching, 1, Oz, episod...","(one, of, the, other, reviewer, have, mention,...",(one of the other reviewers has mentioned that...,"(CARDINAL, CARDINAL, ORDINAL, ORG, ORG, ORG, G..."
1,A wonderful little production. The filming tec...,positive,"(A, wonderful, little, production, ., The, fil...",A wonderful little production. The filming tec...,"(DET, ADJ, ADJ, NOUN, PUNCT, DET, NOUN, NOUN, ...","(wonderful, little, production, ., filming, te...","(a, wonderful, little, production, ., the, fil...",(one of the other reviewers has mentioned that...,"(ORG, PERSON, PERSON, ORG, ORG, ORG)"
2,I thought this was a wonderful way to spend ti...,positive,"(I, thought, this, was, a, wonderful, way, to,...",I thought this was a wonderful way to spend ti...,"(PRON, VERB, PRON, AUX, DET, ADJ, NOUN, PART, ...","(thought, wonderful, way, spend, time, hot, su...","(I, think, this, be, a, wonderful, way, to, sp...",(one of the other reviewers has mentioned that...,"(DATE, PERSON, PERSON, CARDINAL, ORG, DATE, OR..."
3,Basically there's a family where a little boy ...,negative,"(Basically, there, 's, a, family, where, a, li...",Basically there's a family where a little boy ...,"(ADV, PRON, VERB, DET, NOUN, SCONJ, DET, ADJ, ...","(Basically, family, little, boy, (, Jake, ), t...","(basically, there, be, a, family, where, a, li...",(one of the other reviewers has mentioned that...,"(NORP, NORP, PERSON, ORDINAL, NORP, CARDINAL, ..."
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,"(Petter, Mattei, 's, "", Love, in, the, Time, o...","Petter Mattei's ""Love in the Time of Money"" is...","(PROPN, PROPN, PART, PUNCT, NOUN, ADP, DET, PR...","(Petter, Mattei, "", Love, Time, Money, "", visu...","(Petter, Mattei, 's, "", love, in, the, Time, o...",(one of the other reviewers has mentioned that...,"(WORK_OF_ART, PERSON, PERSON, GPE, CARDINAL, D..."


In [12]:
from spacy import displacy


In [13]:
doc = nlp("Noam Chomsky was born in the US in 1928")

In [14]:
for token in doc.ents:
    print(token.text, token.label_)

Noam Chomsky PERSON
US GPE
1928 DATE


In [15]:
displacy.render(doc, style='ent', jupyter=True)

### Dependency Parser 
Examining Dependencies between words to analyze its grammatical structure

In [16]:
displacy.render(doc, style='dep', jupyter=True)

In [17]:
for token in doc:
    print(token.dep_, token.text)

compound Noam
nsubjpass Chomsky
auxpass was
ROOT born
prep in
det the
pobj US
prep in
pobj 1928


In [18]:
# df.to_csv("./ImDB-dataset-processed.csv", index=False)