### Questions

- How do we visualize text data?
    - We will build visualizations
    - End of Corpus Statistics - they only built 1 at the end of W2V lab. 

### Objectives
YWBAT
- apply nlp techniques to cluster data
    - stopwords, lemmatization, stemming, phrase analysis, bag of words, bigrams, trigrams, n-grams
- apply ML to cluster data

### Outline
- Take questions 
- Load in dataset
- Get familiar with dataset using EDA
- clean text in dataset 
- phrase analysis on dataset
- bag of words on dataset

In [73]:
import re
import pandas as pd
import numpy as np

from scipy.sparse import csr_matrix


from textblob import TextBlob # sentiment analysis, quick and dirty

from vaderSentiment import vaderSentiment # sentiment analysis, a little more robust than TextBlob, still not great

# count vectorizer - BOW 
# TfIDF - Term Frequency Inverse Document Frequency

# What does TFIDF Mean? What does it do?
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer


from sklearn.cluster import KMeans
from nltk.corpus import stopwords


import matplotlib.pyplot as plt
import seaborn as sns

### What does TFIDF Mean? What does it do?

- Calculates the frequency of a word in a document vs the frequency of it across documents
- What would have a high TfIDF score?
    - Rare words, unique words
    - Words in Astrophysics vs Words in all academic disciplines would have a high tfidf
    - Authors particular vocabulary

- What has a low tfidf score?
    - stopwords -> they are frequent in all documents and in single documents
             - mid to low tf score
             - low score
             - appear at the bottom of a tfidf chart, because the idf is usually 0 for them
             
    - Words that are used once in every document....
        - if a word is found in every document then the idf_score = log_2(1) = 0
        - this means the tfidf score = 0
    - Common Words 

In [7]:
sw = stopwords.words('english')
sw[:5]

['i', 'me', 'my', 'myself', 'we']

Download data [here](https://www.kaggle.com/nltkdata/movie-review#movie_review.csv)

In [8]:
df = pd.read_csv("data/movie_review.csv")
df.head()

Unnamed: 0,fold_id,cv_tag,html_id,sent_id,text,tag
0,0,cv000,29590,0,films adapted from comic books have had plenty...,pos
1,0,cv000,29590,1,"for starters , it was created by alan moore ( ...",pos
2,0,cv000,29590,2,to say moore and campbell thoroughly researche...,pos
3,0,cv000,29590,3,"the book ( or "" graphic novel , "" if you will ...",pos
4,0,cv000,29590,4,"in other words , don't dismiss this film becau...",pos


### let's concatenate our reviews together by html_id

In [104]:
dictionary_list = []
for html in df['html_id'].unique():
    new_row = {}
    new_row['text'] = " ".join(df.loc[df['html_id'] == html]['text'])
    new_row['html_id'] = html
    new_row['tags'] = df.loc[df['html_id'] == html]['tag'].values[0]
    
    dictionary_list.append(new_row)

new_df = pd.DataFrame(dictionary_list)
new_df.head()

Unnamed: 0,text,html_id,tags
0,films adapted from comic books have had plenty...,29590,pos
1,every now and then a movie comes along from a ...,18431,pos
2,you've got mail works alot better than it dese...,15918,pos
3,""" jaws "" is a rare film that grabs your attent...",11664,pos
4,moviemaking is a lot like being the general ma...,11636,pos


In [105]:
new_df.shape

(2000, 3)

In [106]:
# let's look at a single review
print(new_df.text[0])

films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before . for starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid '80s with a 12-part series called the watchmen . to say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd . the book ( or " graphic novel , " if you will ) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes . in other words , don't dismiss this film because of its source . if you can get past the whole comic book thing , you might find another stumbling block in from hell's directors , albert and allen hughes . getting the hughes brothers to direct this seems almost as ludicr

### let's clean our text!!!!!

In [107]:
def clean_text(text):
    symbols = "()\":-.,?*#$@!;[]|+=<>/"
    for symbol in symbols:
        if symbol in ['-', '/']:
            text = text.replace(symbol, " ")
        else:
            text = text.replace(symbol, "")
            
    text = re.sub(' +', ' ', text)
    text = text.strip()
    return text

In [108]:
new_df['cleaned_text'] = [clean_text(text) for text in new_df.text]
new_df.head()

Unnamed: 0,text,html_id,tags,cleaned_text
0,films adapted from comic books have had plenty...,29590,pos,films adapted from comic books have had plenty...
1,every now and then a movie comes along from a ...,18431,pos,every now and then a movie comes along from a ...
2,you've got mail works alot better than it dese...,15918,pos,you've got mail works alot better than it dese...
3,""" jaws "" is a rare film that grabs your attent...",11664,pos,jaws is a rare film that grabs your attention ...
4,moviemaking is a lot like being the general ma...,11636,pos,moviemaking is a lot like being the general ma...


In [109]:
new_df['pos_total'] = [list(tag).count('pos') for tag in new_df.tags]
new_df.head()

Unnamed: 0,text,html_id,tags,cleaned_text,pos_total
0,films adapted from comic books have had plenty...,29590,pos,films adapted from comic books have had plenty...,0
1,every now and then a movie comes along from a ...,18431,pos,every now and then a movie comes along from a ...,0
2,you've got mail works alot better than it dese...,15918,pos,you've got mail works alot better than it dese...,0
3,""" jaws "" is a rare film that grabs your attent...",11664,pos,jaws is a rare film that grabs your attention ...,0
4,moviemaking is a lot like being the general ma...,11636,pos,moviemaking is a lot like being the general ma...,0


In [41]:
# Bag of Words
# What is a bag of words? Every instance of a word put into a numerical digit
# this dog caught the ball and the ball was tasty to the dog -> this: 1, dog: 2, caught: 1, etc

# Vectorizing sentences by their word count, where the words come from the corpus
BOW = CountVectorizer()

### Pros and Cons of BOW
Con
- Contextual Information is lost
    - Lose word order
- There are no....wait for it....prose (badum tsss)


Pros
- Quantifiable -> transforms sentences to vectors which is awesome
- Clusters on Content

In order to group these things together, what distance metric should we use?
- the similarity we concern ourselves is *COSINE SIMILARITY*

In [76]:
BOW.fit(new_df.text)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [77]:
vecs = BOW.fit_transform(new_df.text)

In [78]:
BOW.vocabulary_  # {word: index in vector}

{'films': 13196,
 'adapted': 1014,
 'from': 14073,
 'comic': 7039,
 'books': 4366,
 'have': 16028,
 'had': 15656,
 'plenty': 26439,
 'of': 24386,
 'success': 34108,
 'whether': 38707,
 'they': 35351,
 're': 28303,
 'about': 750,
 'superheroes': 34291,
 'batman': 3308,
 'superman': 34299,
 'spawn': 32906,
 'or': 24635,
 'geared': 14473,
 'toward': 35949,
 'kids': 19446,
 'casper': 5688,
 'the': 35280,
 'arthouse': 2359,
 'crowd': 8360,
 'ghost': 14650,
 'world': 39165,
 'but': 5187,
 'there': 35324,
 'never': 23719,
 'really': 28357,
 'been': 3486,
 'book': 4357,
 'like': 20492,
 'hell': 16247,
 'before': 3503,
 'for': 13695,
 'starters': 33420,
 'it': 18630,
 'was': 38405,
 'created': 8187,
 'by': 5224,
 'alan': 1408,
 'moore': 22910,
 'and': 1810,
 'eddie': 11120,
 'campbell': 5395,
 'who': 38781,
 'brought': 4875,
 'medium': 22042,
 'to': 35714,
 'whole': 38791,
 'new': 23726,
 'level': 20347,
 'in': 17608,
 'mid': 22336,
 '80s': 403,
 'with': 39013,
 '12': 37,
 'part': 25408,
 'seri

In [79]:
vecs.shape # 2000 rows by 39659 unique words

(2000, 39659)

In [80]:
random_vec = vecs[np.random.randint(low=0, high=vecs.shape[0]-1)]

In [81]:
np.where(random_vec.toarray()>0)

(array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0]),
 array([   26,   438,   750,   903,   981,  1139,  1251,  1275,  1541,
         1755,  1760,  1810,  1932,  2012,  2013,  2235,  2396,  2433,
         2566,  2840,  2934,

In [64]:
# Let's do some phrase analysis
# We are using gensim - a very powerful NLP library
from gensim.models.phrases import Phraser, Phrases

### What is phrase analysis?
- common sequential words
    - Ex: New York
- Phrase analysis will transform: New York -> New_York
- Disambiguates (in the case above) new from new (when new is being used with New York)

In [90]:
documents = new_df.cleaned_text

sentence_stream = [doc.split(" ") for doc in documents] # really a sentence list of lists
bigram = Phrases(sentence_stream, min_count=5, threshold=3)


In [91]:
bigram.vocab

defaultdict(int,
            {b'films': 1527,
             b'adapted': 46,
             b'films_adapted': 1,
             b'from': 4999,
             b'adapted_from': 21,
             b'comic': 389,
             b'from_comic': 2,
             b'books': 78,
             b'comic_books': 18,
             b'have': 4900,
             b'books_have': 3,
             b'had': 1546,
             b'have_had': 45,
             b'plenty': 132,
             b'had_plenty': 2,
             b'of': 34123,
             b'plenty_of': 102,
             b'success': 216,
             b'of_success': 8,
             b'whether': 217,
             b'success_whether': 1,
             b"they're": 414,
             b"whether_they're": 2,
             b'about': 3522,
             b"they're_about": 1,
             b'superheroes': 12,
             b'about_superheroes': 1,
             b'batman': 191,
             b'superheroes_batman': 2,
             b'superman': 23,
             b'batman_superman': 2,
             b

In [94]:
bigram[new_df.cleaned_text[0].split(" ")]

['films',
 'adapted_from',
 'comic_books',
 'have',
 'had',
 'plenty_of',
 'success',
 'whether',
 "they're",
 'about',
 'superheroes',
 'batman',
 'superman',
 'spawn',
 'or',
 'geared',
 'toward',
 'kids',
 'casper',
 'or',
 'the',
 'arthouse',
 'crowd',
 'ghost',
 'world',
 "but_there's",
 'never_really',
 'been',
 'a',
 'comic_book',
 'like',
 'from_hell',
 'before',
 'for_starters',
 'it_was',
 'created_by',
 'alan',
 'moore',
 'and',
 'eddie',
 'campbell',
 'who',
 'brought',
 'the',
 'medium',
 'to',
 'a_whole',
 'new',
 'level',
 'in',
 'the',
 'mid',
 "'80s",
 'with',
 'a',
 '12',
 'part',
 'series',
 'called',
 'the',
 'watchmen',
 'to_say',
 'moore',
 'and',
 'campbell',
 'thoroughly',
 'researched',
 'the',
 'subject',
 'of',
 'jack',
 'the',
 'ripper',
 'would_be',
 'like',
 'saying',
 'michael_jackson',
 'is',
 'starting_to',
 'look',
 'a_little',
 'odd',
 'the',
 'book',
 'or',
 'graphic_novel',
 'if_you',
 'will',
 'is',
 'over',
 '500',
 'pages',
 'long',
 'and',
 'inc

In [96]:
new_df["phrased_cleaned_text"] = [' '.join(bigram[text.split(' ')]) for text in new_df.cleaned_text]
new_df.head()

Unnamed: 0,text,html_id,tags,cleaned_text,pos_total,neg_total,phrased_cleaned_text
0,films adapted from comic books have had plenty...,29590,"[pos, pos, pos, pos, pos, pos, pos, pos, pos, ...",films adapted from comic books have had plenty...,25,0,films adapted_from comic_books have had plenty...
1,every now and then a movie comes along from a ...,18431,"[pos, pos, pos, pos, pos, pos, pos, pos, pos, ...",every now and then a movie comes along from a ...,39,0,every_now and_then a movie comes_along from a ...
2,you've got mail works alot better than it dese...,15918,"[pos, pos, pos, pos, pos, pos, pos, pos, pos, ...",you've got mail works alot better than it dese...,19,0,you've_got mail works alot better_than it dese...
3,""" jaws "" is a rare film that grabs your attent...",11664,"[pos, pos, pos, pos, pos, pos, pos, pos, pos, ...",jaws is a rare film that grabs your attention ...,42,0,jaws is a_rare film that grabs your_attention ...
4,moviemaking is a lot like being the general ma...,11636,"[pos, pos, pos, pos, pos, pos, pos, pos, pos, ...",moviemaking is a lot like being the general ma...,25,0,moviemaking is a_lot like being the general ma...


In [97]:
new_df.cleaned_text[0]

"films adapted from comic books have had plenty of success whether they're about superheroes batman superman spawn or geared toward kids casper or the arthouse crowd ghost world but there's never really been a comic book like from hell before for starters it was created by alan moore and eddie campbell who brought the medium to a whole new level in the mid '80s with a 12 part series called the watchmen to say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd the book or graphic novel if you will is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes in other words don't dismiss this film because of its source if you can get past the whole comic book thing you might find another stumbling block in from hell's directors albert and allen hughes getting the hughes brothers to direct this seems almost as ludicrous as casting carrot top in well anything but riddle me 

In [98]:
new_df.phrased_cleaned_text[0]

"films adapted_from comic_books have had plenty_of success whether they're about superheroes batman superman spawn or geared toward kids casper or the arthouse crowd ghost world but_there's never_really been a comic_book like from_hell before for_starters it_was created_by alan moore and eddie campbell who brought the medium to a_whole new level in the mid '80s with a 12 part series called the watchmen to_say moore and campbell thoroughly researched the subject of jack the ripper would_be like saying michael_jackson is starting_to look a_little odd the book or graphic_novel if_you will is over 500 pages long and includes nearly 30 more that consist_of nothing_but footnotes in other_words don't dismiss this_film because of its_source if_you can_get past the_whole comic_book thing_you might find another stumbling block in from hell's directors albert and allen hughes getting the hughes brothers to_direct this seems almost as ludicrous as casting carrot top in well anything_but riddle me 

In [99]:
BOW.fit(new_df.phrased_cleaned_text)
vecs = BOW.fit_transform(new_df.phrased_cleaned_text)
BOW.vocabulary_  # {word: index in vector}

{'films': 15643,
 'adapted_from': 1311,
 'comic_books': 8604,
 'have': 19002,
 'had': 18518,
 'plenty_of': 31725,
 'success': 40591,
 'whether': 46613,
 'they': 42338,
 're': 33815,
 'about': 977,
 'superheroes': 40802,
 'batman': 4216,
 'superman': 40810,
 'spawn': 39238,
 'or': 29693,
 'geared': 17140,
 'toward': 43398,
 'kids': 23406,
 'casper': 7071,
 'the': 42028,
 'arthouse': 3136,
 'crowd': 10086,
 'ghost': 17354,
 'world': 47295,
 'but_there': 6452,
 'never_really': 28470,
 'been': 4476,
 'comic_book': 8603,
 'like': 24601,
 'from_hell': 16694,
 'before': 4516,
 'for_starters': 16263,
 'it_was': 22346,
 'created_by': 9890,
 'alan': 1758,
 'moore': 27420,
 'and': 2449,
 'eddie': 13191,
 'campbell': 6714,
 'who': 46711,
 'brought': 6105,
 'medium': 26430,
 'to': 42877,
 'a_whole': 886,
 'new': 28484,
 'level': 24447,
 'in': 21110,
 'mid': 26774,
 '80s': 482,
 'with': 47089,
 '12': 56,
 'part': 30575,
 'series': 37245,
 'called': 6640,
 'watchmen': 46228,
 'to_say': 43055,
 'thoro

In [100]:
vecs.shape

(2000, 47947)

### Things we learned
- Phrases analysis using gensim
    - Using Phrases to modify text data
- TFIDF scores for general words, or rather, generalizing TFIDF
- Cosine similarity is best for vectors, because direction matters more than position
- ALWAYS build a cleaner

### Various aspects of NLP
- classifying text
    - chatbots
- unsupervised nlp (clustering reviews, context, profiles, conversations, etc)
- sentiment analysis
- phrase analysis
- sarcasm analysis....wth????
- language translation