### Questions
- downloaded text and ran models....wth? 
- stemming and lemmatizing 

### Objectives
YWBAT
- apply nlp techniques to cluster data
    - stopwords, lemmatization, stemming, phrase analysis, bag of words, bigrams, trigrams, n-grams
- apply ML to cluster data

### Outline
- Take questions 
- Load in dataset
- Get familiar with dataset using EDA
- clean text in dataset 
- phrase analysis on dataset
- bag of words on dataset

In [54]:
import pandas as pd
import numpy as np

from scipy.sparse import csr_matrix


from textblob import TextBlob
from vaderSentiment import vaderSentiment
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.cluster import KMeans
from nltk.corpus import stopwords


import matplotlib.pyplot as plt
import seaborn as sns

In [55]:
sw = list(set(stopwords.words('english')))
sw[:5]

['d', 'doesn', 'that', "shan't", 'we']

Download data [here](https://www.kaggle.com/nltkdata/movie-review#movie_review.csv)

In [18]:
df = pd.read_csv("data/movie_review.csv")
df.head()

Unnamed: 0,fold_id,cv_tag,html_id,sent_id,text,tag
0,0,cv000,29590,0,films adapted from comic books have had plenty...,pos
1,0,cv000,29590,1,"for starters , it was created by alan moore ( ...",pos
2,0,cv000,29590,2,to say moore and campbell thoroughly researche...,pos
3,0,cv000,29590,3,"the book ( or "" graphic novel , "" if you will ...",pos
4,0,cv000,29590,4,"in other words , don't dismiss this film becau...",pos


In [32]:
def clean_text(text):
    symbols_to_remove = ",(){}:;."
    for symbol in symbols_to_remove:
        text = text.replace(symbol, "")
    text = text.strip()
    text = text.replace("  ", " ")
    return text

In [33]:
clean_text(df.text[0])

"films adapted from comic books have had plenty of success whether they're about superheroes batman superman spawn  or geared toward kids casper or the arthouse crowd ghost world  but there's never really been a comic book like from hell before"

In [46]:
# Bag of Words
# What is a bag of words? Every instance of a word put into a numerical digit
# this dog caught the ball and the ball was tasty to the dog -> this: 1, dog: 2, caught: 1, etc

# Vectorizing sentences by their word count, where the words come from the corpus
BOW = CountVectorizer()

In [47]:
BOW.fit(df.text)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [49]:
vecs = BOW.fit_transform(df.text)

In [52]:
BOW.vocabulary_  # {word: index in vector}

{'films': 13196,
 'adapted': 1014,
 'from': 14073,
 'comic': 7039,
 'books': 4366,
 'have': 16028,
 'had': 15656,
 'plenty': 26439,
 'of': 24386,
 'success': 34108,
 'whether': 38707,
 'they': 35351,
 're': 28303,
 'about': 750,
 'superheroes': 34291,
 'batman': 3308,
 'superman': 34299,
 'spawn': 32906,
 'or': 24635,
 'geared': 14473,
 'toward': 35949,
 'kids': 19446,
 'casper': 5688,
 'the': 35280,
 'arthouse': 2359,
 'crowd': 8360,
 'ghost': 14650,
 'world': 39165,
 'but': 5187,
 'there': 35324,
 'never': 23719,
 'really': 28357,
 'been': 3486,
 'book': 4357,
 'like': 20492,
 'hell': 16247,
 'before': 3503,
 'for': 13695,
 'starters': 33420,
 'it': 18630,
 'was': 38405,
 'created': 8187,
 'by': 5224,
 'alan': 1408,
 'moore': 22910,
 'and': 1810,
 'eddie': 11120,
 'campbell': 5395,
 'who': 38781,
 'brought': 4875,
 'medium': 22042,
 'to': 35714,
 'whole': 38791,
 'new': 23726,
 'level': 20347,
 'in': 17608,
 'mid': 22336,
 '80s': 403,
 'with': 39013,
 '12': 37,
 'part': 25408,
 'seri

In [70]:
vecs.shape # 64720 rows by 39659 unique words

(64720, 39659)

In [73]:
random_vec = vecs[np.random.randint(low=0, high=vecs.shape[1]-1)]

In [74]:
random_vec

<1x39659 sparse matrix of type '<class 'numpy.int64'>'
	with 27 stored elements in Compressed Sparse Row format>

In [89]:
def euclid_dist(p1, p2):
    try:
        return np.sqrt(np.sum(p1**2 + p2**2))
    except:
        return np.sqrt(np.sum(p1.toarray()**2 + p2.toarray()**2))

In [84]:
# How can I find reviews similar to this one?
# Compare 2 vectors...that's the goal?


# Euclidean Distance
euclid_dist(vecs[0].toarray(), vecs[1].toarray())

9.055385138137417

In [85]:
df.text[0], df.text[1]

("films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before .",
 "for starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid '80s with a 12-part series called the watchmen .")

In [90]:
euclid_dist(vecs[0], vecs[2])

8.660254037844387

In [91]:
distances = []
for index, vec in enumerate(vecs):
    dist = euclid_dist(vec, random_vec)
    distances.append((index, dist))

In [92]:
distances[:5]

[(0, 9.327379053088816),
 (1, 8.774964387392123),
 (2, 8.366600265340756),
 (3, 8.06225774829855),
 (4, 7.211102550927978)]

In [94]:
distances = sorted(distances, key=lambda x: x[1], reverse=False)
distances[:5]

[(559, 6.4031242374328485),
 (616, 6.4031242374328485),
 (990, 6.4031242374328485),
 (1613, 6.4031242374328485),
 (1937, 6.4031242374328485)]

In [95]:
for tup in distances[:5]:
    print(df.text[tup[0]])

,
?
: )
i . e .
,


In [102]:
df.text[559]

','