<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

# Natural Language Processing (NLP)
## *Data Science Unit 4 Sprint 1 Assignment 1*

Your goal in assignment: find the attributes of the best & worst coffee shops in the dataset. The text is fairly raw: dates in the review, extra words in the `star_rating` column, etc. You'll probably want to clean that stuff up for a better analysis. 

Analyze the corpus of text using text visualizations of token frequency. Try cleaning the data as much as possible. Try the following techniques: 
- Lemmatization
- Custom stopword removal

Keep in mind the attributes of good tokens. Once you have a solid baseline, layer in the star rating in your visualization(s). Keep part in this assignment - produce a write-up of the attributes of the best and worst coffee shops. Based on your analysis, what makes the best the best and the worst the worst. Use graphs and numbesr from your analysis to support your conclusions. There should be plenty of markdown cells! :coffee:

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import re
import spacy
from spacy.tokenizer import Tokenizer
from nltk.stem import PorterStemmer

In [53]:
# function given in lecture for creating a dataframe from a df column of counters
def count(docs):

        word_counts = Counter()
        appears_in = Counter()
        
        total_docs = len(docs)

        for doc in docs:
            word_counts.update(doc)
            appears_in.update(set(doc))

        temp = zip(word_counts.keys(), word_counts.values())
        
        wc = pd.DataFrame(temp, columns = ['word', 'count'])

        wc['rank'] = wc['count'].rank(method='first', ascending=False)
        total = wc['count'].sum()

        wc['pct_total'] = wc['count'].apply(lambda x: x / total)
        
        wc = wc.sort_values(by='rank')
        wc['cul_pct_total'] = wc['pct_total'].cumsum()

        t2 = zip(appears_in.keys(), appears_in.values())
        ac = pd.DataFrame(t2, columns=['word', 'appears_in'])
        wc = ac.merge(wc, on='word')

        wc['appears_in_pct'] = wc['appears_in'].apply(lambda x: x / total_docs)
        
        return wc.sort_values(by='rank')

In [71]:
# setting up natural language processor with tokenizer
nlp = spacy.load('en_core_web_lg')
tokenizer = Tokenizer(nlp.vocab)

In [55]:
url = "https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-1-NLP/master/module1-text-data/data/yelp_coffeeshop_review_data.csv"

shops = pd.read_csv(url)
shops.head()

Unnamed: 0,coffee_shop_name,full_review_text,star_rating
0,The Factory - Cafe With a Soul,11/25/2016 1 check-in Love love loved the atm...,5.0 star rating
1,The Factory - Cafe With a Soul,"12/2/2016 Listed in Date Night: Austin, Ambia...",4.0 star rating
2,The Factory - Cafe With a Soul,11/30/2016 1 check-in Listed in Brunch Spots ...,4.0 star rating
3,The Factory - Cafe With a Soul,11/25/2016 Very cool decor! Good drinks Nice ...,2.0 star rating
4,The Factory - Cafe With a Soul,12/3/2016 1 check-in They are located within ...,4.0 star rating


In [73]:
# copies dataframe to new df
df = shops.copy()

# renames columns for ease of use later
df.columns=['shop', 'text', 'rating']

# changes rating column to have only the number (still as a string)
df['rating'] = df['rating'].str[1:4]

# converts number string to float, then float to int
df['rating'] = df['rating'].astype(float)
df['rating'] = df['rating'].astype(int)

# creates a date column
df['date'] = df['text'].str.split(' ').apply(lambda x: x[1])

# converts data column to datetime dtype
df['date'] = df['date'].apply(pd.to_datetime)

# removes punctuation with regex given in lecture
df['text'] = df['text'].apply(lambda x: re.sub(r'[^a-zA-Z ^0-9]', '', x))

# changes text to be lowercase
df['text'] = df['text'].apply(lambda x: x.lower())

# creates a new column with tokenized data using lecture function
df['tokens'] = df['text'].apply(tokenizer)

# takes every token past the date token
df['tokens'] = df['tokens'].apply(lambda x: x[2:])

# converts tokens to list 
df['tokens'] = df['tokens'].apply(lambda x: [token.text for token in x])

# creates a counter for the tokens
df['counter'] = df['tokens'].apply(Counter)

In [74]:
df.head()

Unnamed: 0,shop,text,rating,date,tokens,counter
0,The Factory - Cafe With a Soul,11252016 1 checkin love love loved the atmosp...,5,2016-11-25,"[1, checkin, love, love, loved, the, atmospher...","{'1': 1, 'checkin': 1, 'love': 2, 'loved': 1, ..."
1,The Factory - Cafe With a Soul,1222016 listed in date night austin ambiance ...,4,2016-12-02,"[listed, in, date, night, austin, ambiance, in...","{'listed': 1, 'in': 2, 'date': 1, 'night': 1, ..."
2,The Factory - Cafe With a Soul,11302016 1 checkin listed in brunch spots i l...,4,2016-11-30,"[1, checkin, listed, in, brunch, spots, i, lov...","{'1': 1, 'checkin': 1, 'listed': 1, 'in': 3, '..."
3,The Factory - Cafe With a Soul,11252016 very cool decor good drinks nice sea...,2,2016-11-25,"[very, cool, decor, good, drinks, nice, seatin...","{'very': 1, 'cool': 1, 'decor': 1, 'good': 1, ..."
4,The Factory - Cafe With a Soul,1232016 1 checkin they are located within the...,4,2016-12-03,"[1, checkin, they, are, located, within, the, ...","{'1': 1, 'checkin': 1, 'they': 1, 'are': 1, 'l..."


## How do we want to analyze these coffee shop tokens? 

- Overall Word / Token Count
- View Counts by Rating 
- *Hint:* a 'bad' coffee shops has a rating betweeen 1 & 3 based on the distribution of ratings. A 'good' coffee shop is a 4 or 5. 

In [75]:
# sums the rows into one counter
wordCount = df['counter'].sum()

In [77]:
# creates dataframes for good and bad ratings
dfGood = df[df['rating'] > 3]
dfBad = df[df['rating'] < 4]

# makes sure the size of the new dataframes add up to the size of the original dataframe
assert len(df) == len(dfGood) + len(dfBad)

In [78]:
# creates counters for good/bad reivews
wordCountGood = dfGood['counter'].sum()
wordCountBad = dfBad['counter'].sum()

In [91]:
# looks at top 10 words from all reviews using lecture function

wordsDF = count(df['counter'])
wordsDF.head(10)

Unnamed: 0,word,appears_in,count,rank,pct_total,cul_pct_total,appears_in_pct
49,the,6847,34809,1.0,0.043867,0.043867,0.899028
55,and,6864,26650,2.0,0.033585,0.077452,0.901261
18,a,6246,22755,3.0,0.028676,0.106128,0.820116
67,i,5528,20237,4.0,0.025503,0.131631,0.72584
53,,4903,18412,5.0,0.023203,0.154834,0.643776
40,to,5653,17164,6.0,0.02163,0.176465,0.742253
35,of,5100,12600,7.0,0.015879,0.192344,0.669643
94,is,4999,11999,8.0,0.015121,0.207465,0.656381
71,coffee,4877,10353,9.0,0.013047,0.220512,0.640362
8,was,3765,9707,10.0,0.012233,0.232745,0.494354


In [92]:
# looks at top 10 words used in good reviews

wordsGoodDF = count(dfGood['counter'])
wordsGoodDF.head(10)

Unnamed: 0,word,appears_in,count,rank,pct_total,cul_pct_total,appears_in_pct
49,the,5479,26616,1.0,0.04358,0.04358,0.892345
55,and,5545,21311,2.0,0.034894,0.078474,0.903094
18,a,4983,17706,3.0,0.028991,0.107466,0.811564
67,i,4344,14952,4.0,0.024482,0.131948,0.707492
53,,3880,14439,5.0,0.023642,0.15559,0.631922
40,to,4452,12763,6.0,0.020898,0.176488,0.725081
35,of,4066,9932,7.0,0.016262,0.19275,0.662215
94,is,4028,9644,8.0,0.015791,0.208541,0.656026
71,coffee,3933,8234,9.0,0.013482,0.222023,0.640554
76,in,3660,7517,10.0,0.012308,0.234331,0.596091


In [93]:
# looks at top 10 words used in good reviews

wordsBadDF = count(dfBad['counter'])
wordsBadDF.head(10)

Unnamed: 0,word,appears_in,count,rank,pct_total,cul_pct_total,appears_in_pct
48,the,1368,8193,1.0,0.044825,0.044825,0.926829
54,and,1319,5339,2.0,0.02921,0.074035,0.893631
183,i,1184,5285,3.0,0.028915,0.10295,0.802168
19,a,1263,5049,4.0,0.027624,0.130574,0.855691
220,to,1201,4401,5.0,0.024078,0.154652,0.813686
51,,1023,3973,6.0,0.021737,0.176389,0.693089
336,was,980,2933,7.0,0.016047,0.192436,0.663957
38,of,1034,2668,8.0,0.014597,0.207033,0.700542
6,it,977,2481,9.0,0.013574,0.220606,0.661924
30,is,971,2355,10.0,0.012884,0.233491,0.657859


## Can visualize the words with the greatest difference in counts between 'good' & 'bad'?

Couple Notes: 
- Rel. freq. instead of absolute counts b/c of different numbers of reviews
- Only look at the top 5-10 words with the greatest differences


In [110]:
# creates dataframe for seeing difference in word use for good/bad review word useage

badDF = wordsBadDF[['word', 'appears_in_pct']]
badDF.columns = ['word', 'percentage_in_bad']

goodDF = wordsGoodDF[['word', 'appears_in_pct']]
goodDF.columns = ['word', 'percentage_in_good']

wordPercents = pd.merge(goodDF, badDF, on='word')

In [111]:
# creates a difference column
wordPercents['difference'] = wordPercents['percentage_in_good'] - wordPercents['percentage_in_bad']

In [114]:
# shows top 10 words used more in good reviews than bad reviews
wordPercents.sort_values('difference', ascending=False).head()

Unnamed: 0,word,percentage_in_good,percentage_in_bad,difference
25,great,0.405863,0.233062,0.172801
68,delicious,0.191857,0.056233,0.135624
58,friendly,0.22785,0.120596,0.107254
76,best,0.165961,0.058943,0.107018
51,love,0.215309,0.120596,0.094713


In [115]:
# shows top 10 words used more in bad reviews than good reviews
wordPercents.sort_values('difference').head()

Unnamed: 0,word,percentage_in_good,percentage_in_bad,difference
29,not,0.320684,0.539973,-0.219289
18,but,0.459935,0.672087,-0.212152
11,was,0.453583,0.663957,-0.210374
49,just,0.221173,0.378049,-0.156876
19,that,0.402117,0.545393,-0.143276


## Stretch Goals

* Analyze another corpus of documents - such as Indeed.com job listings ;).
* Play the the Spacy API to
 - Extract Named Entities
 - Extracting 'noun chunks'
 - Attempt Document Classification with just Spacy
 - *Note:* This [course](https://course.spacy.io/) will be of interesting in helping you with these stretch goals. 
* Try to build a plotly dash app with your text data 

