<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

# Natural Language Processing (NLP)
## *Data Science Unit 4 Sprint 1 Assignment 1*

Your goal in this assignment: find the attributes of the best & worst coffee shops in the dataset. The text is fairly raw: dates in the review, extra words in the `star_rating` column, etc. You'll probably want to clean that stuff up for a better analysis. 

Analyze the corpus of text using text visualizations of token frequency. Try cleaning the data as much as possible. Try the following techniques: 
- Lemmatization
- Custom stopword removal

Keep in mind the attributes of good tokens. Once you have a solid baseline, layer in the star rating in your visualization(s). Key part of this assignment - produce a write-up of the attributes of the best and worst coffee shops. Based on your analysis, what makes the best the best and the worst the worst. Use graphs and numbesr from your analysis to support your conclusions. There should be plenty of markdown cells! :coffee:

In [27]:
%pwd

'C:\\Users\\Mike\\LambdaSchool\\DS-Unit-4-Sprint-1-NLP\\module1-text-data'

In [28]:
import pandas as pd

url = "https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-1-NLP/master/module1-text-data/data/yelp_coffeeshop_review_data.csv"

shops = pd.read_csv(url)
shops.head()

Unnamed: 0,coffee_shop_name,full_review_text,star_rating
0,The Factory - Cafe With a Soul,11/25/2016 1 check-in Love love loved the atm...,5.0 star rating
1,The Factory - Cafe With a Soul,"12/2/2016 Listed in Date Night: Austin, Ambia...",4.0 star rating
2,The Factory - Cafe With a Soul,11/30/2016 1 check-in Listed in Brunch Spots ...,4.0 star rating
3,The Factory - Cafe With a Soul,11/25/2016 Very cool decor! Good drinks Nice ...,2.0 star rating
4,The Factory - Cafe With a Soul,12/3/2016 1 check-in They are located within ...,4.0 star rating


In [29]:
import pandas as pd
import spacy
import re
from spacy.tokenizer import Tokenizer
import matplotlib.pyplot as plt

In [30]:
## spacy 
nlp = spacy.load("en_core_web_lg")

## tokenizer
tokenizer = Tokenizer(nlp.vocab)

In [31]:
# STEP 1: GET YOUR TOKENS

In [54]:
shops.head()

Unnamed: 0,coffee_shop_name,full_review_text,star_rating,tokens
0,The Factory - Cafe With a Soul,11/25/2016 1 check-in Love love loved the atm...,5.0,"[11/25/2016, 1, check-in, love, love, loved, a..."
1,The Factory - Cafe With a Soul,"12/2/2016 Listed in Date Night: Austin, Ambia...",4.0,"[12/2/2016, listed, date, night:, austin,, amb..."
2,The Factory - Cafe With a Soul,11/30/2016 1 check-in Listed in Brunch Spots ...,4.0,"[11/30/2016, 1, check-in, listed, brunch, spot..."
3,The Factory - Cafe With a Soul,11/25/2016 Very cool decor! Good drinks Nice ...,2.0,"[11/25/2016, cool, decor!, good, drinks, nice,..."
4,The Factory - Cafe With a Soul,12/3/2016 1 check-in They are located within ...,4.0,"[12/3/2016, 1, check-in, located, northcross, ..."


In [53]:
def tokenize(text):
    ''' cleanup and tokenizing function '''
    # remove whitespace
    text = text.strip()
    # remove timestamps
    text = re.sub('^[0-9]+[/][0-9]+[/][0-9]+ ', '', text)
    # remove check-ins
    text = re.sub('^[0-9]+ check-in[s]* ', '', text)
    # remove "Listed in"
    text
    # remove "see photos" endings
    text = re.sub(' See all photos from (.*)$', '', text)
    # leave only letters and numbers
    text = re.sub('[^a-zA-Z 0-9]', '', text)
    # set everything to lowercase and split into tokens
    text = text.lower().split()
    
    return text
    
sample = ' 11/10/2016 3 check-ins This place has been shown on my social media for days so i finally visited! See all photos from Sarah L. for The Factory - Cafe With a Soul '

print(tokenize(sample))

['this', 'place', 'has', 'been', 'shown', 'on', 'my', 'social', 'media', 'for', 'days', 'so', 'i', 'finally', 'visited']


In [56]:
shops['tokens'] = shops['full_review_text'].apply(tokenize)

Unnamed: 0,coffee_shop_name,full_review_text,star_rating,tokens
0,The Factory - Cafe With a Soul,11/25/2016 1 check-in Love love loved the atm...,5.0,"[love, love, loved, the, atmosphere, every, co..."
1,The Factory - Cafe With a Soul,"12/2/2016 Listed in Date Night: Austin, Ambia...",4.0,"[listed, in, date, night, austin, ambiance, in..."
2,The Factory - Cafe With a Soul,11/30/2016 1 check-in Listed in Brunch Spots ...,4.0,"[listed, in, brunch, spots, i, loved, the, ecl..."
3,The Factory - Cafe With a Soul,11/25/2016 Very cool decor! Good drinks Nice ...,2.0,"[very, cool, decor, good, drinks, nice, seatin..."
4,The Factory - Cafe With a Soul,12/3/2016 1 check-in They are located within ...,4.0,"[they, are, located, within, the, northcross, ..."


In [34]:
rating = shops['star_rating']

In [36]:
# change star_rating to only be the float of the rating
for i in range(len(rating)):
    rating[i] = rating[i][1:4]
rating

0       5.0
1       4.0
2       4.0
3       2.0
4       4.0
       ... 
7611    4.0
7612    5.0
7613    4.0
7614    3.0
7615    4.0
Name: star_rating, Length: 7616, dtype: object

In [37]:
rating = rating.astype(float)
rating.dtype

dtype('float64')

In [39]:
# set a cutoff for star_rating (good shops will be 4-5* and bad shops will be 1-3*)
cutoff = 3.5

# create 2 DFs: for good and bad ratings based on CUTOFF
good_shops = shops[rating > cutoff]
bad_shops = shops[rating < cutoff]

# show number of good shops vs. bad shops
print('Number of good shops: ', good_shops.shape[0])
print('Number of bad shops: ', bad_shops.shape[0])

Number of good shops:  6140
Number of bad shops:  1476


## How do we want to analyze these coffee shop tokens? 

- Overall Word / Token Count
- View Counts by Rating 
- *Hint:* a 'bad' coffee shops has a rating betweeen 1 & 3 based on the distribution of ratings. A 'good' coffee shop is a 4 or 5. 

In [None]:
# STEP 2: SORT DF into @ DFs: good_shop & bad_shop

In [41]:
from collections import Counter

In [42]:
def count(docs):
    '''
    returns word info DF from docs
    '''
    word_counts = Counter()
    appears_in = Counter()

    total_docs = len(docs)
    
    for doc in docs:
        word_counts.update(doc)
        appears_in.update(set(doc))

    temp = zip(word_counts.keys(), word_counts.values())

    wc = pd.DataFrame(temp, columns = ['word', 'count'])

    wc['rank'] = wc['count'].rank(method='first', ascending=False)
    total = wc['count'].sum()

    wc['pct_total'] = wc['count'].apply(lambda x: x / total)

    wc = wc.sort_values(by='rank')
    wc['cul_pct_total'] = wc['pct_total'].cumsum()

    t2 = zip(appears_in.keys(), appears_in.values())
    ac = pd.DataFrame(t2, columns=['word', 'appears_in'])
    wc = ac.merge(wc, on='word')

    wc['appears_in_pct'] = wc['appears_in'].apply(lambda x: x / total_docs)

    return wc.sort_values(by='rank')

In [52]:
# run count fnct. on both DFs
good_wc = count(good_shops['tokens'])
bad_wc = count(bad_shops['tokens'])

# check top 20 words for good and bad ratings
print('Top 20 words to see in a review: ', '\n',  good_wc['word'].head(20))
print('Worst 20 words to see in a review: ', '\n',  bad_wc['word'].head(20))

Top 20 words to see in a review:  
 27         coffee
130         place
125         great
255          it's
265          good
243          like
15              1
40           love
33       check-in
328           i'm
612        little
291          i've
236          nice
455          best
162      friendly
67         austin
68     definitely
375          food
477          time
43         pretty
Name: word, dtype: object
Worst 20 words to see in a review:  
 22        coffee
144        place
34          like
220         it's
14          good
221        don't
143          i'm
58             1
57          food
180     check-in
235         time
77        pretty
126        great
504       people
23       service
1048        i've
32           got
357      ordered
474       didn't
147         come
Name: word, dtype: object


In [50]:
good_wc.columns

Index(['word', 'appears_in', 'count', 'rank', 'pct_total', 'cul_pct_total',
       'appears_in_pct'],
      dtype='object')

## Can visualize the words with the greatest difference in counts between 'good' & 'bad'?

Couple Notes: 
- Rel. freq. instead of absolute counts b/c of different numbers of reviews
- Only look at the top 5-10 words with the greatest differences


## Stretch Goals

* Analyze another corpus of documents - such as Indeed.com job listings ;).
* Play with the Spacy API to
 - Extract Named Entities
 - Extracting 'noun chunks'
 - Attempt Document Classification with just Spacy
 - *Note:* This [course](https://course.spacy.io/) will be of interesting in helping you with these stretch goals. 
* Try to build a plotly dash app with your text data 

