<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

# Natural Language Processing (NLP)
## *Data Science Unit 4 Sprint 1 Assignment 1*

Your goal in assignment: find the attributes of the best & worst coffee shops in the dataset. The text is fairly raw: dates in the review, extra words in the `star_rating` column, etc. You'll probably want to clean that stuff up for a better analysis. 

Analyze the corpus of text using text visualizations of token frequency. Try cleaning the data as much as possible. Try the following techniques: 
- Lemmatization
- Custom stopword removal

Keep in mind the attributes of good tokens. Once you have a solid baseline, layer in the star rating in your visualization(s). Keep part in this assignment - produce a write-up of the attributes of the best and worst coffee shops. Based on your analysis, what makes the best the best and the worst the worst. Use graphs and numbesr from your analysis to support your conclusions. There should be plenty of markdown cells! :coffee:

In [1]:
# from IPython.display import YouTubeVideo

# YouTubeVideo('Jml7NVYm8cs')

In [2]:
from collections import Counter
import pandas as pd
import spacy

shops = pd.read_csv('./data/yelp_coffeeshop_review_data.csv')
print(shops.shape)
shops.head()

(7616, 3)


Unnamed: 0,coffee_shop_name,full_review_text,star_rating
0,The Factory - Cafe With a Soul,11/25/2016 1 check-in Love love loved the atm...,5.0 star rating
1,The Factory - Cafe With a Soul,"12/2/2016 Listed in Date Night: Austin, Ambia...",4.0 star rating
2,The Factory - Cafe With a Soul,11/30/2016 1 check-in Listed in Brunch Spots ...,4.0 star rating
3,The Factory - Cafe With a Soul,11/25/2016 Very cool decor! Good drinks Nice ...,2.0 star rating
4,The Factory - Cafe With a Soul,12/3/2016 1 check-in They are located within ...,4.0 star rating


In [3]:
# Start here 
shops.isnull().sum()

coffee_shop_name    0
full_review_text    0
star_rating         0
dtype: int64

In [4]:
# number of coffee shops
len(shops['coffee_shop_name'].unique())

79

In [5]:
# check possible ratings
shops['star_rating'].value_counts()

 5.0 star rating     3780
 4.0 star rating     2360
 3.0 star rating      738
 2.0 star rating      460
 1.0 star rating      278
Name: star_rating, dtype: int64

In [6]:
# convert ratings to integer values
shops['rating'] = shops['star_rating'].str.strip().str[0].astype('int')
shops.drop(columns=['star_rating'], inplace=True)
shops.head()

Unnamed: 0,coffee_shop_name,full_review_text,rating
0,The Factory - Cafe With a Soul,11/25/2016 1 check-in Love love loved the atm...,5
1,The Factory - Cafe With a Soul,"12/2/2016 Listed in Date Night: Austin, Ambia...",4
2,The Factory - Cafe With a Soul,11/30/2016 1 check-in Listed in Brunch Spots ...,4
3,The Factory - Cafe With a Soul,11/25/2016 Very cool decor! Good drinks Nice ...,2
4,The Factory - Cafe With a Soul,12/3/2016 1 check-in They are located within ...,4


In [7]:
shops['rating'].value_counts(dropna=False)

5    3780
4    2360
3     738
2     460
1     278
Name: rating, dtype: int64

In [8]:
# extract date from review text
shops['date'] = shops['full_review_text'].str.extract(r'(\d{1,2}/\d{1,2}/\d{4})')
shops['date'] = pd.to_datetime(shops['date'])

# remove date from review text and strip whitespace
shops['full_review_text'] = shops['full_review_text'].str.replace(r'(\d{1,2}/\d{1,2}/\d{4})', '', n=1, regex=True)
shops['full_review_text'] = shops['full_review_text'].str.strip()

# strip whitespace from shop names, for good measure
shops['coffee_shop_name'] = shops['coffee_shop_name'].str.strip()

shops.head()

Unnamed: 0,coffee_shop_name,full_review_text,rating,date
0,The Factory - Cafe With a Soul,1 check-in Love love loved the atmosphere! Eve...,5,2016-11-25
1,The Factory - Cafe With a Soul,"Listed in Date Night: Austin, Ambiance in Aust...",4,2016-12-02
2,The Factory - Cafe With a Soul,1 check-in Listed in Brunch Spots I loved the ...,4,2016-11-30
3,The Factory - Cafe With a Soul,Very cool decor! Good drinks Nice seating How...,2,2016-11-25
4,The Factory - Cafe With a Soul,1 check-in They are located within the Northcr...,4,2016-12-03


In [9]:
# make sure no null values were created
shops.isnull().sum()

coffee_shop_name    0
full_review_text    0
rating              0
date                0
dtype: int64

In [10]:
nlp = spacy.load("en_core_web_lg")
nlp.Defaults.stop_words |= {' ', 'coffee', 'place', 'check', '1', 'Austin', 'shop', 'order', 'spot', '$', '2'}

In [11]:
# Borrow a function from the lecture notebook
def get_lemmas(text):

    lemmas = []
    
    doc = nlp(text)
    
    # Something goes here :P
    for token in doc: 
        if ((token.lemma_ not in nlp.Defaults.stop_words) and (token.is_punct == False)) and (token.pos_!= 'PRON'):
            lemmas.append(token.lemma_)
    
    return lemmas

In [12]:
# create column of extracted lemmas
shops['lemmas'] = shops['full_review_text'].apply(get_lemmas)

In [13]:
# Borrow another function from lecture notebook
def count(docs):

        word_counts = Counter()
        appears_in = Counter()
        
        total_docs = len(docs)

        for doc in docs:
            word_counts.update(doc)
            appears_in.update(set(doc))

        temp = zip(word_counts.keys(), word_counts.values())
        
        wc = pd.DataFrame(temp, columns = ['word', 'count'])

        wc['rank'] = wc['count'].rank(method='first', ascending=False)
        total = wc['count'].sum()

        wc['pct_total'] = wc['count'].apply(lambda x: x / total)
        
        wc = wc.sort_values(by='rank')
        wc['cul_pct_total'] = wc['pct_total'].cumsum()

        t2 = zip(appears_in.keys(), appears_in.values())
        ac = pd.DataFrame(t2, columns=['word', 'appears_in'])
        wc = ac.merge(wc, on='word')

        wc['appears_in_pct'] = wc['appears_in'].apply(lambda x: x / total_docs)
        
        return wc.sort_values(by='rank')

In [14]:
# count the words for all reviews
wc_all = count(shops['lemmas'])
wc_all.head()

Unnamed: 0,word,appears_in,count,rank,pct_total,cul_pct_total,appears_in_pct
18,-PRON-,5127,13301,1.0,0.039255,0.039255,0.673188
159,good,3591,5395,2.0,0.015922,0.055177,0.471507
113,great,2843,3924,3.0,0.011581,0.066758,0.373293
141,like,2273,3379,4.0,0.009972,0.07673,0.298451
407,come,1932,2637,5.0,0.007783,0.084513,0.253676


In [15]:
print(len(wc_all))
wc_all['word'].iloc[:50]

16960


18          -PRON-
159           good
113          great
141           like
407           come
467           time
16           drink
7             love
46          Austin
73             try
627           work
362           food
41            nice
177       friendly
14           latte
589         little
410         people
700      delicious
23          pretty
473            tea
134        service
103            lot
452          staff
105           want
88          flavor
338           look
86          friend
442          taste
10            find
263           know
38      definitely
146          table
471            day
135        seating
199          think
118        parking
519        barista
90           sweet
571          small
201          thing
24      atmosphere
2177     breakfast
143           feel
204            cup
214          super
213            sit
120          enjoy
200          study
528           milk
602       espresso
Name: word, dtype: object

## Stretch Goals

* Analyze another corpus of documents - such as Indeed.com job listings ;).
* Play the the Spacy API to
 - Extract Named Entities
 - Extracting 'noun chunks'
 - Attempt Document Classification with just Spacy
 - *Note:* This [course](https://course.spacy.io/) will be of interesting in helping you with these stretch goals. 
* Try to build a plotly dash app with your text data 

