<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

# Natural Language Processing (NLP)
## *Data Science Unit 4 Sprint 1 Assignment 1*

Your goal in this assignment: find the attributes of the best & worst coffee shops in the dataset. The text is fairly raw: dates in the review, extra words in the `star_rating` column, etc. You'll probably want to clean that stuff up for a better analysis. 

Analyze the corpus of text using text visualizations of token frequency. Try cleaning the data as much as possible. Try the following techniques: 
- Lemmatization
- Custom stopword removal

Keep in mind the attributes of good tokens. Once you have a solid baseline, layer in the star rating in your visualization(s). Key part of this assignment - produce a write-up of the attributes of the best and worst coffee shops. Based on your analysis, what makes the best the best and the worst the worst. Use graphs and numbesr from your analysis to support your conclusions. There should be plenty of markdown cells! :coffee:

In [None]:
from IPython.display import YouTubeVideo

YouTubeVideo('Jml7NVYm8cs')

In [None]:
%pwd

In [None]:
import pandas as pd

url = "https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-1-NLP/master/module1-text-data/data/yelp_coffeeshop_review_data.csv"

shops = pd.read_csv(url)
shops.head()

In [None]:
# Start here 

## How do we want to analyze these coffee shop tokens? 

- Overall Word / Token Count
- View Counts by Rating 
- *Hint:* a 'bad' coffee shops has a rating betweeen 1 & 3 based on the distribution of ratings. A 'good' coffee shop is a 4 or 5. 

In [None]:
#Import statements

import spacy
from spacy.tokenizer import Tokenizer

nlp = spacy.load("en_core_web_lg")

In [None]:
# Check dataframe
shops.head()

In [None]:
# Tokenizer
tokenizer = Tokenizer(nlp.vocab)

In [None]:
#Tokenizer pipe function
def token_pipe(text_column):
    tokens = []
    
    for doc in tokenizer.pipe(text_column, batch_size=1000):
        doc_tokens = [token.text for token in doc]
        tokens.append(doc_tokens)
        
    return tokens

In [None]:
# Use function
shops_tokens = token_pipe(shops['full_review_text'])

In [None]:
# Create new column on shops df.
shops['tokens'] = shops_tokens

shops.head()

In [None]:
# Object from Base Python
from collections import Counter

# The object `Counter` takes an iterable, but you can instaniate an empty one and update it. 
word_counts = Counter()

# Using function from lecture, I'll make my own down the line
def count(docs):

        word_counts = Counter()
        appears_in = Counter()
        
        total_docs = len(docs)

        for doc in docs:
            word_counts.update(doc)
            appears_in.update(set(doc))

        temp = zip(word_counts.keys(), word_counts.values())
        
        wc = pd.DataFrame(temp, columns = ['word', 'count'])

        wc['rank'] = wc['count'].rank(method='first', ascending=False)
        total = wc['count'].sum()

        wc['pct_total'] = wc['count'].apply(lambda x: x / total)
        
        wc = wc.sort_values(by='rank')
        wc['cul_pct_total'] = wc['pct_total'].cumsum()

        t2 = zip(appears_in.keys(), appears_in.values())
        ac = pd.DataFrame(t2, columns=['word', 'appears_in'])
        wc = ac.merge(wc, on='word')

        wc['appears_in_pct'] = wc['appears_in'].apply(lambda x: x / total_docs)
        
        return wc.sort_values(by='rank')

In [None]:
wc = count(shops['tokens'])

In [None]:
wc.head()

In [76]:
# We've still got plenty of stop words and repeated words in different forms, let's do some lemmatizing
def get_lemmas(text):
    
    # Create empty lemma list
    lemmas = []
    
    # Create doc file containing text to be lemmatized
    doc = nlp(text)
    
   # Check for token stop words, punctuation, pronouns, numbers and symbols, remove them, then append to lemma list 
    for token in doc:
        if ((token.is_stop == False) and (token.is_punct == False)) and ((token.pos_ != 'PRON') and (token.pos_ != 'NUM') and (token.pos_ != 'SYM')):
            lemmas.append(token.lemma_)
    
    return lemmas

In [77]:
shops['lemmas'] = shops['full_review_text'].apply(get_lemmas)

In [78]:
shops.head()

Unnamed: 0,coffee_shop_name,full_review_text,star_rating,tokens,lemmas
0,The Factory - Cafe With a Soul,11/25/2016 1 check-in Love love loved the atm...,5.0 star rating,"[ , 11/25/2016, 1, check-in, Love, love, loved...","[ , check, love, love, love, atmosphere, corne..."
1,The Factory - Cafe With a Soul,"12/2/2016 Listed in Date Night: Austin, Ambia...",4.0 star rating,"[ , 12/2/2016, Listed, in, Date, Night:, Austi...","[ , list, Date, Night, Austin, Ambiance, Austi..."
2,The Factory - Cafe With a Soul,11/30/2016 1 check-in Listed in Brunch Spots ...,4.0 star rating,"[ , 11/30/2016, 1, check-in, Listed, in, Brunc...","[ , check, list, Brunch, Spots, love, eclectic..."
3,The Factory - Cafe With a Soul,11/25/2016 Very cool decor! Good drinks Nice ...,2.0 star rating,"[ , 11/25/2016, Very, cool, decor!, Good, drin...","[ , cool, decor, good, drink, nice, seating, ..."
4,The Factory - Cafe With a Soul,12/3/2016 1 check-in They are located within ...,4.0 star rating,"[ , 12/3/2016, 1, check-in, They, are, located...","[ , check, locate, Northcross, mall, shopping,..."


In [81]:
shops['lemmas'].head() 

0    [check, love, love, love, atmosphere, corner, ...
1    [list, Date, Night, Austin, Ambiance, Austin, ...
2    [check, list, Brunch, Spots, love, eclectic, h...
3    [cool, decor, good, drink, nice, seating,  ,  ...
4    [check, locate, Northcross, mall, shopping, ce...
Name: lemmas, dtype: object

In [98]:
# We've got some whitespace, let's get rid of that
for text in shops['lemmas']:
    try:
        text.remove(' ')
    except ValueError:
        pass  # do nothing!
    
# Had to run this about 20 times. I'd like to remove white space beforehand next time.

In [99]:
# Much better
shops['lemmas'].head()

0    [check, love, love, love, atmosphere, corner, ...
1    [list, Date, Night, Austin, Ambiance, Austin, ...
2    [check, list, Brunch, Spots, love, eclectic, h...
3    [cool, decor, good, drink, nice, seating, over...
4    [check, locate, Northcross, mall, shopping, ce...
Name: lemmas, dtype: object

In [100]:
# Let's get some stats
wc = count(shops['lemmas'])

In [101]:
wc.head()

Unnamed: 0,word,appears_in,count,rank,pct_total,cul_pct_total,appears_in_pct
28,coffee,4826,10100,1.0,0.028488,0.028488,0.633666
105,place,3876,6021,2.0,0.016983,0.04547,0.508929
164,good,3588,5391,3.0,0.015206,0.060676,0.471113
93,great,2843,3924,4.0,0.011068,0.071744,0.373293
4,check,3175,3468,5.0,0.009782,0.081525,0.416886


## Can you visualize the words with the greatest difference in counts between 'good' & 'bad'?

Couple Notes: 
- Rel. freq. instead of absolute counts b/c of different numbers of reviews
- Only look at the top 5-10 words with the greatest differences


In [None]:
# Not exactly sure what this is asking for, I'd like to revisit this after asking my TL.

## Stretch Goals

* Analyze another corpus of documents - such as Indeed.com job listings ;).
* Play with the Spacy API to
 - Extract Named Entities
 - Extracting 'noun chunks'
 - Attempt Document Classification with just Spacy
 - *Note:* This [course](https://course.spacy.io/) will be of interesting in helping you with these stretch goals. 
* Try to build a plotly dash app with your text data 

