# Capstone Requirements

Your presentation should include:
* A problem statement.
* Description of and assumptions about the data
* Discussion of the features included in your model (with visualizations) and any derived features that did not turn out to be effective predictors (your feature graveyard)
* An explanation of your cost metric and why it was chosen
* The performance of different models you scored
* Your best performing model performance relative to some baseline
* The feature importances that help explain how your model predicts (if applicable)
* **Encouraged:** Details of the error analysis isolating strengths and shortcomings you performed on your model  
* Recommendations or next steps.

In [8]:
import pandas as pd
import numpy as np
import scipy as sp
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
#from textblob import TextBlob, Word
from nltk.stem.snowball import SnowballStemmer

%matplotlib inline

## What Is Natural Language Processing (NLP)?

- Using computers to process (analyze, understand, generate) natural human languages.

### Higher-Level NLP Task Areas

Some higher-level tasks include:

- **Question answering:** Determine the intent of the question, match query with knowledge base, evaluate hypotheses.
- **Information retrieval:** Find relevant results and similar results.
- **Machine translation:** One language to another.
- **Predictive text input:** Faster or easier typing.

### Higher-Level Tasks Are Hard Because

- **Ambiguity**:
    - Hospitals Are Sued by 7 Foot Doctors
    - Juvenile Court to Try Shooting Defendant
    - Local High School Dropouts Cut in Half
- **Idioms:** "throw in the towel"
- **Newly coined words & non-standard words:** "retweet"
- **Tricky entity names:** "Where is A Bug's Life playing?"

### Our Use Case: Text Classification
** the task of predicting which category or topic a text sample is from **

In [9]:
# Read yelp.csv into a DataFrame.
path = r'./data/yelp.csv'
yelp = pd.read_csv(path)
yelp.head(1)

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0


In [10]:
#sample 'document'
yelp.text.values[0]

'My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.\n\nDo yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I\'ve ever had.  I\'m pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.\n\nWhile EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best "toast" I\'ve ever had.\n\nAnyway, I can\'t wait to go back!'

In text classification, we vectorize the text into a set of numeric features. 

- For a given document, the numeric value of each feature could be the number of times the word appears in the document.
    - So, most features ['tokens'] will have a value of zero, resulting in a sparse matrix of features.

- This technique for vectorizing text is referred to as a bag-of-words model. 
    - It is called bag of words because the document's structure is lost — as if the words are all jumbled up in a bag.
    
Now we can apply a machine learning classifier.

![DTM](images/DTM.png)

## Preprocessing 
** Before we vectorize (or 'tokenize') our text, we may use a variety of preprocessing tools **

### In groups, think about the preprocessing it makes sense to do before we create a matrix like the example above



In [11]:
#sample 'document'
yelp.text.values[0]

'My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.\n\nDo yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I\'ve ever had.  I\'m pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.\n\nWhile EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best "toast" I\'ve ever had.\n\nAnyway, I can\'t wait to go back!'

In [18]:
# Create a new DataFrame that only contains the 5-star and 1-star reviews.
yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]

# Define X and y.
X = yelp_best_worst.text
y = yelp_best_worst.stars

# Split the new DataFrame into training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Use CountVectorizer to create document-term matrices from X_train and X_test. Note parameters.
vect = CountVectorizer(lowercase=True,min_df=2,ngram_range=(1,2),
                       stop_words='english')

# fit: Learn a vocabulary dictionary of all tokens in the raw documents.
# transform: Transform documents to document-term matrix.

# for train we fit and transform
X_train_dtm = vect.fit_transform(X_train)


#### what about for test? Should we fit, transform, or do both? why?

In [20]:
X_test_dtm = vect.transform(X_test)
X_train_dtm.shape

(3064, 21816)

In [21]:
# We can access the tokens in our vocabulary using the get feature names method for the CountVectorizer object
print((vect.get_feature_names()[-50:]))

['young girl', 'young kids', 'young man', 'young woman', 'younger', 'youngest', 'youthful', 'youtube', 'yr', 'yr old', 'yrs', 'yuck', 'yucky', 'yukon', 'yukon gold', 'yum', 'yum definitely', 'yum dinner', 'yum think', 'yum yum', 'yumm', 'yumminess', 'yummm', 'yummmm', 'yummy', 'yummy bread', 'yummy chicken', 'yummy dessert', 'yummy drinks', 'yummy food', 'yummy little', 'yummy meal', 'yummy nice', 'yummy pizza', 'yummy ve', 'yummy yummy', 'yung', 'yup', 'zach', 'zen', 'zero', 'zero stars', 'zinburger', 'zinc', 'zip', 'zoe', 'zoo', 'zucchini', 'zuchinni', 'zuzu']


In [17]:
# our transformed training data is a big pivot table stored in a memory efficient format. We can convert it to an 
# array or df but its memory allocation would be very expensive
type(X_train_dtm)

scipy.sparse.csr.csr_matrix

In [19]:
# Use Logistic Regression to predict the star rating.

# instantiate
logr = LogisticRegression()
# fit
logr.fit(X_train_dtm, y_train)
# predict and store predictions
y_pred_class = logr.predict(X_test_dtm)
print('null_model: '+str(y_test.value_counts('mean').values[0]))
# Calculate accuracy.
print('model accuracy: '+str((accuracy_score(y_test, y_pred_class))))

NameError: name 'X_test_dtm' is not defined

In [None]:
# it's a logistic regression, so we can return coefficients that indicate words that predict positive review
features = np.array(vect.get_feature_names())
logr_coefs = pd.DataFrame({'coef':logr.coef_[0]},index=features)
logr_coefs = logr_coefs.sort_values('coef',ascending=False)
logr_coefs.head(10)

In [None]:
# as well as words that predict negative reviews
logr_coefs.tail(10)

## Term Frequency–Inverse Document Frequency (TF–IDF)

Term frequency–inverse document frequency (TF–IDF) computes the "relative frequency" with which a word appears in a document, compared to its frequency across all documents.

It's more useful than simple "term frequency" [the count vectorizer above] for identifying "important" words in each document (high frequency in that document, low frequency in other documents). (But not necessarily more performant.)

### Example 

In [None]:
simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']

# Term frequency
vect = CountVectorizer()
tf = pd.DataFrame(vect.fit_transform(simple_train).toarray(), columns=vect.get_feature_names())
tf

In [None]:
# Document frequency
vect = CountVectorizer(binary=True)
df = vect.fit_transform(simple_train).toarray().sum(axis=0)
pd.DataFrame(df.reshape(1, 6), columns=vect.get_feature_names())

In [None]:
# Term frequency–inverse document frequency (simple version)
tf/df

In [None]:
# TfidfVectorizer
vect = TfidfVectorizer()
pd.DataFrame(vect.fit_transform(simple_train).toarray(), columns=vect.get_feature_names())

## Pipelines

In [None]:
# Pipeline of transforms with a final estimator. Sequentially apply a list of transforms and a final estimator. 
# Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. 
# The final estimator only needs to implement fit.

from sklearn.pipeline import make_pipeline
model = make_pipeline(TfidfVectorizer(lowercase=True,
                                      max_df=1.0, 
                                      min_df=2,
                                      ngram_range=(1,2),
                                      stop_words='english'),
                      LogisticRegression())
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(accuracy_score(y_test, y_pred))

# Sentiment

In [None]:
tweets = pd.read_csv("./data/Tweets.csv",encoding = "ISO-8859-1")
tweets.head()

In [None]:
tweets.airline_sentiment.value_counts()

In [None]:
# negative sentiment example
tweets.loc[3,'text']

### intuitively, what process might we use to compute sentiment?

In [None]:
# example process https://github.com/cjhutto/vaderSentiment
## review vader lexicon and about scoring

In [None]:
# install vaderSentiment first
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()

In [None]:
# get polarity score for example - output is a dictionary
print(tweets['text'][0])
example = sia.polarity_scores(tweets['text'][0])
example

In [None]:
# get polarity scores for each observation and add each score to df
def unload_dict(df):
    polarity = sia.polarity_scores(df['text'])
    df['compound'] = polarity['compound']
    df['neg'] = polarity['neg']
    df['neu'] = polarity['neu']
    df['pos'] = polarity['pos']
    return df

tweets = tweets.apply(unload_dict, axis=1)
tweets.sample(3)

In [None]:
# this is a very high compound score - note that this is a fallible process
tweets['text'][2728]