### Natural Language Processing II: More Preprocessing and Vectorization

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import ConfusionMatrixDisplay

from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [None]:
# if you get an error with the above code, run this & follow below directions:
# import nltk
# nltk.download()

# Or this, if you're having issues
# import nltk
# nltk.download_shell()

Run `nltk.download()`. A new screen will pop up outside your Jupyter notebook. (It may be hidden behind other windows.) Once this box opens up, click all, then download. Once this is done, restart your Jupyter notebook and try running the first three cells again.

Run:

```
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize("cats")
```

- If this returns `cat`, then fantastic! You’re done. 
- If not, head to http://www.nltk.org/install.html and follow instructions for your computer, then try running the first three cells again.

#### Our Data

This dataset was built using our pitchfork webscraper.  It contains album reviews for new releases and basic information about the release.

In [1]:
#load data
url = 'https://raw.githubusercontent.com/jfkoehler/NYU-Bootcamp/master/notebooks/module_2/2.10_nlp_II/data/reviews.csv'


In [None]:
#look at info


In [None]:
# extract first review


In [None]:
# examine it


In [None]:
# plt.hist(reviews['scores'], edgecolor = 'black', color = 'brown', alpha = 0.4, bins = 20);
# plt.xlabel('Review Score')
# plt.grid()
# plt.title('Distribution of Pitchfork Review Scores', loc = 'left');

In [None]:
# average score?


In [None]:
# create good/bad version


In [None]:
# check the balance of classes


#### Basic Pipeline

- `CountVectorizer`
- `LogisticRegression`

In [None]:
# set up the pipeline


In [None]:
# define X and y


In [None]:
# train/test split


In [None]:
# fit the pipeline


In [None]:
# score on train


In [None]:
# score on test


In [None]:
# confusion matrix


In [None]:
# coefficients and vocabulary extraction


In [None]:
# dataframe of coefs and tokens


In [None]:
# ten tokens for positive


In [None]:
# ten for negative


#### More Preprocessing Ideas

In [None]:
# use word_tokenize on first review


In [None]:
# assign as a variable


### Lemmatization

> *Lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form* -- [Wikipedia](https://en.wikipedia.org/wiki/Lemmatisation)

In [None]:
# instantiate


In [None]:
# dogs


In [None]:
# churches


In [None]:
# computing 


In [None]:
# list comprehension to lemmatize all


In [None]:
# side by side comparison 


#### Stemming

> *In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root* -- [Wikipedia](https://en.wikipedia.org/wiki/Stemming)

In [None]:
# Instantiate PorterStemmer.


In [None]:
# dogs


In [None]:
# churches


In [None]:
# computing


#### TFIDF

> *In information retrieval, tf–idf (also TF*IDF, TFIDF, TF–IDF, or Tf–idf), short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.[1] It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. tf–idf is one of the most popular term-weighting schemes today. A survey conducted in 2015 showed that 83% of text-based recommender systems in digital libraries use tf–idf.* -- [Wikipedia](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)

**Term Frequency**

$${\displaystyle \mathrm {tf} (t,d)={\frac {f_{t,d}}{\sum _{t'\in d}{f_{t',d}}}}},$$

**Inverse Document Frequency**

$$ \mathrm{idf}(t, D) =  \log \frac{N}{|\{d \in D: t \in d\}|}$$

In [None]:
# instantiate


In [None]:
# fit


In [None]:
# dtm to array


In [None]:
# tokens from tfidf


In [None]:
# DataFrame


### Using with `sklearn` vectorizers

Both the `CountVectorizer` and `TfidfVectorizer` have an argument `tokenizer` that accepts a callable function for custom tokenization strategies tha include stemming or lemmatization.

In [None]:
def lemmatizer_fn(text):
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(word) for word in word_tokenize(text)]

In [None]:
X = reviews['review'][:100].str.replace('[^\w\s]', '', regex = True) # removing punctuation

In [None]:
# instantiate


In [None]:
# fit it


In [None]:
# examine words


**PROBLEM** 

Good will be 4 or 5 stars, bad is everything else.  Build a `Pipeline` to use the `TfidfVectorizer` and a `LogisticRegression` classifier to classify good and bad reviews.

In [None]:
url = 'https://raw.githubusercontent.com/jfkoehler/NYU-Bootcamp/master/notebooks/module_2/2.10_nlp_II/data/yelp.csv'

### Sentiment Analysis

> *VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media.*

In [None]:
#instantiate


In [None]:
# If you get an error for the above code, try running the following:
# import nltk
# nltk.download('vader_lexicon')

In [None]:
# Calculate sentiment of first review with .polarity_scores


In [None]:
# is it awesome?


In [None]:
# AWESOME


In [None]:
# AWESOME!!!


In [None]:
# :(
