In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.decomposition import NMF
import pandas as pd
import numpy as np
from collections import Counter

## Representing text data
This notebook is an introduction to using NLP tools to create informative representations of text data.  It accompanies [this slide deck]().  Refer to both for a complete understanding of this walkthrough.

### Table of Contents
* [Data processing](#data)
* [Word Counts](#word)
* [TF-IDF](#tfidf)
* [Topic models](#topic)



### Data processing <a class="anchor" id="data"></a>

Up first is some preprocessing.  You'll either need to download the [imdb review data](https://ai.stanford.edu/~amaas/data/sentiment/) and save it to this directory OR download the [processed data](https://drive.google.com/file/d/1oN_fO91IBkDHD_u6WXiUCvhhyNexQDJq/view?usp=sharing).

In [None]:
# # processing the original data into DataFrame
# # here for reference, don't need to run this if you're using reviews.pkl.gz
# from pathlib import Path
# source_path = Path('./aclImdb/')
# #neg_files = source_path.glob('./*/neg/*.txt')
# #pos_files = source_path.glob('./*/pos/*.txt')
# all_files = []
# for f in source_path.glob('./*/*/*.txt'):
#     filename = f.as_posix()
#     if 'unsup' not in filename:
#         # split up into useful components
#         _, split, sent, idx = filename.split('/')
#         idx = int(idx.split('_')[0])
#         all_files.append([idx, split, sent, f.read_text()])
# review_df = pd.DataFrame(all_files)
# review_df.columns = ['idx', 'split', 'label', 'text']
# # some minor html cruft is in here
# review_df['text'] = review_df['text'].str.replace('<br /><br />', '')
# review_df = review_df.to_pickle('reviews.pkl.gz')

In [None]:
# can skip here if you already have reviews.pkl.gz
review_df = pd.read_pickle('reviews.pkl.gz')
# making this boolean for ease of use
review_df['label'] = review_df['label'] == 'pos'
review_df.head()

### Word counts  <a class="anchor" id="data"></a>
A very basic way to use a sanitized list of tokens is to do a word count. This unlocks a lot of insights right off and is an important step in exploratory data analysis in text.

In [None]:
# take a positive and negative review for examples
neg_review = review_df.loc[~review_df.label].iloc[0]['text']
pos_review = review_df[review_df.label].iloc[0]['text']
print('Negative\n', neg_review, '\n')
print('Positive\n', pos_review)

In [None]:
# base python Counter - split on whitespace, use Counter object)
counter_dtm = Counter(neg_review.split())
print(counter_dtm)

#### What do we notice already?

* Capitalization matters!
* Basic "stopwords" (the, it) dominate frequencies



We'll be using scikit-learn's CountVectorizer, which makes some decisions for us by default.

In [None]:
# scikit-learn's countvectorizer
count = CountVectorizer()
neg_vec = count.fit_transform([neg_review])
neg_vec

`CountVectorizer` outputs a sparse matrix by default.  This is our "document-term matrix".  You can see since we have one document, the first dimension is 1.  76 corresponds to the 76 unique tokens that are part of the vectorizer's vocabulary.  

We can create something compareable to the result from Counter:

In [None]:
sk_dtm = dict(zip(count.get_feature_names_out(), 
                  neg_vec.toarray().flatten()))
print(sk_dtm)

Some things have changed! Let's take a look:

In [None]:
# compare two vocabularies
set(counter_dtm.keys()).difference(set(sk_dtm.keys()))

#### What changed?
* Automatically lower-cased all tokens
* Removed single-character tokens
* Punctuation doesn't count!

Let's set up a training and a test set with a 70/30 split.  Ideally you'd also have a validation set and use that for model selection, but I'm just keeping this simple for experimentation.

In [None]:
# we're going to use this train/test split throughout
# we'll also use this seed for consistency
# NOTE: Usually you'll want to do a separate validation set when choosing models/featuresets!
seed = 37
np.random.seed(seed)
pct_train = 0.7
X_train, X_test, y_train, y_test = train_test_split(
    review_df['text'],
    review_df['label'], train_size=pct_train)

#### Sentiment analysis with Count Vectors
Here we use the representation from out count vectorizer to try to predict sentiment.  

Representation: Count vectors
Application: Predict sentiment

Here I use scikit-learn's [Pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html).  This allows us to construct a complete pipeline that can be used to fit both vocabulary and model and then applied to the test set.


In [None]:
# setting up the vectorizer
# limit the number of features - otherwise will have all the words in the corpus
# removing stopwords (common words like 'the', 'and', etc.)
count = CountVectorizer(stop_words='english', min_df=0.05)

# pipeline contains the preprocessor and the model
count_pipeline = Pipeline(
    steps=[("preprocessor", count),
          ('model', LogisticRegression(max_iter=500, solver='liblinear'))]
)

In [None]:
# setting the seed to keep the performance consistent
np.random.seed(seed)
count_pipeline.fit(X_train, y_train)
# print out the size of the vocabulary
print('Vocab size:', len(count_pipeline.named_steps['preprocessor'].vocabulary_))
print('Accuracy on test set:', count_pipeline.score(X_test, y_test).round(2))

In [None]:
print(
    classification_report(y_pred=count_pipeline.predict(X_test),
                          y_true=y_test))

This is pretty good for a first pass.  81% of the time, we're predicting the correct sentiment.  But are there ways to do better?

#### Experiments to improve our model
Experiment with some of the parameters of [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to create different representations that may yield better results.

* By including stopwords, we actually see minor improvements in performance
* Expanding the number of features increases performance, but leads to overfitting
* Experiments with n-grams can yield improvements

#### What is driving performance?
Are there particular words that are predictive in the featureset?

In [None]:
# visualize highest magnitude coefficients in model
vocab = count_pipeline.named_steps['preprocessor'].get_feature_names_out()
mag_ordered = np.argsort(np.abs(count_pipeline.named_steps['model'].coef_.flatten()))
top = 10
top_mag_ordered = mag_ordered[-10:]
print('Top 10 coefficients by magnitude')
for idx in top_mag_ordered:
    print(vocab[idx], count_pipeline.named_steps['model'].coef_.flatten()[idx])


These coefficients are weights the model has learned based on the training data.  We may want to experiment with other methods for "weighting" the different features.  In this case we do it as part of the preprocessing.

### TF-IDF <a class="anchor" id="tfidf"></a>
One thing we notice with count vectors is that all words are being counted the same.  We might want to use a weighting scheme to ensure that words that are more informative about the content are flagged as more important.  One weighting scheme is Term Frequency - Inverse Document Frequency (TF-IDF).

Take as an example some kind of simplistic movie reviews.  We can already tell which words are most relevant to the specific content of each review (i.e. "good", "bad", "great").

In [None]:
docs = ['The movie was good',
        'The movie was bad',
        'The movie was great']

cv = CountVectorizer()
vecs = cv.fit_transform(docs).toarray()
# we'll use pandas DF for easier display
pd.DataFrame(vecs, columns=cv.get_feature_names_out())

You'll notice that `vecs` contains the term frequencies.  If we use sklearn's `TfidfVectorizer`, it will calculate those term counts and then multiply them by the Inverse Document Frequency (IDF).

In [None]:
tfidf = TfidfVectorizer()
# we'll use pandas DF for easier display
tfidf_vecs = tfidf.fit_transform(docs).toarray()
tfidf_df = pd.DataFrame(tfidf_vecs, columns=tfidf.get_feature_names_out())
tfidf_df

#### What do you notice here?
What has happened due to the IDF weighting?
* The words that make each review "unique" are up-weighted
* Words that are in common are downweighted

Note: Hand-calculation will get you different results - sklearn uses some adjustments for consistency and normalization

In [None]:
# set up our tfidf vectorizer the same way as the count vectorizer
tfidf = TfidfVectorizer(stop_words='english', min_df=0.05)

tfidf_pipeline = Pipeline(
    steps=[("preprocessor", tfidf),
          ('model', LogisticRegression(max_iter=500, solver='liblinear'))]
)

In [None]:
np.random.seed(seed)
tfidf_pipeline.fit(X_train, y_train)
print('Accuracy on test set:', tfidf_pipeline.score(X_test, y_test).round(2))
print(
    classification_report(y_pred=tfidf_pipeline.predict(X_test),
                          y_true=y_test))

The performance is pretty compareable here, it's useful to consider why that might be.

#### Why might IDF-weighted representations not improve our performance?
* Think what is the goal of IDF weighting
* What is the goal of our sentiment analysis prediction?
* What might be a better application of a representation like this one?

In [None]:
# cosine similarity between positive and negative sentiment reviews
from sklearn.metrics.pairwise import cosine_similarity

# we'll use the tfidf vectorizer for this
tfidf_vecs = tfidf.transform(X_test).toarray()
cv_vecs = count.transform(X_test).toarray()
# take the mean of similarity between positive and negative reviews
print('Similarity between positive and negative review vectors')
print('TFIDF:', cosine_similarity(tfidf_vecs[y_test], tfidf_vecs[~y_test]).mean(1).mean())
print('Count:', cosine_similarity(cv_vecs[y_test], cv_vecs[~y_test]).mean(1).mean())

In both examples above, we used many features.  It may be valuable to create a more "generalized" view of the words used in reviews.  This would give us some interpretability about the subjects covered and may yield a useful representation for other use cases.

### Topic Models <a class="anchor" id="topic"></a>


In [None]:
def display_components(model, word_features, top_display=5):
    # utility for displaying respresentative words per component for topic models
    for topic_idx, topic in enumerate(model.components_):
        print("Topic %d:" % (topic_idx))
        top_words_idx = topic.argsort()[::-1][:top_display]
        top_words = [word_features[i] for i in top_words_idx]
        print(" ".join(top_words))

In [None]:
# choose the number of components (topics)
n_components = 10
# adding a few tweaks, just based on experimentation
nmf = NMF(n_components=n_components,
          init='nndsvda',
         max_iter=500)
# NMF typically uses tfidf, not word counts
# fit tfidf vectorizer
tfidf = TfidfVectorizer(stop_words='english')
tfidf_vecs = tfidf.fit_transform(review_df['text'])
nmf_vecs = nmf.fit_transform(tfidf_vecs)

Both NMF provides a components matrix which corresponds to the loading of each word on each topic.  Higher values means the word is more relevant to that topic.  With the function below, we can display some of the "representative" words from each topic.

In [None]:
display_components(nmf, tfidf.get_feature_names_out())

#### What do we observe about these topics?
* Topic 7 - horror movies
* Topic 6 - TV series reviews
* Sample from these topics and assess your labels


In [None]:
review_df.iloc[np.argsort(nmf_vecs[:, 7])[-2:]]['text'].values

Let's see how this does with sentiment analysis.

In [None]:
n_components = 10
tfidf = TfidfVectorizer(stop_words='english', min_df=0.05)
nmf = NMF(n_components=n_components,
          init='nndsvda',
          max_iter=500)
clf_nmf_pipeline = Pipeline(
    steps=[("preprocessor", tfidf),
           ("topic", nmf),
          ('model', LogisticRegression(max_iter=500, solver='liblinear'))]
)



In [None]:
np.random.seed(seed)
clf_nmf_pipeline.fit(X_train, y_train)
print(f'accuracy: {clf_nmf_pipeline.score(X_test, y_test)}')
print(
    classification_report(y_pred=clf_nmf_pipeline.predict(X_test),
                          y_true=y_test))

Again, slightly worse, but here we're looking at 10 features versus 255 in the other representations.  With some tuning, we might be able to get compareable performance with a much smaller feature vector.

One thing this pipeline may be useful for is seeing performance across different "topics".

#### Examine performance across topics
* Identify the "top" topic for a particular review (`np.argmax`)
* Use either the tfidf or count pipeline to predict sentiment
* Examine the performance distribution and interpret

In [None]:
# performance of count pipeline by nmf component
nmf_component = np.argmax(nmf.transform(tfidf.transform(X_test)), axis=1)
tfidf_pred = tfidf_pipeline.predict(X_test)
for i in range(n_components):
    print(f'Component {i}')
    print(f'Accuracy: {np.mean(tfidf_pred[nmf_component == i] == y_test[nmf_component == i])}')