# Week 2: From tokens to vectors
This notebook accompanies the week 2 lecture

In [1]:
# doing this to avoid some warnings
import warnings
warnings.filterwarnings('ignore')

In [2]:
# setup
import sys
import subprocess
import pkg_resources
from collections import Counter
import re


required = {'spacy', 'scikit-learn', 'numpy', 'pandas'}
installed = {pkg.key for pkg in pkg_resources.working_set}
missing = required - installed

if missing:
    python = sys.executable
    subprocess.check_call([python, '-m', 'pip', 'install', *missing], stdout=subprocess.DEVNULL)

import spacy
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import normalize
from sklearn.metrics import accuracy_score
from sklearn.svm import LinearSVC
import pickle

from spacy.lang.en import English
en = English()

def simple_tokenizer(doc, model=en):
    # a simple tokenizer for individual documents (different from above)
    tokenized_docs = []
    parsed = model(doc)
    return([t.lower_ for t in parsed if (t.is_alpha)&(not t.like_url)])

## Word counts revisited
Let's remind ourselves how sklearn's CountVectorizer worked (from last week).

In [3]:
# scikit-learn's countvectorizer
# use our custom tokenizer
cv = CountVectorizer(tokenizer=simple_tokenizer)

In [4]:
# data
text_data = ["I'm taking a course at Harvard.",
            "I'm learning about Natural Language Processing.",
            "We are studying tokenization, vectorization and modelling.",
            "Check out the course on Github: https://github.com/bpben/nlp_lessons"]
# outputs sparse array, want to use a normal numpy array
v = cv.fit_transform(text_data).toarray()
# get_feature_names gets the vocabulary of the vectorizer in order
dict(zip(cv.get_feature_names(), v.sum(axis=0)))

{'a': 1,
 'about': 1,
 'and': 1,
 'are': 1,
 'at': 1,
 'check': 1,
 'course': 2,
 'github': 1,
 'harvard': 1,
 'i': 2,
 'language': 1,
 'learning': 1,
 'modelling': 1,
 'natural': 1,
 'on': 1,
 'out': 1,
 'processing': 1,
 'studying': 1,
 'taking': 1,
 'the': 1,
 'tokenization': 1,
 'vectorization': 1,
 'we': 1}

Works as expected.  Why don't we try this on Assignment 1's dataset?

In [5]:
# you will need to change this to where ever the file is stored
data_location = '../data/assignment_1_reviews.pkl'
with open(data_location, 'rb') as f:
    all_text = pickle.load(f)
# corpora size
print([(k, len(all_text[k])) for k in all_text])
# for simplicity, let's split these into separate sets
neg, pos = all_text.values()

[('neg', 1233), ('pos', 1266)]


In [None]:
# running this on negative reviews
cv = CountVectorizer(tokenizer=simple_tokenizer)
neg_vectors = cv.fit_transform(neg).toarray()
# get_feature_names gets the vocabulary of the vectorizer in order
word_count = dict(zip(cv.get_feature_names(), neg_vectors.sum(axis=0)))
# get the top 10 words
sorted(word_count.items(), key=lambda x: x[1], reverse=True)[:10]

In [None]:
# now do it for positive reviews
cv = CountVectorizer(tokenizer=simple_tokenizer)
pos_vectors = cv.fit_transform(pos).toarray()
# get_feature_names gets the vocabulary of the vectorizer in order
word_count = dict(zip(cv.get_feature_names(), pos_vectors.sum(axis=0)))
# get the top 10 words
sorted(word_count.items(), key=lambda x: x[1], reverse=True)[:10]

These words aren't particularly informative about the content.  Sklearn's CountVectorizer has some additional options that may lead to somewhat more informative frequent terms.

In [None]:
for corpus in [neg, pos]:
    cv = CountVectorizer(tokenizer=simple_tokenizer, min_df=0.01, max_df=0.9,
                        stop_words='english')
    vectors = cv.fit_transform(corpus).toarray()
    # get_feature_names gets the vocabulary of the vectorizer in order
    word_count = dict(zip(cv.get_feature_names(), vectors.sum(axis=0)))
    # get the top 10 words
    print(sorted(word_count.items(), key=lambda x: x[1], reverse=True)[:10])

This is better, but it seems like we'd have to tweak these thresholds a lot and carefully choose our stop words.  Is there a more standard way to extract the most informative words from documents?

## Term Frequency-Inverse Document Frequency (TF-IDF)
See the slides for more information on this.  In this section we'll show how TF-IDF is essentially just a weighting of the count vectors.  We'll then use sklearn's built-in TfidfVectorizer on our sentiment corpora.

In [None]:
docs = ['The movie was good',
        'The movie was bad',
        'The movie was great']

cv = CountVectorizer(tokenizer=simple_tokenizer)
vecs = cv.fit_transform(docs).toarray()
# we'll use pandas DF for easier display
pd.DataFrame(vecs, columns=cv.get_feature_names())

You'll notice that `vecs` contains the term frequencies.  If we use sklearn's `TfidfVectorizer`, it will calculate those term counts and then multiply them by the Inverse Document Frequency (IDF).

The formula sklearn uses is a bit different from the textbook:

$$log(\frac{N+1}{df(t)+1}) + 1$$

Where $N$ is the number of documents.  It also normalizes this value to account for different size vectors (see slides).

In [None]:
tfidf = TfidfVectorizer(tokenizer=simple_tokenizer)
# we'll use pandas DF for easier display
tfidf_vecs = tfidf.fit_transform(docs).toarray()
tfidf_df = pd.DataFrame(tfidf_vecs, columns=tfidf.get_feature_names())
tfidf_df

You can see that the discriminative words (i.e. bad, good, great) have higher weight than the non-discriminative words.  

We see this at the document level, but is there a way we could get some kind of aggregate measure of discriminative words?

### Exercise: Find the top 3 discriminative words
Use the dataset above to try and identify the words that, across the corpus, are particularly representative of content.

Hint: Think about what a weight of zero versus weight of non-zero means.

In [None]:
def top_tfidf_words(tfidf_df):
    return(tfidf_df[tfidf_df>0].mean(axis=0))

In [None]:
top_tfidf_words(tfidf_df)

Now let's run that on our movie reviews dataset.

In [None]:
for corpus in [neg, pos]:
    # adding in a minimum document frequency, so words need to occur at least somewhat often
    tfidf = TfidfVectorizer(tokenizer=simple_tokenizer, min_df=0.02)
    vectors = tfidf.fit_transform(corpus).toarray()
    tfidf_df = pd.DataFrame(vectors, columns=tfidf.get_feature_names())
    # get representative words
    tfidf_word_count = top_tfidf_words(tfidf_df)
    # get the top 10 words
    print(tfidf_word_count.sort_values().iloc[-10:])

These are somewhat useful aggregate measures.  But most of the information in TF-IDF is document-specific.

## Cosine Similarity
See the slides for detail on this.  Sklearn has an implementation that's useful here.

In [None]:
docs = ['The movie was good',
        'The movie was bad',
        'The movie was great']

cv = CountVectorizer(tokenizer=simple_tokenizer)
vecs = cv.fit_transform(docs).toarray()
# cosine similarity without a second argument (y) compares all docs to one another
cosine_similarity(vecs)

On the diagonal axis is a documents similarity to itself.  Off diagonal are the similarities between doc x and doc y.  Each of these docs has basically the same words except for good, bad and great.  So the similarity between them is the same. 

### Exercise: Cosine similarity two ways
- Using the movie reviews dataset combine all reviews, but keep an indicator to know which are positive and which are negative.
- Get count vectors and tf-idf vectors
- Select a review and find most similar and least similar for both methods
- Get the average distance between positive and negative reviews

Tips: 
- You don't need to convert any of this to DataFrames to do this work.  It'll be faster if you don't!
- You can use `np.fill_diagonal` to fill the diagonal entries for the similarity matrix with values

Think about: Why do we fit the vectors on all reviews, rather than separately?

In [6]:
# combine datasets
all_reviews = neg+pos
# indicator where is first positive
first_pos = len(neg)
cv = CountVectorizer(tokenizer=simple_tokenizer)
tfidf = TfidfVectorizer(tokenizer=simple_tokenizer)
# get vectors
count_vecs = cv.fit_transform(all_reviews)
tfidf_vecs = tfidf.fit_transform(all_reviews)
# get similarities
count_sims = cosine_similarity(count_vecs)
tfidf_sims = cosine_similarity(tfidf_vecs)

In [None]:
# sample one, find most similar - count vectors
random_idx = np.random.randint(len(all_reviews))
print(all_reviews[random_idx]+'\n'+
      all_reviews[np.argmax(count_sims[random_idx])])

In [None]:
# sample one, find most similar - tfidf
random_idx = np.random.randint(len(all_reviews))
print(all_reviews[random_idx]+'\n'+
      all_reviews[np.argmax(tfidf_sims[random_idx])])

In [None]:
# compare positive to negative average distance
for s_matrix in [count_sims, tfidf_sims]:
    print('neg-to-neg:', s_matrix[:first_pos, :first_pos].mean(axis=1).mean(),
          'neg-to-pos:', s_matrix[:first_pos, first_pos:].mean(axis=1).mean(),
          'pos-to-pos:', s_matrix[first_pos:, first_pos:].mean(axis=1).mean())


The conclusion here is that tf-idf seems to work better in actually getting similar reviews.  However, the distance shows that, on average, a negative review's vocabulary isn't that much different from a positive review's.  

But maybe this conclusion might change if we can distill some of the information from the counts into more meaningful dimensions.  That brings us to:

## Topic models: Non-negative Matrix Factorization and Latent Dirichlet Allocation

In [7]:
def display_components(model, word_features, top_display=5):
    # utility for displaying respresentative words per component for topic models
    for topic_idx, topic in enumerate(model.components_):
        print("Topic %d:" % (topic_idx))
        top_words_idx = topic.argsort()[::-1][:top_display]
        top_words = [word_features[i] for i in top_words_idx]
        print(" ".join(top_words))

In [8]:
# in this case, excluding standard english stop words
tfidf = TfidfVectorizer(tokenizer=simple_tokenizer, stop_words='english')
tfidf_vecs = tfidf.fit_transform(all_reviews)
cv = CountVectorizer(tokenizer=simple_tokenizer, stop_words='english')
count_vecs = cv.fit_transform(all_reviews)

In [9]:
# choose the number of components (topics)
n_components = 10
# basic configuration
nmf = NMF(n_components=n_components)
# NMF requires tfidf, not word counts
# same syntax as vectorizer
nmf_vecs = nmf.fit_transform(tfidf_vecs)
# LDA uses word counts
lda = LatentDirichletAllocation(n_components=n_components)
lda_vecs = lda.fit_transform(count_vecs)

In [12]:
nmf_vecs[0]

array([0.        , 0.02823817, 0.06582423, 0.00120355, 0.        ,
       0.01070525, 0.03576207, 0.01071412, 0.01198975, 0.01922563])

In [11]:
lda_vecs.sum(axis=1)

array([1., 1., 1., ..., 1., 1., 1.])

Both NMF and LDA provide a components matrix which corresponds to the loading of each word on each topic.  Higher values means the word is more relevant to that topic.

In [None]:
print(nmf.components_)

For evaluating performance, both methods use different ways to quantify the loss from using the topic model versus the actual data.  (In the matrix formulation, $UV$ rather than $X$).  For NMF, it's reconstruction error, which is more directly the difference between the matrix decomposition and the actual data.  For LDA, it uses [ELBO](https://en.wikipedia.org/wiki/Evidence_lower_bound), which is a too complicated to explain here.  In both, higher values means worse performance.  They can't be compared to one another, though.

In [None]:
print(nmf.reconstruction_err_, lda.bound_)

In [None]:
display_components(nmf, tfidf.get_feature_names())

In [None]:
display_components(lda, cv.get_feature_names())

NMF seems to have come up with some reasonable topics, but LDA doesn't seem to work particularly well here.  It may make sense to try some additional token processing and see how that affects what we get out of the topic modelling process.

### Exercise: Tokenization decisions and topic models
Using the tokenizer from week 1 or your own tokenizer, explore how your tokenization decisions up stream might affect your results downstream.

In [None]:
# initialize model
nlp = spacy.load('en_core_web_sm')

def tokenize_full(docs, model=nlp, 
                  entities=False, 
                  stop_words=False, 
                  lowercase=True, 
                  alpha_only=True, 
                  lemma=True):
    """Full tokenizer with flags for processing steps
    entities: If False, replaces with entity type
    stop_words: If False, removes stop words
    lowercase: If True, lowercases all tokens
    alpha_only: If True, removes all non-alpha characters
    lemma: If True, lemmatizes words
    """
    tokenized_docs = []
    for d in docs:
        parsed = model(d)
        # token collector
        tokens = []
        # index pointer
        i = 0
        # entity collector
        ent = ''
        for t in parsed:
            # only need this if we're replacing entities
            if not entities:
                # replace URLs
                if t.like_url:
                    tokens.append('URL')
                    continue
                # if there's entities collected and current token is non-entity
                if (t.ent_iob_=='O')&(ent!=''):
                    tokens.append(ent)
                    ent = ''
                    continue
                elif t.ent_iob_!='O':
                    ent = t.ent_type_
                    continue
            # only include stop words if stop words==True
            if (t.is_stop)&(not stop_words):
                continue
            # only include non-alpha is alpha_only==False
            if (not t.is_alpha)&(alpha_only):
                continue
            if lemma:
                t = t.lemma_
            else:
                t = t.text
            if lowercase:
                t.lower()
            tokens.append(t)
        tokenized_docs.append(tokens)
    return(tokenized_docs)

In [None]:
tokenized = tokenize_full(all_reviews, entities=True)

In [None]:
# if passing a list of tokens to a vectorizer, you can use the following syntax
tfidf = TfidfVectorizer(tokenizer=lambda doc: doc, lowercase=False)
tfidf_vecs = tfidf.fit_transform(tokenized)
cv = CountVectorizer(tokenizer=lambda doc: doc, lowercase=False)
count_vecs = cv.fit_transform(tokenized)

In [None]:
n_components = 10
nmf = NMF(n_components=n_components)
nmf_vecs = nmf.fit_transform(tfidf_vecs)
lda = LatentDirichletAllocation(n_components=n_components)
lda_vecs = lda.fit_transform(count_vecs)

In [None]:
display_components(nmf, tfidf.get_feature_names())

In [None]:
display_components(lda, cv.get_feature_names())

## Supervised learning: Using text features for prediction
Week 1 and all of the above focused on creating features from text.  The tokenization decisions are mainly deterministic, they output what you tell them to output.  The topic models step more into the "learning" aspect of analysis, where you ask the algorithm to find a decomposition that fits the data.  The output of interest in this case is a reconstruction of the data.

But what if you're not interested in reconstructing the data, but rather predicting a specific outcome? In that case, you'll need some amount of data with that outcome specified.  From there, you can ask the machine to learn the relationship between text features and the outcome.  This is, roughly, the idea behind supervised learning.

In this section, we'll introduce how to use text features in predicting an outcome.  Assignment 2 will focus on how to convert the work you did on Assignment 1 into a supervised learning problem.

In [None]:
reviews = ['I love these hot dogs',
          'I hate these hot dogs',
          'These hot dogs are really good',
          'These hot dogs are really bad']
is_positive = [1, 0, 1, 0]
cv = CountVectorizer(tokenizer=simple_tokenizer)
count_vecs = cv.fit_transform(reviews).toarray()

In [None]:
# fit/predict on full dataset
svc = LinearSVC()
svc.fit(count_vecs, is_positive)
svc.predict(count_vecs)

In [None]:
# try predicting on new observatons
new_obs = ["I love these!",
           "I don't love these hot dogs"]
new_count = cv.transform(new_obs)
svc.predict(new_count)

So the model with its current features is 50% accurate on our new observations.  That's not great.  But this is a very small vocabulary and a small dataset.  Why don't we try with our movie reviews?

Remember: This dataset is already labelled.  A review is either positive or negative.

In [None]:
# create binary indicator for positive review
is_positive = np.array([0]*len(neg)+[1]*len(pos))
# sample random 70% for fitting model (training)
# 30% will be simulating "new observations" (testing)
pct_sample = 0.7
train_bool = np.random.random(len(all_reviews))<pct_sample
reviews_train = np.array(all_reviews)[train_bool]
reviews_test = np.array(all_reviews)[~train_bool]
is_positive_train = is_positive[train_bool]
is_positive_test = is_positive[~train_bool]
print(reviews_train.shape, reviews_test.shape)

In [None]:
# fit count vectorizer
cv = CountVectorizer(tokenizer=simple_tokenizer)
train_vecs = cv.fit_transform(reviews_train).toarray()
test_vecs = cv.transform(reviews_test).toarray()

In [None]:
# fit/predict on training dataset
svc = LinearSVC()
svc.fit(train_vecs, is_positive_train)
train_preds = svc.predict(train_vecs)
test_preds = svc.predict(test_vecs)

In [None]:
# scoring accuracy
print('Train accuracy:', accuracy_score(is_positive_train, train_preds))
print('Test accuracy:', accuracy_score(is_positive_test, test_preds))

Train accuracy is usually much higher than test accuracy.  That makes sense: The vocabulary and the model are fit to the training data.  But the bigger concern is how the model performs on data we haven't seen yet.  Test accuracy gives us a measure of that.  83% is not bad...but it could be better.

Assignment 2 is all about trying to boost this accuracy by trying out some the vectorization methods we've gone through here.