# Week 5: Scoping an NLP project
This notebook accompanies the week 5 lecture.  The focus here is to do some surface-level explorations of the wide variety of datasets and models that are publicly available.

In [1]:
# setup
import sys
import subprocess
import pkg_resources
from collections import Counter
import re
import pickle

required = {'spacy', 'transformers'}
installed = {pkg.key for pkg in pkg_resources.working_set}
missing = required - installed

if missing:
    python = sys.executable
    subprocess.check_call([python, '-m', 'pip', 'install', *missing], stdout=subprocess.DEVNULL)

import json
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.svm import LinearSVC
import torch
import transformers
# this will set the device on which to train
device = torch.device("cpu")
# if using collab, set your runtime to use GPU and use the line below
#device = torch.device("cuda:0")

I0718 17:09:34.731106 140735234007040 file_utils.py:39] PyTorch version 1.5.1 available.


### Exploring new dataset and models
There are a huge number of NLP datasets available for all kinds of tasks.  Check out some of these sources:

- [Google dataset search](https://datasetsearch.research.google.com/)
- [Repository of datasets on github](https://github.com/awesomedata/awesome-public-datasets#naturallanguage)
- [NLP dataset categorized by task](https://datasets.quantumstat.com/)

We'll be exploring some datasets here.  You can either download the full dataset or use the subset I'm providing.  The subset is just a random subsample for efficiency's sake.

Datasets:
- [Movie spoilers \(Kaggle\)](https://www.kaggle.com/rmisra/imdb-spoiler-dataset)

- [Wikidata](https://www.wikidata.org/wiki/Wikidata:Database_download)

Models/Libraries:
- [OpenAI GPT-2 (transformers library)](https://openai.com/blog/better-language-models/)
- [BERT (spacy-transformers)]
- [Question Answering with (allennlp)]


### Movie spoilers dataset
In this section we'll be doing some exploration of this dataset, motivated by an analytic question.

In [None]:
# this code is for subsetting, you don't need to run this
# read in full data
movies = pd.read_json(
    open('../data/imdb-spoiler-dataset/IMDB_movie_details.json'),
    lines=True)
reviews = pd.read_json(
    open('../data/imdb-spoiler-dataset/IMDB_reviews.json'),
    lines=True)
# subset
random_state = 42
samp_movies = movies.sample(frac=.1, random_state=random_state)
samp_reviews = reviews[reviews.movie_id.isin(samp_movies.movie_id)]

# output for analysis
samp_movies.to_pickle('../data/spoilers_movies.pkl.gz')
samp_reviews.to_pickle('../data/spoilers_reviews.pkl.gz')

In [4]:
# read in data
movies = pd.read_pickle('../data/spoilers_movies.pkl.gz')
reviews = pd.read_pickle('../data/spoilers_reviews.pkl.gz')

In [5]:
# some general EDA
print(movies.info())
print(reviews.info())
# descriptives of review text
print(reviews.review_text.apply(len).describe())
# what's our class distribution
print(reviews.is_spoiler.value_counts())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 157 entries, 1120 to 915
Data columns (total 7 columns):
movie_id         157 non-null object
plot_summary     157 non-null object
duration         157 non-null object
genre            157 non-null object
rating           157 non-null float64
release_date     157 non-null object
plot_synopsis    157 non-null object
dtypes: float64(1), object(6)
memory usage: 9.8+ KB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 57490 entries, 13974 to 570577
Data columns (total 7 columns):
review_date       57490 non-null object
movie_id          57490 non-null object
user_id           57490 non-null object
is_spoiler        57490 non-null bool
review_text       57490 non-null object
rating            57490 non-null int64
review_summary    57490 non-null object
dtypes: bool(1), int64(1), object(5)
memory usage: 3.1+ MB
None
count    57490.000000
mean      1417.070986
std       1096.628524
min         41.000000
25%        710.000000
50%       10

## Goal: Can we create method for identifying whether a review is a spoiler?

#### Exercise: Scoping the project
Let's refer to the [ML project requirements](https://www.jeremyjordan.me/ml-requirements/) article and see if we can evaluate this dataset for its potential to answer that question.

Some questions we might want to ask:
- Who is the user of this product?
- What data might we need?
- What data/features do we have to work with?
- Is the a simple heuristic here that might be preferable over a model?
- How can we iterate here and find improvement?
- What is the benefit from getting more elaborate with our design? What is the cost?



#### Exercise: Minimal approach
Since we already have a dataset tailored to this problem, let's think of some minimal approaches we can apply without even getting into models

In [None]:
# What about a coin-flip? 50/50 chance a review is a spoiler
actual = reviews.is_spoiler
pred = np.random.random(size=len(actual))<=0.5
accuracy_score(actual, pred)

In [None]:
# what if we adjusted our coin flip by genre?
# are certain genres more likely to have spoilers?
# genre is a list of variable length, so we need to expand that
genres = np.unique(np.concatenate(movies.genre.values))
genre_dummies = np.zeros(shape=(len(movies), len(genres)))
for i, r in enumerate(movies.genre):
    for ii, g in enumerate(genres):
        if g in r:
            genre_dummies[i, ii] = 1
df_genre_dummies = pd.DataFrame(genre_dummies,
                               columns=genres,
                               index=movies.movie_id)
reviews_genre = df_genre_dummies.merge(reviews[['movie_id', 'is_spoiler']],
                      left_index=True, right_on='movie_id')
# now we have the count of spoilers and non spoilers, by genre
genre_spoiler = reviews_genre.groupby('is_spoiler').sum().T
genre_spoiler

In [None]:
# so what if we just set our prediction to the average spoiler prob of genres?
genre_spoiler_prob = genre_spoiler[True]/genre_spoiler.sum(axis=1)
# just getting this at the movie level
movie_spoiler_prob = movies.genre.apply(lambda x: genre_spoiler_prob.loc[x].mean())
movie_spoiler_prob.index = movies.movie_id
actual = reviews.is_spoiler
pred = np.random.random(size=len(actual))
pred = pred <= reviews.movie_id.apply(lambda x: movie_spoiler_prob.loc[x])
accuracy_score(actual, pred)
# what's the problem with this prediction?

#### Exercise: Model-based approach
Think of a simple model approach you could use here.  I suggest nothing more complicated than SVC/Logistic Regression.

In [6]:
# train/val/test construction
# what are the considerations for constructing?
np.random.seed(seed=42)
# Will use 70/15/15 train/validation/test
pct_train = 0.7
pct_val = 0.15
pct_test = 1-pct_train-pct_val
draws = np.random.random(len(reviews))
train_bool = draws<=pct_train
val_bool = (draws>pct_train)&(draws<=pct_train+pct_val)
test_bool = (draws>pct_train+pct_val)&(draws<=pct_train+pct_val+pct_test)
train = reviews[train_bool]
val = reviews[val_bool]
test = reviews[test_bool]
print(train.shape, val.shape, test.shape)

(40355, 7) (8540, 7) (8595, 7)


In [7]:
# try the above method, but considering fitting on train
train_spoiler_prob = reviews_genre[train_bool].groupby('is_spoiler').sum().T
train_spoiler_prob = train_spoiler_prob[True]/train_spoiler_prob.sum(axis=1)
movie_spoiler_prob = movies.genre.apply(lambda x: train_spoiler_prob.loc[x].mean())
movie_spoiler_prob.index = movies.movie_id
for d in [train, val]:
    actual = d.is_spoiler
    pred = np.random.random(size=len(actual))
    pred = pred <= d.movie_id.apply(lambda x: movie_spoiler_prob.loc[x])
    print(accuracy_score(actual, pred))

NameError: name 'reviews_genre' is not defined

In [8]:
# now we're getting into models
# how does a simple word count+SVM perform?
tfidf = TfidfVectorizer(lowercase=False, min_df=0.01)
svc = LinearSVC()
tfidf_train = tfidf.fit_transform(train.review_text)
print(tfidf_train.shape)
svc.fit(tfidf_train, train.is_spoiler)
for d in [train, val]:
    actual = d.is_spoiler
    pred = svc.predict(tfidf.transform(d.review_text))
    print(accuracy_score(actual, pred))

(40355, 2030)
0.7961343080163549
0.7817330210772834


In [12]:
# what about topic models?
n_components = 10
nmf = NMF(n_components=n_components)
nmf_train = nmf.fit_transform(tfidf_train)
svc.fit(nmf_train, train.is_spoiler)
for d in [train, val]:
    actual = d.is_spoiler
    pred = svc.predict(
        nmf.transform(
            tfidf.transform(d.review_text)))
    print(accuracy_score(actual, pred))


0.7629042250030975
0.7605386416861827


In [43]:
# output simple review-based models for deployment 
from sklearn.pipeline import Pipeline
tfidf_prod = TfidfVectorizer(lowercase=False, min_df=0.01)
tfidf_vecs = tfidf_prod.fit_transform(reviews.review_text)
nmf_prod = NMF(n_components=n_components)
nmf_vecs = nmf_prod.fit_transform(tfidf_vecs)
svc_prod = svc.fit(nmf_vecs, reviews.is_spoiler)
pipe = Pipeline(steps=[('tfidf', tfidf_prod), ('nmf', nmf_prod), ('svc', svc_prod)])
print('test pipe:', pipe.predict([reviews.review_text.iloc[0]]))
pickle.dump(pipe, open('model_pipe.pkl', 'wb'))

test pipe: [False]


In [10]:
# what is our information inventory?
# what text info do we have from movies?
print(movies.filter(regex='plot').applymap(lambda x: len(x)).describe())
# what if we include movie information?
# using summary, which is a bit more concise
tfidf_movies = pd.DataFrame(tfidf.transform(movies.plot_summary).toarray(),
                            index=movies.movie_id)
# shape to fit base dataset
movie_features = train.merge(
    tfidf_movies, left_on='movie_id', right_index=True)[tfidf_movies.columns]

       plot_summary  plot_synopsis
count    157.000000     157.000000
mean     611.044586    8813.070064
std      246.397332    8580.499966
min       95.000000       0.000000
25%      421.000000    3674.000000
50%      584.000000    6235.000000
75%      730.000000   11521.000000
max     1067.000000   51736.000000


In [11]:
# concatenate the two arrays
train_feats = np.concatenate([tfidf_train.toarray(),
                              movie_features.values], axis=1)
svc.fit(train_feats, train.is_spoiler)
for d in [train, val]:
    actual = d.is_spoiler
    movie_feats = d.merge(
        tfidf_movies, left_on='movie_id', right_index=True)[tfidf_movies.columns].values
    tfidf_feats = tfidf.transform(d.review_text).toarray()
    pred = svc.predict(
        np.concatenate([tfidf_feats, 
                        movie_feats], axis=1))
    print(accuracy_score(actual, pred))

0.8264899021186966
0.7840749414519906


#### Exploring pre-trained models
Now that we have a simple, performant baseline, here's where we might want to bring in some pre-trained models.  Or, maybe adapt these pre-trained models via a process called fine-tuning. 

Let's start by exploring some available models and their outputs.

In [None]:
# what we're used to: BERT
from transformers import BertTokenizer, BertModel 
MODEL_NAME = 'bert-base-uncased'
# Load pre-trained model
model = BertModel.from_pretrained(MODEL_NAME)
# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)

In [None]:
# what probably makes more sense to use in this setting
from transformers import DistilBertModel, DistilBertTokenizer
MODEL_NAME = 'distilbert-base-uncased'
# Load pre-trained model
distil_model = DistilBertModel.from_pretrained(MODEL_NAME)
# Load pre-trained model tokenizer (vocabulary)
distil_tokenizer = DistilBertTokenizer.from_pretrained(MODEL_NAME)

In [None]:
%%time
# one dimension of difference: speed
samp_reviews = reviews['review_text'].iloc[:5]
tokens = tokenizer.batch_encode_plus(samp_reviews,
  pad_to_max_length=True, return_tensors="pt", 
  max_length=512) # BERT expects sequences of 512 tokens
outputs = model(**tokens)

In [None]:
%%time
samp_reviews = reviews['review_text'].iloc[:5]
tokens = distil_tokenizer.batch_encode_plus(samp_reviews,
  pad_to_max_length=True, return_tensors="pt", 
  max_length=512) # BERT expects sequences of 512 tokens
distil_outputs = distil_model(**tokens)

We can see here the difference between some of the models out there.  Some will take many hours to run our data through (if we can even do so without crashing the machine!).  BERT is not really designed for CPU work.  It's also not really designed to give document-level representations because it's trained on sentences.

Luckily, on Collaboratory, we get access to a GPU, which speeds things up a bit.

In [None]:
# pass model to GPU
distil_model.to(device)

In [None]:
# here we're doing small batches to the model on GPU, we'll load the product of this process later
# The model itself takes up a LOT of memory, so we're passing very small batches
# note here: You may run out of RAM if you try and run this along with all the above.
st = 0
batch_size = 5
batches = list(range(batch_size, len(reviews), batch_size))+[len(reviews)]
doc_rep_collector = []
for b in batches:
    tokens = distil_tokenizer.batch_encode_plus(
        reviews['review_text'][st:b],
        pad_to_max_length=True, 
        return_tensors="pt",
        max_length=512)
    st = b
    tokens.to(device)
    outputs = distil_model(**tokens)
    # taking the representation of the 'CLS' token (doc-level embedding)
    o = outputs[0][:,0].cpu().detach().numpy()
    doc_rep_collector.append(o)

# stack into array
doc_rep_collector = np.concatenate(doc_rep_collector)
# to minimize size, can store as 16-bit float
doc_rep_collector = doc_rep_collector.astype('float16')
# additionally, will store as gzip (pandas can handle this)
import gzip
pickle.dump(doc_rep_collector, gzip.open('review_bert_vectors.pkl.gz', 'wb'))

In [None]:
# bringing this portion down here: often running out of RAM on standard runtimes
reviews = pd.read_pickle('../data/spoilers_reviews.pkl.gz')
np.random.seed(seed=42)
# Will use 70/15/15 train/validation/test
pct_train = 0.7
pct_val = 0.15
pct_test = 1-pct_train-pct_val
draws = np.random.random(len(reviews))
train_bool = draws<=pct_train
val_bool = (draws>pct_train)&(draws<=pct_train+pct_val)
test_bool = (draws>pct_train+pct_val)&(draws<=pct_train+pct_val+pct_test)
train = reviews[train_bool]
val = reviews[val_bool]
test = reviews[test_bool]
print(train.shape, val.shape, test.shape)

In [None]:
# using BERT representations for prediction
bert_vectors = pd.read_pickle('../data/review_bert_vectors.pkl.gz')
print(bert_vectors.shape)

In [None]:
svc = LinearSVC()
svc.fit(bert_vectors[train_bool], train.is_spoiler)
for b in [train_bool, val_bool]:
    actual = reviews[b].is_spoiler
    pred = svc.predict(bert_vectors[b])
    print(accuracy_score(actual, pred))

This is bonus, but SpaCy also has an implementation of transformers.  It gives you similar objects to those we're used to working with out of SpaCy.  However, it has some compatibility issues with the `transformers` library (specifically, it requires transformers 2.0).

In [None]:
# this is what we might refer to as "tech debt"
# this is tricky to set up on collab, your results may vary...
# for spacy-transformers, you'll need a different version of transformers
!pip uninstall -y transformers
!pip install spacy
!pip install spacy-transformers
!python -m spacy download en_trf_distilbertbaseuncased_lg
# to use with GPU, you'll need to run this on collab
!pip uninstall cupy
!pip install cupy
# restart the kernel 

In [None]:
import spacy
gpu = spacy.prefer_gpu()
nlp = spacy.load("en_trf_distilbertbaseuncased_lg")

In [None]:
%%time
samp_reviews = reviews['review_text'].sample(frac=0.001)
parsed = [nlp(doc) for doc in samp_reviews]

In the snippet below, I run the dataset iteratively through DistilBERT, retrieving the representation of the [CLS] token.  This takes a while to run, so we'll just be loading the product of that.

In [None]:
st = 0
batch_size = 5
batches = list(range(batch_size, len(reviews), batch_size))+[len(reviews)]
doc_rep_collector = []
for b in batches:
    cls_rep = [nlp(x)._.trf_last_hidden_state[0] for x in reviews['review_text'][st:b]]
    st = b
    doc_rep_collector.append(cls_rep)

# stack into array
doc_rep_collector = np.concatenate(doc_rep_collector)
pickle.dump(doc_rep_collector, open('review_bert_vectors.pkl', 'wb'))