In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
import spacy
from collections import Counter

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

## Bagging to BERT: A tour of applied NLP
### Table of Contents
* [Data processing](#data)
* [Word Counts](#word)
* [TF-IDF](#tfidf)
* [Topic models](#topic)
* [Word vectors](#vectors)
* [LSTM](#lstm)
* [BERT](#bert)



### Data processing <a class="anchor" id="data"></a>

Up first is some preprocessing.  You'll either need to download the [imdb review data](https://ai.stanford.edu/~amaas/data/sentiment/) and save it to this directory OR download the [processed data](https://drive.google.com/file/d/1oN_fO91IBkDHD_u6WXiUCvhhyNexQDJq/view?usp=sharinghttps://drive.google.com/file/d/1oN_fO91IBkDHD_u6WXiUCvhhyNexQDJq/view?usp=sharing).

In [3]:
# # processing the original data into DataFrame
# # here for reference, don't need to run this if you're using reviews.pkl.gz
# source_path = Path('./aclImdb/')
# #neg_files = source_path.glob('./*/neg/*.txt')
# #pos_files = source_path.glob('./*/pos/*.txt')
# all_files = []
# for f in source_path.glob('./*/*/*.txt'):
#     filename = f.as_posix()
#     if 'unsup' not in filename:
#         # split up into useful components
#         _, split, sent, idx = filename.split('/')
#         idx = int(idx.split('_')[0])
#         all_files.append([idx, split, sent, f.read_text()])
# review_df = pd.DataFrame(all_files)
# review_df.columns = ['idx', 'split', 'label', 'text']
# # some minor html cruft is in here
# review_df['text'] = review_df['text'].str.replace('<br /><br />', '')
# review_df = review_df.to_pickle('reviews.pkl.gz')

In [4]:
# can skip here if you already have reviews.pkl.gz
review_df = pd.read_pickle('reviews.pkl.gz')

### Word counts  <a class="anchor" id="data"></a>
A very basic way to use a sanitized list of tokens is to do a word count. This unlocks a lot of insights right off and is an important step in exploratory data analysis in text.

In [5]:
# take a positive and negative review for examples
# we'll use Star Wars Episode VI since everyone likes a Star War
neg_review = review_df.loc[(review_df.label=='neg')].iloc[0]['text']
pos_review = review_df[(review_df.label=='pos')].iloc[0]['text']
print('Negative\n', neg_review, '\n')
print('Positive\n', pos_review)

Negative
 Alan Rickman & Emma Thompson give good performances with southern/New Orleans accents in this detective flick. It's worth seeing for their scenes- and Rickman's scene with Hal Holbrook. These three actors mannage to entertain us no matter what the movie, it seems. The plot for the movie shows potential, but one gets the impression in watching the film that it was not pulled off as well as it could have been. The fact that it is cluttered by a rather uninteresting subplot and mostly uninteresting kidnappers really muddles things. The movie is worth a view- if for nothing more than entertaining performances by Rickman, Thompson, and Holbrook. 

Positive
 Based on an actual story, John Boorman shows the struggle of an American doctor, whose husband and son were murdered and she was continually plagued with her loss. A holiday to Burma with her sister seemed like a good idea to get away from it all, but when her passport was stolen in Rangoon, she could not leave the country with

In [6]:
# base python word count - split on whitespace, use Counter object)
print(Counter(neg_review.split()))

Counter({'the': 4, 'it': 4, 'for': 3, 'and': 3, 'The': 3, 'performances': 2, 'with': 2, 'in': 2, 'worth': 2, 'Holbrook.': 2, 'movie': 2, 'that': 2, 'as': 2, 'is': 2, 'by': 2, 'a': 2, 'uninteresting': 2, 'Alan': 1, 'Rickman': 1, '&': 1, 'Emma': 1, 'Thompson': 1, 'give': 1, 'good': 1, 'southern/New': 1, 'Orleans': 1, 'accents': 1, 'this': 1, 'detective': 1, 'flick.': 1, "It's": 1, 'seeing': 1, 'their': 1, 'scenes-': 1, "Rickman's": 1, 'scene': 1, 'Hal': 1, 'These': 1, 'three': 1, 'actors': 1, 'mannage': 1, 'to': 1, 'entertain': 1, 'us': 1, 'no': 1, 'matter': 1, 'what': 1, 'movie,': 1, 'seems.': 1, 'plot': 1, 'shows': 1, 'potential,': 1, 'but': 1, 'one': 1, 'gets': 1, 'impression': 1, 'watching': 1, 'film': 1, 'was': 1, 'not': 1, 'pulled': 1, 'off': 1, 'well': 1, 'could': 1, 'have': 1, 'been.': 1, 'fact': 1, 'cluttered': 1, 'rather': 1, 'subplot': 1, 'mostly': 1, 'kidnappers': 1, 'really': 1, 'muddles': 1, 'things.': 1, 'view-': 1, 'if': 1, 'nothing': 1, 'more': 1, 'than': 1, 'entertainin

Already see some things that need to be considered; capitalization treats "The" and "the" differently, words like "the" and "it" dominate counts.

Luckily, scikit-learn's CountVectorizer allows for simple preprocessing like this.

In [7]:
# scikit-learn's countvectorizer
count = CountVectorizer()
neg_vec = count.fit_transform([neg_review])
neg_vec

<1x76 sparse matrix of type '<class 'numpy.int64'>'
	with 76 stored elements in Compressed Sparse Row format>

`CountVectorizer` outputs a sparse matrix by default.  We can convert that to a normal numpy array and stitch it together with the vocabulary from the `fit()` call.

In [8]:
print(
    dict(zip(count.get_feature_names_out(), 
             neg_vec.toarray().flatten())))

{'accents': 1, 'actors': 1, 'alan': 1, 'and': 3, 'as': 2, 'been': 1, 'but': 1, 'by': 2, 'cluttered': 1, 'could': 1, 'detective': 1, 'emma': 1, 'entertain': 1, 'entertaining': 1, 'fact': 1, 'film': 1, 'flick': 1, 'for': 3, 'gets': 1, 'give': 1, 'good': 1, 'hal': 1, 'have': 1, 'holbrook': 2, 'if': 1, 'impression': 1, 'in': 2, 'is': 2, 'it': 5, 'kidnappers': 1, 'mannage': 1, 'matter': 1, 'more': 1, 'mostly': 1, 'movie': 3, 'muddles': 1, 'new': 1, 'no': 1, 'not': 1, 'nothing': 1, 'off': 1, 'one': 1, 'orleans': 1, 'performances': 2, 'plot': 1, 'potential': 1, 'pulled': 1, 'rather': 1, 'really': 1, 'rickman': 3, 'scene': 1, 'scenes': 1, 'seeing': 1, 'seems': 1, 'shows': 1, 'southern': 1, 'subplot': 1, 'than': 1, 'that': 2, 'the': 7, 'their': 1, 'these': 1, 'things': 1, 'this': 1, 'thompson': 2, 'three': 1, 'to': 1, 'uninteresting': 2, 'us': 1, 'view': 1, 'was': 1, 'watching': 1, 'well': 1, 'what': 1, 'with': 2, 'worth': 2}


We can see the defaults have already done some amount of cleaning for us.

#### Deterministic Approach with word counts

Let's try a deterministic approach, using word counts and a list of "positive" vs "negative" words.

In [9]:
pos_words = ["good", "great", "like", "loved"]
neg_words = ["bad", "awful", "dislike", "hated"]

# we're going to use this train/test split throughout
# we'll also use this seed for consistency
# NOTE: Usually you'll want to do a separate validation set when choosing models/featuresets!
seed = 37
np.random.seed(seed)
pct_train = 0.7
X_train, X_test, y_train, y_test = train_test_split(
    review_df['text'],
    review_df['label'], train_size=pct_train)

cv = CountVectorizer(stop_words='english')
train_vecs = cv.fit_transform(X_train)
feats = cv.get_feature_names_out()
pos_idxs = np.where(np.isin(feats, pos_words))[0]
neg_idxs = np.where(np.isin(feats, neg_words))[0]
train_det_score = train_vecs[:, pos_idxs].sum(1) - train_vecs[:, neg_idxs].sum(1)
# easier for group-level score
train_det_score = pd.Series(np.array(train_det_score).ravel(), 
                            index=X_train.index)

In [10]:
# our threshold - the average score for negative, that or below = negative
neg_thresh = train_det_score.groupby(review_df['label'].loc[X_train.index]).mean()['neg']
test_vecs = cv.transform(X_test)
test_det_score = test_vecs[:, pos_idxs].sum(1) - test_vecs[:, neg_idxs].sum(1)
det_pred = test_det_score>neg_thresh

In [11]:
print(
    classification_report(y_pred=det_pred,
                          y_true=y_test=='pos'))

              precision    recall  f1-score   support

       False       0.61      0.44      0.51      7522
        True       0.56      0.71      0.63      7478

    accuracy                           0.58     15000
   macro avg       0.58      0.58      0.57     15000
weighted avg       0.58      0.58      0.57     15000



#### Count Vector + Logistic Regression
Here we try a count vector with Logistic Regression.  This alleviates the need for chosing an arbitrary set of terms and arbitrary threshold as above.

Here I use scikit-learn's [Pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) functionality.  I won't try and explain that here, the docs do a much better job than I can.


In [12]:
count = CountVectorizer(stop_words='english')

count_pipeline = Pipeline(
    steps=[("preprocessor", count),
          ('model', LogisticRegression(max_iter=500, solver='liblinear'))]
)

In [13]:
np.random.seed(seed)
count_pipeline.fit(X_train, y_train)
count_pipeline.score(X_test, y_test)

0.8813333333333333

In [14]:
print(
    classification_report(y_pred=count_pipeline.predict(X_test),
                          y_true=y_test))

              precision    recall  f1-score   support

         neg       0.89      0.88      0.88      7522
         pos       0.88      0.89      0.88      7478

    accuracy                           0.88     15000
   macro avg       0.88      0.88      0.88     15000
weighted avg       0.88      0.88      0.88     15000



This is actually really good! 90% of the time we're predicting the right class with this model.  But can we do...better?

### TF-IDF <a class="anchor" id="tfidf"></a>
One thing we notice with count vectors is that all words are being counted the same.  We might want to use a weighting scheme to ensure that words that are more informative about the content are flagged as more important.  One weighting scheme is Term Frequency - Inverse Document Frequency (TF-IDF).

Take as an example some kind of simplistic movie reviews.  We can already tell which words are most relevant to the specific content of each review (i.e. "good", "bad", "great").

In [15]:
docs = ['The movie was good',
        'The movie was bad',
        'The movie was great']

cv = CountVectorizer()
vecs = cv.fit_transform(docs).toarray()
# we'll use pandas DF for easier display
pd.DataFrame(vecs, columns=cv.get_feature_names_out())

Unnamed: 0,bad,good,great,movie,the,was
0,0,1,0,1,1,1
1,1,0,0,1,1,1
2,0,0,1,1,1,1


You'll notice that `vecs` contains the term frequencies.  If we use sklearn's `TfidfVectorizer`, it will calculate those term counts and then multiply them by the Inverse Document Frequency (IDF).

In [16]:
tfidf = TfidfVectorizer()
# we'll use pandas DF for easier display
tfidf_vecs = tfidf.fit_transform(docs).toarray()
tfidf_df = pd.DataFrame(tfidf_vecs, columns=tfidf.get_feature_names_out())
tfidf_df

Unnamed: 0,bad,good,great,movie,the,was
0,0.0,0.69903,0.0,0.412859,0.412859,0.412859
1,0.69903,0.0,0.0,0.412859,0.412859,0.412859
2,0.0,0.0,0.69903,0.412859,0.412859,0.412859


You can see that the discriminative words have higher weight than the non-discriminative words.  

It's worth noting here - in terms of "separability", having 0 v 1 (count of "good" vs count of "bad") might actually be better.  But these are highly curated examples - you can imagine cases where good and bad descriptive terms are mixed in a review, you want to capture the words that describe better the "aboutness" of the review.  (Think: "This movie was not bad, it was good!")

Now let's fit our regression as above with TF-IDF vectors.

In [17]:
# we use binary here to handle longer reviews
tfidf = TfidfVectorizer(stop_words='english')

tfidf_pipeline = Pipeline(
    steps=[("preprocessor", tfidf),
          ('model', LogisticRegression(max_iter=500, solver='liblinear'))]
)

In [18]:
np.random.seed(seed)
tfidf_pipeline.fit(X_train, y_train)
print(f'accuracy: {tfidf_pipeline.score(X_test, y_test)}')
print(
    classification_report(y_pred=tfidf_pipeline.predict(X_test),
                          y_true=y_test))

0.8912666666666667

In [20]:
# looking at the coefficients on the LR for each model
word_feats = tfidf_pipeline['preprocessor'].get_feature_names_out()
# get the largest by magnitude, stitch together to compare
top = 10
top_tfidf = np.argsort(np.abs(tfidf_pipeline['model'].coef_.flatten()))[-top:]
top_count = np.argsort(np.abs(count_pipeline['model'].coef_.flatten()))[-top:]
# top
coef_df = pd.DataFrame([
    word_feats,
    tfidf_pipeline['model'].coef_.flatten(),
    count_pipeline['model'].coef_.flatten()],
    index=['word', 'tfidf', 'count']).T
# normalize result for compare
coef_df['tfidf'] = coef_df['tfidf'].rank()
coef_df['count'] = coef_df['count'].rank()
coef_df.loc[np.unique(np.concatenate([top_tfidf, top_count]))]

Unnamed: 0,word,tfidf,count
6408,awful,3.0,4.0
6697,bad,2.0,316.0
10324,boring,5.0,14.0
22357,disappointing,16.0,5.0
22360,disappointment,14.0,2.0
27335,excellent,89233.0,89227.0
28106,fails,17.0,9.0
34078,great,89234.0,89112.0
50143,mediocre,30.0,6.0
51096,mildly,96.0,10.0


In [21]:
# examples where there's disagreement
tfidf_pred = tfidf_pipeline.predict_proba(X_test)[:, 1]
count_pred = count_pipeline.predict_proba(X_test)[:, 1]

In [22]:
# most interesting are where there's the largest disagreement
top_disagree_idx = np.argsort(np.abs(tfidf_pred - count_pred))[-10:]

In [23]:
# assemble in df
compare_df = pd.DataFrame([tfidf_pred, count_pred, y_test, X_test],
            index=['tfidf_pred', 'count_pred', 'label', 'text']).T
# would like some shorter mv reviews here
compare_df['text'] = compare_df['text'].apply(lambda x: x[:2000])

In [24]:
compare_df['tfidf_right'] = ((compare_df['tfidf_pred']>=0.5)&(compare_df['label']=='pos'))|\
    ((compare_df['tfidf_pred']<0.5)&(compare_df['label']=='neg'))

In [27]:
# simple way to look at some of these differences
#compare_df[compare_df.tfidf_right].reindex(top_disagree_idx).values

It's difficult to see piecemeal, but it does appear that certain words we associate with negative reviews (e.g. "bad") have a stronger influence on prediction in the TF-IDF model.

### Topic Models


In [33]:
from sklearn.decomposition import NMF, LatentDirichletAllocation

In [28]:
def display_components(model, word_features, top_display=5):
    # utility for displaying respresentative words per component for topic models
    for topic_idx, topic in enumerate(model.components_):
        print("Topic %d:" % (topic_idx))
        top_words_idx = topic.argsort()[::-1][:top_display]
        top_words = [word_features[i] for i in top_words_idx]
        print(" ".join(top_words))

In [46]:
# fit tfidf/cv models
tfidf = TfidfVectorizer(stop_words='english')
tfidf_vecs = tfidf.fit_transform(review_df['text'])
#cv = CountVectorizer(stop_words='english', min_df=0.05)
#count_vecs = cv.fit_transform(review_df['text'])

In [47]:
# choose the number of components (topics)
n_components = 10
# adding a few tweaks, just based on experimentation
nmf = NMF(n_components=n_components,
          init='nndsvda',
         max_iter=500)
# NMF requires tfidf, not word counts
# same syntax as vectorizer
nmf_vecs = nmf.fit_transform(tfidf_vecs)

Both NMF and LDA provide a components matrix which corresponds to the loading of each word on each topic.  Higher values means the word is more relevant to that topic.

In [48]:
print(nmf.components_)

[[5.08613596e-03 4.10971444e-02 1.41693238e-04 ... 2.29262679e-04
  6.28212910e-05 6.28212910e-05]
 [7.22501881e-04 0.00000000e+00 0.00000000e+00 ... 1.39870498e-05
  0.00000000e+00 0.00000000e+00]
 [2.64404827e-03 6.88220840e-03 0.00000000e+00 ... 0.00000000e+00
  0.00000000e+00 0.00000000e+00]
 ...
 [3.40469091e-03 1.42849764e-02 0.00000000e+00 ... 0.00000000e+00
  0.00000000e+00 0.00000000e+00]
 [4.42327945e-03 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00
  1.29440514e-04 1.29440514e-04]
 [6.79228602e-03 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00
  0.00000000e+00 0.00000000e+00]]


For evaluating performance, both methods use different ways to quantify the loss from using the topic model versus the actual data.  (In the matrix formulation, $UV$ rather than $X$).  For NMF, it's reconstruction error, which is more directly the difference between the matrix decomposition and the actual data.  For LDA, it uses [ELBO](https://en.wikipedia.org/wiki/Evidence_lower_bound), which is a too complicated to explain here.  In both, higher values means worse performance.  They can't be compared to one another, though.

In [49]:
print(nmf.reconstruction_err_)

219.17632964934285


In [50]:
display_components(nmf, tfidf.get_feature_names_out())

Topic 0:
life man story character young
Topic 1:
movie movies watch book saw
Topic 2:
bad acting good terrible worst
Topic 3:
film films director plot making
Topic 4:
just like don really people
Topic 5:
great good story really love
Topic 6:
series episode tv episodes season
Topic 7:
horror gore budget effects scary
Topic 8:
funny comedy jokes laugh humor
Topic 9:
seen ve movies worst time


Let's try using this in a model

In [51]:
# we use binary here to handle longer reviews
tfidf = TfidfVectorizer(stop_words='english')
nmf = NMF(n_components=n_components,
          init='nndsvda',
          max_iter=500)

nmf_pipeline = Pipeline(
    steps=[("preprocessor", tfidf),
           ("topic", nmf),
          ('model', LogisticRegression(max_iter=500, solver='liblinear'))]
)

In [52]:
np.random.seed(seed)
nmf_pipeline.fit(X_train, y_train)
print(f'accuracy: {nmf_pipeline.score(X_test, y_test)}')
print(
    classification_report(y_pred=nmf_pipeline.predict(X_test),
                          y_true=y_test))

accuracy: 0.7588666666666667
              precision    recall  f1-score   support

         neg       0.80      0.70      0.74      7522
         pos       0.73      0.82      0.77      7478

    accuracy                           0.76     15000
   macro avg       0.76      0.76      0.76     15000
weighted avg       0.76      0.76      0.76     15000



### Word vectors <a class="anchor" id="vectors"></a>
Our next approach is to include context in the word-level representations.  We'll be bringing SpaCy into the mix here, particularly their "medium" English web model, which uses GloVe embeddings.

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
import spacy

In [None]:
# only need to run this once
#!python -m spacy download en_core_web_md

In [None]:
nlp = spacy.load("en_core_web_md")

In [None]:

class GloveVectorizer(BaseEstimator, TransformerMixin):
    # this is a custom document transformer for use in the scikit-learn pipeline
    def __init__(self, vectorizer):
        self.vectorizer = vectorizer
        return
    
    def fit(self, X, y=None):
        self.vectorizer.fit(X)
        vocab = self.vectorizer.vocabulary_
        self.vocab_glove = np.zeros(shape=(len(vocab), 300))
        for token, idx in vocab.items():
            self.vocab_glove[idx] = nlp(token).vector
        return self
    
    def transform(self, X, y=None):
        X_transformed = self.vectorizer.transform(X).toarray()
        sum_words = (X_transformed.sum(1)).reshape(-1, 1)
        glove_vecs = (X_transformed.dot(self.vocab_glove))/sum_words
        return glove_vecs

In [None]:
# we use binary here to handle longer reviews
count = CountVectorizer(stop_words='english', min_df=0.01, binary=False)
glove = GloveVectorizer(count)

glove_pipeline = Pipeline(
    steps=[("preprocessor", glove),
          ('model', LogisticRegression(max_iter=500, solver='liblinear'))]
)


In [None]:
np.random.seed(seed)
glove_pipeline.fit(X_train, y_train)
glove_pipeline.score(X_test, y_test)

In [None]:
print(
    classification_report(y_pred=glove_pipeline.predict(X_test),
                          y_true=y_test))

### LSTM

In [75]:
import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader
# this will set the device on which to train
device = torch.device("cpu")


In [154]:
class SentimentNet(nn.Module):
    # sentiment classifier with single LSTM layer + Fully-connected layer, sigmoid activation and dropout
    # adapted from https://blog.floydhub.com/long-short-term-memory-from-zero-to-hero-with-pytorch/
    def __init__(self,
                 weight_matrix=None,
                 vocab_size=None, 
                 output_size=1,  
                 hidden_dim=512,
                 embedding_dim=400, 
                 n_layers=2, 
                 dropout_prob=0.5):
        super(SentimentNet, self).__init__()
        # size of the output, in this case it's one input to one output
        self.output_size = output_size
        # number of layers (default 2) one LSTM layer, one fully-connected layer
        self.n_layers = n_layers
        # dimensions of our hidden state, what is passed from one time point to the next
        self.hidden_dim = hidden_dim
        # initialize the representation to pass to the LSTM
        self.embedding, embedding_dim = self.init_embedding(
            vocab_size, 
            embedding_dim, 
            weight_matrix)
        # LSTM layer, where the magic happens
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, 
                            dropout=dropout_prob, batch_first=True)
        # dropout, similar to regularization
        self.dropout = nn.Dropout(dropout_prob)
        # fully connected layer
        self.fc = nn.Linear(hidden_dim, output_size)
        # sigmoid activiation
        self.sigmoid = nn.Sigmoid()
        
    def forward(self, x, hidden):
        # forward pass of the network
        batch_size = x.size(0)
        # transform input
        embeds = self.embedding(x)
        # run input embedding + hidden state through model
        lstm_out, hidden = self.lstm(embeds, hidden)
        # reshape
        lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)
        # dropout certain pct of connections
        out = self.dropout(lstm_out)
        # fully connected layer
        out = self.fc(out)
        # activation function
        out = self.sigmoid(out)
        # reshape
        out = out.view(batch_size, -1)
        out = out[:,-1]
        # return the output and the hidden state
        return out, hidden
    
    def init_embedding(self, vocab_size, embedding_dim, weight_matrix):
        # initializes the embedding
        if weight_matrix is None:
            if vocab_size is None:
                raise ValueError('If no weight matrix, need a vocab size')
            # if embedding is a size, initialize trainable
            return(nn.Embedding(vocab_size, embedding_dim),
                   embedding_dim)
        else:
            # otherwise use matrix as pretrained
            weights = torch.FloatTensor(weight_matrix)
            return(nn.Embedding.from_pretrained(weights),
                  weights.shape[1])
    
    def init_hidden(self, batch_size):
        # initializes the hidden state
        hidden = (torch.zeros(self.n_layers, batch_size, self.hidden_dim).to(device),
                  torch.zeros(self.n_layers, batch_size, self.hidden_dim).to(device))
        return hidden

In [143]:
from sklearn.model_selection import train_test_split
from spacy.lang.en import English
en = English()

In [161]:
# take a sample for quickness
sample_review = review_df.sample(frac=0.1)
X_train, X_test, y_train, y_test = train_test_split(
    sample_review['text'], sample_review['label']=='pos')

In [162]:
def simple_tokenizer(doc, model=en):
    # a simple tokenizer for individual documents 
    tokenized_docs = []
    parsed = model(doc)
    return([t.lower_ for t in parsed if (t.is_alpha)&(not t.is_stop)])

def doc_to_index(docs, vocab, tokenizer=simple_tokenizer):
    # transform docs into series of indices
    docs_idxs = []
    for d in docs:
        w_idxs = []
        d_tokenized = simple_tokenizer(d)
        for w in d_tokenized:
            if w in vocab:
                w_idxs.append(vocab[w])
            else:
                # unknown token = 1
                w_idxs.append(1)
        docs_idxs.append(w_idxs)
    return(docs_idxs)

def pad_sequence(seqs, seq_len=200):
    # function for adding padding to ensure all seq same length
    features = np.zeros((len(seqs), seq_len),dtype=int)
    for i, seq in enumerate(seqs):
        if len(seq) != 0:
            features[i, -len(seq):] = np.array(seq)[:seq_len]
    return features

In [163]:
# need to adapt vocab, leave space for padding
tfidf = TfidfVectorizer(tokenizer=simple_tokenizer,
                       token_pattern=None,
                       min_df=0.01)
tfidf.fit(X_train)
vocab = tfidf.vocabulary_
vocab = dict([(v, vocab[v]+2) for v in vocab])
vocab['_UNK'] = 1
vocab['_PAD'] = 0
parsed_train = doc_to_index(X_train, vocab)
padded_train = pad_sequence(parsed_train)
parsed_test = doc_to_index(X_test, vocab)
padded_test = pad_sequence(parsed_test)


In [164]:
# construct datasets for loading by PyTorch
train_data = TensorDataset(torch.from_numpy(padded_train), 
                           torch.from_numpy(y_train.values))
test_data = TensorDataset(torch.from_numpy(padded_test), 
                          torch.from_numpy(y_test.values))

batch_size = 100

train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size,
                         drop_last=True) # this is to keep the size consistent
test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size,
                        drop_last=True)

In [165]:
vocab_size = len(vocab)
model_params = {'weight_matrix': None,
               'output_size': 1,
               'hidden_dim': 512,
               'n_layers': 2,
               'embedding_dim': 400,
               'dropout_prob': 0.2,
               'vocab_size': vocab_size}

model = SentimentNet(**model_params)
model.to(device)

SentimentNet(
  (embedding): Embedding(1526, 400)
  (lstm): LSTM(400, 512, num_layers=2, batch_first=True, dropout=0.2)
  (dropout): Dropout(p=0.2, inplace=False)
  (fc): Linear(in_features=512, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)

In [None]:
lr=0.005
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
# increasing this will make the training take a while on CPU
# decrease to 5 if it's taking too long
epochs = 1
counter = 0
print_every = 5
clip = 5
valid_loss_min = np.Inf

model.train()
for i in range(epochs):
    h = model.init_hidden(batch_size)
    for inputs, labels in train_loader:
        counter += 1
        h = tuple([e.data for e in h])
        inputs, labels = inputs.to(device), labels.to(device)
        model.zero_grad()
        output, h = model(inputs, h)
        loss = criterion(output.squeeze(), labels.float())
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()
        
        if counter%print_every == 0:
            val_h = model.init_hidden(batch_size)
            val_losses = []
            model.eval()
#            for inp, lab in val_loader:
            for inp, lab in test_loader:
                val_h = tuple([each.data for each in val_h])
                inp, lab = inp.to(device), lab.to(device)
                out, val_h = model(inp, val_h)
                val_loss = criterion(out.squeeze(), lab.float())
                val_losses.append(val_loss.item())
                
            model.train()
            print("Epoch: {}/{}...".format(i+1, epochs),
                  "Step: {}...".format(counter),
                  "Loss: {:.6f}...".format(loss.item()),
                  "Val Loss: {:.6f}".format(np.mean(val_losses)))
            if np.mean(val_losses) <= valid_loss_min:
                torch.save(model.state_dict(), './state_dict.pt')
                print('Validation loss decreased ({:.6f} --> {:.6f}).  Saving model ...'.format(
                    valid_loss_min,np.mean(val_losses)))
                valid_loss_min = np.mean(val_losses)

In [None]:
model.predict

In [None]:
print(f'accuracy: {nmf_pipeline.score(X_test, y_test)}')
print(
    classification_report(y_pred=nmf_pipeline.predict(X_test),
                          y_true=y_test))

### BERT
From [HF tutorials](https://huggingface.co/blog/sentiment-analysis-python).  Default pipeline uses this [distilBERT model](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).  

In [None]:
from transformers import pipeline
sentiment_pipeline = pipeline("sentiment-analysis")
data = ["I love you", "I hate you"]
sentiment_pipeline(data)