## A light introduction to natural language processing

In lab 3, we're going to briefly cover some methods for analyzing data that comes in the form of text. This will help with the practical next week 

## Sentiment Classification in Movie Reviews

We'll use the dataset from Stanford here: http://ai.stanford.edu/~amaas/data/sentiment/

1) First click on the link and download the dataset (It's too big to put on github)

2) Make sure you move the directory "aclImdb" into the same folder as this notebook

Unfortunately, the data comes in separate files, which is kind of annoying. I used glob for this. glob("directory/*") just lists the filenames in that directory

In [1]:
import pandas as pd
import numpy as np
#glob lets us quickly access all the filenames, either pip install it or find a different way to do this
from glob import glob

In [2]:
pos_filenames = glob('aclImdb/train/pos/*')
neg_filenames = glob('aclImdb/train/neg/*')

You can check now that pos_filenames has all the filenames for positive reviews and neg_filenames has all the filenames for negative reviews. The following code is pretty hacky, but it does the job for combining all the text into one dataframe. We'll just open the files one by one in a list and append each string to a list. We'll also keep track of the sentiment.

In [3]:
#loop through the list of files and append the contents to a list
contents = []
sentiments = []

#loop through the positive sentiment files and save all the contents
for fname in pos_filenames:
    with open(fname,'rb') as f:
        contents.append(str(f.readlines()[0]))
        sentiments.append(1)
        
for fname in neg_filenames:
    with open(fname,'rb') as f:
        contents.append(str(f.readlines()[0]))
        sentiments.append(0)

Print the length of the list we just made (total number of movie revieews)

In [4]:
len(contents)

25000

Print the first movie review 

In [5]:
print(contents[0])

b'Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High\'s satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers\' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I\'m here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn\'t!'


To get back to familiar territory, we'll turn this into a dataframe

In [6]:
#we can turn this into a pd Dataframe
df = pd.DataFrame()
df['txt'] = contents
df['sentiment'] = sentiments

In [7]:
df.head()

Unnamed: 0,txt,sentiment
0,b'Bromwell High is a cartoon comedy. It ran at...,1
1,b'Homelessness (or Houselessness as George Car...,1
2,b'Brilliant over-acting by Lesley Ann Warren. ...,1
3,b'This is easily the most underrated film inn ...,1
4,b'This is not the typical Mel Brooks film. It ...,1


Okay, cool. But we still don't really know how to deal with this. Computers aren't inherently able to understand text, so we'll need to get the "txt" column into a form we know how to work with in order to make predictions

### Using sklearn

sklearn isn't really the best library for working with text data, so we'll keep this section relatively short. For most purposes you'll want to use NLTK or spacy. But since you're familiar with sklearn we'll start here. 

The main thing that we'll be using from sklearn is CountVectorizer. This will take a corpus of text and turn each document into a "count vector." This count vector is essentially a histogram over the entire vocabulary (all words in the training set). As an example, consider the (fake) sentence "dog cat cat cat bear". Our vocab size is 3, so the sentence is represented by the three dimensional vector:

$$[1,3,1] $$

This is also called the bag of words representation. 

**Exercise 1:** what are the pros and cons of using this? Can you think of an alternative way of representing text at the sentence level? 

sklearn has a built in count vectorizer
- Fit: build vocabulary on some iterable containing strings
- transform: use existing vocabulary to transform the input into a N x V sparse matrix
- fit_transform: fit on this data, and also transform it (same as calling fit then transform)


In [8]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()

#by default countvectorizer will return a sparse array which is a special datatype for 
#arrays that are mostly 0s (to save space), but .toarray() will convert this back to a
#regular numpy array
cv.fit_transform(["cat dog dog dog cow"]).toarray()

array([[1, 1, 3]], dtype=int64)

As another example consider the sentence "dog dog dog snail snail". Since "snail" is not in the original vocab that we fit count_vectorizer with, it won't be part of the vector

In [9]:
cv.transform(["dog dog dog snail snail"]).toarray()

array([[0, 0, 3]])

**Exercise 2**: Use CountVectorizer on the Stanford dataset. This will take a few seconds. You'll want to use max_features to limit the number of words that you consider, since rare words won't help you much.  

Limit the number of features to 10,000

Save the result as new_data

In [10]:
#solution:

new_data = 

### Unigrams vs Bigrams vs N-grams

As we explored above, using this "Bag of Words" representation throws away a lot of information in the sentence. One way of trying to preserve local information is to use bigrams. This is just expanding our vocabulary to include consecutive word-pairs of length 2. So in our example: "cat dog dog dog cow", we would have the vocabulary
- cat x1 
- dog x3
- cow x1
- cat dog x1
- dog dog x2 
- dog cow x1

N-grams is the extension of this to word sequences of length N. For CountVectorizer, we specify this with:

ngram_range = (1, N) 

In [11]:
cv = CountVectorizer(ngram_range = (1,2))
cv.fit_transform(["cat dog dog dog cow"]).toarray()

array([[1, 1, 1, 3, 1, 2]], dtype=int64)

### Building a simple baseline model

For text classification, simple models trained on a large amount of data perform quite well. A pretty standard baseline is the Naive Bayes Model. We'll go through some of the math here. If you're not interested in it, you can skip it.

Suppose that $y_i$ is your class label (in this case $y_i$ is either 0 or 1 for negative and positive). $i$ just indexes what datapoint you're looking at. We'll say that $i$ ranges from $1$ to $N$ (in other words you have $N$ sentences in your training dataset).

Also we have $\mathbf{x}_i$ which is the sentence corresponding to the label $y_i$. We use bold to denote the fact that $\mathbf{x}_i$ is a vector where each element is a word.

Naive Bayes models the joint probability density, $p(y_i, \mathbf{x}_i)$. We do this by parameterizing the prior probability of having a certain class, and then parameterizing the probability of generating a certain sentence given that class. Using Bayes rule, we can write this as:

$$p(y_i, \mathbf{x}_i) = p(\mathbf{x}_i | y_i) p(y_i) = p(x_{i1},\dots,x_{iT_i} | y_i) p(y_i) $$

where $x_{it}$ is the word at position $t$, and $T_i$ is the length of sentence $i$. Now we apply a huge assumption (which seems like it is just wrong, but works decently in practice). That is, we assume that $x_{i1},\dots,x_{iT_i}$ are conditionally independent given the class, $y_i$. This lets us factor the probability as:

$$p(y_i, \mathbf{x}_i) = p(y_i)\prod_{t=1}^T p(x_{it}| y_i)  $$ 

We parameterize $p(x_{it}|y_i)$ as a Multinoulli random variable, i.e.:

$$p(x_{it} = dog | y_i = 0) =  \pi_{0,dog} $$ 

Where $\pi_{0,dog}$ is the probability that "dog" is generated given that we're in class 0 (negative). So we need 2*Vocab_size parameters for this, since we need a probability for every class for every word. For the english language, that's approximately 20,000 parameters. Also, we parameterize the prior probabilites as bernoulli random variables. That's only one extra parameter:

$$p(y_i = 0) = \theta_0$$
$$p(y_i = 1) = 1 - \theta_0$$

Given this model, it's pretty straightforward to get a maximum likelihood estimate for all the parameters, i.e. the $\theta$s and $\pi$s. If you're not familiar with maximum likelihood, it just means that we choose $\theta$ and $\pi$s to be the values that make the observed data have the highest likelihood. This turns the learning procedure into a simple optimization problem. 

If we go through all the math to solve this, it actually turns out that we get:
optimal $\theta_0$ is the proportion of sentences that are in class $0$, the negative class, and that $\pi_{0,dog}$ is just the proportion of words in class $0$ that are the word "dog". Likewise, $\pi_{1,dog}$ is just the proportion of words in class $1$ that are the word "dog". 

So "training" the model is just learning these parameters through an optimization procedure. But given a sentence, how do we make a prediction of whether its positive of negative?

We can express that as:

$$p(y_i = 0 | \mathbf{x}_i ) \propto p(\mathbf{x}_i|y_i = 0) p(y_i = 0) $$
$$p(y_i = 1 | \mathbf{x}_i ) \propto p(\mathbf{x}_i|y_i = 1) p(y_i = 1) $$

To predict we just take the higher of these values. 

**Optional (hard) exercise:** implement Naive Bayes 

In sklearn, this is easy

In [12]:
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB()

We can fit the model with the normal sklearn syntax

In [13]:
nb.fit(new_data, df['sentiment']) 

NameError: name 'new_data' is not defined

This code splits into train and test for evaluation of the model. We'll make a random permutation of indices and then use that to randomly shuffle our data. We'll take a 70:30 split

In [None]:
#np.random.permutation makes a random permutation of indices {1...N} 
perm = np.random.permutation(range(len(df.sentiment)))

#split this permutation by a 70:30 split
trn = perm[:int(0.7*len(perm))]
tst = perm[int(0.7*len(perm)):]

#slice the processed data into train and test sets; alternatively we could have done this with 
#sklearn functions
x_train = new_data[trn]
x_tst = new_data[tst]
y_train = df.sentiment[trn]
y_tst = df.sentiment[tst]

Checking the shape of our splits. The test set has 7500 data points and the trianing set has 17500. We have 10,000 features since there are 10,000 words in our vocab.

In [None]:
y_tst.shape

In [None]:
y_train.shape

In [None]:
x_train.shape

In [None]:
x_tst.shape

Fitting the Naive Bayes model on our training x and y values

In [None]:
nb = MultinomialNB()
nb.fit(x_train, y_train) 

Score will give us our classification accuracy on test

In [None]:
nb.score(x_tst,y_tst)

That's pretty good for such a simple model. Obviously, we'll do a bit better on the data that we trained on.

In [None]:
nb.score(x_train,y_train)

**Exercise 3:** Improve the score somehow

## (Optional) building a more powerful model using pytorch

### Using torchtext (preprocessing)

We're going to save out tabular dataset from above as a csv and then load it with torchtext because I couldn't think of a better way to do this off the top of my head. Torchtext is a library for loading/dealing with datasets for pytorch. And pytorch is a library for implementing neural networks (similar to tensorflow). 

In [None]:
import torchtext
from torchtext.vocab import Vectors, GloVe
import torchtext.datasets as datasets


df.iloc[trn,:].to_csv('saved_dataset_train.csv',index = False,header = False)
df.iloc[tst,:].to_csv('saved_dataset_test.csv',index = False,header = False)

We'll start by initializing two torchtext Field objects, which will hold label and text vocabularies. We'll load in the two datasets using these fields. 

In [None]:
TEXT = torchtext.data.Field()
LABEL = torchtext.data.Field(sequential = False)
pos_train = torchtext.data.TabularDataset(path='saved_dataset_train.csv', format='csv',fields=[('txt', TEXT),
 ('sentiment', LABEL)])

pos_test = torchtext.data.TabularDataset(path='saved_dataset_test.csv', format='csv',fields=[('txt', TEXT),
 ('sentiment', LABEL)])

Building the vocabulary using the training dataset

In [146]:
TEXT.build_vocab(pos_train)
LABEL.build_vocab(pos_train)
print('len(TEXT.vocab)', len(TEXT.vocab))
print('len(LABEL.vocab)', len(LABEL.vocab))

len(TEXT.vocab) 235807
len(LABEL.vocab) 3


In [150]:
train_iter, test_iter = torchtext.data.BucketIterator.splits(
    (pos_train,pos_test), batch_size=10, device=-1,sort_key=lambda x: len(x.txt),repeat = False)

In [159]:
batch = next(iter(train_iter))
batch

<torchtext.data.batch.Batch at 0x15b3c9ef0>

We'll also make use of pretrained Word Embeddings by Google. 

In [234]:
url = 'https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.simple.vec'
TEXT.vocab.load_vectors(vectors=Vectors('wiki.simple.vec', url=url))

In [235]:
print("Word embeddings size ", TEXT.vocab.vectors.size())
print("Word embedding of 'follows', first 10 dim ", TEXT.vocab.vectors[TEXT.vocab.stoi['follows']][:10])

Word embeddings size  torch.Size([235807, 300])
Word embedding of 'follows', first 10 dim  
 0.3925
-0.4770
 0.1754
-0.0845
 0.1396
 0.3722
-0.0878
-0.2398
 0.0367
 0.2800
[torch.FloatTensor of size 10]



Okay cool. Now that all the preprocessing stuff is done, we can focus on actually building a model. We're going to build a convolutional neural network in pytorch. This involves building a CNN class that inherits nn.Module. We'll implement this paper by Yoon Kim: http://aclweb.org/anthology/D/D14/D14-1181.pdf 

In [244]:
VOCAB_SIZE = len(TEXT.vocab)

In [261]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable



class customConvNet(nn.Module):
    
    def __init__(self,input_embeddings, embedding_dim = 300, hidden_size = 100, vocab_size = 235807):
        super(customConvNet,self).__init__()  
        embedding = nn.Embedding(vocab_size,embedding_dim)
        embedding.weight = nn.Parameter(input_embeddings)
        self.embedding = embedding
        self.conv3 = nn.Conv1d(embedding_dim,hidden_size,kernel_size = 3,stride = 1)
        self.conv4 = nn.Conv1d(embedding_dim,hidden_size,kernel_size = 4,stride = 1)
        self.conv5 = nn.Conv1d(embedding_dim,hidden_size,kernel_size = 5,stride = 1)
        self.maxpool = nn.AdaptiveMaxPool1d(1)
        self.dropout = nn.Dropout(0.5)
        
        self.linear = nn.Linear(3*hidden_size,3)
        
        
    def forward(self, input_):
        #apply embedding layer
        embeds = self.embedding(input_).permute(1,2,0).contiguous()
        #apply convolution layers
        out1 = self.conv3(embeds)
        out2 = self.conv4(embeds)
        out3 = self.conv5(embeds)
        
        #apply max pooling layers
        out1 = self.maxpool(out1).squeeze(2)
        out2 = self.maxpool(out2).squeeze(2)
        out3 = self.maxpool(out3).squeeze(2)
        #concatenate the outputs; ending up with a batch_size x 3*hidden_size vector
        out = torch.cat((out1,out2,out3),dim = 1)
        out = self.dropout(out)
        return self.linear(out)
        

Let's initialize an instance of the neural network, and make sure it produces output when we feed in a batch of training data.

In [262]:
cn = customConvNet(TEXT.vocab.vectors)

In [263]:
batch = next(iter(train_iter))

In [264]:
cn(batch.txt)

Variable containing:
-0.2616 -0.0998 -0.0296
-0.2551 -0.4493 -0.1118
-0.1072 -0.2760 -0.0689
-0.1006 -0.1326  0.0534
-0.1832 -0.1641  0.0457
-0.0593 -0.0350 -0.1146
-0.0945 -0.3983 -0.0943
 0.1247  0.1473 -0.1453
-0.1965 -0.1806 -0.0194
 0.1082 -0.2329 -0.0160
[torch.FloatTensor of size 10x3]

Okay, that looks reasonable. The next thing we have to do is write a training loop that will optimize the Convolutional Neural Networks parameters using Stochastic Gradient Descent.
    

In [275]:
from tqdm import tqdm_notebook

def model_train(model,train_iter,num_epochs):
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=.01)
    
    for epoch in tqdm_notebook(range(num_epochs),desc = 'Epoch'):
        total_loss = 0 
        count = 0
        model.train()
        
        for batch in tqdm_notebook(train_iter, desc = 'batch'):
            optimizer.zero_grad()
            txt = batch.txt
            lbl = batch.sentiment
            
            loss = criterion(model(txt),lbl)
            total_loss += loss.data
            count += 1
            loss.backward()
            optimizer.step()
            
            if count % 50 == 1:
                print("Average NLL: ", (total_loss/count)) 
            
    

In [276]:
model_train(cn,train_iter,1)

Average NLL:  
 0.7661
[torch.FloatTensor of size 1]




Exception in thread Thread-10:
Traceback (most recent call last):
  File "/anaconda/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/anaconda/lib/python3.6/site-packages/tqdm/_tqdm.py", line 144, in run
    for instance in self.tqdm_cls._instances:
  File "/anaconda/lib/python3.6/_weakrefset.py", line 60, in __iter__
    for itemref in self.data:
RuntimeError: Set changed size during iteration



Average NLL:  
 0.7285
[torch.FloatTensor of size 1]

Average NLL:  
 0.7208
[torch.FloatTensor of size 1]

Average NLL:  
 0.7158
[torch.FloatTensor of size 1]

Average NLL:  
 0.7130
[torch.FloatTensor of size 1]

Average NLL:  
 0.7111
[torch.FloatTensor of size 1]

Average NLL:  
 0.7071
[torch.FloatTensor of size 1]

Average NLL:  
 0.7059
[torch.FloatTensor of size 1]

Average NLL:  
 0.7038
[torch.FloatTensor of size 1]

Average NLL:  
 0.7023
[torch.FloatTensor of size 1]

Average NLL:  
 0.7003
[torch.FloatTensor of size 1]

Average NLL:  
 0.6985
[torch.FloatTensor of size 1]

Average NLL:  
 0.6972
[torch.FloatTensor of size 1]

Average NLL:  
 0.6970
[torch.FloatTensor of size 1]

Average NLL:  
 0.6960
[torch.FloatTensor of size 1]

Average NLL:  
 0.6954
[torch.FloatTensor of size 1]

Average NLL:  
 0.6943
[torch.FloatTensor of size 1]

Average NLL:  
 0.6931
[torch.FloatTensor of size 1]

Average NLL:  
 0.6911
[torch.FloatTensor of size 1]

Average NLL:  
 0.6903
[torc

KeyboardInterrupt: 