# Language Modeling & Sentiment Analysis of IMDB movie reviews
In this small notebook, we will look at what it will take to get to state of the art and will introduce a new style of thinking. 

When you read in the media of new **State of the Art** models for different task, they may not seem like big gaps: **94%** to **96%**. Looking at the advancement as a **2%** increase is not the right approach. 

A better approach is: after any model exceeds **50%** at a given task. You need to measure **error rate**. 

So if the last model had **94%** accuracy, this means it had **6% error**. 

So now if your new model achieves **96%** accuracy, this means you have brought the error down to **4%** which is in fact a **20%** increase, very worthy of news. 

In [1]:
from fastai import *
from fastai.text import *

In [2]:
bs = 128
bs = bs/2

In [3]:
bs = bs/2
bs

32.0

In [4]:
# getting path to our data
path = untar_data(URLs.IMDB)

In [5]:
path.ls()

[WindowsPath('C:/Users/dmber/.fastai/data/imdb/imdb.vocab'),
 WindowsPath('C:/Users/dmber/.fastai/data/imdb/lm_databunch'),
 WindowsPath('C:/Users/dmber/.fastai/data/imdb/models'),
 WindowsPath('C:/Users/dmber/.fastai/data/imdb/README'),
 WindowsPath('C:/Users/dmber/.fastai/data/imdb/test'),
 WindowsPath('C:/Users/dmber/.fastai/data/imdb/tmp_clas'),
 WindowsPath('C:/Users/dmber/.fastai/data/imdb/tmp_lm'),
 WindowsPath('C:/Users/dmber/.fastai/data/imdb/train'),
 WindowsPath('C:/Users/dmber/.fastai/data/imdb/unsup')]

## Language Model Dataset
A new concept we are introducing here is that we are also including the directory ```unsup``` which represents **unsupervised learning** dataset. 

This specific dataset just contains reviews without any label. 

The reason we are in fact interested in this, is because of the power behind **semi-supervised learning**. Which is when you combine *supervised learning* with *unsupervised learning*. 

This fits perfect into our NLP pipeline as the first model **language model** is just learning to predict the next word of a sentence. This is in fact an unsupervised learning approach. 

In [5]:
# Making sure our GPU is setup
import torch

print(torch.cuda.device_count())
device = torch.cuda.current_device()

# printing device name
print(torch.cuda.get_device_name(device))

# setting the device
torch.cuda.set_device(device)

1
GeForce RTX 2070 with Max-Q Design


In [6]:
torch.cuda.empty_cache()

In [7]:
# Creating our data_lm using block api
data_lm = (TextList.from_folder(path)
                   .filter_by_folder(include=['train', 'test', 'unsup'])
                   .split_by_rand_pct(0.1, seed=42)
                   .label_for_lm()
                   .databunch(bs=bs, num_workers=1))

In [8]:
# Saving our language model to later use
data_lm.save('lm_databunch')

In [5]:
# # If we want to load this just run this cell
data_lm = load_data(path, 'lm_databunch', bs=int(bs))

In [6]:
# Creating our language model - AWD_LSTM
learn_lm = language_model_learner(data_lm, AWD_LSTM, drop_mult=1.).to_fp16()

In [10]:
lr = 1e-2
lr *= bs/48

In [7]:
lr = 0.026666666666666665

In [8]:
learn_lm.fit_one_cycle(1, lr, moms=(0.8, 0.7))

epoch,train_loss,valid_loss,accuracy,time
0,4.606462,4.173866,0.280831,17:42


In [None]:
# # Training the model with unfrozen layers
# learn_lm.unfreeze()
# learn_lm.fit_one_cycle(10, lr/10, moms=(0.8, 0.7))

In [13]:
# Saving the model
learn_lm.save('fine_tuned_10')
learn_lm.save_encoder('fine_tuned_enc_10')

# Classifier Model
now we will create a classifier dataset and a model. We will be using our saved language model encoder, load that encoder into our classifier model

In [15]:
data_clas = (TextList.from_folder(path, vocab=data_lm.vocab)
                     .split_by_folder(valid='test')
                     .label_from_folder(classes=['neg', 'pos'])
                     .databunch(bs=int(bs), num_workers=1))

In [19]:
# saving the new dataset for classification
data_clas.save('imdb_textlist_class')

In [7]:
# # for laoding the dataset
data_clas = load_data(path, 'imdb_textlist_class', bs=int(bs), num_workers=1)

In [8]:
# Creating our classifier model
learn_c = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.5).to_fp16()
learn_c.load_encoder('fine_tuned_enc_10') # replacing encoder
learn_c.freeze() # freezing our model

## ```drop_mult```
When using the ```AWD_LSTM``` architecture you may have noticed the argument ```drop_mult``` which essentially stands for the dropout for each given layer of the ```AWD_LSTM```. 

What FastAI does is actually provide their default, and any value we feed into this argument is what that default value is multiplied by.

So if you want the full default just leave ```drop_mult=1.```. 

Another thing to note. When training the ```language model```, they have found that a language model trained with high dropout - even though will have lower predictive accuracy on the next word. Will actually outperform when migrating that language model's ```encoder``` into the classifier network. 

## More regularization
What this means is that in the language model you want to add a lot of **Regularization** which will actually cause the training for the classifier to be less resilient. 

In [9]:
lr = 2e-2
lr *= bs/2

In [11]:
lr = 0.026

In [25]:
learn_c.fit_one_cycle(1, lr, moms=(0.8, 0.7))

epoch,train_loss,valid_loss,accuracy,time
0,0.449722,0.342098,0.85456,01:49


In [26]:
learn_c.save('1')

In [28]:
learn_c.freeze_to(-2)
learn_c.fit_one_cycle(1, slice(lr/(2.6**4),lr), moms=(0.8, 0.7))

epoch,train_loss,valid_loss,accuracy,time
0,0.301872,0.220684,0.91224,02:09


In [30]:
learn_c.save('2')

In [9]:
learn_c.load('2')

RNNLearner(data=TextClasDataBunch;

Train: LabelList (25000 items)
x: TextList
xxbos xxmaj story of a man who has unnatural feelings for a pig . xxmaj starts out with a opening scene that is a terrific example of absurd comedy . a formal orchestra audience is turned into an insane , violent mob by the crazy xxunk of it 's singers . xxmaj unfortunately it stays absurd the xxup whole time with no general narrative eventually making it just too off putting . xxmaj even those from the era should be turned off . xxmaj the cryptic dialogue would make xxmaj shakespeare seem easy to a third grader . xxmaj on a technical level it 's better than you might think with some good cinematography by future great xxmaj vilmos xxmaj zsigmond . xxmaj future stars xxmaj sally xxmaj kirkland and xxmaj frederic xxmaj forrest can be seen briefly .,xxbos xxmaj airport ' 77 starts as a brand new luxury 747 plane is loaded up with valuable paintings & such belonging to rich businessman xxmaj philip xxmaj steven

In [12]:
learn_c.freeze_to(-3)
learn_c.fit_one_cycle(1, slice(lr/2/(2.6**4),lr/2), moms=(0.8, 0.7))

epoch,train_loss,valid_loss,accuracy,time
0,0.241614,0.182672,0.9302,02:48


In [13]:
learn_c.save('3')

In [12]:
learn_c.load('3')

RNNLearner(data=TextClasDataBunch;

Train: LabelList (25000 items)
x: TextList
xxbos xxmaj story of a man who has unnatural feelings for a pig . xxmaj starts out with a opening scene that is a terrific example of absurd comedy . a formal orchestra audience is turned into an insane , violent mob by the crazy xxunk of it 's singers . xxmaj unfortunately it stays absurd the xxup whole time with no general narrative eventually making it just too off putting . xxmaj even those from the era should be turned off . xxmaj the cryptic dialogue would make xxmaj shakespeare seem easy to a third grader . xxmaj on a technical level it 's better than you might think with some good cinematography by future great xxmaj vilmos xxmaj zsigmond . xxmaj future stars xxmaj sally xxmaj kirkland and xxmaj frederic xxmaj forrest can be seen briefly .,xxbos xxmaj airport ' 77 starts as a brand new luxury 747 plane is loaded up with valuable paintings & such belonging to rich businessman xxmaj philip xxmaj steven

In [13]:
learn_c.unfreeze()
learn_c.fit_one_cycle(2, slice(lr/10/(2.6**4), lr/10), moms=(0.8, 0.7))

epoch,train_loss,valid_loss,accuracy,time
0,0.2075,0.176794,0.93236,05:17
1,0.189108,0.173498,0.93452,05:59


In [14]:
learn_c.save('clas')