# IMDB

In [None]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
get_ipython().config.get('IPKernelApp', {})['parent_appname'] = ""
from fastai.text import *
from fastai.version import __version__
print('fastai', __version__)

Rough steps:
- train a language model
    1. training data for lang model
    1. create a lang model
    1. train a lang model
    1. finetune a lang model
    1. export encoder
    1. predict next words
- train a classifier model
    1. training data for classifier model
    1. create a classifier model
    1. load encoder from language model
    1. train the model
    1. finetune layer by layer
    1. unfreeze all and train
    1. predict pos vs neg

## The data set

The [dataset](http://ai.stanford.edu/~amaas/data/sentiment/) has been curated by Andrew Maas et al. and contains a total of 100,000 reviews on IMDB. 25,000 of them are labelled as positive and negative for training, another 25,000 are labelled for testing (in both cases they are highly polarized). The remaning 50,000 is an additional unlabelled data (but we will find a use for it nonetheless).

In [None]:
# path = untar_data(URLs.IMDB)
path = Path('/home/paperspace/.fastai/data/imdb')
path.ls()

Dataset directory structure (training set):

Positive reviews are in `pos` folder. Each review is a text file.

Negative reviews are in `neg` folder. Each review is a text file.

Unlabeled reviews are in `unsup` folder. Each review is a text file.

### Language model data

[text.data docs](https://docs.fast.ai/text.data.html)

In [None]:
# Set batch size to 16 to avoid running out of GPU ram later
bs = 16

In [None]:
# Save
# data_lm.save('data_lm.pkl')

In [None]:
# Load language model data
data_lm = load_data(path, fname='data_lm.pkl', bs=16)

### Language model

In [None]:
lr = 1e-2
# momentum will be explained in a later lesson.
moms = (0.8, 0.7)

#### Fit 1 cycle

In [None]:
# Don't execute during practice

In [None]:
# Save
# learn_lm.save('lm-stage1-fp32')

In [None]:
# Load
learn_lm.load('lm-stage1-fp32')

#### Finetune language model with our dataset

In [None]:
lr = 1e-3
moms = (0.8, 0.7)

#### Fit 1 cycle

In [None]:
# Don't execute during practice. Takes too long.

In [None]:
# Save
# learn_lm.save('lm-stage2-finetuned-fp32')

In [None]:
# Load
learn_lm.load('lm-stage2-finetuned-fp32')

#### Export encoder: `lm-finetuned-encoder-fp32`

In [None]:
# Don't execute during practice

#### Next word prediction

In [None]:
# Load the finetuned language model
# learn_lm = load_learner('lm-stage2-finetuned-fp32')

# Prediction args
TEXT = "I liked this movie because"
N_WORDS = 100
N_SENTENCES = 1

# Make predictions
print("\n".join(learn_lm.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)))

---

## Classifier: training data

This time, we'll create a databunch **with** labels.

In [None]:
# Don't execute during practice

In [None]:
# Save
# data.save('data_clas.pkl')

In [None]:
# Load
data = load_data(path, fname='data_clas.pkl', bs=16)

## Classifier: create a model

Please restart the kernel.

In [None]:
# Restart the kernel and run these:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
get_ipython().config.get('IPKernelApp', {})['parent_appname'] = ""
from fastai.text import *

path = Path('/home/paperspace/.fastai/data/imdb')
data = load_data(path, fname='data_clas.pkl', bs=16)

In [None]:
data.batch_size

In [None]:
# Create a classifier learner

In [None]:
# load language model encoder

In [None]:
torch.cuda.memory_allocated()

#### Train the classifier

In [None]:
lr = 2e-2
moms = (0.8, 0.7)

In [None]:
# Fit 1 cycle
# Don't execute during practice


In [None]:
# Save
# learn_clas.save('clas-step-1')

In [None]:
# Load
learn_clas.load('clas-step-1')

In [None]:
torch.cuda.memory_allocated()

### Finetune classifier model: unfreeze layer by layer

#### Freeze all except the second last layer and the last layers

In [None]:
lr = slice(5e-3/(2.6**4),5e-3) # We'll learn more about this later

In [None]:
# Don't execute during practice
# Fit 1 cycle


In [None]:
# Save
# learn_clas.save('clas-step-2')

In [None]:
# Load
learn_clas.load('clas-step-2')

In [None]:
torch.cuda.memory_allocated()

#### Freeze all except the third last layer and the last layers

In [None]:
lr = slice(5e-3/(2.6**4),5e-3) # We'll learn more about this later

In [None]:
# Don't execute during practice


In [None]:
# Save
# learn_clas.save('clas-stage-3')

In [None]:
# Load
learn_clas.load('clas-stage-3')

In [None]:
torch.cuda.memory_allocated()

#### Unfreeze all layers

At this point, it's a good idea to restart the kernel and load the model from file to avoid running out of GPU memory.

In [None]:
# Restart and run these:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
get_ipython().config.get('IPKernelApp', {})['parent_appname'] = ""
from fastai.text import *

path = Path('/home/paperspace/.fastai/data/imdb')
data = load_data(path, fname='data_clas.pkl', bs=16)
learn_clas = text_classifier_learner(data, arch=AWD_LSTM, drop_mult=0.5)
learn_clas.load('clas-stage-3')

In [None]:
lr = slice(1e-3/(2.6**4),1e-3)
moms = (0.8, 0.7)

In [None]:
# Fit two cycles
# Don't execute during practice

In [None]:
# Save
# learn_clas.save('clas-stage-4')

In [None]:
# Load
learn_clas.load('clas-stage-4')

In [None]:
torch.cuda.memory_allocated()

### Make predictions