In [19]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

# Plan B: AWD_LSTM Transfer Learning with FastAI

In [20]:
from pathlib import Path

from fastai import *
from fastai.basic_data import *
from fastai.text import *
from torch import *
import pandas as pd

## Preparing the data

Use panda to conveniently import three `TSV` data files as `DataFrame`s.

In [21]:
path =  Path('data')

In [22]:
training_df = pd.read_table(path / 'training_set.ss')
test_df = pd.read_table(path / 'test_set.ss')
validation_df = pd.read_table(path / 'validation_set.ss')

`review_content` would be the only feature column, `rating` would be our target column

In [5]:
validation_df

Unnamed: 0,user_id,product_id,rating,review_content
0,ur0116181/,\tt0430357,6,i 'm a huge fan of michael mann -lrb- i even l...
1,ur0116181/,\tt0141941,10,this film is unavoidably being compared to dir...
2,ur0550732/,\tt0829459,9,we start all of our reviews with the following...
3,ur0550732/,\tt0964517,10,"my wife and i see over 100 movies a year , hav..."
4,ur0550732/,\tt0266543,10,we see over 75 movies a year and have not rate...
5,ur0550732/,\tt0183649,8,we absolutely think that colin is simply a ter...
6,ur0365713/,\tt0165854,6,watching this film i feel like apologising to ...
7,ur0365713/,\tt0140352,8,russell crowe is fast gaining a great reputati...
8,ur0365713/,\tt0245674,7,i was expecting the usual teen slasher type ho...
9,ur0365713/,\tt0118636,6,apt pupil had the potential to be a great movi...


## Tokenization && Numericalization

In fastai, `TextDataBunch` helps us to tokenize words and clean some parts of our texts like HTML code for instance.

Also, once we have extracted tokens from our texts, DataBunch also helps to convert tokens to integers by creating a list of all the words used. By default,it only keeps the ones that appear at least twice with a maximum vocabulary size of 60,000 (by default) and replace the ones that don't make the cut by the unknown token `UNK`.

In [9]:
# Language model data
data_lm = TextLMDataBunch.from_df(path, training_df, valid_df=validation_df, test_df=test_df, text_cols=['review_content'], label_cols=['rating'])
# Classifier model data
data_clas = TextClasDataBunch.from_df(path, train_df=training_df, valid_df=validation_df, test_df=test_df, vocab=data_lm.train_ds.vocab, text_cols=['review_content'], label_cols=['rating'], bs=32)

In [10]:
data_lm.save('data_lm_export.pkl')
data_clas.save('data_clas_export.pkl')

In [23]:
data_lm = load_data(path, 'data_lm_export.pkl')
data_clas = load_data(path, 'data_clas_export.pkl', bs=16)

## Language model

Language model is trained to guess what the next word is, its input being all the previous words. It has a recurrent structure and a hidden state that is updated each time it sees a new word. This hidden state thus contains information about the sentence up to that point.

We are going to use that 'knowledge' of the English language to build our classifier

We use pretrained `AWD_LSTM` model as a starting point. `AWD-LSTM` has been dominating the state-of-the-art language modeling. All the top research papers on word-level models incorporate `AWD-LSTM`s. 

### LSTM

A normal LSTM cell's mathematical formulation looks like this:

$$i_t  = σ(W^ix_t + U^ih_{t-1})$$
$$f_t = σ(W^fx_t  + U^fh_{t-1})$$
$$o_t = σ(W^ox_t  + U^oh_{t-1})$$
$$c'_t = tanh(W^cx_t  + U^ch_{t-1})$$
$$c_t = i_t ⊙ c'_t + f_t ⊙ c'_{t-1}$$
$$ht = o_t ⊙ tanh(c_t)$$

where, $W^i$, $W^f$, $W^o$, $W^c$, $U^i$, $U^f$, $U^o$, $U^c$ are weight matrices, $x_t$ is the vector input to timestep $t$, $h_t$ is the current exposed hidden state, ct is the memory cell state, and $⊙$ is element-wise multiplication.

### AWD_LSTM: ASGD Weight-dropped LSTM 

The recurrent connections of an RNN have been prone to overfitting. Dropouts have been massively successful in feed-forward and convolutional neural networks to prevent overfitting. But applying dropouts similarly to an RNN's hidden state is ineffective as it disrupts the RNN's ability to retain long-term dependencies.

To counter this problem, AWD_LSTM uses DropConnect. We know that in Dropout, a randomly selected subset of activations is set to zero within each layer. In DropConnect, instead of activations, a randomly selected subset of weights within the network is set to zero. Each unit thus receives input from a random subset of units in the previous layer.

![drop_connect](./images/drop_connect.jpg)

DropConnect is applied on the hidden to hidden weight matrices ($U^i$, $U^f$, $U^o$, $U^c$) instead of the hidden or memory states. Since this dropout operation is performed once, before the forward and backward pass the impact on training speed is minimal and any standard optimized black box RNN implementation can be used. By performing dropout on the hidden-to-hidden weight matrices, overfitting can be prevented on the recurrent connections of the LSTM.

[Details](https://yashuseth.blog/2018/09/12/awd-lstm-explanation-understanding-language-model/)

### Train

In [25]:
learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.5)

In [15]:
learn.fit_one_cycle(1, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,3.753421,3.642199,0.386658,14:48


### Fine-tuning

I would say `1e-2` or `1e-3` here as learning rate are like rule of numbs than magic numbers, which work most of the time.  

In [None]:
# We unfreeze the entire model weights and start training 1 epoch
learn.unfreeze()
learn.fit_one_cycle(1, 1e-3)

In [17]:
learn.save_encoder('ft_enc') # 191.318M

In [19]:
# Have fun testing
learn.predict('find', n_words=10)

'find although a true classic film for me for sense of'

## Text Classifier

So now we got a language model, then we need to build a classifier which will take in text and tell us the rating it guesses.
For this specific task, fastai equips us with `text_classifier_learner`. Again, we start with pretrained `AWD_LSTM` network.

In [26]:
learn = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.5)
# Load the language model we just trained as encoder
learn.load_encoder('ft_enc')

In [21]:
# See how the dataset looks
data_clas.show_batch()

text,target
xxbos the day after tomorrow is one of those movies that comes along occasionally that operates on such a massive scale that it 's sure to be a success even if for no other reason than because it takes place in such a massive setting . < xxrep 5 s > i think that a large part of the appeal of disaster movies comes from the possibility of seeing somewhere,6
"xxbos this is yet another summer blockbuster gone hollywood awry by being a bloated , self - indulging `` more is better '' behemoth . < xxrep 5 s > the series is based on my favorite disney amusement park ride of the same name . < xxrep 5 s > by this point i was surprised that the end product has been this tolerable , considering the sword of",6
"xxbos `` talladega nights : the ballad of ricky bobby '' is about a man named ricky bobby -lrb- will ferrell -rrb- who as a boy always dreamed of being a race car driver and living life in the fast lane . < xxrep 5 s > so one day , ricky bobby got the chance of a lifetime as he was able to race in a nascar tournament .",5
"xxbos what is with hollywood and the concept of love in the mitts of great disasters ? < xxrep 5 s > there can be no greater example of this obsession than the highest grossing film of all time , titanic . < xxrep 5 s > though the oscar darling can be blamed for the most recent trend , it is not the first to start this . <",6
"xxbos director walter stuck faithfully to ernesto guevara and alberto granado 's book recounting their 1952 adventure traveling through thousands of miles of south america , much of their adventure taking them through magnificent wilderness and provincial small towns . < xxrep 5 s > guevara -lrb- gael garcia bernal -rrb- was about a semester away from receiving his medical degree . < xxrep 5 s > his jovial companion",7


In [22]:
learn.fit_one_cycle(1, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,1.79187,1.714865,0.316141,08:02


In [29]:
# Save intermediate model to enable resuming training from stages
learn.save('stage-1-model.pkl')

In [30]:
learn.load('stage-1-model.pkl')

RNNLearner(data=TextClasDataBunch;

Train: LabelList (16839 items)
x: TextList
xxbos sometimes popular opinion really sucks about a film elevating a rotten movie to cult status and if you do n't like it then you obviously do n't get it . < xxrep 5 s > this movie is one of them . < xxrep 5 s > the much discussed camera moves did not make me sick , but the whiny idiotic characters did . < xxrep 5 s > the suspense was minimal . < xxrep 5 s > i do n't care if it was really low budget . < xxrep 5 s > i ' m not going to grade a film on degree of difficulty . < xxrep 5 s > i ' m sure that `` titanic '' was a very tough experience for the production crew . < xxrep 5 s > i do n't care ! < xxrep 5 s > it was a great movie because it had a great story and great characters and was really well put together . < xxrep 5 s > this movie did n't have any of those things . < xxrep 5 s > it telegraphed the last shot in the first 15 minutes . < xxrep 5 s > this is one of those rare films that made me angry

### Fine-tuning

In [31]:
learn.freeze_to(-2)
# Slice up the learning rates to apply different rates to different layers of the network - the deeper the larger.
learn.fit_one_cycle(1, slice(5e-3/2., 5e-3))

epoch,train_loss,valid_loss,accuracy,time
0,1.612746,1.55434,0.380611,09:15


In [33]:
learn.save('stage-2-model.pkl')

In [34]:
learn.load('stage-2-model.pkl')

RNNLearner(data=TextClasDataBunch;

Train: LabelList (16839 items)
x: TextList
xxbos sometimes popular opinion really sucks about a film elevating a rotten movie to cult status and if you do n't like it then you obviously do n't get it . < xxrep 5 s > this movie is one of them . < xxrep 5 s > the much discussed camera moves did not make me sick , but the whiny idiotic characters did . < xxrep 5 s > the suspense was minimal . < xxrep 5 s > i do n't care if it was really low budget . < xxrep 5 s > i ' m not going to grade a film on degree of difficulty . < xxrep 5 s > i ' m sure that `` titanic '' was a very tough experience for the production crew . < xxrep 5 s > i do n't care ! < xxrep 5 s > it was a great movie because it had a great story and great characters and was really well put together . < xxrep 5 s > this movie did n't have any of those things . < xxrep 5 s > it telegraphed the last shot in the first 15 minutes . < xxrep 5 s > this is one of those rare films that made me angry

In [52]:
learn.unfreeze()
learn.fit_one_cycle(1, slice(2e-3/100, 2e-3))

epoch,train_loss,valid_loss,accuracy,time
0,1.518762,1.490982,0.403534,18:04


In [53]:
# Save our final model
learn.save('final-model.pkl') # 487.053M

In [27]:
learn.load('final-model.pkl')

RNNLearner(data=TextClasDataBunch;

Train: LabelList (16839 items)
x: TextList
xxbos sometimes popular opinion really sucks about a film elevating a rotten movie to cult status and if you do n't like it then you obviously do n't get it . < xxrep 5 s > this movie is one of them . < xxrep 5 s > the much discussed camera moves did not make me sick , but the whiny idiotic characters did . < xxrep 5 s > the suspense was minimal . < xxrep 5 s > i do n't care if it was really low budget . < xxrep 5 s > i ' m not going to grade a film on degree of difficulty . < xxrep 5 s > i ' m sure that `` titanic '' was a very tough experience for the production crew . < xxrep 5 s > i do n't care ! < xxrep 5 s > it was a great movie because it had a great story and great characters and was really well put together . < xxrep 5 s > this movie did n't have any of those things . < xxrep 5 s > it telegraphed the last shot in the first 15 minutes . < xxrep 5 s > this is one of those rare films that made me angry

In [10]:
learn.predict("This was a okay movie!")

  warn('Tensor is int32: upgrading to int64; for better performance use int64 input')


(Category 5,
 tensor(4),
 tensor([0.1114, 0.0561, 0.1459, 0.0720, 0.2170, 0.1737, 0.1432, 0.0546, 0.0199,
         0.0063]))

## Predict Helper Function

In [11]:
def predict(text):
    return learn.predict(text)[1].item() + 1

In [12]:
predict("This is a find so good movie amazing best top of the list!")

  warn('Tensor is int32: upgrading to int64; for better performance use int64 input')


10

## Calculate RMSE on Validation Set

In [33]:
import math
import warnings
warnings.filterwarnings("ignore")

mse_loss = 0
vali_len = len(validation_df.index)
for i in range(vali_len):
    line = validation_df.iloc[i]
    text = line['review_content']
    rating = line['rating']
    pred = predict(text)
    mse_loss += (rating - pred) ** 2
    
    if i % 10 == 0:
        print("{} finished".format(i))

rmse_loss = math.sqrt(mse_loss / vali_len)

0 finished
10 finished
20 finished
30 finished
40 finished
50 finished
60 finished
70 finished
80 finished
90 finished
100 finished
110 finished
120 finished
130 finished
140 finished
150 finished
160 finished
170 finished
180 finished
190 finished
200 finished
210 finished
220 finished
230 finished
240 finished
250 finished
260 finished
270 finished
280 finished
290 finished
300 finished
310 finished
320 finished
330 finished
340 finished
350 finished
360 finished
370 finished
380 finished
390 finished
400 finished
410 finished
420 finished
430 finished
440 finished
450 finished
460 finished
470 finished
480 finished
490 finished
500 finished
510 finished
520 finished
530 finished
540 finished
550 finished
560 finished
570 finished
580 finished
590 finished
600 finished
610 finished
620 finished
630 finished
640 finished
650 finished
660 finished
670 finished
680 finished
690 finished
700 finished
710 finished
720 finished
730 finished
740 finished
750 finished
760 finished
770 finish

**RMSE:**

In [34]:
rmse_loss

1.5641335777652257

## Make Predictions on Test Set 

In [None]:
ratings = []
for line in test_df['review_content']:
    ratings.append(predict(line))

In [61]:
output_data = {'review_content': test_df['review_content'], 'rating': ratings}

In [62]:
# Save result
output_df = pd.DataFrame(output_data)
output_df.to_csv('senti_output.ss', sep='\t')

## Conclusion

**Sizes:**

Language Model Size:      191.32M

Classifier Model Size:    487.05M

Total Size:               678.37M

**Loss:**

RMSE:          1.5641335777652257

**Test Set Output:**

File:             senti_output.ss

**Comment:**

Apparently, this looks way better than "Plan A". I trained the language model using AWD_LSTM, and I trained the classifier using also AWD_LSTM, which makes those two models big yet powerful. With the size of almost 20 times as "Plan A", I managed to get the loss down to 68% of "Plan A".
 
Judging by the look of testset predictions, I think the outcome is pretty impressive. Most predicted ratings won't be found absurd at all, instead, most of them are reasonable and hard to tell whether they are labels or predictions. 