**Copyright (c) by Wei Tan Date:** 19 Aug 2019

**Environment Factors:**
* python 3.6
* fastai 1.0
* Cuda 10.0 GPU
* pytorch 1.1.0


## Introduction

Sentiment analysis or commonly known as opinion mining is the process of using machine learning to analyse a person’s opinion, emotional tone and attitude based on a written text (Liu, 2012). Research in the field of sentiment analysis such as identifying metaphor and expression started in the early 90s with the term sentiment analysis was first introduced in 2003 (Liu, 2012). 
  
The Application of sentiment analysis includes social media monitoring of events. For example, Obama administration used sentiment analysis on social media platform such as Facebook and Twitter to measure the public’s opinion ahead of his presidential election which he subsequently won or big brands company such as Coca Cola using sentiment analysis to gauge the public reaction of their product or the public perception of the company. By using sentiment analysis, companies are able to improve customer service, product quality or even develop and adjust their marketing strategy. 
  
The code structure is shown as below:
* Loading the dataset
* Preprocessing the dataset
* Fine-tuning a language model
* Develop the classifier model
* Retuning the saved model
* Predict the label

In [1]:
# Libraries used
%reload_ext autoreload
%autoreload 2

import html
import json

import pandas as pd
import numpy as np

from fastai import * 
from fastai.text import * 
from fastai.core import *

import warnings
from tqdm import tqdm

##  Loading the dataset

In [2]:
# create a path for saving dataset and model
DATA_PATH=Path('saved')
SAVE_PATH=Path('saved')

In [3]:
# read the training and testing dataset
df_tr = pd.read_csv(DATA_PATH/'trainData.csv')
df_te = pd.read_csv(DATA_PATH/'testDataTrans.csv')

## Preprocessing the dataset

In [4]:
# reshape the column name of dataframe into the specific format
# df_trLM and df_te is for language model use
# df_trCL is for classifier model use
df_trLM = pd.DataFrame({0:df_tr["text"],1:df_tr["text"]})
df_trCL = pd.DataFrame({0:df_tr["label"],1:df_tr["text"]})
df_te = pd.DataFrame({0:df_te["text"],1:df_te["text"]})

In [6]:
# union df_trLM and df_trCL as one training dataset for language model use
dfoutAll = pd.concat([df_trLM,df_te],axis=0)

In [8]:
# define a variable and save it into a file for later language model use
trainDataTL = dfoutAll[[0,1]]
trainDataTL.to_csv(SAVE_PATH/"trainDataTL.csv", header=None, index = False)

In [9]:
# create DataBunch object that is used inside Learner to train a model
# this does all the necessary preprocessing behind the scene
# It basically creates a separate unit (a “token”) for each separate part of a word. 
# Most of them are just for words, but sometimes if it’s an 's from it's, 
# it will get its own token. Every bit of punctuation tends to 
# get its own token (a comma, a period, etc).
data_lm = TextLMDataBunch.from_csv(SAVE_PATH, 'trainDataTL.csv')

In [10]:
# save the previous DataBunch data as pickle format for training language model use
data_lm.save('data_lm_export.pkl')

In [5]:
# load the saved the pickle data for training language model use
data_lm = load_data(SAVE_PATH,'data_lm_export.pkl')

In [11]:
# show the tokenization text
data_lm.show_batch()

idx,text
0,"another employee in the middle of our order . \r \n \r \n i had high hopes for xxmaj pollo xxmaj campero , as every time i drove past it , i could see cars wrapping around the drive through . i realize now it is because they have inefficient work . xxmaj it took about 15 - 20 minutes to get it our food , which is surprising"
1,"... xxmaj vegas is definitely still on my mind ... \r \n xxmaj we had the fantastic xxmaj xxunk xxmaj suite on the 27th xxmaj floor and it was spic xxrep 4 y . xxmaj enough to easily sleep 6 + people , although we were only 3 deep . \r \n xxmaj shower , separate bath tub and private toilet area . \r \n xxmaj large"
2,"xxmaj allegheny xxmaj tavern last night . xxmaj we had been looking forward to a night out together for a while . \r \n \r \n i think the food was good , but it 's hard to say , because all i could focus on was the xxup loud -- and i mean xxup loud -- woman seated behind my boyfriend and me . \r \n \r \n"
3,ever in my life xxrep 4 . xxbos xxmaj they worse dunking donuts ever xxrep 5 ! \r \n xxmaj first they do n't know how to make coffee ! ! ! \r \n xxmaj they have so many employees and it 's look like they do n't want to give service . \r \n xxmaj the place very messy and dirty . xxbos xxmaj as you
4,"do not see it lasting much longer . xxmaj there are other places in xxmaj calgary to get sub - par broth and far bigger portion of meat and veggies . i think i will stick to xxunk for now . xxbos xxmaj it was a very disappointing dining experience . xxmaj there was a long line for this restaurant ( and decent reviews on xxmaj yelp ) , which"


In [None]:
# get the first ten in order of frequency
# output list is all the possible unique tokens
data_lm.vocab.itos[:10]

## Fine-tuning a language model

In [7]:
# create a model as learn object, and download the pretrained wiki103 model and be ready for fine-tuning
# set AWD-LSTM method to tune
# drop_mult=0.3 is for reducing the regularization to avoid under fitting
learn = language_model_learner(data_lm, AWD_LSTM, pretrained_fnames = ["lstm_wt103","itos_wt103"],drop_mult=0.3)
# create a output word embedding vector as the input encoder by training the language model 
# tuning from wiki103 model for classifier model use 
# set 5 epoches and learning rate 0.01
learn.fit_one_cycle(5, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,4.119113,3.982092,0.272577,50:08
1,4.135674,4.005588,0.270786,50:13
2,4.100678,3.962234,0.2744,50:09
3,4.047487,3.912771,0.278795,50:09
4,4.027775,3.896547,0.280431,50:08


In [8]:
# unfreeze all the layers of the model and fine-tune it
learn.unfreeze()
# set 2 epoches, learing rate 0.01 and momentum equals 0.8,0.7. 
# Basically fastai found for training RNNs, it really helps to decrease the momentum a little bit
learn.fit_one_cycle(2, 1e-3,moms=(0.8, 0.7))

epoch,train_loss,valid_loss,accuracy,time
0,3.552196,3.460151,0.331855,55:44
1,3.449042,3.379631,0.341974,55:53


In [None]:
# evaluate our language model by predict the certain words by passing serval words 
learn.predict("This is a review about", n_words=50)

In [10]:
# save the encoder for classifier model use
learn.save_encoder('ft_enc_wk')

## Develop the classifier model
* unfreeze the last two layers
* train it a little bit more
* unfreeze the next layer again
* train it a little bit more
* unfreeze the whole thing
* train it a little bit more

In [17]:
# save the training dataset for building the classifier model
df_trCL.to_csv(SAVE_PATH/"trainDataCLS.csv", header=None, index = False)

In [7]:
# using the previous generated dataset for building 
# pass the vocabulary (mapping from ids to words) created from language model that we want to use
# set validation ratio as 10% and batch size 32, make lower if you run out of memory
data_clas = TextClasDataBunch.from_csv(SAVE_PATH, 'trainDataCLS.csv',valid_pct=0.1, vocab=data_lm.train_ds.vocab, bs=32)

In [8]:
# save the previous DataBunch data as pickle format for training classifier model use
data_clas.save('data_clas_export.pkl')

In [9]:
# load the saved the pickle data for training classifier model use
data_clas = load_data(SAVE_PATH,'data_clas_export.pkl', bs=32)

In [19]:
# create a learner object by using the data_clas object to build a classifier 
# load the fine-tuned the encoder input created from language model
learn = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.3)
learn.load_encoder('ft_enc_wk')

In [20]:
# show the data_clas object content which includes "text" and "label"
data_clas.show_batch()

text,target
xxbos xxmaj we xxmaj have xxmaj been xxmaj regulars at xxmaj several xxmaj outbacks xxmaj usually xxmaj by xxmaj where xxmaj we xxmaj live . xxmaj for xxmaj years xxmaj went xxup xxunk xxmaj scottsdale xxmaj location xxmaj because it xxmaj was xxmaj near xxmaj the xxmaj harkins xxmaj there . xxmaj never xxmaj had a xxmaj problem . xxmaj then xxmaj we xxmaj moved & xxmaj we xxmaj were,2
xxbos xxup beware ! \r \n \r \n i believe it was back in 2007 when my best friend and i decided to get gym memberships at this 24 xxmaj hour xxmaj fitness . xxmaj she signed up for the month - to - month option where they debit your monthly fee from your account . i was not comfortable with this and after severe pressure from one of,1
"xxbos xxmaj in an age where teenagers are getting $ xxunk an hour to babysit children ( yes , that 's live xxup human beings ) , i find it astonishing that an xxup az pet sitting service would have the audacity to charge $ 32 for a 1 / 2 hour cat visit ! xxmaj if you do the math , that 's $ 64 / hour folks ,",2
"xxbos i want to start off by saying , this was the most horrible experience ever we had renting with this particular xxmaj alamo located at the mccarran xxmaj alamo xxmaj rent a xxmaj car xxmaj center . xxmaj my xxmaj father who rented it through xxunk . xxmaj com was visiting from xxmaj hawaii . \r \n \r \n xxmaj we get to the counter and i really",1
xxbos i am giving this xxup scam absolutely no stars ! ! ! \r \n \r \n xxmaj the marketing team for this so called resort preys upon unsuspecting tourists as they innocently travel through xxmaj sin xxmaj city . xxmaj this group woos you with the promise of a free gift ( dinner show even a 2 day cruise ! ) xxmaj and all you have to do,1


In [21]:
# run 5 epoches with learning rate 0.01 to train the model
learn.fit_one_cycle(5, 1e-2, moms = (0.8,0.7))

epoch,train_loss,valid_loss,accuracy,time
0,0.917727,0.835745,0.635626,18:28
1,0.922176,0.860451,0.634272,19:34
2,0.888794,0.814937,0.643722,16:59
3,0.865585,0.804189,0.65031,17:00
4,0.886688,0.8034,0.650125,16:53


In [None]:
# save the model after 5 epoch
learn.save('first')
learn.load('first')

In [22]:
# unfreeze last two layers and other layers keep freeze and fune-tune last 2 layers
# set 1 epoch and learning rate 1e-2/(2.6**4) is better for discriminating learning
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4), 1e-2), moms = (0.8,0.7))

epoch,train_loss,valid_loss,accuracy,time
0,0.746111,0.703349,0.700134,20:10


In [None]:
# save the model after tuning the last 2 layers
learn.save('second')
learn.load('second')

In [23]:
# then unfreeze the last third layer and tuning it more
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4), 5e-3), moms = (0.8,0.7))

epoch,train_loss,valid_loss,accuracy,time
0,0.696538,0.669784,0.710801,39:19


In [None]:
# save the model after tuning the layers
learn.save('third')
learn.load('third')

In [24]:
# unfreeze all layers and tuning it
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4), 1e-3), moms = (0.8,0.7))

epoch,train_loss,valid_loss,accuracy,time
0,0.669128,0.663296,0.713787,54:49
1,0.64003,0.659572,0.716142,50:18


In [25]:
# save the model weight
learn.save("yelpModel716142")

In [28]:
learn.fit_one_cycle(1, slice(1e-3/(2.6**4), 1e-3), moms = (0.8,0.7))

epoch,train_loss,valid_loss,accuracy,time
0,0.629368,0.6584,0.71705,48:51


In [29]:
# save the model after tuning the layers
learn.save('yelpModel717050')

## Retuning the saved model if need

In [15]:
# loading the saved model for predition
learn = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.1)
learn.load_encoder('ft_enc_wk')
learn.load("2222222")

RNNLearner(data=TextClasDataBunch;

Train: LabelList (584712 items)
x: TextList
xxbos xxmaj this is probably the worst business i have ever dealt with . xxmaj none of their people know xxmaj english and they rude as hell . i tried to downgrade to just wifi and they would n't let me do it . i provided my last four of my social and card number and they still would n't let me in my account because i did n't know their fake ass pin they claim that i made . xxmaj this is a tactic they are using just to keep me at the package i 'm at . xxmaj cox stinks,xxbos xxmaj selection walking in was xxunk . xxmaj the cookies are okay but nothing to return back for .,xxbos xxmaj beyond my expectations ! xxmaj stunning combinations of flowers , colors . xxmaj and the fragrance ! xxmaj so happy to make my daughter happy on her wedding day .,xxbos i stand corrected . xxmaj there is a xxmaj top xxmaj shop inside xxmaj nordstrom in xxmaj union xxmaj square . xxmaj dangerously awesome ... 
 
  xxrep 26 _ 
 
 

In [None]:
# plot the learning rate
learn.lr_find()
learn.recorder.plot()

In [None]:
# plot the momentum
learn.lr_find()
learn.recorder.plot_lr(show_moms=True)

In [None]:
# plot the losses
learn.lr_find()
learn.recorder.plot_losses()

In [None]:
# unfreeze the model and fune-tune last 2 layers
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4), 5e-3), moms = (0.8,0.7))

## Predict the label

In [13]:
# ignore warning message
warnings.filterwarnings('ignore')

# define a prectiction function
def pred_labels(learnFit):
    testDataDf = pd.read_csv(SAVE_PATH/"testDataTrans.csv")
    teDataDfArray = testDataDf["text"].values
    labels = []
    for x in tqdm(teDataDfArray):
        pred = learnFit.predict(x)
        labels.append(pred[0])
    predLabels = [int(str(x)) for x in labels]
    predLabelsDF = pd.DataFrame({"test_id":testDataDf["test_id"].values, "label":predLabels})
    predLabelsDF.to_csv("predLabels.csv", index=False)
    return predLabelsDF


In [16]:
# make prediction labels by using the past model
pred_labels(learn)

100%|██████████| 50000/50000 [1:15:40<00:00, 12.36it/s]


Unnamed: 0,test_id,label
0,test_1,2
1,test_2,4
2,test_3,1
3,test_4,5
4,test_5,4
5,test_6,4
6,test_7,3
7,test_8,4
8,test_9,2
9,test_10,1
