# Sequence classification model for IMDB Sentiment Analysis
(c) Deniz Yuret, 2019
* Objectives: Learn the structure of the IMDB dataset and train a simple RNN model.
* Prerequisites: [RNN models](60.rnn.ipynb)

In [1]:
# Set display width, load packages, import symbols
ENV["COLUMNS"] = 72
using Pkg; for p in ("Knet","IterTools"); haskey(Pkg.installed(),p) || Pkg.add(p); end
using Statistics: mean
using IterTools: ncycle
using Knet: Knet, AutoGrad, RNN, param, dropout, minibatch, nll, accuracy, progress!, adam, save, load, gc

In [2]:
# Set constants for the model and training
EPOCHS=3          # Number of training epochs
BATCHSIZE=64      # Number of instances in a minibatch
EMBEDSIZE=125     # Word embedding size
NUMHIDDEN=100     # Hidden layer size
MAXLEN=150        # maximum size of the word sequence, pad shorter sequences, truncate longer ones
VOCABSIZE=30000   # maximum vocabulary size, keep the most frequent 30K, map the rest to UNK token
NUMCLASS=2        # number of output classes
DROPOUT=0.5       # Dropout rate
LR=0.001          # Learning rate
BETA_1=0.9        # Adam optimization parameter
BETA_2=0.999      # Adam optimization parameter
EPS=1e-08         # Adam optimization parameter

1.0e-8

## Load and view data

In [3]:
include(Knet.dir("data","imdb.jl"))   # defines imdb loader

imdb

In [4]:
@doc imdb

```
imdb()
```

Load the IMDB Movie reviews sentiment classification dataset from https://keras.io/datasets and return (xtrn,ytrn,xtst,ytst,dict) tuple.

# Keyword Arguments:

  * url=https://s3.amazonaws.com/text-datasets: where to download the data (imdb.npz) from.
  * dir=Pkg.dir("Knet/data"): where to cache the data.
  * maxval=nothing: max number of token values to include. Words are ranked by how often they occur (in the training set) and only the most frequent words are kept. nothing means keep all, equivalent to maxval = vocabSize + pad + stoken.
  * maxlen=nothing: truncate sequences after this length. nothing means do not truncate.
  * seed=0: random seed for sample shuffling. Use system seed if 0.
  * pad=true: whether to pad short sequences (padding is done at the beginning of sequences). pad_token = maxval.
  * stoken=true: whether to add a start token to the beginning of each sequence. start_token = maxval - pad.
  * oov=true: whether to replace words >= oov*token with oov*token (the alternative is to skip them). oov_token = maxval - pad - stoken.


In [5]:
@time (xtrn,ytrn,xtst,ytst,imdbdict)=imdb(maxlen=MAXLEN,maxval=VOCABSIZE);

┌ Info: Loading IMDB...
└ @ Main /kuacc/users/dyuret/.julia/dev/Knet/data/imdb.jl:57


  8.322828 seconds (30.48 M allocations: 1.560 GiB, 9.76% gc time)


In [6]:
println.(summary.((xtrn,ytrn,xtst,ytst,imdbdict)));

25000-element Array{Array{Int32,1},1}
25000-element Array{Int8,1}
25000-element Array{Array{Int32,1},1}
25000-element Array{Int8,1}
Dict{String,Int32} with 88584 entries


In [7]:
# Words are encoded with integers
rand(xtrn)'

1×150 LinearAlgebra.Adjoint{Int32,Array{Int32,1}}:
 30000  30000  30000  30000  30000  …  2  4933  21  15  6057  1  144

In [8]:
# Each word sequence is padded or truncated to length 150
length.(xtrn)'

1×25000 LinearAlgebra.Adjoint{Int64,Array{Int64,1}}:
 150  150  150  150  150  150  150  …  150  150  150  150  150  150

In [9]:
# Define a function that can print the actual words:
imdbvocab = Array{String}(undef,length(imdbdict))
for (k,v) in imdbdict; imdbvocab[v]=k; end
imdbvocab[VOCABSIZE-2:VOCABSIZE] = ["<unk>","<s>","<pad>"]
function reviewstring(x,y=0)
    x = x[x.!=VOCABSIZE] # remove pads
    """$(("Sample","Negative","Positive")[y+1]) review:\n$(join(imdbvocab[x]," "))"""
end

reviewstring (generic function with 2 methods)

In [10]:
# Hit Ctrl-Enter to see random reviews:
r = rand(1:length(xtrn))
println(reviewstring(xtrn[r],ytrn[r]))

Positive review:
immortal question of why they do this everyday what rushes through their minds what pushes them to go further and the bonds that are formed while out there on the wild blue <unk> i felt like after watching this film that i not only knew more about big wave surfing but also about the emotional side to the sport this was an element not as developed in the other films and pushed riding giants to a whole new personal level br br overall this film was brilliant never have i witnessed so much passion devotion and love wrapped in a structurally sound film from beginning to end i was impressed i would be very happy if this film won the oscar this year for best documentary and to see a new rebirth in the surfing world and open more doors for films of this nature br br grade out of


In [11]:
# Here are the labels: 1=negative, 2=positive
ytrn'

1×25000 LinearAlgebra.Adjoint{Int8,Array{Int8,1}}:
 2  2  1  1  2  1  1  2  2  1  1  …  1  1  1  2  2  2  2  2  1  1  1

## Define the model

In [12]:
struct SequenceClassifier; input; rnn; output; pdrop; end

In [13]:
SequenceClassifier(input::Int, embed::Int, hidden::Int, output::Int; pdrop=0) =
    SequenceClassifier(param(embed,input), RNN(embed,hidden,rnnType=:gru), param(output,hidden), pdrop)

SequenceClassifier

In [14]:
function (sc::SequenceClassifier)(input)
    embed = sc.input[:, permutedims(hcat(input...))]
    embed = dropout(embed,sc.pdrop)
    hidden = sc.rnn(embed)
    hidden = dropout(hidden,sc.pdrop)
    return sc.output * hidden[:,:,end]
end

(sc::SequenceClassifier)(input,output) = nll(sc(input),output)

## Experiment

In [15]:
dtrn = minibatch(xtrn,ytrn,BATCHSIZE;shuffle=true)
dtst = minibatch(xtst,ytst,BATCHSIZE)
length.((dtrn,dtst))

(390, 390)

In [16]:
# For running experiments
function trainresults(file,maker; o...)
    if (print("Train from scratch? "); readline()[1]=='y')
        model = maker()
        progress!(adam(model,ncycle(dtrn,EPOCHS);lr=LR,beta1=BETA_1,beta2=BETA_2,eps=EPS))
        Knet.save(file,"model",model)
        Knet.gc() # To save gpu memory
    else
        isfile(file) || download("http://people.csail.mit.edu/deniz/models/tutorial/$file",file)
        model = Knet.load(file,"model")
    end
    return model
end

trainresults (generic function with 1 method)

In [17]:
maker() = SequenceClassifier(VOCABSIZE,EMBEDSIZE,NUMHIDDEN,NUMCLASS,pdrop=DROPOUT)
# model = maker()
# nll(model,dtrn), nll(model,dtst), accuracy(model,dtrn), accuracy(model,dtst)
# (0.69312066f0, 0.69312423f0, 0.5135817307692307, 0.5096153846153846)

maker (generic function with 1 method)

In [18]:
# 2.51e-01  100.00%┣████████████████████┫ 1170/1170 [00:16/00:16, 75.46i/s]
model = trainresults("imdbmodel113.jld2",maker);

Train from scratch? stdin> y
1.07e-01  100.00%┣████████████████████┫ 1170/1170 [00:15/00:15, 75.55i/s]


In [19]:
#nll(model,dtrn), nll(model,dtst), accuracy(model,dtrn), accuracy(model,dtst)
# (0.059155148f0, 0.3877507f0, 0.9846153846153847, 0.8583733974358975)

## Playground

In [20]:
predictstring(x)="\nPrediction: " * ("Negative","Positive")[argmax(Array(vec(model([x]))))]
UNK = VOCABSIZE-2
str2ids(s::String)=[(i=get(imdbdict,w,UNK); i>=UNK ? UNK : i) for w in split(lowercase(s))]

str2ids (generic function with 1 method)

In [21]:
# Here we can see predictions for random reviews from the test set; hit Ctrl-Enter to sample:
r = rand(1:length(xtst))
println(reviewstring(xtst[r],ytst[r]))
println(predictstring(xtst[r]))

Positive review:
<s> if you get the slight enjoyment out of pink floyd's music you will love this movie the score is completely pink floyd and of course the drug element plays a major part in this movie giving you the doubts about life within the weakest moments this movie also touches the heart with the story about love and the people around you there is also a huge connection with the world around you with the environment of a personal island this thing tell me i need ten lines to sum up a movie but i am done that is all you get that is why this movie is a 6 1 which is a major upset to any movie with a score like this take a look at requiem for a dream and the fountain equally good scores for our generation but <unk>

Prediction: Positive


In [22]:
# Here the user can enter their own reviews and classify them:
println(predictstring(str2ids(readline(stdin))))

stdin> i do not recommend this movie

Prediction: Negative
