# Text analysis demo

This session will put into practice the methods we coverered last week. 

We'll run through:
 * tf-idf
 * topic modelling
 * word embeddings
 * ltsms
 
We'll use a module we've started developing to make it quicker and less repetitive to code and use these techniques, imaginatively called 'text_analysis'.

The code (along with this notebook) lives at: https://github.com/dfleetwood/text_analysis . Please send me your github username and I'll add you to the repository. The code is very new - so there are probably lots of bugs at the moment, but we'll be doing a fair bit of work on it over the coming weeks...

<b>Make sure you switch to a 'GPU' runtime. Click 'Runtime' -> 'Change runtime type' -> Change 'Hardware accelerator' to GPU

In [0]:
#A few housekeeping things to make sure everything runs ok

!wget https://raw.githubusercontent.com/dfleetwood/text_analysis_public/master/text_task.py
!wget https://raw.githubusercontent.com/dfleetwood/text_analysis_public/master/imdb_small.csv
    
import text_task
imdb_task.load_embedding_model ('glove-twitter-25')
del imdb_task

--2019-02-08 13:15:54--  https://raw.githubusercontent.com/dfleetwood/text_analysis_public/master/text_task.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 17632 (17K) [text/plain]
Saving to: ‘text_task.py’


2019-02-08 13:15:55 (1.42 MB/s) - ‘text_task.py’ saved [17632/17632]

--2019-02-08 13:15:58--  https://raw.githubusercontent.com/dfleetwood/text_analysis_public/master/imdb_small.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9380178 (8.9M) [text/plain]
Saving to: ‘imdb_small.csv’


2019-02-08 13:15:59 (63.0 MB/s) - ‘imdb_small

In [0]:
#Import libraries
import pandas as pd
import numpy as np
import text_task

## Data processing

Our task for this session will be to estimate the sentiment (positive/negative) of movie reviews from https://www.imdb.com/

To make things quicker today, I've made a much smaller version of the data - for the full dataset go to: https://www.kaggle.com/utathya/imdb-review-dataset

IMDB asks for both a textual description and a score for how much someone liked or disliked a movie. This makes it really useful for testing sentiment analysis - we can link text to quantitative sentiment scores (in this dataset the scores are reduced to 'positive' and 'negative'.


In [0]:
#Read in the data
imdb = pd.read_csv ("imdb_small.csv", encoding = "latin")

The file has five columns:
 * ID: The id of the review
 * Type: Whether the review is from the training or test set (more on this later)
 * Review: The actual review text
 * Label: Whether the quantitative scoring was positive or negative - this is what we'll be trying to predict
 * File: The original file name the review was drawn from (this is just an administrative column)

In [0]:
imdb.columns

Index(['Unnamed: 0', 'type', 'review', 'label', 'file'], dtype='object')

In [0]:
imdb

Unnamed: 0.1,Unnamed: 0,type,review,label,file
0,43228,train,This has to be the funniest stand up comedy I ...,pos,3906_10.txt
1,48074,train,This is apparently the second remake of this f...,pos,8268_7.txt
2,38204,train,In order to stop her homosexual friend Albert ...,pos,10634_10.txt
3,25573,train,Forest of the Damned starts out as five young ...,neg,10516_4.txt
4,34858,train,This little show is obviously some stupid litt...,neg,7623_4.txt
5,46871,train,Sitting on the front porch of his Burbank home...,pos,7185_10.txt
6,47078,train,A suspenseful thriller that bears some resembl...,pos,7371_7.txt
7,37631,train,"I really like Miikes movies about Yakuza, this...",pos,10118_7.txt
8,26896,train,This 2003 made for TV movie was shown on a wom...,neg,11707_1.txt
9,32027,train,"The animation was good, the imagery was good, ...",neg,5075_3.txt


In total, we've got 7,000 reviews. The full dataset has 100,000

In [0]:
imdb.shape

(7000, 5)

Whenever you're developing a model it's a very good idea to hold back some data that you won't use for training, and will just be used to give you an indication of how well your model perform on data it's never seen before - https://scikit-learn.org/stable/modules/cross_validation.html

Our data is already split into training examples, that we'll use to train the model, and testing examples, that we'll use to see how well it's doing.

We have 5,000 training examples and 2,000 test examples

In [0]:
#Split training and test data
imdb_train = imdb [imdb.type == "train"]
imdb_test = imdb [imdb.type == "test"]

In [0]:
imdb_train.shape, imdb_test.shape

((5000, 5), (2000, 5))

## Creating 'text task' class instance and loading data into it

The text_analysis module is organised around 'TextTasks'. This is a class that holds all the data and routines for a text analysis job in one place. Once the data is loaded in you can save and load the class instance to pick up where you left off.

In [0]:
#Initialise tast task object
imdb_task = text_task.TextTask()

You can load as many different text datasets into task as you like - here we load the training and test data in separately to make sure they don't get mixed up. 

Give each dataset you load in a tag/label so you know what data you're working with later.

In [0]:
#Load training data into the text task instance - we label it 'train'
imdb_task.add_text(text = imdb_train.review.values, text_tag="train")
#And load the test data - we label it 'test'
imdb_task.add_text(text = imdb_test.review.values, text_tag="test")

The text is saved in a dictionary called 'texts' within the class

In [0]:
imdb_task.texts ['train']

array(["This has to be the funniest stand up comedy I have ever seen. Eddie Izzard is a genius, he picks in Brits, Americans and everyone in between. His style is completely natural and completely hilarious. I doubt that anyone could sit through this and not laugh their a** off. Watch, enjoy, it's funny.",
       'This is apparently the second remake of this film, having been filmed before in 1911 and 1918. And, in so many ways it reminds me of the later film, A YANK AT OXFORD. Both films concern a conceited blow-hard who arrives at one of the top schools in the world and both, ultimately, show the blow-hard slowly learning about teamwork and decency. In this film, William Haines is "Tom Brown" and his main rival, "Bob" is played by Frances X. Bushman. And, in a supporting role is Jack Pickford--always remembered as the brother of Mary. Of these three, Pickford comes off the best, as the sympathetic loser who becomes Tom\'s pal--he actually has a few decent scenes as well as a dramatic

## Cleaning/preprocessing text

The first real job with working with text is to clean/preprocess it.

We do this with the 'clean_text' method (or write your own routine). Specify the dataset you want to clean using the tag you gave it, and how you want to clean it. At the moment, the method will overwrite your data, make sure you save the original version somewhere else!

In [0]:
#Before
imdb_task.texts ['train'][0]

"This has to be the funniest stand up comedy I have ever seen. Eddie Izzard is a genius, he picks in Brits, Americans and everyone in between. His style is completely natural and completely hilarious. I doubt that anyone could sit through this and not laugh their a** off. Watch, enjoy, it's funny."

In [0]:
imdb_task.clean_text (text_tag="train", processes=['urls', 'punctuation', 'numeric', 'lower'])

In [0]:
#After
imdb_task.texts ['train'][0]

'this has to be the funniest stand up comedy i have ever seen eddie izzard is a genius he picks in brits americans and everyone in between his style is completely natural and completely hilarious i doubt that anyone could sit through this and not laugh their a off watch enjoy it s funny'

In [0]:
#Clean test data
imdb_task.clean_text (text_tag="test", processes=['urls', 'punctuation', 'numeric', 'lower'])

Save the task instance - with a lot of text the cleaning can take a while, so it's a good idea to save so you don't need to do it again and can just pick up where you left off

In [0]:
imdb_task.save ("imdb_text.pkl")

In [0]:
#Reload the instance - uncomment this if you're picking up from here
#mdb_task = text_task.load ("imdb_text.pkl")

## Tf-idf

Tf-idf (text frequency-inverse document frequency) is a good place to start with most tasks

Calculating the tf-idf scores for each review is easy with the class instance - just call 'fit_tf_idf' with the tag of the dataset you want to train the model on

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation, NMF

In [0]:
count_vectorizer = CountVectorizer(max_features=10000, stop_words='english')

In [0]:
cvec_train = count_vectorizer.fit_transform(imdb_task.texts ['train'])
cvec_test = count_vectorizer.fit_transform(imdb_task.texts ['test'])

In [0]:
cvec_feature_names = count_vectorizer.get_feature_names()

In [0]:
#lda_model = LatentDirichletAllocation(n_components=5, max_iter=10, learning_method = "batch", random_state=2344)
topic_model = NMF(n_components=20,  random_state=2344)

In [0]:
topic_model.fit(cvec_train)

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
  n_components=20, random_state=2344, shuffle=False, solver='cd',
  tol=0.0001, verbose=0)

In [0]:
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print ("Topic %d:" % (topic_idx))
        print (" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))

In [0]:
no_top_words = 5

In [0]:
display_topics(topic_model, cvec_feature_names, no_top_words)

Topic 0:
boots elisabeth days extended poses
Topic 1:
mstk mtv watts selina porno
Topic 2:
files filmed die schtick manufactured
Topic 3:
lisa mtv lovely disliked loneliness
Topic 4:
changed chaos porno schreiber elisabeth
Topic 5:
granted proceedings acolytes threads carpenters
Topic 6:
kissing disliked laser discusses devine
Topic 7:
strange strained ii activities trucks
Topic 8:
baba acolytes wreaking mansfield happen
Topic 9:
pg disliked weaknesses threads scares
Topic 10:
titanic elisabeth yokozuna louis selina
Topic 11:
reasonable disliked threads devine watts
Topic 12:
idealistic filmed ii gravic nickel
Topic 13:
servant england turmoil network engineers
Topic 14:
beginnings active die selina combination
Topic 15:
margaret liner fades farik yuko
Topic 16:
vice value laser killjoy ranch
Topic 17:
lunk liner passions chaos weaknesses
Topic 18:
judicial damn ifc schreiber laughless
Topic 19:
grutter discussed mussolini luckily woody


In [0]:
train_vec = topic_model.transform (cvec_train)
test_vec = topic_model.transform (cvec_test)

In [0]:
import sklearn.linear_model
log_reg = sklearn.linear_model.LogisticRegression (C=0.1, solver='newton-cg', random_state=46543)
train_labs = np.array (imdb.label.values [imdb.type == "train"] == "pos")*1
test_labs = np.array (imdb.label.values [imdb.type == "test"] == "pos")*1

In [0]:
log_reg.fit (train_vec, train_labs)

LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=46543,
          solver='newton-cg', tol=0.0001, verbose=0, warm_start=False)

In [0]:
log_reg.score (train_vec, train_labs)

0.7014

In [0]:
log_reg.score (test_vec, test_labs)

0.4995

#### Train tf-idf model

In [0]:
#Train/fit tf-idf model
imdb_task.fit_tf_idf(text_tag='train')

We can access the fitted scores via 'get_tf_idf' with the dataset tag

By default, the tf-idf scores will be as long as our vocabulary is large, which can be massive for a big dataset. I've clipped this to be only the most common 30,000 words.


So we get a matrix that is 5,000 (the number of training reviews) by 30,000 (our clipped vocabulary)

In [0]:
imdb_task.get_tf_idf('train').shape

(5000, 30000)

Most of the columns are 0.0 (the word in our vocabulary wasn't mentioned in the particular review) - tf-idf is not very efficient...

In [0]:
list (imdb_task.get_tf_idf('train')[0])

[0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0

#### Apply model to test set

While we could use the above method on our test set that's likely to give use a biased result - the test set is supposed to represent data we've never seen before.

So instead of calling 'fit_tf_idf' call 'transform_tf_idf' - this <b>applies</b> the model we trained earlier rather than training it. We specify the tag of data it was trained on and the tag of the data we want to apply it to

In [0]:
imdb_task.transform_tf_idf(fit_model_tag = 'train', to_text_tag='test')

For the test data we get a matrix that's 2,000 (the number of test reviews) by 30,000 (the clipped vocabulary size of the <b> training data </b>)

In [0]:
imdb_task.get_tf_idf('test').shape

(2000, 30000)

#### Predict sentiment using the tf-idf scores

We can use the tf-idf scores just like any other data.

Here we feed them into a logistic regression model to try to predict the sentiment of the reviews

In [0]:
import sklearn.linear_model

In [0]:
#Create regression model class instance
log_reg = sklearn.linear_model.LogisticRegression (C=0.1, solver='newton-cg', random_state=46543)

In [0]:
#Format the sentiment labels from the imdb data so we can use them in the regression model (turn 'pos' to '1' and 'neg' to '0')
train_labs = np.array (imdb.label.values [imdb.type == "train"] == "pos")*1
test_labs = np.array (imdb.label.values [imdb.type == "test"] == "pos")*1

Train the regression model on the training data. Because our data are 30,000 columns wide this takes a long time to fit even with 5,000 reviews, it takes a very long time and isn't really practical with a big dataset (though dimensionality reduction could help here: https://scikit-learn.org/stable/modules/unsupervised_reduction.html

In [0]:
#Train the regression model on the training data. Because our data are 30,000 columns wide this takes a long time to fit (even with )
log_reg.fit (imdb_task.get_tf_idf(text_tag='train'), train_labs)

LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=46543,
          solver='newton-cg', tol=0.0001, verbose=0, warm_start=False)

We get 84% accuracy on training data 

In [0]:
log_reg.score(imdb_task.get_tf_idf(text_tag='train'), train_labs)

0.843

And 80% on the test data

In [0]:
log_reg.score(imdb_task.get_tf_idf(text_tag='test'), test_labs)

0.8055

## Topic modelling

Topic modelling refers to a set techniques to automatically identify key topics in a corpus of text.

In [0]:
#To train the topic model we need to specify the number of topics we want to find
#We choose 20 in this case, but it's a good idea to play with this 
#and see what results come out
imdb_task.train_topic_model (text_tag = "train", ntopics = 20)

In [0]:
#To try to understand what the topics relate to, we show the 5 words (or more)
#that are most stronly associated with each topic.

#These can be a little hard to understand at first and usually need a bit of 
#digging to interpret!
imdb_task.display_topics(nwords = 5)

In [0]:
#We can also get the probability of each topic being referred to in 
#each review. We do this for the train and test sets
train_topic_preds = imdb_task.get_topic_preds ('train')
test_topic_preds = imdb_task.get_topic_preds ('test')

In [0]:
#We get a matrix with each review (5000) by each topic (20).
#This is a much more manageable set of data that we can try to classify
#sentiment with
train_topic_preds.shape

In [0]:
#Load the logistic regression model and format the labels as before
import sklearn.linear_model
log_reg = sklearn.linear_model.LogisticRegression (C=0.1, solver='newton-cg', random_state=46543)
train_labs = np.array (imdb.label.values [imdb.type == "train"] == "pos")*1
test_labs = np.array (imdb.label.values [imdb.type == "test"] == "pos")*1

In [0]:
#Fit the model to the training data
log_reg.fit (train_topic_preds, train_labs)

Our accuracy has gone down quite a bit... we could try other "dimensionality reduction" techniques https://scikit-learn.org/stable/modules/unsupervised_reduction.html

In [0]:
log_reg.score (train_topic_preds, train_labs)

In [0]:
log_reg.score (test_topic_preds, test_labs)

## Word embeddings

Word embeddings often do a much better job of capturing the semantic relationships between words that bag-of-words methods with tf-idf, and are easier to work with.

If the text we're using isn't too obscure, you can usually get better performance using 'pre-trained' embeddings rather than training your own. These are models that Google, Facebook and other companies with lots of computing power available have trained on enormous datasets.

In [0]:
#There are lots of pretrained models available. You can list them like this. This shows the type of model (e.g. glove)
#the data it was trained on (e.g. wikipedia or twitter) and the size of the embedding (the number of columns). 

#Glove/Word2vec are often better at capturing semantic relationships while fasttext is often better at semantics.
#Larger embedding sizes usually give better performance but are slower to work with.

imdb_task.get_available_embeddings()

['fasttext-wiki-news-subwords-300',
 'conceptnet-numberbatch-17-06-300',
 'word2vec-ruscorpora-300',
 'word2vec-google-news-300',
 'glove-wiki-gigaword-50',
 'glove-wiki-gigaword-100',
 'glove-wiki-gigaword-200',
 'glove-wiki-gigaword-300',
 'glove-twitter-25',
 'glove-twitter-50',
 'glove-twitter-100',
 'glove-twitter-200',
 '__testing_word2vec-matrix-synopsis']

The models can take a long time to download (the fasttext one is 1GB), so we'll use the smallest embedding size. I'd strongly recommend changing this for a bigger model for a real application.

Applying the model to text is as simple as calling 'embed_text' with the dataset tag and the model you want to use. This will take quite a while to load the model, but you only need to do this once every session and it's really quick to apply after this.

In [0]:
imdb_task.embed_text(text_tags='train', pretrained_model_tag='glove-twitter-25')

Loading model
Embedding 5000 texts
0
1000
2000
3000
4000


In [0]:
imdb_task.embed_text(text_tags='test', pretrained_model_tag='glove-twitter-25')

Embedding 2000 texts
0
1000


#### Semantics of word embeddings

In [0]:
imdb_task.emb_models ['glove-twitter-25'].most_similar ("kanye", topn = 15)

  if np.issubdtype(vec.dtype, np.int):


[('jay-z', 0.9496527314186096),
 ('drake', 0.9476633667945862),
 ('trey', 0.9405695796012878),
 ('wayne', 0.9334985017776489),
 ('ross', 0.9329334497451782),
 ('beyonce', 0.9278717041015625),
 ('stevie', 0.924564778804779),
 ('usher', 0.9236401915550232),
 ('ciara', 0.9228289723396301),
 ('nicki', 0.9154650568962097),
 ('pharrell', 0.9058845639228821),
 ('mariah', 0.9040012955665588),
 ('minaj', 0.903660774230957),
 ('chainz', 0.9019912481307983),
 ('tyga', 0.9009972214698792)]

In [0]:
imdb_task.emb_models ['glove-twitter-25'].most_similar ("queen", topn = 15)

  if np.issubdtype(vec.dtype, np.int):


[('princess', 0.9393543004989624),
 ('lady', 0.933632493019104),
 ('prince', 0.9268781542778015),
 ('king', 0.9202420711517334),
 ('aka', 0.8976844549179077),
 ('hero', 0.8970822095870972),
 ('beautiful', 0.8900729417800903),
 ('angel', 0.8794295787811279),
 ('love', 0.8791000843048096),
 ('lana', 0.8773646354675293),
 ('song', 0.8772189617156982),
 ('hunter', 0.874837338924408),
 ('baby', 0.8734958171844482),
 ('singer', 0.871101975440979),
 ('star', 0.8697962164878845)]

In [0]:
emb_model = imdb_task.emb_models ['glove-twitter-25']
emb_model.similar_by_vector (emb_model["london"] - emb_model ["uk"] + emb_model ["france"], topn = 5)

  if np.issubdtype(vec.dtype, np.int):


[('france', 0.9207379221916199),
 ('paris', 0.9196597337722778),
 ('lyon', 0.8785386085510254),
 ('grand', 0.8772081732749939),
 ('marseille', 0.8520588874816895)]

In [0]:
print ("Similarity nurse/she: " + str (emb_model.similarity ("nurse", "she")) + " Similarity nurse/he: " + str (emb_model.similarity ("nurse", "he")))
print ("Similarity doctor/she: " + str (emb_model.similarity ("doctor", "she")) + " Similarity nurse/he: " + str (emb_model.similarity ("doctor", "he")))

Similarity nurse/she: 0.49813786 Similarity nurse/he: 0.3893748
Similarity doctor/she: 0.70306164 Similarity nurse/he: 0.7907317


  if np.issubdtype(vec.dtype, np.int):


#### Search text

One useful application of word embeddings is searching for text matching a particular term or phrase

In [0]:
lookup_word = 'horror'

In [0]:
#Specify the lookup word or phrase, the model you want to use (make sure to apply the embeddings first) and the dataset.
#We get back the similarity to the lookup word for each word in the specified dataset, the similarity for each document,
#and the order of the index of the dataset, with the most similar document first and the least similar last.

word_similarities, document_similarities, ordering = imdb_task.embedding_lookup(text_tag='train', 
                                                           pretrained_model_tag='glove-twitter-25', 
                                                           lookup_text=lookup_word)

0
1000
2000
3000
4000


We can use the word simiarities to highlight the words the model is paying attention to when searching.

The advantage of using word embeddings to do this search over a simple keyword search is that the embeddings pay attention to similar words, not just exact matches: e.g. evil, killer, ghost

In [0]:
imdb_task.print_coloured_text (np.array (imdb_task.colour_attention (text_tag='train', attention_weights = word_similarities)) [ordering [0:20]])

0
1000
2000
3000
4000


#### Predict sentiment using the word embeddings

To use the word embeddings to predict embeddings we'll need to turn them into sentence/document embeddings, so they look more like a traditional dataset.

The 'word_to_sentence_embedding' method does this in quite a simple way - delete the embeddings of all the 'stopwords' (common uninformative words like 'a', 'it', etc.) and average the embeddings of the words that are left

In [0]:
imdb_task.word_to_sentence_embedding(text_tag='train', pretrained_model_tag='glove-twitter-25')
imdb_task.word_to_sentence_embedding(text_tag='test', pretrained_model_tag='glove-twitter-25')

Embedding 5000 sentences
0
1000
2000
3000
4000
Embedding 2000 sentences
0
1000


This gets us a traditional looking dataset that's 5000 (the number of training reviews) by 25 (the length of our embedding)

In [0]:
imdb_task.sent_embeds['train']['glove-twitter-25'].shape

(5000, 25)

Fit the logistic regression model as we did before

In [0]:
log_reg = sklearn.linear_model.LogisticRegression (C=0.1, solver='newton-cg', random_state=46543)

In [0]:
log_reg.fit (imdb_task.sent_embeds['train']['glove-twitter-25'], train_labs)

LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=46543,
          solver='newton-cg', tol=0.0001, verbose=0, warm_start=False)

This doesn't do very well: achieving around 74% accuracy on the training and test sets. This is worse than tf-idf for this particular dataset.

The problem is almost certainly the length of the text and the simple way we turned word embeddings into sentence embeddings - the reviews are quite long, so there are a lot of words left even after we remove stopwords, so we're probably still including a lot of words that aren't particuarly useful when we average the embeddings.

This technique usually does much better on short texts

In [0]:
log_reg.score(imdb_task.sent_embeds['train']['glove-twitter-25'], train_labs)

0.7444

In [0]:
log_reg.score(imdb_task.sent_embeds['test']['glove-twitter-25'], test_labs)

0.7475

In [0]:
#imdb_task.save ("imdb_task.pkl")

In [0]:
#imdb_task = text_task.load ("imdb_task.pkl")

## Training an LSTM

We can try using a more advanced model: Long Short-Term Memory networks (LTSMs). These are a type of neural network that is able to capture the context in which words are used (i.e. separating Apple the company from apply the fruit) and are able to learn to ignore particlar words.

We use an LSTM model that has been 'pretrained' on Wikipedia, so it doesn't need to learn everything from scratch. We 'finetune' this model on our data before trying to use it to predict sentiment. This 'finetuning' involves training the model to predict the next word in the reviews based on the previous words. This way it can learn about any words that it hasn't seen before and better learn the structure of language for the data we're using.

The pretrained model comes from fastai (https://docs.fast.ai/text.html), which is also a great resource for learning about neural networks.

In [0]:
#Turn the sentiment labels we'll be predicting into 0's and 1's
train_labs = np.array (imdb.label.values [imdb.type == "train"] == "pos")*1
test_labs = np.array (imdb.label.values [imdb.type == "test"] == "pos")*1

In [0]:
#The model needs to be initialised by passing it the tags for the data we'll 
#use for pretraining and training the classifier (these are the same in this 
#case) and the sentiment labels
imdb_task.init_ltsm_classifier (pretrain_traindf_tag = "train", 
                                pretrain_testdf_tag = "test", 
                                classif_traindf_tag = "train", 
                                classif_testdf_tag = "test", 
                                classif_trn_labs = train_labs, 
                                classif_tst_labs = test_labs)

In [0]:
#Pretrain the model. The accuracy here is based on predicting the next word
#not predicting sentiment!
imdb_task.pretrain_ltsm()

epoch,train_loss,valid_loss,accuracy
1,5.217520,4.825068,0.207069


epoch,train_loss,valid_loss,accuracy
1,4.841149,4.685379,0.219051


Train the finetuned LSTM to predict sentiment.

The model achieves ~88% accuracy, which is better than tf-idf (~80%). But it took a lot longer to train (around 30 mins in this case).

Neural networks like LSTMs usually do better with more data. With the full dataset of 100,000 this model gets to 95.4% accuracy 

In [0]:
imdb_task.train_ltsm_classifier(nepoch = 10)

epoch,train_loss,valid_loss,accuracy
1,0.592078,0.443124,0.809000


epoch,train_loss,valid_loss,accuracy
1,0.495087,0.378202,0.840500
2,0.498038,0.346592,0.855500
3,0.449840,0.416550,0.852000
4,0.454005,0.372477,0.841000
5,0.402910,0.294019,0.877500
6,0.366087,0.301664,0.876500
7,0.396185,0.291826,0.876000
8,0.351480,0.301403,0.871500
9,0.353211,0.316715,0.875000


KeyboardInterrupt: ignored