# Text analysis demo

This session will put into practice the methods we coverered last week. 

We'll run through:
 * tf-idf
 * topic modelling
 * word embeddings
 * ltsms
 
We'll use a module we've started developing to make it quicker and less repetitive to code and use these techniques, imaginatively called 'text_analysis'.

The code (along with this notebook) lives at: https://github.com/dfleetwood/text_analysis . Please send me your github username and I'll add you to the repository. The code is very new - so there are probably lots of bugs at the moment, but we'll be doing a fair bit of work on it over the coming weeks...

In [1]:
#Import libraries
import pandas as pd
import numpy as np
import text_task


from importlib import reload
reload (text_task)



<module 'text_task' from 'C:\\Users\\dfleetwood\\Documents\\text\\text_analysis\\text_task.py'>

## Data processing

Our task for this session will be to estimate the sentiment (positive/negative) of movie reviews from https://www.imdb.com/

To make things quicker today, I've made a much smaller version of the data - for the full dataset go to: https://www.kaggle.com/utathya/imdb-review-dataset

IMDB asks for both a textual description and a score for how much someone liked or disliked a movie. This makes it really useful for testing sentiment analysis - we can link text to quantitative sentiment scores (in this dataset the scores are reduced to 'positive' and 'negative'.


In [2]:
#Read in the data
imdb = pd.read_csv ("imdb_small.csv", encoding = "latin")

The file has five columns:
 * ID: The id of the review
 * Type: Whether the review is from the training or test set (more on this later)
 * Review: The actual review text
 * Label: Whether the quantitative scoring was positive or negative - this is what we'll be trying to predict
 * File: The original file name the review was drawn from (this is just an administrative column)

In [3]:
imdb.columns

Index(['ID', 'type', 'review', 'label', 'file'], dtype='object')

In [4]:
imdb

Unnamed: 0,ID,type,review,label,file
0,43228,train,This has to be the funniest stand up comedy I ...,pos,3906_10.txt
1,48074,train,This is apparently the second remake of this f...,pos,8268_7.txt
2,38204,train,In order to stop her homosexual friend Albert ...,pos,10634_10.txt
3,25573,train,Forest of the Damned starts out as five young ...,neg,10516_4.txt
4,34858,train,This little show is obviously some stupid litt...,neg,7623_4.txt
5,46871,train,Sitting on the front porch of his Burbank home...,pos,7185_10.txt
6,47078,train,A suspenseful thriller that bears some resembl...,pos,7371_7.txt
7,37631,train,"I really like Miikes movies about Yakuza, this...",pos,10118_7.txt
8,26896,train,This 2003 made for TV movie was shown on a wom...,neg,11707_1.txt
9,32027,train,"The animation was good, the imagery was good, ...",neg,5075_3.txt


In total, we've got 7,000 reviews. The full dataset has 100,000

In [5]:
imdb.shape

(7000, 5)

Whenever you're developing a model it's a very good idea to hold back some data that you won't use for training, and will just be used to give you an indication of how well your model perform on data it's never seen before - https://scikit-learn.org/stable/modules/cross_validation.html

Our data is already split into training examples, that we'll use to train the model, and testing examples, that we'll use to see how well it's doing.

We have 5,000 training examples and 2,000 test examples

In [6]:
#Split training and test data
imdb_train = imdb [imdb.type == "train"]
imdb_test = imdb [imdb.type == "test"]

In [7]:
imdb_train.shape, imdb_test.shape

((5000, 5), (2000, 5))

## Creating 'text task' class instance and loading data into it

The text_analysis module is organised around 'TextTasks'. This is a class that holds all the data and routines for a text analysis job in one place. Once the data is loaded in you can save and load the class instance to pick up where you left off.

In [8]:
#Initialise tast task object
imdb_task = text_task.TextTask()

You can load as many different text datasets into task as you like - here we load the training and test data in separately to make sure they don't get mixed up. 

Give each dataset you load in a tag/label so you know what data you're working with later.

In [9]:
#Load training data into the text task instance - we label it 'train'
imdb_task.add_text(text = imdb_train.review.values, text_tag="train")
#And load the test data - we label it 'test'
imdb_task.add_text(text = imdb_test.review.values, text_tag="test")

The text is saved in a dictionary called 'texts' within the class

In [10]:
imdb_task.texts ['train']

array(["This has to be the funniest stand up comedy I have ever seen. Eddie Izzard is a genius, he picks in Brits, Americans and everyone in between. His style is completely natural and completely hilarious. I doubt that anyone could sit through this and not laugh their a** off. Watch, enjoy, it's funny.",
       'This is apparently the second remake of this film, having been filmed before in 1911 and 1918. And, in so many ways it reminds me of the later film, A YANK AT OXFORD. Both films concern a conceited blow-hard who arrives at one of the top schools in the world and both, ultimately, show the blow-hard slowly learning about teamwork and decency. In this film, William Haines is "Tom Brown" and his main rival, "Bob" is played by Frances X. Bushman. And, in a supporting role is Jack Pickford--always remembered as the brother of Mary. Of these three, Pickford comes off the best, as the sympathetic loser who becomes Tom\'s pal--he actually has a few decent scenes as well as a dramatic

## Cleaning/preprocessing text

The first real job with working with text is to clean/preprocess it.

We do this with the 'clean_text' method (or write your own routine). Specify the dataset you want to clean using the tag you gave it, and how you want to clean it. At the moment, the method will overwrite your data, make sure you save the original version somewhere else!

In [11]:
#Before
imdb_task.texts ['train'][0]

"This has to be the funniest stand up comedy I have ever seen. Eddie Izzard is a genius, he picks in Brits, Americans and everyone in between. His style is completely natural and completely hilarious. I doubt that anyone could sit through this and not laugh their a** off. Watch, enjoy, it's funny."

In [12]:
imdb_task.clean_text (text_tag="train", processes=['urls', 'punctuation', 'numeric', 'lower'])

In [13]:
#After
imdb_task.texts ['train'][0]

'this has to be the funniest stand up comedy i have ever seen eddie izzard is a genius he picks in brits americans and everyone in between his style is completely natural and completely hilarious i doubt that anyone could sit through this and not laugh their a off watch enjoy it s funny'

In [14]:
#Clean test data
imdb_task.clean_text (text_tag="test", processes=['urls', 'punctuation', 'numeric', 'lower'])

Save the task instance - with a lot of text the cleaning can take a while, so it's a good idea to save so you don't need to do it again and can just pick up where you left off

In [15]:
imdb_task.save ("imdb_text.pkl")

In [16]:
#Reload the instance - uncomment this if you're picking up from here
#mdb_task = text_task.load ("imdb_text.pkl")

## Tf-idf

tf-idf (text frequency-inverse document frequency) is a good place to start with most tasks

# ADD TF-IDF SLIDE IMAGE HERE

Calculating the tf-idf scores for each review is easy with the class instance - just call 'fit_tf_idf' with the tag of the dataset you want to train the model on

In [69]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation, NMF

In [43]:
count_vectorizer = CountVectorizer(max_features=10000, stop_words='english')

In [97]:
cvec_train = count_vectorizer.fit_transform(imdb_task.texts ['train'])
cvec_test = count_vectorizer.fit_transform(imdb_task.texts ['test'])

In [98]:
cvec_feature_names = count_vectorizer.get_feature_names()

In [120]:
#lda_model = LatentDirichletAllocation(n_components=5, max_iter=10, learning_method = "batch", random_state=2344)
topic_model = NMF(n_components=20,  random_state=2344)

In [121]:
topic_model.fit(cvec_train)

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
  n_components=20, random_state=2344, shuffle=False, solver='cd',
  tol=0.0001, verbose=0)

In [122]:
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print ("Topic %d:" % (topic_idx))
        print (" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))

In [123]:
no_top_words = 5

In [124]:
display_topics(topic_model, cvec_feature_names, no_top_words)

Topic 0:
boyle 007 emphasize das facing
Topic 1:
naff nail wealthy portrayal sensing
Topic 2:
films finding dies scott married
Topic 3:
listed nail lowest distributors loner
Topic 4:
championship changes portrayal scorpion emphasize
Topic 5:
govinda profoundly actioner timothy carries
Topic 6:
keira distributors lacked displays development
Topic 7:
studios studied huppert actual ullmann
Topic 8:
bacon actioner writings marlon hamill
Topic 9:
pierce distributors website timothy school
Topic 10:
touch emphasize yokozuna lover sensing
Topic 11:
reduce distributors timothy development wealthy
Topic 12:
humanity finding huppert graffiti notably
Topic 13:
seymour entries underground entitled nope
Topic 14:
bent acts dies sensing college
Topic 15:
mart lines fame faulkner yuck
Topic 16:
vile veidt lacked kansas raveena
Topic 17:
macarthur lines pedestrian changes website
Topic 18:
jeanne cybertracker hungry scorpion lang
Topic 19:
grip displayed nature lurking worrying


In [125]:
train_vec = topic_model.transform (cvec_train)
test_vec = topic_model.transform (cvec_test)

In [158]:
import sklearn.linear_model
log_reg = sklearn.linear_model.LogisticRegression (C=0.1, solver='newton-cg', random_state=46543)
train_labs = np.array (imdb.label.values [imdb.type == "train"] == "pos")*1
test_labs = np.array (imdb.label.values [imdb.type == "test"] == "pos")*1

In [159]:
log_reg.fit (train_vec, train_labs)

LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=46543, solver='newton-cg', tol=0.0001,
          verbose=0, warm_start=False)

In [160]:
log_reg.score (train_vec, train_labs)

0.7028

In [161]:
log_reg.score (test_vec, test_labs)

0.4995

#### Train tf-idf model

In [17]:
#Train/fit tf-idf model
imdb_task.fit_tf_idf(text_tag='train')

We can access the fitted scores via 'get_tf_idf' with the dataset tag

By default, the tf-idf scores will be as long as our vocabulary is large, which can be massive for a big dataset. I've clipped this to be only the most common 30,000 words.


So we get a matrix that is 5,000 (the number of training reviews) by 30,000 (our clipped vocabulary)

In [18]:
imdb_task.get_tf_idf('train').shape

(5000, 30000)

Most of the columns are 0.0 (the word in our vocabulary wasn't mentioned in the particular review) - tf-idf is not very efficient...

In [19]:
list (imdb_task.get_tf_idf('train')[0])

[0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0

#### Apply model to test set

While we could use the above method on our test set that's likely to give use a biased result - the test set is supposed to represent data we've never seen before.

So instead of calling 'fit_tf_idf' call 'transform_tf_idf' - this <b>applies</b> the model we trained earlier rather than training it. We specify the tag of data it was trained on and the tag of the data we want to apply it to

In [20]:
imdb_task.transform_tf_idf(fit_model_tag = 'train', to_text_tag='test')

For the test data we get a matrix that's 2,000 (the number of test reviews) by 30,000 (the clipped vocabulary size of the <b> training data </b>)

In [21]:
imdb_task.get_tf_idf('test').shape

(2000, 30000)

#### Predict sentiment using the tf-idf scores

We can use the tf-idf scores just like any other data.

Here we feed them into a logistic regression model to try to predict the sentiment of the reviews

In [22]:
import sklearn.linear_model

In [23]:
#Create regression model class instance
log_reg = sklearn.linear_model.LogisticRegression (C=0.1, solver='newton-cg', random_state=46543)

In [15]:
#Format the sentiment labels from the imdb data so we can use them in the regression model (turn 'pos' to '1' and 'neg' to '0')
train_labs = np.array (imdb.label.values [imdb.type == "train"] == "pos")*1
test_labs = np.array (imdb.label.values [imdb.type == "test"] == "pos")*1

Train the regression model on the training data. Because our data are 30,000 columns wide this takes a long time to fit even with 5,000 reviews, it takes a very long time and isn't really practical with a big dataset (though dimensionality reduction could help here: https://scikit-learn.org/stable/modules/unsupervised_reduction.html

In [25]:
#Train the regression model on the training data. Because our data are 30,000 columns wide this takes a long time to fit (even with )
log_reg.fit (imdb_task.get_tf_idf(text_tag='train'), train_labs)

LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=46543, solver='newton-cg', tol=0.0001,
          verbose=0, warm_start=False)

We get 84% accuracy on training data 

In [26]:
log_reg.score(imdb_task.get_tf_idf(text_tag='train'), train_labs)

0.8434

And 80% on the test data

In [27]:
log_reg.score(imdb_task.get_tf_idf(text_tag='test'), test_labs)

0.8065

## Word embeddings

Word embeddings often do a much better job of capturing the semantic relationships between words that bag-of-words methods with tf-idf, and are easier to work with.

If the text we're using isn't too obscure, you can usually get better performance using 'pre-trained' embeddings rather than training your own. These are models that Google, Facebook and other companies with lots of computing power available have trained on enormous datasets.

In [28]:
#There are lots of pretrained models available. You can list them like this. This shows the type of model (e.g. glove)
#the data it was trained on (e.g. wikipedia or twitter) and the size of the embedding (the number of columns). 

#Glove/Word2vec are often better at capturing semantic relationships while fasttext is often better at semantics.
#Larger embedding sizes usually give better performance but are slower to work with.
imdb_task.get_available_embeddings()

['fasttext-wiki-news-subwords-300',
 'conceptnet-numberbatch-17-06-300',
 'word2vec-ruscorpora-300',
 'word2vec-google-news-300',
 'glove-wiki-gigaword-50',
 'glove-wiki-gigaword-100',
 'glove-wiki-gigaword-200',
 'glove-wiki-gigaword-300',
 'glove-twitter-25',
 'glove-twitter-50',
 'glove-twitter-100',
 'glove-twitter-200',
 '__testing_word2vec-matrix-synopsis']

The models can take a long time to download (the fasttext one is 1GB), so we'll use the smallest embedding size. I'd strongly recommend changing this for a bigger model for a real application.

Applying the model to text is as simple as calling 'embed_text' with the dataset tag and the model you want to use. This will take quite a while to load the model, but you only need to do this once every session and it's really quick to apply after this.

In [110]:
imdb_task.embed_text(text_tags='train', pretrained_model_tag='glove-twitter-25')

Embedding 5000 texts
0
1000
2000
3000
4000


In [121]:
imdb_task.embed_text(text_tags='test', pretrained_model_tag='glove-twitter-25')

Embedding 2000 texts
0
1000


#### Semantics of word embeddings

In [55]:
imdb_task.emb_models ['glove-twitter-25'].most_similar ("kanye", topn = 15)

[('jay-z', 0.9496528506278992),
 ('drake', 0.9476633071899414),
 ('trey', 0.9405696392059326),
 ('wayne', 0.9334985017776489),
 ('ross', 0.9329334497451782),
 ('beyonce', 0.9278717637062073),
 ('stevie', 0.9245648384094238),
 ('usher', 0.923640251159668),
 ('ciara', 0.9228290319442749),
 ('nicki', 0.9154649972915649),
 ('pharrell', 0.9058846235275269),
 ('mariah', 0.9040012955665588),
 ('minaj', 0.9036608338356018),
 ('chainz', 0.9019912481307983),
 ('tyga', 0.9009972214698792)]

In [58]:
imdb_task.emb_models ['glove-twitter-25'].most_similar ("queen", topn = 15)

[('princess', 0.9393543004989624),
 ('lady', 0.933632493019104),
 ('prince', 0.9268780946731567),
 ('king', 0.9202421307563782),
 ('aka', 0.8976844549179077),
 ('hero', 0.8970820903778076),
 ('beautiful', 0.8900729417800903),
 ('angel', 0.8794296979904175),
 ('love', 0.8791000247001648),
 ('lana', 0.8773646354675293),
 ('song', 0.8772189617156982),
 ('hunter', 0.8748372793197632),
 ('baby', 0.8734957575798035),
 ('singer', 0.8711019158363342),
 ('star', 0.8697961568832397)]

In [89]:
emb_model = imdb_task.emb_models ['glove-twitter-25']
emb_model.similar_by_vector (emb_model["london"] - emb_model ["uk"] + emb_model ["france"], topn = 5)

[('france', 0.9207380414009094),
 ('paris', 0.9196597933769226),
 ('lyon', 0.8785385489463806),
 ('grand', 0.8772082328796387),
 ('marseille', 0.8520588874816895)]

In [106]:
print (emb_model.similarity ("nurse", "she"), emb_model.similarity ("nurse", "he"))
print (emb_model.similarity ("doctor", "she"), emb_model.similarity ("doctor", "he"))

0.49813783167652015 0.3893748164540006
0.7030616423232617 0.7907317529481618


#### Search text

One useful application of word embeddings is searching for text matching a particular term or phrase

In [72]:
lookup_word = 'horror'

In [73]:
#Specify the lookup word or phrase, the model you want to use (make sure to apply the embeddings first) and the dataset.
#We get back the similarity to the lookup word for each word in the specified dataset, the similarity for each document,
#and the order of the index of the dataset, with the most similar document first and the least similar last.

word_similarities, document_similarities, ordering = imdb_task.embedding_lookup(text_tag='train', 
                                                           pretrained_model_tag='glove-twitter-25', 
                                                           lookup_text=lookup_word)

0
1000
2000
3000
4000


We can use the word simiarities to highlight the words the model is paying attention to when searching.

The advantage of using word embeddings to do this search over a simple keyword search is that the embeddings pay attention to similar words, not just exact matches: e.g. zombie, vampire, halloween, thriller 

In [74]:
imdb_task.print_coloured_text (np.array (imdb_task.colour_attention (text_tag='train', attention_weights = word_similarities)) [ordering [0:10]])

0
1000
2000
3000
4000


#### Predict sentiment using the word embeddings

In [122]:
imdb_task.word_to_sentence_embedding(text_tag='train', pretrained_model_tag='glove-twitter-25')
imdb_task.word_to_sentence_embedding(text_tag='test', pretrained_model_tag='glove-twitter-25')

Embedding 5000 sentences
0
1000
2000
3000
4000
Embedding 2000 sentences
0
1000


In [123]:
imdb_task.sent_embeds['train']['glove-twitter-25'].shape

(5000, 25)

In [124]:
log_reg = sklearn.linear_model.LogisticRegression (C=0.1, solver='newton-cg', random_state=46543)

In [125]:
log_reg.fit (imdb_task.sent_embeds['train']['glove-twitter-25'], train_labs)

LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=46543, solver='newton-cg', tol=0.0001,
          verbose=0, warm_start=False)

In [126]:
log_reg.score(imdb_task.sent_embeds['train']['glove-twitter-25'], train_labs)

0.7262

In [127]:
log_reg.score(imdb_task.sent_embeds['test']['glove-twitter-25'], test_labs)

0.7285

In [158]:
imdb_task.save ("imdb_task.pkl")

In [11]:
#imdb_task = text_task.load ("imdb_task.pkl")

## LTSM

In [12]:
from fastai.text import * 

In [13]:
train_df = pd.DataFrame (imdb_task.texts['train'])
train_df.columns = ["text"]

val_df = pd.DataFrame (imdb_task.texts['test'])
val_df.columns = ["text"]

In [16]:
train_df['label'] = train_labs
val_df['label'] = test_labs

In [17]:
# Language model data
data_lm = TextLMDataBunch.from_df(path = "./", train_df = train_df, valid_df = val_df, text_cols = "text", bs = 64)

In [144]:
# Classifier model data
data_clas = TextClasDataBunch.from_df(path = "./", train_df = train_df, valid_df = val_df, text_cols = "text", label_cols="label", vocab=data_lm.train_ds.vocab, bs = 32)

In [18]:
learn = language_model_learner(data_lm, pretrained_model=URLs.WT103_1, drop_mult=0.5)

Downloading https://s3.amazonaws.com/fast-ai-modelzoo/wt103-1


KeyboardInterrupt: 

In [None]:
learn.fit_one_cycle(1, 1e-2)

In [152]:
learn = language_model_learner(data_lm, pretrained_model="https://s3.amazonaws.com/fast-ai-modelzoo/wt103-1.tgz", drop_mult=0.5)
learn.fit_one_cycle(1, 1e-2)

Downloading https://s3.amazonaws.com/fast-ai-modelzoo/wt103-1.tgz


ReadError: not a gzip file

In [None]:
learn.freeze()
learn.fit_one_cycle(1, 1e-3)

In [86]:
imdb_task.embeds['train']['glove-twitter-25']

(25,)

In [65]:
len (agg_sims)

5000

In [63]:
word_sims

array([array([0.61939216, 0.42088   , 0.5062772 , 0.6405264 , 0.61984324,
       0.5082001 , 0.5903946 , 0.54026645, 0.6164763 , 0.5372426 ,
       0.6240126 , 0.6416653 , 0.5556282 , 0.32596833, 0.0342401 ,
       0.576899  , 0.3858249 , 0.5833219 , 0.41261664, 0.3594917 ,
       0.47139272, 0.37920654, 0.31073195, 0.65938735, 0.6038233 ,
       0.47139272, 0.58252776, 0.57074076, 0.6296195 , 0.576899  ,
       0.4744357 , 0.5962877 , 0.65938735, 0.4744357 , 0.5181165 ,
       0.5372426 , 0.5840609 , 0.5997743 , 0.57060707, 0.556755  ,
       0.49304223, 0.5651883 , 0.61939216, 0.65938735, 0.59958464,
       0.6302462 , 0.6371176 , 0.3858249 , 0.4610607 , 0.51337916,
       0.77273124, 0.60864025, 0.28311792, 0.570439  ], dtype=float32),
       array([ 0.61939204,  0.57689902,  0.45473815,  0.61984316,  0.59676421,
        0.44409797,  0.61955441,  0.61939204,  0.5203397 ,  0.68666828,
        0.5090131 ,  0.35521208,  0.56355127,  0.47139274,  0.65938729,
        0.65938729,  0.47139

In [724]:
#imdb_task.train_embed_model('train', 'w2v_model', model_type='w2vec')

Training model


In [492]:
#imdb_task.get_tf_idf('train').shape

In [493]:
#imdb_task.transform_tf_idf(fit_model_tag= 'train', to_text_tag= 'test')

In [494]:
#imdb_task.get_tf_idf(text_tag='test').shape

In [229]:
import sklearn.linear_model


In [45]:
log_reg = sklearn.linear_model.LogisticRegression (C=0.1, solver='newton-cg', random_state=40)

In [46]:
train_labs = np.array (imdb.label.values == "pos")*1
train_labs = train_labs [train_idx]

In [47]:
test_labs = np.array (imdb.label.values == "pos")*1
test_labs = test_labs [test_idx]

In [48]:
log_reg.fit (imdb_task.get_tf_idf(text_tag='train'), train_labs)

LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=40, solver='newton-cg',
          tol=0.0001, verbose=0, warm_start=False)

In [49]:
log_reg.score(imdb_task.get_tf_idf(text_tag='train'), train_labs)

0.8432

In [50]:
log_reg.score(imdb_task.get_tf_idf(text_tag='test'), test_labs)

0.8065

In [64]:
imdb_task.emb_models = models_back

In [65]:
imdb_task.embed_text('train', 'glove-twitter-25')

Embedding 5000 texts
0
1000
2000
3000
4000


In [26]:
#models_back = imdb_task.emb_models

In [66]:
word_sims, agg_sims, ordering = imdb_task.embedding_lookup('train', 'glove-twitter-25', 'romantic', 
                                                           lookup_agg_method = np.mean, 
                                                           text_agg_method = np.max)

0
1000
2000
3000
4000


In [67]:
coloured_texts = imdb_task.colour_attention ('train', word_sims)

0
1000
2000
3000
4000


In [60]:
imdb_task.print_coloured_text (np.array (coloured_texts) [ordering [0:10]])

In [76]:
stopwords = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
stopwords_set = set (stopwords)

In [80]:
np.array (imdb_task.texts_tok ['train'][0]) in np.array (stopwords_set)

False

In [71]:
len (imdb_task.embeds['train']['glove-twitter-25'][0])

54

In [81]:
np.setdiff1d(imdb_task.texts_tok ['train'][0],stopwords)

array(['americans', 'anyone', 'brits', 'comedy', 'completely', 'could',
       'doubt', 'eddie', 'enjoy', 'ever', 'everyone', 'funniest', 'funny',
       'genius', 'hilarious', 'izzard', 'laugh', 'natural', 'picks',
       'seen', 'sit', 'stand', 'style', 'watch'], dtype='<U10')

In [89]:
#imdb_task.texts_tok ['train'][0]

In [101]:
np.average (np.array (imdb_task.embeds['train']['glove-twitter-25'][1]) [np.where (~np.isin (imdb_task.texts_tok ['train'][1],stopwords))[0]], axis = 0)

(25,)

In [94]:
np.where (~np.isin (imdb_task.texts_tok ['train'][0],stopwords))[0]

array([ 5,  6,  8, 11, 12, 13, 14, 17, 19, 21, 22, 24, 28, 30, 31, 33, 34,
       36, 38, 39, 40, 45, 49, 50, 53], dtype=int64)

In [106]:
text_embeds = imdb_task.embeds['train']['glove-twitter-25']
text_tok = imdb_task.texts_tok ['train']

sent_embs = []
for i in range (len (text_tok)):
    if (i % 1000) == 0:
        print (i)
    assert (len (text_embeds[i]) == len (text_tok[i]))
    sent_embs.append (np.average (np.array (text_embeds[i]) [np.where (~np.isin (text_tok[i],stopwords))[0]], axis = 0))
    
sent_embs = np.array (sent_embs)

0
1000
2000
3000
4000


In [107]:
sent_embs.shape

(5000, 25)

In [49]:
from IPython.display import HTML as html_print

In [50]:
html_print (" <br> ".join (np.array (coloured_texts) [ordering [0:10]]))

In [135]:
embeds = imdb_task.embeds ['train']['glove-twitter-25']

In [299]:
for i in range (len (embeds)):
    a = cosine_similarity(embeds [i], lookup_text_emb)

In [544]:
pretrained_model_tag = 'glove-twitter-25'

In [567]:
lookup_text = 'violent action'

In [568]:
text_tag = 'train'

In [558]:
imdb_task.embeds.keys()

dict_keys(['train'])

In [563]:
lookup_agg_method = np.mean

In [576]:
text_agg_method = np.max

In [592]:
# imdb_task.load_embedding_model(pretrained_model_tag)
# 
# emb_model = imdb_task.emb_models [pretrained_model_tag]
# 
# lookup_text_tok = imdb_task._tokenize([lookup_text])
# #print (lookup_text_tok)
# emb_length = emb_model ['a'].shape [0]
# lookup_text_emb = imdb_task._embed_line(lookup_text_tok, emb_model, emb_length)
# lookup_text_emb = np.array (lookup_text_emb)
# lookup_text_emb = lookup_text_emb.squeeze (0)
# 
# embs = imdb_task.embeds [text_tag][pretrained_model_tag]
# 
# emb_sims = []
# emb_sims_max = []
# for i in range(0, len(embs)):
#     if (i % 1000) == 0:
#         print(i)
# 
#     lin_cos_sim = cosine_similarity(embs[i], lookup_text_emb)
#     lin_cos_sim = lookup_agg_method(lin_cos_sim, axis=1)
#     emb_sims.append(lin_cos_sim)
#     max_sims = text_agg_method (lin_cos_sim)
#     emb_sims_max.append(max_sims)
# 
# emb_sims = np.array (emb_sims)
# emb_sims_max = np.array (emb_sims_max)
# 
# ordering = np.argsort(-emb_sims_max)

In [99]:
imdb_task.emb_models['glove-twitter-25']['romance']

array([ 0.010751, -0.72868 ,  0.02601 ,  0.69312 ,  0.97722 , -0.22981 ,
        1.5951  ,  0.14071 ,  0.27606 , -1.4623  ,  0.03556 , -0.46209 ,
       -2.2854  , -0.40581 , -0.27038 ,  0.51349 , -0.46135 ,  0.38363 ,
        0.54332 , -0.1618  , -0.64526 , -0.14408 , -0.9408  , -0.32212 ,
        0.64805 ], dtype=float32)

In [117]:
word_sims, max_sims, ordering = imdb_task.embedding_lookup('train', 'glove-twitter-25', 'romantic', 
                                                           lookup_agg_method = np.mean, 
                                                           text_agg_method = np.max)

0
1000
2000
3000
4000


In [101]:
ordering

array([3041, 4262, 4036, ..., 2449,  302, 2236], dtype=int64)

In [118]:
attention_weights = word_sims

In [119]:
subset = None

In [120]:
text_tag = "train"

In [121]:
from sklearn.utils.extmath import softmax

text_tok = imdb_task.texts_tok [text_tag]

if subset is not None:
    text_tok = text_tok [subset]
    attention_weights = attention_weights [subset]

def cstr(s, color='black'):
    return "<text style=background-color:{}>{}</text>".format(color, s)


def norm (x):
    return ((x-min(x))/(max(x)-min(x)))
    

red = np.array([255, 0, 0])
green = np.array([0, 255, 0])
blue = np.array([0, 0, 255])
red = np.round(red).astype("int64")
blue = np.round(blue).astype("int64")
green = np.round(green).astype("int64")

col = green

coloured_sents = []
for ref in range (len (text_tok)):
    col_sent = []
    attn_weights = attention_weights [ref]
    attn_weights = norm(attn_weights)**5
    #attn_weights = softmax(np.array (attn_weights))

    #attn_weights = attn_weights [0,:]
    
    if (ref % 1000) == 0:
        print(ref)

    for i in range(0, len(text_tok [ref])):

        assert (len (attn_weights) == len (text_tok[ref]))

        alpha = np.array(attn_weights[i]) * 255

        alpha = np.round(alpha).astype("int64")

        col_word = cstr(text_tok[ref][i],
                        '#{:02x}{:02x}{:02x}{:02x}'.format(col[0], col[1], col[2], alpha))

#         if (i % 10) == 0:
#             col_sent.append(" <br> " + col_word)
#         else:
        col_sent.append(col_word)

    coloured_sents.append(" ".join(col_sent))

0
1000
2000
3000
4000


In [41]:
#alpha

In [33]:
#attn_weights

In [71]:
#coloured_sents [0]

In [36]:
#coloured_sents = imdb_task.colour_attention ('train', word_sims)

In [122]:
from IPython.display import HTML as html_print

In [123]:
html_print (" <br> ".join (np.array (coloured_sents) [ordering [0:10]]))

In [602]:
np.array (imdb_task.texts['train'])[ordering]

array(['this was excellent touching action packed and perfect for kurt russel i loved this movie it deserves more than or so stars this movie is the story of an obsolete soldier who learns there is more to life than soldiering and people who learn that there is a time for fighting a need to defend i cried laughed and mostly sat in awe of this story good writing job for an action flick and the plot was appropriate and fairly solid the ending wasn t twisty but it was still excellent if you like escape from new york or rooting for the underdog this movie is for you not an undue amount of gore or violence it was not difficult to watch in that respect something for everyone',
       'after gorging myself on a variety of seemingly immature movies purchased on ex rental dvds i figured that the time was right for a little serious drama and who better to provide it than sam mendes for a number of reasons american beauty doesn t appeal to me as much as this film which is easily the darkest thing

In [595]:
#np.mean (word_sims[0])

In [596]:
#max_sims_a = [np.max (x) for x in word_sims]

In [597]:
#ordering = np.argsort (-np.array (max_sims_a))

In [461]:
#np.array (imdb_task.texts ['train']) [ordering]

In [462]:
embs = imdb_task.embeds['train']['glove-twitter-25']

In [463]:
emb_model = imdb_task.emb_models ['glove-twitter-25']

In [464]:
lookup_text = "action"

In [465]:
lookup_text_tok = lookup_text.split (" ")

In [466]:
emb_length = emb_model ['a'].shape [0]

In [467]:
lookup_text_emb = imdb_task._embed_line (lookup_text_tok, emb_model, emb_length)

In [468]:
from sklearn.metrics.pairwise import cosine_similarity

In [469]:
lin_cos_sim = cosine_similarity (embs[0], lookup_text_emb)

In [209]:
np.mean (lin_cos_sim, axis = 1)

array([0.7064953 , 0.589067  , 0.6213429 , 0.59356546, 0.7458868 ,
       0.589079  , 0.69953656, 0.6129971 , 0.7839712 , 0.5932517 ,
       0.66459924, 0.7034769 , 0.63008404, 0.43208623, 0.20198816,
       0.7342957 , 0.5363021 , 0.67736506, 0.5144093 , 0.518046  ,
       0.7123766 , 0.584567  , 0.49265343, 0.70720315, 0.56344473,
       0.7123766 , 0.7157514 , 0.63224745, 0.73522794, 0.7342957 ,
       0.5645032 , 0.47523054, 0.70720315, 0.5645032 , 0.61516124,
       0.5932517 , 0.6221181 , 0.69119936, 0.6563445 , 0.6570064 ,
       0.523802  , 0.66994524, 0.7064953 , 0.70720315, 0.6591221 ,
       0.55855596, 0.70603883, 0.5363021 , 0.5598166 , 0.72054225,
       0.60620964, 0.67439103, 0.49265248, 0.6046612 ], dtype=float32)

In [193]:
np.max (lin_cos_sim)

0.8427038

In [218]:
lookup_agg_method = np.max
text_agg_method = np.max 

emb_sims = []
emb_sims_max = []
for i in range (0, len (embs)):
    if (i % 1000) == 0:
        print (i)
    lin_cos_sim = cosine_similarity (embs[i], lookup_text_emb)
    lin_cos_sim = text_agg_method (lin_cos_sim, axis = 1)
    emb_sims.append (lin_cos_sim)
    emb_sims_max.append (agg_method (lin_cos_sim))
    
    

0
1000
2000
3000
4000


In [219]:
emb_sims_max [1]

1.0000000000000002

In [220]:
emb_sims_max [0:100]

[0.8427038,
 1.0000000000000002,
 0.87328947,
 0.8541685497959718,
 1.0000000000000002,
 0.9999999,
 0.8500843365069591,
 0.8769028008141451,
 0.8841082724194276,
 0.85008436,
 0.9999999,
 0.85008436,
 1.0000000000000002,
 0.99999994,
 1.0000000000000002,
 0.85008436,
 1.0000000000000002,
 0.8607826,
 0.8500843365069591,
 0.86093867,
 0.8624499122122617,
 0.8968295,
 0.8620199623511139,
 0.8500843365069591,
 0.99999994,
 0.8500843365069591,
 1.0000000000000002,
 1.0000000000000002,
 0.8427036851650482,
 0.85008436,
 1.0000000000000002,
 0.8518469169916202,
 0.8500843365069591,
 0.8841082724194276,
 0.8500843365069591,
 0.8841082724194276,
 0.8427036851650482,
 0.85008436,
 0.8500843365069591,
 0.8681443,
 0.8841082,
 0.9999999,
 0.85008436,
 0.85008436,
 0.8624499122122617,
 0.99999994,
 1.0000000000000002,
 0.82852024,
 0.8620199623511139,
 0.8500843365069591,
 0.85008436,
 0.8427038,
 0.9999999,
 0.8500843365069591,
 0.875910588479086,
 0.8968294526476451,
 1.0000000000000002,
 0.917

In [169]:
np.max (lin_cos_sim)

0.84270376

In [51]:
import gensim

In [52]:
import gensim.downloader as api

In [3]:
a = api.info()


In [None]:
list (a["models"].keys())

dict_keys(['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis'])

In [14]:
emb_model = api.load ("glove-twitter-25")

In [19]:
emb_model ['a'].shape[0]

25

In [510]:
gensim.models.word2vec.Word2VecKeyedVectors()

<module 'gensim.models' from 'C:\\Users\\dfleetwood\\PycharmProjects\\text_analysis\\venv\\lib\\site-packages\\gensim\\models\\__init__.py'>