# Beyond counting words: Working with word embeddings

Workshop by Damian Trilling

This notebook illustrates how we can use embeddings in Machine Learning tasks.

As always, we first import neccesary modules. We also get our data.

In [57]:
#!pip install embeddingvectorizer    # you need to install this module

In [58]:
# Supervised text classification
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.utils import shuffle
from sklearn import metrics
import joblib
import eli5
from nltk.sentiment import vader

from embeddingvectorizer import EmbeddingCountVectorizer, EmbeddingTfidfVectorizer
import embeddingvectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier


# general
import numpy as np
import re
# word embedding stuff
import gensim
import gensim.downloader as api
from gensim.similarities import SoftCosineSimilarity, SparseTermSimilarityMatrix
from gensim.corpora import Dictionary
from gensim.models import WordEmbeddingSimilarityIndex

# data
from courseutils import get_review_data

# lets get more output
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [2]:
# get data
reviews_train, reviews_test, y_train, y_test = get_review_data()

reviews_train, y_train = shuffle(reviews_train, y_train, random_state=42)
reviews_test, y_test = shuffle(reviews_test, y_test, random_state=42)

# get word embedding model

# pretrained:
# wv = api.load('word2vec-google-news-300')
# wv = api.load("glove-wiki-gigaword-300")

# or our own:
wv = gensim.models.Word2Vec.load("mymodel").wv

Using cached file reviewdata.pickle.bz2


2021-04-08 14:56:15,329 : INFO : loading Word2Vec object from mymodel
2021-04-08 14:56:15,508 : INFO : loading wv recursively from mymodel.wv.* with mmap=None
2021-04-08 14:56:15,509 : INFO : loading vectors from mymodel.wv.vectors.npy with mmap=None
2021-04-08 14:56:15,578 : INFO : setting ignored attribute vectors_norm to None
2021-04-08 14:56:15,579 : INFO : loading vocabulary recursively from mymodel.vocabulary.* with mmap=None
2021-04-08 14:56:15,580 : INFO : loading trainables recursively from mymodel.trainables.* with mmap=None
2021-04-08 14:56:15,581 : INFO : loading syn1neg from mymodel.trainables.syn1neg.npy with mmap=None
2021-04-08 14:56:15,601 : INFO : setting ignored attribute cum_table to None
2021-04-08 14:56:15,602 : INFO : loaded mymodel


In [3]:
# explore data here

# Task 1: Document similarities

In [4]:
termsim_index = WordEmbeddingSimilarityIndex(wv)
documents = [e.lower().split() for e in reviews_train[:100]]

id2word = Dictionary(documents)
bow_corpus = [id2word.doc2bow(document) for document in documents]
similarity_matrix = SparseTermSimilarityMatrix(termsim_index, id2word)  # construct similarity matrix
docsim_index = SoftCosineSimilarity(bow_corpus, similarity_matrix, num_best=10)

2021-04-08 14:56:19,509 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2021-04-08 14:56:19,551 : INFO : built Dictionary(6268 unique tokens: ['"a', '"play', '"the', '-', 'a']...) from 100 documents (total 22165 corpus positions)
2021-04-08 14:56:19,580 : INFO : constructing a sparse term similarity matrix using <gensim.models.keyedvectors.WordEmbeddingSimilarityIndex object at 0x7f563d125a90>
2021-04-08 14:56:19,582 : INFO : iterating over columns in dictionary order
2021-04-08 14:56:19,588 : INFO : PROGRESS: at 0.02% columns (1 / 6268, 0.015954% density, 0.015954% projected density)
2021-04-08 14:56:19,596 : INFO : precomputing L2-norms of word weight vectors
2021-04-08 14:56:25,796 : INFO : PROGRESS: at 15.97% columns (1001 / 6268, 0.120628% density, 0.671393% projected density)
2021-04-08 14:56:29,934 : INFO : PROGRESS: at 31.92% columns (2001 / 6268, 0.174843% density, 0.513663% projected density)
2021-04-08 14:56:33,768 : INFO : PROGRESS: at 47.88% columns (3001 / 

In [5]:
query = '''Pulp Fiction may be the single best film ever made, and quite appropriately it is by one of the most 
creative directors of all time, Quentin Tarantino. This movie is amazing from the beginning definition of pulp to
the end credits and boasts one of the best casts ever assembled with the likes of Bruce Willis, Samuel L. Jackson, 
John Travolta, Uma Thurman, Harvey Keitel, Tim Roth and Christopher Walken. The dialog is surprisingly humorous for
this type of film, and I think that's what has made it so successful. Wrongfully denied the many Oscars it was 
nominated for, Pulp Fiction is by far the best film of the 90s and no Tarantino film has surpassed the quality of
this movie (although Kill Bill came close). As far as I'm concerned this is the top film of all-time and definitely 
deserves a watch if you haven't seen it.
'''.lower().split()
sims = docsim_index[id2word.doc2bow(query)]                                                                

In [18]:
# or let's take a  the first, second, or whatever docuemnt itself

docindex = 2

sims = docsim_index[id2word.doc2bow(documents[docindex])]      

In [19]:
# check wether everything's ok
" ".join(documents[docindex]), reviews_train[docindex]

('after watching this movie i was honestly disappointed - not because of the actors, story or directing - i was disappointed by this film advertisements.<br /><br />the trailers were suggesting that the battalion "have chosen the third way out" other than surrender or die (polish infos were even misguiding that they had the choice between being killed by own artillery or german guns, they even translated the title wrong as "misplaced battalion"). this have tickled the right spot and i bought the movie.<br /><br />the disappointment started when i realized that the third way is to just sit down and count dead bodies followed by sitting down and counting dead bodies... then i began to think "hey, this story can\'t be that simple... i bet this clever officer will find some cunning way to save what left of his troops". well, he didn\'t, they were just sitting and waiting for something to happen. and so was i.<br /><br />the story was based on real events of world war i, so the writers coul

In [20]:
for index, similarity in sims:
    print(f"This review has a similarity of {similarity} with our query:")
    print(reviews_train[index][:1000])
    print("\n*************************************************************\n")

This review has a similarity of 1.0 with our query:
After watching this movie I was honestly disappointed - not because of the actors, story or directing - I was disappointed by this film advertisements.<br /><br />The trailers were suggesting that the battalion "have chosen the third way out" other than surrender or die (Polish infos were even misguiding that they had the choice between being killed by own artillery or German guns, they even translated the title wrong as "misplaced battalion"). This have tickled the right spot and I bought the movie.<br /><br />The disappointment started when I realized that the third way is to just sit down and count dead bodies followed by sitting down and counting dead bodies... Then I began to think "hey, this story can't be that simple... I bet this clever officer will find some cunning way to save what left of his troops". Well, he didn't, they were just sitting and waiting for something to happen. And so was I.<br /><br />The story was based on

# Task 2: Supervised Machine Learning

## A classical model

In [7]:
vectorizer = CountVectorizer(stop_words='english')
X_train = vectorizer.fit_transform(reviews_train)
X_test = vectorizer.transform(reviews_test)

logreg = LogisticRegression(solver='liblinear')
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)

print(metrics.classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         neg       0.85      0.87      0.86     12500
         pos       0.87      0.85      0.86     12500

    accuracy                           0.86     25000
   macro avg       0.86      0.86      0.86     25000
weighted avg       0.86      0.86      0.86     25000



### Let's discuss

- what happened here under the hood?
- How many features do we have?
- How does X_train "look" like?

**write your conclusions here**

Let's rewrite this into a pipeline (for easier use), and let's use a TfIDF vectorizer instead. This is probably as good as it can get.

In [61]:
traditionalpipe = Pipeline([('vectorizer', TfidfVectorizer()),
                    ('logreg',LogisticRegression(solver='liblinear'))])

traditionalpipe.fit(reviews_train, y_train)
y_pred = traditionalpipe.predict(reviews_test)

print(metrics.classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         neg       0.88      0.88      0.88     12500
         pos       0.88      0.88      0.88     12500

    accuracy                           0.88     25000
   macro avg       0.88      0.88      0.88     25000
weighted avg       0.88      0.88      0.88     25000



**It's not the topic of today, but once we have such a pipeline, we can use a so-called gridsearch to find the optimal settings. For more info, see https://github.com/damian0604/bdaca/blob/master/12ec/week10/lecture10.pdf**

## Let's use embeddings as input instead

In [62]:
# MAKE SURE THAT YOU KNOW WHICH MODEL YOU ARE WORKING ON - can use either self-trained or pre-trained model

# we need to convert `wv` to a slightliy different format:
w2vmodel = dict(zip(wv.index2word, wv.vectors))

In [63]:
mypipe = Pipeline([('vectorizer', embeddingvectorizer.EmbeddingTfidfVectorizer(w2vmodel, operator='mean')),
                    ('svm', 
                     SGDClassifier(loss='hinge', penalty='l2', tol=1e-4, alpha=1e-6, max_iter=1000, random_state=42))])

# Generate BOW representation of word counts
mypipe.fit(reviews_train, y_train)
y_pred = mypipe.predict(reviews_test)

print(metrics.classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         neg       0.90      0.73      0.80     12500
         pos       0.77      0.92      0.84     12500

    accuracy                           0.82     25000
   macro avg       0.83      0.82      0.82     25000
weighted avg       0.83      0.82      0.82     25000



In [64]:
mypipe = Pipeline([('vectorizer', embeddingvectorizer.EmbeddingTfidfVectorizer(w2vmodel, operator='mean')),
                    ('logreg', LogisticRegression(solver='liblinear'))])

# Generate BOW representation of word counts
mypipe.fit(reviews_train, y_train)
y_pred = mypipe.predict(reviews_test)

print(metrics.classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         neg       0.83      0.83      0.83     12500
         pos       0.83      0.83      0.83     12500

    accuracy                           0.83     25000
   macro avg       0.83      0.83      0.83     25000
weighted avg       0.83      0.83      0.83     25000



In [65]:
mypipe = Pipeline([('vectorizer', embeddingvectorizer.EmbeddingCountVectorizer(w2vmodel, operator='mean')),
                    ('logreg', LogisticRegression(solver='liblinear'))])

# Generate BOW representation of word counts
mypipe.fit(reviews_train, y_train)
y_pred = mypipe.predict(reviews_test)

print(metrics.classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         neg       0.83      0.83      0.83     12500
         pos       0.83      0.83      0.83     12500

    accuracy                           0.83     25000
   macro avg       0.83      0.83      0.83     25000
weighted avg       0.83      0.83      0.83     25000



### Let's discuss

- what happened here under the hood?
- How many features do we have?
- How does X_train "look" like?

**write your conclusions here**

## There's a reason why the classical approach worked so good that the embedding approach couldn't add anything.

- can you see what?

In [73]:
reviews_train_short = reviews_train[:100]
reviews_test_short = reviews_test[:100] 
y_train_short = y_train[:100] 
y_test_short = y_test[:100] 

In [74]:
# traditional, short dataset
traditionalpipe.fit(reviews_train_short, y_train_short )
y_pred_short = traditionalpipe.predict(reviews_test_short)

print(metrics.classification_report(y_test_short, y_pred_short))

              precision    recall  f1-score   support

         neg       0.85      0.47      0.60        47
         pos       0.66      0.92      0.77        53

    accuracy                           0.71       100
   macro avg       0.75      0.70      0.69       100
weighted avg       0.75      0.71      0.69       100



In [75]:
# with embeddings, shor tdataset
mypipe.fit(reviews_train_short, y_train_short )
y_pred_short = mypipe.predict(reviews_test_short)

print(metrics.classification_report(y_test_short, y_pred_short))

              precision    recall  f1-score   support

         neg       0.75      0.70      0.73        47
         pos       0.75      0.79      0.77        53

    accuracy                           0.75       100
   macro avg       0.75      0.75      0.75       100
weighted avg       0.75      0.75      0.75       100



**write your conclusions here. Why do you think that the relative advantage turns around with the smaller dataset?**