In [2]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams["figure.dpi"] = 300
np.set_printoptions(precision=3, suppress=True)
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import scale, StandardScaler

In [3]:
from sklearn.datasets import load_files

reviews_train = load_files("../data/aclImdb/train/")
# load_files returns a bunch, containing training texts and training labels
text_train, y_train = reviews_train.data, reviews_train.target
print("type of text_train: {}".format(type(text_train)))
print("length of text_train: {}".format(len(text_train)))
print("text_train[1]:\n{}".format(text_train[1].decode()))
text_train = [doc.replace(b"<br />", b" ") for doc in text_train]

type of text_train: <class 'list'>
length of text_train: 25000
text_train[1]:
Words can't describe how bad this movie is. I can't explain it by writing only. You have too see it for yourself to get at grip of how horrible a movie really can be. Not that I recommend you to do that. There are so many clichés, mistakes (and all other negative things you can imagine) here that will just make you cry. To start with the technical first, there are a LOT of mistakes regarding the airplane. I won't list them here, but just mention the coloring of the plane. They didn't even manage to show an airliner in the colors of a fictional airline, but instead used a 747 painted in the original Boeing livery. Very bad. The plot is stupid and has been done many times before, only much, much better. There are so many ridiculous moments here that i lost count of it really early. Also, I was on the bad guys' side all the time in the movie, because the good guys were so stupid. "Executive Decision" should with

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

text_train_sub, text_val, y_train_sub, y_val = train_test_split(
    text_train, y_train, stratify=y_train, random_state=0)
vect = CountVectorizer(min_df=2)
X_train = vect.fit_transform(text_train_sub)
X_val = vect.transform(text_val)

In [4]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(C=.1).fit(X_train, y_train_sub)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [5]:
lr.score(X_val, y_val)

0.88

# spaCy
python -m spacy download en_core_web_lg

In [6]:
import spacy
nlp = spacy.load("en_core_web_lg")

In [7]:
doc = nlp("What is my purpose?")

In [8]:
for token in doc:
        print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
              token.shape_, token.is_alpha, token.is_stop)

What what NOUN WP attr Xxxx True False
is be VERB VBZ ROOT xx True False
my -PRON- ADJ PRP$ poss xx True False
purpose purpose NOUN NN nsubj xxxx True False
? ? PUNCT . punct ? False False


In [9]:
for token in doc:
    print(token.text, token.has_vector, token.vector.shape)

What True (300,)
is True (300,)
my True (300,)
purpose True (300,)
? True (300,)


In [10]:
nlp.vocab.vectors.shape

(684831, 300)

In [11]:
token = nlp('movies')

# Word2Vec with spacy

In [12]:
queries = [w for w in nlp.vocab if w.is_lower and w.prob >= -15]

def most_similar(word, count=10):
    by_similarity = sorted(queries, key=lambda w: word.similarity(w), reverse=True)
    return [w.orth_ for w in by_similarity[:count]]

In [13]:
most_similar(nlp("movie"))

['movie',
 'movies',
 'film',
 'films',
 'flick',
 'starring',
 'soundtrack',
 'trailer',
 'cinema',
 'remake']

In [14]:
most_similar(nlp("good"), count=15)

['good',
 'great',
 'better',
 'very',
 'nice',
 'really',
 'excellent',
 'decent',
 'well',
 'but',
 'much',
 'too',
 'bad',
 'enough',
 'kind']

In [15]:
most_similar(nlp("cute dog"))

['cute',
 'dog',
 'adorable',
 'puppy',
 'cat',
 'kitty',
 'dogs',
 'kitten',
 'bunny',
 'pet']

In [16]:
doc = nlp("cute dog")

In [17]:
doc.vector.shape

(300,)

In [18]:
doc.vector == (nlp("cute").vector + nlp("dog").vector) / 2

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,

In [21]:
print(text_train_sub[0].decode())

Maybe it's just because I have an intense fear of hospitals and medical stuff, but this one got under my skin (pardon the pun). This piece is brave, not afraid to go over the top and as satisfying as they come in terms of revenge movies. Not only did I find myself feeling lots of hatred for the screwer and lots of sympathy towards the "screwee", I felt myself cringe and feel pangs of disgust at certain junctures which is really a rare and delightful thing for a somewhat jaded horror viewer like myself. Some parts are very reminiscant of "Hellraiser", but come off as tribute rather than imitation. It's a heavy handed piece that does not offer the viewer much to consider, but I enjoy being assaulted by a film once and awhile. This piece brings it and doesn't appologize. I liked this one a lot. Do NOT watch whilst eating pudding.


In [22]:
nlp = spacy.load("en_core_web_lg", disable=["tagger", "parser", "ner"])

In [23]:
docs_train = [nlp(d.decode()).vector for d in text_train_sub]

In [24]:
X_train = np.vstack(docs_train)

In [25]:
X_train.shape

(18750, 300)

In [26]:
docs_val = [nlp(d.decode()).vector for d in text_val]
X_val = np.vstack(docs_val)

In [37]:

lr_w2v = LogisticRegression().fit(X_train, y_train_sub)
lr_w2v.score(X_train, y_train_sub)

0.85824

In [38]:
lr_w2v.score(X_val, y_val)

0.84768

# Semantic Arithmetic

In [None]:
nlp.vocab

In [69]:
queries = [w for w in nlp.vocab if w.is_lower and w.prob >= -15]

def most_similar(word, count=10):
    token = nlp(word)
    by_similarity = sorted(queries, key=lambda w: w.similarity(token), reverse=True)
    return [w.orth_ for w in by_similarity[:count]]

In [101]:
from sklearn.metrics.pairwise import cosine_similarity

queries = [w for w in nlp.vocab if w.is_lower and w.prob >= -15]

def cos_sim(a, b):
    return cosine_similarity(a.reshape(1, -1), b.reshape(1, -1))

def most_similar_vec(vec, count=10):
    by_similarity = sorted(queries, key=lambda w: cos_sim(w.vector, vec), reverse=True)
    return [w.orth_ for w in by_similarity[:count]]

vec = nlp('woman').vector + nlp('king').vector - nlp("man").vector
most_similar_vec(vec)

['king',
 'queen',
 'prince',
 'kings',
 'princess',
 'royal',
 'throne',
 'queens',
 'monarch',
 'kingdom']

In [72]:
most_similar("woman")

['woman',
 'lady',
 'girl',
 'man',
 'women',
 'mother',
 'female',
 'she',
 'wife',
 'pregnant']

In [93]:
most_similar_vec(nlp("woman").vector)

['woman',
 'lady',
 'girl',
 'man',
 'women',
 'mother',
 'female',
 'she',
 'wife',
 'pregnant']

In [94]:
vec = nlp('woman').vector - nlp("man").vector + nlp('king').vector 

In [95]:
most_similar_vec(vec)

['king',
 'queen',
 'prince',
 'kings',
 'princess',
 'royal',
 'throne',
 'queens',
 'monarch',
 'kingdom']

In [96]:
vec = nlp('woman').vector - nlp("man").vector + nlp('he').vector 
most_similar_vec(vec)

['she',
 'woman',
 'he',
 'her',
 'herself',
 'mother',
 'wife',
 'who',
 'told',
 'when']

In [110]:
vec = nlp('paris').vector - nlp('berlin').vector + nlp('germany').vector 

most_similar_vec(vec)

['france',
 'paris',
 'europe',
 'germany',
 'italy',
 'spain',
 'japan',
 'european',
 'poland',
 'usa']

# Doc2Vec with gensim
Also see https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb

In [113]:
import gensim
def read_corpus(text, tokens_only=False):
    for i, line in enumerate(text):
        if tokens_only:
            yield gensim.utils.simple_preprocess(line)
        else:
            # For training data, add tags
            yield gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(line), [i])


ModuleNotFoundError: No module named 'gensim'

In [112]:
train_corpus = list(read_corpus(text_train_sub))
test_corpus = list(read_corpus(text_val, tokens_only=True))

NameError: name 'gensim' is not defined

In [51]:
model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2)
model.build_vocab(train_corpus)

In [57]:
model.train(train_corpus, total_examples=model.corpus_count, epochs=55)

In [58]:
import pickle
with open("doc2vec_50.pickle", "wb") as f:
    pickle.dump(model, f, -1)

In [56]:
import pickle
#with open("doc2vec.pickle", "rb") as f:
#    model = pickle.load(f)

AttributeError: Can't get attribute 'DocvecsArray' on <module 'gensim.models.doc2vec' from '/home/andy/anaconda3/lib/python3.6/site-packages/gensim/models/doc2vec.py'>

In [72]:
model.wv.most_similar("movie")

[('film', 0.9482318758964539),
 ('flick', 0.822228193283081),
 ('series', 0.715380072593689),
 ('programme', 0.7032747268676758),
 ('sequel', 0.6939107179641724),
 ('story', 0.6771408319473267),
 ('show', 0.6559576392173767),
 ('documentary', 0.6537493467330933),
 ('picture', 0.6427854299545288),
 ('thriller', 0.6300673484802246)]

In [59]:
vectors = [model.infer_vector(train_corpus[doc_id].words)
          for doc_id in range(len(train_corpus))]    

In [60]:
X_train = np.vstack(vectors)

In [61]:
X_train.shape

(18750, 50)

In [62]:
test_vectors = [model.infer_vector(test_corpus[doc_id])
                for doc_id in range(len(test_corpus))]   

In [75]:
text_val[1]

b'I enjoyed this film. It was funny, cute, silly, and entertaining. Had a fine cast and really got hammered by some critics for reasons that I truly don\'t understand. No, it wasn\'t "The Grapes of Wrath" or "Casablanca" or even "Moonstruck", but it was an enjoyable film.  Julia was excellent playing the psychotic \'man behind the man\'. The story is a little silly to be sure, but it this isn\'t high drama, folks. I happened to see a review of the film, probably the only good one it got and then ran into it one night when looking for a movie. I never heard it was supposed to stink until after I saw it, and I\'m glad I saw it. Eventually bought the VHS tape on the bargain pile, and I watch it a couple times a year.'

In [63]:
X_val = np.vstack(test_vectors)

In [81]:
from sklearn.metrics.pairwise import cosine_similarity
dist = cosine_similarity(X_train, X_val[1:2])

In [82]:
np.argmax(dist)

8026

In [83]:
text_train_sub[np.argmax(dist)]



In [64]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(C=100).fit(X_train, y_train_sub)

In [65]:
lr.score(X_train, y_train_sub)

0.82826666666666671

In [66]:
lr.score(X_val, y_val)

0.81567999999999996

In [67]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=500, max_depth=8).fit(X_train, y_train_sub)
rf.score(X_train, y_train_sub)

0.86885333333333337

# Hugging Face transformers

In [1]:
from transformers import pipeline

nlp = pipeline("sentiment-analysis")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=546.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=754.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267844284.0, style=ProgressStyle(descri…




In [7]:
text_val[0]

b'Did anyone stop to realise what sort of movie they were producing here ? Now let`s a former marine officer becomes assinged to a group of kids at a cadet school so this should be a family comedy right ? Wrong . This is just a gross comedy aimed at teenagers with many bad taste moments .It might have been watchable in an extremely dumb way at this point but I found Damon Wayans voice to be irritating beyond belief . Does he speak like that in real life ? If he does then he has my sympathy but he won`t be getting any of my money from watching his movies'

In [None]:
res = [nlp(t.decode()) for t in text_val]