# Film Script Analyzer

## Gensim (word2vec)

Word vectors have gained much attention by their capacity to translate words, in an unsupervised way depending on their context in vectors. Vectors then can be used to find similarities between words, words close by in the vector space are similar words. Perform addition or subtraction of words to create new words.

A classic example is King - man + woman = Queen.

Another way may be, in the context of film scripts, to find similar characters in different movies. Find threads in stories as numeric sucessions, etc.

Many methods exist to create this, CBOW and skipgram, and many implementations also exist. Here used is Gensim's word2vec, recent developments are subtantialy faster and deliver outstanding results.

It can be used in word prediction and phrase generation by taking into account not just what words was before, but what kind of word.

Will train a model with fasttext with all the scripts available, start by generating a single file with all the texts. In this case it will be using skipgram.

In [1]:
import pandas as pd 
import string

df = pd.read_csv('data/dfreps0.csv',index_col=0)

Generate the model on Gensim, now using per instance training instead of the whole text.

Deliver sequentially the training texts, all the scripts in our dataset.

In [2]:
from gensim.models import Word2Vec, Phrases
from nltk.corpus   import stopwords
import logging
import os
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',level=logging.INFO)

class trainer(object):
    def __init__(self,dirname,index):
        self.dirname = dirname
        self.index   = index
        self.cache   = stopwords.words('english')
 
    def __iter__(self):
        for fname in os.listdir(self.dirname):
            if fname in self.index:
                for line in open(os.path.join(self.dirname, fname)):
                   #line  = line.translate(None,string.punctuation).split() #Remove punctuation
                   #nline = []
                   #for word in line:
                   #    if word.lower() not in self.cache:                  #Remove stopwords
                   #        nline.append(word)
                   #yield nline
                   yield line.translate(None,string.punctuation).split()



In [3]:
sentences = trainer('scripts/',df.index)
model     = Word2Vec(sentences,size=200,min_count=5,iter=10,workers=2)
#model.train(more_sentences)

Save model and load it, and train it with more sentences if needed, we will just be querying now, to trim unneeded model memory.

In [4]:
model.save('data/gensim_punct_5_100')

#model = Word2Vec.load('data/gensim_punct_5_10')

Experiment with the uses of the model. First check similarity between words.

In [5]:
model.similarity('Batman','Superman')

0.64670523650250278

Then, check which word doesn't match.

In [6]:
model.doesnt_match("Batman Robin Superman Aquaman".split())

'Aquaman'

Vector addition and subtraction. Which word is the closest to a sum and/or subtraction of words.

Generating word vectors with movie scripts bears intriguing results. 

Batman - man = X

The word man has different semantic interpretations can be the signs 
of age or seniority (Nightwing, Robin), of gender (Leia), moral compass 
(Monster) or of humanity altogether (Cyborg, Hal, Yoda). All resemble 
characteristics of the Dark Knight, but lack what we subtracted from 'Batman'.

In [7]:
model.most_similar(positive=['Batman'],negative=['man'],topn=10)

[('Robin', 0.4148382544517517),
 ('KnoxJohnston', 0.4054247736930847),
 ('Nightwing', 0.4025588035583496),
 ('Cyborg', 0.3848974108695984),
 ('Starfire', 0.3841334581375122),
 ('yoda', 0.3667193353176117),
 ('Superman', 0.33921581506729126),
 ('Spoony', 0.3350698947906494),
 ('Batmans', 0.3302323818206787),
 ('WHOOSHING', 0.32674193382263184)]

And just accesing the vector for the word.

In [8]:
model['King'][:5]

array([ 2.5261097 ,  0.46229416,  0.7212823 ,  0.45801175, -0.46808067], dtype=float32)

Test accuracy of the model against Google test, question-words.txt

In [None]:
#acc = model.accuracy('trunk/questions-words.txt')

Create bigrams, sentences of 2 words like New York and San Francisco.

In [None]:
bigram_transformer = Phrases(sentences)
model              = Word2Vec(bigram_transformer[sentences],size=100,min_count=10,workers=2)

If done with training, trim unused model memory.

In [None]:
model.init_sims(replace=True)

It is important to find the optimal way to generate the vectors within the scripts space. Easily could be to download training corpus to accoplish this, but the interest is on the words within the script corpus.

Next section: [Pending]()