# <center>Predicción de género cinematográfico utilizando un enfoque no supervisado</center>

Se intentará predecir el genero de un plotline utilizando metodos no supervisados de deteccion. Iniciaremos esta parte centrados en el modelo LDA.

In [1]:
#Importamos librerias

import numpy as np
import pandas as pd
import re, nltk, spacy, gensim
from nltk.corpus import stopwords

# Sklearn
from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from pprint import pprint

# Plotting tools
import pyLDAvis
import pyLDAvis.sklearn
import matplotlib.pyplot as plt
%matplotlib inline

# Warnings
import warnings
warnings.filterwarnings("ignore")

In [2]:
#Importamos el df
df = pd.read_csv('orig_movies.csv')

In [3]:
df.head(10)

Unnamed: 0.1,Unnamed: 0,imdb_id,title,original_title,year,date_published,genre,inf_genre,duration,country,language,director,actors,description,plot_synopsis,avg_vote,votes
0,0,tt0035423,Kate & Leopold,Kate & Leopold,2001,2002-04-05,"Comedy, Fantasy, Romance",comedy,118,USA,"English, French",James Mangold,"Meg Ryan, Hugh Jackman, Liev Schreiber, Brecki...",An English Duke from 1876 is inadvertedly drag...,"On 28 April 1876, Leopold, His Grace the 3rd D...",6.4,75298
1,1,tt0073537,Double Exposure,Double Exposure,1982,1983-11-09,"Comedy, Crime, Drama",comedy,100,USA,English,William Byron Hillman,"Michael Callan, Joanna Pettet, James Stacy, Pa...",A photographer for a men's magazine is disturb...,"The Putnams, Roger ('Ian Buchanan') and Maria ...",4.9,535
2,2,tt0076709,Si wang ta,Si wang ta,1981,1981-03-21,"Action, Crime, Mystery",action,96,Hong Kong,Cantonese,"See-Yuen Ng, Sammo Kam-Bo Hung","Bruce Lee, Tae-jeong Kim, Jung-Lee Hwang, Roy ...",After Billy Lo is killed while seeking the mur...,"After a recent amount of challenges, Billy Lo ...",5.2,2670
3,3,tt0078349,Sekai meisaku dôwa: Hakuchô no mizûmi,Sekai meisaku dôwa: Hakuchô no mizûmi,1981,1981-03-14,"Animation, Adventure, Family",comedy,75,Japan,Japanese,Kimio Yabuki,"Keiko Takeshita, Tarô Shigaki, Asao Koike, Yôk...",A prince falls in love with a princess cursed ...,Below is a synopsis based on the 1895 libretto...,7.8,667
4,4,tt0078749,Alien 2 - Sulla Terra,Alien 2 - Sulla Terra,1980,1980-04-11,"Adventure, Horror, Sci-Fi",horror,92,Italy,"English, Italian","Ciro Ippolito, Biagio Proietti","Belinda Mayne, Mark Bodin, Roberto Barrese, Be...",A spaceship lands back on Earth after a failed...,The commercial spacecraft Nostromo is on a ret...,3.7,1104
5,5,tt0078806,The Attic,The Attic,1980,1980-10-01,"Drama, Thriller, Horror",drama,101,USA,English,"George Edwards, Gary Graver","Carrie Snodgress, Ray Milland, Ruth Cox, Rosem...",A librarian devotes her life to caring for her...,Emma (Moss) has a strong aversion towards her ...,5.7,583
6,6,tt0078880,Bloodrage,Bloodrage,1980,1981-07-12,"Crime, Horror, Thriller",crime,78,USA,English,Joseph Zito,"James Johnson, Judith-Marie Bergan, Jerry McGe...",A sexually frustrated young man kills hookers.,"A young man named Richard visits Beverly, a lo...",4.9,238
7,7,tt0078935,Cannibal Holocaust,Cannibal Holocaust,1980,1980-02-07,"Adventure, Horror",horror,95,Italy,"English, Spanish, Italian",Ruggero Deodato,"Robert Kerman, Francesca Ciardi, Perry Pirkane...",During a rescue mission into the Amazon rainfo...,"In New York City, a TV news reporter recounts ...",5.9,47342
8,8,tt0079285,Saturn 3,Saturn 3,1980,1980-06-27,"Adventure, Horror, Sci-Fi",horror,96,UK,English,"Stanley Donen, John Barry","Farrah Fawcett, Kirk Douglas, Harvey Keitel",Two lovers stationed at a remote base in the a...,The film opens at a space station in the vicin...,5.2,7754
9,9,tt0079891,Shao Lin si,Shao Lin si,1982,1982,"Action, Drama",action,95,"China, Hong Kong",Mandarin,Hsin-Yen Chang,"Jet Li, Hai Yu, Chenghui Yu, Lan Ding, Jianqia...","A young man, hounded by a psychopathic general...",The film opens with the chief Shaolin Monks re...,7.0,3726


In [4]:
#Convertimos la sinopsis a lista
data = df.plot_synopsis.values.tolist()

#Removemos enters
data = [re.sub('\s+', ' ', sent) for sent in data]

#Removemos caracteres innecesarios
data = [re.sub("\'", "", sent) for sent in data]

#Mostramos ejemplo
print(data[:1])

['On 28 April 1876, Leopold, His Grace the 3rd Duke of Albany (Hugh Jackman), is a stifled dreamer. He has created a design for a primitive elevator, and has built a small model of this device. His strict uncle Millard (Paxton Whitehead) has no patience for what he characterises as a sign of Leopolds disrespect for the Monarchy, chastising him, and telling him he must marry a rich American, as the Mountbatten family finances are depleted. In response to his uncles accusations of his blemishing the family name, Leopold counters that the new nobility is to be found in those who pursue initiatives, hence his interest in the sciences and inventions. One day, the Duke finds Stuart Besser (Liev Schreiber), an amateur physicist (and great-great-grandson of Leopold) in his study perusing his schematic diagrams and taking photographs of them. He had seen him earlier at Roeblings speech about the Brooklyn Bridge, after he was laughing at the word "erection." Leopold follows Stuart and tries to s

In [5]:
#Tokenizamos y limpiamos texto usando gensim simple_preprocess()
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True elimina puntuacion

data_words = list(sent_to_words(data))

print(data_words[:1])

[['on', 'april', 'leopold', 'his', 'grace', 'the', 'rd', 'duke', 'of', 'albany', 'hugh', 'jackman', 'is', 'stifled', 'dreamer', 'he', 'has', 'created', 'design', 'for', 'primitive', 'elevator', 'and', 'has', 'built', 'small', 'model', 'of', 'this', 'device', 'his', 'strict', 'uncle', 'millard', 'paxton', 'whitehead', 'has', 'no', 'patience', 'for', 'what', 'he', 'characterises', 'as', 'sign', 'of', 'leopolds', 'disrespect', 'for', 'the', 'monarchy', 'chastising', 'him', 'and', 'telling', 'him', 'he', 'must', 'marry', 'rich', 'american', 'as', 'the', 'mountbatten', 'family', 'finances', 'are', 'depleted', 'in', 'response', 'to', 'his', 'uncles', 'accusations', 'of', 'his', 'blemishing', 'the', 'family', 'name', 'leopold', 'counters', 'that', 'the', 'new', 'nobility', 'is', 'to', 'be', 'found', 'in', 'those', 'who', 'pursue', 'initiatives', 'hence', 'his', 'interest', 'in', 'the', 'sciences', 'and', 'inventions', 'one', 'day', 'the', 'duke', 'finds', 'stuart', 'besser', 'liev', 'schreibe

In [6]:
#Importamos las StopWord para ingles
stopwords = nltk.corpus.stopwords.words('english')

In [7]:
#Importamos nombres propios
sw_firstnames = open("names-first.txt", "r").readlines()
sw_firstnames = [i.strip('\n') if type(i) == str else str(i) for i in sw_firstnames]
sw_firstnames = [x.lower() for x in sw_firstnames]

In [8]:
#Agregamos a stopword 
stopwords.extend(sw_firstnames)

In [9]:
#Funcion para remover Stopwords
from gensim.utils import simple_preprocess
def remove_stopwords(texts):
    ''' Remueve los stopwords '''
    return [[word for word in simple_preprocess(str(doc)) if word not in stopwords] for doc in texts]

In [10]:
#Aplicamos la funcion de remocion de stopwords
data_words_nonstop = remove_stopwords(data_words)

In [11]:
#Funcion para Lemmatization
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'ADV']): #, 'VERB']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append(" ".join([token.lemma_ if token.lemma_ not in ['-PRON-'] else '' for token in doc if token.pos_ in allowed_postags]))
    return texts_out

nlp = spacy.load('en', disable=['parser', 'ner'])

#Realizamos lemmatization conservando solo NOUN, ADJ, ADV  #VERB
data_lemmatized = lemmatization(data_words_nonstop, allowed_postags=['NOUN', 'ADJ', 'ADV']) #, 'VERB'])

print(data_lemmatized[:2])

['design primitive elevator small model device strict uncle characterise sign leopold monarchy family depleted response uncle accusation family name new nobility pursue initiative hence interest science day duke physicist great great grandson study schematic diagram photograph early roebling word erection try unfinished bridge temporal century travel first portal temporal universe inside apartment open week later dog elevator shaft eventually scientific discovery stuart book unintentional time disruption elevator century patent device cynical ambitious stuart ex girlfriend apartment career field woman state librarian demand dog walk overwhelmed roebling still apartment befriend brother actor actor steadfast character romantically tour commercial kate client product disgusting argue integrity bristling countering connection reality realising time together nearly evening contemplation suddenly apartment mental hospital back time afterwards notice photo leopold ball realise back night pro

In [54]:
#Creamos la matriz de Documentos-Palabras
vectorizer = CountVectorizer(analyzer='word',       
                             min_df=15,                        # Ocurrencias minimas de una palabra
                             stop_words='english',             # Remover stopwords
                             lowercase=True,                   # Convertir palabras a minusculas
                             token_pattern='[a-zA-Z0-9]{3,}',  # Caracteres > 3
                             # max_features=50000,             
                            )

data_vectorized = vectorizer.fit_transform(data_lemmatized)

In [55]:
#Chequeamos Sparsicity

# Materialize the sparse data
data_dense = data_vectorized.todense()

# Compute Sparsicity = Percentage of Non-Zero cells
print("Sparsicity: ", ((data_dense > 0).sum()/data_dense.size)*100, "%")

Sparsicity:  1.7735373887883519 %


In [56]:
#Construimos Modelo LDA
lda_model = LatentDirichletAllocation(n_components=20,           # Numero de topicos
                                      max_iter=50,               # Max learning iterations
                                      learning_method='online',   
                                      random_state=100,          # Random state
                                      batch_size=128,            # n docs en cada learning iter
                                      evaluate_every = -1,       # compute perplexity every n iters, default: Don't
                                      n_jobs = -1,               # Use all available CPUs
                                     )
lda_output = lda_model.fit_transform(data_vectorized)

print(lda_model)  # Model attributes

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method='online', learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=50,
                          mean_change_tol=0.001, n_components=20, n_jobs=-1,
                          perp_tol=0.1, random_state=100, topic_word_prior=None,
                          total_samples=1000000.0, verbose=0)


In [57]:
lda_output = lda_model.fit_transform(data_vectorized)


In [58]:
#Performance del modelo con perplexity y log-likelihood
# Log Likelyhood: Mas grande mejor
print("Log Likelihood: ", lda_model.score(data_vectorized))

# Perplexity: Mas chico mejor. Perplexity = exp(-1. * log-likelihood per word)
print("Perplexity: ", lda_model.perplexity(data_vectorized))

#Parametros del modelo
pprint(lda_model.get_params())

Log Likelihood:  -13195446.079707857
Perplexity:  2058.0715888394884
{'batch_size': 128,
 'doc_topic_prior': None,
 'evaluate_every': -1,
 'learning_decay': 0.7,
 'learning_method': 'online',
 'learning_offset': 10.0,
 'max_doc_update_iter': 100,
 'max_iter': 50,
 'mean_change_tol': 0.001,
 'n_components': 20,
 'n_jobs': -1,
 'perp_tol': 0.1,
 'random_state': 100,
 'topic_word_prior': None,
 'total_samples': 1000000.0,
 'verbose': 0}


In [59]:
#GridSearch Modelo LDA
#Definimos parametros de busqueda
search_params = {'n_components': [10, 20, 25, 30], 'learning_decay': [.5, .7, .9]}

#Iniciamos Modelo
lda = LatentDirichletAllocation()
model = GridSearchCV(lda, param_grid=search_params)

#Hacemos el GridSearch (tardo 50 min)
model.fit(data_vectorized)

GridSearchCV(cv=None, error_score=nan,
             estimator=LatentDirichletAllocation(batch_size=128,
                                                 doc_topic_prior=None,
                                                 evaluate_every=-1,
                                                 learning_decay=0.7,
                                                 learning_method='batch',
                                                 learning_offset=10.0,
                                                 max_doc_update_iter=100,
                                                 max_iter=10,
                                                 mean_change_tol=0.001,
                                                 n_components=10, n_jobs=None,
                                                 perp_tol=0.1,
                                                 random_state=None,
                                                 topic_word_prior=None,
                                                 tota

In [60]:
#Revisamos el mejor modelo y sus parametros
#Mejor Modelo
best_lda_model = model.best_estimator_

#Parametros
print("Parametros: ", model.best_params_)

#Log Likelihood Score
print("Log Likelihood Score: ", model.best_score_)

#Perplexity
print("Perplexity: ", best_lda_model.perplexity(data_vectorized))

Parametros:  {'learning_decay': 0.7, 'n_components': 10}
Log Likelihood Score:  -2759648.0385598135
Perplexity:  2045.1242367519476


In [61]:
#Obtenemos el topico mas relevante por cada documento

#Creamos matriz Documento-Topico
lda_output = best_lda_model.transform(data_vectorized)

#Column names
topicnames = ["Topic" + str(i) for i in range(best_lda_model.n_components)]

#Index names
docnames = ["Doc" + str(i) for i in range(len(data))]

#Generamos un dataframe
df_document_topic = pd.DataFrame(np.round(lda_output, 2), columns=topicnames, index=docnames)

#Obtenemos el topico dominante para cada documento
dominant_topic = np.argmax(df_document_topic.values, axis=1)
df_document_topic['dominant_topic'] = dominant_topic

#Estilo
def color_green(val):
    color = 'green' if val > .1 else 'black'
    return 'color: {col}'.format(col=color)

def make_bold(val):
    weight = 700 if val > .1 else 400
    return 'font-weight: {weight}'.format(weight=weight)

#Aplicar Estilo
df_document_topics = df_document_topic.head(15).style.applymap(color_green).applymap(make_bold)
df_document_topics

Unnamed: 0,Topic0,Topic1,Topic2,Topic3,Topic4,Topic5,Topic6,Topic7,Topic8,Topic9,dominant_topic
Doc0,0.0,0.0,0.0,0.0,0.72,0.0,0.09,0.12,0.0,0.07,4
Doc1,0.0,0.0,0.41,0.0,0.52,0.0,0.0,0.05,0.0,0.0,4
Doc2,0.08,0.05,0.31,0.0,0.0,0.0,0.0,0.0,0.5,0.06,8
Doc3,0.0,0.12,0.0,0.0,0.26,0.03,0.0,0.0,0.59,0.0,8
Doc4,0.43,0.0,0.0,0.0,0.0,0.0,0.37,0.15,0.04,0.0,0
Doc5,0.0,0.19,0.37,0.0,0.22,0.0,0.0,0.0,0.21,0.0,2
Doc6,0.0,0.0,0.7,0.29,0.0,0.0,0.0,0.0,0.0,0.0,2
Doc7,0.19,0.0,0.0,0.08,0.0,0.0,0.03,0.0,0.08,0.61,9
Doc8,0.29,0.15,0.0,0.13,0.0,0.0,0.25,0.19,0.0,0.0,0
Doc9,0.0,0.0,0.0,0.0,0.0,0.22,0.0,0.0,0.0,0.77,9


In [62]:
#Revisamos la distribucion de los topicos vs los documentos
df_topic_distribution = df_document_topic['dominant_topic'].value_counts().reset_index(name="Num Documents")
df_topic_distribution.columns = ['Topic Num', 'Num Documents']
df_topic_distribution

Unnamed: 0,Topic Num,Num Documents
0,4,2168
1,2,1258
2,3,1014
3,8,738
4,1,720
5,0,603
6,5,552
7,7,426
8,9,419
9,6,401


In [63]:
#Visualizamos con pyLDAvis
pyLDAvis.enable_notebook()
panel = pyLDAvis.sklearn.prepare(best_lda_model, data_vectorized, vectorizer, mds='tsne')
pyLDAvis.save_html(panel, 'lda.html') #Grabo el grafico en html
panel

In [65]:
#Obtenemos 15 keywords por topico

def show_topics(vectorizer=vectorizer, lda_model=lda_model, n_words=20):
    keywords = np.array(vectorizer.get_feature_names())
    topic_keywords = []
    for topic_weights in lda_model.components_:
        top_keyword_locs = (-topic_weights).argsort()[:n_words]
        topic_keywords.append(keywords.take(top_keyword_locs))
    return topic_keywords

topic_keywords = show_topics(vectorizer=vectorizer, lda_model=best_lda_model, n_words=20)        

# Topic - Keywords Dataframe
df_topic_keywords = pd.DataFrame(topic_keywords)
df_topic_keywords.columns = ['Word '+str(i) for i in range(df_topic_keywords.shape[1])]
df_topic_keywords.index = [i for i in range(df_topic_keywords.shape[0])]
df_topic_keywords

Unnamed: 0,Word 0,Word 1,Word 2,Word 3,Word 4,Word 5,Word 6,Word 7,Word 8,Word 9,Word 10,Word 11,Word 12,Word 13,Word 14,Word 15,Word 16,Word 17,Word 18,Word 19
0,soldier,order,team,man,plane,mission,agent,bomb,attack,war,military,terrorist,force,officer,pilot,crew,weapon,time,group,government
1,school,girl,mother,friend,student,home,boy,room,time,day,night,later,sex,parent,teacher,year,class,child,old,film
2,police,murder,money,drug,car,death,officer,killer,prison,later,case,gun,man,crime,wife,dead,scene,time,shoot,evidence
3,car,room,door,away,woman,head,phone,house,gun,night,home,dead,hand,body,time,window,way,tell,day,talk
4,family,home,life,day,time,mother,year,later,friend,woman,child,wife,old,daughter,night,father,relationship,work,story,husband
5,game,team,time,day,new,friend,big,band,good,kid,player,money,fight,year,later,car,home,night,old,race
6,ship,time,water,crew,world,boat,earth,human,life,animal,planet,white,year,snow,day,way,later,away,new,old
7,car,room,bond,train,truck,time,building,group,agent,escape,man,way,team,gun,hand,guard,away,head,people,suddenly
8,body,death,power,ghost,human,attack,vampire,zombie,dead,child,creature,away,blood,demon,night,evil,time,year,way,monster
9,film,town,man,group,village,people,local,member,death,woman,brother,horse,black,time,story,later,day,new,war,family


### Predecir topicos para un nuevo texto

In [2]:
#Orden de transformaciones
#sent_to_words() –> lemmatization() –> vectorizer.transform() –> best_lda_model.transform()

nlp = spacy.load('en', disable=['parser', 'ner'])

def predict_topic(text, nlp=nlp):
    global sent_to_words
    global lemmatization
    
    # 0: Eliminar caracteres innecesarios
    text1 = [re.sub('\s+', ' ', sent) for sent in text] #Removemos enters
    text1 = [re.sub("\'", "", sent) for sent in text1]  #Removemos caracteres innecesarios    
    
    # 1: Limpiar texto con simple_preprocess
    mytext_2 = list(sent_to_words(text1))
    
    # 2: Lemmatize
    mytext_3 = lemmatization(mytext_2, allowed_postags=['NOUN', 'ADJ', 'ADV']) #, 'VERB'])

    # 3: Vectorizar
    mytext_4 = vectorizer.transform(mytext_3)

    # 4: LDA Transform
    topic_probability_scores = best_lda_model.transform(mytext_4)
    topic = df_topic_keywords.iloc[np.argmax(topic_probability_scores), :].values.tolist()
    return topic, topic_probability_scores

# Predecir topico
mytext = ['Some text about a good war movie']
topic, prob_scores = predict_topic(text = mytext)
print(topic)

NameError: name 'sent_to_words' is not defined

In [67]:
# Predecir topico
apocalipsisnow = ['The story opens in Saigon South Vietname late in 1969. U.S. Army Captain and special operations veteran Benjamin L. Willard (Martin Sheen) has returned to Saigon on another combat tour during the Vietnam War casually admitting that he is unable to rejoin society in the USA and that his marriage has broken up. He drinks heavily chain-smokes and hallucinates alone in his room becoming very upset and injuring himself when he breaks a large mirror. One day two military policemen arrive at Williards Saigon apartment and after cleaning him up escort him to an officers trailer where military intelligence officers Lt. General R. Corman (G. D. Spradlin) and Colonel Lucas (Harrison Ford) approach him with a top-secret assignment to follow the Nung River into the remote jungle find rogue Special Forces Colonel Walter E. Kurtz (Marlon Brando) and terminate his command with extreme prejudice. Kurtz apparently went insane and now commands his own Montagnard troops inside neutral Cambodia. They play a recording of Kurtz voice captured by Army intelligence where Kurtz rambles about the destruction of the war and a snail crawling on the edge of a straight razor. Willard is flown to Cam Ram Bay and joins a Navy PBR commanded by Chief (Albert Hall) and crewmen Lance (Sam Bottoms) Chef (Frederic Forrest) and Mr. Clean (Laurence Fishburne). Williard narrates that the crew are mostly young soldiers; Clean is only 17 and from the South Bronx Lance is a famous surfer from California and Chef is a chef from New Orleans. The Chief is an experienced sailor who mentions that hed previously brought another special operations soldier into the jungles of Vietnam on a similar mission and heard that the man committed suicide. As they travel down the coast to the mouth of the Nung River Willards voiceover reveals that hearing Kurtz voice triggered a fascination with Kurtz himself.']

topic, prob_scores = predict_topic(text = apocalipsisnow)
print(topic)

['soldier', 'order', 'team', 'man', 'plane', 'mission', 'agent', 'bomb', 'attack', 'war', 'military', 'terrorist', 'force', 'officer', 'pilot', 'crew', 'weapon', 'time', 'group', 'government']


In [None]:
#Serializamos para futuros usos
import pickle
import dill
dill.dump(sent_to_words, open('sent_to_words.dill', 'wb'))
dill.dump(lemmatization, open('lemmatization.dill', 'wb'))
pickle.dump(vectorizer, open('vectorizer.pkl', 'wb'))
pickle.dump(best_lda_model, open('lda_model.pkl', 'wb'))

### Embeddings para categorizar un tópico

In [69]:
#Tokenizamos la data lemmatizada
from nltk import word_tokenize, sent_tokenize
from tqdm import tqdm #Biblioteca para barra de avance

embeddings = []
for sent in tqdm(data_lemmatized):
    tokens = []
    for token in word_tokenize(sent):
        if token.isalpha():
            tokens.append(token)
    embeddings.append(tokens)

100%|█████████████████████████████████████| 8299/8299 [00:08<00:00, 954.51it/s]


In [70]:
print ("el corpus tiene",len(embeddings), "oraciones y",sum([len(x) for x in embeddings]),"palabras"   )

el corpus tiene 8299 oraciones y 2061972 palabras


In [71]:
from gensim.models.word2vec import Word2Vec
# "window" es el tamaño de la ventana. windows = 10, usa 10 palabras a la izquierda y 10 palabras a la derecha
# "n_dim" es la dimension (i.e. el largo) de los vectores de word2vec
# "workers" es el numero de cores que usa en paralelo. Para aprobechar eso es necesario tener instalado Cython)
# "sample": word2vec filtra palabras que aparecen una fraccion mayor que "sample"
# "min_count": Word2vec filtra palabras con menos apariciones que  "min_count"
# "sg": para correr el Skipgram model (sg = 1), para correr el CBOW (sg = 0)
# para mas detalle ver: https://radimrehurek.com/gensim/models/word2vec.html
n_dim = 20
w2v_model = Word2Vec(embeddings, workers=4, size=n_dim, min_count=10, window=10, sample=1e-3, negative=10, sg=0)

In [72]:
#Serializamos para futuros usos
pickle.dump(w2v_model, open('w2v_model.pkl', 'wb'))

In [73]:
genero = 'romance'
palabra = 'love'

pprint(w2v_model.most_similar(positive=[genero], negative=[], topn=10))
print('\n', genero, '-', palabra, 'similarity:', w2v_model.wv.n_similarity([genero], [palabra]))

[('romantic', 0.9299993515014648),
 ('passion', 0.9209772348403931),
 ('romantically', 0.9097239375114441),
 ('lifestyle', 0.9036920070648193),
 ('attraction', 0.8988896608352661),
 ('poet', 0.8947316408157349),
 ('friendship', 0.8907514810562134),
 ('jealousy', 0.885991632938385),
 ('occasion', 0.8844090104103088),
 ('affection', 0.8778270483016968)]

 romance - love similarity: 0.41884464


### Asignamos nombre a los topicos utilizando similarity de los embeddings

El metodo de asignacion será efectuar un promedio de todas las similarity entre las palabras de un topico y un genero dado. Si el metodo funciona correctamente, deberia asignarse un genero para cada topico, pero al no tener relacion con los embeddings y tratarse solo de una operacion matematica, esta opcion se presenta solo a modo de prueba.

In [74]:
#Generamos un df de generos
df_generos = pd.DataFrame({'genero':['action', 'comedy', 'crime', 'drama', 'fantasy', 'fiction', \
                                     'historical', 'horror', 'romance', 'thriller'],
                           '0':'', '1':'', '2':'', '3':'', '4':'', '5':'','6':'', '7':'', '8':'', '9':''})
df_generos

Unnamed: 0,genero,0,1,2,3,4,5,6,7,8,9
0,action,,,,,,,,,,
1,comedy,,,,,,,,,,
2,crime,,,,,,,,,,
3,drama,,,,,,,,,,
4,fantasy,,,,,,,,,,
5,fiction,,,,,,,,,,
6,historical,,,,,,,,,,
7,horror,,,,,,,,,,
8,romance,,,,,,,,,,
9,thriller,,,,,,,,,,


In [75]:
#Calculamos el promedio de las similarity para cada palabra de cada topico contra el genero a asignar
for i in df_generos.index:
    vec_similarity=[]
    col = 0
    print('Genero: ', df_generos.iloc[i].genero)
    for j in df_topic_keywords.index:
        for k in df_topic_keywords.columns:
            similarity = w2v_model.wv.n_similarity([df_generos.iloc[i].genero], [df_topic_keywords.iloc[j][k]])
            vec_similarity.append(similarity)
            print('\tTopic', j, 'Col', k, '-', '{0: <16}'.format(df_topic_keywords.iloc[j][k]), similarity)
        df_generos.set_value(i, str(col), np.mean(vec_similarity).astype(str))
        print('Promedio genero "', df_generos.iloc[i].genero, '" para Topic', str(col), ': ', np.mean(vec_similarity))
        col = col + 1
        vec_similarity=[]
    col = 0  

Genero:  action
	Topic 0 Col Word 0 - soldier          0.11772944
	Topic 0 Col Word 1 - order            0.37400395
	Topic 0 Col Word 2 - team             0.3412092
	Topic 0 Col Word 3 - man              -0.02333355
	Topic 0 Col Word 4 - plane            0.0395606
	Topic 0 Col Word 5 - mission          0.5003827
	Topic 0 Col Word 6 - agent            0.17470099
	Topic 0 Col Word 7 - bomb             0.2357462
	Topic 0 Col Word 8 - attack           0.11245807
	Topic 0 Col Word 9 - war              0.57603025
	Topic 0 Col Word 10 - military         0.46365365
	Topic 0 Col Word 11 - terrorist        0.49254084
	Topic 0 Col Word 12 - force            0.40519148
	Topic 0 Col Word 13 - officer          0.393775
	Topic 0 Col Word 14 - pilot            0.09792673
	Topic 0 Col Word 15 - crew             0.20941599
	Topic 0 Col Word 16 - weapon           0.3102798
	Topic 0 Col Word 17 - time             0.12098871
	Topic 0 Col Word 18 - group            -0.06771648
	Topic 0 Col Word 19 - governm

	Topic 7 Col Word 13 - gun              0.11654178
	Topic 7 Col Word 14 - hand             -0.064565025
	Topic 7 Col Word 15 - guard            0.06866483
	Topic 7 Col Word 16 - away             -0.3141349
	Topic 7 Col Word 17 - head             -0.33045977
	Topic 7 Col Word 18 - people           0.21693674
	Topic 7 Col Word 19 - suddenly         -0.29734448
Promedio genero " action " para Topic 7 :  -0.073676296
	Topic 8 Col Word 0 - body             -0.25821584
	Topic 8 Col Word 1 - death            0.26531795
	Topic 8 Col Word 2 - power            0.24870366
	Topic 8 Col Word 3 - ghost            -0.21984385
	Topic 8 Col Word 4 - human            -0.043139488
	Topic 8 Col Word 5 - attack           0.11245807
	Topic 8 Col Word 6 - vampire          -0.24011788
	Topic 8 Col Word 7 - zombie           -0.19061038
	Topic 8 Col Word 8 - dead             -0.23107487
	Topic 8 Col Word 9 - child            -0.011831684
	Topic 8 Col Word 10 - creature         -0.3631346
	Topic 8 Col Word 11 - 

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



 father           0.09545562
	Topic 4 Col Word 16 - relationship     -0.006571511
	Topic 4 Col Word 17 - work             -0.05278493
	Topic 4 Col Word 18 - story            0.45541164
	Topic 4 Col Word 19 - husband          0.10085488
Promedio genero " crime " para Topic 4 :  0.018120844
	Topic 5 Col Word 0 - game             0.01242137
	Topic 5 Col Word 1 - team             0.07325468
	Topic 5 Col Word 2 - time             -0.14917089
	Topic 5 Col Word 3 - day              -0.11524039
	Topic 5 Col Word 4 - new              -0.09952014
	Topic 5 Col Word 5 - friend           -0.072466955
	Topic 5 Col Word 6 - big              0.012375397
	Topic 5 Col Word 7 - band             -0.3191011
	Topic 5 Col Word 8 - good             -0.018461352
	Topic 5 Col Word 9 - kid              -0.23985513
	Topic 5 Col Word 10 - player           -0.0011678247
	Topic 5 Col Word 11 - money            0.3055812
	Topic 5 Col Word 12 - fight            -0.014860757
	Topic 5 Col Word 13 - year             0.13

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



 dead             -0.24260467
	Topic 2 Col Word 16 - scene            0.14700463
	Topic 2 Col Word 17 - time             0.22165117
	Topic 2 Col Word 18 - shoot            -0.40081042
	Topic 2 Col Word 19 - evidence         0.042911686
Promedio genero " fiction " para Topic 2 :  -0.073730305
	Topic 3 Col Word 0 - car              -0.2758642
	Topic 3 Col Word 1 - room             0.026580095
	Topic 3 Col Word 2 - door             -0.16309996
	Topic 3 Col Word 3 - away             -0.43836027
	Topic 3 Col Word 4 - woman            0.06710666
	Topic 3 Col Word 5 - head             -0.28046817
	Topic 3 Col Word 6 - phone            -0.0920852
	Topic 3 Col Word 7 - house            -0.14287195
	Topic 3 Col Word 8 - gun              -0.32579058
	Topic 3 Col Word 9 - night            -0.0040396685
	Topic 3 Col Word 10 - home             -0.1229228
	Topic 3 Col Word 11 - dead             -0.24260467
	Topic 3 Col Word 12 - hand             -0.14008144
	Topic 3 Col Word 13 - body             -0.

	Topic 3 Col Word 5 - head             -0.5781571
	Topic 3 Col Word 6 - phone            -0.21413581
	Topic 3 Col Word 7 - house            0.0029309173
	Topic 3 Col Word 8 - gun              -0.42212406
	Topic 3 Col Word 9 - night            0.32760113
	Topic 3 Col Word 10 - home             0.28067783
	Topic 3 Col Word 11 - dead             -0.42948943
	Topic 3 Col Word 12 - hand             -0.34109494
	Topic 3 Col Word 13 - body             -0.38084367
	Topic 3 Col Word 14 - time             0.28015205
	Topic 3 Col Word 15 - window           -0.29270586
	Topic 3 Col Word 16 - way              -0.1520051
	Topic 3 Col Word 17 - tell             -0.06984567
	Topic 3 Col Word 18 - day              0.25557452
	Topic 3 Col Word 19 - talk             0.22913685
Promedio genero " romance " para Topic 3 :  -0.105640575
	Topic 4 Col Word 0 - family           0.4001787
	Topic 4 Col Word 1 - home             0.28067783
	Topic 4 Col Word 2 - life             0.53703713
	Topic 4 Col Word 3 - day

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)




	Topic 7 Col Word 13 - gun              -0.063702494
	Topic 7 Col Word 14 - hand             0.037392627
	Topic 7 Col Word 15 - guard            -0.12603255
	Topic 7 Col Word 16 - away             -0.19471595
	Topic 7 Col Word 17 - head             -0.1064167
	Topic 7 Col Word 18 - people           0.20771672
	Topic 7 Col Word 19 - suddenly         -0.0073101996
Promedio genero " thriller " para Topic 7 :  0.010527924
	Topic 8 Col Word 0 - body             0.0074180802
	Topic 8 Col Word 1 - death            0.17503384
	Topic 8 Col Word 2 - power            0.19485398
	Topic 8 Col Word 3 - ghost            0.15861222
	Topic 8 Col Word 4 - human            0.18248571
	Topic 8 Col Word 5 - attack           -0.043233335
	Topic 8 Col Word 6 - vampire          0.12656285
	Topic 8

In [76]:
df_generos

In [77]:
df_topic_keywords

Unnamed: 0,Word 0,Word 1,Word 2,Word 3,Word 4,Word 5,Word 6,Word 7,Word 8,Word 9,Word 10,Word 11,Word 12,Word 13,Word 14,Word 15,Word 16,Word 17,Word 18,Word 19
0,soldier,order,team,man,plane,mission,agent,bomb,attack,war,military,terrorist,force,officer,pilot,crew,weapon,time,group,government
1,school,girl,mother,friend,student,home,boy,room,time,day,night,later,sex,parent,teacher,year,class,child,old,film
2,police,murder,money,drug,car,death,officer,killer,prison,later,case,gun,man,crime,wife,dead,scene,time,shoot,evidence
3,car,room,door,away,woman,head,phone,house,gun,night,home,dead,hand,body,time,window,way,tell,day,talk
4,family,home,life,day,time,mother,year,later,friend,woman,child,wife,old,daughter,night,father,relationship,work,story,husband
5,game,team,time,day,new,friend,big,band,good,kid,player,money,fight,year,later,car,home,night,old,race
6,ship,time,water,crew,world,boat,earth,human,life,animal,planet,white,year,snow,day,way,later,away,new,old
7,car,room,bond,train,truck,time,building,group,agent,escape,man,way,team,gun,hand,guard,away,head,people,suddenly
8,body,death,power,ghost,human,attack,vampire,zombie,dead,child,creature,away,blood,demon,night,evil,time,year,way,monster
9,film,town,man,group,village,people,local,member,death,woman,brother,horse,black,time,story,later,day,new,war,family


In [78]:
#Observamos que siguiendo el metodo mencionado, el topico 9 se asigna a mas 
#de un genero por lo que debemos descartarlo. Se procedera a la asignacion manual.
for i in range(0,10):
    found = df_generos.isin([df_generos.iloc[i,i+1:].max()]).any()
    column_name = found[found].index.values[0]  
    print('Genero:', '{0: <10}'.format(df_generos.genero[i]), \
          '- Topico:', df_generos.columns[df_generos.columns.get_loc(column_name)])

Genero: action     - Topico: 0
Genero: comedy     - Topico: 1
Genero: crime      - Topico: 2
Genero: drama      - Topico: 4
Genero: fantasy    - Topico: 4
Genero: fiction    - Topico: 6
Genero: historical - Topico: 9
Genero: horror     - Topico: 8
Genero: romance    - Topico: 9
Genero: thriller   - Topico: 9


In [79]:
df_topic_keywords

Unnamed: 0,Word 0,Word 1,Word 2,Word 3,Word 4,Word 5,Word 6,Word 7,Word 8,Word 9,Word 10,Word 11,Word 12,Word 13,Word 14,Word 15,Word 16,Word 17,Word 18,Word 19
0,soldier,order,team,man,plane,mission,agent,bomb,attack,war,military,terrorist,force,officer,pilot,crew,weapon,time,group,government
1,school,girl,mother,friend,student,home,boy,room,time,day,night,later,sex,parent,teacher,year,class,child,old,film
2,police,murder,money,drug,car,death,officer,killer,prison,later,case,gun,man,crime,wife,dead,scene,time,shoot,evidence
3,car,room,door,away,woman,head,phone,house,gun,night,home,dead,hand,body,time,window,way,tell,day,talk
4,family,home,life,day,time,mother,year,later,friend,woman,child,wife,old,daughter,night,father,relationship,work,story,husband
5,game,team,time,day,new,friend,big,band,good,kid,player,money,fight,year,later,car,home,night,old,race
6,ship,time,water,crew,world,boat,earth,human,life,animal,planet,white,year,snow,day,way,later,away,new,old
7,car,room,bond,train,truck,time,building,group,agent,escape,man,way,team,gun,hand,guard,away,head,people,suddenly
8,body,death,power,ghost,human,attack,vampire,zombie,dead,child,creature,away,blood,demon,night,evil,time,year,way,monster
9,film,town,man,group,village,people,local,member,death,woman,brother,horse,black,time,story,later,day,new,war,family


In [80]:
#Asignamos generos manualmente
df_generos.loc[df_generos['genero'] == 'action',    'topic'] = str(9)
df_generos.loc[df_generos['genero'] == 'comedy',    'topic'] = str(3)
df_generos.loc[df_generos['genero'] == 'crime',     'topic'] = str(4)
df_generos.loc[df_generos['genero'] == 'drama',     'topic'] = str(5)
df_generos.loc[df_generos['genero'] == 'fantasy',   'topic'] = str(6)
df_generos.loc[df_generos['genero'] == 'fiction',   'topic'] = str(7)
df_generos.loc[df_generos['genero'] == 'historical','topic'] = str(0)
df_generos.loc[df_generos['genero'] == 'horror',    'topic'] = str(2)
df_generos.loc[df_generos['genero'] == 'romance',   'topic'] = str(8)
df_generos.loc[df_generos['genero'] == 'thriller',  'topic'] = str(1)

df_generos = df_generos[['genero', 'topic']]
df_generos

Unnamed: 0,genero,topic
0,action,9
1,comedy,3
2,crime,4
3,drama,5
4,fantasy,6
5,fiction,7
6,historical,0
7,horror,2
8,romance,8
9,thriller,1


In [81]:
df_generos = df_generos.set_index('topic').sort_values('topic').reset_index()
df_generos

Unnamed: 0,topic,genero
0,0,historical
1,1,thriller
2,2,horror
3,3,comedy
4,4,crime
5,5,drama
6,6,fantasy
7,7,fiction
8,8,romance
9,9,action


In [82]:
#Agregamos el genero al df de Palabras por Topico
df_topic_keywords = df_topic_keywords.merge(df_generos[['genero']], left_index=True, right_index=True)
df_topic_keywords

Unnamed: 0,Word 0,Word 1,Word 2,Word 3,Word 4,Word 5,Word 6,Word 7,Word 8,Word 9,...,Word 11,Word 12,Word 13,Word 14,Word 15,Word 16,Word 17,Word 18,Word 19,genero
0,soldier,order,team,man,plane,mission,agent,bomb,attack,war,...,terrorist,force,officer,pilot,crew,weapon,time,group,government,historical
1,school,girl,mother,friend,student,home,boy,room,time,day,...,later,sex,parent,teacher,year,class,child,old,film,thriller
2,police,murder,money,drug,car,death,officer,killer,prison,later,...,gun,man,crime,wife,dead,scene,time,shoot,evidence,horror
3,car,room,door,away,woman,head,phone,house,gun,night,...,dead,hand,body,time,window,way,tell,day,talk,comedy
4,family,home,life,day,time,mother,year,later,friend,woman,...,wife,old,daughter,night,father,relationship,work,story,husband,crime
5,game,team,time,day,new,friend,big,band,good,kid,...,money,fight,year,later,car,home,night,old,race,drama
6,ship,time,water,crew,world,boat,earth,human,life,animal,...,white,year,snow,day,way,later,away,new,old,fantasy
7,car,room,bond,train,truck,time,building,group,agent,escape,...,way,team,gun,hand,guard,away,head,people,suddenly,fiction
8,body,death,power,ghost,human,attack,vampire,zombie,dead,child,...,away,blood,demon,night,evil,time,year,way,monster,romance
9,film,town,man,group,village,people,local,member,death,woman,...,horse,black,time,story,later,day,new,war,family,action


In [3]:
# Predecir topico
twelve_angry = ['In a New York City courthouse an eighteen-year-old boy from a slum is on trial for allegedly stabbing his father to death. Final closing arguments having been presented a visibly bored judge instructs the jury to decide whether the boy is guilty of murder. If there is any reasonable doubt of his guilt they are to return a verdict of not guilty. The judge further informs them that a guilty verdict will be accompanied by a mandatory death sentence.The jury retires to a private room where the jurors spend a short while getting acquainted before they begin deliberating. It is immediately apparent that the jurors have already decided that the boy is guilty and that they plan to return their verdict without taking time for discussion with the sole exception of Juror 8  who is the only not guilty vote in a preliminary tally. He explains that there is too much at stake for him to go along with the verdict without at least talking about it first. His vote annoys the other jurors especially Juror 7  who has tickets to a baseball game that evening; and Juror 10  who believes that people from slum backgrounds are liars wild and dangerous.The rest of the films focus is the jurys difficulty in reaching a unanimous verdict. While several of the jurors harbor personal prejudices Juror 8 maintains that the evidence presented in the case is circumstantial and that the boy deserves a fair deliberation. He calls into question the accuracy and reliability of the only two witnesses to the murder the rarity of the murder weapon (a common switchblade of which he has an identical copy) and the overall questionable circumstances. He further argues that he cannot in good conscience vote guilty when he feels there is reasonable doubt of the boys guilt.Having argued several points and gotten no favorable response from the others Juror 8 reluctantly agrees that he has only succeeded in hanging the jury. Instead he requests another vote this time by secret ballot. He proposes that he will abstain from voting and if the other 11 jurors are still unanimous in a guilty vote then he will acquiesce to their decision. The secret ballot is held and a new not guilty vote appears. This earns intense criticism from Juror 3  who blatantly accuses Juror 5  who had grown up in a slum of switching out of sympathy toward slum children. However Juror 9  reveals that he himself changed his vote feeling that Juror 8s points deserve further discussion.Juror 8 presents a convincing argument that one of the witnesses an elderly man who claimed to have heard the boy yell Im going to kill you shortly before the murder took place could not have heard the voices as clearly as he had testified due to an elevated train passing by at the time; as well as stating that Im going to kill you is often said by people who do not literally mean it. Juror 5 changes his vote to not guilty. Soon afterward Juror 11  questions whether it is reasonable to suppose the defendant would have fled the scene having cleaned the knife of fingerprints but leaving it behind and then come back three hours later to retrieve it (having been left in his fathers chest). Juror 11 then changes his vote.Juror 8 then mentions the mans second claim: upon hearing the fathers body hit the floor he had run to the door of his apartment and seen the defendant running out of the building from his front door in 15 seconds. Jurors 5 6 and 8 question whether this is true as the witness in question had had a stroke limiting his ability to walk. Upon the end of an experiment the jury finds that the witness would not have made it to the door in enough time to actually see the killer running out. Juror 8 concludes that judging from what he claims to have heard earlier the witness must have merely assumed it was the defendant running. Juror 3 growing more irritated throughout the process explodes in a rant: Hes got to burn! Hes slipping through our fingers! Juror 8 takes him to task calling him a self-appointed public avenger and a sadist saying he wants the defendant to die because of personal desire rather than the facts. Juror 3 shouts Ill kill him! and starts lunging at Juror 8 but is restrained by Jurors 5 and 7. Juror 8 calmly retorts You dont really mean youll kill me do you? proving his previous point.Jurors 2  and 6  also decide to vote not guilty tying the vote at 6-6. Soon after a rainstorm hits the city apparently postponing the baseball game for which Juror 7 has tickets thus allowing him to relax and pay attention with that schedule pressure relieved.Juror 4  continues to state that he does not believe the boys alibi which was being at the movies with a few friends at the time of the murder because the boy could not remember what movie he had seen when questioned by police shortly after the murder. Juror 8 explains that being under emotional stress can make you forget certain things and tests how well Juror 4 can remember the events of previous days. Juror 4 remembers with some difficulty the events of the previous five days and Juror 8 points out that he had not been under emotional stress at that time thus there was no reason to think the boy should be able to remember the particulars of the movie that he claimed to have seen.Juror 2 calls into question the prosecutions claim that the accused who was 57 tall was able to inflict the downward stab wound found on his father who was 62. Jurors 3 and 8 conduct an experiment to see if its possible for a shorter person to stab downward into a taller person. The experiment proves the possibility but Juror 5 then explains that he had grown up amidst knife fights in his neighborhood and shows through demonstrating the correct use of a switchblade that no one so much shorter than his opponent would have held a switchblade in such a way as to stab downward as the grip would have been too awkward and the act of changing hands too time-consuming. Rather someone that much shorter than his opponent would stab underhanded at an upwards angle. This revelation augments the certainty of several of the jurors in their belief that the defendant is not guilty.Increasingly impatient Juror 7 changes his vote just so that the deliberation may end which earns him the ire of Jurors 3 and 11 both on opposite sides of the discussion. Juror 11 an immigrant who has repeatedly displayed strong patriotic pride presses Juror 7 hard about using his vote frivolously and eventually Juror 7 admits that he now truly believes the defendant is not guilty.The next jurors to change their votes are Jurors 12  and the Jury Foreman  making the vote 9-3 and leaving only three dissenters: Jurors 3 4 and 10. Outraged at how the proceedings have gone Juror 10 goes into a rage on why people from the slums cannot be trusted of how they are little better than animals who gleefully kill each other off for fun. His speech offends Juror 5 who turns his back to him and one by one the rest of the jurors start turning away from him. Confused and disturbed by this reaction to his diatribe Juror 10 continues in a steadily fading voice and manner slowing to a stop with Listen to me. Listen... Juror 4 the only man still facing him tersely responds I have. Now sit down and dont open your mouth again. As Juror 10 moves to sit in a corner by himself Juror 8 speaks quietly about the evils of prejudice and the other jurors slowly resume their seats.When those remaining in favor of a guilty vote are pressed as to why they still maintain that there is no reasonable doubt Juror 4 states his belief that despite all the other evidence that has been called into question the fact remains that the woman who saw the murder from her bedroom window across the street (through the passing train) still stands as solid evidence. After he points this out Juror 12 changes his vote back to guilty making the vote 8-4.Then Juror 9 after seeing Juror 4 rub his nose (which is being irritated by his eye glasses) realizes that like Juror 4 the woman who allegedly saw the murder had impressions in the sides of her nose which she rubbed indicating that she wore glasses but did not wear them to court out of vanity. Juror 8 cannily asks Juror 4 if he wears his eyeglasses to sleep and Juror 4 admits that he does not wear them nobody does. Juror 8 explains that there was thus no logical reason to expect that the witness happened to be wearing her glasses while trying to sleep and he points out that on her own evidence the attack happened so swiftly that she would not have had time to put them on. After he points this out Jurors 12 10 and 4 all change their vote to not guilty.At this point the only remaining juror with a guilty vote is Juror 3. Juror 3 gives a long and increasingly tortured string of arguments ending with Rotten kids you work your life out! This builds on a more emotionally ambivalent earlier revelation that his relationship with his own son is deeply strained and his anger over this fact is the main reason that he wants the defendant to be guilty. Juror 3 finally loses his temper and tears up a photo of himself and his son then suddenly breaks down crying and changes his vote to not guilty making the vote unanimous.As the jurors leave the room Juror 8 helps the distraught Juror 3 with his coat in a show of compassion. The film ends when the friendly Jurors 8  and 9  exchange names and all of the jurors descend the courthouse steps to return to their individual lives... never to see each other again.']

topic, prob_scores = predict_topic(text = twelve_angry)
print(topic)

NameError: name 'sent_to_words' is not defined