# <center>Predicción de género cinematográfico utilizando métodos no supervisados</center>

Se intentará predecir el genero de un plotline utilizando metodos no supervisados de deteccion. Iniciaremos esta parte centrados en el modelo LDA.

In [1]:
#Importamos librerias

import numpy as np
import pandas as pd
import re, nltk, spacy, gensim
from nltk.corpus import stopwords

# Sklearn
from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from pprint import pprint

# Plotting tools
import pyLDAvis
import pyLDAvis.sklearn
import matplotlib.pyplot as plt
%matplotlib inline

# Warnings
import warnings
warnings.filterwarnings("ignore")

In [2]:
#Importamos el df
df = pd.read_csv('orig_movies.csv')

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,imdb_id,title,original_title,year,date_published,genre,duration,country,language,director,actors,description,avg_vote,votes,plot_synopsis
0,0,tt0035423,Kate & Leopold,Kate & Leopold,2001,2002-04-05,"Comedy, Fantasy, Romance",118,USA,"English, French",James Mangold,"Meg Ryan, Hugh Jackman, Liev Schreiber, Brecki...",An English Duke from 1876 is inadvertedly drag...,6.4,75298,"On 28 April 1876, Leopold, His Grace the 3rd D..."
1,1,tt0073537,Double Exposure,Double Exposure,1982,1983-11-09,"Comedy, Crime, Drama",100,USA,English,William Byron Hillman,"Michael Callan, Joanna Pettet, James Stacy, Pa...",A photographer for a men's magazine is disturb...,4.9,535,"The Putnams, Roger ('Ian Buchanan') and Maria ..."
2,2,tt0076709,Si wang ta,Si wang ta,1981,1981-03-21,"Action, Crime, Mystery",96,Hong Kong,Cantonese,"See-Yuen Ng, Sammo Kam-Bo Hung","Bruce Lee, Tae-jeong Kim, Jung-Lee Hwang, Roy ...",After Billy Lo is killed while seeking the mur...,5.2,2670,"After a recent amount of challenges, Billy Lo ..."
3,3,tt0078349,Sekai meisaku dôwa: Hakuchô no mizûmi,Sekai meisaku dôwa: Hakuchô no mizûmi,1981,1981-03-14,"Animation, Adventure, Family",75,Japan,Japanese,Kimio Yabuki,"Keiko Takeshita, Tarô Shigaki, Asao Koike, Yôk...",A prince falls in love with a princess cursed ...,7.8,667,Below is a synopsis based on the 1895 libretto...
4,4,tt0078749,Alien 2 - Sulla Terra,Alien 2 - Sulla Terra,1980,1980-04-11,"Adventure, Horror, Sci-Fi",92,Italy,"English, Italian","Ciro Ippolito, Biagio Proietti","Belinda Mayne, Mark Bodin, Roberto Barrese, Be...",A spaceship lands back on Earth after a failed...,3.7,1104,The commercial spacecraft Nostromo is on a ret...


In [4]:
#Convertimos la sinopsis a lista
data = df.plot_synopsis.values.tolist()

#Removemos enters
data = [re.sub('\s+', ' ', sent) for sent in data]

#Removemos caracteres innecesarios
data = [re.sub("\'", "", sent) for sent in data]

#Mostramos ejemplo
print(data[:1])

['On 28 April 1876, Leopold, His Grace the 3rd Duke of Albany (Hugh Jackman), is a stifled dreamer. He has created a design for a primitive elevator, and has built a small model of this device. His strict uncle Millard (Paxton Whitehead) has no patience for what he characterises as a sign of Leopolds disrespect for the Monarchy, chastising him, and telling him he must marry a rich American, as the Mountbatten family finances are depleted. In response to his uncles accusations of his blemishing the family name, Leopold counters that the new nobility is to be found in those who pursue initiatives, hence his interest in the sciences and inventions. One day, the Duke finds Stuart Besser (Liev Schreiber), an amateur physicist (and great-great-grandson of Leopold) in his study perusing his schematic diagrams and taking photographs of them. He had seen him earlier at Roeblings speech about the Brooklyn Bridge, after he was laughing at the word "erection." Leopold follows Stuart and tries to s

In [5]:
#Tokenizamos y limpiamos texto usando gensim simple_preprocess()
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

data_words = list(sent_to_words(data))

print(data_words[:1])

[['on', 'april', 'leopold', 'his', 'grace', 'the', 'rd', 'duke', 'of', 'albany', 'hugh', 'jackman', 'is', 'stifled', 'dreamer', 'he', 'has', 'created', 'design', 'for', 'primitive', 'elevator', 'and', 'has', 'built', 'small', 'model', 'of', 'this', 'device', 'his', 'strict', 'uncle', 'millard', 'paxton', 'whitehead', 'has', 'no', 'patience', 'for', 'what', 'he', 'characterises', 'as', 'sign', 'of', 'leopolds', 'disrespect', 'for', 'the', 'monarchy', 'chastising', 'him', 'and', 'telling', 'him', 'he', 'must', 'marry', 'rich', 'american', 'as', 'the', 'mountbatten', 'family', 'finances', 'are', 'depleted', 'in', 'response', 'to', 'his', 'uncles', 'accusations', 'of', 'his', 'blemishing', 'the', 'family', 'name', 'leopold', 'counters', 'that', 'the', 'new', 'nobility', 'is', 'to', 'be', 'found', 'in', 'those', 'who', 'pursue', 'initiatives', 'hence', 'his', 'interest', 'in', 'the', 'sciences', 'and', 'inventions', 'one', 'day', 'the', 'duke', 'finds', 'stuart', 'besser', 'liev', 'schreibe

In [6]:
#Importamos las StopWord para ingles
stopwords = nltk.corpus.stopwords.words('english')

In [7]:
#Importamos nombres propios
sw_firstnames = open("names-first.txt", "r").readlines()
sw_firstnames = [i.strip('\n') if type(i) == str else str(i) for i in sw_firstnames]
sw_firstnames = [x.lower() for x in sw_firstnames]

In [8]:
#Agregamos a stopword 
stopwords.extend(sw_firstnames)

In [9]:
#Funcion para remover Stopwords
from gensim.utils import simple_preprocess
def remove_stopwords(texts):
    ''' Remueve los stopwords '''
    return [[word for word in simple_preprocess(str(doc)) if word not in stopwords] for doc in texts]

In [10]:
#Aplicamos la funcion de remocion de stopwords
data_words_nonstop = remove_stopwords(data_words)

In [11]:
#Funcion para Lemmatization
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'ADV']): #, 'VERB']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append(" ".join([token.lemma_ if token.lemma_ not in ['-PRON-'] else '' for token in doc if token.pos_ in allowed_postags]))
    return texts_out

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
# Run in terminal: python3 -m spacy download en
nlp = spacy.load('en', disable=['parser', 'ner'])

#Realizamos lemmatization conservando solo NOUN, ADJ, VERB, ADV
data_lemmatized = lemmatization(data_words_nonstop, allowed_postags=['NOUN', 'ADJ', 'ADV']) #, 'VERB'])

print(data_lemmatized[:2])

['design primitive elevator small model device strict uncle characterise sign leopold monarchy family depleted response uncle accusation family name new nobility pursue initiative hence interest science day duke physicist great great grandson study schematic diagram photograph early roebling word erection try unfinished bridge temporal century travel first portal temporal universe inside apartment open week later dog elevator shaft eventually scientific discovery stuart book unintentional time disruption elevator century patent device cynical ambitious stuart ex girlfriend apartment career field woman state librarian demand dog walk overwhelmed roebling still apartment befriend brother actor actor steadfast character romantically tour commercial kate client product disgusting argue integrity bristling countering connection reality realising time together nearly evening contemplation suddenly apartment mental hospital back time afterwards notice photo leopold ball realise back night pro

In [12]:
#Creamos la matriz de Documentos-Palabras
vectorizer = CountVectorizer(analyzer='word',       
                             min_df=10,                        # Ocurrencias minimas de una palabra
                             stop_words='english',             # Remover stopwords
                             lowercase=True,                   # Convertir palabras a minusculas
                             token_pattern='[a-zA-Z0-9]{3,}',  # Caracteres > 3
                             # max_features=50000,             
                            )

data_vectorized = vectorizer.fit_transform(data_lemmatized)

In [13]:
#Chequeamos Sparsicity

# Materialize the sparse data
data_dense = data_vectorized.todense()

# Compute Sparsicity = Percentage of Non-Zero cells
print("Sparsicity: ", ((data_dense > 0).sum()/data_dense.size)*100, "%")

Sparsicity:  1.4379054618092122 %


In [14]:
#Construimos Modelo LDA
lda_model = LatentDirichletAllocation(n_components=20,           # Numero de topicos
                                      max_iter=10,               # Max learning iterations
                                      learning_method='online',   
                                      random_state=100,          # Random state
                                      batch_size=128,            # n docs en cada learning iter
                                      evaluate_every = -1,       # compute perplexity every n iters, default: Don't
                                      n_jobs = -1,               # Use all available CPUs
                                     )
lda_output = lda_model.fit_transform(data_vectorized)

print(lda_model)  # Model attributes

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method='online', learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=10,
                          mean_change_tol=0.001, n_components=20, n_jobs=-1,
                          perp_tol=0.1, random_state=100, topic_word_prior=None,
                          total_samples=1000000.0, verbose=0)


In [15]:
#Performance del modelo con perplexity y log-likelihood
# Log Likelyhood: Higher the better
print("Log Likelihood: ", lda_model.score(data_vectorized))

# Perplexity: Lower the better. Perplexity = exp(-1. * log-likelihood per word)
print("Perplexity: ", lda_model.perplexity(data_vectorized))

#Parametros del modelo
pprint(lda_model.get_params())

Log Likelihood:  -13655493.759704186
Perplexity:  2307.5162596075597
{'batch_size': 128,
 'doc_topic_prior': None,
 'evaluate_every': -1,
 'learning_decay': 0.7,
 'learning_method': 'online',
 'learning_offset': 10.0,
 'max_doc_update_iter': 100,
 'max_iter': 10,
 'mean_change_tol': 0.001,
 'n_components': 20,
 'n_jobs': -1,
 'perp_tol': 0.1,
 'random_state': 100,
 'topic_word_prior': None,
 'total_samples': 1000000.0,
 'verbose': 0}


In [16]:
#GridSearch Modelo LDA
#Definimos parametros de busqueda
search_params = {'n_components': [10, 15, 20, 25, 30], 'learning_decay': [.5, .7, .9]}

#Iniciamos Modelo
lda = LatentDirichletAllocation()
model = GridSearchCV(lda, param_grid=search_params)

#Hacemos el GridSearch (tardo 50 min)
model.fit(data_vectorized)

GridSearchCV(cv='warn', error_score='raise-deprecating',
             estimator=LatentDirichletAllocation(batch_size=128,
                                                 doc_topic_prior=None,
                                                 evaluate_every=-1,
                                                 learning_decay=0.7,
                                                 learning_method='batch',
                                                 learning_offset=10.0,
                                                 max_doc_update_iter=100,
                                                 max_iter=10,
                                                 mean_change_tol=0.001,
                                                 n_components=10, n_jobs=None,
                                                 perp_tol=0.1,
                                                 random_state=None,
                                                 topic_word_prior=None,
                                   

In [17]:
#Revisamos el mejor modelo y sus parametros
#Mejor Modelo
best_lda_model = model.best_estimator_

#Parametros
print("Parametros: ", model.best_params_)

#Log Likelihood Score
print("Log Likelihood Score: ", model.best_score_)

#Perplexity
print("Perplexity: ", best_lda_model.perplexity(data_vectorized))

Parametros:  {'learning_decay': 0.7, 'n_components': 10}
Log Likelihood Score:  -4684292.774862733
Perplexity:  2256.7284995747204


In [18]:
#Obtenemos el topico mas relevante por cada documento

#Creamos matriz Documento-Topico
lda_output = best_lda_model.transform(data_vectorized)

#Column names
topicnames = ["Topic" + str(i) for i in range(best_lda_model.n_components)]

#Index names
docnames = ["Doc" + str(i) for i in range(len(data))]

#Generamos un dataframe
df_document_topic = pd.DataFrame(np.round(lda_output, 2), columns=topicnames, index=docnames)

#Obtenemos el topico dominante para cada documento
dominant_topic = np.argmax(df_document_topic.values, axis=1)
df_document_topic['dominant_topic'] = dominant_topic

#Estilo
def color_green(val):
    color = 'green' if val > .1 else 'black'
    return 'color: {col}'.format(col=color)

def make_bold(val):
    weight = 700 if val > .1 else 400
    return 'font-weight: {weight}'.format(weight=weight)

#Aplicar Estilo
df_document_topics = df_document_topic.head(15).style.applymap(color_green).applymap(make_bold)
df_document_topics

Unnamed: 0,Topic0,Topic1,Topic2,Topic3,Topic4,Topic5,Topic6,Topic7,Topic8,Topic9,dominant_topic
Doc0,0.19,0.0,0.15,0.14,0.0,0.14,0.06,0.0,0.31,0.0,8
Doc1,0.0,0.0,0.04,0.43,0.52,0.0,0.0,0.0,0.0,0.0,4
Doc2,0.0,0.3,0.0,0.13,0.0,0.0,0.0,0.16,0.0,0.41,9
Doc3,0.55,0.35,0.0,0.0,0.0,0.0,0.0,0.0,0.09,0.0,0
Doc4,0.0,0.13,0.87,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2
Doc5,0.28,0.11,0.0,0.2,0.4,0.0,0.0,0.0,0.0,0.0,4
Doc6,0.0,0.0,0.0,0.0,0.98,0.0,0.0,0.0,0.0,0.0,4
Doc7,0.0,0.48,0.14,0.0,0.0,0.2,0.0,0.0,0.18,0.0,1
Doc8,0.0,0.0,0.66,0.0,0.03,0.0,0.26,0.0,0.06,0.0,2
Doc9,0.13,0.13,0.0,0.0,0.0,0.33,0.0,0.41,0.0,0.0,7


In [19]:
#Revisamos la distribucion de los topicos vs los documentos
df_topic_distribution = df_document_topic['dominant_topic'].value_counts().reset_index(name="Num Documents")
df_topic_distribution.columns = ['Topic Num', 'Num Documents']
df_topic_distribution

Unnamed: 0,Topic Num,Num Documents
0,3,1926
1,1,1053
2,6,1011
3,9,976
4,5,855
5,2,744
6,4,694
7,0,637
8,8,290
9,7,113


In [20]:
#Visualizamos con pyLDAvis
pyLDAvis.enable_notebook()
panel = pyLDAvis.sklearn.prepare(best_lda_model, data_vectorized, vectorizer, mds='tsne')
pyLDAvis.save_html(panel, 'lda.html') #Grabo el grafico en html
panel

In [21]:
#Obtenemos 15 keywords por topico

def show_topics(vectorizer=vectorizer, lda_model=lda_model, n_words=20):
    keywords = np.array(vectorizer.get_feature_names())
    topic_keywords = []
    for topic_weights in lda_model.components_:
        top_keyword_locs = (-topic_weights).argsort()[:n_words]
        topic_keywords.append(keywords.take(top_keyword_locs))
    return topic_keywords

topic_keywords = show_topics(vectorizer=vectorizer, lda_model=best_lda_model, n_words=15)        

# Topic - Keywords Dataframe
df_topic_keywords = pd.DataFrame(topic_keywords)
df_topic_keywords.columns = ['Word '+str(i) for i in range(df_topic_keywords.shape[1])]
df_topic_keywords.index = ['Topic '+str(i) for i in range(df_topic_keywords.shape[0])]
df_topic_keywords

Unnamed: 0,Word 0,Word 1,Word 2,Word 3,Word 4,Word 5,Word 6,Word 7,Word 8,Word 9,Word 10,Word 11,Word 12,Word 13,Word 14
Topic 0,time,woman,wedding,friend,family,day,life,home,new,night,later,wife,girl,child,daughter
Topic 1,group,body,human,dead,creature,attack,death,away,water,town,man,child,way,head,time
Topic 2,ship,team,crew,order,plane,time,bond,soldier,man,attack,helicopter,earth,mission,human,machine
Topic 3,mother,friend,home,school,family,day,year,later,life,father,time,child,old,girl,parent
Topic 4,police,car,murder,killer,gun,case,body,officer,dead,money,cop,death,scene,woman,detective
Topic 5,game,team,year,time,day,life,family,later,war,new,people,wife,film,man,death
Topic 6,room,car,door,away,girl,home,woman,time,night,house,head,phone,hand,day,window
Topic 7,prison,power,guard,snake,fight,mutant,woody,prisoner,toy,martial,japanese,barne,time,master,tournament
Topic 8,film,movie,vampire,band,time,new,scene,music,stage,audience,character,year,studio,life,night
Topic 9,police,man,money,car,agent,gun,drug,escape,shoot,officer,truck,later,plan,group,order


### Predecir topicos para un nuevo texto

In [22]:
#Orden de transformaciones
#sent_to_words() –> lemmatization() –> vectorizer.transform() –> best_lda_model.transform()

nlp = spacy.load('en', disable=['parser', 'ner'])

def predict_topic(text, nlp=nlp):
    global sent_to_words
    global lemmatization
    
    # 0: Eliminar caracteres innecesarios
    text1 = [re.sub('\s+', ' ', sent) for sent in text] #Removemos enters
    text1 = [re.sub("\'", "", sent) for sent in text1]  #Removemos caracteres innecesarios    
    
    # 1: Limpiar texto con simple_preprocess
    mytext_2 = list(sent_to_words(text1))
    
    # 2: Lemmatize
    mytext_3 = lemmatization(mytext_2, allowed_postags=['NOUN', 'ADJ', 'ADV']) #, 'VERB'])

    # 3: Vectorizar
    mytext_4 = vectorizer.transform(mytext_3)

    # 4: LDA Transform
    topic_probability_scores = best_lda_model.transform(mytext_4)
    topic = df_topic_keywords.iloc[np.argmax(topic_probability_scores), :].values.tolist()
    return topic, topic_probability_scores

# Predecir topico
mytext = ["Some text about christianity and bible"]
topic, prob_scores = predict_topic(text = mytext)
print(topic)

['room', 'car', 'door', 'away', 'girl', 'home', 'woman', 'time', 'night', 'house', 'head', 'phone', 'hand', 'day', 'window']


In [46]:
# Predecir topico
apocalipsisnow = ['The story opens in Saigon South Vietname late in 1969. U.S. Army Captain and special operations veteran Benjamin L. Willard (Martin Sheen) has returned to Saigon on another combat tour during the Vietnam War casually admitting that he is unable to rejoin society in the USA and that his marriage has broken up. He drinks heavily chain-smokes and hallucinates alone in his room becoming very upset and injuring himself when he breaks a large mirror. One day two military policemen arrive at Williards Saigon apartment and after cleaning him up escort him to an officers trailer where military intelligence officers Lt. General R. Corman (G. D. Spradlin) and Colonel Lucas (Harrison Ford) approach him with a top-secret assignment to follow the Nung River into the remote jungle find rogue Special Forces Colonel Walter E. Kurtz (Marlon Brando) and terminate his command with extreme prejudice. Kurtz apparently went insane and now commands his own Montagnard troops inside neutral Cambodia. They play a recording of Kurtz voice captured by Army intelligence where Kurtz rambles about the destruction of the war and a snail crawling on the edge of a straight razor. Willard is flown to Cam Ram Bay and joins a Navy PBR commanded by Chief (Albert Hall) and crewmen Lance (Sam Bottoms) Chef (Frederic Forrest) and Mr. Clean (Laurence Fishburne). Williard narrates that the crew are mostly young soldiers; Clean is only 17 and from the South Bronx Lance is a famous surfer from California and Chef is a chef from New Orleans. The Chief is an experienced sailor who mentions that hed previously brought another special operations soldier into the jungles of Vietnam on a similar mission and heard that the man committed suicide. As they travel down the coast to the mouth of the Nung River Willards voiceover reveals that hearing Kurtz voice triggered a fascination with Kurtz himself.']

topic, prob_scores = predict_topic(text = apocalipsisnow)
print(topic)

['ship', 'team', 'crew', 'order', 'plane', 'time', 'bond', 'soldier', 'man', 'attack', 'helicopter', 'earth', 'mission', 'human', 'machine']


In [24]:
#Serializamos para futuros usos
import pickle
pickle.dump(vectorizer, open('vectorizer.pkl', 'wb'))
pickle.dump(best_lda_model, open('lda_model.pkl', 'wb'))

### Embeddings para categorizar un tópico

In [27]:
#Tokenizamos la data lemmatizada
from nltk import word_tokenize, sent_tokenize
from tqdm import tqdm #Biblioteca para barra de avance

embeddings = []
for sent in tqdm(data_lemmatized):
    tokens = []
    for token in word_tokenize(sent):
        if token.isalpha():
            tokens.append(token)
    embeddings.append(tokens)

100%|█████████████████████████████████████| 8299/8299 [00:11<00:00, 749.10it/s]


In [29]:
print ("el corpus tiene",len(embeddings), "oraciones y",sum([len(x) for x in embeddings]),"palabras"   )

el corpus tiene 8299 oraciones y 2061972 palabras


In [38]:
from gensim.models.word2vec import Word2Vec
# "window" es el tamaño de la ventana. windows = 10, usa 10 palabras a la izquierda y 10 palabras a la derecha
# "n_dim" es la dimension (i.e. el largo) de los vectores de word2vec
# "workers" es el numero de cores que usa en paralelo. Para aprobechar eso es necesario tener instalado Cython)
# "sample": word2vec filtra palabras que aparecen una fraccion mayor que "sample"
# "min_count": Word2vec filtra palabras con menos apariciones que  "min_count"
# "sg": para correr el Skipgram model (sg = 1), para correr el CBOW (sg = 0)
# para mas detalle ver: https://radimrehurek.com/gensim/models/word2vec.html
n_dim = 20
w2v_model = Word2Vec(embeddings, workers=4, size=n_dim, min_count=10, window=10, sample=1e-3, negative=10, sg=0)

In [43]:
#Serializamos para futuros usos
pickle.dump(w2v_model, open('w2v_model.pkl', 'wb'))

In [75]:
genero = 'romance'
palabra = 'love'

pprint(w2v_model.most_similar(positive=[genero], negative=[], topn=10))
print('\n', genero,'-',palabra,'similarity:',w2v_model.wv.n_similarity([genero], [palabra]))

[('friendship', 0.9167739152908325),
 ('marrie', 0.9102799296379089),
 ('passion', 0.9080193638801575),
 ('lifestyle', 0.9061200618743896),
 ('romantic', 0.9050272703170776),
 ('poet', 0.9046342372894287),
 ('romantically', 0.9033544063568115),
 ('jealousy', 0.89674973487854),
 ('advice', 0.8896514177322388),
 ('mutual', 0.8895968198776245)]

 romance - love similarity: 0.3551698
