# Proyecto de Análisis Inteligente de Datos (INF-390)
# TMDb y redes neuronales

### Integrantes
* Sebastián Aedo
* Diego Córdova

#### Noviembre 2017

## Caso de estudio

El dataset que se estudiará esta compuesto por 19 características de hasta 5000 películas, tales como el género, ranking, popularidad, reseña, y otros. El objetivo de la mayoria de los estudios anteriores sobre este mismo dataset es el tratar de predecir el éxito de una película en base sus atributo. Sin embargo, pueden requerirse técnicas bastante complejas para generar predicciones que involucren tantos factores, por lo que nosotros apuntaremos por un objetivo mas pequeño, pero también relacionado con predicción. 

## Propuesta

Debido a que resulta bastante intuitivo que la reseña de una pélicula, o **overview**, este relacionado directamente con el **género** de esta. El principal objetivo de estas reseñas es dar rapidamente una idea de la historia y a que género pertenece. Por tanto nuestra meta será aplicar alguna técnica que nos permita predecir a partir de un overview su género correspondiente.

Para lograr está predicción se aplicará un modelo de redes neuronales que reciba como entrada el texto de la reseña, y nos entregue como resultado el género correspondiente. 

## Objetivos

* Implementar un modelo de redes neuronales.
* Aprender ell proceso de ajuste de parámetros.
* Aprender


In [3]:
# Librerias a usar

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random as r
import json
r.seed(1234)


## Extracción de información

Lo primero que haremos, será importar la data ([Source](https://www.kaggle.com/tmdb/tmdb-movie-metadata/data)) usando pandas

In [23]:
df = pd.read_csv("tmdb_5000_movies.csv")

In [24]:
df.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
budget                  4803 non-null int64
genres                  4803 non-null object
homepage                1712 non-null object
id                      4803 non-null int64
keywords                4803 non-null object
original_language       4803 non-null object
original_title          4803 non-null object
overview                4800 non-null object
popularity              4803 non-null float64
production_companies    4803 non-null object
production_countries    4803 non-null object
release_date            4802 non-null object
revenue                 4803 non-null int64
runtime                 4801 non-null float64
spoken_languages        4803 non-null object
status                  4803 non-null object
tagline                 3959 non-null object
title                   4803 non-null object
vote_average            4803 non-null float64
vote_count              4803 non-null 

En este caso, lo que necesitamos es sólamente el **género** y la **overview** para entrenar la red neuronal, por lo tanto se extraerá ésto.

In [26]:
print(df['genres'][0])
print(df['overview'][0])

[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]
In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.


## Preparación del los datos de prueba. ("Training data")

Se puede observar, como ser humano hay una relación inherente entre las etiquetas y la descripción de la película. En nuestro clasificador de texto solo se puede asignar una etiqueta al texto. Suponemos que todas las etiquetas son representativas, así que se escogerá la etiqueta *categorizadora* de forma aleatoria.

In [27]:
def generate_data():
    final_list = []
    for i, row in df.iterrows():
        temp_dict = dict()
        genres = json.loads(row['genres'])
        
        # peliculas sin genero
        if len(genres) == 0 or (type(row['overview']) == str and len(row['overview']) < 3) or type(row['overview']) == float:
            continue
            
        selected_genre = r.choice(genres)
        temp_dict['class'] = selected_genre['name']
        temp_dict['sentence'] = row['overview']
        final_list.append(temp_dict)
    return final_list

In [28]:
training_data = generate_data()

In [29]:
r.choice(training_data)

{'class': 'Animation',
 'sentence': 'Inventor Flint Lockwood creates a machine that makes clouds rain food, enabling the down-and-out citizens of Chewandswallow to feed themselves. But when the falling food reaches gargantuan proportions, Flint must scramble to avert disaster. Can he regain control of the machine and put an end to the wild weather before the town is destroyed?'}

## Red Neuronal.

Una vez preparado la `training_data`, se procederá a construir la red neuronal.

### Stemming de palabras.

Con el fin de obtener mejores resultados, cada palabra se reducirá a su raiz mediante el metodo de stemming. De esta manera, si una palabra se relaciona fuertemente con cierto género, también lo harán todas las palabras de la misma familia o tipo.

### Bag of words.

Se usará el método bag of words, que es básicamente agrupar todas las palabras, sin importan su orden inicial y sus repeticiones. Si bien la semántica si puede tener importancia en estos casos, su análisis es mucho mas complejo. Este método aunque sea mas simple puede traer buenos resultados para este caso, ya creemos que los generos estarán relacionados con palabras claves con los que se suele describir cada uno.


In [30]:
# use natural language toolkit
import nltk
from nltk.stem.lancaster import LancasterStemmer
import os
import json
import datetime
stemmer = LancasterStemmer()

In [31]:
words = []
classes = []
documents = []
ignore_words = ['?', '.', ',', "'", '"']
# loop through each sentence in our training data
for pattern in training_data:
    # tokenize each word in the sentence
    w = nltk.word_tokenize(pattern['sentence'])
    # add to our words list
    words.extend(w)
    # add to documents in our corpus
    documents.append((w, pattern['class']))
    # add to our classes list
    if pattern['class'] not in classes:
        classes.append(pattern['class'])

# stem and lower each word and remove duplicates
words = [stemmer.stem(w.lower()) for w in words if w not in ignore_words]
words = list(set(words))

# remove duplicates
classes = list(set(classes))

print (len(documents), "documents")
print (len(classes), "classes", classes)

4771 documents
19 classes ['Foreign', 'Family', 'Drama', 'Mystery', 'Fantasy', 'Crime', 'Action', 'Horror', 'Western', 'History', 'Documentary', 'Romance', 'Music', 'War', 'Animation', 'Science Fiction', 'Comedy', 'Thriller', 'Adventure']


In [32]:
# create our training data
training = []
output = []
# create an empty array for our output
output_empty = [0] * len(classes)

# training set, bag of words for each sentence
for doc in documents:
    # initialize our bag of words
    bag = []
    # list of tokenized words for the pattern
    pattern_words = doc[0]
    # stem each word
    pattern_words = [stemmer.stem(word.lower()) for word in pattern_words]
    # create our bag of words array
    for w in words:
        bag.append(1) if w in pattern_words else bag.append(0)

    training.append(bag)
    # output is a '0' for each tag and '1' for current tag
    output_row = list(output_empty)
    output_row[classes.index(doc[1])] = 1
    output.append(output_row)

# sample training/output
i = 0
w = documents[i][0]
print ([stemmer.stem(word.lower()) for word in w])
print (training[i])
print (output[i])

['in', 'the', '22nd', 'century', ',', 'a', 'parapleg', 'marin', 'is', 'dispatch', 'to', 'the', 'moon', 'pandor', 'on', 'a', 'un', 'miss', ',', 'but', 'becom', 'torn', 'between', 'follow', 'ord', 'and', 'protect', 'an', 'aly', 'civil', '.']
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

In [12]:
X = np.array(training)
y = np.array(output)

In [13]:
train_X = X[:int(len(X)*0.8)]
train_Y = y[:int(len(y)*0.8)]

test_X = X[int(len(X)*0.8):]
test_Y = y[int(len(y)*0.8):]

In [14]:
print(train_X.shape)
print(test_X.shape)

(3816, 15007)
(955, 15007)


In [15]:
print(classes)

['Foreign', 'Family', 'Drama', 'Mystery', 'Fantasy', 'Crime', 'Action', 'Western', 'History', 'Horror', 'Documentary', 'Romance', 'Music', 'War', 'Animation', 'Science Fiction', 'Comedy', 'Thriller', 'Adventure']


In [34]:
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout

model = Sequential()
model.add(Dense(1024, input_dim=train_X.shape[1], kernel_initializer='uniform', activation="relu"))
model.add(Dropout(0.5))
model.add(Dense(1024, kernel_initializer='uniform', activation="relu"))
model.add(Dropout(0.1))
#model.add(Dense(512, kernel_initializer='uniform', activation="relu"))
model.add(Dense(train_Y.shape[1]))
model.add(Activation("softmax"))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

## Entrenamiento del modelo.

In [35]:
model.fit(train_X, train_Y, epochs=5, batch_size=256, verbose=1, shuffle=True)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f24e1775748>

## Test del modelo.
Se procede a probar el modelo con los casos de prueba.

In [36]:
score=model.evaluate(test_X, test_Y, verbose=1)
print("\nLoss: %.3f \t Accuracy: %.3f" % (score[0], score[1]))


Loss: 2.760 	 Accuracy: 0.285


Podemos apreciar que se obtuvo una precisión de casi un 30%, lo que es un avance, pero muy mejorable.

In [37]:
def clean_up_sentence(sentence):
    # tokenize the pattern
    sentence_words = nltk.word_tokenize(sentence)
    # stem each word
    sentence_words = [stemmer.stem(word.lower()) for word in sentence_words]
    return sentence_words

def bow(sentence, words, show_details=False):
    # tokenize the pattern
    sentence_words = clean_up_sentence(sentence)
    # bag of words
    bag = [0]*len(words)  
    for s in sentence_words:
        for i,w in enumerate(words):
            if w == s: 
                bag[i] = 1
                if show_details:
                    print ("found in bag: %s" % w)
                    
    return(np.array(bag))

In [38]:
def predict(t, threshold=0.08):
    bag = bow(t, words)
    bag = np.array(bag).reshape(1, 15007)
    l = sorted(zip(classes, model.predict(bag)[0]), key=lambda x: x[1], reverse=True)
    return [i for i in l if i[1] > threshold]

## Test y Resultados.



In [39]:
test = ["A family goes out for vacations with their kids to enjoy the life", 
        "A bat have to save the world against superman",
       "An Alien arrived to my home and tried to kill me",
       "A haunted house is in front of my window",
       "A police man is killing someone"]

for t in test:
    print(t, predict(t))
    print()

A family goes out for vacations with their kids to enjoy the life [('Comedy', 0.89271015)]

A bat have to save the world against superman [('Fantasy', 0.15458329), ('Science Fiction', 0.1256789), ('Comedy', 0.091306552), ('Family', 0.08567708)]

An Alien arrived to my home and tried to kill me [('Comedy', 0.1661043), ('Science Fiction', 0.14578521), ('Fantasy', 0.10033672)]

A haunted house is in front of my window [('Drama', 0.12379406), ('Mystery', 0.11889305), ('Comedy', 0.096957155), ('History', 0.094010495), ('Thriller', 0.091078192)]

A police man is killing someone [('Thriller', 0.35837236), ('Crime', 0.14874661), ('Action', 0.090221427)]



## Conclusiones

Se logró implementar una red neuronal usando los elementos aprendidos en la clase, y ciertas otras técnicas para manejar strings, tales como el stemming. Se lograron ver resultados concretos de la aplicación de esta red. Se aprecia que estas predicciones obtenidas parecen bastante acertadas, con bastante sentido. Sin embargo la precisión que se logró es aproximadamente de un 30%. Esto podría deberse a la selección aleatoria de categorias. La baja precisión se debe probablemente también al overfitting, es decir, que se sobre-entreno la red y ha empezado a fallar para los casos nuevos. 

    Los desafios que quedan para la siguiente entrega serían mejorar la precisión, usar otros métodos para seleccionar los generos, ajustar bien los parámetros y corregir el overfitting.
