# Enunciado del problema
Una reconocida empresa dedicada a producir películas para distribuir vía streaming requiere de un mecanismo para determinar si la producción de una película generará los ingresos necesarios para garantizar un excelente margen operacional. La solución debe establecer que tan buena podrá ser una película en base a sus características. 

## Datos
La información relacionada a las películas se divide en 7 diferentes archivos CSV obtenidos del sitio web kaggle. La información se encuentra semiestructurada, ya que varios CSVs
contienen información en formato Json.

Url: https://www.kaggle.com/rounakbanik/the-movies-dataset

## Objetivo de Negocio
- Producir películas que garanticen altos ingresos.
- Entregar contenido que permita retener clientes y atraer nuevos.

# 1. Exploración de los datos

In [1]:
import pandas as pd

encoding = 'iso-8859-1'    
delimiter = ','

creditsFile = '../the-movies-dataset/credits.csv'
keywordsFile = '../the-movies-dataset/keywords.csv'
linksFile = '../the-movies-dataset/links.csv'
linkssmallFile = '../the-movies-dataset/links_small.csv'
moviesFile = '../the-movies-dataset/movies_metadata.csv'
ratingFile = '../the-movies-dataset/ratings.csv'
ratingsmallFile = '../the-movies-dataset/ratings_small.csv'

credits = pd.read_csv(creditsFile, delimiter = delimiter, encoding = encoding)
keywords = pd.read_csv(keywordsFile, delimiter = delimiter, encoding = encoding)
links = pd.read_csv(linksFile, delimiter = delimiter, encoding = encoding)
links_small = pd.read_csv(linkssmallFile, delimiter = delimiter, encoding = encoding)
movies = pd.read_csv(moviesFile, delimiter = delimiter, encoding = encoding)
rating = pd.read_csv(ratingFile, delimiter = delimiter, encoding = encoding)
rating_small = pd.read_csv(ratingsmallFile, delimiter = delimiter, encoding = encoding)

  interactivity=interactivity, compiler=compiler, result=result)


## Credits

In [2]:
credits.head()

Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862


## Keywords

In [3]:
keywords.head()

Unnamed: 0,id,keywords
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,"[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


## Links and Links_small
Los archivos links.csv y links_small.csv contienen información relacionada a los imdbid. Está información es útil para enriquecer la fuente original. Sin embargo, por ahora estos archivos no se tendrán en cuenta en el desarrollo del modelo. 

In [4]:
links.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [5]:
links_small.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


## Rating and Rating_small
Los archivos rating.csv y rating_small.csv contienen información sobre el rating de las películas, los usuarios y la fecha en la que se dio la calificación.

In [6]:
rating.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,110,1.0,1425941529
1,1,147,4.5,1425942435
2,1,858,5.0,1425941523
3,1,1221,5.0,1425941546
4,1,1246,5.0,1425941556


In [7]:
rating_small.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


### Preparación archivo rating
Para el proyecto utilizaremos la tabla más grande de ratings mediante la obtención del promedio y el total de personas que aportaron a dicha medición. Esta información será incorporatada al archivo de metadatos de peliculas para complementar la información.

Inicialmente se analizó la información para establecer si existian valores fuera de rango, encontramos que los valores estaban dentro de la escala de 0 a 5.

In [8]:
import seaborn as sb

sb.boxplot(rating['rating'], orient = 'v')

<matplotlib.axes._subplots.AxesSubplot at 0x1c801844400>

Se procede a generar la tabla con los promedios y número de personas que calificaron la película.

In [9]:
rating_mean = rating.groupby(['movieId'])[['rating']].mean()
rating_total = rating.groupby(['movieId'])[['rating']].count().rename(columns={'rating':'Total_Rating'})

rating = pd.concat([rating_mean, rating_total], axis=1, join='inner')
rating.head()

Unnamed: 0_level_0,rating,Total_Rating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,3.888157,66008
2,3.236953,26060
3,3.17555,15497
4,2.875713,2981
5,3.079565,15258


In [10]:
rating_mean.shape

(45115, 1)

In [11]:
rating_total.shape

(45115, 1)

In [12]:
rating.shape

(45115, 2)

## Movies_metadata
El archivo movies_metadata contiene información general de las películas. Sin embargo, se encontraron algunos campos en formato json.

In [13]:
movies.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


### Ajuste campos con formato json
Para cada atributo en formato json se tomó la información relevante, obteniendo así el nuevo dataset sin atributos json en donde es relevante.

In [14]:
import json

def getData(j, l):
    try:
        return json.loads(j.replace('\'', '\"').replace('None', '0'))[l]  
    except:
        return ''

def getData2(j, l):
    try:
        return json.loads(j.replace('\'', '\"').replace('None', '0'))[0][l]  
    except:
        return ''

movies['belongs_to_collection'] = movies['belongs_to_collection'].apply(lambda x: getData(x, "name"))
movies['genres'] = movies['genres'].apply(lambda x: getData2(x, "name"))
movies['production_companies'] = movies['production_companies'].apply(lambda x: getData2(x, "name"))
movies['production_countries'] = movies['production_countries'].apply(lambda x: getData2(x, "name"))

movies.head()


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,Toy Story Collection,30000000,Animation,http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,Adventure,,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,Grumpy Old Men Collection,0,Romance,,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,Comedy,,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,Father of the Bride Collection,0,Comedy,,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


### Agregando los campos de rating

In [20]:
movies.rename(index=str, columns={"id": "movieId"}, inplace = True)
movies.set_index("movieId").head()

Unnamed: 0_level_0,adult,belongs_to_collection,budget,genres,homepage,imdb_id,original_language,original_title,overview,popularity,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
862,False,Toy Story Collection,30000000,Animation,http://toystory.disney.com/toy-story,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.9469,...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
8844,False,,65000000,Adventure,,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.0155,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
15602,False,Grumpy Old Men Collection,0,Romance,,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
31357,False,,16000000,Comedy,,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.85949,...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
11862,False,Father of the Bride Collection,0,Comedy,,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,8.38752,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


#### Limpiando ids que erroneos
Con el fin de poder realizar el cruce se eliminan algunos ids que no permiten realizar es cast a integer para hacer el cruce.

In [16]:
movies.drop(movies[movies['movieId'] == '1997-08-20'].index, inplace = True) 
movies.drop(movies[movies['movieId'] == '2012-09-29'].index, inplace = True)
movies.drop(movies[movies['movieId'] == '2014-01-01'].index, inplace = True)
movies['movieId'] = movies['movieId'].astype(int)
movies.shape

(45463, 24)

#### Verificando el Join entre datasets
Se encontró que no existen ratings para todos los campos del dataset movies. Existen registros para el 16.64 % de los casos, por esta razón no se tendrá en cuenta la información de ratings. En su lugar se utilizarán los campos vote_average y vote_count

In [17]:
movies_rating = movies.merge(rating,left_on = "movieId", right_on="movieId")
movies_rating.shaperegistros

(7569, 26)

## Dataset final

In [18]:
movies.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,movieId,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,Toy Story Collection,30000000,Animation,http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,Adventure,,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,Grumpy Old Men Collection,0,Romance,,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,Comedy,,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,Father of the Bride Collection,0,Comedy,,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [19]:
list(movies.keys())

['adult',
 'belongs_to_collection',
 'budget',
 'genres',
 'homepage',
 'movieId',
 'imdb_id',
 'original_language',
 'original_title',
 'overview',
 'popularity',
 'poster_path',
 'production_companies',
 'production_countries',
 'release_date',
 'revenue',
 'runtime',
 'spoken_languages',
 'status',
 'tagline',
 'title',
 'video',
 'vote_average',
 'vote_count']