# <h1 align=center> **PROYECTO INDIVIDUAL Nº1** </h1>
# <h1 align=center>**`Machine Learning Operations (MLOps) Engineer`**</h1>

El siguiente proyecto soluciona un problema de negocio usando ML, el cual tiene como objetivo crear un sistema de recomendación de una plataforma de streaming (series y películas). Se comienza haciendo un trabajo de Data Engineer (ETL) debido a que los datos están anidados y sin transformar al momento de realizar la ingesta de datos de los 2 archivos que contienen los datos, los cuales son: movies_dataset.csv y credits.csv.

## **DATA ENGINEER (ETL)**

Importar las librerias necesarias.

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime
import ast

>**`Dataset 'credits.csv'`**

Cargar el dataset 'credits.csv' en un dataframe usando la librería pandas.

In [2]:
data_credits = pd.read_csv('credits.csv', encoding='UTF-8')
data_credits

Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862
...,...,...,...
45471,"[{'cast_id': 0, 'character': '', 'credit_id': ...","[{'credit_id': '5894a97d925141426c00818c', 'de...",439050
45472,"[{'cast_id': 1002, 'character': 'Sister Angela...","[{'credit_id': '52fe4af1c3a36847f81e9b15', 'de...",111109
45473,"[{'cast_id': 6, 'character': 'Emily Shaw', 'cr...","[{'credit_id': '52fe4776c3a368484e0c8387', 'de...",67758
45474,"[{'cast_id': 2, 'character': '', 'credit_id': ...","[{'credit_id': '533bccebc3a36844cf0011a7', 'de...",227506


Es un dataset que contiene 45476 registros y 3 columnas, de las cuales 'crew' y 'id' son las necesarias; además, se observa que la columna 'crew' es una lista que contiene un diccionario, y de ella nos interesa solamente el nombre del director. Por lo tanto, antes de procesar los datos de cada columna se debe conocer su tipo y la existencia de datos nulos:  

In [3]:
data_credits = data_credits[['crew', 'id']]
data_credits.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45476 entries, 0 to 45475
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   crew    45476 non-null  object
 1   id      45476 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 710.7+ KB


En ambas columnas existen 45432 datos no nulos. En cuanto al tipo de dato, la columna 'crew' es de tipo 'object', lo cual indica que es una secuencia de carácteres (string), y la columna 'id' es de tipo entero. De esta última columna interesa que no haya registros duplicados puesto que representa el identificador único de las películas, por lo que se procede a descartar dichos duplicados:

In [5]:
data_credits = data_credits[data_credits['id'].duplicated() == False]
data_credits

Unnamed: 0,crew,id
0,"[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862
...,...,...
45471,"[{'credit_id': '5894a97d925141426c00818c', 'de...",439050
45472,"[{'credit_id': '52fe4af1c3a36847f81e9b15', 'de...",111109
45473,"[{'credit_id': '52fe4776c3a368484e0c8387', 'de...",67758
45474,"[{'credit_id': '533bccebc3a36844cf0011a7', 'de...",227506


Con esto último ya no existen registros duplicados, lo cual también coincidía con los registros nulos.

Por otra parte, para procesar la columna 'crew' se hace uso de la librería 'ast' y su función 'literal_eval', la cual convierte cada registro en una lista lo que es ideal para trabajar con los datos. Luego, para extraer el nombre del director se usa el método 'explode()' de la librería pandas que transforma cada elemento de la lista en una fila, replicando los valores del índice original:

In [6]:
data_credits['crew'] = data_credits['crew'].apply(ast.literal_eval)
data_credits_job =  data_credits.explode('crew')['crew'].apply(pd.Series)[['job', 'name']]
data_credits_job

Unnamed: 0,job,name
0,Director,John Lasseter
0,Screenplay,Joss Whedon
0,Screenplay,Andrew Stanton
0,Screenplay,Joel Cohen
0,Screenplay,Alec Sokolow
...,...,...
45473,Original Music Composer,Richard McHugh
45473,Director of Photography,João Fernandes
45474,Director,Yakov Protazanov
45474,Producer,Joseph N. Ermolieff


A este nuevo dataframe le adjuntamos la columna 'id' del dataframe original para saber a qué película corresponde cada director:

In [7]:
data_credits_job['id'] = data_credits['id']
data_credits_job

Unnamed: 0,job,name,id
0,Director,John Lasseter,862
0,Screenplay,Joss Whedon,862
0,Screenplay,Andrew Stanton,862
0,Screenplay,Joel Cohen,862
0,Screenplay,Alec Sokolow,862
...,...,...,...
45473,Original Music Composer,Richard McHugh,67758
45473,Director of Photography,João Fernandes,67758
45474,Director,Yakov Protazanov,227506
45474,Producer,Joseph N. Ermolieff,227506


Se observa que en la columna 'job' existen cargos que no corresponden netamente al director principal de las películas, por lo que se realiza un filtro (o una máscara) para tal fin. Además, como una película puede tener más de un director, estos se agrupan en una lista para cada película. Y por último, se resetea el índice del nuevo dataframe de directores y se renombran las columnas para hacer más fácil su entendimiento. 

In [8]:
data_credits_director = data_credits_job[data_credits_job['job'] == 'Director']
data_credits_director = data_credits_director.groupby('id')['name'].apply(lambda x: list(x)).to_frame().reset_index()
data_credits_director = data_credits_director.rename(columns={'name':'director'})
data_credits_director

Unnamed: 0,id,director
0,2,[Aki Kaurismäki]
1,3,[Aki Kaurismäki]
2,5,"[Allison Anders, Alexandre Rockwell, Robert Ro..."
3,6,[Stephen Hopkins]
4,11,[George Lucas]
...,...,...
44540,465044,"[Molly Smith, Maurice Smith]"
44541,467731,[Sidney Lumet]
44542,468343,[Jack Witikka]
44543,468707,[Hannaleena Hauru]


Este dataframe es exportado en formato .csv para tener un respaldo del mismo.

In [208]:
data_credits_director.to_csv('data_credits_director.csv')

>**`Dataset 'movies_dataset.csv'`**

Cargar el dataset 'movies_dataset.csv' en un dataframe usando la librería pandas.

In [42]:
data_movies = pd.read_csv('movies_dataset.csv', encoding='UTF-8')
pd.options.display.max_columns=0
data_movies

  data_movies = pd.read_csv('movies_dataset.csv', encoding='UTF-8')


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,"[{'name': 'Sandollar Productions', 'id': 5842}...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45461,False,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 10751, 'n...",http://www.imdb.com/title/tt6209470/,439050,tt6209470,fa,رگ خواب,Rising and falling between a man and woman.,0.072051,/jldsYflnId4tTWPx8es3uzsB1I8.jpg,[],"[{'iso_3166_1': 'IR', 'name': 'Iran'}]",,0.0,90.0,"[{'iso_639_1': 'fa', 'name': 'فارسی'}]",Released,Rising and falling between a man and woman,Subdue,False,4.0,1.0
45462,False,,0,"[{'id': 18, 'name': 'Drama'}]",,111109,tt2028550,tl,Siglo ng Pagluluwal,An artist struggles to finish his work while a...,0.178241,/xZkmxsNmYXJbKVsTRLLx3pqGHx7.jpg,"[{'name': 'Sine Olivia', 'id': 19653}]","[{'iso_3166_1': 'PH', 'name': 'Philippines'}]",2011-11-17,0.0,360.0,"[{'iso_639_1': 'tl', 'name': ''}]",Released,,Century of Birthing,False,9.0,3.0
45463,False,,0,"[{'id': 28, 'name': 'Action'}, {'id': 18, 'nam...",,67758,tt0303758,en,Betrayal,"When one of her hits goes wrong, a professiona...",0.903007,/d5bX92nDsISNhu3ZT69uHwmfCGw.jpg,"[{'name': 'American World Pictures', 'id': 6165}]","[{'iso_3166_1': 'US', 'name': 'United States o...",2003-08-01,0.0,90.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,A deadly game of wits.,Betrayal,False,3.8,6.0
45464,False,,0,[],,227506,tt0008536,en,Satana likuyushchiy,"In a small town live two brothers, one a minis...",0.003503,/aorBPO7ak8e8iJKT5OcqYxU3jlK.jpg,"[{'name': 'Yermoliev', 'id': 88753}]","[{'iso_3166_1': 'RU', 'name': 'Russia'}]",1917-10-21,0.0,87.0,[],Released,,Satan Triumphant,False,0.0,0.0


Es un dataset que contiene 45466 registros y 24 columnas.

Antes de procesar los datos de cada columna se debe conocer su tipo y la existencia de datos nulos:

In [43]:
data_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

Al igual que en el dataset 'credits.csv', en la columna 'id' interesa que no haya registros duplicados puesto que representa el identificador único de las películas, por lo que se procede a descartar dichos duplicados y resetear el índice.

In [44]:
data_movies = data_movies[data_movies['id'].duplicated()==False]
data_movies.reset_index(inplace=True, drop=True)
data_movies

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,"[{'name': 'Sandollar Productions', 'id': 5842}...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45431,False,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 10751, 'n...",http://www.imdb.com/title/tt6209470/,439050,tt6209470,fa,رگ خواب,Rising and falling between a man and woman.,0.072051,/jldsYflnId4tTWPx8es3uzsB1I8.jpg,[],"[{'iso_3166_1': 'IR', 'name': 'Iran'}]",,0.0,90.0,"[{'iso_639_1': 'fa', 'name': 'فارسی'}]",Released,Rising and falling between a man and woman,Subdue,False,4.0,1.0
45432,False,,0,"[{'id': 18, 'name': 'Drama'}]",,111109,tt2028550,tl,Siglo ng Pagluluwal,An artist struggles to finish his work while a...,0.178241,/xZkmxsNmYXJbKVsTRLLx3pqGHx7.jpg,"[{'name': 'Sine Olivia', 'id': 19653}]","[{'iso_3166_1': 'PH', 'name': 'Philippines'}]",2011-11-17,0.0,360.0,"[{'iso_639_1': 'tl', 'name': ''}]",Released,,Century of Birthing,False,9.0,3.0
45433,False,,0,"[{'id': 28, 'name': 'Action'}, {'id': 18, 'nam...",,67758,tt0303758,en,Betrayal,"When one of her hits goes wrong, a professiona...",0.903007,/d5bX92nDsISNhu3ZT69uHwmfCGw.jpg,"[{'name': 'American World Pictures', 'id': 6165}]","[{'iso_3166_1': 'US', 'name': 'United States o...",2003-08-01,0.0,90.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,A deadly game of wits.,Betrayal,False,3.8,6.0
45434,False,,0,[],,227506,tt0008536,en,Satana likuyushchiy,"In a small town live two brothers, one a minis...",0.003503,/aorBPO7ak8e8iJKT5OcqYxU3jlK.jpg,"[{'name': 'Yermoliev', 'id': 88753}]","[{'iso_3166_1': 'RU', 'name': 'Russia'}]",1917-10-21,0.0,87.0,[],Released,,Satan Triumphant,False,0.0,0.0


Con el paso anterior, quedan 45346 registros en el dataframe.

De ahora en adelante, se procede a desanidar las columnas 'belongs_to_collection', 'genres', 'production_companies', 'production_countries' y 'spoken_languages'. La primera de estas es un diccionario y las demás son listas que contienen diccionarios.

`[belongs_to_collection]`

Para procesar esta columna se aplica la función 'literal_eval' de la librería 'ast' para convertir cada registro (a excepción de los valores NaN) en una lista y así poder desanidar su contenido, que es un diccionario de 4 campos, en en dataframe:

In [45]:
df_belongs_to_collection = data_movies['belongs_to_collection'].apply(lambda x: ast.literal_eval(x) if type(x) == str else x)
df_belongs_to_collection = df_belongs_to_collection.apply(pd.Series).rename(columns = {'id':'id_belongs',
                                                                                       'name':'name_belongs',
                                                                                       'poster_path':'poster_path_belongs',
                                                                                       'backdrop_path':'backdrop_path_belongs'}).iloc[:,0:4]
df_belongs_to_collection

Unnamed: 0,id_belongs,name_belongs,poster_path_belongs,backdrop_path_belongs
0,10194.0,Toy Story Collection,/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg,/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg
1,,,,
2,119050.0,Grumpy Old Men Collection,/nLvUdqgPgm3F85NMCii9gVFUcet.jpg,/hypTnLot2z8wpFS7qwsQHW1uV8u.jpg
3,,,,
4,96871.0,Father of the Bride Collection,/nts4iOmNnq7GNicycMJ9pSAn204.jpg,/7qwE57OVZmMJChBpLEbJEmzUydk.jpg
...,...,...,...,...
45431,,,,
45432,,,,
45433,,,,
45434,,,,


La información del tipo de dato de cada columna y sus valores nulos es la siguiente: 

In [46]:
df_belongs_to_collection.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45436 entries, 0 to 45435
Data columns (total 4 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id_belongs             4488 non-null   float64
 1   name_belongs           4488 non-null   object 
 2   poster_path_belongs    3945 non-null   object 
 3   backdrop_path_belongs  3261 non-null   object 
dtypes: float64(1), object(3)
memory usage: 1.4+ MB


Se observa que existe una gran cantidad de valores nulos en las columnas del dataframe (más del 90%).

`[genres]`

Para desanidar esta columna se usa el método 'replace()', el cual reemplaza una serie de símbolos y cadenas de texto que no se usarán por una cadena vacía. Luego, mediante el método 'split()' se divide la cadena en una lista, donde cada palabra (separada por una ',') es un elemento de la lista. Y por último, las listas vacías se convierten en registros nulos.

Nota: con estos métodos el tiempo de respuesta es menor si se compara con la función 'literal_eval'.

In [47]:
df_genres = data_movies['genres'].str.replace(r"[0-9 ':[\]{}]", "", regex=True).str.replace('id,name', '')
df_genres = df_genres.str.split(',')
df_genres = df_genres.apply(lambda x: np.nan if x == [''] else x)
df_genres


0         [Animation, Comedy, Family]
1        [Adventure, Fantasy, Family]
2                   [Romance, Comedy]
3            [Comedy, Drama, Romance]
4                            [Comedy]
                     ...             
45431                 [Drama, Family]
45432                         [Drama]
45433       [Action, Drama, Thriller]
45434                             NaN
45435                             NaN
Name: genres, Length: 45436, dtype: object

La información del tipo de dato y la cantidad de valores nulos es la siguiente:

In [48]:
df_genres.info()

<class 'pandas.core.series.Series'>
RangeIndex: 45436 entries, 0 to 45435
Series name: genres
Non-Null Count  Dtype 
--------------  ----- 
42994 non-null  object
dtypes: object(1)
memory usage: 355.1+ KB


Existen apenas 2442 registros nulos.

`[production_companies]`

Para desanidar esta columna se realiza el mismo procedimiento que se hizo en la columna 'genres'. El ciclo 'for' elimina los espacios en blanco que quedaron en los elementos de la lista.

In [181]:
df_production_companies = data_movies['production_companies'].str.replace(r"[0-9':*[\]{}]", "", regex=True)
df_production_companies = df_production_companies.str.replace('name', '').str.replace(', id', '')
df_production_companies = df_production_companies.str.split(',')

for ipc in range(len(df_production_companies)):
    if type(df_production_companies[ipc]) == list:
        for npc in range(len(df_production_companies[ipc])):
            df_production_companies[ipc][npc] = df_production_companies[ipc][npc].strip()

df_production_companies = df_production_companies.apply(lambda x: np.nan if x == [''] else x)


df_production_companies

0                                [Pixar Animation Studios]
1        [TriStar Pictures, Teitler Film, Interscope Co...
2                           [Warner Bros., Lancaster Gate]
3                 [Twentieth Century Fox Film Corporation]
4             [Sandollar Productions, Touchstone Pictures]
                               ...                        
45431                                                  NaN
45432                                        [Sine Olivia]
45433                            [American World Pictures]
45434                                          [Yermoliev]
45435                                                  NaN
Name: production_companies, Length: 45436, dtype: object

La información del tipo de dato y la cantidad de valores nulos es la siguiente:

In [182]:
df_production_companies.info()

<class 'pandas.core.series.Series'>
RangeIndex: 45436 entries, 0 to 45435
Series name: production_companies
Non-Null Count  Dtype 
--------------  ----- 
33564 non-null  object
dtypes: object(1)
memory usage: 355.1+ KB


Existen 11872 registros nulos.

`[production_countries]`

Para desanidar esta columna se realiza el mismo procedimiento que se hizo en las columnas anteriores. El primer ciclo 'for' elimina los espacios en blanco que quedaron en los elementos de la lista y el segundo 'for' permite quedarse con el primer elemento de la lista que corresponde a la abreviatura ISO de los países.

In [187]:
df_production_countries = data_movies['production_countries'].str.replace(r"[0-9'.:[\]{}]", "", regex=True)
df_production_countries = df_production_countries.str.replace('name', '').str.replace('iso__', '')
df_production_countries = df_production_countries.str.split(',')

for ipco in range(len(df_production_countries)):
    if type(df_production_countries[ipco]) == list:
        for npco in range(len(df_production_countries[ipco])):
            df_production_countries[ipco][npco] = df_production_countries[ipco][npco].strip()
            
for ipco1 in range(len(df_production_countries)):
    lista_df_production_countries = []
    if type(df_production_countries[ipco1]) == list:
        for npco1 in range(len(df_production_countries[ipco1])):
            if npco1%2 == 0:
                lista_df_production_countries.append(df_production_countries[ipco1][npco1])
        df_production_countries[ipco1] = lista_df_production_countries[::]

df_production_countries = df_production_countries.apply(lambda x: np.nan if x == [''] else x)

df_production_countries

0        [US]
1        [US]
2        [US]
3        [US]
4        [US]
         ... 
45431    [IR]
45432    [PH]
45433    [US]
45434    [RU]
45435    [GB]
Name: production_countries, Length: 45436, dtype: object

La información del tipo de dato y la cantidad de valores nulos es la siguiente:

In [189]:
df_production_countries.info()

<class 'pandas.core.series.Series'>
RangeIndex: 45436 entries, 0 to 45435
Series name: production_countries
Non-Null Count  Dtype 
--------------  ----- 
39151 non-null  object
dtypes: object(1)
memory usage: 355.1+ KB


Existen 6285 registros nulos.

`[spoken_languages]`

En esta columna se realiza el mismo procedimiento que se realizó en la columna anterior ('production_countries').

In [193]:
df_spoken_languages = data_movies['spoken_languages'].str.replace(r"[0-9':[\]{}]", "", regex=True)
df_spoken_languages = df_spoken_languages.str.replace('name', '').str.replace('iso__', '')
df_spoken_languages = df_spoken_languages.str.split(',')

for isl in range(len(df_spoken_languages)):
    if type(df_spoken_languages[isl]) == list:
        for nsl in range(len(df_spoken_languages[isl])):
            df_spoken_languages[isl][nsl] = df_spoken_languages[isl][nsl].strip()
    else:
        df_spoken_languages[isl] = df_spoken_languages[isl]

for isl1 in range(len(df_spoken_languages)):
    lista_df_spoken_languages = []
    if type(df_spoken_languages[isl1]) == list:
        for nsl1 in range(len(df_spoken_languages[isl1])):
            if nsl1%2 == 0:
                lista_df_spoken_languages.append(df_spoken_languages[isl1][nsl1])
        df_spoken_languages[isl1] = lista_df_spoken_languages[::]

df_spoken_languages = df_spoken_languages.apply(lambda x: np.nan if x == [''] else x)
        
df_spoken_languages

0            [en]
1        [en, fr]
2            [en]
3            [en]
4            [en]
           ...   
45431        [fa]
45432        [tl]
45433        [en]
45434         NaN
45435        [en]
Name: spoken_languages, Length: 45436, dtype: object

La información del tipo de dato y la cantidad de valores nulos es la siguiente:

In [195]:
df_spoken_languages.info()

<class 'pandas.core.series.Series'>
RangeIndex: 45436 entries, 0 to 45435
Series name: spoken_languages
Non-Null Count  Dtype 
--------------  ----- 
41603 non-null  object
dtypes: object(1)
memory usage: 355.1+ KB


Existen 3833 registros nulos.

`Unión de los dataframes`

Antes de unir al dataframe principal 'data_movies' los nuevos dataframes generados anteriormente, se suprimen las columnas a las cuales se les realizó el ETL.

In [197]:
data_movies = data_movies.loc[:,['adult',
                                 'budget', 
                                 'homepage',
                                 'id', 
                                 'imdb_id', 
                                 'original_language', 
                                 'original_title', 
                                 'overview', 
                                 'popularity', 
                                 'poster_path', 
                                 'release_date', 
                                 'revenue', 
                                 'runtime',
                                 'status', 
                                 'tagline', 
                                 'title', 
                                 'video', 
                                 'vote_average', 
                                 'vote_count']]
data_movies

Unnamed: 0,adult,budget,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,release_date,revenue,runtime,status,tagline,title,video,vote_average,vote_count
0,False,30000000,http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,1995-10-30,373554033.0,81.0,Released,,Toy Story,False,7.7,5415.0
1,False,65000000,,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,1995-12-15,262797249.0,104.0,Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,0,,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,1995-12-22,0.0,101.0,Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,16000000,,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,1995-12-22,81452156.0,127.0,Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,0,,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,1995-02-10,76578911.0,106.0,Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45431,False,0,http://www.imdb.com/title/tt6209470/,439050,tt6209470,fa,رگ خواب,Rising and falling between a man and woman.,0.072051,/jldsYflnId4tTWPx8es3uzsB1I8.jpg,,0.0,90.0,Released,Rising and falling between a man and woman,Subdue,False,4.0,1.0
45432,False,0,,111109,tt2028550,tl,Siglo ng Pagluluwal,An artist struggles to finish his work while a...,0.178241,/xZkmxsNmYXJbKVsTRLLx3pqGHx7.jpg,2011-11-17,0.0,360.0,Released,,Century of Birthing,False,9.0,3.0
45433,False,0,,67758,tt0303758,en,Betrayal,"When one of her hits goes wrong, a professiona...",0.903007,/d5bX92nDsISNhu3ZT69uHwmfCGw.jpg,2003-08-01,0.0,90.0,Released,A deadly game of wits.,Betrayal,False,3.8,6.0
45434,False,0,,227506,tt0008536,en,Satana likuyushchiy,"In a small town live two brothers, one a minis...",0.003503,/aorBPO7ak8e8iJKT5OcqYxU3jlK.jpg,1917-10-21,0.0,87.0,Released,,Satan Triumphant,False,0.0,0.0


Y con el siguiente código se unen o concatenan los nuevos dataframes:

In [198]:
data_movies = pd.concat([data_movies,
                         df_belongs_to_collection,
                         df_genres,
                         df_production_companies,
                         df_production_countries,
                         df_spoken_languages], axis=1)
data_movies.head(2)

Unnamed: 0,adult,budget,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,release_date,revenue,runtime,status,tagline,title,video,vote_average,vote_count,id_belongs,name_belongs,poster_path_belongs,backdrop_path_belongs,genres,production_companies,production_countries,spoken_languages
0,False,30000000,http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,1995-10-30,373554033.0,81.0,Released,,Toy Story,False,7.7,5415.0,10194.0,Toy Story Collection,/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg,/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg,"[Animation, Comedy, Family]",[Pixar Animation Studios],[US],[en]
1,False,65000000,,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,1995-12-15,262797249.0,104.0,Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,,,,,"[Adventure, Fantasy, Family]","[TriStar Pictures, Teitler Film, Interscope Co...",[US],"[en, fr]"


`Columnas 'revenue' y 'budget'`

Los valores nulos de estas columnas se reemplazan por el número 0.

In [199]:
data_movies['revenue'] = data_movies['revenue'].fillna(0)
data_movies['budget'] = data_movies['budget'].fillna(0)
data_movies.head(2)

Unnamed: 0,adult,budget,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,release_date,revenue,runtime,status,tagline,title,video,vote_average,vote_count,id_belongs,name_belongs,poster_path_belongs,backdrop_path_belongs,genres,production_companies,production_countries,spoken_languages
0,False,30000000,http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,1995-10-30,373554033.0,81.0,Released,,Toy Story,False,7.7,5415.0,10194.0,Toy Story Collection,/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg,/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg,"[Animation, Comedy, Family]",[Pixar Animation Studios],[US],[en]
1,False,65000000,,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,1995-12-15,262797249.0,104.0,Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,,,,,"[Adventure, Fantasy, Family]","[TriStar Pictures, Teitler Film, Interscope Co...",[US],"[en, fr]"


`Columna 'release_date'`.

Los valores nulos de esta columna se eliminan.

In [200]:
data_movies = data_movies.dropna(subset=['release_date'])
data_movies.reset_index(inplace=True, drop=True)
data_movies.head(2)

Unnamed: 0,adult,budget,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,release_date,revenue,runtime,status,tagline,title,video,vote_average,vote_count,id_belongs,name_belongs,poster_path_belongs,backdrop_path_belongs,genres,production_companies,production_countries,spoken_languages
0,False,30000000,http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,1995-10-30,373554033.0,81.0,Released,,Toy Story,False,7.7,5415.0,10194.0,Toy Story Collection,/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg,/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg,"[Animation, Comedy, Family]",[Pixar Animation Studios],[US],[en]
1,False,65000000,,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,1995-12-15,262797249.0,104.0,Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,,,,,"[Adventure, Fantasy, Family]","[TriStar Pictures, Teitler Film, Interscope Co...",[US],"[en, fr]"


`Columna 'release_date' y creación de la columna 'release_year'`

La columna 'release_date' se transforma a formato fecha del tipo 'AAAA-mm-dd', y además se crea la columna 'release_year' la cual contiene el año de la fecha de estreno de la película (año de la columna 'release_date').

In [201]:
data_movies['release_date'] = data_movies['release_date'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d').date() if len(x) == 10 else None)
data_movies = data_movies.dropna(subset=['release_date'])
data_movies.reset_index(inplace=True, drop=True)
data_movies['year_release_date'] = data_movies['release_date'].apply(lambda x: x.year)
data_movies.head(2)

Unnamed: 0,adult,budget,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,release_date,revenue,runtime,status,tagline,title,video,vote_average,vote_count,id_belongs,name_belongs,poster_path_belongs,backdrop_path_belongs,genres,production_companies,production_countries,spoken_languages,year_release_date
0,False,30000000,http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,1995-10-30,373554033.0,81.0,Released,,Toy Story,False,7.7,5415.0,10194.0,Toy Story Collection,/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg,/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg,"[Animation, Comedy, Family]",[Pixar Animation Studios],[US],[en],1995
1,False,65000000,,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,1995-12-15,262797249.0,104.0,Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,,,,,"[Adventure, Fantasy, Family]","[TriStar Pictures, Teitler Film, Interscope Co...",[US],"[en, fr]",1995


`Creación de la columna 'return'`

La columna 'return' es el retorno de inversión, la cual se calcula a partir de las columnas 'revenue' y 'budget', dividiendo estas dos últimas 'revenue / budget'. Cabe resaltar que cuando no hay datos disponibles para calcular la columna, esta toma el valor de 0.

In [202]:
data_movies['return'] = data_movies['revenue']/data_movies['budget'].astype('float')
data_movies['return'] = data_movies['return'].apply(lambda x: 0 if np.isnan(x) or np.isinf(x) else x)
data_movies.head(2)

Unnamed: 0,adult,budget,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,release_date,revenue,runtime,status,tagline,title,video,vote_average,vote_count,id_belongs,name_belongs,poster_path_belongs,backdrop_path_belongs,genres,production_companies,production_countries,spoken_languages,year_release_date,return
0,False,30000000,http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,1995-10-30,373554033.0,81.0,Released,,Toy Story,False,7.7,5415.0,10194.0,Toy Story Collection,/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg,/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg,"[Animation, Comedy, Family]",[Pixar Animation Studios],[US],[en],1995,12.451801
1,False,65000000,,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,1995-12-15,262797249.0,104.0,Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,,,,,"[Adventure, Fantasy, Family]","[TriStar Pictures, Teitler Film, Interscope Co...",[US],"[en, fr]",1995,4.043035


`Eliminar las columnas: 'video', 'imdb_id', 'adult', 'original_title', 'poster_path' y 'homepage'`

Con el siguiente código se suprimen las columnas que no serán utilizadas: 'video', 'imdb_id', 'adult', 'original_title', 'poster_path' y 'homepage'. Además, se ordenan las columnas para tener una mejor comprensión del dataset.

In [203]:
data_movies = data_movies.loc[:,['id',
                                 'title',
                                 'name_belongs',
                                 'overview',
                                 'genres',
                                 'original_language',
                                 'spoken_languages',
                                 'popularity',
                                 'release_date',
                                 'year_release_date',
                                 'production_companies',
                                 'production_countries',
                                 'runtime',
                                 'status',
                                 'tagline',
                                 'vote_average',
                                 'vote_count',
                                 'budget',
                                 'revenue', 
                                 'return']]
data_movies.head(2)

Unnamed: 0,id,title,name_belongs,overview,genres,original_language,spoken_languages,popularity,release_date,year_release_date,production_companies,production_countries,runtime,status,tagline,vote_average,vote_count,budget,revenue,return
0,862,Toy Story,Toy Story Collection,"Led by Woody, Andy's toys live happily in his ...","[Animation, Comedy, Family]",en,[en],21.946943,1995-10-30,1995,[Pixar Animation Studios],[US],81.0,Released,,7.7,5415.0,30000000,373554033.0,12.451801
1,8844,Jumanji,,When siblings Judy and Peter discover an encha...,"[Adventure, Fantasy, Family]",en,"[en, fr]",17.015539,1995-12-15,1995,"[TriStar Pictures, Teitler Film, Interscope Co...",[US],104.0,Released,Roll the dice and unleash the excitement!,6.9,2413.0,65000000,262797249.0,4.043035


`La información del dataset 'data_movies'`

In [204]:
data_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45346 entries, 0 to 45345
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    45346 non-null  object 
 1   title                 45346 non-null  object 
 2   name_belongs          4485 non-null   object 
 3   overview              44405 non-null  object 
 4   genres                42962 non-null  object 
 5   original_language     45335 non-null  object 
 6   spoken_languages      41580 non-null  object 
 7   popularity            45346 non-null  object 
 8   release_date          45346 non-null  object 
 9   year_release_date     45346 non-null  int64  
 10  production_companies  33556 non-null  object 
 11  production_countries  39138 non-null  object 
 12  runtime               45100 non-null  float64
 13  status                45266 non-null  object 
 14  tagline               20387 non-null  object 
 15  vote_average       

Se observa que tiene 20 columnas divididas según el tipo de dato, de la siguiente manera: 5 de 'float64', 1 de 'int64' y 14 'object'. Cabe destacar que la columna 'id' debe transformarse al tipo entero y las columnas 'popularity' y 'budget' al tipo flotante.

In [205]:
data_movies = data_movies.astype({'id':'int', 'popularity':'float', 'budget':'float'})

Quedando de la siguiente manera el dataset:

In [206]:
data_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45346 entries, 0 to 45345
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    45346 non-null  int32  
 1   title                 45346 non-null  object 
 2   name_belongs          4485 non-null   object 
 3   overview              44405 non-null  object 
 4   genres                42962 non-null  object 
 5   original_language     45335 non-null  object 
 6   spoken_languages      41580 non-null  object 
 7   popularity            45346 non-null  float64
 8   release_date          45346 non-null  object 
 9   year_release_date     45346 non-null  int64  
 10  production_companies  33556 non-null  object 
 11  production_countries  39138 non-null  object 
 12  runtime               45100 non-null  float64
 13  status                45266 non-null  object 
 14  tagline               20387 non-null  object 
 15  vote_average       

Se exporta dicho dataset al formato '.csv':

In [207]:
data_movies.to_csv('data_movies.csv')

>**`Creación del dataset a usar`**

Se procede a unir los dataframes 'data_movies' y 'data_credits_job' a través de la columna 'id'. Este dataframe generado será el usado para la extracción de información y creación del modelo de recomendación a través del Machine Learning.

In [209]:
movies_credits = pd.merge(data_movies, data_credits_director, how='left' , on='id')
movies_credits

Unnamed: 0,id,title,name_belongs,overview,genres,original_language,spoken_languages,popularity,release_date,year_release_date,production_companies,production_countries,runtime,status,tagline,vote_average,vote_count,budget,revenue,return,director
0,862,Toy Story,Toy Story Collection,"Led by Woody, Andy's toys live happily in his ...","[Animation, Comedy, Family]",en,[en],21.946943,1995-10-30,1995,[Pixar Animation Studios],[US],81.0,Released,,7.7,5415.0,30000000.0,373554033.0,12.451801,[John Lasseter]
1,8844,Jumanji,,When siblings Judy and Peter discover an encha...,"[Adventure, Fantasy, Family]",en,"[en, fr]",17.015539,1995-12-15,1995,"[TriStar Pictures, Teitler Film, Interscope Co...",[US],104.0,Released,Roll the dice and unleash the excitement!,6.9,2413.0,65000000.0,262797249.0,4.043035,[Joe Johnston]
2,15602,Grumpier Old Men,Grumpy Old Men Collection,A family wedding reignites the ancient feud be...,"[Romance, Comedy]",en,[en],11.712900,1995-12-22,1995,"[Warner Bros., Lancaster Gate]",[US],101.0,Released,Still Yelling. Still Fighting. Still Ready for...,6.5,92.0,0.0,0.0,0.000000,[Howard Deutch]
3,31357,Waiting to Exhale,,"Cheated on, mistreated and stepped on, the wom...","[Comedy, Drama, Romance]",en,[en],3.859495,1995-12-22,1995,[Twentieth Century Fox Film Corporation],[US],127.0,Released,Friends are the people who let you be yourself...,6.1,34.0,16000000.0,81452156.0,5.090760,[Forest Whitaker]
4,11862,Father of the Bride Part II,Father of the Bride Collection,Just when George Banks has recovered from his ...,[Comedy],en,[en],8.387519,1995-02-10,1995,"[Sandollar Productions, Touchstone Pictures]",[US],106.0,Released,Just When His World Is Back To Normal... He's ...,5.7,173.0,0.0,76578911.0,0.000000,[Charles Shyer]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45341,30840,Robin Hood,,"Yet another version of the classic epic, with ...","[Drama, Action, Romance]",en,[en],5.683753,1991-05-13,1991,"[Westdeutscher Rundfunk (WDR), Working Title F...","[CA, DE, GB, US]",104.0,Released,,5.7,26.0,0.0,0.0,0.000000,[John Irvin]
45342,111109,Century of Birthing,,An artist struggles to finish his work while a...,[Drama],tl,[tl],0.178241,2011-11-17,2011,[Sine Olivia],[PH],360.0,Released,,9.0,3.0,0.0,0.0,0.000000,[Lav Diaz]
45343,67758,Betrayal,,"When one of her hits goes wrong, a professiona...","[Action, Drama, Thriller]",en,[en],0.903007,2003-08-01,2003,[American World Pictures],[US],90.0,Released,A deadly game of wits.,3.8,6.0,0.0,0.0,0.000000,[Mark L. Lester]
45344,227506,Satan Triumphant,,"In a small town live two brothers, one a minis...",,en,,0.003503,1917-10-21,1917,[Yermoliev],[RU],87.0,Released,,0.0,0.0,0.0,0.0,0.000000,[Yakov Protazanov]


Las columnas se reordenan para un mejor entendimiento.

In [210]:
movies_credits = movies_credits.reindex(columns=['id',
                                                       'title',
                                                       'director',
                                                       'name_belongs',
                                                       'overview',
                                                       'genres',
                                                       'original_language',
                                                       'spoken_languages',
                                                       'popularity',
                                                       'release_date',
                                                       'year_release_date',
                                                       'production_companies',
                                                       'production_countries',
                                                       'runtime',
                                                       'status',
                                                       'tagline',
                                                       'vote_average',
                                                       'vote_count',
                                                       'budget',
                                                       'revenue', 
                                                       'return'])
movies_credits

Unnamed: 0,id,title,director,name_belongs,overview,genres,original_language,spoken_languages,popularity,release_date,year_release_date,production_companies,production_countries,runtime,status,tagline,vote_average,vote_count,budget,revenue,return
0,862,Toy Story,[John Lasseter],Toy Story Collection,"Led by Woody, Andy's toys live happily in his ...","[Animation, Comedy, Family]",en,[en],21.946943,1995-10-30,1995,[Pixar Animation Studios],[US],81.0,Released,,7.7,5415.0,30000000.0,373554033.0,12.451801
1,8844,Jumanji,[Joe Johnston],,When siblings Judy and Peter discover an encha...,"[Adventure, Fantasy, Family]",en,"[en, fr]",17.015539,1995-12-15,1995,"[TriStar Pictures, Teitler Film, Interscope Co...",[US],104.0,Released,Roll the dice and unleash the excitement!,6.9,2413.0,65000000.0,262797249.0,4.043035
2,15602,Grumpier Old Men,[Howard Deutch],Grumpy Old Men Collection,A family wedding reignites the ancient feud be...,"[Romance, Comedy]",en,[en],11.712900,1995-12-22,1995,"[Warner Bros., Lancaster Gate]",[US],101.0,Released,Still Yelling. Still Fighting. Still Ready for...,6.5,92.0,0.0,0.0,0.000000
3,31357,Waiting to Exhale,[Forest Whitaker],,"Cheated on, mistreated and stepped on, the wom...","[Comedy, Drama, Romance]",en,[en],3.859495,1995-12-22,1995,[Twentieth Century Fox Film Corporation],[US],127.0,Released,Friends are the people who let you be yourself...,6.1,34.0,16000000.0,81452156.0,5.090760
4,11862,Father of the Bride Part II,[Charles Shyer],Father of the Bride Collection,Just when George Banks has recovered from his ...,[Comedy],en,[en],8.387519,1995-02-10,1995,"[Sandollar Productions, Touchstone Pictures]",[US],106.0,Released,Just When His World Is Back To Normal... He's ...,5.7,173.0,0.0,76578911.0,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45341,30840,Robin Hood,[John Irvin],,"Yet another version of the classic epic, with ...","[Drama, Action, Romance]",en,[en],5.683753,1991-05-13,1991,"[Westdeutscher Rundfunk (WDR), Working Title F...","[CA, DE, GB, US]",104.0,Released,,5.7,26.0,0.0,0.0,0.000000
45342,111109,Century of Birthing,[Lav Diaz],,An artist struggles to finish his work while a...,[Drama],tl,[tl],0.178241,2011-11-17,2011,[Sine Olivia],[PH],360.0,Released,,9.0,3.0,0.0,0.0,0.000000
45343,67758,Betrayal,[Mark L. Lester],,"When one of her hits goes wrong, a professiona...","[Action, Drama, Thriller]",en,[en],0.903007,2003-08-01,2003,[American World Pictures],[US],90.0,Released,A deadly game of wits.,3.8,6.0,0.0,0.0,0.000000
45344,227506,Satan Triumphant,[Yakov Protazanov],,"In a small town live two brothers, one a minis...",,en,,0.003503,1917-10-21,1917,[Yermoliev],[RU],87.0,Released,,0.0,0.0,0.0,0.0,0.000000


Exportación del dataset al formato '.csv'.

In [37]:
movies_credits.to_csv('movies_credits.csv')