## Laboratorio de implementación - Limpieza de datos y exploración

La información que vamos a utilizar está en el siguiente link: https://www.imdb.com/interfaces/

Es clave que cuando tengamos una base de dato, lo primero que intentemos es entender es de donde salio y que representa. Por eso, cuando hay documentación, lo más importante es leerla primero.

### IMDB - Internet Movie Database
<img src="imdb-banner-1.jpeg">

La base de datos de IMDB esta compuesta por 7 tablas:
* <b>title.akas.tsv.gz</b> - Contains the following information for titles:
    - titleId (string) - a tconst, an alphanumeric unique identifier of the title ordering (integer) – a number to uniquely identify rows for a given titleId
    - title (string) – the localized title
    - region (string) - the region for this version of the title
    - language (string) - the language of the title
    - types (array) - Enumerated set of attributes for this alternative title. One or more of the following: "alternative", "dvd", "festival", "tv", "video", "working", "original", "imdbDisplay". New values may be added in the future without warning
    - attributes (array) - Additional terms to describe this alternative title, not enumerated
    - isOriginalTitle (boolean) – 0: not original title; 1: original title
* <b>title.basics.tsv.gz</b> - Contains the following information for titles:
    - tconst (string) - alphanumeric unique identifier of the title
    - titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
    - primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
    - originalTitle (string) - original title, in the original language
    - isAdult (boolean) - 0: non-adult title; 1: adult title
    - startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year
    - endYear (YYYY) – TV Series end year. ‘\N’ for all other title types
    - runtimeMinutes – primary runtime of the title, in minutes
    - genres (string array) – includes up to three genres associated with the title
* <b> title.crew.tsv.gz</b> – Contains the director and writer information for all the titles in IMDb. Fields include:
    - tconst (string) - alphanumeric unique identifier of the title
    - directors (array of nconsts) - director(s) of the given title
    - writers (array of nconsts) – writer(s) of the given title
* <b> title.episode.tsv.gz</b> – Contains the tv episode information. Fields include:
    - tconst (string) - alphanumeric identifier of episode
    - parentTconst (string) - alphanumeric identifier of the parent TV Series
    - seasonNumber (integer) – season number the episode belongs to 
    - episodeNumber (integer) – episode number of the tconst in the TV series
* <b> title.principals.tsv.gz </b> – Contains the principal cast/crew for titles
    - tconst (string) - alphanumeric unique identifier of the title
    - ordering (integer) – a number to uniquely identify rows for a given titleId
    - nconst (string) - alphanumeric unique identifier of the name/person
    - category (string) - the category of job that person was in
    - job (string) - the specific job title if applicable, else '\N'
    - characters (string) - the name of the character played if applicable, else '\N'
* <b>title.ratings.tsv.gz</b> – Contains the IMDb rating and votes information for titles
    - tconst (string) - alphanumeric unique identifier of the title
    - averageRating – weighted average of all the individual user ratings
    - numVotes - number of votes the title has received
* <b>name.basics.tsv.gz</b> – Contains the following information for names:
    - nconst (string) - alphanumeric unique identifier of the name/person
    - primaryName (string)– name by which the person is most often credited
    - birthYear – in YYYY format
    - deathYear – in YYYY format if applicable, else '\N'
    - primaryProfession (array of strings)– the top-3 professions of the person
    - knownForTitles (array of tconsts) – titles the person is known for

In [None]:
#Importamos la librerías que vamos a usar:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt #para graficar
import seaborn as sns #para graficar

import warnings
warnings.filterwarnings('always')
warnings.filterwarnings('ignore')

#Seteo opciones de pandas para evitar truncar columnas/filas
pd.set_option('display.max_columns', None) #evitar truncar columnas
#pd.set_option('display.max_rows', 20) #evitar truncar filas

In [None]:
#Veamos los datasets que tengo
os.listdir('/Users/adambrosio/Documents/GitHub/DMA_LABO_Austral_2021_rosario/Data/')

In [None]:
path='/Users/adambrosio/Documents/GitHub/DMA_LABO_Austral_2021_rosario/Data/'

In [None]:
#Vamos a arrancar explorando el de title basics que parece ser el archivo más interesante para comenzar
title_basics = pd.read_csv(path+'title.basics.tsv',sep="\t")

In [None]:
#Veamos que hay en el archivo

title_basics.head(-5)

In [None]:
#Veamos si hay nulos

title_basics.isna()

In [None]:
len(title_basics)

In [None]:
#Veamos si hay nulos

title_basics.isna().sum()

In [None]:
# Veamos los que tienen título nulo

title_basics[title_basics.primaryTitle.isna()]

In [None]:
# Veamos los que tienen generos nulos

title_basics[title_basics.genres.isna()]

In [None]:
#Tipo de datos de las columnas

title_basics.dtypes

In [None]:
#Ver que tipos de datos hay en cada columna

columna='startYear'

lista=[]
lista.append(type(title_basics[columna][0]))

for ele in title_basics[columna]:
    a=type(ele)
    if a==lista[-1]:
        continue
    else:
        lista.append(a)
        print(lista)

In [None]:
title_basics.startYear.astype('float')

In [None]:
pd.to_numeric(title_basics['startYear'])

In [None]:
title_basics.iloc[65776]

In [None]:
# Con esto convierto a número y fuerzo a que tome valores nulos cuando no puede

pd.to_numeric(title_basics['startYear'],errors='coerce').iloc[65776]

In [None]:
# Con esto convierto a número y fuerzo a que tome valores nulos cuando no puede


title_basics['startYear'] = pd.to_numeric(title_basics['startYear'],errors='coerce')

In [None]:
#Convierto las columnas numéricas

title_basics['endYear'] = pd.to_numeric(title_basics['endYear'],errors='coerce')
title_basics['runtimeMinutes'] = pd.to_numeric(title_basics['runtimeMinutes'],errors='coerce')

In [None]:
#Convierto las columnas de texto

title_basics['tconst']=title_basics['tconst'].astype('string')
title_basics['titleType']=title_basics['titleType'].astype('string')
title_basics['primaryTitle']=title_basics['primaryTitle'].astype('string')
title_basics['originalTitle']=title_basics['originalTitle'].astype('string')
title_basics['genres']=title_basics['genres'].astype('string')

In [None]:
#Veamos como quedaron los tipos de cada columna
title_basics.info()

In [None]:
#Veamos un histograma de como estan distribuidas las peliculas

title_basics.startYear.hist()

In [None]:
#Veamos un histograma de como estan distribuidas las peliculas (ahora con bins)
title_basics.startYear.hist(bins=20)

In [None]:
#Cual es el minimo

title_basics.startYear.min()

In [None]:
#Cual es el máximo

title_basics.startYear.max()

In [None]:
#Veamos cual es la peli con el máximo
title_basics.loc[title_basics['startYear'] == 2028.0]

In [None]:
title_basics.iloc[5886678]

In [None]:
print('https://www.imdb.com/title/tt5174640/')

In [None]:
#Veamos cual es la peli con el minimo
title_basics.loc[title_basics['startYear'] == 1874]

In [None]:
print('https://www.imdb.com/title/tt3155794/')

In [None]:
#Armemos ahora un histograma con bins especificdos

title_basics.startYear.hist(bins=[1900,1910,1920,1930,1940,1950,1960,1970,1980,1990,2000,2010,2020,2030,2040],
                            alpha=0.5)

In [None]:
#Armemos rangos de fechas
list(range(1900, 2020))

In [None]:
fechas=list(range(1900, 2023))

In [None]:
# Usemos esos rangos ahora
title_basics.startYear.hist(bins=fechas,alpha=0.8)

In [None]:
# Que tipo de entidades tenemos
title_basics.titleType.unique()

In [None]:
title_basics.groupby('titleType').startYear.hist(stacked=True)

In [None]:
title_basics.pivot(columns='titleType').startYear.plot(kind = 'hist', stacked=True,bins=fechas)

In [None]:
title_basics.pivot(columns='titleType')

In [None]:
#Veamos que hay en episodios
title_basics[title_basics['titleType']=='tvEpisode'].sample(20)

In [None]:
print('https://www.imdb.com/title/tt1026159/')

In [None]:
fechas=list(range(1900, 2030))
title_basics[title_basics['titleType']=='movie'].startYear.plot(kind = 'hist', stacked=True,bins=fechas)
plt.xlim(1900,2030)
plt.ylim(0,20000)

In [None]:
title_basics.head(4)

In [None]:
movies=title_basics[title_basics['titleType']=='movie']
movies.head()

In [None]:
pelis_por_ano=pd.pivot_table(movies,values='primaryTitle',aggfunc='count',index=["startYear"])

In [None]:
pelis_por_ano.plot()

In [None]:
pelis_por_ano.to_csv('peliano.csv')

In [None]:
!ls

In [None]:
title_ratings = pd.read_csv(path+"title.ratings.tsv",sep="\t")

In [None]:
title_ratings.head(4)

In [None]:
title_ratings.info()

In [None]:
title_ratings['tconst']=title_ratings['tconst'].astype('string')

In [None]:
title_ratings.info()

In [None]:
title_ratings.head(10)

In [None]:
result = pd.merge(movies,title_ratings, how='left', on=['tconst', 'tconst'])

In [None]:
result

In [None]:
print('https://www.imdb.com/title/tt9916706/')

In [None]:
movies_chica=result[['tconst','primaryTitle','startYear','runtimeMinutes','averageRating','numVotes']]

In [None]:
movies_chica

In [None]:
movies_chica.isna().sum()

In [None]:
len(movies_chica)

In [None]:
plt.scatter(x=movies_chica.runtimeMinutes,y=movies_chica.averageRating)

In [None]:
movies_chica.info()

In [None]:
movies_chica['duracion']=pd.cut(movies_chica.runtimeMinutes, bins=list(range(0,600,60)))

In [None]:
movies_chica.head()

In [None]:
pd.pivot_table(movies_chica,values='averageRating',aggfunc=np.mean,index=["duracion"]).plot(figsize=(15, 7))
#plt.ylim(0,10)

In [None]:
pd.pivot_table(movies_chica,values='averageRating',aggfunc=[np.mean,'count'],index=["duracion"])

In [None]:
movies_chica.averageRating.isna().sum()

In [None]:
movies_sin_nulos=movies_chica.dropna(axis=0)

In [None]:
len(movies_sin_nulos)

In [None]:
pd.pivot_table(movies_sin_nulos,values='averageRating',aggfunc=np.mean,index=["startYear"]).plot()

In [None]:
pd.pivot_table(movies_sin_nulos,values='averageRating',aggfunc=[np.mean,'count'],index=["startYear"])

In [None]:
movies_sin_nulos.sort_values(by=['averageRating'],ascending=False)

In [None]:
movies_sin_nulos[movies_sin_nulos['numVotes']>100000].sort_values(by=['averageRating'],ascending=False)

In [None]:
print('https://www.imdb.com/title/tt0112178/')

In [None]:
movies_sin_nulos.insert(2,'url','https://www.imdb.com/title/'+movies_sin_nulos.tconst+'/')

In [None]:
movies_sin_nulos

In [None]:
movies.genres.unique()

In [None]:
genres_list=[]
for ele in list(movies.genres.unique()):
    genres_list=genres_list+str(ele).split(sep=",")
genres_list = list(dict.fromkeys(genres_list))
genres_list

In [None]:
movies.fillna('\\N',inplace=True)

In [None]:
for ele in genres_list:
    movies[ele]=np.nan
movies.head()

In [None]:
genres_list.remove('\\N')

In [None]:
genres_list

In [None]:
for ele in genres_list:
    movies.loc[movies['genres'].str.contains(ele),ele]=1

In [None]:
movies.head(10)

In [None]:
movies.isna().sum()

In [None]:
movies.drop(columns='\\N',inplace=True)

In [None]:
movies.fillna(0)

In [None]:
title_basics.head()

In [None]:
title_series=title_basics[title_basics.titleType=='tvSeries']
title_series[title_basics.primaryTitle.str.contains("Star Trek")]

In [None]:
title_episode=pd.read_csv(path+"title.episode.tsv",sep="\t")
title_episode.head()

In [None]:
title_episode[title_episode.parentTconst=='tt0092455']

In [None]:
ratings=pd.merge(title_episode[title_episode.parentTconst=='tt0112178'],
                 title_ratings, how='left', on=['tconst', 'tconst'])

In [None]:
ratings.sort_values(by=['averageRating']).tail(10)

In [None]:
ratings.boxplot(column='averageRating',by='seasonNumber',figsize=(15, 7))