![IronHack Logo](https://s3-eu-west-1.amazonaws.com/ih-materials/uploads/upload_d5c5793015fec3be28a63c4fa3dd4d55.png)

# Final Project

## Objetivos:

### · Predicción de la recaudación de una película en base a varios parámetros, como el presupuesto, géneros, casting, entre otros.

### · Visualización de datos entre diversas características de las películas.

## Desarrollo


### Importo las librerías y métodos necesarios a lo largo del proyecto.

In [355]:
import pandas as pd
import numpy as np
import json
from datetime import datetime
from dotenv import load_dotenv
import os
import requests
from bs4 import BeautifulSoup
from statsmodels.multivariate.pca import PCA
import matplotlib.pyplot as plt
%matplotlib inline 
from sklearn.preprocessing import StandardScaler
import seaborn as sns



### Cargo los datos y hago una primera comprobación del dataframe (en adelante, "df").

In [47]:
df = pd.read_csv('tmdb_5000_movies.csv')
df.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


### Veo cuántos registros y columnas tiene el df.

In [48]:
df.shape

(4803, 20)

### Empieza la limpieza. Reviso cómo es la distribución de valores de la columna status.

In [49]:
df.status.value_counts()

Released           4795
Rumored               5
Post Production       3
Name: status, dtype: int64

### Como hay solo 8 registros que no son películas ya lanzadas, me quedaré solo con las lanzadas.

In [50]:
df = df[df.status == 'Released' ]

### Elimino columnas que no me aportarán casi nada o nada a la hora de entrenar el modelo más adelante, tales como:
#### - homepage: registros de texto únicos para cada registro que claramente no aportarán nada.
#### - original_title: escogí quedarme con title, porque así los tengo normalizados en inglés y me evito símbolos de otros idiomas, y lo mantengo solo como soporte para depurar los futuros códigos, ya que al ser también único para cada registro no aportará al modelo.
#### - status: porque ya todos tienen el mismo valor.
#### - tagline, overview: son descripciones de la película que no me servirán de nada, además la columna keywords ya incluye una versión más amigable para máquinas.
#### - spoken_languages: decidí quedarme solo con el lenguaje original, ya que pueden hablarse una gran cantidad de lenguajes en cada película.

In [51]:
df.drop(['homepage','original_title', 'status','tagline','spoken_languages','overview'], axis=1, inplace=True)

### Por mera estética, ordeno las columnas, primero creando una lista para luego asignárselas al df.

In [52]:
cols = df.columns.tolist()
cols

['budget',
 'genres',
 'id',
 'keywords',
 'original_language',
 'popularity',
 'production_companies',
 'production_countries',
 'release_date',
 'revenue',
 'runtime',
 'title',
 'vote_average',
 'vote_count']

In [53]:
cols2 = ['id','title','release_date','runtime','original_language','genres','production_companies',
        'production_countries','keywords','budget','vote_average','vote_count','popularity','revenue']

In [54]:
df = df[cols2]
df.head()

Unnamed: 0,id,title,release_date,runtime,original_language,genres,production_companies,production_countries,keywords,budget,vote_average,vote_count,popularity,revenue
0,19995,Avatar,2009-12-10,162.0,en,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",237000000,7.2,11800,150.437577,2787965087
1,285,Pirates of the Caribbean: At World's End,2007-05-19,169.0,en,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",300000000,6.9,4500,139.082615,961000000
2,206647,Spectre,2015-10-26,148.0,en,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...","[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",245000000,6.3,4466,107.376788,880674609
3,49026,The Dark Knight Rises,2012-07-16,165.0,en,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...","[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...","[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",250000000,7.6,9106,112.31295,1084939099
4,49529,John Carter,2012-03-07,132.0,en,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...","[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",260000000,6.1,2124,43.926995,284139100


### Reviso la distribución de los valores de la columna del lenguaje de la película. La inmensa mayoría son en inglés, pero mantendré los 4 principales idiomas, mediante la función convert_lang, que convierte todos los demás en "other".

In [55]:
df.original_language.value_counts()[:4]

en    4498
fr      70
es      32
de      27
Name: original_language, dtype: int64

In [56]:
df.original_language.value_counts()

en    4498
fr      70
es      32
de      27
zh      27
hi      19
ja      16
it      14
cn      12
ru      11
ko      11
pt       9
da       7
sv       5
fa       4
nl       4
he       3
th       3
ro       2
ta       2
ar       2
id       2
cs       1
tr       1
te       1
ps       1
hu       1
sl       1
ky       1
xx       1
af       1
is       1
vi       1
no       1
nb       1
pl       1
el       1
Name: original_language, dtype: int64

In [57]:
def convert_lang(elem):
    top_countries = df.original_language.value_counts()[:4]
    if elem not in top_countries:
        elem = 'other'
    return elem            

In [58]:
df.original_language = df.original_language.apply(convert_lang)
df.head()

Unnamed: 0,id,title,release_date,runtime,original_language,genres,production_companies,production_countries,keywords,budget,vote_average,vote_count,popularity,revenue
0,19995,Avatar,2009-12-10,162.0,en,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",237000000,7.2,11800,150.437577,2787965087
1,285,Pirates of the Caribbean: At World's End,2007-05-19,169.0,en,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",300000000,6.9,4500,139.082615,961000000
2,206647,Spectre,2015-10-26,148.0,en,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...","[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",245000000,6.3,4466,107.376788,880674609
3,49026,The Dark Knight Rises,2012-07-16,165.0,en,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...","[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...","[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",250000000,7.6,9106,112.31295,1084939099
4,49529,John Carter,2012-03-07,132.0,en,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...","[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",260000000,6.1,2124,43.926995,284139100


In [59]:
df.original_language.value_counts()

en       4498
other     168
fr         70
es         32
de         27
Name: original_language, dtype: int64

### Reviso cuántos registros de las columnas vote_count y popularity son 0, y como son pocos, elimino esos registros.

In [60]:
df[df.vote_count==0].vote_count.value_counts()

0    61
Name: vote_count, dtype: int64

In [61]:
df[df.popularity==0].popularity.value_counts()

0.0    1
Name: popularity, dtype: int64

In [62]:
df = df[df.vote_count!=0]

In [63]:
df = df[df.popularity!=0]

In [64]:
df.shape

(4734, 14)

### Elimino un registro en específico que no es una película sino una miniserie, que además carece de varios de los valores discriminantes.

In [65]:
df = df[df.id!=112430]

In [66]:
df.shape

(4733, 14)

### Al descubrir este valor que no es una película y podría perjudicar el modelo, decido eliminar los registros cuya "duración" sea menor de 80 minutos, o que si es menor, tengan valores para el budget y el revenue, ya que a la hora de recuperar los valores faltantes de estas dos categorías tendré que hacer web scraping en una página que se enfoca principalmente en películas.

In [67]:
df = df[(df.runtime>80) | ((df.budget!=0) & (df.revenue!=0))]

In [68]:
df.shape

(4645, 14)

### 

In [71]:
def getNames(series):
    lis = []
    for x in series:
        x = json.loads(x)
        x = [z['name'] for z in x]
        for y in x:
            if y not in lis:
                lis.append(y)
    return lis

all_genres = getNames(df.genres)
all_genres

['Action',
 'Adventure',
 'Fantasy',
 'Science Fiction',
 'Crime',
 'Drama',
 'Thriller',
 'Animation',
 'Family',
 'Western',
 'Comedy',
 'Romance',
 'Horror',
 'Mystery',
 'History',
 'War',
 'Music',
 'Documentary',
 'Foreign',
 'TV Movie']

In [72]:
def convert_genres(elem):
    elem = json.loads(elem)
    elem = [z['name'] for z in elem]
    return elem

# def convert_genres2(elem):
#     elem = json.loads(elem)
#     res = ''
#     for z in elem:
#         res += z['name']
#     return res

# def convert_genres3(elem):
#     elem = json.loads(elem)
#     elem = [z['name'] for z in elem]
#     return set(elem)

In [73]:
df.genres = df.genres.apply(convert_genres)
df.head()

Unnamed: 0,id,title,release_date,runtime,original_language,genres,production_companies,production_countries,keywords,budget,vote_average,vote_count,popularity,revenue
0,19995,Avatar,2009-12-10,162.0,en,"[Action, Adventure, Fantasy, Science Fiction]","[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",237000000,7.2,11800,150.437577,2787965087
1,285,Pirates of the Caribbean: At World's End,2007-05-19,169.0,en,"[Adventure, Fantasy, Action]","[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",300000000,6.9,4500,139.082615,961000000
2,206647,Spectre,2015-10-26,148.0,en,"[Action, Adventure, Crime]","[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...","[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",245000000,6.3,4466,107.376788,880674609
3,49026,The Dark Knight Rises,2012-07-16,165.0,en,"[Action, Crime, Drama, Thriller]","[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...","[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",250000000,7.6,9106,112.31295,1084939099
4,49529,John Carter,2012-03-07,132.0,en,"[Action, Adventure, Science Fiction]","[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...","[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",260000000,6.1,2124,43.926995,284139100


In [74]:
genres_dummies = df['genres'].str.join(sep='*').str.get_dummies(sep='*')
genres_dummies.head()

Unnamed: 0,Action,Adventure,Animation,Comedy,Crime,Documentary,Drama,Family,Fantasy,Foreign,History,Horror,Music,Mystery,Romance,Science Fiction,TV Movie,Thriller,War,Western
0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0
1,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
2,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0
4,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0


In [75]:
all_comps = []
for x in df.production_companies:
    if len(x)>2:
        x = json.loads(x)
        y = x[0]['name']
    
        if y not in all_comps:
                all_comps.append(y)
len(all_comps)

1257

In [76]:
df.drop('production_companies', axis=1, inplace=True)

In [77]:
all_countries = getNames(df.production_countries)
display(len(all_countries), all_countries[:10])

83

['United States of America',
 'United Kingdom',
 'Jamaica',
 'Bahamas',
 'Dominica',
 'Czech Republic',
 'Poland',
 'Slovenia',
 'New Zealand',
 'Germany']

In [78]:
df.production_countries = df.production_countries.apply(convert_genres)
df.head()

Unnamed: 0,id,title,release_date,runtime,original_language,genres,production_countries,keywords,budget,vote_average,vote_count,popularity,revenue
0,19995,Avatar,2009-12-10,162.0,en,"[Action, Adventure, Fantasy, Science Fiction]","[United States of America, United Kingdom]","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",237000000,7.2,11800,150.437577,2787965087
1,285,Pirates of the Caribbean: At World's End,2007-05-19,169.0,en,"[Adventure, Fantasy, Action]",[United States of America],"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",300000000,6.9,4500,139.082615,961000000
2,206647,Spectre,2015-10-26,148.0,en,"[Action, Adventure, Crime]","[United Kingdom, United States of America]","[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",245000000,6.3,4466,107.376788,880674609
3,49026,The Dark Knight Rises,2012-07-16,165.0,en,"[Action, Crime, Drama, Thriller]",[United States of America],"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",250000000,7.6,9106,112.31295,1084939099
4,49529,John Carter,2012-03-07,132.0,en,"[Action, Adventure, Science Fiction]",[United States of America],"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",260000000,6.1,2124,43.926995,284139100


In [79]:
countries_dummies = df['production_countries'].str.join(sep='*').str.get_dummies(sep='*')
countries_dummies.head()

Unnamed: 0,Afghanistan,Angola,Argentina,Aruba,Australia,Austria,Bahamas,Belgium,Bhutan,Bolivia,...,Sweden,Switzerland,Taiwan,Thailand,Tunisia,Turkey,Ukraine,United Arab Emirates,United Kingdom,United States of America
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,1
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [80]:
all_keys = getNames(df.keywords)
display(len(all_keys), all_keys[:10])

9765

['culture clash',
 'future',
 'space war',
 'space colony',
 'society',
 'space travel',
 'futuristic',
 'romance',
 'space',
 'alien']

In [81]:
df.drop('keywords', axis=1, inplace=True)

In [82]:
df['release_date'] = pd.to_datetime(df['release_date'])
df.dtypes

id                               int64
title                           object
release_date            datetime64[ns]
runtime                        float64
original_language               object
genres                          object
production_countries            object
budget                           int64
vote_average                   float64
vote_count                       int64
popularity                     float64
revenue                          int64
dtype: object

In [83]:
df.head()

Unnamed: 0,id,title,release_date,runtime,original_language,genres,production_countries,budget,vote_average,vote_count,popularity,revenue
0,19995,Avatar,2009-12-10,162.0,en,"[Action, Adventure, Fantasy, Science Fiction]","[United States of America, United Kingdom]",237000000,7.2,11800,150.437577,2787965087
1,285,Pirates of the Caribbean: At World's End,2007-05-19,169.0,en,"[Adventure, Fantasy, Action]",[United States of America],300000000,6.9,4500,139.082615,961000000
2,206647,Spectre,2015-10-26,148.0,en,"[Action, Adventure, Crime]","[United Kingdom, United States of America]",245000000,6.3,4466,107.376788,880674609
3,49026,The Dark Knight Rises,2012-07-16,165.0,en,"[Action, Crime, Drama, Thriller]",[United States of America],250000000,7.6,9106,112.31295,1084939099
4,49529,John Carter,2012-03-07,132.0,en,"[Action, Adventure, Science Fiction]",[United States of America],260000000,6.1,2124,43.926995,284139100


In [84]:
df.shape

(4645, 12)

In [85]:
df[df.revenue==0].revenue.value_counts()

0    1279
Name: revenue, dtype: int64

In [86]:
df[df.revenue==0]

Unnamed: 0,id,title,release_date,runtime,original_language,genres,production_countries,budget,vote_average,vote_count,popularity,revenue
83,79698,The Lovers,2015-02-13,109.0,en,"[Action, Adventure, Science Fiction, Romance]","[Australia, Belgium, India]",27000000,4.8,34,2.418535,0
135,7978,The Wolfman,2010-02-11,102.0,en,"[Drama, Horror, Thriller]",[United States of America],150000000,5.5,549,21.214571,0
265,10588,The Cat in the Hat,2003-11-21,82.0,en,"[Comedy, Fantasy, Family]",[United States of America],0,4.9,366,18.251129,0
309,10214,Son of the Mask,2005-02-18,94.0,en,"[Fantasy, Comedy, Family, Adventure]","[Germany, United States of America]",84000000,3.6,338,17.815595,0
376,10357,Volcano,1997-04-25,104.0,en,"[Science Fiction, Action, Drama, Thriller]",[United States of America],90000000,5.2,376,19.836124,0
406,51052,Arthur Christmas,2011-02-22,97.0,en,"[Drama, Animation, Family, Comedy]",[United States of America],0,6.7,333,19.836830,0
433,146216,RED 2,2013-07-18,116.0,en,"[Action, Comedy, Crime, Thriller]","[Canada, France, United States of America]",84000000,6.4,1526,44.343330,0
453,12096,The Pink Panther,2006-01-18,93.0,en,"[Action, Comedy, Crime, Mystery, Family]",[United States of America],0,5.6,550,25.450534,0
463,161795,Déjà Vu,1998-04-22,117.0,en,"[Romance, Drama]",[United States of America],0,8.0,1,0.605645,0
467,9822,Mighty Joe Young,1998-12-25,114.0,en,"[Action, Adventure, Family, Fantasy]",[United States of America],90000000,5.9,208,6.643778,0


In [87]:
df[df.budget==0].budget.value_counts()

0    915
Name: budget, dtype: int64

In [88]:
load_dotenv(dotenv_path='.env')
SECRET_KEY = os.getenv("TOKEN")

In [90]:
id1 = str(19995)
response = requests.get('https://api.themoviedb.org/3/movie/'+id1+'/external_ids?api_key='+SECRET_KEY)
results = response.json()
display('imdb_id' in results.keys())

True

In [91]:
def imdb_id(row):
    if row['budget']==0 or row['revenue']==0:
        id1 = str(row['id'])
        response = requests.get('https://api.themoviedb.org/3/movie/'+id1+'/external_ids?api_key='+SECRET_KEY)
        results = response.json()
        if 'imdb_id' in results.keys():
            return results["imdb_id"]
        else: return 0
    else: return 0

df['imdb_id'] = df.apply(imdb_id, axis=1)

In [92]:
df[df.revenue==0].head()

Unnamed: 0,id,title,release_date,runtime,original_language,genres,production_countries,budget,vote_average,vote_count,popularity,revenue,imdb_id
83,79698,The Lovers,2015-02-13,109.0,en,"[Action, Adventure, Science Fiction, Romance]","[Australia, Belgium, India]",27000000,4.8,34,2.418535,0,tt1321869
135,7978,The Wolfman,2010-02-11,102.0,en,"[Drama, Horror, Thriller]",[United States of America],150000000,5.5,549,21.214571,0,tt0780653
265,10588,The Cat in the Hat,2003-11-21,82.0,en,"[Comedy, Fantasy, Family]",[United States of America],0,4.9,366,18.251129,0,tt0312528
309,10214,Son of the Mask,2005-02-18,94.0,en,"[Fantasy, Comedy, Family, Adventure]","[Germany, United States of America]",84000000,3.6,338,17.815595,0,tt0362165
376,10357,Volcano,1997-04-25,104.0,en,"[Science Fiction, Action, Drama, Thriller]",[United States of America],90000000,5.2,376,19.836124,0,tt0120461


In [120]:
def get_dirs(row):
    
    id1 = str(row['id'])
    response = requests.get('https://api.themoviedb.org/3/movie/'+id1+'/credits?api_key='+SECRET_KEY)
    results = response.json()
    if 'crew' in results.keys():
        dirs = []
        for x in results['crew']:
            if x['department'] == 'Directing':
                dirs.append(x['name'])
                return dirs
        return 0
    else: return 0

df['directors'] = df.apply(get_dirs, axis=1)


In [122]:
df.head()

Unnamed: 0,id,title,release_date,runtime,original_language,genres,production_countries,budget,vote_average,vote_count,popularity,revenue,imdb_id,directors
0,19995,Avatar,2009-12-10,162.0,en,"[Action, Adventure, Fantasy, Science Fiction]","[United States of America, United Kingdom]",237000000,7.2,11800,150.437577,2787965087,0,[James Cameron]
1,285,Pirates of the Caribbean: At World's End,2007-05-19,169.0,en,"[Adventure, Fantasy, Action]",[United States of America],300000000,6.9,4500,139.082615,961000000,0,[Gore Verbinski]
2,206647,Spectre,2015-10-26,148.0,en,"[Action, Adventure, Crime]","[United Kingdom, United States of America]",245000000,6.3,4466,107.376788,880674609,0,[Sam Mendes]
3,49026,The Dark Knight Rises,2012-07-16,165.0,en,"[Action, Crime, Drama, Thriller]",[United States of America],250000000,7.6,9106,112.31295,1084939099,0,[Christopher Nolan]
4,49529,John Carter,2012-03-07,132.0,en,"[Action, Adventure, Science Fiction]",[United States of America],260000000,6.1,2124,43.926995,284139100,0,[Andrew Stanton]


In [127]:
def get_cast(row):
    
    id1 = str(row['id'])
    response = requests.get('https://api.themoviedb.org/3/movie/'+id1+'/credits?api_key='+SECRET_KEY)
    results = response.json()
    if 'cast' in results.keys():
        cast = []
        for i in range(3):
            if len(results['cast'])>i:
                cast.append(results['cast'][i]['name'])
            else: cast.append(0)
        return cast
    else: return 0

df['cast'] = df.apply(get_cast, axis=1)

In [130]:
df.head()

Unnamed: 0,id,title,release_date,runtime,original_language,genres,production_countries,budget,vote_average,vote_count,popularity,revenue,imdb_id,directors,cast
0,19995,Avatar,2009-12-10,162.0,en,"[Action, Adventure, Fantasy, Science Fiction]","[United States of America, United Kingdom]",237000000,7.2,11800,150.437577,2787965087,0,[James Cameron],"[Sam Worthington, Zoe Saldana, Stephen Lang]"
1,285,Pirates of the Caribbean: At World's End,2007-05-19,169.0,en,"[Adventure, Fantasy, Action]",[United States of America],300000000,6.9,4500,139.082615,961000000,0,[Gore Verbinski],"[Johnny Depp, Orlando Bloom, Keira Knightley]"
2,206647,Spectre,2015-10-26,148.0,en,"[Action, Adventure, Crime]","[United Kingdom, United States of America]",245000000,6.3,4466,107.376788,880674609,0,[Sam Mendes],"[Daniel Craig, Christoph Waltz, Léa Seydoux]"
3,49026,The Dark Knight Rises,2012-07-16,165.0,en,"[Action, Crime, Drama, Thriller]",[United States of America],250000000,7.6,9106,112.31295,1084939099,0,[Christopher Nolan],"[Christian Bale, Michael Caine, Gary Oldman]"
4,49529,John Carter,2012-03-07,132.0,en,"[Action, Adventure, Science Fiction]",[United States of America],260000000,6.1,2124,43.926995,284139100,0,[Andrew Stanton],"[Taylor Kitsch, Lynn Collins, Samantha Morton]"


In [None]:
# df8 = df con merge genre y countries
# df7 = df con todos los dummies
# df6_4 = df después de todos los merge
# df6 = df antes de merge con dummies
# df5 = df después de get_revenue
# df4s = sample de df4
# df4 = df antes de get_revenue
# df3 = df antes de get_budget
# df2 = sample de df    df2 = df[1000:1465].copy()

In [190]:
df = df3.copy()

In [191]:
df3[df3['budget']==0].budget.value_counts()

0    915
Name: budget, dtype: int64

In [196]:
df[df['budget']==0].budget.value_counts()

0    340
Name: budget, dtype: int64

In [172]:
df2 = df[1000:1465].copy()

In [None]:
df3 = df.copy()

In [193]:
def get_budget(row):
    
    id1 = row['imdb_id']
    if id1 == 0: return row['budget']
    if row['budget'] == 0:
        
        url = 'https://www.imdb.com/title/'+id1
        html = requests.get(url).content
        soup = BeautifulSoup(html)

        budget = ''
        for element in soup.find_all('div', class_="txt-block"):
            if element.text.strip('\n').startswith('Budget:$'):
                budget = int(element.text.strip('\n').lstrip('Budget:$').rstrip('\n            (estimated)').replace(',', ''))
        if budget == '':
            return 0
        else:
            return budget
    else: return row['budget']

df['budget'] = df.apply(get_budget, axis=1)

In [229]:
df4.loc[3285]

id                                          153397
title                                     Restless
release_date                   2012-12-07 00:00:00
runtime                                        180
original_language                               en
genres                  [TV Movie, Romance, Drama]
production_countries              [United Kingdom]
budget                                           0
vote_average                                   4.9
vote_count                                       8
popularity                                0.812776
revenue                                          0
imdb_id                                          0
directors                                        0
cast                                             0
Name: 3285, dtype: object

In [None]:
df4 = df.copy()

In [None]:
df4s = df4[1000:1465].copy()

In [230]:
def get_revenue(row):
    
    id1 = row['imdb_id']
    
    if row['revenue'] > 0: 
        return row['revenue']
    
    elif id1 == 0: return row['revenue']

    else:
        
        url = 'https://www.imdb.com/title/'+id1
        html = requests.get(url).content
        soup = BeautifulSoup(html)

        glob = ''
        usa = ''
        
        for element in soup.find_all('div', class_="txt-block"):
            
            if element.text.strip('\n').startswith('Gross USA: $'):
                usa = int(element.text.strip('\n').lstrip('Gross USA: $').split(' ')[0].replace(',', ''))
            
            if element.text.strip('\n').startswith('Cumulative Worldwide Gross: $'):
                glob = int(element.text.strip('\n').lstrip('Cumulative Worldwide Gross: $').split(' ')[0].replace(',', ''))
                return glob
                
        if usa != '' and usa > 0:
            return usa
        else:
            return 0
        

df['revenue'] = df.apply(get_revenue, axis=1)

In [253]:
df5 = df.copy()

In [254]:
df.loc[(df['budget'] == 0) | (df['revenue'] == 0)].shape


(596, 15)

In [264]:
df = df[(df.budget!=0) & (df.revenue!=0)]

In [283]:
df = df[(df.budget>1000) | (df.revenue>1000)]

In [284]:
df.shape

(4039, 15)

In [285]:
df.head()

Unnamed: 0,id,title,release_date,runtime,original_language,genres,production_countries,budget,vote_average,vote_count,popularity,revenue,imdb_id,directors,cast
0,19995,Avatar,2009-12-10,162.0,en,"[Action, Adventure, Fantasy, Science Fiction]","[United States of America, United Kingdom]",237000000,7.2,11800,150.437577,2787965087,0,[James Cameron],"[Sam Worthington, Zoe Saldana, Stephen Lang]"
1,285,Pirates of the Caribbean: At World's End,2007-05-19,169.0,en,"[Adventure, Fantasy, Action]",[United States of America],300000000,6.9,4500,139.082615,961000000,0,[Gore Verbinski],"[Johnny Depp, Orlando Bloom, Keira Knightley]"
2,206647,Spectre,2015-10-26,148.0,en,"[Action, Adventure, Crime]","[United Kingdom, United States of America]",245000000,6.3,4466,107.376788,880674609,0,[Sam Mendes],"[Daniel Craig, Christoph Waltz, Léa Seydoux]"
3,49026,The Dark Knight Rises,2012-07-16,165.0,en,"[Action, Crime, Drama, Thriller]",[United States of America],250000000,7.6,9106,112.31295,1084939099,0,[Christopher Nolan],"[Christian Bale, Michael Caine, Gary Oldman]"
4,49529,John Carter,2012-03-07,132.0,en,"[Action, Adventure, Science Fiction]",[United States of America],260000000,6.1,2124,43.926995,284139100,0,[Andrew Stanton],"[Taylor Kitsch, Lynn Collins, Samantha Morton]"


In [294]:
df6 = df.copy()

In [290]:
directors_dummies = df['directors'].str.join(sep='*').str.get_dummies(sep='*')
directors_dummies.head()

Unnamed: 0,Aaron Schneider,Aaron Seltzer,Abel Ferrara,Adam Carolla,Adam Goldberg,Adam Green,Adam Marcus,Adam McKay,Adam Rapp,Adam Rifkin,...,Xavier Gens,Yash Chopra,Youssef Delara,Zach Braff,Zach Cregger,Zack Snyder,Zak Penn,Zal Batmanglij,Zhang Yimou,Émile Gaudreault
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [291]:
cast_dummies = df['cast'].str.join(sep='*').str.get_dummies(sep='*')
cast_dummies.head()

Unnamed: 0,"""Weird Al"" Yankovic",50 Cent,A.J. Cook,Aaliyah,Aamir Khan,Aaran Thomas,Aaron Abrams,Aaron Eckhart,Aaron Kwok,Aaron Paul,...,Zoe Saldana,Zoe Sloane,Zooey Deschanel,Zoé Félix,Zoë Bell,Zoë Kravitz,Zubaida Sahar,Zuleikha Robinson,Óscar Jaenada,Đỗ Thị Hải Yến
0,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [287]:
countries_dummies = df['production_countries'].str.join(sep='*').str.get_dummies(sep='*')
countries_dummies.head()

Unnamed: 0,Afghanistan,Angola,Argentina,Aruba,Australia,Austria,Bahamas,Belgium,Bhutan,Bosnia and Herzegovina,...,Spain,Sweden,Switzerland,Taiwan,Thailand,Tunisia,Ukraine,United Arab Emirates,United Kingdom,United States of America
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,1
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [289]:
genres_dummies = df['genres'].str.join(sep='*').str.get_dummies(sep='*')
genres_dummies.head()

Unnamed: 0,Action,Adventure,Animation,Comedy,Crime,Documentary,Drama,Family,Fantasy,Foreign,History,Horror,Music,Mystery,Romance,Science Fiction,TV Movie,Thriller,War,Western
0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0
1,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
2,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0
4,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0


In [296]:
df6_1 = pd.merge(df6, genres_dummies, left_index=True, right_index=True)

In [298]:
df6_2 = pd.merge(df6_1, countries_dummies, left_index=True, right_index=True)

In [300]:
df6_3 = pd.merge(df6_2, directors_dummies, left_index=True, right_index=True)

In [301]:
df6_4 = pd.merge(df6_3, cast_dummies, left_index=True, right_index=True)

In [302]:
df6_4.shape

(4039, 6432)

In [305]:
df = df6_4.copy()

In [306]:
df6.dtypes

id                               int64
title                           object
release_date            datetime64[ns]
runtime                        float64
original_language               object
genres                          object
production_countries            object
budget                           int64
vote_average                   float64
vote_count                       int64
popularity                     float64
revenue                          int64
imdb_id                         object
directors                       object
cast                            object
dtype: object

In [307]:
df.drop(['title','original_language', 'genres','production_countries','imdb_id','directors','cast'], axis=1, inplace=True)

In [316]:
df.dtypes.head(10)

id                       int64
release_date    datetime64[ns]
runtime                float64
budget                   int64
vote_average           float64
vote_count               int64
popularity             float64
revenue                  int64
Action                   int64
Adventure                int64
dtype: object

In [315]:
df7 = df.copy()

In [317]:
df8 = df6_2.copy()

In [318]:
df = df8.copy()

In [319]:
df.drop(['title','original_language', 'genres','production_countries','imdb_id','directors','cast'], axis=1, inplace=True)

In [320]:
df8 = df.copy()

In [321]:
df8.head()

Unnamed: 0,id,release_date,runtime,budget,vote_average,vote_count,popularity,revenue,Action,Adventure,...,Spain,Sweden,Switzerland,Taiwan,Thailand,Tunisia,Ukraine,United Arab Emirates,United Kingdom,United States of America
0,19995,2009-12-10,162.0,237000000,7.2,11800,150.437577,2787965087,1,1,...,0,0,0,0,0,0,0,0,1,1
1,285,2007-05-19,169.0,300000000,6.9,4500,139.082615,961000000,1,1,...,0,0,0,0,0,0,0,0,0,1
2,206647,2015-10-26,148.0,245000000,6.3,4466,107.376788,880674609,1,1,...,0,0,0,0,0,0,0,0,1,1
3,49026,2012-07-16,165.0,250000000,7.6,9106,112.31295,1084939099,1,0,...,0,0,0,0,0,0,0,0,0,1
4,49529,2012-03-07,132.0,260000000,6.1,2124,43.926995,284139100,1,1,...,0,0,0,0,0,0,0,0,0,1


In [334]:
df9 = df8.drop('release_date', axis=1)

In [328]:
df8a = np.array(df8)
df8a.shape

(4039, 100)

In [2]:
df9a = np.array(df9)
df9a.shape

NameError: name 'np' is not defined

In [337]:
# pc = PCA(df9a, ncomp=20)
# pc.factors.shape

(4039, 20)

In [1]:
# X = df9.values
# X_std = StandardScaler().fit_transform(X)


NameError: name 'df9' is not defined

In [None]:
# pca = PCA(n_components=9)
# x_9d = pca.fit_transform(X_std)

In [215]:
# url = 'https://www.imdb.com/title/tt0396592/'
# html = requests.get(url).content
# soup = BeautifulSoup(html)
# soup

<!DOCTYPE html>
<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="app-id=342792525, app-argument=imdb:///title/tt0396592?src=mdot" name="apple-itunes-app"/>
<script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>
<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle", {wb: 1});
    }
</script>
<script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>
<title>El gran Alberto (2004) - IMDb</title>
<script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>
<script>
    if (typeof uet == 'function') {
      uet("be", "LoadTitle", {wb: 1});
    }
</script>
<script>
    if (typeof uex == 'function') {
      uex("ld", "LoadTitle", {wb: 1});
    }
</script>
<link href="https://www.im

In [145]:
# budget = ''
# for element in soup.find_all('div', class_="txt-block"):
#     if element.text.strip('\n').startswith('Bud'):
#         budget = int(element.text.strip('\n').lstrip('Budget:$').rstrip('\n            (estimated)').replace(',', ''))
        
# budget

109000000

In [216]:
# budget = ''
# for element in soup.find_all('div', class_="txt-block"):
#     if element.text.strip('\n').startswith('Cumulative Worldwide Gross: $'):
#         budget = int(element.text.strip('\n').lstrip('Cumulative Worldwide Gross: $').rstrip('\n        ').replace(',', ''))
        
# budget

ValueError: invalid literal for int() with base 10: '48551322 31 December 2005'

In [219]:
# budget = ''
# for element in soup.find_all('div', class_="txt-block"):
#     if element.text.strip('\n').startswith('Cumulative Worldwide Gross: $'):
#         budget = int(element.text.strip('\n').lstrip('Cumulative Worldwide Gross: $').split(' ')[0].replace(',', ''))
        
# budget

48551322

In [211]:
# budget = ''
# for element in soup.find_all('div', class_="txt-block"):
#     if element.text.strip('\n').startswith('Gross USA: $'):
#         budget = int(element.text.strip('\n').lstrip('Gross USA: $').rstrip('\n        ').replace(',', ''))
        
# budget

ValueError: invalid literal for int() with base 10: '93607673 28 October 2001'

In [217]:

# text = [element.text.strip('\n') for element in soup.find_all('div', class_="txt-block")]
# clean_names = []
# text
# for element in text:
#     clean_names.append(element.replace("\n\n                          ", "").replace("\n                            \n                              ", " ").replace("\n                            \n ", ""))

# clean_names

['Taglines:\nHey! Hey! Hey! Christmas Day        ',
 'Motion Picture Rating\n                    (MPAA)\n                \nRated PG for momentary language\n| \n See all certifications\xa0»\n            ',
 'Parents Guide:\n\n View content advisory\xa0»\n        ',
 'Country:\nUSA',
 'Language:\nEnglish',
 'Release Date: 5 August 2005 (Spain)\n    \nSee more\xa0»\n    ',
 'Also Known As: El gran Alberto\n      \nSee more\xa0»\n      ',
 'Filming Locations:\nSanta Clarita, California, USA\n\nSee more\xa0»\n      ',
 'Budget:$26,000,000\n            (estimated)',
 'Opening Weekend USA: $10,021,510,\n26 December 2004, Wide Release',
 'Gross USA: $48,116,322, 7 April 2005',
 'Cumulative Worldwide Gross: $48,551,322, 31 December 2005',
 'Production Co:\n Twentieth Century Fox, Davis Entertainment, The Culver Studios \nSee more\xa0»\n      ',
 '  Show more on\n  IMDbPro\xa0»\n  ',
 'Runtime:\n93 min',
 'Sound Mix:\nDolby Digital\n| DTS',
 'Color:\nColor',
 'Aspect Ratio: 2.39 : 1\n    ',
 'Tr

In [None]:
# def get_budget(row):
    
#     id1 = str(row['id'])
#     response = requests.get('https://api.themoviedb.org/3/movie/'+id1+'/credits?api_key='+SECRET_KEY)
#     results = response.json()
#     if 'cast' in results.keys():
#         cast = []
#         for i in range(3):
#             if len(results['cast'])>i:
#                 cast.append(results['cast'][i]['name'])
#             else: cast.append(0)
#         return cast
#     else: return 0

# df.head().apply(get_budget, axis=1)

In [None]:



# load_dotenv(dotenv_path='.env')
# SECRET_KEY = os.getenv("TOKEN")

In [126]:
# id1 = str(19995)
# response = requests.get('https://api.themoviedb.org/3/movie/'+id1+'/external_ids?api_key='+SECRET_KEY)
# results = response.json()
# results["imdb_id"]

'tt0499549'

In [123]:

# response = requests.get('https://api.themoviedb.org/3/movie/11516/credits?api_key='+SECRET_KEY)
# results = response.json()
# results.keys()

dict_keys(['id', 'cast', 'crew'])

In [124]:
# results['cast']

[{'cast_id': 7,
  'character': 'Narrator (voice, English version)',
  'credit_id': '52fe44509251416c7503056b',
  'gender': 0,
  'id': 69681,
  'name': 'Philippe Labro',
  'order': 0,
  'profile_path': None},
 {'cast_id': 8,
  'character': 'Narrator (voice)',
  'credit_id': '52fe44509251416c7503056f',
  'gender': 2,
  'id': 20030,
  'name': 'Jacques Perrin',
  'order': 1,
  'profile_path': '/6qegj3UL0YKMARVakEDSFTXbMRk.jpg'}]

In [115]:
# direct = results['cast'][i]['name']
# direct = [(results['cast'][i]['name']) for i in range(3)]
# direct

IndexError: list index out of range

In [104]:
# direct = results['crew']
# lista = []
# for x in direct:
#     if x['department'] == 'Directing':
#         lista.append(x['name'])
#         break
# direct

[{'credit_id': '52fe43b29251416c7501aa63',
  'department': 'Writing',
  'gender': 2,
  'id': 455,
  'job': 'Screenplay',
  'name': 'Paul Haggis',
  'profile_path': '/v2MadHJVrBAv0tEgs9H9PJ9y50z.jpg'},
 {'credit_id': '52fe43b29251416c7501aa69',
  'department': 'Writing',
  'gender': 0,
  'id': 10782,
  'job': 'Screenplay',
  'name': 'Neal Purvis',
  'profile_path': '/eossmqcHU9EeV28foEiYLOpxjST.jpg'},
 {'credit_id': '52fe43b29251416c7501aa6f',
  'department': 'Writing',
  'gender': 2,
  'id': 10783,
  'job': 'Screenplay',
  'name': 'Robert Wade',
  'profile_path': '/txwb1hbBMiNuvJA8LvcvsjiIML8.jpg'},
 {'credit_id': '52fe43b29251416c7501aa75',
  'department': 'Production',
  'gender': 1,
  'id': 10666,
  'job': 'Producer',
  'name': 'Barbara Broccoli',
  'profile_path': None},
 {'credit_id': '52fe43b29251416c7501aa7b',
  'department': 'Production',
  'gender': 2,
  'id': 10876,
  'job': 'Executive Producer',
  'name': 'Callum McDougall',
  'profile_path': None},
 {'credit_id': '52fe43b29