<a href="https://colab.research.google.com/github/X4Zero/SISTEMAS_DE_RECOMENDACION/blob/master/SistemaRecomendacion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SISTEMAS DE RECOMENDACIÓN
Primera semana en Hackspace, el primer proyecto que desarrollaré será un sistema de recomendación.  
[Enlace del tutorial](https://www.datacamp.com/community/tutorials/recommender-systems-python)

De acuerdo al tutorial hay 3 tipos de sistemas de recomendación, los cuales son:


*   Recomendadores simples: básicamente hablando de películas recomiendan aquellas con mejores resultados en críticas, pues tienen mayor probabilidad de gustarle al público promedio
*   Recomendadores basados en contenido: usan metadatos de items anteriores que han gustado a los usuarios, teniendo la idea de que a los usuarios les gustarán items similares a los que le han gustado antes.
*   Motores de filtrado colaborativo: predicen la preferencia o  puntuación de un usuario sobre un item en base a preferencias o puntuaciones de otros usuarios.




## DATASET
Contiene metadata de 45000 películas. Full MovieLens Dataset.

El dataset contiene:


*   movies_metadata.csv: This file contains information on ~45,000 movies featured in the Full MovieLens dataset. Features include posters, backdrops, budget, genre, revenue, release dates, languages, production countries, and companies.
*   keywords.csv: Contains the movie plot keywords for our MovieLens movies. Available in the form of a stringified JSON Object.
*   credits.csv: Consists of Cast and Crew Information for all the movies. Available in the form of a stringified JSON Object.
*   links.csv: This file contains the TMDB and IMDB IDs of all the movies featured in the Full MovieLens dataset.
*   links_small.csv: Contains the TMDB and IMDB IDs of a small subset of 9,000 movies of the Full Dataset.
*   ratings_small.csv: The subset of 100,000 ratings from 700 users on 9,000 movies.


[Enlace del dataset completo](https://grouplens.org/datasets/movielens/latest/)

[Enlace del dataset usado para el trabajo](https://www.kaggle.com/rounakbanik/the-movies-dataset/data)

## RECOMENDADOR SIMPLE

In [1]:
# Import Pandas
import pandas as pd
import matplotlib.pyplot as plt

# Load Movies Metadata
ruta_base = '/content/drive/My Drive/Colab Notebooks/HACKSPACE/SEMANA1/The_Movies_Dataset/'
metadata = pd.read_csv(ruta_base+'movies_metadata.csv', low_memory=False)

# Print the first three rows
metadata.head(3)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0


In [None]:
metadata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

In [None]:
registros, columnas = metadata.shape
print("{} registros".format(registros))
print("{} columnas".format(columnas))

45466 registros
24 columnas


Since you are trying to build a clone of IMDB's Top 250, let's use its weighted rating formula as a metric/score. Mathematically, it is represented as follows:

\begin{equation}
\text Weighted Rating (\bf WR) = \left({{\bf v} \over {\bf v} + {\bf m}} \cdot R\right) + \left({{\bf m} \over {\bf v} + {\bf m}} \cdot C\right)
\end{equation}

In the above equation,

v is the number of votes for the movie;

m is the minimum votes required to be listed in the chart;

R is the average rating of the movie;

C is the mean vote across the whole report.

You already have the values to v (vote_count) and R (vote_average) for each movie in the dataset. It is also possible to directly calculate C from this data.

In [None]:
# Calculate mean of vote average column
# La calificación promedio para una película en IMDB es de 5.61 en una escala de 0 a 10
C = metadata['vote_average'].mean()
print(C)

5.618207215133889


In [None]:
# Calculate the minimum number of votes required to be in the chart, m
# el número de votos recibidos por la película en el percentil 90
m = metadata['vote_count'].quantile(0.90)
print(m)

160.0


In [None]:
# estadísticas sobre la cantidad de votos
metadata['vote_count'].describe()

count    45460.000000
mean       109.897338
std        491.310374
min          0.000000
25%          3.000000
50%         10.000000
75%         34.000000
max      14075.000000
Name: vote_count, dtype: float64

In [None]:
# Filter out all qualified movies into a new DataFrame
q_movies = metadata.copy().loc[metadata['vote_count'] >= m]
q_movies.shape

(4555, 24)

In [None]:
metadata.shape

(45466, 24)

In [None]:
# Calcular el puntaje para cada película dentro de las seleccionadas
# Function that computes the weighted rating of each movie
def weighted_rating(x, m=m, C=C):
    v = x['vote_count']
    R = x['vote_average']
    # Calculation based on the IMDB formula
    return (v/(v+m) * R) + (m/(m+v) * C)

In [None]:
# Define a new feature 'score' and calculate its value with `weighted_rating()`
q_movies['score'] = q_movies.apply(weighted_rating, axis=1)

In [None]:
#Sort movies based on score calculated above
q_movies = q_movies.sort_values('score', ascending=False)

#Print the top 15 movies
q_movies[['title', 'vote_count', 'vote_average', 'score']].head(20)

Unnamed: 0,title,vote_count,vote_average,score
314,The Shawshank Redemption,8358.0,8.5,8.445869
834,The Godfather,6024.0,8.5,8.425439
10309,Dilwale Dulhania Le Jayenge,661.0,9.1,8.421453
12481,The Dark Knight,12269.0,8.3,8.265477
2843,Fight Club,9678.0,8.3,8.256385
292,Pulp Fiction,8670.0,8.3,8.251406
522,Schindler's List,4436.0,8.3,8.206639
23673,Whiplash,4376.0,8.3,8.205404
5481,Spirited Away,3968.0,8.3,8.196055
2211,Life Is Beautiful,3643.0,8.3,8.187171


Podemos ver que hay varias coincidencias con los primeros puestos del top 250 de IMDB y el top 20 de películas resultado del recomendador simple

## RECOMENDADOR BASADO EN CONTENIDO

In [None]:
#Print plot overviews of the first 5 movies.
#descripción de la trama de las películas
metadata['overview'].head()

0    Led by Woody, Andy's toys live happily in his ...
1    When siblings Judy and Peter discover an encha...
2    A family wedding reignites the ancient feud be...
3    Cheated on, mistreated and stepped on, the wom...
4    Just when George Banks has recovered from his ...
Name: overview, dtype: object

In [None]:
metadata

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,"[{'name': 'Sandollar Productions', 'id': 5842}...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45461,False,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 10751, 'n...",http://www.imdb.com/title/tt6209470/,439050,tt6209470,fa,رگ خواب,Rising and falling between a man and woman.,0.072051,/jldsYflnId4tTWPx8es3uzsB1I8.jpg,[],"[{'iso_3166_1': 'IR', 'name': 'Iran'}]",,0.0,90.0,"[{'iso_639_1': 'fa', 'name': 'فارسی'}]",Released,Rising and falling between a man and woman,Subdue,False,4.0,1.0
45462,False,,0,"[{'id': 18, 'name': 'Drama'}]",,111109,tt2028550,tl,Siglo ng Pagluluwal,An artist struggles to finish his work while a...,0.178241,/xZkmxsNmYXJbKVsTRLLx3pqGHx7.jpg,"[{'name': 'Sine Olivia', 'id': 19653}]","[{'iso_3166_1': 'PH', 'name': 'Philippines'}]",2011-11-17,0.0,360.0,"[{'iso_639_1': 'tl', 'name': ''}]",Released,,Century of Birthing,False,9.0,3.0
45463,False,,0,"[{'id': 28, 'name': 'Action'}, {'id': 18, 'nam...",,67758,tt0303758,en,Betrayal,"When one of her hits goes wrong, a professiona...",0.903007,/d5bX92nDsISNhu3ZT69uHwmfCGw.jpg,"[{'name': 'American World Pictures', 'id': 6165}]","[{'iso_3166_1': 'US', 'name': 'United States o...",2003-08-01,0.0,90.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,A deadly game of wits.,Betrayal,False,3.8,6.0
45464,False,,0,[],,227506,tt0008536,en,Satana likuyushchiy,"In a small town live two brothers, one a minis...",0.003503,/aorBPO7ak8e8iJKT5OcqYxU3jlK.jpg,"[{'name': 'Yermoliev', 'id': 88753}]","[{'iso_3166_1': 'RU', 'name': 'Russia'}]",1917-10-21,0.0,87.0,[],Released,,Satan Triumphant,False,0.0,0.0


In [None]:
# subset de películas 
metadata_subset = metadata[0:25000].copy()

In [None]:
#Import TfIdfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
metadata_subset['overview'] = metadata_subset['overview'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(metadata_subset['overview'])

#Output the shape of tfidf_matrix
print(tfidf_matrix.shape)

print("{} películas, {} palabras".format(tfidf_matrix.shape[0],tfidf_matrix.shape[1]))

(25000, 53130)
25000 películas, 53130 palabras


In [None]:
#Array mapping from feature integer indices to feature name.
tfidf.get_feature_names()[5000:5010]

['bertsolari',
 'berwick',
 'beryl',
 'berylune',
 'berzan',
 'berzano',
 'besa',
 'besco',
 'beseeches',
 'beseiged']

In [None]:
tfidf_matrix.shape

(25000, 53130)

Calculamos el puntaje de similaridad usando 

In [None]:
# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [None]:
cosine_sim.shape

(25000, 25000)

In [None]:
cosine_sim[1]

array([0.01586868, 1.        , 0.04878944, ..., 0.01340316, 0.        ,
       0.        ])

In [None]:
#Construct a reverse map of indices and movie titles
indices = pd.Series(metadata_subset.index, index=metadata_subset['title']).drop_duplicates()

In [None]:
indices[:10]

title
Toy Story                      0
Jumanji                        1
Grumpier Old Men               2
Waiting to Exhale              3
Father of the Bride Part II    4
Heat                           5
Sabrina                        6
Tom and Huck                   7
Sudden Death                   8
GoldenEye                      9
dtype: int64

In [None]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return metadata['title'].iloc[movie_indices]

In [None]:
metadata_subset[metadata_subset['title'].str.contains('The Dark Knight',na=False)]['title']

12481                            The Dark Knight
18252                      The Dark Knight Rises
19792    Batman: The Dark Knight Returns, Part 1
20232    Batman: The Dark Knight Returns, Part 2
Name: title, dtype: object

Este sistema hace un buen trabajo encontrando películas con una descripción similar de la trama, aunque la calidad de estas recomendaciones no es tan buena."The Dark Knight Rises" nos retorna todas las películas de batman, mientras que es más probable que las personas a las que les gustaron la película se encuentren más inclinados a disfrutar otras películas del mismo director, Christopher Nolan

In [None]:
get_recommendations('The Dark Knight')

18252                                The Dark Knight Rises
1328                                        Batman Returns
15511                           Batman: Under the Red Hood
21194    Batman Unmasked: The Psychology of the Dark Kn...
20232              Batman: The Dark Knight Returns, Part 2
150                                         Batman Forever
19792              Batman: The Dark Knight Returns, Part 1
585                                                 Batman
18035                                     Batman: Year One
9230                    Batman Beyond: Return of the Joker
Name: title, dtype: object

In [None]:
get_recommendations('The Godfather')

1178      The Godfather: Part II
1914     The Godfather: Part III
23126                 Blood Ties
11297           Household Saints
10821                   Election
17729          Short Sharp Shock
8653                Violent City
13177               I Am the Law
6977             Queen of Hearts
6711                    Mobsters
Name: title, dtype: object

In [None]:
get_recommendations('Toy Story')

15348                    Toy Story 3
2997                     Toy Story 2
10301         The 40 Year Old Virgin
24523                      Small Fry
23843    Andy Hardy's Blonde Trouble
8327                       The Champ
1071           Rebel Without a Cause
11399         For Your Consideration
1932                       Condorman
21359       Andy Hardy's Double Life
Name: title, dtype: object

## RECOMENDADOR BASADO EN CRÉDITOS, GÉNEROS Y PALABRAS CLAVE

In [None]:
# Load keywords and credits
credits = pd.read_csv(ruta_base + 'credits.csv')
keywords = pd.read_csv(ruta_base + 'keywords.csv')

# Remove rows with bad IDs.
metadata = metadata.drop([19730, 29503, 35587])

# Convert IDs to int. Required for merging
keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')
metadata['id'] = metadata['id'].astype('int')

# Merge keywords and credits into your main metadata dataframe
metadata = metadata.merge(credits, on='id')
metadata = metadata.merge(keywords, on='id')

In [None]:
# Print the first two movies of your newly merged metadata
metadata.head(2)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,cast,crew,keywords
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1..."


In [None]:
# Parse the stringified features into their corresponding python objects
from ast import literal_eval

features = ['cast', 'crew', 'keywords', 'genres']
for feature in features:
    metadata[feature] = metadata[feature].apply(literal_eval)

In [None]:
# Import Numpy
import numpy as np

In [None]:
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [None]:
def get_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        #Check if more than 3 elements exist. If yes, return only first three. If no, return entire list.
        if len(names) > 3:
            names = names[:3]
        return names

    #Return empty list in case of missing/malformed data
    return []

In [None]:
# Define new director, cast, genres and keywords features that are in a suitable form.
metadata['director'] = metadata['crew'].apply(get_director)

features = ['cast', 'keywords', 'genres']
for feature in features:
    metadata[feature] = metadata[feature].apply(get_list)

In [None]:
# Print the new features of the first 3 films
metadata[['title', 'cast', 'director', 'keywords', 'genres']].head(3)

Unnamed: 0,title,cast,director,keywords,genres
0,Toy Story,"[Tom Hanks, Tim Allen, Don Rickles]",John Lasseter,"[jealousy, toy, boy]","[Animation, Comedy, Family]"
1,Jumanji,"[Robin Williams, Jonathan Hyde, Kirsten Dunst]",Joe Johnston,"[board game, disappearance, based on children'...","[Adventure, Fantasy, Family]"
2,Grumpier Old Men,"[Walter Matthau, Jack Lemmon, Ann-Margret]",Howard Deutch,"[fishing, best friend, duringcreditsstinger]","[Romance, Comedy]"


In [None]:
# Function to convert all strings to lower case and strip names of spaces
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

In [None]:
# Apply clean_data function to your features.
features = ['cast', 'keywords', 'director', 'genres']

for feature in features:
    metadata[feature] = metadata[feature].apply(clean_data)

In [None]:
metadata[['title', 'cast', 'director', 'keywords', 'genres']].head(3)

Unnamed: 0,title,cast,director,keywords,genres
0,Toy Story,"[tomhanks, timallen, donrickles]",johnlasseter,"[jealousy, toy, boy]","[animation, comedy, family]"
1,Jumanji,"[robinwilliams, jonathanhyde, kirstendunst]",joejohnston,"[boardgame, disappearance, basedonchildren'sbook]","[adventure, fantasy, family]"
2,Grumpier Old Men,"[waltermatthau, jacklemmon, ann-margret]",howarddeutch,"[fishing, bestfriend, duringcreditsstinger]","[romance, comedy]"


In [None]:
def create_soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])

In [None]:
# Create a new soup feature
metadata['soup'] = metadata.apply(create_soup, axis=1)

In [None]:
metadata[['soup']].head(2)

Unnamed: 0,soup
0,jealousy toy boy tomhanks timallen donrickles ...
1,boardgame disappearance basedonchildren'sbook ...


In [None]:
#usaremos solo una parte de las películas, por problemas con la memoria Ram
metadata_subset = metadata[:25000].copy()

In [None]:
metadata_subset['soup']

0        jealousy toy boy tomhanks timallen donrickles ...
1        boardgame disappearance basedonchildren'sbook ...
2        fishing bestfriend duringcreditsstinger walter...
3        basedonnovel interracialrelationship singlemot...
4        baby midlifecrisis confidence stevemartin dian...
                               ...                        
24995    musical englishchannel swimmer estherwilliams ...
24996    hell basedonsong,poemorrhyme salvatorepapa art...
24997     tonyabatemarco scottadsit mattbesser seanmere...
24998    prison bankrobbery fingerprints sidneytoler ma...
24999                     davidspade keithtruesdell comedy
Name: soup, Length: 25000, dtype: object

In [None]:
# Import CountVectorizer and create the count matrix
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
# count_matrix = count.fit_transform(metadata['soup'])
count_matrix = count.fit_transform(metadata_subset['soup'])

In [None]:
count_matrix.shape

(25000, 44442)

In [None]:
# Compute the Cosine Similarity matrix based on the count_matrix
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

In [None]:
# Reset index of your main DataFrame and construct reverse mapping as before
# metadata = metadata.reset_index()
# indices = pd.Series(metadata.index, index=metadata['title'])
metadata_subset = metadata_subset.reset_index()
indices = pd.Series(metadata_subset.index, index=metadata_subset['title'])

In [None]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations_sub(title, cosine_sim=cosine_sim2):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return metadata_subset['title'].iloc[movie_indices]

In [None]:
get_recommendations_sub('The Dark Knight Rises', cosine_sim2)

12589      The Dark Knight
10210        Batman Begins
9311                Shiner
9874       Amongst Friends
7772              Mitchell
516      Romeo Is Bleeding
11463         The Prestige
24090            Quicksand
10853       Helter Skelter
18940            Last Exit
Name: title, dtype: object

In [None]:
get_recommendations_sub('The Godfather', cosine_sim2)

In [None]:
get_recommendations_sub('Toy Story', cosine_sim2)

In [None]:
get_recommendations_sub('Superman', cosine_sim2)

In [None]:
metadata_subset[metadata_subset['title'].str.contains('Superman',na=False)]['title']

## Preparación del Sistema Recomendador Basado en contenido para pasar a una app web en flask

In [2]:
# Import Pandas
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Load Movies Metadata
ruta_base = '/content/drive/My Drive/Colab Notebooks/HACKSPACE/SEMANA1/The_Movies_Dataset/'
metadata_prod = pd.read_csv(ruta_base+'movies_metadata.csv', low_memory=False)

# Load keywords and credits
credits = pd.read_csv(ruta_base + 'credits.csv')
keywords = pd.read_csv(ruta_base + 'keywords.csv')

# Remove rows with bad IDs.
metadata_prod = metadata_prod.drop([19730, 29503, 35587])

# Convert IDs to int. Required for merging
keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')
metadata_prod['id'] = metadata_prod['id'].astype('int')

# Merge keywords and credits into your main metadata dataframe
metadata_prod = metadata_prod.merge(credits, on='id')
metadata_prod = metadata_prod.merge(keywords, on='id')

In [37]:
########################
# FUNCIONES NECESARIAS #
########################

def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan


def get_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        #Check if more than 3 elements exist. If yes, return only first three. If no, return entire list.
        if len(names) > 3:
            names = names[:3]
        return names

    #Return empty list in case of missing/malformed data
    return []

def get_list_value(x,col_value):
    if isinstance(x, list):
        names = [i[col_value] for i in x]
        #Check if more than 3 elements exist. If yes, return only first three. If no, return entire list.
        # if len(names) > 3:
        #     names = names[:3]
        return names

    #Return empty list in case of missing/malformed data
    return []


# Function to convert all strings to lower case and strip names of spaces
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''
  

def create_soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])


# Function that takes in movie title as input and outputs most similar movies
def obtener_recomendaciones(title, cosine_sim,indices,df):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return df['title'].iloc[movie_indices]

In [4]:
############################
# PREPARACION DE LOS DATOS #
############################

# Parse the stringified features into their corresponding python objects
from ast import literal_eval

features = ['cast', 'crew', 'keywords', 'genres']
for feature in features:
    metadata_prod[feature] = metadata_prod[feature].apply(literal_eval)


# Define new director, cast, genres and keywords features that are in a suitable form.
metadata_prod['director'] = metadata_prod['crew'].apply(get_director)

features = ['cast', 'keywords', 'genres']
for feature in features:
    metadata_prod[feature] = metadata_prod[feature].apply(get_list)


# Apply clean_data function to your features.
features = ['cast', 'keywords', 'director', 'genres']

for feature in features:
    metadata_prod[feature] = metadata_prod[feature].apply(clean_data)


# Create a new soup feature
metadata_prod['soup'] = metadata_prod.apply(create_soup, axis=1)

Por problemas de memoria, no se trabajará con las 45000 películas, sino se utilizará una parte de estos datos

In [149]:
# Número de películas en total
print('Número de Películas: {}'.format(metadata_prod.shape[0]))

metadata_prod['release_date'] = pd.to_datetime(metadata_prod['release_date'])
metadata_prod = metadata_prod[metadata_prod['release_date']>'1994-01-01']
numero_peliculas = metadata_prod.shape[0]
print('Número de Películas después de 1990: {}'.format(numero_peliculas))
metadata_prod = metadata_prod.reset_index()
#Aún ocupan 

Número de Películas: 28656
Número de Películas después de 1990: 28656


In [150]:
metadata_prod.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28656 entries, 0 to 28655
Data columns (total 31 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   index                  28656 non-null  int64         
 1   adult                  28656 non-null  object        
 2   belongs_to_collection  2760 non-null   object        
 3   budget                 28656 non-null  object        
 4   genres                 28656 non-null  object        
 5   homepage               7642 non-null   object        
 6   id                     28656 non-null  int64         
 7   imdb_id                28645 non-null  object        
 8   original_language      28650 non-null  object        
 9   original_title         28656 non-null  object        
 10  overview               28010 non-null  object        
 11  popularity             28656 non-null  object        
 12  poster_path            28417 non-null  object        
 13  p

### Exploración de los lenguajes

In [151]:
metadata_prod['spoken_languages']

0                 [{'iso_639_1': 'en', 'name': 'English'}]
1        [{'iso_639_1': 'en', 'name': 'English'}, {'iso...
2                 [{'iso_639_1': 'en', 'name': 'English'}]
3                 [{'iso_639_1': 'en', 'name': 'English'}]
4                 [{'iso_639_1': 'en', 'name': 'English'}]
                               ...                        
28651             [{'iso_639_1': 'en', 'name': 'English'}]
28652             [{'iso_639_1': 'en', 'name': 'English'}]
28653                    [{'iso_639_1': 'tl', 'name': ''}]
28654             [{'iso_639_1': 'en', 'name': 'English'}]
28655             [{'iso_639_1': 'en', 'name': 'English'}]
Name: spoken_languages, Length: 28656, dtype: object

In [243]:
# print('Data de películas - lenguajes:\n{}'.format(metadata_prod['spoken_languages'].to_string()))

In [153]:
metadata_prod.shape

(28656, 31)

In [186]:
metadata_prod.iloc[28541]

index                                                                46447
adult                                                                False
belongs_to_collection                                                  NaN
budget                                                                   0
genres                                                    [romance, drama]
homepage                                                               NaN
id                                                                  182981
imdb_id                                                          tt2352044
original_language                                                       nl
original_title                                                   Nude Area
overview                 Naomi, a fifteen year-old Dutch girl from Sout...
popularity                                                        0.420688
poster_path                               /lcMorF1nIvnO4Zq3BG6YRc8MKkO.jpg
production_companies     

In [177]:
ruta_lenguajes = '/'.join(ruta_base.split('/')[:-2])+'/languages.csv'
df_lenguajes = pd.read_csv(ruta_lenguajes)
df_lenguajes.head(5)

Unnamed: 0,ISO language name,Native name (endonym),639-1
0,Abkhazian,"аҧсуа бызшәа, аҧсшәа",ab
1,Afar,Afaraf,aa
2,Afrikaans,Afrikaans,af
3,Akan,Akan,ak
4,Albanian,Shqip,sq


In [178]:
print('Aún hay {} películas en el dataset'.format(metadata_prod.shape[0]))


Aún hay 28656 películas en el dataset


In [179]:
# obtenemos los codigos iso de los lenguajes
liso = metadata_prod['lang'].apply(lambda x: get_list_value(x,'iso_639_1')).values.tolist()
df_liso = pd.DataFrame(liso)

# filtramos para tener los códigos iso de los lenguajes sin que se repitan
lista_lenguajes = []

for col in apd.columns:
  lista_lenguajes += list(df_liso[col].unique())

set_lenguajes = set(lista_lenguajes)

codigos_lenguajes = list(set_lenguajes)
codigos_lenguajes.remove(None)

cantidad_codigos_lenguajes = len(codigos_lenguajes)
print("Hay {} códigos iso de lenguajes".format(cantidad_codigos_lenguajes))

Hay 129 códigos iso de lenguajes


In [180]:
codigos_lenguajes

['dz',
 'af',
 'bn',
 'ro',
 'nl',
 'ce',
 'bo',
 'id',
 'iu',
 'eu',
 'pl',
 'et',
 'fi',
 'sw',
 'el',
 'ms',
 'ko',
 'bm',
 'fy',
 'xx',
 'jv',
 'tr',
 'ku',
 'da',
 'bi',
 'sr',
 'he',
 'ki',
 'km',
 'ky',
 'tl',
 'de',
 'tn',
 'se',
 'sq',
 'st',
 'tt',
 'az',
 'ny',
 'no',
 'mt',
 'hy',
 'my',
 'cs',
 'gl',
 'ru',
 'lo',
 'fr',
 'mi',
 'sg',
 'uz',
 'sv',
 'ht',
 'kn',
 'wo',
 'sm',
 'hi',
 'lb',
 'oc',
 'ur',
 'as',
 'cr',
 'be',
 'en',
 'xh',
 'ha',
 'ar',
 'zu',
 'ab',
 'mn',
 'sl',
 'is',
 'gn',
 'lv',
 'yi',
 'lt',
 'es',
 'nv',
 'si',
 'to',
 'am',
 'vi',
 'ga',
 'gu',
 'ln',
 'ta',
 'pa',
 'hr',
 'sc',
 'cn',
 'sa',
 'ca',
 'ml',
 'th',
 'mk',
 'rw',
 'tk',
 'bs',
 'zh',
 'tg',
 'fa',
 'la',
 'qu',
 'kw',
 'uk',
 'gd',
 'cy',
 'sh',
 'fo',
 'ja',
 'kk',
 'nb',
 'ay',
 'eo',
 'sk',
 'pt',
 'bg',
 'co',
 'ka',
 'ne',
 'so',
 'ig',
 'te',
 'hu',
 'ug',
 'ps',
 'mr',
 'br',
 'it']

In [181]:
df_lenguajes = df_lenguajes.set_index('639-1').iloc[:]
df_lenguajes.columns = ['ISO language name','Native name']

In [187]:
diccionario_lenguajes = df_lenguajes.to_dict('index')
# es necesario agregar xx
diccionario_lenguajes['xx'] = {'ISO language name':'No Language','Native name':''}
diccionario_lenguajes['cn'] = {'ISO language name':'Cantonés estándar','Native name':'广州话 / 廣州話'}
diccionario_lenguajes['sh'] = {'ISO language name':'No Language','Native name':''}
diccionario_lenguajes

{'aa': {'ISO language name': 'Afar', 'Native name': 'Afaraf'},
 'ab': {'ISO language name': 'Abkhazian',
  'Native name': 'аҧсуа бызшәа, аҧсшәа'},
 'ae': {'ISO language name': 'Avestan', 'Native name': 'avesta'},
 'af': {'ISO language name': 'Afrikaans', 'Native name': 'Afrikaans'},
 'ak': {'ISO language name': 'Akan', 'Native name': 'Akan'},
 'am': {'ISO language name': 'Amharic', 'Native name': 'አማርኛ'},
 'an': {'ISO language name': 'Aragonese', 'Native name': 'aragonés'},
 'ar': {'ISO language name': 'Arabic', 'Native name': 'العربية'},
 'as': {'ISO language name': 'Assamese', 'Native name': 'অসমীয়া'},
 'av': {'ISO language name': 'Avaric',
  'Native name': 'авар мацӀ, магӀарул мацӀ'},
 'ay': {'ISO language name': 'Aymara', 'Native name': 'aymar aru'},
 'az': {'ISO language name': 'Azerbaijani', 'Native name': 'azərbaycan dili'},
 'ba': {'ISO language name': 'Bashkir', 'Native name': 'башҡорт теле'},
 'be': {'ISO language name': 'Belarusian', 'Native name': 'беларуская мова'},
 'bg'

In [188]:
df_liso

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
0,en,,,,,,,,,,,,,,,,,,
1,en,fr,,,,,,,,,,,,,,,,,
2,en,,,,,,,,,,,,,,,,,,
3,en,,,,,,,,,,,,,,,,,,
4,en,,,,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28651,en,,,,,,,,,,,,,,,,,,
28652,en,,,,,,,,,,,,,,,,,,
28653,tl,,,,,,,,,,,,,,,,,,
28654,en,,,,,,,,,,,,,,,,,,


In [189]:
df_liso['posicion'] = df_liso.index.values

In [190]:
df_liso_pro = df_liso.melt(id_vars='posicion').drop('variable',1)
df_liso_pro

Unnamed: 0,posicion,value
0,0,en
1,1,en
2,2,en
3,3,en
4,4,en
...,...,...
544459,28651,
544460,28652,
544461,28653,
544462,28654,


In [191]:
posicion_v_lenguaje = pd.crosstab(df_liso_pro['posicion'],df_liso_pro['value'])
posicion_v_lenguaje

value,ab,af,am,ar,as,ay,az,be,bg,bi,bm,bn,bo,br,bs,ca,ce,cn,co,cr,cs,cy,da,de,dz,el,en,eo,es,et,eu,fa,fi,fo,fr,fy,ga,gd,gl,gn,...,qu,ro,ru,rw,sa,sc,se,sg,sh,si,sk,sl,sm,so,sq,sr,st,sv,sw,ta,te,tg,th,tk,tl,tn,to,tr,tt,ug,uk,ur,uz,vi,wo,xh,xx,yi,zh,zu
posicion,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28651,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
28652,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
28653,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
28654,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [195]:
# saber que en que lenguajes se encuentran las películas del dataset
for codigo in codigos_lenguajes:
  print('codigo: {}, ISO language name: {}'.format(codigo,diccionario_lenguajes[codigo]['ISO language name']))

codigo: dz, ISO language name: Dzongkha
codigo: af, ISO language name: Afrikaans
codigo: bn, ISO language name: Bengali
codigo: ro, ISO language name: Romanian, Moldavian, Moldovan
codigo: nl, ISO language name: Dutch, Flemish
codigo: ce, ISO language name: Chechen
codigo: bo, ISO language name: Tibetan
codigo: id, ISO language name: Indonesian
codigo: iu, ISO language name: Inuktitut
codigo: eu, ISO language name: Basque
codigo: pl, ISO language name: Polish
codigo: et, ISO language name: Estonian
codigo: fi, ISO language name: Finnish
codigo: sw, ISO language name: Swahili
codigo: el, ISO language name: Greek, Modern (1453–)
codigo: ms, ISO language name: Malay
codigo: ko, ISO language name: Korean
codigo: bm, ISO language name: Bambara
codigo: fy, ISO language name: Western Frisian
codigo: xx, ISO language name: No Language
codigo: jv, ISO language name: Javanese
codigo: tr, ISO language name: Turkish
codigo: ku, ISO language name: Kurdish
codigo: da, ISO language name: Danish
codig

In [None]:
['br','te','so']

In [202]:
#prueba con el lenguaje Telugu
posicion_v_lenguaje['te'].sum()

73

In [214]:
#peliculas en el lenguaje Telugu
posicion_v_lenguaje[posicion_v_lenguaje['te']==1]

value,ab,af,am,ar,as,ay,az,be,bg,bi,bm,bn,bo,br,bs,ca,ce,cn,co,cr,cs,cy,da,de,dz,el,en,eo,es,et,eu,fa,fi,fo,fr,fy,ga,gd,gl,gn,...,qu,ro,ru,rw,sa,sc,se,sg,sh,si,sk,sl,sm,so,sq,sr,st,sv,sw,ta,te,tg,th,tk,tl,tn,to,tr,tt,ug,uk,ur,uz,vi,wo,xh,xx,yi,zh,zu
posicion,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
7132,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
11134,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
11320,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
11526,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
12260,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27411,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
27448,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
27913,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
28202,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [237]:
x = posicion_v_lenguaje[posicion_v_lenguaje['te']==1]
indices = x[x.sum(1) == 1].index

In [240]:
lista_indices = list(indices)
lista_indices

[11526,
 12260,
 16138,
 17510,
 19347,
 19512,
 21551,
 21552,
 21737,
 21738,
 21983,
 21984,
 21985,
 21986,
 22089,
 22090,
 22645,
 22646,
 22648,
 22649,
 22650,
 22653,
 22654,
 22655,
 22656,
 22657,
 22658,
 22659,
 22662,
 22663,
 22928,
 22981,
 22982,
 22983,
 22984,
 22985,
 22986,
 22987,
 22988,
 24673,
 24720,
 27383,
 27448,
 27913,
 28202]

In [242]:
metadata_prod.loc[lista_indices]['title']

11526                     Magadheera
12260            Once Upon a Warrior
16138                1 - Nenokkadine
17510                          Aarya
19347                          Vedam
19512                          Manam
21551                           King
21552                           King
21737                         Adhurs
21738                         Adhurs
21983                      100% Love
21984                      100% Love
21985                         Leader
21986                         Leader
22089               Nannaku Prematho
22090               Nannaku Prematho
22645                    Srimanthudu
22646                        Dookudu
22648                         Pokiri
22649                        Khaleja
22650                         Athadu
22653                          Jalsa
22654            Atharintiki Daaredi
22655                   Gabbar Singh
22656                    Race Gurram
22657    Cameraman Ganga Tho Rambabu
22658                         Panjaa
2

In [231]:
posicion_v_lenguaje[posicion_v_lenguaje['te']==1].sum(1)

posicion
7132     3
11134    5
11320    2
11526    1
12260    1
        ..
27411    2
27448    1
27913    1
28202    1
28520    2
Length: 73, dtype: int64

In [218]:
#peliculas en el lenguaje Telugu solo con ese lenguaje
posicion_v_lenguaje.loc[(posicion_v_lenguaje[posicion_v_lenguaje['te']==1].sum(1) == 1).index]

value,ab,af,am,ar,as,ay,az,be,bg,bi,bm,bn,bo,br,bs,ca,ce,cn,co,cr,cs,cy,da,de,dz,el,en,eo,es,et,eu,fa,fi,fo,fr,fy,ga,gd,gl,gn,...,qu,ro,ru,rw,sa,sc,se,sg,sh,si,sk,sl,sm,so,sq,sr,st,sv,sw,ta,te,tg,th,tk,tl,tn,to,tr,tt,ug,uk,ur,uz,vi,wo,xh,xx,yi,zh,zu
posicion,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
7132,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
11134,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
11320,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
11526,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
12260,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27411,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
27448,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
27913,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
28202,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [206]:

(posicion_v_lenguaje[posicion_v_lenguaje['te']==1].sum(1) == 1).sum()

45

In [112]:
metadata_prod['lang']

0                 [{'iso_639_1': 'en', 'name': 'English'}]
1        [{'iso_639_1': 'en', 'name': 'English'}, {'iso...
2                 [{'iso_639_1': 'en', 'name': 'English'}]
3                 [{'iso_639_1': 'en', 'name': 'English'}]
4                 [{'iso_639_1': 'en', 'name': 'English'}]
                               ...                        
46620             [{'iso_639_1': 'en', 'name': 'English'}]
46621             [{'iso_639_1': 'en', 'name': 'English'}]
46624                    [{'iso_639_1': 'tl', 'name': ''}]
46625             [{'iso_639_1': 'en', 'name': 'English'}]
46627             [{'iso_639_1': 'en', 'name': 'English'}]
Name: lang, Length: 28656, dtype: object

In [102]:
metadata_prod.shape

(28656, 30)

In [101]:
len(liso)

28656

In [27]:
type(metadata_prod['spoken_languages'][0])

str

In [None]:
metada

In [None]:
metada

In [None]:
# Import CountVectorizer and create the count matrix
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
# count_matrix = count.fit_transform(metadata['soup'])
count_matrix = count.fit_transform(metadata_subset['soup'])


# Compute the Cosine Similarity matrix based on the count_matrix
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim2 = cosine_similarity(count_matrix, count_matrix)


metadata_subset = metadata_subset.reset_index()
indices = pd.Series(metadata_subset.index, index=metadata_subset['title'])

In [None]:
metadata_subset['release_date'].describe()

In [None]:
metadata['release_date'] = pd.to_datetime(metadata['release_date'])

In [None]:
metadata['title'][metadata['release_date']>'1992-01-01']