# Modelo Recomendación
Se creará un modelo de recomendación de películas, basado en el dataset recibido, la salida retornará una lista de 5 películas similares recomendadas por el modelo según sus características, dado un título de una película perteneciente al dataset

In [5]:
import pandas as pd
import nltk
import string
import warnings

nltk.download('punkt_tab')  # Tokenización
nltk.download('stopwords')  # Stopwords en inglés
warnings.filterwarnings('ignore')

from sklearn.preprocessing import MinMaxScaler
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer


[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/alejocampos/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/alejocampos/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Preprocesamiento de datos

Carga del dataset

In [6]:
df_premodel = pd.read_parquet('https://github.com/alejocampos1/Henry_PI1_Alejandro-Campos/raw/main/Datasets/Datasets_Limpios/Parquet/movies_tomodel.parquet')
df_genres = pd.read_parquet('https://github.com/alejocampos1/Henry_PI1_Alejandro-Campos/raw/main/Datasets/Datasets_Limpios/Parquet/genres.parquet')
df_prodcompanies = pd.read_parquet("https://github.com/alejocampos1/Henry_PI1_Alejandro-Campos/raw/main/Datasets/Datasets_Limpios/Parquet/prodcompanies.parquet")

In [7]:
df_premodel.head(2)

Unnamed: 0,id,title,genres,overview,runtime,production_companies,original_language,spoken_languages,production_countries,popularity,release_year,weighted_rating
0,862,Toy Story,"[16, 35, 10751]","Led by Woody, Andy's toys live happily in his ...",81.0,[3],en,[en],[US],21.946943,1995,7.68
1,8844,Jumanji,"[12, 14, 10751]",When siblings Judy and Peter discover an encha...,104.0,"[559, 2550, 10201]",en,"[en, fr]",[US],17.015539,1995,6.87


## Búsqueda de valores faltantes

In [8]:
# Calcular la cantidad de valores nulos por columna
df_nulos = df_premodel[df_premodel.isnull().any(axis=1)]
df_nulos.head()

Unnamed: 0,id,title,genres,overview,runtime,production_companies,original_language,spoken_languages,production_countries,popularity,release_year,weighted_rating
19574,283101,Shadowing the Third Man,[99],Documentary about the production of The Third ...,95.0,"[694, 3324, 3391, 10915, 15505, 72703, 72704]",,"[de, en]","[AT, FR, JP, GB, US]",0.017007,2004,5.63
21602,103902,Unfinished Sky,"[10749, 18]",An Outback farmer takes in an Afghani woman wh...,94.0,"[10229, 12022]",,[en],[AU],0.359818,2007,5.73
22832,359195,13 Fighting Men,"[10752, 37]",A group of Union Army soldiers is charged with...,69.0,[4141],,[en],[US],0.070647,1960,5.63
37407,257095,Prince Bayaya,[16],The first fairy tale transformed into a full-l...,87.0,[2502],,[cs],[CZ],0.036841,1950,5.62
41047,332742,Song of Lahore,[99],"Until the late 1970s, the Pakistani city of La...",82.0,[],,"[ur, en, pa]",[No Data],0.373688,2015,5.66


In [9]:
df_premodel.dropna(subset='original_language', inplace=True)

## One-hot label encoding para la columna 'géneros'

Primero, se cambiará el 'id' de género en la columna por su nombre real

In [10]:
# Creamos un diccionario que relacione los IDs de los géneros con sus nombres
generos_dict = dict(zip(df_genres['id'], df_genres['name']))

# Crear una nueva columna en df_premodel con los nombres de los géneros en lugar de los IDs
df_premodel['genres_names'] = df_premodel['genres'].apply(
    lambda x: [generos_dict[item_id] for item_id in x]
)

# Eliminar columna duplicada
df_premodel.drop(columns=['genres'], inplace=True)

df_premodel.head(2)

Unnamed: 0,id,title,overview,runtime,production_companies,original_language,spoken_languages,production_countries,popularity,release_year,weighted_rating,genres_names
0,862,Toy Story,"Led by Woody, Andy's toys live happily in his ...",81.0,[3],en,[en],[US],21.946943,1995,7.68,"[Animation, Comedy, Family]"
1,8844,Jumanji,When siblings Judy and Peter discover an encha...,104.0,"[559, 2550, 10201]",en,"[en, fr]",[US],17.015539,1995,6.87,"[Adventure, Fantasy, Family]"


Creación del one-hot label enconding

In [11]:
# Expandir las listas de géneros en filas individuales
df_generos_encoded = df_premodel[['id', 'genres_names']].explode('genres_names')

# Get_dummies para crear las columnas one hot
df_generos_encoded = pd.get_dummies(df_generos_encoded, columns=['genres_names'], prefix='', prefix_sep='')

# Nuevo Dataframe agrupado

df_generos_encoded = df_generos_encoded.groupby('id').sum().reset_index()

# Hacer un merge de df_premodel con df_generos_encoded usando la columna 'id'
df_premodel = pd.merge(df_premodel, df_generos_encoded, on='id', how='left')

# Eliminar columna 'genres_names'
df_premodel.drop(columns=['genres_names'], inplace=True)

# Verificar el resultado
df_premodel.head(2)


Unnamed: 0,id,title,overview,runtime,production_companies,original_language,spoken_languages,production_countries,popularity,release_year,...,Horror,Music,Mystery,No Data,Romance,Science Fiction,TV Movie,Thriller,War,Western
0,862,Toy Story,"Led by Woody, Andy's toys live happily in his ...",81.0,[3],en,[en],[US],21.946943,1995,...,0,0,0,0,0,0,0,0,0,0
1,8844,Jumanji,When siblings Judy and Peter discover an encha...,104.0,"[559, 2550, 10201]",en,"[en, fr]",[US],17.015539,1995,...,0,0,0,0,0,0,0,0,0,0


## One-hot label encoding para la columna 'production_companies'

Para las compañías productoras, utilizaremos solamente las 100 más populares, las demás, será agrupadas en la categoría 'OTROS'

In [12]:
# Filtrar las primeras 100 productoras por popularidad
df_top100_prod = (
    df_premodel[['production_companies', 'popularity']]
    .explode('production_companies')  # Descomponer la columna de listas
    .groupby('production_companies', as_index=False)['popularity'].sum()  # Agrupar por el id de productora y sumar popularidades
    .merge(df_prodcompanies[['id', 'name']], left_on='production_companies', right_on='id')  # Hacer el merge con nombres de productoras
    [['id', 'name', 'popularity']]  # Seleccionar las columnas finales, incluyendo los IDs de las productoras
    .sort_values(by='popularity', ascending=False)  # Ordenar por popularidad de forma descendente
    .reset_index(drop=True)  # Restablecer el índice
    .iloc[:100]  # Seleccionar solo las primeras 100 filas
)

In [13]:
df_top100_prod.head(1)

Unnamed: 0,id,name,popularity
0,6194,Warner Bros.,7717.437823


In [14]:
# Crear un set con los IDs de las top 100 productoras
top100_prod_ids = set(df_top100_prod['id'])

# Crear un diccionario que relacione los IDs de las compañías productoras con sus nombres
prodcomp_dict = dict(zip(df_prodcompanies['id'], df_prodcompanies['name']))

# Crear una nueva columna en df_premodel con los nombres de las compañías productoras en lugar de los IDs
df_premodel['production_companies_new'] = df_premodel['production_companies'].apply(
    lambda x: ["Other Prod Company" if item_id not in top100_prod_ids else prodcomp_dict.get(item_id) for item_id in x]
)

# Verificar el resultado
df_premodel['production_companies_new']

0                                [Pixar Animation Studios]
1        [TriStar Pictures, Other Prod Company, Other P...
2                       [Warner Bros., Other Prod Company]
3                 [Twentieth Century Fox Film Corporation]
4                [Other Prod Company, Touchstone Pictures]
                               ...                        
43224             [Other Prod Company, Other Prod Company]
43225                                 [Other Prod Company]
43226    [Other Prod Company, Working Title Films, Othe...
43227                                 [Other Prod Company]
43228                                                   []
Name: production_companies_new, Length: 43229, dtype: object

In [15]:
# Expandir las listas de géneros en filas individuales
df_prodcompany_encoded = df_premodel[['id', 'production_companies_new']].explode('production_companies_new')

# Get_dummies para crear las columnas one hot
df_prodcompany_encoded = pd.get_dummies(df_prodcompany_encoded, columns=['production_companies_new'], prefix='', prefix_sep='')

# Nuevo Dataframe agrupado
df_prodcompany_encoded = df_prodcompany_encoded.groupby('id').sum().reset_index()

# Hacer un merge de df_premodel con df_prodcompany_encoded usando la columna 'id'
df_premodel = pd.merge(df_premodel, df_prodcompany_encoded, on='id', how='left')

# Eliminar columnas 'production_companies', 'production_companies_new'
df_premodel.drop(columns=['production_companies', 'production_companies_new' ], inplace=True)

# Corregir datos en 'Other Prod Company'
df_premodel['Other Prod Company'] = df_premodel['Other Prod Company'].apply(lambda x: 1 if x > 1 else x)

# Verificar el resultado
df_premodel.head(1)

Unnamed: 0,id,title,overview,runtime,original_language,spoken_languages,production_countries,popularity,release_year,weighted_rating,...,Vertigo Entertainment,Village Roadshow Pictures,Walt Disney Animation Studios,Walt Disney Pictures,Walt Disney Productions,Wanda Pictures,Warner Bros.,Warner Bros. Animation,Wild Bunch,Working Title Films
0,862,Toy Story,"Led by Woody, Andy's toys live happily in his ...",81.0,en,[en],[US],21.946943,1995,7.68,...,0,0,0,0,0,0,0,0,0,0


## One-hot label encoding para la columna 'original_language'

## Normalización de columnas

Se normalizarán las columnas numéricas en preparación para el modelado

In [17]:
scaler = MinMaxScaler()

# Seleccionar las columnas numéricas a normalizar
numerical_cols = ['runtime', 'popularity','weighted_rating']

# Aplicar el escalado de min-max
df_premodel[numerical_cols] = scaler.fit_transform(df_premodel[numerical_cols])

df_premodel[numerical_cols].head()

Unnamed: 0,runtime,popularity,weighted_rating
0,0.38756,0.040087,0.810985
1,0.497608,0.031079,0.680129
2,0.483254,0.021394,0.570275
3,0.607656,0.007049,0.510501
4,0.507177,0.01532,0.487884


## Tokenización de la columna 'Overview'

In [18]:
# Definir la función para tokenizar y limpiar el texto
def procesar_texto(text):
    # Convertir el texto a minúsculas y dividirlo en palabras
    tokens = nltk.word_tokenize(text.lower())
    
    # Eliminar stopwords y puntuación
    stop_words = set(stopwords.words('english'))
    tokens_limpios = [word for word in tokens if word not in stop_words and word not in string.punctuation]
    
    return ' '.join(tokens_limpios)  # Devolver el texto procesado como una cadena

# Aplicar esta función al campo 'overview' del DataFrame
df_premodel['overview_clean'] = df_premodel['overview'].apply(lambda x: procesar_texto(str(x)))

In [19]:
df_premodel[['overview', 'overview_clean']].head()

Unnamed: 0,overview,overview_clean
0,"Led by Woody, Andy's toys live happily in his ...",led woody andy 's toys live happily room andy ...
1,When siblings Judy and Peter discover an encha...,siblings judy peter discover enchanted board g...
2,A family wedding reignites the ancient feud be...,family wedding reignites ancient feud next-doo...
3,"Cheated on, mistreated and stepped on, the wom...",cheated mistreated stepped women holding breat...
4,Just when George Banks has recovered from his ...,george banks recovered daughter 's wedding rec...


In [20]:
# Crear un objeto TfidfVectorizer - usar las 500 palabras más relevantes
tfidf = TfidfVectorizer(max_features=500, stop_words='english')

# Ajustar el vectorizador al campo 'overview_clean' y transformar los datos
tfidf_matrix = tfidf.fit_transform(df_premodel['overview_clean'])

# Convertir la matriz TF-IDF a un DataFrame
df_tfidf = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf.get_feature_names_out())

# Concatenar el DataFrame tfidf al DataFrame original
df_premodel = pd.concat([df_premodel, df_tfidf], axis=1)

In [21]:
df_premodel.head(2)

Unnamed: 0,id,title,overview,runtime,original_language,spoken_languages,production_countries,popularity,release_year,weighted_rating,...,works,world,writer,written,wrong,year,years,york,young,younger
0,862,Toy Story,"Led by Woody, Andy's toys live happily in his ...",0.38756,en,[en],[US],0.040087,1995,0.810985,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,8844,Jumanji,When siblings Judy and Peter discover an encha...,0.497608,en,"[en, fr]",[US],0.031079,1995,0.680129,...,0.0,0.155387,0.0,0.0,0.0,0.0,0.168692,0.0,0.0,0.0
