# Modelo Recomendación
Se creará un modelo de recomendación de películas, basado en el dataset recibido, la salida retornará una lista de 5 películas similares recomendadas por el modelo según sus características, dado un título de una película perteneciente al dataset

In [40]:
import pandas as pd
import warnings
import re

warnings.filterwarnings('ignore')

from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from wordcloud import STOPWORDS

## Preprocesamiento de datos

Carga del dataset

In [41]:
df_premodel = pd.read_parquet('https://github.com/alejocampos1/Henry_PI1_Alejandro-Campos/raw/main/Datasets/Datasets_Limpios/Parquet/movies_tomodel.parquet')
df_genres = pd.read_parquet('https://github.com/alejocampos1/Henry_PI1_Alejandro-Campos/raw/main/Datasets/Datasets_Limpios/Parquet/genres.parquet')
df_prodcompanies = pd.read_parquet("https://github.com/alejocampos1/Henry_PI1_Alejandro-Campos/raw/main/Datasets/Datasets_Limpios/Parquet/prodcompanies.parquet")

In [42]:
df_premodel.head(2)

Unnamed: 0,id,title,genres,overview,runtime,production_companies,spoken_languages,production_countries,popularity,release_year,weighted_rating
0,862,Toy Story,"[16, 35, 10751]","Led by Woody, Andy's toys live happily in his ...",81.0,[3],[English],[US],21.946943,1995,7.68
1,8844,Jumanji,"[12, 14, 10751]",When siblings Judy and Peter discover an encha...,104.0,"[559, 2550, 10201]","[English, Other_Languages]",[US],17.015539,1995,6.87


## Búsqueda de valores faltantes

In [43]:
# Calcular la cantidad de valores nulos por columna
df_nulos = df_premodel[df_premodel.isnull().any(axis=1)]
df_nulos.head()

Unnamed: 0,id,title,genres,overview,runtime,production_companies,spoken_languages,production_countries,popularity,release_year,weighted_rating


## One-hot label encoding para la columna 'géneros'

Primero, se cambiará el 'id' de género en la columna por su nombre real

In [44]:
# Creamos un diccionario que relacione los IDs de los géneros con sus nombres
generos_dict = dict(zip(df_genres['id'], df_genres['name']))

# Crear una nueva columna en df_premodel con los nombres de los géneros en lugar de los IDs
df_premodel['genres_names'] = df_premodel['genres'].apply(
    lambda x: [generos_dict[item_id] for item_id in x]
)

# Eliminar columna duplicada
df_premodel.drop(columns=['genres'], inplace=True)

df_premodel.head(2)

Unnamed: 0,id,title,overview,runtime,production_companies,spoken_languages,production_countries,popularity,release_year,weighted_rating,genres_names
0,862,Toy Story,"Led by Woody, Andy's toys live happily in his ...",81.0,[3],[English],[US],21.946943,1995,7.68,"[Animation, Comedy, Family]"
1,8844,Jumanji,When siblings Judy and Peter discover an encha...,104.0,"[559, 2550, 10201]","[English, Other_Languages]",[US],17.015539,1995,6.87,"[Adventure, Fantasy, Family]"


Creación del one-hot label enconding

In [45]:
# Expandir las listas de géneros en filas individuales
df_generos_encoded = df_premodel[['id', 'genres_names']].explode('genres_names')

# Get_dummies para crear las columnas one hot
df_generos_encoded = pd.get_dummies(df_generos_encoded, columns=['genres_names'], prefix='', prefix_sep='')

# Nuevo Dataframe agrupado

df_generos_encoded = df_generos_encoded.groupby('id').sum().reset_index()

# Hacer un merge de df_premodel con df_generos_encoded usando la columna 'id'
df_premodel = pd.merge(df_premodel, df_generos_encoded, on='id', how='left')

# Eliminar columna 'genres_names'
df_premodel.drop(columns=['genres_names'], inplace=True)

# Verificar el resultado
df_premodel.head(2)


Unnamed: 0,id,title,overview,runtime,production_companies,spoken_languages,production_countries,popularity,release_year,weighted_rating,...,History,Horror,Music,Mystery,Romance,Science Fiction,TV Movie,Thriller,War,Western
0,862,Toy Story,"Led by Woody, Andy's toys live happily in his ...",81.0,[3],[English],[US],21.946943,1995,7.68,...,0,0,0,0,0,0,0,0,0,0
1,8844,Jumanji,When siblings Judy and Peter discover an encha...,104.0,"[559, 2550, 10201]","[English, Other_Languages]",[US],17.015539,1995,6.87,...,0,0,0,0,0,0,0,0,0,0


## One-hot label encoding para la columna 'production_companies'

Para las compañías productoras, utilizaremos solamente las 100 más populares, las demás, será agrupadas en la categoría 'OTROS'

In [46]:
# Filtrar las primeras 50 productoras por popularidad
df_top50_prod = (
    df_premodel[['production_companies', 'popularity']]
    .explode('production_companies')  # Descomponer la columna de listas
    .groupby('production_companies', as_index=False)['popularity'].sum()  # Agrupar por el id de productora y sumar popularidades
    .merge(df_prodcompanies[['id', 'name']], left_on='production_companies', right_on='id')  # Hacer el merge con nombres de productoras
    [['id', 'name', 'popularity']]  # Seleccionar las columnas finales, incluyendo los IDs de las productoras
    .sort_values(by='popularity', ascending=False)  # Ordenar por popularidad de forma descendente
    .reset_index(drop=True)  # Restablecer el índice
    .iloc[:50]  # Seleccionar solo las primeras 50 filas
)

In [47]:
df_top50_prod.head(1)

Unnamed: 0,id,name,popularity
0,6194,Warner Bros.,7661.61343


In [48]:
# Crear un set con los IDs de las top 50 productoras
top50_prod_ids = set(df_top50_prod['id'])

# Crear un diccionario que relacione los IDs de las compañías productoras con sus nombres
prodcomp_dict = dict(zip(df_prodcompanies['id'], df_prodcompanies['name']))

# Crear una nueva columna en df_premodel con los nombres de las compañías productoras en lugar de los IDs
df_premodel['production_companies_new'] = df_premodel['production_companies'].apply(
    lambda x: ["Other Prod Company" if item_id not in top50_prod_ids else prodcomp_dict.get(item_id) for item_id in x]
)

# Verificar el resultado
df_premodel['production_companies_new']

0                                [Pixar Animation Studios]
1        [TriStar Pictures, Other Prod Company, Other P...
2                       [Warner Bros., Other Prod Company]
3                 [Twentieth Century Fox Film Corporation]
4                [Other Prod Company, Touchstone Pictures]
                               ...                        
34422                                 [Universal Pictures]
34423             [Other Prod Company, Other Prod Company]
34424                                 [Other Prod Company]
34425    [Other Prod Company, Working Title Films, Othe...
34426                                 [Other Prod Company]
Name: production_companies_new, Length: 34427, dtype: object

In [49]:
# Expandir las listas de géneros en filas individuales
df_prodcompany_encoded = df_premodel[['id', 'production_companies_new']].explode('production_companies_new')

# Get_dummies para crear las columnas one hot
df_prodcompany_encoded = pd.get_dummies(df_prodcompany_encoded, columns=['production_companies_new'], prefix='', prefix_sep='')

# Nuevo Dataframe agrupado
df_prodcompany_encoded = df_prodcompany_encoded.groupby('id').sum().reset_index()

# Hacer un merge de df_premodel con df_prodcompany_encoded usando la columna 'id'
df_premodel = pd.merge(df_premodel, df_prodcompany_encoded, on='id', how='left')

# Eliminar columnas 'production_companies', 'production_companies_new'
df_premodel.drop(columns=['production_companies', 'production_companies_new' ], inplace=True)

# Corregir datos en 'Other Prod Company'
df_premodel['Other Prod Company'] = df_premodel['Other Prod Company'].apply(lambda x: 1 if x > 1 else x)

# Verificar el resultado
df_premodel.head(1)

Unnamed: 0,id,title,overview,runtime,spoken_languages,production_countries,popularity,release_year,weighted_rating,Action,...,Touchstone Pictures,TriStar Pictures,Twentieth Century Fox Film Corporation,United Artists,Universal Pictures,Village Roadshow Pictures,Walt Disney Pictures,Warner Bros.,Wild Bunch,Working Title Films
0,862,Toy Story,"Led by Woody, Andy's toys live happily in his ...",81.0,[English],[US],21.946943,1995,7.68,0,...,0,0,0,0,0,0,0,0,0,0


## One-hot label encoding para la columna 'spoken_languages'

In [50]:
# Expandir las listas de géneros en filas individuales
df_spk_language_encoded = df_premodel[['id', 'spoken_languages']].explode('spoken_languages')

# Get_dummies para crear las columnas one hot
df_spk_language_encoded = pd.get_dummies(df_spk_language_encoded, columns=['spoken_languages'], prefix='SL_', prefix_sep='')

# Nuevo Dataframe agrupado
df_spk_language_encoded = df_spk_language_encoded.groupby('id').sum().reset_index()

# Hacer un merge de df_premodel con df_prodcompany_encoded usando la columna 'id'
df_premodel = pd.merge(df_premodel, df_spk_language_encoded, on='id', how='left')

# Eliminar columna 'spoken_languages'
df_premodel.drop(columns=['spoken_languages'], inplace=True)


In [51]:
df_premodel.head(2)

Unnamed: 0,id,title,overview,runtime,production_countries,popularity,release_year,weighted_rating,Action,Adventure,...,Twentieth Century Fox Film Corporation,United Artists,Universal Pictures,Village Roadshow Pictures,Walt Disney Pictures,Warner Bros.,Wild Bunch,Working Title Films,SL_English,SL_Other_Languages
0,862,Toy Story,"Led by Woody, Andy's toys live happily in his ...",81.0,[US],21.946943,1995,7.68,0,0,...,0,0,0,0,0,0,0,0,1,0
1,8844,Jumanji,When siblings Judy and Peter discover an encha...,104.0,[US],17.015539,1995,6.87,0,1,...,0,0,0,0,0,0,0,0,1,1


## One-hot label encoding para la columna 'production_countries'

In [52]:
# Expandir las listas de géneros en filas individuales
df_prod_country_encoded = df_premodel[['id', 'production_countries']].explode('production_countries')

# Get_dummies para crear las columnas one hot
df_prod_country_encoded = pd.get_dummies(df_prod_country_encoded, columns=['production_countries'], prefix='PC_', prefix_sep='')

# Nuevo Dataframe agrupado
df_prod_country_encoded = df_prod_country_encoded.groupby('id').sum().reset_index()

# Hacer un merge de df_premodel con df_prodcompany_encoded usando la columna 'id'
df_premodel = pd.merge(df_premodel, df_prod_country_encoded, on='id', how='left')

# Eliminar columna 'production_countries'
df_premodel.drop(columns=['production_countries'], inplace=True)

In [53]:
df_premodel.head(1)

Unnamed: 0,id,title,overview,runtime,popularity,release_year,weighted_rating,Action,Adventure,Animation,...,Universal Pictures,Village Roadshow Pictures,Walt Disney Pictures,Warner Bros.,Wild Bunch,Working Title Films,SL_English,SL_Other_Languages,PC_Other_Countries,PC_US
0,862,Toy Story,"Led by Woody, Andy's toys live happily in his ...",81.0,21.946943,1995,7.68,0,0,1,...,0,0,0,0,0,0,1,0,0,1


## Normalización de columnas

Se normalizarán las columnas numéricas en preparación para el modelado

In [54]:
scaler = MinMaxScaler()

# Seleccionar las columnas numéricas a normalizar
numerical_cols = ['runtime', 'popularity','weighted_rating', 'release_year']

# Aplicar el escalado de min-max
df_premodel[numerical_cols] = scaler.fit_transform(df_premodel[numerical_cols])

df_premodel[numerical_cols].head()

Unnamed: 0,runtime,popularity,weighted_rating,release_year
0,0.386473,0.040087,0.797794,0.747126
1,0.497585,0.031079,0.648897,0.747126
2,0.483092,0.021394,0.542279,0.747126
3,0.608696,0.007049,0.498162,0.747126
4,0.507246,0.01532,0.452206,0.747126


## Tokenización de la columna 'Overview'

In [55]:
# Definir las stopwords personalizadas
stopwords = set(STOPWORDS)
stopwords.update([
    "the", "a", "an", "and", "or", "but", "so", "of", "in", "on", "with", "by", "for", "from", "at", "to", "it", "its", "this", "that", "is", "are", "was", "were", 
    "be", "been", "being",
    "have", "has", "had", "i", "you", "he", "she", "we", "they", "his", "her", "their", "there", "them", "then", "when", "where", "what", "who", "which", "why", 
    "how", "just", "about",
    "into", "like", "out", "over", "under", "before", "after", "as", "all", "one", "two", "three", "more", "most", "many", "few", 
    "movie", "film", "films", "story", "plot", "character", "characters", "director", "actors", "role", "scene", "scenes", "based", "villain", "sequel", "prequel", 
    "series", "set", "takes",
    "place", "stars", "produced", "produces", "production", "released", "release", "year", "years", 
    "month", "months", "day", "days", "part", "parts",
    "young", "old", "man", "woman", "boy", "girl", "family", "father", "mother", "son", "daughter", "brother", "sister", "find", 
    "discovers", "fight", "against", "away", "new", "return", "facing", "must", "will", 
    "can", "tries", "ends", "begins", "helps",
    "starts", "goes", "comes", "leads", "takes", "finds", "discovers", "faces", "tries", "leaves", "meets", "begins", "ends", "find", "fights", "wants", "needs", 
    "helps", "works", 
    "together", "make", "makes", "tells", "asks"
])

# Definir la función para tokenizar y limpiar el texto
def procesar_texto_regex(text):
    # Eliminar puntuación y caracteres no deseados, convertir a minúsculas
    text = re.sub(r'\W+', ' ', text.lower())
    
    # Tokenizar dividiendo el texto en palabras
    tokens = text.split()
    
    # Eliminar stopwords personalizadas
    tokens_limpios = [word for word in tokens if word not in stopwords]
    
    return ' '.join(tokens_limpios)

# Aplicar esta función al campo 'overview' del DataFrame
df_premodel['overview_clean'] = df_premodel['overview'].apply(lambda x: procesar_texto_regex(str(x)))

In [56]:
df_premodel[['overview', 'overview_clean']].head()

Unnamed: 0,overview,overview_clean
0,"Led by Woody, Andy's toys live happily in his ...",led woody andy s toys live happily room andy s...
1,When siblings Judy and Peter discover an encha...,siblings judy peter discover enchanted board g...
2,A family wedding reignites the ancient feud be...,wedding reignites ancient feud next door neigh...
3,"Cheated on, mistreated and stepped on, the wom...",cheated mistreated stepped women holding breat...
4,Just when George Banks has recovered from his ...,george banks recovered s wedding receives news...


In [57]:
# Crear un objeto TfidfVectorizer - usar las 25 palabras más relevantes
tfidf = TfidfVectorizer(max_features=25, stop_words='english')

# Ajustar el vectorizador al campo 'overview_clean' y transformar los datos
tfidf_matrix = tfidf.fit_transform(df_premodel['overview_clean'])

# Convertir la matriz TF-IDF a un DataFrame
df_tfidf = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf.get_feature_names_out())

# Concatenar el DataFrame tfidf al DataFrame original
df_premodel = pd.concat([df_premodel, df_tfidf], axis=1)

# Eliminar columna 'overview'
df_premodel.drop(columns=['overview', 'overview_clean'], inplace=True)

In [58]:
df_premodel.head(2)

Unnamed: 0,id,title,runtime,popularity,release_year,weighted_rating,Action,Adventure,Animation,Comedy,...,people,school,small,soon,time,town,war,way,wife,world
0,862,Toy Story,0.386473,0.040087,0.747126,0.797794,0,0,1,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,8844,Jumanji,0.497585,0.031079,0.747126,0.648897,0,1,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


Preparar dataframe de referencia para resultados

In [59]:
df_premodelo = df_premodel[['id', 'title']]

In [60]:
df_premodelo.to_parquet('../Datasets/pre_modelo.parquet')


## Implementación del modelo

Preparar la matriz de características

In [61]:
df_modelo = df_premodel.drop(columns=['id', 'title'])

In [62]:
df_modelo.head()

Unnamed: 0,runtime,popularity,release_year,weighted_rating,Action,Adventure,Animation,Comedy,Crime,Documentary,...,people,school,small,soon,time,town,war,way,wife,world
0,0.386473,0.040087,0.747126,0.797794,0,0,1,1,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.497585,0.031079,0.747126,0.648897,0,1,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,0.483092,0.021394,0.747126,0.542279,0,0,0,1,0,0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,0.608696,0.007049,0.747126,0.498162,0,0,0,1,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.711783,0.0,0.0
4,0.507246,0.01532,0.747126,0.452206,0,0,0,1,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.701414,0.0


In [63]:
print(list(df_modelo.columns))

['runtime', 'popularity', 'release_year', 'weighted_rating', 'Action', 'Adventure', 'Animation', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Family', 'Fantasy', 'Foreign', 'History', 'Horror', 'Music', 'Mystery', 'Romance', 'Science Fiction', 'TV Movie', 'Thriller', 'War', 'Western', 'Amblin Entertainment', 'Atlas Entertainment', 'BBC Films', 'Blumhouse Productions', 'Canal+', 'Castle Rock Entertainment', 'Columbia Pictures', 'Columbia Pictures Corporation', 'DC Comics', 'DC Entertainment', 'Dimension Films', 'DreamWorks Animation', 'DreamWorks SKG', 'Dune Entertainment', 'Film4', 'Fox 2000 Pictures', 'Fox Searchlight Pictures', 'Gaumont', 'Illumination Entertainment', 'Imagine Entertainment', 'Legendary Pictures', 'Lionsgate', 'Marvel Studios', 'Metro-Goldwyn-Mayer (MGM)', 'Millennium Films', 'Miramax Films', 'New Line Cinema', 'New Regency Pictures', 'Orion Pictures', 'Other Prod Company', 'Paramount Pictures', 'Pixar Animation Studios', 'RKO Radio Pictures', 'Regency Enterprises', '

In [64]:
df_modelo.to_parquet('../Datasets/matriz_features.parquet')