### Import Libraries

We import the necessary libraries that will be used throughout the project. These libraries provide essential functions and tools for data manipulation, machine learning, and model evaluation.

In [81]:
import ast
import os
import numpy as np
import pandas as pd
import pickle
import torch
import nltk
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics.pairwise import cosine_similarity
from transformers import BertTokenizer, BertModel
import matplotlib.pyplot as plt


### Reading DataSets 

In [82]:
movies = pd.read_csv('tmdb_5000_movies.csv')
credits = pd.read_csv('tmdb_5000_credits.csv') 

In [83]:
movies.head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500


In [84]:
credits.head(2)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


### Merging the `movies` and `credits` DataFrames

In this step, we will merge the **`movies`** and **`credits`** DataFrames based on the common column **`title`**. This operation combines information from both DataFrames, allowing us to enrich our movie dataset with additional details. The result will be a unified DataFrame that contains all relevant attributes needed for further analysis and modeling.

In [85]:
movies = movies.merge(credits, on = 'title')

### Cleaning and Preparing the `movies` DataFrame for Creating Embeddings

In this step, we will clean and prepare the **`movies`** DataFrame to ensure it is in the right format for creating embeddings. This process includes tasks such as normalizing text data, handling missing values, and converting data types as needed. By doing so, we can enhance the quality of our dataset, which is crucial for generating accurate and meaningful embeddings for our movie recommendation model.

In [86]:
movies.head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,285,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


In [87]:
movies.columns

Index(['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'vote_average',
       'vote_count', 'movie_id', 'cast', 'crew'],
      dtype='object')

### Removing Unnecessary Columns from the `movies` DataFrame

In this step, we will remove unnecessary columns from the **`movies`** DataFrame and keep only the relevant columns needed for our analysis. This helps streamline the dataset, reducing clutter and focusing on the data that is essential for building our movie recommendation model.

In [88]:
movies = movies[['movie_id', 'title', 'overview', 'genres', 'keywords', 'cast', 'crew']]

In [89]:
movies.head(2)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


#### Checking for Missing Values in the `movies` DataFrame

We check for any missing values in the **`movies`** DataFrame and display the count of null values for each column. This allows us to identify potential issues in the dataset that may need to be addressed before proceeding with analysis.

#### Dropping Rows with Missing Values

We drop any rows in the **`movies`** DataFrame that contain missing values. This ensures a complete dataset for analysis, allowing us to work with only the most relevant and available data.

#### Checking for Duplicate Rows in the `movies` DataFrame

We check for any duplicate rows in the **`movies`** DataFrame. If any duplicates are found, we will drop them to ensure that our dataset remains unique and to prevent any biases in our analysis or recommendations.

In [90]:
movies.isnull().sum()

movie_id    0
title       0
overview    3
genres      0
keywords    0
cast        0
crew        0
dtype: int64

In [91]:
movies.dropna(inplace=True)

In [92]:
movies.shape

(4806, 7)

In [93]:
movies.duplicated().sum()

0

In [94]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


### `Convert_genres` and `convert_keywords` Functions

The **`convert_genres`** and **`convert_keywords`** functions both take a string representation of a list of dictionaries as input. They extract the names of movie genres and keywords, respectively, from each dictionary.

These functions return a list containing the relevant genres and keywords, allowing us to effectively categorize and describe the movies in our dataset. By extracting this information, we enhance our ability to make recommendations based on user preferences and trends.

In [95]:
def convert_genres(text):
    l = []
    for i in ast.literal_eval(text):
        l.append(i['name'])

    return l

In [96]:
movies['genres'] = movies['genres'].apply(convert_genres)

In [97]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [98]:
def convert_keywords(text):
    l = []
    for i in ast.literal_eval(text):
        l.append(i['name'])

    return l

In [99]:
movies['keywords'] = movies['keywords'].apply(convert_keywords)


In [100]:
movies.head(2)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


### `Convert_cast` Function

The **`convert_cast`** function takes a string representation of a list of dictionaries as input and extracts the names of individuals from each dictionary (in our case, actor names). 

This function is designed to return a list containing the names of up to three actors (the movie cast). 

By focusing on a limited number of cast members, we simplify the representation of the movie's talent, making it easier to showcase key contributors in our recommendations.

In [101]:
def convert_cast(text):
    l = []
    counter = 0
    for i in ast.literal_eval(text):
        if counter < 3:  
            l.append(i['name'])
        counter += 1

    return l

In [102]:
movies['cast'] = movies['cast'].apply(convert_cast)

In [103]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


### `Fetch_director` Function

The **`fetch_director`** function takes a string representation of a list of dictionaries as input and checks each dictionary for the job title **'Director'**. 

This function is designed to return a list containing the name of the first individual (in this case, the movie director) found with that job title. By identifying the director from the provided data, we can streamline the process of extracting key contributors for our movie recommendations.

In [104]:
def fectch_director(text):
    l = []
    for i in ast.literal_eval(text):
        if i['job'] == 'Director':
            l.append(i['name'])  
            break

    return l


In [105]:
movies['crew'] = movies['crew'].apply(fectch_director)

In [106]:
movies.head(2)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley]",[Gore Verbinski]


### Splitting Command

In this step, we transform the **'overview'** column in the `movies` DataFrame by splitting each string into a list of words based on whitespace. 

This transformation is crucial as it allows each overview to be represented as a list of words instead of a single string, facilitating easier manipulation and analysis of the text data.

By converting overviews into lists of words, we can implement various natural language processing techniques, such as tokenization and filtering, enhancing the overall processing of our movie data.

In [107]:
movies['overview'] = movies['overview'].apply(lambda x:x.split())

In [108]:
movies.head(2)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley]",[Gore Verbinski]


In [109]:
movies.iloc[0]['overview']

['In',
 'the',
 '22nd',
 'century,',
 'a',
 'paraplegic',
 'Marine',
 'is',
 'dispatched',
 'to',
 'the',
 'moon',
 'Pandora',
 'on',
 'a',
 'unique',
 'mission,',
 'but',
 'becomes',
 'torn',
 'between',
 'following',
 'orders',
 'and',
 'protecting',
 'an',
 'alien',
 'civilization.']

### Removing Spaces in **Cast**, **Crew** and **Genres** Columns for Movie Recommendation Models

In the process of building a movie recommendation model, it is essential to remove spaces from the **cast**, **crew**, **genres** columns to ensure data consistency and accuracy. This normalization helps avoid confusion between individuals and categories, especially when names or terms share common elements.

In [110]:
def remove_space(word):
    l = []
    for i in word:
        l.append(i.replace(' ', ''))
    return l

In [111]:
movies['cast'] = movies['cast'].apply(remove_space)
movies['crew'] = movies['crew'].apply(remove_space)
movies['genres'] = movies['genres'].apply(remove_space)

In [112]:
movies.head(2)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[culture clash, future, space war, space colon...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron]
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[JohnnyDepp, OrlandoBloom, KeiraKnightley]",[GoreVerbinski]


In the process of enhancing our movie recommendation model, it is essential to create a new column called **`tags`** that combines information from the **overview**, **genres**, **keywords**, **cast**, and **crew** columns. This concatenation will result in a comprehensive representation of each movie, making it easier to analyze and compare their content.

In [113]:
movies['tags'] = movies['overview'] + movies['genres'] + movies['keywords'] + movies['cast'] + movies['crew']

In [114]:
movies.head(2)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[culture clash, future, space war, space colon...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron],"[In, the, 22nd, century,, a, paraplegic, Marin..."
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[JohnnyDepp, OrlandoBloom, KeiraKnightley]",[GoreVerbinski],"[Captain, Barbossa,, long, believed, to, be, d..."


### Creating the `new_df` DataFrame

This line creates a new DataFrame called **`new_df`** by selecting only the **`movie_id`**, **`title`**, and **`tags`** columns from the original **`movies`** DataFrame. This new DataFrame will serve as the foundation for further analysis and modeling, focusing on the essential information needed for generating accurate movie recommendations.

In [115]:
new_df = movies[['movie_id','title','tags']]

In [116]:
new_df.head()

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin..."
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d..."
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send..."
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney..."
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili..."


In [117]:
new_df.loc[:, 'tags'] = new_df['tags'].apply(lambda x: ' '.join(x))


In [118]:
new_df.head(2)

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha..."


In [119]:
new_df.iloc[0]['tags']

'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. Action Adventure Fantasy ScienceFiction culture clash future space war space colony society space travel futuristic romance space alien tribe alien planet cgi marine soldier battle love affair anti war power relations mind and soul 3d SamWorthington ZoeSaldana SigourneyWeaver JamesCameron'

In [120]:
new_df.loc[:, 'tags'] = new_df['tags'].apply(lambda x: x.lower().replace(',', ''))


In [121]:
new_df.head(2)

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,in the 22nd century a paraplegic marine is dis...
1,285,Pirates of the Caribbean: At World's End,captain barbossa long believed to be dead has ...


In [122]:
# Guardar el DataFrame como un archivo CSV
new_df.to_csv('new_df.csv', index=False)

In [123]:
# Cargar datos 
new_df = pd.read_csv('new_df.csv')

### Model 1 - Content-Based Recommendation Model

It relies on the similarity of attributes (in this case, the tags) of movies to suggest other similar movies. The algorithm uses the TF-IDF technique to vectorize the text and cosine similarity to calculate the relevance between the items.

In [124]:
# Crear el vectorizador TF-IDF
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(new_df['tags'])

In [125]:
# Calcular la similitud del coseno
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

In [133]:
# Función para obtener recomendaciones 
def get_recommendations(title, cosine_sim=cosine_sim):
    # Obtener el índice de la película que coincide con el título
    idx = new_df[new_df['title'] == title].index[0]
    
    # Obtener las puntuaciones de similitud para todas las películas
    sim_scores = list(enumerate(cosine_sim[idx]))
    
    # Ordenar las películas por puntuación de similitud
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Obtener los índices de las 5 películas más similares (excluyendo la película original)
    sim_scores = sim_scores[1:6]
    movie_indices = [i[0] for i in sim_scores]
    
    # Retornar los títulos de las películas recomendadas en el formato deseado
    recommendations = new_df['title'].iloc[movie_indices].tolist()
    return "\n".join(f"'{title}'" for title in recommendations)

# Probar la función
print(get_recommendations('Avatar'))

'Aliens'
'Alien³'
'Mission to Mars'
'Moonraker'
'Alien'


### Model 2 - BERT Model

BERT (Bidirectional Encoder Representations from Transformers) is a state-of-the-art natural language processing algorithm developed by Google. It uses a transformer architecture to understand the context of words in a sentence by considering the words that come before and after them, which allows it to capture nuanced meanings. BERT is pre-trained on a large corpus of text and can be fine-tuned for various tasks, such as sentiment analysis, question answering, and named entity recognition. Its ability to understand context makes it highly effective for tasks involving human language.

In [127]:
# Cargar el tokenizador y el modelo preentrenado
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') 
model = BertModel.from_pretrained('bert-base-uncased')

### Function to Obtain the Embedding

This function takes a text input and generates an embedding using the pre-trained **BERT** model.

The embedding is a numerical representation of the input text that captures its semantic meaning in a high-dimensional space. The **BERT** model transforms the input text into a format that allows for capturing contextual information, which is crucial for tasks such as recommendation systems or natural language understanding.

In [128]:

# Función para obtener el embedding
def get_embedding(text):
    # Tokenizar el texto
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True)
    # Obtener las salidas del modelo
    outputs = model(**inputs)
    # Usar el embedding
    return outputs.last_hidden_state[0][0].detach().numpy()

In [129]:
# Calcular los embeddings para las películas
new_df['embedding'] = new_df['tags'].apply(get_embedding)

In [130]:
# Convertir los embeddings a listas de floats antes de guardar el DataFrame
new_df['embedding'] = new_df['embedding'].apply(lambda x: x.flatten().tolist())  # Convierte numpy array a lista

# Guardar el DataFrame con los embeddings en un nuevo archivo CSV
new_df.to_csv('new_df_with_embeddings.csv', index=False)

### Calculating the Similarity Matrix

In this step, we will calculate the similarity matrix using the **cosine similarity** metric. The embeddings for each movie are first converted into a NumPy array. This allows us to compute the similarity between each pair of movie embeddings effectively.

In [131]:
# Calcular la matriz de similitud
embeddings_array = np.array(new_df['embedding'].tolist())
similarity_matrix = cosine_similarity(embeddings_array)  # Asegúrate de convertir a array numpy
print(similarity_matrix)

[[1.         0.88716994 0.87438841 ... 0.87168471 0.81162183 0.87071089]
 [0.88716994 1.         0.83085308 ... 0.88108399 0.84150101 0.85931019]
 [0.87438841 0.83085308 1.         ... 0.83448914 0.79614896 0.83726173]
 ...
 [0.87168471 0.88108399 0.83448914 ... 1.         0.84458369 0.86265072]
 [0.81162183 0.84150101 0.79614896 ... 0.84458369 1.         0.8560308 ]
 [0.87071089 0.85931019 0.83726173 ... 0.86265072 0.8560308  1.        ]]


### Creating a DataFrame from the Similarity Matrix

In this step, we create a **DataFrame** to store the similarity scores between movies. The DataFrame will have movie titles as both the index and the columns, allowing us to easily access similarity scores for any given movie.

In [134]:
# Crear un DataFrame a partir de la matriz de similitud
similarity_df = pd.DataFrame(similarity_matrix, index=new_df['title'], columns=new_df['title'])

def get_similar_movies(movie_title, n=5):
    # Verifica si la película está en el DataFrame
    if movie_title not in similarity_df.index:
        return f"La película '{movie_title}' no se encuentra en la base de datos."

    # Obtener la fila correspondiente a la película seleccionada
    similar_scores = similarity_df[movie_title]

    # Ordenar las películas por similitud (de mayor a menor)
    similar_movies = similar_scores.sort_values(ascending=False)

    # Devolver las 5 películas más similares (excluyendo la película misma)
    return similar_movies[1:n+1]

# Ejemplo de uso
similar_movies = get_similar_movies('Avatar', n=5)
print(similar_movies)

title
The Fifth Element    0.943883
The Terminator       0.937978
The Time Machine     0.935023
Titan A.E.           0.933537
The Thing            0.931523
Name: Avatar, dtype: float64
