# Movies Recommendation's System

### Development of a basic recommendation system using Python and Pandas. 

### In this case the recommendation is based on the movies that the user has watched. We will use the cosine simmilarity to find the most similar users and we will recommend the user the most watched movies among these similar users.

In [1]:
import numpy as np
import pandas as pd 
from tqdm import tqdm
import math

#### We need to create a function to get the name of the movie based on its movie Id. We will use this function at the end of the notebook:

In [2]:
def get_title(num):
    title = links.loc[links["movieId"] == num, "Pelicula"]    
    return title

#### Let's import the data:

In [4]:
path = r"C:\Users\usuario\Desktop\Nebulova\Curso\Machine Learning\Ejercicios\Movies"
ratings = pd.read_csv(path + "/ratings_small.csv")
movies_metadata = pd.read_csv(path + "/movies_metadata.csv", low_memory=False)
links = pd.read_csv(path + "/links_small.csv")
#There are some missing values. We will fill them with a -1 value
links["tmdbId"] = links["tmdbId"].fillna(-1).astype(int)

#### In the dataset there are some rows where the id it's a date instead of a number, which is converting the column type to object. We will remove these rows and transform the column to int:

In [5]:
rowsDrop = list()
for i in range(len(movies_metadata["id"])):
    if not movies_metadata["id"][i].isdigit():
        rowsDrop.append(i)
movies_metadata.drop(index=rowsDrop, inplace=True)
movies_metadata["id"] = movies_metadata["id"].astype(int)

#### We need to create now an array with the movies' names:

In [6]:
peliculas = []
for elem in links["tmdbId"]:
    if elem == -1:
        pass
    elif movies_metadata.loc[movies_metadata.id == elem ,'id'].any():
        movie = movies_metadata.loc[movies_metadata.id == elem ,'original_title'].values
        peliculas.append(movie[0])
    else:
        pass

#### This array should be added to the links dataframe to have all the information of the movies in the same table:

In [7]:
peliculas = pd.DataFrame(peliculas, columns = ["Pelicula"])
links = pd.concat([links, peliculas], axis=1)
#We drop those movies with no title
links.dropna(subset = ["Pelicula"], inplace=True)

#### We can add now the titles to the rating's dataframe:

In [8]:
peliculas_ratings = []
for peli in ratings["movieId"]:
    if links.loc[links.movieId == peli ,'movieId'].any():
        movie = links.loc[links.movieId == peli ,'Pelicula'].values
        peliculas_ratings.append(movie[0])
    else:
        peliculas_ratings.append(0)

In [9]:
peliculas_ratings = pd.DataFrame(peliculas_ratings, columns = ["Titulo"])
ratings = pd.concat([ratings, peliculas_ratings], axis=1)
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp,Titulo
0,1,31,2.5,1260759144,Dangerous Minds
1,1,1029,3.0,1260759179,Alice in Wonderland
2,1,1061,3.0,1260759182,Shall We Dance
3,1,1129,2.0,1260759185,Manon des Sources
4,1,1172,4.0,1260759205,La double vie de Véronique


#### Last step before begining with the algorithm is to create a matrix where we can see which movies have been seen by each user:

In [10]:
mapMovieId = pd.DataFrame(ratings["movieId"].unique(), columns=["movieId"]).sort_values(by="movieId").reset_index(drop=True)
mapUserId = pd.DataFrame(ratings["userId"].unique(), columns=["userId"]).sort_values(by="userId").reset_index(drop=True)
#Creation of a zero-matrix with shape: users x movies
matrix = np.zeros((mapUserId.shape[0], mapMovieId.shape[0]))

#If the user has seen the movie, we will replace the 0 with a 1:
for index, movies in tqdm(ratings.groupby(by="userId")["movieId"].apply(list).iteritems(), total=matrix.shape[0]):
    user_row = mapUserId.loc[mapUserId["userId"] == index, "userId"].index[0]
    for movie in movies:
        movie_col = mapMovieId.loc[mapMovieId["movieId"] == movie, "movieId"].index[0]
        matrix[user_row, movie_col] = 1

100%|████████████████████████████████████████████████████████████████████████████████| 671/671 [00:49<00:00, 13.50it/s]


In [11]:
matrix = pd.DataFrame(matrix, columns = mapMovieId.iloc[:,0])

### Cosine Similarity 

#### Let's suppose that the user has seen and liked the following movies:

In [12]:
gustos = ["Singin' in the rain", "Breakfast at Tiffany's", "Casablanca", "The Wizard of Oz",
          "Gone with the Wind", "Citizen Kane", "Giant", "East of Eden"]

#### We have to find the corresponding movies IDs:

In [13]:
gustos_movieId = []
for opcion in gustos:
    if links.loc[links["Pelicula"] == opcion, "movieId"].values.any():
        opc = links.loc[links["Pelicula"] == opcion, "movieId"].values[0]
        gustos_movieId.append(opc)
    else:
        pass

#### Next step is to create a dataframe so that we can calculate the cosine similarity with each user on the dataframe:

In [14]:
new_user = np.zeros(matrix.shape[1])
new_user = pd.DataFrame(new_user, index = mapMovieId.iloc[:,0])
#Creation of the new user's array:
for columna in new_user.index:
    if columna in gustos_movieId:
        new_user.loc[columna, 0] = 1
new_user = new_user.iloc[:,0]

new_user = pd.DataFrame(new_user)
new_user = new_user.T

#### It's time now to apply the algorithm:

In [15]:
from scipy.spatial import distance
cosine = []
for fila in range(matrix.shape[0]):
    sim = 1 - distance.cosine(matrix.loc[fila,:].values.reshape(1,-1), new_user)
    cosine.append(sim)

#### We should add this information to the dataframe to be able to manage all the information:

In [16]:
cosine = pd.DataFrame(cosine)
final = pd.concat([matrix, cosine], axis = 1)
final.rename(columns = {0:'cosine_sim'}, inplace=True)

#### If we order now the dataframe based on the cosine similarity obtained, we can get the most similar ones:

In [17]:
final = final.sort_values(by=['cosine_sim'], ascending=False)
#We will work from now on with the 30 most similars:
cercanos = final.iloc[:30,:]

#### We should add up each columns to see which movies are the most seen among these users:

In [18]:
suma_columna = []
for col in range(cercanos.shape[1]-1):
    suma = cercanos.iloc[:,col].sum()
    suma_columna.append(suma)

#### Let's add this row to the dataset:

In [19]:
suma_columna = pd.Series(suma_columna, index = mapMovieId.iloc[:,0], name = "suma")
suma_columna = pd.DataFrame(suma_columna).T

cercanos_final = []
cercanos_final = pd.concat([cercanos, suma_columna])

#### Ordering the dataset based on this last row added, we can get the movies we should recommend to the new user:

In [20]:
resultados_ordenados = cercanos_final.sort_values(by="suma", ascending=False, axis=1)

#### The 3 movie's recommendations for this user are:

In [22]:
cont = 0
buenos = 0
while buenos < 3:
    movie = get_title(resultados_ordenados.columns[cont]).iloc[0]
    if movie in gustos:
        cont += 1
    else:    
        print(get_title(resultados_ordenados.columns[cont]))
        buenos += 1
        cont += 1

733    Sabrina
Name: Pelicula, dtype: object
232    Star Wars
Name: Pelicula, dtype: object
729    Charade
Name: Pelicula, dtype: object
