# 📚 DATOS MASIVOS II
## 💻 Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas
## 🏫 Universidad Nacional Autónoma de México

<hr>

### 🎬 Caso Netflix
### 🍿 Sistema de Recomendación de Contenido
#### 🚻 Método de Filtrado Colaborativo

<br>

#### Realizado por:
#### Iván Alejadro Ramos Herrera
#### 💜 [@arhcoder](https://github.com/arhcoder)



# [01] 📓 Selección de Dataset

## Netflix Prize data
### Fuente: https://www.kaggle.com/datasets/netflix-inc/netflix-prize-data/
### Información del dataset:
> Netflix held the Netflix Prize open competition for the best algorithm to predict user ratings for films. The grand prize was $1,000,000 and was won by BellKor's Pragmatic Chaos team. This is the dataset that was used in that competition.

> TRAINING DATASET FILE DESCRIPTION
The file "training_set.tar" is a tar of a directory containing 17770 files, one
per movie. The first line of each file contains the movie id followed by a
colon. Each subsequent line in the file corresponds to a rating from a customer
and its date in the following format:

> CustomerID,Rating,Date
MovieIDs range from 1 to 17770 sequentially.
CustomerIDs range from 1 to 2649429, with gaps. There are 480189 users.
Ratings are on a five star (integral) scale from 1 to 5.
Dates have the format YYYY-MM-DD.
MOVIES FILE DESCRIPTION
Movie information in "movie_titles.txt" is in the following format:

> MovieID,YearOfRelease,Title
MovieID do not correspond to actual Netflix movie ids or IMDB movie ids.
YearOfRelease can range from 1890 to 2005 and may correspond to the release of
corresponding DVD, not necessarily its theaterical release.
Title is the Netflix movie title and may not correspond to
titles used on other sites. Titles are in English.
QUALIFYING AND PREDICTION DATASET FILE DESCRIPTION
The qualifying dataset for the Netflix Prize is contained in the text file
"qualifying.txt". It consists of lines indicating a movie id, followed by a
colon, and then customer ids and rating dates, one per line for that movie id.
The movie and customer ids are contained in the training set. Of course the
ratings are withheld. There are no empty lines in the file.
…

> To calculate the RMSE of your predictions against those
ratings and compare your RMSE against the Cinematch RMSE on the same data. See http://www.netflixprize.com/faq#probe for that value.


# [02] 📖 Dataset

## Obtención

In [1]:
import warnings
warnings.filterwarnings("ignore")

### Leyendo el Dataset

In [2]:
# Leyendo el dataset:
import pandas as pd

ratings = pd.DataFrame()
for i in range(1, 5):
  df = pd.read_csv(f"/content/drive/MyDrive/Datasets/Netflix/combined_data_{i}.txt",
                   header = None, names = ["User_Id", "Rating"], usecols = [0, 1])
  df["Rating"] = df["Rating"].astype(float)
  ratings = ratings.append(df, ignore_index=True)

# Observando el dataset:
print(ratings.head(6))
print(ratings.shape)

   User_Id  Rating
0       1:     NaN
1  1488844     3.0
2   822109     5.0
3   885013     4.0
4    30878     4.0
5   823519     3.0
(100498277, 2)


### Títulos de películas

In [16]:
# Títulos de las películas:
import csv

names = []
def combine_name(row):
    return " ".join(map(str, row[2:]))

with open("/content/drive/MyDrive/Datasets/Netflix/movie_titles.csv", encoding="ISO-8859-1") as file:
  reader = csv.reader(file)
  for row in reader:
    # Combina las columnas Name:
    name = combine_name(row)
    names.append([row[0], row[1], name])

# Dataframe de títulos:
titles = pd.DataFrame(names, columns=["Movie_Id", "Year", "Name"])
titles.set_index("Movie_Id", inplace=True)
print(titles.head(6))
print(titles.shape)

          Year                          Name
Movie_Id                                    
1         2003               Dinosaur Planet
2         2004    Isle of Man TT 2004 Review
3         1997                     Character
4         1994  Paula Abdul's Get Up & Dance
5         2004      The Rise and Fall of ECW
6         1997                          Sick
(17770, 2)


In [16]:
# Limpia la RAM:
try:
  del df
except:
  pass
try:
  del reader
except:
  pass
try:
  del names
except:
  pass
try:
  del name
except:
  pass
try:
  del row
except:
  pass

## Exploración

### Distribución de ratings

In [5]:
# Distribución de Ratings:
print("Conteo de raitings:")
print(ratings.groupby("Rating")["Rating"].agg(["count"]))

Conteo de raitings:
           count
Rating          
1.0      4617990
2.0     10132080
3.0     28811247
4.0     33750958
5.0     23168232


### Conteo de datos

In [6]:
# Conteo de datos:
movies_count = ratings.isnull().sum()[1]
users_count = ratings["User_Id"].nunique() - movies_count
rating_count = ratings["User_Id"].count() - movies_count
print(f"Películas: {movies_count}\nUsuarios: {users_count}\nRatings: {rating_count}")

Películas: 17770
Usuarios: 480189
Ratings: 100480507


### Datos faltantes

In [7]:
# Datos faltantes:
ratings.isnull().sum()

User_Id        0
Rating     17770
dtype: int64

## Preparación

### Lidiando con ID's faltantes

In [8]:
import numpy as np

# Se lidia con completar los ID's vacíos:
df_nan = pd.DataFrame(pd.isnull(ratings.Rating))
df_nan = df_nan[df_nan["Rating"] == True]
df_nan = df_nan.reset_index()

# Reorganiza los ID's:
movie_np = []
movie_id = 1
for i,j in zip(df_nan["index"][1:], df_nan["index"][:-1]):
    temp = np.full((1, i-j-1), movie_id)
    movie_np = np.append(movie_np, temp)
    movie_id += 1
last_record = np.full((1, len(ratings) - df_nan.iloc[-1, 0] - 1), movie_id)
movie_np = np.append(movie_np, last_record)

# Vector Numpy con los nuevos ID's:
print(f"Movie Numpy: {movie_np}")
print(f"Length: {len(movie_np)}")

Movie Numpy: [1.000e+00 1.000e+00 1.000e+00 ... 1.777e+04 1.777e+04 1.777e+04]
Length: 100480507


In [9]:
try:
  del movie_id
except:
  pass
try:
  del last_record
except:
  pass
try:
  del df_nan
except:
  pass

In [10]:
# Guardar movie_np en un archivo, para evitar saturar la RAM:
np.save("movie_np.npy", movie_np)

### Reintegrando Dataset completo

In [5]:
# Remueve las filas con valores NaN en Rating:
import numpy as np
ratings = ratings.dropna(subset=["Rating"])

# Carga movie_np para evitar saturar la RAM:
movie_np = np.load("movie_np.npy")

# Guarda y convierte los ID's en int:
ratings["Movie_Id"] = movie_np.astype(int)
ratings["User_Id"] = ratings["User_Id"].astype(int)

### Guardando Datset

In [7]:
# Preparación de "Ratings" como matriz definitiva:
print(ratings.head(6))
ratings.to_csv("/content/drive/MyDrive/Datasets/Netflix/ratings.csv", index=False)

   User_Id  Rating  Movie_Id
1  1488844     3.0         1
2   822109     5.0         1
3   885013     4.0         1
4    30878     4.0         1
5   823519     3.0         1
6   893988     3.0         1


# [03] 🧺 Rellenando con Predicciones

Se utilizará Factorización No-Negativa de Matrices para que; ya teniendo la "matrix" con los películas y ratings asociados a cada usuario, rellenar los espacios vacíos para "predecir" qué películas querrá ver cada uno de los usuarios (obteniendo la estimación del rating que le daría a cada película que no ha visto)

In [None]:
!pip install surprise

## Factorización No-Negativa

In [11]:
from surprise import NMF, Dataset, Reader
from surprise.model_selection import cross_validate

# Cargando únicamente las primeras 100,000 líneas:
reader = Reader()
data = Dataset.load_from_df(ratings[["User_Id", "Movie_Id", "Rating"]][:100000], reader)

# Factorización No-Negativa de Matrices:
nmf = NMF()

# Validación cruzada con las métricas "RMSE" y "MAE":
results = cross_validate(nmf, data, measures=["RMSE", "MAE"], cv=3, verbose=False)

# Promedios de resultados:
rmse = results["test_rmse"].mean()
mae = results["test_mae"].mean()
print(f"RMSE promedio: {rmse}")
print(f"MAE promedio: {mae}")

RMSE promedio: 1.1803985125831258
MAE promedio: 0.9740562898435142


# [04] 🔮 Recomendando Películas

Se toma un usuario al azar y se analiza cómo se realizarían las recomendaciones


### Obteniendo usuario al azar

In [1]:
# Usuario al azar:
import numpy as np
import pandas as pd

ratings = pd.read_csv("/content/drive/MyDrive/Datasets/Netflix/ratings.csv")
random_user = np.random.choice(ratings["User_Id"][:100000].unique())
print(random_user)

510521


### Películas favoritas del usuario

In [26]:
# Se muestran 10 películas a las que el usario ha dado rating de 5:
ids = ratings[(ratings["User_Id"] == random_user) & (ratings["Rating"] == 5)]
ids = ids[["Movie_Id"]]
ids = ids["Movie_Id"].tolist()

names = []
for id in ids:
  name = filtered_rows = titles.reset_index()[["Name"]][titles.reset_index()["Movie_Id"] == str(id)].values[0][0]
  names.append(name)
favorites = pd.DataFrame({"Movie_Id": ids, "Name": names})
favorites = favorites.reset_index(drop=True)

print(f"\nUsuario: {random_user}\n[n] películas favoritas:")
print(favorites)


Usuario: 510521
[n] películas favoritas:
    Movie_Id                                              Name
0         28                                   Lilo and Stitch
1        270                        Sex and the City: Season 4
2       1174                                       The Sandlot
3       2040          Star Trek: The Next Generation: Season 5
4       2452     Lord of the Rings: The Fellowship of the Ring
5       3079                    The Lion King: Special Edition
6       3523          Star Trek: The Next Generation: Season 6
7       3962                         Finding Nemo (Widescreen)
8       4306                                   The Sixth Sense
9       5317                                 Miss Congeniality
10      5326          Star Trek: The Next Generation: Season 4
11      5385                 Jamie Foxx: I Might Need Security
12      5513                          Mr. Bean: The Whole Bean
13      5614                                      Best in Show
14      6205 

### Recomendaciones encontradas para el usuario

In [27]:
user_already_views = ratings[ratings["User_Id"] == random_user]

In [29]:
user_already_views

Unnamed: 0,User_Id,Rating,Movie_Id
43469,510521,3.0,23
67146,510521,5.0,28
136573,510521,4.0,30
847045,510521,4.0,197
943012,510521,4.0,209
...,...,...,...
98587294,510521,5.0,17397
99556645,510521,2.0,17560
99949973,510521,4.0,17627
100161908,510521,4.0,17692


In [19]:
import pandas as pd
from surprise import NMF, Dataset, Reader
from surprise.model_selection import cross_validate

# Nuevo DataFrame para el usuario al azar:
ratings = pd.read_csv("/content/drive/MyDrive/Datasets/Netflix/ratings.csv")
user = ratings[ratings["User_Id"] == random_user]

# Cargar el conjunto de datos completo:
reader = Reader()
data = Dataset.load_from_df(ratings[["User_Id", "Movie_Id", "Rating"]][:100000], reader)

In [20]:
user

Unnamed: 0,User_Id,Rating,Movie_Id
43469,510521,3.0,23
67146,510521,5.0,28
136573,510521,4.0,30
847045,510521,4.0,197
943012,510521,4.0,209
...,...,...,...
98587294,510521,5.0,17397
99556645,510521,2.0,17560
99949973,510521,4.0,17627
100161908,510521,4.0,17692


In [31]:
try:
  del ratings
except:
  pass

In [32]:
# Se entrena el NMF:
nmf = NMF()
trainset = data.build_full_trainset()
nmf.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.NMF at 0x7c50f1242140>

In [41]:
ratings = pd.read_csv("/content/drive/MyDrive/Datasets/Netflix/ratings.csv")

In [44]:
# not_viewed_movies = data[data["User_Id"] != random_user]

**NOTA: PARA UNA CORRECTA PREDICCIÓN HACE FALTA TOMAR TODO EL DATASET Y ELIMINAR LAS PELÍCULAS QUE YA VIÓ EL USUARIO, Y PREDECIR ESTAS CON RESPECTO A LAS QUE SÍ VIÓ; SIN EMBARGO, AL CARGAR "not_viwed_movies" SE DESBORDA LA RAM. LA PREDICCIÓN HECHA ENSEGUIDA ES SÓLO UN EJEMPLO DE CÓMO SE IMPLEMENTARÍA SI SE UTILIZARA TODO EL CATÁLOGO DE PELÍCULAS NO VISTAS**

In [38]:
# Se predicen los scores estimados para el usuario al azar:
user["Estimate_Score"] = ratings["Movie_Id"].apply(lambda x: nmf.predict(random_user, x).est)

# Ordenar las estimaciones de calificación en orden descendente:
user = user.sort_values("Estimate_Score", ascending=False)

print("Dataset con las estimaciones para el usuario al azar:")
print(user)

Dataset con las estimaciones para el usuario al azar:
          User_Id  Rating  Movie_Id  Estimate_Score
67146      510521     5.0        28        4.979762
136573     510521     4.0        30        3.830453
4317421    510521     3.0       831        3.532140
3884598    510521     4.0       752        3.532140
5582267    510521     2.0      1121        3.532140
...           ...     ...       ...             ...
94189938   510521     5.0     16711        3.532140
95007063   510521     4.0     16879        3.532140
95464819   510521     3.0     16930        3.532140
96136709   510521     5.0     17046        3.532140
43469      510521     3.0        23        3.124421

[212 rows x 4 columns]


In [39]:
# Las 10 recomendaciones que se le darán individualmente al usuario:
recomendations = user[["Movie_Id", "Estimate_Score"]]

ids = recomendations[["Movie_Id"]]
ids = ids["Movie_Id"].tolist()

names = []
for id in ids:
  name = filtered_rows = titles.reset_index()[["Name"]][titles.reset_index()["Movie_Id"] == str(id)].values[0][0]
  names.append(name)
recomendated_movies = pd.DataFrame({"Movie_Id": ids, "Name": names})
recomendated_movies = recomendated_movies.reset_index(drop=True)

In [40]:
print(f"\nUsuario: {random_user}:\n[10] Recomendaciones:")
print(recomendated_movies)


Usuario: 510521:
[10] Recomendaciones:
     Movie_Id                                               Name
0          28                                    Lilo and Stitch
1          30                             Something's Gotta Give
2         831                                          Mannequin
3         752           Star Trek: The Next Generation: Season 7
4        1121                      MVP 2:  Most Vertical Primate
..        ...                                                ...
207     16711                 Sex and the City: Season 6: Part 1
208     16879                                            Titanic
209     16930                                         The Tuxedo
210     17046   Thomas & Friends: Thomas & His Friends Get Along
211        23  Clifford: Clifford Saves the Day! / Clifford's...

[212 rows x 2 columns]


# [05] 💜 @arhcoder
## Realizado por:
### Iván Alejadro Ramos Herrera
### 💜 [@arhcoder](https://github.com/arhcoder)