# Sistema de recomendación de películas utilizando el algoritmo KNN utilizando filtrado basado en contenido con nombres de las películas

---------


> Desarrollado por: Bryam David Vega Moreno<br>
> Cargo: Backend developer y Data Scientist <br>
> correo: vegabryam40@gmail.com

### Librerias a usar

In [1]:
# para manejar los datos
import pandas as pd
import numpy as np

# para crear el modelo de ML
from sklearn.neighbors import NearestNeighbors
from scipy.sparse import csr_matrix

# para guardar el modelo de ML

### Lectura de datos

Para este ejemplo, estaremos usando el movielens 100k ya que es ideal para realizar pruebas pertinenetes a el modelo de ML. Para ello utilizaremos dos conjuntos de datos que son muy importantes:

* **u.data**: Contiene la información correspondiente a los votos de los usuarios con respecto a una pelicula en el tiempo. Este es muy importante ya que así podemos saber que tan popular fue la pelicula para los visitantes.
* **u.item**: Contiene la información correspondiente a las peliculdas.

El objetivo de esta sección es leer los datos que se tienen en estos dos archivos para poder trabajar con ellos y empezar con nuestro sistema de recomendación.

In [125]:
# leemos los datos de los votos de los usuarios y los vamos a llamar ratings
ratings = pd.read_csv('./data/new_data/ratings.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [126]:
movies = pd.read_csv("./data/new_data/movies.csv")
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


### Pre procesamiento

Un proceso muy importante luego de haber leido la información, es proceder a mezclar nuestros dos dataset, con el objetivo de tener en un solo conjunto de datos toda la información correspondiente al nombre de la pelicula y las calificaciones que los usuarios han ido realizando a lo largo del tiempo.

El concepto de esta mezcla tiene mucho que ver con las bases de datos, en este caso estamos mezclando los dos dataset utilizando el movie_id como clave principal de union.

Luego de ello el objetivo es tener el promedio de calificación de la pelicula tomando en cuenta todas las calificaciones de distintos usuarios a esa misma pelicula.

In [113]:
dataset = pd.merge(ratings, movies, how='inner', on='movie_id')
dataset.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,movie_title
0,196,242,3,881250949,Kolya (1996)
1,63,242,3,875747190,Kolya (1996)
2,226,242,5,883888671,Kolya (1996)
3,154,242,3,879138235,Kolya (1996)
4,306,242,5,876503793,Kolya (1996)


Como podemos notar este dataset sufre de que existe una gran repetición del id de peliculas y cada una tiene una distinta calificación puesto que el id del usuario es unico y por tanto eso hace que una pelicula aparezca varias veces en el dataset, por ejemplo tenemos el siguiente caso:

In [115]:
dataset[(dataset['movie_title'] == 'Willy Wonka and the Chocolate Factory (1971)')]

Unnamed: 0,user_id,movie_id,rating,timestamp,movie_title
17437,168,151,5,884288058,Willy Wonka and the Chocolate Factory (1971)
17438,222,151,3,878182109,Willy Wonka and the Chocolate Factory (1971)
17439,285,151,5,890595636,Willy Wonka and the Chocolate Factory (1971)
17440,32,151,3,883717850,Willy Wonka and the Chocolate Factory (1971)
17441,117,151,4,880126373,Willy Wonka and the Chocolate Factory (1971)
...,...,...,...,...,...
17758,346,151,4,874949244,Willy Wonka and the Chocolate Factory (1971)
17759,238,151,2,883576398,Willy Wonka and the Chocolate Factory (1971)
17760,741,151,3,891458539,Willy Wonka and the Chocolate Factory (1971)
17761,864,151,5,888889466,Willy Wonka and the Chocolate Factory (1971)


In [116]:
clean_dataset[(clean_dataset['movie_id'] == 151)]

Unnamed: 0,user_id,movie_id,rating
150,1,151,4.0
441,5,151,3.0
624,6,151,3.0
858,7,151,4.0
2341,14,151,5.0
...,...,...,...
98334,933,151,4.0
98501,934,151,3.0
98880,938,151,4.0
99040,940,151,3.0


Por lo tanto, lo que se busca es obtener una media de todas las calificaciones de las peliculas, ya que nuestra recomendación esta más enfocada en la pelicula que en el usuario, por lo tanto, vamos a realizar dicho proceso a continuación

In [107]:
clean_dataset = dataset.groupby(by=['user_id','movie_id'], as_index=False).agg({"rating":"mean"})
clean_dataset.head()

Unnamed: 0,user_id,movie_id,rating
0,1,1,5.0
1,1,2,3.0
2,1,3,4.0
3,1,4,3.0
4,1,5,3.0


In [121]:
clean_dataset

Unnamed: 0,user_id,movie_id,rating
0,1,1,5.0
1,1,2,3.0
2,1,3,4.0
3,1,4,3.0
4,1,5,3.0
...,...,...,...
99387,943,1067,2.0
99388,943,1074,4.0
99389,943,1188,3.0
99390,943,1228,3.0


### Proceso de recomendación utilizando el algoritmo KNN con la entrada del nombre de la película y el número de películas que se quieren recomendar:

Como primer paso tenemos que realizar el proceso de crear una matriz inversa, es decir que los titulos de las peliculas sean tomadas como filas, mientras que las columnas sean los id del usuario y la intersección entre la columna y fila sea la calificación del usuario en esa pelicula.

In [87]:
movie_to_user_df = clean_dataset.pivot(
     index='movie_title',
   columns='user_id',
      values='rating').fillna(0)

movie_to_user_df.head()

user_id,1,2,3,4,5,6,7,8,9,10,...,934,935,936,937,938,939,940,941,942,943
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'Til There Was You (1997),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1-900 (1994),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
101 Dalmatians (1996),2.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,...,2.0,0.0,0.0,2.0,4.0,0.0,0.0,0.0,0.0,0.0
12 Angry Men (1957),5.0,0.0,0.0,0.0,0.0,4.0,4.0,0.0,0.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
187 (1997),0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


 Una de las desventajas de esta matriz, es que esta llena de valores 0 o nulos, este problema se lo conoce como "problema de esparcimiento" y esto ocasiona que las predicciones o recomendaciones que un sistema de recomendación desee hacer sean imprecisas debidos a estos valores basura, por tanto la solución es crear una matriz conocida como sparse, que soluciona este problema, a continuación realizaremos dicho proceso.

In [9]:
movie_to_user_sparse_df = csr_matrix(movie_to_user_df.values)
movie_to_user_sparse_df

<1664x943 sparse matrix of type '<class 'numpy.float64'>'
	with 99693 stored elements in Compressed Sparse Row format>

Una vez realizado la matriz de esparcimiento, que nos ayuda a evitiar el problema sparse, procedemos a crear una lista de todas las peliculas que se tienen en el dataset, con el objetivo de darles un indice posteriormente, transformadolas en un diccionario que nos servira para después poder realizar las recomendaciones correspondientes.

In [10]:
movies_list = list(movie_to_user_df.index)
movies_list[:10]

["'Til There Was You (1997)",
 '1-900 (1994)',
 '101 Dalmatians (1996)',
 '12 Angry Men (1957)',
 '187 (1997)',
 '2 Days in the Valley (1996)',
 '20,000 Leagues Under the Sea (1954)',
 '2001: A Space Odyssey (1968)',
 '3 Ninjas: High Noon At Mega Mountain (1998)',
 '39 Steps, The (1935)']

In [11]:
movie_dict = {movie : index for index, movie in enumerate(movies_list)}
print(movie_dict)

{"'Til There Was You (1997)": 0, '1-900 (1994)': 1, '101 Dalmatians (1996)': 2, '12 Angry Men (1957)': 3, '187 (1997)': 4, '2 Days in the Valley (1996)': 5, '20,000 Leagues Under the Sea (1954)': 6, '2001: A Space Odyssey (1968)': 7, '3 Ninjas: High Noon At Mega Mountain (1998)': 8, '39 Steps, The (1935)': 9, '8 1/2 (1963)': 10, '8 Heads in a Duffel Bag (1997)': 11, '8 Seconds (1994)': 12, 'A Chef in Love (1996)': 13, 'Above the Rim (1994)': 14, 'Absolute Power (1997)': 15, 'Abyss, The (1989)': 16, 'Ace Ventura: Pet Detective (1994)': 17, 'Ace Ventura: When Nature Calls (1995)': 18, 'Across the Sea of Time (1995)': 19, 'Addams Family Values (1993)': 20, 'Addicted to Love (1997)': 21, 'Addiction, The (1995)': 22, 'Adventures of Pinocchio, The (1996)': 23, 'Adventures of Priscilla, Queen of the Desert, The (1994)': 24, 'Adventures of Robin Hood, The (1938)': 25, 'Affair to Remember, An (1957)': 26, 'African Queen, The (1951)': 27, 'Afterglow (1997)': 28, 'Age of Innocence, The (1993)': 2

Posterior a ello, procedemos a **crear nuestro modelo de KNN** utilizando la metrica de similitud de cosenos y un algoritmo bruto para conseguir una recomendación a corde al conjunto de datos que se tienen actualmente. De igual forma con el modelo ya creado procedemos a realizar un entrenamiento con la finalidad de que nuestro modelo aprenda del conjunto de datos que se tienen.

In [12]:
knn_movie_model = NearestNeighbors(metric='cosine', algorithm='brute')
knn_movie_model.fit(movie_to_user_sparse_df)

NearestNeighbors(algorithm='brute', metric='cosine')

Una vez entrenado nuestro modelo KNN, procedemos a realizar la recomendación de una pelicula en particular, utilizando el nombre como lo hemos indicado al inicio de nuestro modelo. Con ello creamos un método o función que toma como entrada el nombre de la película y el número de similitudes que queremos obtener con el nombre de la pelicula principal.

In [13]:
def get_similar_movies(movie, n = 10):
    index = movie_dict[movie]
    knn_input = np.asarray([movie_to_user_df.values[index]])
    n = min(len(movies_list)-1,n)
    distances, indices = knn_movie_model.kneighbors(knn_input, n_neighbors=n+1)
  
    print("Top",n,"peliculas que tienen mucha similitud con la pelicula-",movie, "son: ")
    print(" ")
    for i in range(1,len(distances[0])):
        print(movies_list[indices[0][i]])

Por último procedemos a realizar las pruebas correspondientes a nuestro modelo de recomendación

In [14]:
movie_name = 'Toy Story (1995)'

get_similar_movies(movie_name,15)

Top 15 peliculas que tienen mucha similitud con la pelicula- Toy Story (1995) son: 
 
Star Wars (1977)
Return of the Jedi (1983)
Independence Day (ID4) (1996)
Rock, The (1996)
Mission: Impossible (1996)
Willy Wonka and the Chocolate Factory (1971)
Star Trek: First Contact (1996)
Fargo (1996)
Jerry Maguire (1996)
Raiders of the Lost Ark (1981)
Twelve Monkeys (1995)
Twister (1996)
Empire Strikes Back, The (1980)
Apollo 13 (1995)
Men in Black (1997)


<div class="alert alert-success">
    <b>Con ello el sistema de recomendación funciona correctamente!!</b> <br>
    <br>- El modelo toma como entrada el nombre de la película y el número de películas que se desea recomendar.
    <br>- El modelo devolvera películas relacionadas a la película que se ingreso para recomendar.
    <br>- Ahora se indica el proceso para proceder a guardar el modelo en un formato <b>.pkl</b> para utilizarlo en un servicio web. 
</div>

### Proceso que permite guardar el modelo KNN en un formato .pkl

Una de los procesos más importantes después de realizar un modelo de ML es realizar la exportación de dicho modelo, esto debido a que muchas veces estos modelos suelen quedarse en cuadernos jupyter por lo que solo quedan en desarrollo para pruebas. Lo importante es implementar este modelo en un sistema para ver su funcionamiento.

A continuación se presentan los pasos para exportar el modelo en un formato .pkl

#### Joblib para exportar y cargar modelos

Joblib es una libreria que permite facilitar la creación de pipelines con python, sin embargo, esta librería también funciona para poder exportar e importar modelos para poder realizar las pruebas del mismo.

In [15]:
import joblib

#### exportando modelo 

Exportar un modelo de ml es muy sencillo con esta herramienta, solamente debemos utilizar la función **dump** enviando como parametros nuestro modelo de ML y el path donde se va a ubicar nuestro modelo con formato .pkl

In [16]:
name_model = 'knn_model'
path_save = './model/{}.pkl'.format(name_model)
joblib.dump(knn_movie_model, path_save)

['./model/knn_model.pkl']

### importando un modelo 

Ahora, imaginemos que tenemos que consumir un modelo en un servicio web como por ejemplo en django, pues joblib permite realizar de igual forma la importación de este modelo y tratarlo como servicio. A continuación se indica como se debe importar el modelo.

In [17]:
name_model_load = 'knn_model'
path_load = './model/{}.pkl'.format(name_model_load)
model_load = joblib.load(path_load,mmap_mode='r')

Con el modelo de ML importado, en el siguiente ejemplo se simula un método para obtener las recomendaciones del sistema de recomendación utilizando el modelo importado. Con ello obtenemos el siguiente resultado.

In [23]:
def get_similar_movies_with_model_load(movie, n = 10):
    index = movie_dict[movie]
    print(index)
    knn_input = np.asarray([movie_to_user_df.values[index]])
    print(knn_input)
    n = min(len(movies_list)-1,n)
    distances, indices = model_load.kneighbors(knn_input, n_neighbors=n+1)
  
    print("Top",n,"peliculas que tienen mucha similitud con la pelicula-",movie, "son: ")
    print(" ")
    for i in range(1,len(distances[0])):
        print(movies_list[indices[0][i]])

In [24]:
movie_name = 'Willy Wonka and the Chocolate Factory (1971)'

get_similar_movies_with_model_load(movie_name,15)

1633
[[4. 0. 0. 0. 3. 3. 4. 0. 0. 0. 0. 0. 0. 5. 0. 5. 4. 3. 0. 3. 0. 0. 3. 5.
  4. 3. 0. 0. 0. 0. 0. 3. 0. 0. 0. 0. 0. 0. 0. 0. 0. 4. 4. 4. 2. 4. 0. 0.
  5. 0. 0. 5. 4. 2. 0. 4. 3. 3. 5. 5. 0. 5. 0. 3. 0. 0. 4. 0. 5. 3. 1. 0.
  0. 0. 5. 0. 0. 0. 0. 0. 2. 2. 3. 4. 0. 0. 0. 0. 5. 2. 0. 0. 1. 5. 4. 0.
  0. 0. 0. 0. 3. 0. 0. 0. 0. 0. 0. 0. 5. 0. 0. 0. 0. 0. 0. 0. 4. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 3. 0. 0. 0. 3. 0. 0. 0. 0. 0. 4. 0. 0. 2. 0. 0. 0.
  0. 0. 0. 4. 0. 4. 5. 4. 0. 0. 0. 0. 0. 0. 0. 4. 0. 3. 0. 0. 0. 0. 0. 5.
  0. 0. 0. 0. 0. 3. 0. 4. 0. 0. 0. 0. 2. 0. 0. 0. 0. 0. 0. 3. 5. 0. 0. 0.
  0. 0. 0. 0. 0. 4. 0. 3. 0. 0. 4. 0. 0. 0. 0. 0. 0. 0. 0. 0. 5. 5. 5. 3.
  0. 0. 0. 0. 1. 3. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 3. 0. 2. 0. 2. 0. 0.
  0. 0. 3. 5. 3. 5. 4. 0. 0. 4. 5. 0. 0. 2. 0. 5. 4. 0. 0. 0. 0. 0. 0. 0.
  2. 0. 0. 3. 5. 0. 0. 0. 0. 0. 0. 5. 3. 0. 4. 0. 0. 0. 4. 0. 5. 5. 0. 0.
  2. 2. 5. 5. 4. 5. 4. 2. 3. 3. 4. 0. 2. 0. 5. 0. 4. 0. 0. 4. 0. 0. 0. 2.
  1. 4. 0. 0. 0. 0. 0. 0. 0. 0. 4

<div class="alert alert-success">
    <b>Con ello se puede utilizar el modelo de ML en un servicio web!!!</b> <br>
    <br>- Se aprendio a crear un modelo de ML utilizando el algoritmo KNN.
    <br>- Se aprendio sobre problemas de esparcimiento y como resolverlo.
    <br>- Se aprendio como exportar e importar el modelo de ML para usarlo en un servicio web.
</div>

In [53]:
movie_to_user_df.to_csv("./data/movies_to_user.csv",index=True)

In [54]:
df = pd.read_csv('./data/movies_to_user.csv')

In [45]:
movies_list = list(df)
movie_dict = {movie: index for index, movie in enumerate(movies_list)}

In [57]:
df = df.set_index('movie_title')

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,...,934,935,936,937,938,939,940,941,942,943
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'Til There Was You (1997),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1-900 (1994),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
101 Dalmatians (1996),2.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,...,2.0,0.0,0.0,2.0,4.0,0.0,0.0,0.0,0.0,0.0
12 Angry Men (1957),5.0,0.0,0.0,0.0,0.0,4.0,4.0,0.0,0.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
187 (1997),0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Young Guns II (1990),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0
"Young Poisoner's Handbook, The (1995)",0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,...,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Zeus and Roxanne (1997),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
unknown,4.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
