# Sistemas de Recomendación - Netflix Prize Challenge

## Filtro Colaborativo a partir de Similitud Coseno

En este notebook vamos a implementar un sistema de recomendación a través de un filtro colaborativo a partir de la similitud coseno. Nos basamos en este [muy recomendable artículo](https://medium.com/radon-dev/item-item-collaborative-filtering-with-binary-or-unary-data-e8f0b465b2c3) para esta sección.

Comenzamos cargando los datos.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

import pandas as pd

import gc #garbage collector
import warnings
warnings.filterwarnings('ignore')


In [2]:
%%time
df1 = pd.read_csv('../netflix-prize-data/combined_data_1_with_movie_id.csv', dtype={'Rating': np.int8, 'movie_id': np.int16})


Wall time: 9.49 s


In [3]:
print(df1.shape)
df1.head()

(24053764, 3)


Unnamed: 0,User,Rating,Movie_id
0,1488844,3,1
1,822109,5,1
2,885013,4,1
3,30878,4,1
4,823519,3,1


Y los títulos.

In [4]:
df_title = pd.read_csv('../netflix-prize-data/movie_titles.csv', encoding = "ISO-8859-1",index_col = 0, header = None, usecols = [0,2], names = ['Movie_Id', 'Name'])
df_title.head()

Unnamed: 0_level_0,Name
Movie_Id,Unnamed: 1_level_1
1,Dinosaur Planet
2,Isle of Man TT 2004 Review
3,Character
4,Paula Abdul's Get Up & Dance
5,The Rise and Fall of ECW


Vamos a descartar usuarios con el objetivo de achicar la base de datos. Bien hecho, no debería afectar al sistema de recomendación de manera apreciable. Nosotros lo vamos a hacer de una manera *ad-hoc*, pero recalcamos que existen criterios (que no estamos mencionando) para hacerlo de una manera racional.

Vamos a descartar aquellos usuarios que califican poco (menos de cinco calificaciones) o mucho (más de 100 calificaciones).

In [5]:
mask_usuarios_descartables = np.logical_or(df1.User.value_counts() <= 5, df1.User.value_counts() > 100)
usuarios_descartables = mask_usuarios_descartables[mask_usuarios_descartables].index.values
print(len(usuarios_descartables))

150301


In [6]:
mascara_descartables = df1.User.isin(usuarios_descartables)
print(mascara_descartables.sum())

13395316


In [7]:
print(df1.shape)
df1 = df1[~mascara_descartables]
print(df1.shape)

(24053764, 3)
(10658448, 3)


También vamos a descartar también aquellas películas que tengan pocas calificaciones. Esto, nuevamente, no es obligatorio y lo hacemos con el objetivo de achicar la matriz de utilidad aún más.

In [8]:
mask_items_descartables = df1.Movie_id.value_counts() <= 3000
# mask_items_descartables
items_descartables = mask_items_descartables[mask_items_descartables].index.values
# items_descartables
print(len(items_descartables))

3997


In [9]:
mascara_descartables = df1.Movie_id.isin(items_descartables)
print(mascara_descartables.sum())

1606917


In [10]:
print(df1.shape)
df1 = df1[~mascara_descartables]
print(df1.shape)

(10658448, 3)
(9051531, 3)


Notar que `df1` **no** es una matriz de utilidad, sino que son simplemente entradas con cada calificación. Esto es conveniente porque ocupa **mucho menos** espacio en memoria. Pero, para poder continuar, es necesario llevarlo a una matriz de utilidad.

Sin embargo, hay que tener cuidado. Tenemos del orden de ~1000 ítems y ~500 mil usuarios, lo que da una matriz de 500.000.000 entradas. Pero la mayoría de estas entradas son 0, ya que solo contamos con aproxidamente 10.000.000 calificaciones (estos valores se modifican según cómo filtremos el dataset). Es decir, aproximadamente una de cada 50 posibles entradas. Si guardásemos esta información en un dataset común, casi seguro llenaríamos nuestra memoria RAM. Por suerte, existen los tipos de datos "ralos" (*sparse*), que hacen esto de manera eficiente.

Puedes leer más [aquí](https://pandas.pydata.org/docs/user_guide/sparse.html).

In [11]:
usuarios = df1.User.unique()
items = df1.Movie_id.unique()

n_usuarios = len(usuarios)
n_items = len(items)
print(n_usuarios, n_items)

320181 502


Creamos una matriz rala de ceros, de `n_usuarios` filas y `n_items` columnas.

In [24]:
matriz_utilidad = pd.DataFrame(np.zeros((n_usuarios, n_items)), index =  usuarios, columns = items).astype(pd.SparseDtype("float"))
matriz_utilidad.head()

Unnamed: 0,8,17,28,30,33,58,83,97,108,111,...,4402,4418,4420,4432,4472,4479,4488,4490,4492,4496
824097,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2630686,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
644003,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
229220,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2580478,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Hasta ahora, la matriz no contiene las califiaciones, solo ceros, así que debemos llenarla. Al tratarse una matriz rala o esparza, debemos hacerlo de una manera particular. Puedes leer más [aquí](https://stackoverflow.com/questions/49032856/assign-values-to-sparsearray-in-pandas).

Llenemos la matriz con las calificaciones:

In [25]:
### Iteramos sobre cada columna
for column in matriz_utilidad:
    # Save the original sparse format for reuse later
    spdtypes = matriz_utilidad.dtypes[column]

    # Convert concerned Series to dense format
    matriz_utilidad[column] = matriz_utilidad[column].sparse.to_dense()
    
    ### DEBEMOS LLENAR LA COLUMNA CON LAS CALIFICACIONES CORRESPONDIENTES
    mask_calificaciones_columna = df1.Movie_id == column
    calificaciones_columna = df1[mask_calificaciones_columna]

    matriz_utilidad[column].loc[calificaciones_columna.User.values] = calificaciones_columna.Rating.values

    # Back to the original sparse format
    matriz_utilidad[column] = matriz_utilidad[column].astype(spdtypes)

In [26]:
matriz_utilidad.head()

Unnamed: 0,8,17,28,30,33,58,83,97,108,111,...,4402,4418,4420,4432,4472,4479,4488,4490,4492,4496
824097,2.0,0.0,0.0,5.0,0.0,0.0,4.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,4.0,0.0,0.0,1.0,0.0,0.0
2630686,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
644003,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0
229220,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0
2580478,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


A cada columna le ponemos el nombre de la película correspondiente.

In [27]:
matriz_utilidad.columns = df_title.loc[matriz_utilidad.columns]['Name'].values

In [28]:
matriz_utilidad.head()

Unnamed: 0,What the #$*! Do We Know!?,7 Seconds,Lilo and Stitch,Something's Gotta Give,Aqua Teen Hunger Force: Vol. 1,Dragonheart,Silkwood,Mostly Martha,Spartan,Duplex (Widescreen),...,A Nightmare on Elm Street,Titan A.E.,Against the Ropes,The Italian Job,Love Actually,Ordinary People,Wonder Boys,Ned Kelly,Club Dread,Farewell My Concubine
824097,2.0,0.0,0.0,5.0,0.0,0.0,4.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,4.0,0.0,0.0,1.0,0.0,0.0
2630686,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
644003,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0
229220,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0
2580478,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Ya contamos con una matriz de utilidad. Ahora debemos implementar el método.

**Normalización**

Para implementar el filtro colaborativo, vamos a normalizar las calificaciones que ponen los usuarios. En palabras del artículo, 

>*Ésta es la idea de normalizar los vectores de usuario para que un usuario con muchas calificaciones contribuya menos a cualquier calificación individual. Esto quiere decir que un me gusta de un usuario al que solo le han gustado 10 artículos es más valioso para nosotros que un me gusta de alguien a quien le gusta todo lo que encuentra.*

Entonces, dadas las calificaciones que puso cada usuario/a, por ejemplo 5, 3, 4, 1, 2, etc., se calcula la magnitud de ese vector, que para el ejemplo es $\sqrt{5^2 + 3^2 + 4^2 + 1^2 + 2^2 + ...}$. Luego, se divide cada calificación por ese valor.

Este paso es el más sensible, desde el punto de vista de memoria RAM. Por lo que seguiremos usando matrices ralas, en este caso de Scipy. Asegúrate de que entiendes bien la celda.

In [29]:
from scipy import sparse
data_sparse = sparse.csr_matrix(matriz_utilidad)

magnitude = np.sqrt(data_sparse.power(2).sum(axis=1))
matriz_utilidad = matriz_utilidad.divide(magnitude, axis='index')
matriz_utilidad

Unnamed: 0,What the #$*! Do We Know!?,7 Seconds,Lilo and Stitch,Something's Gotta Give,Aqua Teen Hunger Force: Vol. 1,Dragonheart,Silkwood,Mostly Martha,Spartan,Duplex (Widescreen),...,A Nightmare on Elm Street,Titan A.E.,Against the Ropes,The Italian Job,Love Actually,Ordinary People,Wonder Boys,Ned Kelly,Club Dread,Farewell My Concubine
824097,0.069048,0.0,0.0,0.172619,0.0,0.0,0.138095,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.138095,0.0,0.0,0.034524,0.0,0.0
2630686,0.201784,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0
644003,0.133235,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.222058,0.000000,0.0,0.0,0.000000,0.0,0.0
229220,0.239904,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.239904,0.000000,0.0,0.0,0.000000,0.0,0.0
2580478,0.529813,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
708048,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,1.000000,0.0,0.0,0.000000,0.0,0.0
175653,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.000000,1.0,0.0,0.000000,0.0,0.0
1602517,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,1.0,0.0
2478092,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,1.0


Ya contamos con la matriz de utilidad normalizada. Ahora, debemos crear la matriz de similitudes. Es decir, utilizando la similitud coseno creamos una matriz donde se computa la similitud entre cada ítem. En este caso, las películas.

Recuerda que dos ítems son similares si fueron calificados de la misma forma por un conjunto de usuarios que coinciden en sus gustos.

**Lee la documentación de `cosine_similarity` en Scikit-Learn y asegúrate de entender la celda a continuación**

In [30]:
from sklearn.metrics.pairwise import cosine_similarity

def calculate_similarity(data_items):
    """Calculate the column-wise cosine similarity for a sparse
    matrix. Return a new dataframe matrix with similarities.
    """
    
    data_sparse = sparse.csr_matrix(data_items) # Esto solo convierte el dataframe en una matriz rala de scipy.
    similarities = cosine_similarity(data_sparse.transpose())
    sim = pd.DataFrame(data=similarities, index= data_items.columns, columns= data_items.columns)
    return sim

# Build the similarity matrix
data_matrix = calculate_similarity(matriz_utilidad)
data_matrix

Unnamed: 0,What the #$*! Do We Know!?,7 Seconds,Lilo and Stitch,Something's Gotta Give,Aqua Teen Hunger Force: Vol. 1,Dragonheart,Silkwood,Mostly Martha,Spartan,Duplex (Widescreen),...,A Nightmare on Elm Street,Titan A.E.,Against the Ropes,The Italian Job,Love Actually,Ordinary People,Wonder Boys,Ned Kelly,Club Dread,Farewell My Concubine
What the #$*! Do We Know!?,1.000000,0.011905,0.015454,0.054690,0.016012,0.012753,0.022939,0.036692,0.027460,0.017864,...,0.017267,0.006643,0.015534,0.055275,0.063326,0.030412,0.033336,0.024705,0.012083,0.022010
7 Seconds,0.011905,1.000000,0.013997,0.040711,0.003074,0.011263,0.002889,0.002340,0.056138,0.027385,...,0.011350,0.007316,0.042302,0.058439,0.022166,0.001899,0.001996,0.028486,0.017042,0.001223
Lilo and Stitch,0.015454,0.013997,1.000000,0.097116,0.015679,0.046203,0.024125,0.020717,0.020254,0.048847,...,0.039026,0.066965,0.027682,0.100865,0.075114,0.025148,0.031075,0.016239,0.015367,0.014545
Something's Gotta Give,0.054690,0.040711,0.097116,1.000000,0.013321,0.035943,0.065419,0.072694,0.108205,0.166989,...,0.048037,0.025697,0.116965,0.291946,0.336159,0.072180,0.078586,0.071012,0.045897,0.032856
Aqua Teen Hunger Force: Vol. 1,0.016012,0.003074,0.015679,0.013321,1.000000,0.009890,0.001705,0.002561,0.006915,0.010271,...,0.033177,0.010851,0.001245,0.034127,0.019776,0.006121,0.016666,0.003570,0.025059,0.003930
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Ordinary People,0.030412,0.001899,0.025148,0.072180,0.006121,0.007013,0.155419,0.040992,0.011486,0.011628,...,0.032110,0.007670,0.009228,0.050351,0.064923,1.000000,0.099439,0.010808,0.002257,0.043569
Wonder Boys,0.033336,0.001996,0.031075,0.078586,0.016666,0.006976,0.062852,0.049669,0.027100,0.024506,...,0.034292,0.019061,0.015175,0.073341,0.094150,0.099439,1.000000,0.015368,0.011939,0.038016
Ned Kelly,0.024705,0.028486,0.016239,0.071012,0.003570,0.010441,0.008071,0.023551,0.100836,0.054554,...,0.010902,0.007812,0.085792,0.063806,0.071381,0.010808,0.015368,1.000000,0.024072,0.008339
Club Dread,0.012083,0.017042,0.015367,0.045897,0.025059,0.011150,0.002182,0.003458,0.073032,0.060801,...,0.023658,0.008795,0.046724,0.053121,0.043658,0.002257,0.011939,0.024072,1.000000,0.001931


In [31]:
data_matrix.info()

<class 'pandas.core.frame.DataFrame'>
Index: 502 entries, What the #$*! Do We Know!? to Farewell My Concubine
Columns: 502 entries, What the #$*! Do We Know!? to Farewell My Concubine
dtypes: float64(502)
memory usage: 1.9+ MB


Podemos elegir un elemento de la matriz y ver los `nlargest` elementos más parecidos.

In [41]:
print(data_matrix.loc["Being John Malkovich"].nlargest(25))

Being John Malkovich                                      1.000000
American Beauty                                           0.446688
Reservoir Dogs                                            0.327424
Eternal Sunshine of the Spotless Mind                     0.311059
Bowling for Columbine                                     0.308085
High Fidelity                                             0.301722
This Is Spinal Tap                                        0.299533
The Silence of the Lambs                                  0.271295
Clerks                                                    0.265710
Garden State                                              0.252333
Sideways                                                  0.236919
The Godfather                                             0.236214
Midnight Cowboy                                           0.224254
Lord of the Rings: The Fellowship of the Ring             0.221807
Dogma                                                     0.22

Prueba con varios elementos. Elige películas conocidas y no tan conocidas. Algunas de directores famosos, *de autor* y franquicias. ¿Cómo te parece que funciona?

**Para pensar:** ¿Qué ocurre cuando vemos los parecidos de películas muy populares?¿Por qué será?¿Se te ocurre cómo arreglarlo?

Ahora vamos a construir el "vecindario" de cada ítem, que será un dataframe con los `n_neighbours` más similares. 


In [42]:
n_neighbours = 20
data_neighbours = pd.DataFrame(index=data_matrix.columns, columns=range(n_neighbours))
for column in data_matrix.columns:
    data_neighbours.loc[column,:] = data_matrix.loc[:,column].sort_values(ascending=False)[:n_neighbours].index
data_neighbours

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
What the #$*! Do We Know!?,What the #$*! Do We Know!?,Sideways,Eternal Sunshine of the Spotless Mind,The Motorcycle Diaries,Finding Neverland,Super Size Me,Garden State,Ray,The Life Aquatic with Steve Zissou,Napoleon Dynamite,Bowling for Columbine,Being John Malkovich,Whale Rider,American Beauty,Frida,Kill Bill: Vol. 2,Outfoxed: Rupert Murdoch's War on Journalism,Rabbit-Proof Fence,Control Room,House of Sand and Fog
7 Seconds,7 Seconds,Hostage,King's Ransom,Coach Carter,Blade: Trinity,The Pacifier,Beauty Shop,Taxi,The Final Cut,Taking Lives,In Good Company,Batman Begins,S.W.A.T.,Man on Fire,The Amityville Horror,The Sandlot 2,The Bourne Supremacy,Unleashed,Saw,Ray
Lilo and Stitch,Lilo and Stitch,The Emperor's New Groove,Finding Nemo (Widescreen),Brother Bear (Theatrical Widescreen Version),Stuart Little 2,Aladdin: Platinum Edition,The Lion King: Special Edition,Shrek 2,Hercules,Sleeping Beauty: Special Edition,Spy Kids 2: The Island of Lost Dreams,Charlotte's Web,Pirates of the Caribbean: The Curse of the Bla...,Spy Kids,Lord of the Rings: The Fellowship of the Ring,Freaky Friday,The Wizard of Oz: Collector's Edition,Pocahontas,Jimmy Neutron: Boy Genius,Dinosaur
Something's Gotta Give,Something's Gotta Give,Love Actually,50 First Dates,Along Came Polly,The Last Samurai,Pirates of the Caribbean: The Curse of the Bla...,Secondhand Lions,The Italian Job,Man on Fire,Bruce Almighty,The Bourne Supremacy,Shrek 2,Road to Perdition,House of Sand and Fog,About Schmidt,Bend It Like Beckham,Ray,Mean Girls,Finding Nemo (Widescreen),Freaky Friday
Aqua Teen Hunger Force: Vol. 1,Aqua Teen Hunger Force: Vol. 1,Futurama: Vol. 1,Chappelle's Show: Season 1,Family Guy: Freakin' Sweet Collection,The Simpsons: Season 3,The Simpsons: Season 1,Cowboy Bebop Remix,The Simpsons: Treehouse of Horror,Half Baked,Clerks,Fear and Loathing in Las Vegas,Reservoir Dogs,Ghost in the Shell,Army of Darkness,Dogma,BASEketball,Napoleon Dynamite,Kill Bill: Vol. 2,The Life Aquatic with Steve Zissou,Being John Malkovich
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Ordinary People,Ordinary People,On Golden Pond,Moonstruck,Silkwood,Network,Midnight Cowboy,The Sting,American Beauty,A Beautiful Mind,Children of a Lesser God,The Deer Hunter,A Streetcar Named Desire,The Full Monty,The Way We Were,The Silence of the Lambs,Fried Green Tomatoes,Nine to Five,When Harry Met Sally,An Officer and a Gentleman,The Prince of Tides
Wonder Boys,Wonder Boys,High Fidelity,American Beauty,Being John Malkovich,The Ice Storm,About a Boy,About Schmidt,Igby Goes Down,This Is Spinal Tap,Clerks,Bowling for Columbine,Bottle Rocket,A Beautiful Mind,Reservoir Dogs,The Full Monty,Eternal Sunshine of the Spotless Mind,The Silence of the Lambs,Heathers,Moonstruck,Bend It Like Beckham
Ned Kelly,Ned Kelly,The Alamo,Taking Lives,Spartan,The Missing,Man on Fire,House of Sand and Fog,The Ladykillers,Against the Ropes,The Door in the Floor,The Last Samurai,Beyond Borders,The Bourne Supremacy,Secondhand Lions,The Magdalene Sisters,Kill Bill: Vol. 2,Along Came Polly,Love Actually,Something's Gotta Give,The Passion of the Christ
Club Dread,Club Dread,The Rundown,Harold and Kumar Go to White Castle,Dawn of the Dead,Kill Bill: Vol. 2,Along Came Polly,Spartan,American Wedding,National Lampoon's Van Wilder,The Last Samurai,Taking Lives,Mean Girls,50 First Dates,The Ladykillers,Freddy vs. Jason,Elf,Man on Fire,Duplex (Widescreen),Identity,BASEketball


Ahora que ya contamos con un una matriz de parecidos, es momento de hacer recomendaciones.

**Recomendaciones**

Elijamos un usuario al azar y veamos qué películas vio y sus calificaciones

In [43]:
# np.random.seed(15)
user = np.random.choice(list(matriz_utilidad.index))
# user = 1329595  # Descomentar este usuario y ver sus resultados.
# user = 1208602  # Descomentar este usuario y ver sus resultados.
print(user)
df_user = df1[df1.User == user]
df_user['Title'] = df_title.loc[df1[df1.User == user].Movie_id]['Name'].values
df_user

1060431


Unnamed: 0,User,Rating,Movie_id,Title
865692,1060431,3,197,Taking Lives
2058404,1060431,5,378,Party Monster
2438900,1060431,5,457,Kill Bill: Vol. 2
3793114,1060431,4,722,The Wire: Season 1
6283914,1060431,4,1220,Man on Fire
7015132,1060431,5,1406,Hook
7161670,1060431,1,1428,The Recruit
9110399,1060431,5,1799,Die Hard 2: Die Harder
9505483,1060431,5,1865,Eternal Sunshine of the Spotless Mind
9803129,1060431,2,1905,Pirates of the Caribbean: The Curse of the Bla...


Podemos usar todas las películas que vio o filtrar solamente las que calificó sobre cierto umbral. **¿Te parece que mejoran las recomendaciones?**

In [66]:
known_user_likes = df_user['Title'].values
#known_user_likes = df_user[df_user.Rating > 3]['Title'].values # puedes resolver aqui el ejercicio.
known_user_likes

array(['Taking Lives', 'Party Monster', 'Kill Bill: Vol. 2',
       'The Wire: Season 1', 'Man on Fire', 'Hook', 'The Recruit',
       'Die Hard 2: Die Harder', 'Eternal Sunshine of the Spotless Mind',
       'Pirates of the Caribbean: The Curse of the Black Pearl',
       'Liar Liar', 'Identity', 'BASEketball', 'Super Size Me',
       'The Bourne Supremacy',
       'Lord of the Rings: The Fellowship of the Ring', 'Exit Wounds',
       'Napoleon Dynamite', 'Evil Dead 2: Dead by Dawn', 'The Godfather',
       'The Village', 'The Last Samurai', 'Hackers', 'About Schmidt',
       'Flatliners', 'Bruce Almighty', 'Planet of the Apes',
       'The Passion of the Christ', 'Bowling for Columbine',
       'Road to Perdition', 'Army of Darkness', 'The Italian Job'],
      dtype=object)

Una vez que ya sabemos qué le gustó, hay que tomar del vecindario los parecidos a cada película.

In [47]:
most_similar_to_likes = data_neighbours.loc[known_user_likes]
most_similar_to_likes

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
Taking Lives,Taking Lives,Man on Fire,50 First Dates,The Bourne Supremacy,Along Came Polly,Identity,The Last Samurai,The Italian Job,The Village,Kill Bill: Vol. 2,The Recruit,Shrek 2,Mean Girls,The Passion of the Christ,Bruce Almighty,Pirates of the Caribbean: The Curse of the Bla...,Hostage,S.W.A.T.,Something's Gotta Give,Saw
Party Monster,Party Monster,Thirteen,Spun,Elephant,Igby Goes Down,The Dangerous Lives of Altar Boys,Super Size Me,Mean Girls,Garden State,Bowling for Columbine,Queer as Folk: Season 1,Eternal Sunshine of the Spotless Mind,American Beauty,Napoleon Dynamite,24 Hour Party People,Kill Bill: Vol. 2,Happiness,Bad Education,The Opposite of Sex,28 Days Later
Kill Bill: Vol. 2,Kill Bill: Vol. 2,Pirates of the Caribbean: The Curse of the Bla...,The Last Samurai,The Bourne Supremacy,Reservoir Dogs,Lord of the Rings: The Fellowship of the Ring,Eternal Sunshine of the Spotless Mind,Man on Fire,The Italian Job,Shrek 2,Napoleon Dynamite,American Beauty,X2: X-Men United,50 First Dates,The Silence of the Lambs,Garden State,Mean Girls,Super Size Me,Sideways,Bruce Almighty
The Wire: Season 1,The Wire: Season 1,Curb Your Enthusiasm: Season 3,Six Feet Under: Season 4,Chappelle's Show: Season 1,Sideways,Kill Bill: Vol. 2,Reservoir Dogs,Super Size Me,Eternal Sunshine of the Spotless Mind,The Godfather,Control Room,American Beauty,The Bourne Supremacy,Nip/Tuck: Season 2,Man on Fire,Garden State,Being John Malkovich,Bowling for Columbine,Ray,Napoleon Dynamite
Man on Fire,Man on Fire,The Bourne Supremacy,The Last Samurai,Taking Lives,The Italian Job,50 First Dates,Pirates of the Caribbean: The Curse of the Bla...,Shrek 2,Ray,The Passion of the Christ,Kill Bill: Vol. 2,Something's Gotta Give,Along Came Polly,S.W.A.T.,Mean Girls,Bruce Almighty,Secondhand Lions,The Recruit,The Village,Elf
Hook,Hook,Men in Black II,Speed,Bruce Almighty,Pirates of the Caribbean: The Curse of the Bla...,Liar Liar,Sleepless in Seattle,The Mummy,Patch Adams,Forever Young,Rush Hour 2,What Women Want,Lord of the Rings: The Fellowship of the Ring,Lethal Weapon,The Matrix: Reloaded,The Family Man,Finding Nemo (Widescreen),National Lampoon's Vacation,Ghost,Hollow Man
The Recruit,The Recruit,S.W.A.T.,The Italian Job,Don't Say a Word,The Sum of All Fears,Collateral Damage,What Women Want,Ghost,Men in Black II,Man on Fire,Pirates of the Caribbean: The Curse of the Bla...,The Last Samurai,Bruce Almighty,The Bourne Supremacy,Lethal Weapon,Identity,Bad Boys II,Road to Perdition,The Sixth Sense,The Wedding Planner
Die Hard 2: Die Harder,Die Hard 2: Die Harder,Lethal Weapon,Speed,Beverly Hills Cop,Lethal Weapon 3,The Mummy,Rush Hour 2,Bad Boys,Beverly Hills Cop III,Braveheart,National Lampoon's Vacation,The Sum of All Fears,Liar Liar,Lord of the Rings: The Fellowship of the Ring,Pirates of the Caribbean: The Curse of the Bla...,The Italian Job,Ghost,The Silence of the Lambs,The Matrix: Reloaded,The Sixth Sense
Eternal Sunshine of the Spotless Mind,Eternal Sunshine of the Spotless Mind,Garden State,Sideways,Being John Malkovich,Napoleon Dynamite,Super Size Me,American Beauty,Kill Bill: Vol. 2,Finding Neverland,The Life Aquatic with Steve Zissou,Bowling for Columbine,Pirates of the Caribbean: The Curse of the Bla...,Love Actually,The Bourne Supremacy,Ray,Reservoir Dogs,Mean Girls,50 First Dates,Lord of the Rings: The Fellowship of the Ring,Bruce Almighty
Pirates of the Caribbean: The Curse of the Black Pearl,Pirates of the Caribbean: The Curse of the Bla...,Lord of the Rings: The Fellowship of the Ring,Finding Nemo (Widescreen),Shrek 2,The Italian Job,The Last Samurai,Bruce Almighty,50 First Dates,X2: X-Men United,The Bourne Supremacy,Braveheart,Man on Fire,Something's Gotta Give,The Matrix: Reloaded,Kill Bill: Vol. 2,The Sixth Sense,Secondhand Lions,Road to Perdition,Bend It Like Beckham,The Silence of the Lambs


Hasta ahora tenemos un dataframe con todos los parecidos a las películas que vio. Notar que pueden haber películas repetidas, ya que una misma película puede parecerse a más de una que vio. Entonces creamos una lista con elementos sin repetir.

In [48]:
similar_list = most_similar_to_likes.values.tolist()
similar_list = list(set([item for sublist in similar_list for item in sublist]))
similar_list

['Dogma',
 'Fried Green Tomatoes',
 'Man on Fire',
 'Training Day',
 'Hook',
 'Happiness',
 'Bad Boys II',
 'Midnight Cowboy',
 'About a Boy',
 'Spun',
 'Under Siege',
 'Nine to Five',
 'Taking Lives',
 "Something's Gotta Give",
 'The Village',
 'Thirteen',
 'Super Size Me',
 'Jaws',
 'Lord of the Rings: The Fellowship of the Ring',
 'Shrek 2',
 'Beaches',
 'Control Room',
 'Elf',
 'Patch Adams',
 'Identity',
 'Ghost Ship',
 'Clerks',
 'Army of Darkness',
 'Fear and Loathing in Las Vegas',
 'While You Were Sleeping',
 'Saw',
 'Major League',
 'Love Actually',
 'Frida',
 '28 Days Later',
 'Ace Ventura: When Nature Calls',
 'The Silence of the Lambs',
 "Bram Stoker's Dracula",
 'Evil Dead 2: Dead by Dawn',
 'American Beauty',
 'Sleepless in Seattle',
 'Elephant',
 'Rush Hour 2',
 'Roger & Me',
 "Wayne's World",
 'Signs',
 'The Opposite of Sex',
 'Men in Black II',
 'The Pianist',
 'Bad Education',
 'The Sum of All Fears',
 'Queer as Folk: Season 1',
 'The Hand that Rocks the Cradle',
 "C

A partir de esta lista, tomamos de `neighbourhood` los elementos similares junto con la medida de similitud.

In [49]:
neighbourhood = data_matrix[similar_list].loc[similar_list]
neighbourhood

Unnamed: 0,Dogma,Fried Green Tomatoes,Man on Fire,Training Day,Hook,Happiness,Bad Boys II,Midnight Cowboy,About a Boy,Spun,...,Lethal Weapon,The Sixth Sense,Flatliners,The Life Aquatic with Steve Zissou,Poltergeist,Party Monster,50 First Dates,Dawn of the Dead,Finding Neverland,Finding Nemo (Widescreen)
Dogma,1.000000,0.103721,0.089208,0.132802,0.113240,0.082860,0.080175,0.074947,0.119644,0.063941,...,0.100892,0.171207,0.100497,0.127774,0.100796,0.039483,0.118072,0.054696,0.099077,0.149713
Fried Green Tomatoes,0.103721,1.000000,0.127215,0.079897,0.153634,0.036828,0.077366,0.138469,0.114228,0.020952,...,0.148353,0.221206,0.162489,0.079108,0.101193,0.025176,0.169192,0.025830,0.156463,0.180247
Man on Fire,0.089208,0.127215,1.000000,0.169628,0.125584,0.026196,0.234632,0.070023,0.131510,0.047261,...,0.179684,0.228993,0.111907,0.120357,0.079671,0.030414,0.325218,0.158194,0.221139,0.233073
Training Day,0.132802,0.079897,0.169628,1.000000,0.079804,0.034582,0.149293,0.068575,0.153084,0.045727,...,0.126355,0.183011,0.086207,0.066686,0.079522,0.019232,0.132477,0.060619,0.083496,0.162679
Hook,0.113240,0.153634,0.125584,0.079804,1.000000,0.019578,0.141789,0.066434,0.079504,0.019511,...,0.191910,0.177289,0.152771,0.072882,0.090879,0.019476,0.158155,0.041574,0.121085,0.184074
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Party Monster,0.039483,0.025176,0.030414,0.019232,0.019476,0.051021,0.022219,0.027348,0.033079,0.088143,...,0.015184,0.032745,0.018673,0.037131,0.021742,1.000000,0.034633,0.034737,0.034338,0.037861
50 First Dates,0.118072,0.169192,0.325218,0.132477,0.158155,0.027338,0.221331,0.056299,0.167150,0.046515,...,0.173361,0.240697,0.110767,0.131502,0.077926,0.034633,1.000000,0.105578,0.233737,0.278173
Dawn of the Dead,0.054696,0.025830,0.158194,0.060619,0.041574,0.024834,0.091406,0.025108,0.038710,0.046688,...,0.054129,0.087714,0.050999,0.067670,0.072492,0.034737,0.105578,1.000000,0.064206,0.086522
Finding Neverland,0.099077,0.156463,0.221139,0.083496,0.121085,0.044863,0.088844,0.090999,0.151996,0.041571,...,0.115037,0.190063,0.073362,0.210344,0.062437,0.034338,0.233737,0.064206,1.000000,0.213909


Esta es, tal vez, la parte más abstracta para seguir. Ya sabemos cuáles elementos se parecen y tenemos medidas de similitud entre esos elementos, pero debemos condensar esta información en un único ranking. Para ello, tomamos de la matriz de utilidad las calificaciones de las películas que vio (que están normalizados) y las de las películas que no vio. Como no las vio, son cero.

In [50]:
user_vector = matriz_utilidad.loc[user].loc[similar_list]
user_vector

Dogma                        0.000000
Fried Green Tomatoes         0.000000
Man on Fire                  0.181444
Training Day                 0.000000
Hook                         0.226805
                               ...   
Party Monster                0.226805
50 First Dates               0.000000
Dawn of the Dead             0.000000
Finding Neverland            0.000000
Finding Nemo (Widescreen)    0.000000
Name: 1060431, Length: 152, dtype: Sparse[float64, nan]

Esta celda calcula el score. No te preocupes si no la entiendes, pero si conoces algo de algebra lineal, es el producto escalar de la matriz "vecindario" con las calificaciones.

In [51]:
# Calculate the score.
score = neighbourhood.dot(user_vector).div(neighbourhood.sum(axis=1))
score

Dogma                        0.036066
Fried Green Tomatoes         0.032381
Man on Fire                  0.047099
Training Day                 0.035977
Hook                         0.044820
                               ...   
Party Monster                0.071146
50 First Dates               0.039551
Dawn of the Dead             0.041885
Finding Neverland            0.040308
Finding Nemo (Widescreen)    0.038015
Length: 152, dtype: float64

Sacamos de la lista las películas que ya vio.

In [52]:
score = score.drop(known_user_likes)
score

Dogma                        0.036066
Fried Green Tomatoes         0.032381
Training Day                 0.035977
Happiness                    0.033608
Bad Boys II                  0.035297
                               ...   
Poltergeist                  0.034111
50 First Dates               0.039551
Dawn of the Dead             0.041885
Finding Neverland            0.040308
Finding Nemo (Widescreen)    0.038015
Length: 120, dtype: float64

E imprimimos las películas que le gustó y las recomendadas en orden.

In [53]:
print(known_user_likes)
print(score.nlargest(20))

['Taking Lives' 'Party Monster' 'Kill Bill: Vol. 2' 'The Wire: Season 1'
 'Man on Fire' 'Hook' 'The Recruit' 'Die Hard 2: Die Harder'
 'Eternal Sunshine of the Spotless Mind'
 'Pirates of the Caribbean: The Curse of the Black Pearl' 'Liar Liar'
 'Identity' 'BASEketball' 'Super Size Me' 'The Bourne Supremacy'
 'Lord of the Rings: The Fellowship of the Ring' 'Exit Wounds'
 'Napoleon Dynamite' 'Evil Dead 2: Dead by Dawn' 'The Godfather'
 'The Village' 'The Last Samurai' 'Hackers' 'About Schmidt' 'Flatliners'
 'Bruce Almighty' 'Planet of the Apes' 'The Passion of the Christ'
 'Bowling for Columbine' 'Road to Perdition' 'Army of Darkness'
 'The Italian Job']
Ray                                    0.042226
Elf                                    0.042007
Dawn of the Dead                       0.041885
Garden State                           0.041746
House of Sand and Fog                  0.041364
Sideways                               0.041251
Shrek 2                                0.041012
Ha

Si bien el notebook es demostrativo, te recomendamos que lo mires detenidamente, trates de entenderlo lo mejor posible y juegues un poco con él. 

**Para pensar:**

1. ¿Cuáles te parecen que son los hiperparámetros?
1. ¿Cómo evaluarías estos resultados?¿Cómo harías un train/test split?