# Sistemas de Recomendación - Netflix Prize Challenge

## Filtro Colaborativo a partir de Similitud Coseno

Se busca encontrar similitudes entre distancias Coseno.

El mayor trabajo que hace, es construir la Matriz de Confusión.

En este notebook vamos a implementar un sistema de recomendación a través de un filtro colaborativo a partir de la similitud coseno. Nos basamos en este [muy recomendable artículo](https://medium.com/radon-dev/item-item-collaborative-filtering-with-binary-or-unary-data-e8f0b465b2c3) para esta sección.

Comenzamos cargando los datos.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

import pandas as pd

import gc #garbage collector

In [2]:
df1 = pd.read_csv('combined_data_1_with_movie_id.csv', dtype={'Rating': np.int8, 'movie_id': np.int16})
print(df1.shape)
df1.head()

(24053764, 3)


Unnamed: 0,User,Rating,Movie_id
0,1488844,3,1
1,822109,5,1
2,885013,4,1
3,30878,4,1
4,823519,3,1


Y los títulos.

In [3]:
df_title = pd.read_csv('movie_titles.csv', encoding = "ISO-8859-1",index_col = 0, header = None, usecols = [0,2], names = ['Movie_Id', 'Name'])
df_title.head()

Unnamed: 0_level_0,Name
Movie_Id,Unnamed: 1_level_1
1,Dinosaur Planet
2,Isle of Man TT 2004 Review
3,Character
4,Paula Abdul's Get Up & Dance
5,The Rise and Fall of ECW


Vamos a descartar usuarios con el objetivo de achicar la base de datos. Bien hecho, no debería afectar al sistema de recomendación de manera apreciable. Nosotros lo vamos a hacer de una manera *ad-hoc*, pero recalcamos que existen criterios (que no estamos mencionando) para hacerlo de una manera racional.

Vamos a descartar aquellos usuarios que califican poco (menos de cinco calificaciones) o mucho (más de 100 calificaciones).

In [4]:
mask_usuarios_descartables = np.logical_or(df1.User.value_counts() <= 5, df1.User.value_counts() > 100)
usuarios_descartables = mask_usuarios_descartables[mask_usuarios_descartables].index.values
print(len(usuarios_descartables))

150301


In [5]:
mascara_descartables = df1.User.isin(usuarios_descartables)
print(mascara_descartables.sum())

13395316


In [6]:
print(df1.shape)
df1 = df1[~mascara_descartables]
print(df1.shape)

(24053764, 3)
(10658448, 3)


También vamos a descartar también aquellas películas que tengan pocas calificaciones. Esto, nuevamente, no es obligatorio y lo hacemos con el objetivo de achicar la matriz de utilidad aún más.

In [7]:
mask_items_descartables = df1.Movie_id.value_counts() <= 3000
# mask_items_descartables
items_descartables = mask_items_descartables[mask_items_descartables].index.values
# items_descartables
print(len(items_descartables))

3997


In [8]:
mascara_descartables = df1.Movie_id.isin(items_descartables)
print(mascara_descartables.sum())

1606917


In [9]:
print(df1.shape)
df1 = df1[~mascara_descartables]
print(df1.shape)

(10658448, 3)
(9051531, 3)


Notar que `df1` **no** es una matriz de utilidad, sino que son simplemente entradas con cada calificación. Esto es conveniente porque ocupa **mucho menos** espacio en memoria. Pero, para poder continuar, es necesario llevarlo a una matriz de utilidad.

Sin embargo, hay que tener cuidado. Tenemos del orden de ~1000 ítems y ~500 mil usuarios, lo que da una matriz de 500.000.000 entradas. Pero la mayoría de estas entradas son 0, ya que solo contamos con aproxidamente 10.000.000 calificaciones (estos valores se modifican según cómo filtremos el dataset). Es decir, aproximadamente una de cada 50 posibles entradas. Si guardásemos esta información en un dataset común, casi seguro llenaríamos nuestra memoria RAM. Por suerte, existen los tipos de datos "ralos" (*sparse*), que hacen esto de manera eficiente.

Puedes leer más [aquí](https://pandas.pydata.org/docs/user_guide/sparse.html).

In [10]:
usuarios = df1.User.unique()
items = df1.Movie_id.unique()

n_usuarios = len(usuarios)
n_items = len(items)
print(n_usuarios, n_items)

320181 502


Creamos una matriz rala de ceros, de `n_usuarios` filas y `n_items` columnas.

In [11]:
matriz_utilidad = pd.DataFrame(np.zeros((n_usuarios, n_items)), index =  usuarios, columns = items).astype(pd.SparseDtype("float"))
matriz_utilidad.head()

Unnamed: 0,8,17,28,30,33,58,83,97,108,111,...,4402,4418,4420,4432,4472,4479,4488,4490,4492,4496
824097,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2630686,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
644003,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
229220,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2580478,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Id de Usuario x Id Películas.

Hasta ahora, la matriz no contiene las califiaciones, solo ceros, así que debemos llenarla. Al tratarse una matriz rala o esparza, debemos hacerlo de una manera particular. Puedes leer más [aquí](https://stackoverflow.com/questions/49032856/assign-values-to-sparsearray-in-pandas).

Llenemos la matriz con las calificaciones:

In [12]:
### Iteramos sobre cada columna
for column in matriz_utilidad:
    # Save the original sparse format for reuse later
    spdtypes = matriz_utilidad.dtypes[column]

    # Convert concerned Series to dense format
    matriz_utilidad[column] = matriz_utilidad[column].sparse.to_dense()
    
    ### DEBEMOS LLENAR LA COLUMNA CON LAS CALIFICACIONES CORRESPONDIENTES
    mask_calificaciones_columna = df1.Movie_id == column
    calificaciones_columna = df1[mask_calificaciones_columna]

    matriz_utilidad[column].loc[calificaciones_columna.User.values] = calificaciones_columna.Rating.values

    # Back to the original sparse format
    matriz_utilidad[column] = matriz_utilidad[column].astype(spdtypes)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, val

In [13]:
matriz_utilidad.head()

Unnamed: 0,8,17,28,30,33,58,83,97,108,111,...,4402,4418,4420,4432,4472,4479,4488,4490,4492,4496
824097,2.0,0.0,0.0,5.0,0.0,0.0,4.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,4.0,0.0,0.0,1.0,0.0,0.0
2630686,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
644003,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0
229220,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0
2580478,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


A cada columna le ponemos el nombre de la película correspondiente.

In [14]:
matriz_utilidad.columns = df_title.loc[matriz_utilidad.columns]['Name'].values

In [15]:
matriz_utilidad.head()

Unnamed: 0,What the #$*! Do We Know!?,7 Seconds,Lilo and Stitch,Something's Gotta Give,Aqua Teen Hunger Force: Vol. 1,Dragonheart,Silkwood,Mostly Martha,Spartan,Duplex (Widescreen),...,A Nightmare on Elm Street,Titan A.E.,Against the Ropes,The Italian Job,Love Actually,Ordinary People,Wonder Boys,Ned Kelly,Club Dread,Farewell My Concubine
824097,2.0,0.0,0.0,5.0,0.0,0.0,4.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,4.0,0.0,0.0,1.0,0.0,0.0
2630686,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
644003,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0
229220,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0
2580478,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Ya contamos con una matriz de utilidad. Ahora debemos implementar el método.

**Normalización**

Para implementar el filtro colaborativo, vamos a normalizar las calificaciones que ponen los usuarios. En palabras del artículo, 

>*Ésta es la idea de normalizar los vectores de usuario para que un usuario con muchas calificaciones contribuya menos a cualquier calificación individual. Esto quiere decir que un me gusta de un usuario al que solo le han gustado 10 artículos es más valioso para nosotros que un me gusta de alguien a quien le gusta todo lo que encuentra.*

Entonces, dadas las calificaciones que puso cada usuario/a, por ejemplo 5, 3, 4, 1, 2, etc., se calcula la magnitud de ese vector, que para el ejemplo es $\sqrt{5^2 + 3^2 + 4^2 + 1^2 + 2^2 + ...}$. Luego, se divide cada calificación por ese valor (el obtenido de la raíz).

El usuario que realizó más calificaciones, se lo va a dividir por un número más grande, logrando que tenga menos peso que el usuario que realizó pocas calificaciones.

Este paso es el más sensible, desde el punto de vista de memoria RAM. Por lo que seguiremos usando matrices ralas, en este caso de Scipy. Asegúrate de que entiendes bien la celda.

__Nota:__ Aquí se construye la Matriz de Confusión.
* Es dispersa porque tiene muchos valores en 0, lo cual es consistente con el hecho de que son muy pocas las películas calificadas, y hay muchos usuarios que califican pocas películas.

In [16]:
from scipy import sparse
data_sparse = sparse.csr_matrix(matriz_utilidad)

magnitude = np.sqrt(data_sparse.power(2).sum(axis=1))
matriz_utilidad = matriz_utilidad.divide(magnitude, axis='index')
matriz_utilidad

Unnamed: 0,What the #$*! Do We Know!?,7 Seconds,Lilo and Stitch,Something's Gotta Give,Aqua Teen Hunger Force: Vol. 1,Dragonheart,Silkwood,Mostly Martha,Spartan,Duplex (Widescreen),...,A Nightmare on Elm Street,Titan A.E.,Against the Ropes,The Italian Job,Love Actually,Ordinary People,Wonder Boys,Ned Kelly,Club Dread,Farewell My Concubine
824097,0.069048,0.0,0.0,0.172619,0.0,0.0,0.138095,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.138095,0.0,0.0,0.034524,0.0,0.0
2630686,0.201784,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0
644003,0.133235,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.222058,0.000000,0.0,0.0,0.000000,0.0,0.0
229220,0.239904,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.239904,0.000000,0.0,0.0,0.000000,0.0,0.0
2580478,0.529813,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
708048,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,1.000000,0.0,0.0,0.000000,0.0,0.0
175653,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.000000,1.0,0.0,0.000000,0.0,0.0
1602517,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,1.0,0.0
2478092,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,1.0


Ya contamos con la matriz de utilidad normalizada. Ahora, debemos crear la matriz de similitudes. Es decir, utilizando la similitud coseno creamos una matriz donde se computa la similitud entre cada ítem. En este caso, las películas.

Recuerda que dos ítems son similares si fueron calificados de la misma forma por un conjunto de usuarios que coinciden en sus gustos.

**Lee la documentación de `cosine_similarity` en Scikit-Learn y asegúrate de entender la celda a continuación**

In [17]:
from sklearn.metrics.pairwise import cosine_similarity

def calculate_similarity(data_items):
    """Calculate the column-wise cosine similarity for a sparse
    matrix. Return a new dataframe matrix with similarities.
    """
    
    data_sparse = sparse.csr_matrix(data_items) # Esto solo convierte el dataframe en una matriz rala de scipy.
    similarities = cosine_similarity(data_sparse.transpose())
    sim = pd.DataFrame(data=similarities, index= data_items.columns, columns= data_items.columns)
    return sim

# Build the similarity matrix
data_matrix = calculate_similarity(matriz_utilidad)
data_matrix

Unnamed: 0,What the #$*! Do We Know!?,7 Seconds,Lilo and Stitch,Something's Gotta Give,Aqua Teen Hunger Force: Vol. 1,Dragonheart,Silkwood,Mostly Martha,Spartan,Duplex (Widescreen),...,A Nightmare on Elm Street,Titan A.E.,Against the Ropes,The Italian Job,Love Actually,Ordinary People,Wonder Boys,Ned Kelly,Club Dread,Farewell My Concubine
What the #$*! Do We Know!?,1.000000,0.011905,0.015454,0.054690,0.016012,0.012753,0.022939,0.036692,0.027460,0.017864,...,0.017267,0.006643,0.015534,0.055275,0.063326,0.030412,0.033336,0.024705,0.012083,0.022010
7 Seconds,0.011905,1.000000,0.013997,0.040711,0.003074,0.011263,0.002889,0.002340,0.056138,0.027385,...,0.011350,0.007316,0.042302,0.058439,0.022166,0.001899,0.001996,0.028486,0.017042,0.001223
Lilo and Stitch,0.015454,0.013997,1.000000,0.097116,0.015679,0.046203,0.024125,0.020717,0.020254,0.048847,...,0.039026,0.066965,0.027682,0.100865,0.075114,0.025148,0.031075,0.016239,0.015367,0.014545
Something's Gotta Give,0.054690,0.040711,0.097116,1.000000,0.013321,0.035943,0.065419,0.072694,0.108205,0.166989,...,0.048037,0.025697,0.116965,0.291946,0.336159,0.072180,0.078586,0.071012,0.045897,0.032856
Aqua Teen Hunger Force: Vol. 1,0.016012,0.003074,0.015679,0.013321,1.000000,0.009890,0.001705,0.002561,0.006915,0.010271,...,0.033177,0.010851,0.001245,0.034127,0.019776,0.006121,0.016666,0.003570,0.025059,0.003930
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Ordinary People,0.030412,0.001899,0.025148,0.072180,0.006121,0.007013,0.155419,0.040992,0.011486,0.011628,...,0.032110,0.007670,0.009228,0.050351,0.064923,1.000000,0.099439,0.010808,0.002257,0.043569
Wonder Boys,0.033336,0.001996,0.031075,0.078586,0.016666,0.006976,0.062852,0.049669,0.027100,0.024506,...,0.034292,0.019061,0.015175,0.073341,0.094150,0.099439,1.000000,0.015368,0.011939,0.038016
Ned Kelly,0.024705,0.028486,0.016239,0.071012,0.003570,0.010441,0.008071,0.023551,0.100836,0.054554,...,0.010902,0.007812,0.085792,0.063806,0.071381,0.010808,0.015368,1.000000,0.024072,0.008339
Club Dread,0.012083,0.017042,0.015367,0.045897,0.025059,0.011150,0.002182,0.003458,0.073032,0.060801,...,0.023658,0.008795,0.046724,0.053121,0.043658,0.002257,0.011939,0.024072,1.000000,0.001931


In [18]:
data_matrix.info()

<class 'pandas.core.frame.DataFrame'>
Index: 502 entries, What the #$*! Do We Know!? to Farewell My Concubine
Columns: 502 entries, What the #$*! Do We Know!? to Farewell My Concubine
dtypes: float64(502)
memory usage: 1.9+ MB


Podemos elegir un elemento de la matriz y ver los `nlargest` elementos más parecidos.

In [19]:
print(data_matrix.loc["Men in Black II"].nlargest(25))

Men in Black II                                           1.000000
What Women Want                                           0.377085
Collateral Damage                                         0.350282
S.W.A.T.                                                  0.347777
Lethal Weapon 3                                           0.321729
Lethal Weapon                                             0.316907
Ghost                                                     0.309382
Pirates of the Caribbean: The Curse of the Black Pearl    0.276857
The Wedding Planner                                       0.270643
The Recruit                                               0.269769
The Sum of All Fears                                      0.260525
Hook                                                      0.258874
Bruce Almighty                                            0.257587
Braveheart                                                0.257529
The Matrix: Reloaded                                      0.25

Prueba con varios elementos. Elige películas conocidas y no tan conocidas. Algunas de directores famosos, *de autor* y franquicias. ¿Cómo te parece que funciona?

**Para pensar:** ¿Qué ocurre cuando vemos los parecidos de películas muy populares?¿Por qué será?¿Se te ocurre cómo arreglarlo?

Ahora vamos a construir el "vecindario" de cada ítem, que será un dataframe con los `n_neighbours` más similares. 


In [20]:
n_neighbours = 20
data_neighbours = pd.DataFrame(index=data_matrix.columns, columns=range(n_neighbours))
for column in data_matrix.columns:
    data_neighbours.loc[column,:] = data_matrix.loc[:,column].sort_values(ascending=False)[:n_neighbours].index
data_neighbours

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
What the #$*! Do We Know!?,What the #$*! Do We Know!?,Sideways,Eternal Sunshine of the Spotless Mind,The Motorcycle Diaries,Finding Neverland,Super Size Me,Garden State,Ray,The Life Aquatic with Steve Zissou,Napoleon Dynamite,Bowling for Columbine,Being John Malkovich,Whale Rider,American Beauty,Frida,Kill Bill: Vol. 2,Outfoxed: Rupert Murdoch's War on Journalism,Rabbit-Proof Fence,Control Room,House of Sand and Fog
7 Seconds,7 Seconds,Hostage,King's Ransom,Coach Carter,Blade: Trinity,The Pacifier,Beauty Shop,Taxi,The Final Cut,Taking Lives,In Good Company,Batman Begins,S.W.A.T.,Man on Fire,The Amityville Horror,The Sandlot 2,The Bourne Supremacy,Unleashed,Saw,Ray
Lilo and Stitch,Lilo and Stitch,The Emperor's New Groove,Finding Nemo (Widescreen),Brother Bear (Theatrical Widescreen Version),Stuart Little 2,Aladdin: Platinum Edition,The Lion King: Special Edition,Shrek 2,Hercules,Sleeping Beauty: Special Edition,Spy Kids 2: The Island of Lost Dreams,Charlotte's Web,Pirates of the Caribbean: The Curse of the Bla...,Spy Kids,Lord of the Rings: The Fellowship of the Ring,Freaky Friday,The Wizard of Oz: Collector's Edition,Pocahontas,Jimmy Neutron: Boy Genius,Dinosaur
Something's Gotta Give,Something's Gotta Give,Love Actually,50 First Dates,Along Came Polly,The Last Samurai,Pirates of the Caribbean: The Curse of the Bla...,Secondhand Lions,The Italian Job,Man on Fire,Bruce Almighty,The Bourne Supremacy,Shrek 2,Road to Perdition,House of Sand and Fog,About Schmidt,Bend It Like Beckham,Ray,Mean Girls,Finding Nemo (Widescreen),Freaky Friday
Aqua Teen Hunger Force: Vol. 1,Aqua Teen Hunger Force: Vol. 1,Futurama: Vol. 1,Chappelle's Show: Season 1,Family Guy: Freakin' Sweet Collection,The Simpsons: Season 3,The Simpsons: Season 1,Cowboy Bebop Remix,The Simpsons: Treehouse of Horror,Half Baked,Clerks,Fear and Loathing in Las Vegas,Reservoir Dogs,Ghost in the Shell,Army of Darkness,Dogma,BASEketball,Napoleon Dynamite,Kill Bill: Vol. 2,The Life Aquatic with Steve Zissou,Being John Malkovich
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Ordinary People,Ordinary People,On Golden Pond,Moonstruck,Silkwood,Network,Midnight Cowboy,The Sting,American Beauty,A Beautiful Mind,Children of a Lesser God,The Deer Hunter,A Streetcar Named Desire,The Full Monty,The Way We Were,The Silence of the Lambs,Fried Green Tomatoes,Nine to Five,When Harry Met Sally,An Officer and a Gentleman,The Prince of Tides
Wonder Boys,Wonder Boys,High Fidelity,American Beauty,Being John Malkovich,The Ice Storm,About a Boy,About Schmidt,Igby Goes Down,This Is Spinal Tap,Clerks,Bowling for Columbine,Bottle Rocket,A Beautiful Mind,Reservoir Dogs,The Full Monty,Eternal Sunshine of the Spotless Mind,The Silence of the Lambs,Heathers,Moonstruck,Bend It Like Beckham
Ned Kelly,Ned Kelly,The Alamo,Taking Lives,Spartan,The Missing,Man on Fire,House of Sand and Fog,The Ladykillers,Against the Ropes,The Door in the Floor,The Last Samurai,Beyond Borders,The Bourne Supremacy,Secondhand Lions,The Magdalene Sisters,Kill Bill: Vol. 2,Along Came Polly,Love Actually,Something's Gotta Give,The Passion of the Christ
Club Dread,Club Dread,The Rundown,Harold and Kumar Go to White Castle,Dawn of the Dead,Kill Bill: Vol. 2,Along Came Polly,Spartan,American Wedding,National Lampoon's Van Wilder,The Last Samurai,Taking Lives,Mean Girls,50 First Dates,The Ladykillers,Freddy vs. Jason,Elf,Man on Fire,Duplex (Widescreen),Identity,BASEketball


Ahora que ya contamos con un una matriz de parecidos, es momento de hacer recomendaciones.

**Recomendaciones**

Elijamos un usuario al azar y veamos qué películas vio y sus calificaciones

In [21]:
# np.random.seed(15)
user = np.random.choice(list(matriz_utilidad.index))
# user = 1329595  # Descomentar este usuario y ver sus resultados.
# user = 1208602  # Descomentar este usuario y ver sus resultados.
print(user)
df_user = df1[df1.User == user]
df_user['Title'] = df_title.loc[df1[df1.User == user].Movie_id]['Name'].values
df_user

108635


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_user['Title'] = df_title.loc[df1[df1.User == user].Movie_id]['Name'].values


Unnamed: 0,User,Rating,Movie_id,Title
659653,108635,4,175,Reservoir Dogs
7409254,108635,4,1470,Bend It Like Beckham
12197727,108635,4,2371,To Catch a Thief
14220254,108635,3,2743,The Pianist


Podemos usar todas las películas que vio o filtrar solamente las que calificó sobre cierto umbral. **¿Te parece que mejoran las recomendaciones?**

In [22]:
known_user_likes = df_user['Title'].values
# known_user_likes = df_user[COMPLETAR]['Title'].values # puedes resolver aqui el ejercicio.
known_user_likes

array(['Reservoir Dogs', 'Bend It Like Beckham', 'To Catch a Thief',
       'The Pianist'], dtype=object)

Una vez que ya sabemos qué le gustó, hay que tomar del vecindario los parecidos a cada película.

In [23]:
most_similar_to_likes = data_neighbours.loc[known_user_likes]
most_similar_to_likes

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
Reservoir Dogs,Reservoir Dogs,American Beauty,Being John Malkovich,Clerks,The Silence of the Lambs,Kill Bill: Vol. 2,Lock,The Godfather,Lord of the Rings: The Fellowship of the Ring,Dogma,This Is Spinal Tap,Eternal Sunshine of the Spotless Mind,High Fidelity,Pirates of the Caribbean: The Curse of the Bla...,Braveheart,Bowling for Columbine,The Professional,National Lampoon's Vacation,The Sixth Sense,Napoleon Dynamite
Bend It Like Beckham,Bend It Like Beckham,Whale Rider,Love Actually,Pirates of the Caribbean: The Curse of the Bla...,About a Boy,Something's Gotta Give,Lord of the Rings: The Fellowship of the Ring,The Italian Job,About Schmidt,Finding Nemo (Widescreen),Bowling for Columbine,Mean Girls,Garden State,Bruce Almighty,Eternal Sunshine of the Spotless Mind,Sideways,American Beauty,Shrek 2,The Last Samurai,Super Size Me
To Catch a Thief,To Catch a Thief,Dial M for Murder,North by Northwest,Notorious,The Maltese Falcon,Bringing Up Baby,The Third Man,Singin' in the Rain,A Streetcar Named Desire,The King and I,The Caine Mutiny,The Sting,The Hustler,The Odd Couple,When Harry Met Sally,Pirates of the Caribbean: The Curse of the Bla...,Goldfinger,The Godfather,Pride and Prejudice,The Wizard of Oz: Collector's Edition
The Pianist,The Pianist,About Schmidt,Frida,A Beautiful Mind,Road to Perdition,Whale Rider,Bend It Like Beckham,House of Sand and Fog,Bowling for Columbine,American Beauty,Pirates of the Caribbean: The Curse of the Bla...,Something's Gotta Give,Rabbit-Proof Fence,Sideways,The Last Samurai,About a Boy,Lord of the Rings: The Fellowship of the Ring,Love Actually,Ray,Finding Neverland


Hasta ahora tenemos un dataframe con todos los parecidos a las películas que vio. Notar que pueden haber películas repetidas, ya que una misma película puede parecerse a más de una que vio. Entonces creamos una lista con elementos sin repetir.

In [24]:
similar_list = most_similar_to_likes.values.tolist()
similar_list = list(set([item for sublist in similar_list for item in sublist]))
similar_list

['Braveheart',
 'Pirates of the Caribbean: The Curse of the Black Pearl',
 'This Is Spinal Tap',
 'The Silence of the Lambs',
 'North by Northwest',
 'The Pianist',
 'Dogma',
 'The Sting',
 'Notorious',
 'Finding Nemo (Widescreen)',
 'The Caine Mutiny',
 'Ray',
 'The Odd Couple',
 'House of Sand and Fog',
 'The Third Man',
 'Being John Malkovich',
 'The Italian Job',
 'To Catch a Thief',
 'Reservoir Dogs',
 'When Harry Met Sally',
 'Whale Rider',
 'American Beauty',
 'Sideways',
 "National Lampoon's Vacation",
 'Clerks',
 'The Professional',
 'Goldfinger',
 'Love Actually',
 'About a Boy',
 'Kill Bill: Vol. 2',
 'The King and I',
 'Bend It Like Beckham',
 'Bruce Almighty',
 "The Wizard of Oz: Collector's Edition",
 'Road to Perdition',
 'The Maltese Falcon',
 'Frida',
 'High Fidelity',
 'The Hustler',
 'Dial M for Murder',
 'Lord of the Rings: The Fellowship of the Ring',
 "Something's Gotta Give",
 'A Beautiful Mind',
 'Finding Neverland',
 "Singin' in the Rain",
 'About Schmidt',
 'L

A partir de esta lista, tomamos de `neighbourhood` los elementos similares junto con la medida de similitud.

In [25]:
neighbourhood = data_matrix[similar_list].loc[similar_list]
neighbourhood

Unnamed: 0,Braveheart,Pirates of the Caribbean: The Curse of the Black Pearl,This Is Spinal Tap,The Silence of the Lambs,North by Northwest,The Pianist,Dogma,The Sting,Notorious,Finding Nemo (Widescreen),...,Bringing Up Baby,Eternal Sunshine of the Spotless Mind,Bowling for Columbine,Super Size Me,The Godfather,The Sixth Sense,Rabbit-Proof Fence,Garden State,Mean Girls,Pride and Prejudice
Braveheart,1.000000,0.332159,0.127797,0.344029,0.111898,0.141989,0.155798,0.127027,0.050311,0.274819,...,0.036587,0.171265,0.143452,0.147079,0.222142,0.356219,0.079373,0.160310,0.144972,0.069055
Pirates of the Caribbean: The Curse of the Black Pearl,0.332159,1.000000,0.152310,0.287491,0.134939,0.186036,0.180110,0.121266,0.067162,0.414720,...,0.055325,0.255314,0.214280,0.227903,0.175967,0.309385,0.123464,0.237382,0.272684,0.096643
This Is Spinal Tap,0.127797,0.152310,1.000000,0.188255,0.152243,0.096020,0.165910,0.116537,0.076098,0.124245,...,0.048371,0.197394,0.203739,0.151690,0.176009,0.134086,0.085596,0.174842,0.096571,0.053080
The Silence of the Lambs,0.344029,0.287491,0.188255,1.000000,0.155905,0.163872,0.198049,0.171210,0.079724,0.265195,...,0.048820,0.214661,0.202810,0.180596,0.315911,0.422923,0.096321,0.188397,0.157397,0.063440
North by Northwest,0.111898,0.134939,0.152243,0.155905,1.000000,0.096765,0.083993,0.155955,0.217039,0.110208,...,0.128139,0.117457,0.125630,0.093958,0.167114,0.123475,0.079083,0.105223,0.064956,0.084280
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
The Sixth Sense,0.356219,0.309385,0.134086,0.422923,0.123475,0.145716,0.171207,0.124307,0.059536,0.300394,...,0.040497,0.184390,0.163548,0.163748,0.212491,1.000000,0.089194,0.173949,0.174497,0.069605
Rabbit-Proof Fence,0.079373,0.123464,0.085596,0.096321,0.079083,0.179672,0.060775,0.070782,0.051927,0.106010,...,0.027368,0.144722,0.179679,0.145766,0.075051,0.089194,1.000000,0.137025,0.084159,0.063054
Garden State,0.160310,0.237382,0.174842,0.188397,0.105223,0.149062,0.156087,0.067181,0.047823,0.187511,...,0.043502,0.355358,0.216372,0.276520,0.146419,0.173949,0.137025,1.000000,0.239108,0.074348
Mean Girls,0.144972,0.272684,0.096571,0.157397,0.064956,0.109113,0.115937,0.049611,0.029808,0.225019,...,0.029134,0.237132,0.148120,0.232556,0.095982,0.174497,0.084159,0.239108,1.000000,0.063437


Esta es, tal vez, la parte más abstracta para seguir. Ya sabemos cuáles elementos se parecen y tenemos medidas de similitud entre esos elementos, pero debemos condensar esta información en un único ranking. Para ello, tomamos de la matriz de utilidad las calificaciones de las películas que vio (que están normalizados) y las de las películas que no vio. Como no las vio, son cero.

In [26]:
user_vector = matriz_utilidad.loc[user].loc[similar_list]
user_vector

Braveheart                                                0.0
Pirates of the Caribbean: The Curse of the Black Pearl    0.0
This Is Spinal Tap                                        0.0
The Silence of the Lambs                                  0.0
North by Northwest                                        0.0
                                                         ... 
The Sixth Sense                                           0.0
Rabbit-Proof Fence                                        0.0
Garden State                                              0.0
Mean Girls                                                0.0
Pride and Prejudice                                       0.0
Name: 108635, Length: 61, dtype: Sparse[float64, nan]

Esta celda calcula el score. No te preocupes si no la entiendes, pero si conoces algo de algebra lineal, es el producto escalar de la matriz "vecindario" con las calificaciones.

In [27]:
# Calculate the score.
score = neighbourhood.dot(user_vector).div(neighbourhood.sum(axis=1))
score

Braveheart                                                0.027018
Pirates of the Caribbean: The Curse of the Black Pearl    0.028842
This Is Spinal Tap                                        0.031108
The Silence of the Lambs                                  0.030002
North by Northwest                                        0.036863
                                                            ...   
The Sixth Sense                                           0.026513
Rabbit-Proof Fence                                        0.035279
Garden State                                              0.029402
Mean Girls                                                0.027323
Pride and Prejudice                                       0.030348
Length: 61, dtype: float64

Sacamos de la lista las películas que ya vio.

In [28]:
score = score.drop(known_user_likes)
score

Braveheart                                                0.027018
Pirates of the Caribbean: The Curse of the Black Pearl    0.028842
This Is Spinal Tap                                        0.031108
The Silence of the Lambs                                  0.030002
North by Northwest                                        0.036863
Dogma                                                     0.029739
The Sting                                                 0.029244
Notorious                                                 0.038059
Finding Nemo (Widescreen)                                 0.027662
The Caine Mutiny                                          0.028858
Ray                                                       0.027313
The Odd Couple                                            0.026905
House of Sand and Fog                                     0.031039
The Third Man                                             0.032063
Being John Malkovich                                      0.03

E imprimimos las películas que le gustó y las recomendadas en orden.

In [29]:
print(known_user_likes)
print(score.nlargest(20))

['Reservoir Dogs' 'Bend It Like Beckham' 'To Catch a Thief' 'The Pianist']
Dial M for Murder        0.042486
Notorious                0.038059
North by Northwest       0.036863
Whale Rider              0.036360
Frida                    0.036194
Rabbit-Proof Fence       0.035279
Lock                     0.033942
Bringing Up Baby         0.033794
About a Boy              0.033140
Bowling for Columbine    0.033053
Clerks                   0.032691
About Schmidt            0.032653
The Maltese Falcon       0.032340
The Third Man            0.032063
Being John Malkovich     0.031887
The Godfather            0.031686
American Beauty          0.031504
The Professional         0.031312
Love Actually            0.031141
This Is Spinal Tap       0.031108
dtype: float64


Si bien el notebook es demostrativo, te recomendamos que lo mires detenidamente, trates de entenderlo lo mejor posible y juegues un poco con él. 

**Para pensar:**

1. ¿Cuáles te parecen que son los hiperparámetros?
1. ¿Cómo evaluarías estos resultados?¿Cómo harías un train/test split?