# Desafio de Tripulaciones 
03/22 Grupo 3 


### ALGORITMO UTILIZADO:
#### Singular Value Decomposition

$$ \underset{(n, d)}A \approx \underset{(n, n)}U * \underset{(n, d)}\Sigma * \underset{(d, d)} V^T  $$

Cualquier matriz de tamaño (n, d) se puede descomponer en producto de tres factores

* En *U* de tamaño (n, n) es una matriz ortogonal que contiene los vectores singulares izquierdos de *A*.
* En $\Sigma$ que es una matriz diagonal (n,d), cuyos valores son los valores singulares de la matriz *A* ordenados en valor decreciente
* En *V* que es una matriz transpuesta (d,d), cuyos valores son los vectores singulares derechos de *A*.

*Ortogonal significa que multiplicando la transpuesta por si misma, se obtiene la matriz identidad*

Con esto lo que se consigue es que podemos ir elminando vectores de las matrices con la información que no es fundamental, (limpiar los datos) y quedarnos con aquella información más determinante.

## Aplicación práctica

Lo que se hace con los motores de recomendación, es para una actividad que tu no has realizado, teniendo en cuenta tus características y las de otros usuarios. Mediante SVD nos quedamos con los usuarios que son parecidos a ti, y vemos las actividades que no has visto

###  Cargamos librerías

In [1]:
import numpy as np
import pandas as pd
from scipy.sparse import coo_matrix
from scipy.sparse.linalg import svds
from pandas.core.frame import DataFrame
from pandas.io.parsers import read_csv
from surprise import SVDpp
from surprise import Dataset, Reader
from surprise.model_selection import train_test_split
from surprise import accuracy
from collections import defaultdict

### Cargamos los datos

** Entrenamos el algoritmo con un set de películas de forma temporal, cuando los resultados de la encuesta estén listos, entrenaremos el modelo con esos datos

In [2]:
df_cohousing = pd.read_csv("./data/cohousing_TUESDAY.csv")
df_cohousing

Unnamed: 0,Timestamp,¿Qué edad tienes?,Lista de actividades: [YOGA],Lista de actividades: [NATACIÓN ],Lista de actividades: [BAILE ],Lista de actividades: [GOLF ],Lista de actividades: [GIMNASIO ],Lista de actividades: [TIRO CON ARCO ],Lista de actividades: [ZUMBA ],Lista de actividades: [TENIS],...,Lista de actividades: [MANUALIDADES],Lista de actividades: [IDIOMAS],Lista de actividades: [COCINA],Lista de actividades: [COCTELERÍA],Lista de actividades: [CERVECERÍA ARTESANAL],Lista de actividades: [CATAS DE COMIDA Y BEBIDA],Lista de actividades: [BINGO],Lista de actividades: [PARCHIS],Lista de actividades: [AJEDREZ],Lista de actividades: [TEATRO]
0,2022/03/11 7:27:28 pm CET,Más de 55 años,,,,,,,,,...,,,,,,,,,,
1,2022/03/11 7:32:28 pm CET,Más de 55 años,1 No me gusta,3,1 No me gusta,3,1 No me gusta,3,1 No me gusta,3,...,3,2,3,3,3,3,1 No me gusta,1 No me gusta,2,3
2,2022/03/11 7:50:52 pm CET,Más de 55 años,1 No me gusta,3,3,,2,,3,3,...,2,2,2,,,1 No me gusta,1 No me gusta,1 No me gusta,,1 No me gusta
3,2022/03/11 7:55:37 pm CET,Más de 55 años,3,5 Me encanta,4,1 No me gusta,1 No me gusta,,,4,...,,2,1 No me gusta,1 No me gusta,1 No me gusta,1 No me gusta,1 No me gusta,3,,
4,2022/03/11 8:35:38 pm CET,Más de 55 años,1 No me gusta,2,1 No me gusta,1 No me gusta,4,1 No me gusta,1 No me gusta,5 Me encanta,...,5 Me encanta,3,3,1 No me gusta,1 No me gusta,1 No me gusta,1 No me gusta,3,2,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
339,2022/03/14 8:57:00 pm CET,Más de 55 años,,,3,,,,,,...,,,3,,,,,,,
340,2022/03/14 9:16:36 pm CET,Más de 55 años,,1 No me gusta,1 No me gusta,,,,1 No me gusta,,...,,,,,,,,,,1 No me gusta
341,2022/03/14 9:18:24 pm CET,Más de 55 años,1 No me gusta,1 No me gusta,1 No me gusta,,1 No me gusta,,,,...,,1 No me gusta,,,,,,,,
342,2022/03/15 12:37:03 am CET,Más de 55 años,1 No me gusta,4,1 No me gusta,,3,,,3,...,3,5 Me encanta,2,,,,1 No me gusta,2,,


In [3]:
def preprocess_form(dataframe):
    """
    This functions will preprocess the original csv from the Google form and get it ready for the model.

    Parameters
    -----------

    dataframe (object): the DataFrame containing three columns; userID, itemID and rating.

    return
    ------

    A new dataframe ready for the model.

    """
    #we clean the df from only nan answers and we erase the first two columns as they are not answers for the algorithm
    #dataframe = dataframe[dataframe.columns[2:]]
    dataframe.drop(['Timestamp', '¿Qué edad tienes?'], axis=1, inplace=True)
    dataframe.dropna(how='all', inplace= True)
    
    #we need to change the answers from the form from str to int
    dicc_formulario = {
        '1 No me gusta': 1,
        '2': 2,
        '3': 3,
        '4': 4,
        '5 Me encanta': 5
    }

    for i in dataframe.columns:
        dataframe[i] = dataframe[i].map(dicc_formulario)

    #we transorm the dataframe to get the triplet column format we need to feed our algorithm
    df_new = pd.DataFrame([(1,1,1)])
    for i in dataframe.index:
        for j in dataframe.columns:
            if dataframe[j][i] == 'NaN':
                pass
            else:
                df_new = df_new.append([(i,j, dataframe[j][i])])

    df_new.dropna(how= 'any', inplace=True)
    df_new = df_new[1:]

    #we cahange the columns names
    df_new.columns = ['userId', 'itemId', 'rating']

    #we create a dictionary to map the name of activities and change it for their activity code
    dict_activities = {
    'Lista de actividades: [YOGA]': 1, 
    'Lista de actividades: [NATACIÓN ]': 2,
    'Lista de actividades: [BAILE ]': 3, 
    'Lista de actividades: [GOLF ]': 4,
    'Lista de actividades: [GIMNASIO ]': 5,
    'Lista de actividades: [TIRO CON ARCO ]': 6,
    'Lista de actividades: [ZUMBA ]': 7, 
    'Lista de actividades: [TENIS]': 8,
    'Lista de actividades: [CLUB DE LECTURA]': 9,
    'Lista de actividades: [CLUB DE ESCRITURA]': 10,
    'Lista de actividades: [PINTURA]': 11, 
    'Lista de actividades: [MÚSICA ]': 12,
    'Lista de actividades: [MACRAMÉ]': 13,
    'Lista de actividades: [INFORMÁTICA]': 14,
    'Lista de actividades: [JARDINERÍA]': 15,
    'Lista de actividades: [MANUALIDADES]': 16,
    'Lista de actividades: [IDIOMAS]': 17, 
    'Lista de actividades: [COCINA]': 18,
    'Lista de actividades: [COCTELERÍA]': 19,
    'Lista de actividades: [CERVECERÍA ARTESANAL]': 20,
    'Lista de actividades: [CATAS DE COMIDA Y BEBIDA]': 21,
    'Lista de actividades: [BINGO]': 22, 
    'Lista de actividades: [PARCHIS]': 23,
    'Lista de actividades: [AJEDREZ]': 24, 
    'Lista de actividades: [TEATRO]': 25
    }
    df_new['itemId'] = df_new['itemId'].map(dict_activities)
    
    return df_new

In [4]:
df_algorithm = preprocess_form(df_cohousing)


In [5]:
df_algorithm

Unnamed: 0,userId,itemId,rating
0,1,1,1.0
0,1,2,3.0
0,1,3,1.0
0,1,4,3.0
0,1,5,1.0
...,...,...,...
0,343,19,5.0
0,343,20,4.0
0,343,22,1.0
0,343,23,1.0


### Preprocessing

In [6]:
reader = Reader()
data = Dataset.load_from_df(df_algorithm, reader)

train, test = train_test_split(data, test_size=0.25)

### Training and testing

In [24]:
svd = SVDpp(n_epochs= 100, lr_all= 0.01, reg_all= 0.2)
SVD_model_for_pickle = svd.fit(train)
preds = svd.test(test)

#### Hacemos un Cross Validation para analizar las metricas

In [8]:
from surprise.model_selection import cross_validate
# Run 5-fold cross-validation and print results
cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVDpp on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9452  0.9465  0.9752  0.9488  0.9262  0.9484  0.0156  
MAE (testset)     0.7515  0.7357  0.7558  0.7578  0.7186  0.7439  0.0148  
Fit time          6.82    6.62    6.58    6.88    6.53    6.69    0.14    
Test time         0.03    0.03    0.03    0.03    0.03    0.03    0.00    


{'test_rmse': array([0.94524618, 0.94648292, 0.97519585, 0.94876465, 0.92624384]),
 'test_mae': array([0.75146823, 0.73567474, 0.75581058, 0.75783496, 0.71858988]),
 'fit_time': (6.815050840377808,
  6.621856212615967,
  6.582151889801025,
  6.87920618057251,
  6.532171010971069),
 'test_time': (0.03427886962890625,
  0.03241395950317383,
  0.032640933990478516,
  0.03234577178955078,
  0.03286886215209961)}

### Evaluation

In [25]:
accuracy.mae(preds)
accuracy.rmse(preds)

MAE:  0.7534
RMSE: 0.9658


0.9657983485446319

### Prueba con KNN algoritmo

In [10]:
from surprise import KNNBasic

# Retrieve the trainset.
trainset = data.build_full_trainset()

# Build an algorithm, and train it.
knn = KNNBasic(n_epochs= 50, lr_all= 0.01, reg_all= 0.2)
predictions = knn.fit(trainset).test(test)

accuracy.mae(predictions)
accuracy.rmse(predictions)

Computing the msd similarity matrix...
Done computing similarity matrix.
MAE:  0.5490
RMSE: 0.7112


0.7112497452464863

In [11]:
# Run 5-fold cross-validation and print results
cross_validate(knn, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9852  0.9726  1.0074  0.9921  0.9864  0.9888  0.0113  
MAE (testset)     0.7727  0.7638  0.7825  0.7907  0.7949  0.7809  0.0114  
Fit time          0.03    0.02    0.02    0.02    0.02    0.02    0.00    
Test time         0.34    0.22    0.20    0.20    0.22    0.24    0.06    


{'test_rmse': array([0.98524708, 0.97261274, 1.00744333, 0.99214499, 0.98635901]),
 'test_mae': array([0.77269907, 0.76381901, 0.78254021, 0.79067473, 0.79492122]),
 'fit_time': (0.025620222091674805,
  0.02054309844970703,
  0.016434907913208008,
  0.016020774841308594,
  0.017827987670898438),
 'test_time': (0.3449070453643799,
  0.21704697608947754,
  0.19662213325500488,
  0.2019801139831543,
  0.21691226959228516)}

### Train all data

In [26]:
trainfull = data.build_full_trainset()

svd = SVDpp(n_epochs= 100, lr_all= 0.01, reg_all= 0.2)
SVD_model_for_pickle = svd.fit(trainfull)

SVD_model_for_pickle.predict(uid=1, iid=1)

Prediction(uid=1, iid=1, r_ui=None, est=2.0071712112703013, details={'was_impossible': False})

#### GridSearch SVD

In [13]:
from surprise.model_selection import GridSearchCV
from surprise import SVD

param_grid = {'n_epochs': [50, 100], 'lr_all': [0.01, 0.012],
              'reg_all': [0.1, 0.2]}
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)

gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

0.967589143754929
{'n_epochs': 100, 'lr_all': 0.01, 'reg_all': 0.2}


#### GridSearch KNN

In [29]:
from surprise.model_selection import GridSearchCV

param_grid = {'n_epochs': [2, 5, 10], 'lr_all': [0.003, 0.005, 0.007],
              'reg_all': [0.05, 0.1]}
gs_knn = GridSearchCV(KNNBasic, param_grid, measures=['rmse', 'mae'], cv=3)

gs_knn.fit(data)

# best RMSE score
print(gs_knn.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs_knn.best_params['rmse'])

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computi

### FUNCIONES

In [14]:
def defiant_recommender(userId, dataframe, algorithm, n_recommendations, column_iid= None, column_uid= None):
    """
    This functions will use a trained algorithm to find the n top list of recommended items for a given userID.

    Parameters
    -----------

    userId (int): the user ID of the person that we want recommendations for.

    dataframe (object): the DataFrame containing three columns; userID, itemID and rating.

    algorithm (object): the trained algorith used to recommend items.

    n_rcommendations (int): the number of items recommended.

    column_iid (string): name of the column containing the item ID.

    column_uid (string): name of the column containing the user ID.


    return
    ------

    List of ID of items that an specific user will like.

    """
    item_ids = dataframe[column_iid].to_list()
    items_finished = dataframe[dataframe[column_uid] == userId][column_iid]

    items_no_finished = []
    for item in item_ids:
        if item not in items_finished:
            items_no_finished.append(item)

    preds = []
    for item in items_no_finished:
        preds.append(algorithm.predict(uid=userId, iid=item))

    recommendations_rating = {pred[1]:pred[3] for pred in preds}

    order_dict = {k: v for k, v in sorted(recommendations_rating.items(), key=lambda item: item[1])}

    top_predictions = list(order_dict.keys())[:n_recommendations]
    
    return top_predictions

In [15]:
def check_recommended_item_name(list):
    """
    This functions will show the names of the n top rated items for a given userID.

    Parameters
    -----------

    list (object): the list of n recommended itemId.

    return
    ------

    A list with the n names of the itemId recommended to the given userId.

    """
    dict_items = {
            1: 'YOGA', 
            2: 'NATACION',
            3: 'BAILE', 
            4: 'GOLF',
            5: 'GIMNASIO',
            6: 'TIRO CON ARCO',
            7: 'ZUMBA', 
            8: 'TENIS',
            9: 'CLUB DE LECTURA',
            10: 'CLUB DE ESCRITURA',
            11: 'PINTURA', 
            12: 'MUSICA',
            13: 'MACRAME',
            14: 'INFORMATICA',
            15: 'JARDINERIA',
            16: 'MANUALIDADES',
            17: 'IDIOMAS', 
            18: 'COCINA',
            19: 'COCTELERIA',
            20: 'CERVECERIA ARTESANAL',
            21: 'CATAS DE COMIDA Y BEBIDA',
            22: 'BINGO', 
            23: 'PARCHIS',
            24: 'AJEDREZ', 
            25: 'TEATRO'
        }

    return [dict_items[i] for i in list]

In [16]:
def check_activities_user(userId, dataframe, n, column_rating= None, column_uid= None):
    """
    This functions will show the n top rated items for a given userID.

    Parameters
    -----------

    userId (int): the user ID of the person that we want recommendations for.

    dataframe (object): the DataFrame containing three columns; userID, itemID and rating.

    n (int): number of top rated items to show.

    column_rating (string): name of the column containing the item rating.

    column_uid (string): name of the column containing the user ID.


    return
    ------

    A dataframe with the n top rated items by that given user.

    """
    dataframe = dataframe[dataframe[column_uid] ==userId].sort_values(column_rating, ascending=False)[:n]
    
    #we create a dictionary to map the name of activities and change it for their activity code
    dict_activities = {
        1: 'YOGA', 
        2: 'NATACION',
        3: 'BAILE', 
        4: 'GOLF',
        5: 'GIMNASIO',
        6: 'TIRO CON ARCO',
        7: 'ZUMBA', 
        8: 'TENIS',
        9: 'CLUB DE LECTURA',
        10: 'CLUB DE ESCRITURA',
        11: 'PINTURA', 
        12: 'MUSICA',
        13: 'MACRAME',
        14: 'INFORMATICA',
        15: 'JARDINERIA',
        16: 'MANUALIDADES',
        17: 'IDIOMAS', 
        18: 'COCINA',
        19: 'COCTELERIA',
        20: 'CERVECERIA ARTESANAL',
        21: 'CATAS DE COMIDA Y BEBIDA',
        22: 'BINGO', 
        23: 'PARCHIS',
        24: 'AJEDREZ', 
        25: 'TEATRO'
    }

    dataframe['itemName'] = dataframe['itemId'].map(dict_activities)

    return dataframe

## Funcion para comprobar la logica de la recomendacion

### SVD


In [17]:
user = 100

In [30]:
activities_recommended = defiant_recommender(user, df_algorithm, SVD_model_for_pickle, 5, 'itemId', 'userId')
print("ID of the recommended activities:", activities_recommended)
print("\nNAME of the recommended activities:",check_recommended_item_name(activities_recommended))
print(f"\nTop RATED activities from this user:\n", check_activities_user(user, df_algorithm, 5, 'rating', 'userId'))

ID of the recommended activities: [22, 6, 4, 13, 19]

NAME of the recommended activities: ['BINGO', 'TIRO CON ARCO', 'GOLF', 'MACRAME', 'COCTELERIA']

Top RATED activities from this user:
    userId  itemId  rating      itemName
0     100       3     3.0         BAILE
0     100       8     3.0         TENIS
0     100      15     3.0    JARDINERIA
0     100      16     3.0  MANUALIDADES
0     100      17     3.0       IDIOMAS


#### KNN

In [31]:
activities_recommended = defiant_recommender(user, df_algorithm, knn, 5, 'itemId', 'userId')
print("ID of the recommended activities:", activities_recommended)
print("\nNAME of the recommended activities:",check_recommended_item_name(activities_recommended))
print(f"\nTop RATED activities from this user:\n", check_activities_user(user, df_algorithm, 5, 'rating', 'userId'))

ID of the recommended activities: [22, 4, 6, 13, 7]

NAME of the recommended activities: ['BINGO', 'GOLF', 'TIRO CON ARCO', 'MACRAME', 'ZUMBA']

Top RATED activities from this user:
    userId  itemId  rating      itemName
0     100       3     3.0         BAILE
0     100       8     3.0         TENIS
0     100      15     3.0    JARDINERIA
0     100      16     3.0  MANUALIDADES
0     100      17     3.0       IDIOMAS


In [20]:
# Guardar el modelo
import pickle
pickle_file = open('model_prueba.model', 'wb')
pickle.dump(SVD_model_for_pickle, pickle_file)
pickle_file.close()

In [21]:
# Para volver a leer el modelo
file = open('model_prueba.model', 'rb')
model_test = pickle.load(file)

In [22]:
actividad_test = model_test.predict(iid= 22, uid= 100)

In [23]:
actividad_test.est

2.033678168025659