# Simplified Pipeline

The following cells provide a simplified template of the steps used on part 1 of the BLU12 Learning Notebook. These steps are not the only way to get a RS up and running and we encourage you to tweak them as you see fit.

## Understanding the data

- The dataset that you selected is appropriated for building a RS?
- Do you have data regarding the items or only about the users' preference?
- Do you have a test dataset or do you have to create it?

In [12]:
# YOUR CODE HERE

# Import the necessary dependencies

# Operating System
import os

# Numpy, Pandas and Scipy
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix, save_npz, load_npz

# Scikit-learn
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split


## Load the Data

In [65]:
def read_users_history() -> pd.DataFrame:
    """Imports the listening history for each user.
    
    Returns:
        data (pd.DataFrame): DataFrame with the song and respective rating for each user. 
                             The rows are tuples of (user, song_id, rating).
                             
    """
    path = os.path.join('data', 'train_play_counts.txt')
    data = pd.read_csv(path, names=['user', 'song_id', 'rating'], sep='\t')
    return data

data = read_users_history()
data.head() #EDIT TO CREATE A TRAIN AND TEST; BEINF THE TEST PART OF T>HE DATA

12294


Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


In [66]:
def read_test_users() -> pd.DataFrame:
    """Imports the list of users for which we need to predict.
    
    Returns:
        users_to_pred (pd.DataFrame): DataFrame with the users for which we will recommend songs. 
    """

    path = os.path.join('data', 'test_users.csv')
    users_to_pred_ = pd.read_csv(path, names=['users to recommend songs'])
    
    return users_to_pred_


users_to_pred = read_test_users()
users_to_pred.head()

12294


Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


## Process and clean data
- Check if data needs to be processed and cleaned.
- Process and clean data if necessary.

In [67]:
# YOUR CODE HERE
data['genre'] = data['genre'].str.replace(' ', '').str.replace(',', '').str.lower().fillna('n/a')
data.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,dramaromanceschoolsupernatural,Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,actionadventuredramafantasymagicmilitaryshounen,TV,64,9.26,793665
2,28977,Gintama°,actioncomedyhistoricalparodysamuraisci-fishounen,TV,51,9.25,114262
3,9253,Steins;Gate,sci-fithriller,TV,24,9.17,673572
4,9969,Gintama&#039;,actioncomedyhistoricalparodysamuraisci-fishounen,TV,51,9.16,151266


In [70]:
data['name'] = data['genre'].str.replace(' ', '').str.replace(',', '').str.lower().dropna()
data.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,dramaromanceschoolsupernatural,dramaromanceschoolsupernatural,Movie,1,9.37,200630
1,5114,actionadventuredramafantasymagicmilitaryshounen,actionadventuredramafantasymagicmilitaryshounen,TV,64,9.26,793665
2,28977,actioncomedyhistoricalparodysamuraisci-fishounen,actioncomedyhistoricalparodysamuraisci-fishounen,TV,51,9.25,114262
3,9253,sci-fithriller,sci-fithriller,TV,24,9.17,673572
4,9969,actioncomedyhistoricalparodysamuraisci-fishounen,actioncomedyhistoricalparodysamuraisci-fishounen,TV,51,9.16,151266


## Identify and separate the Users
- Which users are present in the training data?
- Make sure that you identify which test users are present in the training data and which are not.
- Can you use personalized methologies for all users?

In [72]:
# YOUR CODE HERE
train_rating, val_rating = train_test_split(rating, test_size=0.2, random_state=8)

#Store the indexes of each observation to identify which records to replace with zero
train_index = train_rating.index
val_index = val_rating.index

#make copies of data to replace the observations
train_rating_clean = rating.copy()
val_rating_clean = rating.copy()

#Replace the validation observations on the training data
train_rating_clean.loc[val_index,["rating"]] = 0
    
#Replace the training observations on the validation data
val_rating_clean.loc[train_index,["rating"]] = 0


In [75]:
#Create the R matrices
R_train = make_ratings(train_rating_clean)
R_val = make_ratings(val_rating_clean)

#remove the explicit zeros from the sparse matrices
R_train.eliminate_zeros()
R_val.eliminate_zeros()

TypeError: no supported conversion for types: (dtype('O'),)

## Create the Ratings Matrix

In [73]:
# YOUR CODE HERE
def make_ratings(data: pd.DataFrame) -> csr_matrix:
    """Creates the ratings matrix of listening history with optional shape
    
    Creates the ratings matrix from the listening history imported using the read_users_history() method.
    
    Args:
        data (pd.DataFrame):  Listening history for the users.
        shape (tuple): The overall (n_users, n_items) shape desired for the matrix. 
                       If None, define the shape with the (n_users, n_items) from data argument.
        
    Returns:
        ratings (csr_matrix): Ratings matrix with shape (n_users, n_items).
    
    """
    users, user_pos = np.unique(data.iloc[:, 0].values, return_inverse=True)
    items, item_pos = np.unique(data.iloc[:, 1].values, return_inverse=True)
    values = data.iloc[:, 2].fillna(0).values
    
    #R Matrix dimensions (n_users, n_items)
    shape = (len(users), len(items))

    R_ = csr_matrix((values, (user_pos, item_pos)), shape=shape)
    return R_


## Non-Personalized Recommendations
- Create non-personalized recommendations as a baseline.
- Apply the recommendations to the test users.
- Store results in the required format for submission.
- Submit baseline recommendations.

In [52]:
# Let's store a Series with the unique user id's that we have in the original data.
def get_unique_users(data: pd.DataFrame) -> pd.DataFrame:
    """Get unique users in training data.
    
    Args:
        data (pd.DataFrame):  listening history for the users.
        
    Returns:
        unique_users (pd.DataFrame): DataFrame of one column with unique users in training data.
    
    """
    return pd.DataFrame(np.unique(data.iloc[:, 0].values), columns=["users to recommend songs"])


unique_users_training_data = get_unique_users(anime)
unique_users_training_data.head()

Unnamed: 0,users to recommend songs
0,1
1,5
2,6
3,7
4,8


In [53]:
# YOUR CODE HERE
def get_most_rated(ratings: csr_matrix, n: int) -> np.matrix:
    """Returns the n most rated items in a ratings matrix.
    
    Args:
        ratings (csr_matrix): A sparse ratings matrix
        n (int): The number of top-n items we should retrieve.
        
    Returns:
        most_rated (np.matrix): An array of the most rated items.
    
    """
    def is_rating(R_: csr_matrix) -> csr_matrix:
        """Returns a sparse matrix of booleans 
        
        Args:
            R_ (csr_matrix): A sparse ratings matrix
            
        Returns:
            is_rating (csr_matrix): A sparse matrix of booleans.
        """
        return R_ > 0
    
    def count_ratings(R_: csr_matrix) -> np.matrix:
        """Returns an array with the count of ratings
        
        The attribute ".A1" of a numpy matrix returns self as a flattened ndarray.
        
        Args:
            R_ (csr_matrix): A sparse matrix of booleans
        
        Returns:
            count_ratings (np.darray): Count of ratings by item
        """
        return R_.sum(axis=0).A1
    
    ratings_ = is_rating(ratings)
    ratings_ = count_ratings(ratings_)
    return np.negative(ratings_).argsort()[:n]


non_pers_most_rated = get_most_rated(R_train, 100)
non_pers_most_rated

array([ 1389,  6606,  7439,  1427,  4630,   201,    10,  3965,    99,
        2651,  3558,   175,  5214,  1978,  6288,  5968,  5673,  6481,
        1816,  3150,   768,   764,  4722,  8134,  3531,  8627,    19,
        6008,   402,  5881,  4792,   732,  3948,   141,  4268,     0,
        1839,  3867,  4879,  1087,   330,  8277,  1531,  6369,  6054,
         181,  6567,  7312,  7005,  3043,   244,  7927,   199,  7161,
        5514,  8532,  2725, 10188,   202,  2043,   844,   243,  8606,
        7101,   802,  6336,  1649,  4701,  2266,   488,    39,  6599,
        7115,  2703,  7028,  7170,   329,  4556,  5534,  8687,  1671,
        5407,  2055,  1713,    50,  6612,  4424,  9631,  3752,  7614,
           2,   401,    98,   200,  7817,  4239,  6607,  3799,  6703,
        9708])

In [54]:
def convert_non_pers_recommendations_to_df(non_pers_recs: np.array, users_to_pred: pd.DataFrame) -> pd.DataFrame:
    """
    Converts the non-personalized most rated to an DataFrame with the users and the recommendations.
    We will basically repeat the non_pers_recs array for the number of users in need.
    
    Args:
        non_pers_recs (np.array): Array of indices for the best non-personalized items to recommend.
        users_to_pred (pd.DataFrame): DataFrame containing the users which need recommendations.
        
    Returns:
        non_pers_most_rated_matrix (np.array): Two dimensional array of (n_users, top_n_items)
    
    """
    non_pers_df = pd.DataFrame(np.zeros((len(users_to_pred), 1), dtype=non_pers_recs.dtype) + non_pers_recs)
    non_pers_df = pd.concat([users_to_pred, non_pers_df], axis=1)
    non_pers_df = non_pers_df.set_index("users to recommend songs")
    
    return non_pers_df


non_pers_most_rated_df = convert_non_pers_recommendations_to_df(non_pers_most_rated, unique_users_training_data)
non_pers_most_rated_df.head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
users to recommend songs,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1389,6606,7439,1427,4630,201,10,3965,99,2651,...,2,401,98,200,7817,4239,6607,3799,6703,9708
5,1389,6606,7439,1427,4630,201,10,3965,99,2651,...,2,401,98,200,7817,4239,6607,3799,6703,9708
6,1389,6606,7439,1427,4630,201,10,3965,99,2651,...,2,401,98,200,7817,4239,6607,3799,6703,9708
7,1389,6606,7439,1427,4630,201,10,3965,99,2651,...,2,401,98,200,7817,4239,6607,3799,6703,9708
8,1389,6606,7439,1427,4630,201,10,3965,99,2651,...,2,401,98,200,7817,4239,6607,3799,6703,9708


In [55]:
def create_dict_preds(preds_df: pd.DataFrame) -> dict:
    """Convert the predictions DataFrame (index:users -> columns: items) to a dictionary of key (user->list of items).
    
    Args: 
        preds_df (pd.DataFrame): DataFrame containing the users and the ordered predictions.
        
    Returns:
        preds_dict (dict): Dict of (user_id: list of items) used for evaluating the performance.
    
    """
    return {preds_df.index[i]: preds_df.values[i].tolist() for i in range(len(preds_df))}


non_pers_dict = create_dict_preds(non_pers_most_rated_df)
# Since dicts in python are not ordered, we need to HAMMER DOWN a way to print some values.
dict(list(non_pers_dict.items())[0:1])

{1: [1389,
  6606,
  7439,
  1427,
  4630,
  201,
  10,
  3965,
  99,
  2651,
  3558,
  175,
  5214,
  1978,
  6288,
  5968,
  5673,
  6481,
  1816,
  3150,
  768,
  764,
  4722,
  8134,
  3531,
  8627,
  19,
  6008,
  402,
  5881,
  4792,
  732,
  3948,
  141,
  4268,
  0,
  1839,
  3867,
  4879,
  1087,
  330,
  8277,
  1531,
  6369,
  6054,
  181,
  6567,
  7312,
  7005,
  3043,
  244,
  7927,
  199,
  7161,
  5514,
  8532,
  2725,
  10188,
  202,
  2043,
  844,
  243,
  8606,
  7101,
  802,
  6336,
  1649,
  4701,
  2266,
  488,
  39,
  6599,
  7115,
  2703,
  7028,
  7170,
  329,
  4556,
  5534,
  8687,
  1671,
  5407,
  2055,
  1713,
  50,
  6612,
  4424,
  9631,
  3752,
  7614,
  2,
  401,
  98,
  200,
  7817,
  4239,
  6607,
  3799,
  6703,
  9708]}

In [56]:
def get_y_true(R_val_: csr_matrix, users_to_pred: pd.DataFrame, n=100):
    """Get the ground truth (best recommendations) of the users in the validation set.
    
    Args:
        R_val_ (csr_matrix): Validation set ratings matrix.
        users_to_pred: 
        n (int): Number of top-n items.
        
    Returns:
        y_true_df (pd.DataFrame): DataFrame which returns the y_true items.
        
    """
    top_from_R_val = pd.DataFrame(np.negative(R_val_).toarray().argsort()[:, :n])
    y_true_df = pd.concat([users_to_pred, top_from_R_val], axis=1)
    y_true_df = y_true_df.set_index("users to recommend songs")
    return y_true_df


y_true_df = get_y_true(ratings_val, unique_users_training_data, n=100)
y_true_df.head()

NameError: name 'ratings_val' is not defined

## Evaluate results
- Calculate the evaluation metric on the validation users.
- Compare it later with the personalized recommendations

In [None]:
# YOUR CODE HERE

## Personalized Recommendations: Collaborative Filtering
- Compute the user similarities matrix.
- Predict ratings.
- Select the best recommendations.
- Submit recommendations.

In [None]:
# YOUR CODE HERE

## Evaluate results (Again)
- Calculate the evaluation metric on the validation users.

In [None]:
# YOUR CODE HERE

## Content-based Recommendations

- Compute the item similarities matrix.
- Predict ratings.
- Select the best recommendations.
- Submit recommendations.

In [None]:
# YOUR CODE HERE

## Evaluate results (Yet again)
- Calculate the evaluation metric on the validation users.

In [None]:
# YOUR CODE HERE

## Potential improvements

At this point you can try to improve your prediction using several approaches:
- Aggregation of ratings from different sources. 
- Mixing Collaborative Filtering and Content-based Recommendations.
- Matrix Factorization.
- Could you use a classification or regression models to predict users' preference? 🤔

In [None]:
# YOUR CODE HERE