# Simplified Pipeline

The following cells provide a simplified template of the steps used on part 1 of the BLU12 Learning Notebook. These steps are not the only way to get a RS up and running and we encourage you to tweak them as you see fit.

## Imports

In [4]:
! pip install xlrd

Collecting xlrd
  Using cached xlrd-1.2.0-py2.py3-none-any.whl (103 kB)
Installing collected packages: xlrd
Successfully installed xlrd-1.2.0


In [152]:
import pandas as pd
import numpy as np
import scipy as sp
from scipy.sparse import csr_matrix#, save_npz, load_npz

from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity

import ml_metrics as metrics

## Understanding the data

- The dataset that you selected is appropriated for building a RS?
- Do you have data regarding the items or only about the users' preference?
- Do you have a test dataset or do you have to create it?

### My solution

I used Dataset 1 from http://eigentaste.berkeley.edu/dataset/.

## Load the Data

In [2]:
df_raw = pd.read_excel('data/jokes/jester-data-1.xls', header=None)\
            .append(pd.read_excel('data/jokes/jester-data-2.xls', header=None))\
            .append(pd.read_excel('data/jokes/jester-data-3.xls', header=None))

print(f"The shape is {df_raw.shape}")
df_raw.head(5)

The shape is (73421, 101)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,91,92,93,94,95,96,97,98,99,100
0,74,-7.82,8.79,-9.66,-8.16,-7.52,-8.5,-9.85,4.17,-8.98,...,2.82,99.0,99.0,99.0,99.0,99.0,-5.63,99.0,99.0,99.0
1,100,4.08,-0.29,6.36,4.37,-2.38,-9.66,-0.73,-5.34,8.88,...,2.82,-4.95,-0.29,7.86,-0.19,-2.14,3.06,0.34,-4.32,1.07
2,49,99.0,99.0,99.0,99.0,9.03,9.27,9.03,9.27,99.0,...,99.0,99.0,99.0,9.08,99.0,99.0,99.0,99.0,99.0,99.0
3,48,99.0,8.35,99.0,99.0,1.8,8.16,-2.82,6.21,99.0,...,99.0,99.0,99.0,0.53,99.0,99.0,99.0,99.0,99.0,99.0
4,91,8.5,4.61,-4.17,-5.39,1.36,1.6,7.04,4.61,-0.44,...,5.19,5.58,4.27,5.19,5.73,1.55,3.11,6.55,1.8,1.6


## Process and clean data
- Check if data needs to be processed and cleaned.
- Process and clean data if necessary.

In [155]:
df_clean = df_raw\
                .reset_index()\
                .drop(columns=['index', 0])\
                .replace(99, np.nan)\
                .replace(0, np.nan)\
                .add(10)\
                .stack(dropna=True)\
                .to_frame()
df_clean.index.names = ['user', 'item']
df_clean.columns = ['rating']

df_clean.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,rating
user,item,Unnamed: 2_level_1
0,1,2.18
0,2,18.79
0,3,0.34
0,4,1.84
0,5,2.48


## Identify and separate the Users
- Which users are present in the training data?
- Make sure that you identify which test users are present in the training data and which are not.
- Can you use personalized methologies for all users?

In [132]:
users = df_clean.index.get_level_values('user').unique()
items = df_clean.index.get_level_values('item').unique()

In [177]:
def split_data(df, test_size=0.3):
    '''Perform train-test split '''

    # split
    data_train, data_test = train_test_split(df, test_size=test_size, random_state=0)

    # make copies
    df_train = df.copy()
    df_test = df.copy()
    
    # replace with zeros
    df_train.loc[df_train.index.isin(data_test.index.to_list()), 'rating'] = 0
    df_test.loc[df_test.index.isin(data_train.index.to_list()), 'rating'] = 0
    
    return (df_train, df_test)

In [178]:
df_train, df_test = split_data(df_clean)

## Create the Ratings Matrix

In [179]:
def data_to_R(df):
    R = csr_matrix(df.unstack(fill_value=0).values)
    return R

In [180]:
R = data_to_R(df_clean)
R_train = data_to_R(df_train)
R_test = data_to_R(df_test)

## Non-Personalized Recommendations
- Create non-personalized recommendations as a baseline.
- Apply the recommendations to the test users.
- Store results in the required format for submission.
- Submit baseline recommendations.

In [181]:
rating_count = (R_train!=0).sum(axis=0)
avg_rating = np.nanmean(R_train.todense(), axis=0)
best_items = items[(avg_rating*-1).argsort().tolist()[0]].tolist()

In [182]:
recs_non_person = {user: best_items for user in users}

## Evaluate results
- Calculate the evaluation metric on the validation users.
- Compare it later with the personalized recommendations

In [183]:
test_recs = {users[i]: l.tolist() for i, l in enumerate((R_test*-1).toarray().argsort()[:, :100])}

In [184]:
actual = []
predicted = []
for user in users:
    actual.append(test_recs[user])
    predicted.append(recs_non_person[user])

In [185]:
metrics.mapk(actual, predicted, k=100)

0.9888408197409772

## Personalized Recommendations: Collaborative Filtering
- Compute the user similarities matrix.
- Predict ratings.
- Select the best recommendations.
- Submit recommendations.

In [193]:
user_similarity = cosine_similarity(R_train.todense(), dense_output=False)

MemoryError: Unable to allocate 40.2 GiB for an array with shape (73421, 73421) and data type float64

## Evaluate results (Again)
- Calculate the evaluation metric on the validation users.

In [None]:
# YOUR CODE HERE

## Content-based Recommendations

- Compute the item similarities matrix.
- Predict ratings.
- Select the best recommendations.
- Submit recommendations.

In [None]:
# YOUR CODE HERE

## Evaluate results (Yet again)
- Calculate the evaluation metric on the validation users.

In [None]:
# YOUR CODE HERE

## Potential improvements

At this point you can try to improve your prediction using several approaches:
- Aggregation of ratings from different sources. 
- Mixing Collaborative Filtering and Content-based Recommendations.
- Matrix Factorization.
- Could you use a classification or regression models to predict users' preference? 🤔

In [None]:
# YOUR CODE HERE