# Simplified Pipeline

The following cells provide a simplified template of the steps used on part 1 of the BLU12 Learning Notebook. These steps are not the only way to get a RS up and running and we encourage you to tweak them as you see fit.

## Understanding the data

- The dataset that you selected is appropriated for building a RS?
- Do you have data regarding the items or only about the users' preference?
- Do you have a test dataset or do you have to create it?

In [235]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import os

from evaluation import evaluate_solution
# LightFM
from lightfm import LightFM
from lightfm.data import Dataset as lfmDataset 
from sklearn.feature_extraction.text import TfidfVectorizer

from scipy.sparse import csr_matrix, save_npz
from sklearn.metrics.pairwise import cosine_similarity

from sklearn.preprocessing import StandardScaler


## Load the Data

In [2]:
ratings = pd.read_csv("data/BookRatings.csv")
ratings.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,99,316748641,7
1,99,446677450,10
2,99,553347594,9
3,99,451166892,3
4,99,671621009,10


In [3]:
items_info = pd.read_csv("data/BooksMetaInfo.csv")
items_info.head()

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L,authors,description,pageCount,categories
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,"['Mark P. O. Morford', 'Robert J. Lenardon']",Provides an introduction to classical myths pl...,808.0,['Social Science']
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,['Richard Bruce Wright'],"In a small town in Canada, Clara Callan reluct...",414.0,['Actresses']
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,"[""Carlo D'Este""]","Here, for the first time in paperback, is an o...",555.0,['1940-1949']
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,['Gina Bari Kolata'],"Describes the great flu epidemic of 1918, an o...",330.0,['Medical']
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,['E. J. W. Barber'],A look at the incredibly well-preserved ancien...,240.0,['Design']


In [4]:
users_info = pd.read_csv("data/BooksUsers.csv")
users_info.head()

Unnamed: 0,User-ID,Location,Age
0,2,"stockton, california, usa",18.0
1,8,"timmins, ontario, canada",
2,9,"germantown, tennessee, usa",
3,10,"albacete, wisconsin, spain",26.0
4,12,"fort bragg, california, usa",


## Process and clean data
- Check if data needs to be processed and cleaned.
- Process and clean data if necessary.

In [5]:
def check_for_nans(df):
    return(np.sum(df.isnull()))

In [6]:
check_for_nans(ratings)

User-ID        0
ISBN           0
Book-Rating    0
dtype: int64

In [7]:
check_for_nans(items_info)

ISBN                       0
Book-Title                 0
Book-Author                0
Year-Of-Publication        0
Publisher                  0
Image-URL-S                0
Image-URL-M                0
Image-URL-L                2
authors                    0
description            14815
pageCount              15197
categories             16914
dtype: int64

In [8]:
check_for_nans(users_info)

User-ID         0
Location        0
Age         24605
dtype: int64

In [9]:
# How many ratings do we have in total?
# Tip: The ":," at the end of the f-string adds the thousand separator.
print(f"We have {len(ratings):,} ratings in total.")

We have 109,209 ratings in total.


In [10]:
# How many items were rated?
print(f" We have {ratings['ISBN'].unique().size:,} items rated.")

 We have 47,768 items rated.


In [11]:
# How many users rated at least one book?
print(f" We have {ratings['User-ID'].unique().size:,} users that rated at least one book.")

 We have 5,719 users that rated at least one book.


In [12]:
# Plotting the rating distribution.
ratings["Book-Rating"].hist()

<matplotlib.axes._subplots.AxesSubplot at 0x7f0741a36280>

In [13]:
min(ratings["Book-Rating"])

1

## Identify and separate the Users
- Which users are present in the training data?
- Make sure that you identify which test users are present in the training data and which are not.
- Can you use personalized methologies for all users?

In [5]:
# Crate validation set
data_train, data_val = train_test_split(ratings, test_size=0.25, random_state=42)

### Training Set


In [6]:
# How many ratings do we have in total?
print(f"We have {len(data_train):,} ratings in total.")

We have 81,906 ratings in total.


In [7]:
# How many items were rated?
print(f" We have {data_train['ISBN'].unique().size:,} items rated.")

 We have 39,192 items rated.


In [8]:
# How many users rated at least one item?
print(f" We have {data_train['User-ID'].unique().size:,} users that rated at least one item.")

 We have 5,709 users that rated at least one item.


### Validation Set

In [9]:
# How many ratings do we have in total?
print(f"We have {len(data_val):,} ratings in total.")

We have 27,303 ratings in total.


In [11]:
# How many items were rated?
print(f" We have {data_val['ISBN'].unique().size:,} items rated.")

 We have 17,204 items rated.


In [12]:
# How many users rated at least one item?
print(f" We have {data_val['User-ID'].unique().size:,} users that rated at least one item.")

 We have 5,173 users that rated at least one item.


In [85]:
#Select reviews from users with at least 10 positive ratings.
def select_frequent_reviewers(df: pd.DataFrame, min_nr_reviews: int = 10, min_rating: int = 6):
    """
    Select reviews from users with at least min_nr_reviews reviews with rating larger than min_rating.
    """
    
    # Select only positive reviews
    df_positive = df.copy().loc[df["Book-Rating"] >= min_rating]

    # Select users with more than min_nr_reviews positive reviews
    user_review_count = df_positive.groupby(by=["User-ID"])["ISBN"].count()
    test_users_list = list(user_review_count[user_review_count > min_nr_reviews].index)

    # Select ratings from users specified above
    df_restrict = df_positive.copy().loc[df_positive["User-ID"].isin(test_users_list)]
    
    return df_restrict

data_val_final = select_frequent_reviewers(data_val)
data_val_final.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
77379,189334,891457461,10
48418,112001,440132789,10
39923,97324,679774386,8
31698,76499,321043707,10
82275,203240,590404989,10


In [115]:
#Create the validation recommendations
# nr of recommendations per user
k_top = 10

def top_items_per_user(df: pd.DataFrame, user_col: str, rating_col:str, item_col:str, k_top: int = 10):
    df_ = df.copy()
    df_ = df_.set_index(item_col)
    df_users_kbest = df_.groupby(by=[user_col])[rating_col].nlargest(k_top).reset_index()
    df_users_kbest['rank'] = df_users_kbest.groupby(by=[user_col])[rating_col].rank(method="first")
    #df_users_kbest['rank'] = df_users_kbest['rank'].astype(int) - 1
    df_recommendations = df_users_kbest.pivot(index=user_col, columns="rank", values=item_col)
    df_recommendations = df_recommendations.reset_index(drop=False)
    df_recommendations.columns = np.arange(len(df_recommendations.columns))
    return df_recommendations

val_recommendations = top_items_per_user(data_val_final, "User-ID", "Book-Rating", "ISBN", k_top=k_top)
val_recommendations.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,1903,60928069,0582037670,517084732,1564146731,316441791,151104212,0345409469,394743644,380810336,0812550706
1,2033,894716174,0439139597,836221362,886773784,965921506,836204387,1556432151,886778603,765345048,0345340426
2,2110,590956159,042516540X,373765649,394707745,373638167,807842613,0345317580,679805265,590109960,0345375580
3,2276,441317502,0812575962,897890272,399146725,451163710,684803976,0451452941,671524313,440145465,0756400147
4,2766,767902521,1556614373,1562471171,937295477,679830006,937295221,039922405X,688025072,918956730,071483839X


In [116]:
users_val = data_val_final["User-ID"].unique().tolist()
print(f"We are validating recommendations with {len(users_val)} users.")

We are validating recommendations with 404 users.


In [118]:
def save_recommendations(df: pd.DataFrame, file_name: str):
    """
    Save recommendation dataframe as .csv.
    """
    
    file_path = os.path.join("data", f"{file_name}.csv")
    df.to_csv(file_path, index=False, header=False)
    print(f"Recommendations were saved on file {file_name}.csv.")
    
save_recommendations(val_recommendations, "validation_recommendations")

Recommendations were saved on file validation_recommendations.csv.


## Create the Ratings Matrix

In [26]:
def make_ratings(data):
    return csr_matrix(data.pivot(index='User-ID', 
                                         columns='ISBN', 
                                         values='Book-Rating')
                                  # Good practice when setting the index.
                                  .sort_index()
                                  # Sparse matrices don't assume NaN value as zeros.
                                  .fillna(0)) 


R = make_ratings(data_train)

In [27]:
save_npz('ratings_matrix.npz', R)
f"We have {R.shape[0]} user and {R.shape[1]} items."

'We have 5709 user and 39192 items.'

## Non-Personalized Recommendations
- Create non-personalized recommendations as a baseline.
- Apply the recommendations to the test users.
- Store results in the required format for submission.
- Submit baseline recommendations.

In [119]:
def non_pers_reco_order(data: pd.DataFrame,
                        item_col: str,
                        rating_col:str,
                        k_top: int = 10,
                        aggregation: list() = ["mean", "count"]):
    """
    Create an ordered list of non-personalized recommendations, from best rated to worst rated.
    """
    non_pers_ratings = data.groupby(by=[item_col])[[rating_col]].agg(aggregation)
    non_pers_ratings.columns = non_pers_ratings.columns.get_level_values(1)
    
    #The resulting column names might be different than the specified with the aggregation parameter.
    try:
        non_pers_ratings = non_pers_ratings.sort_values(by=aggregation, ascending=False).head(k_top)
    except KeyError as e:
        print(e)
        print("Check if aggregation argument results in valid column names.")
        print(f"aggregation = {aggregation}\nrating columns = {non_pers_ratings.columns}")
        raise e
        
    non_pers_reco_list = non_pers_ratings.index.to_list()
    return non_pers_reco_list


non_pers_recommendations = non_pers_reco_order(data_train, "ISBN", "Book-Rating", k_top=k_top)
print(non_pers_recommendations)

['0064405028', '0373790651', '0375727191', '0380720132', '039483609X', '0671729454', '1561483176', '1565548353', '0002251760', '0044409494']


In [120]:
def non_pers_reco_output(user_id_list:list, non_pers_reco_list:list):
    """
    Creates a non-personalized recommendation dataframe for specified users.
    """
    nr_test_users = len(user_id_list)
    user_id_df = pd.DataFrame(user_id_list, columns = ["user_id"], dtype = int)
    non_pers_reco_repeated =  pd.DataFrame(pd.DataFrame(non_pers_reco_list).T.values.repeat(nr_test_users, axis=0))
    non_pers_reco_output = pd.concat([user_id_df, non_pers_reco_repeated], axis=1)
    
    # Reset columns numbering. Useful later.
    #non_pers_reco_output.columns = np.arange(len(non_pers_reco_output.columns))
    
    return non_pers_reco_output

In [121]:
non_pers_reco_solution_val = non_pers_reco_output(users_val, non_pers_recommendations)
save_recommendations(non_pers_reco_solution_val, "non_personalized_recommendations_VAL")

Recommendations were saved on file non_personalized_recommendations_VAL.csv.


## Evaluate results
- Calculate the evaluation metric on the validation users.
- Compare it later with the personalized recommendations

In [123]:
## Second argument is the recommendation file to compare
evaluate_solution('non_personalized_recommendations_VAL', 'validation_recommendations')

0.0

## Personalized Recommendations: Collaborative Filtering
- Compute the user similarities matrix.
- Predict ratings.
- Select the best recommendations.
- Submit recommendations.

In [40]:
#LightFM allows to create the rating matrix (aka interaction matrix) and use that matrix to generate recommendations for our users.
#We start by using lightFM Dataset() function to create the user and item mapping that defines the vectorial space of the rating matrix.

# Notice the alias lfmDataset() instead of the standard Dataset() used to distiguish between lightFM Dataset() and another Dataset() that we use later.
lfmdataset = lfmDataset()
lfmdataset.fit(data_train['User-ID'], data_train["ISBN"])


In [46]:
(interactions, weights) = lfmdataset.build_interactions((row for row in data_train.values))

print(repr(interactions))

<5709x39192 sparse matrix of type '<class 'numpy.int32'>'
	with 81906 stored elements in COOrdinate format>


In [48]:
lfmodel = LightFM(loss='warp')
lfmodel.fit(interactions)

<lightfm.lightfm.LightFM at 0x7f0746ac7c10>

In [334]:
def lightFM_recommendations(dataset,
                            model,
                            user_id_ext_list,
                            non_pers_reco_list,
                            k_top: int = 50,
                            item_features = None):   
    """
    Create output dataframe with recommendations based on dataset, model and list of users.
    
    This function predicts recommendations for users specified in user_id_ext_list that are present in the lightFM dataset.
    New users are recommended the items in the non-personalized list non_per_reco_list.
    
    Parameters:
    -----------
    dataset: lightFM dataset
    
    model: lightFM trained model
    
    user_id_ext_list: list of user external IDs to predict
    
    non_pers_reco: list of non-personalized recommendations ordered from best to worst rated
    
    k_top: number of recommendations to create per user
    
    item_features: lightFM item features
    
    Returns:
    --------
    final_reco_df: dataframe with users' recommendations
    The first column has the users' ID and the remaining columns have the recommendations
    """
    
    assert len(user_id_ext_list) > 0, "User ID list length must be larger than 0."
    
    # Dataset mappings
    user_id_map, user_feature_map, item_id_map, item_feature_map = dataset.mapping()
    
    # reverse mapping
    item_id_map_reverse = {v: k for k, v in item_id_map.items()}
    user_id_map_reverse = {v: k for k, v in user_id_map.items()}
    
    
    # item internal ids
    item_id_int_list = list(item_id_map.values())
    
    # Split old users (user_id_int_list) from new users (user_id_ext_excluded)
    # Old users are defined in the ratings vectorial space.
    # New users are not defined in the ratings vectorial space.
    # New users receive non-personalized recommendations.
    user_id_int_list = []
    user_id_ext_excluded = []
    
    for user_id_ext in user_id_ext_list:
        try:
            user_id_int_list.append(user_id_map[user_id_ext])
        except:
            user_id_ext_excluded.append(user_id_ext)
    
    # Dataframe to store model recommendations
    model_reco_df = pd.DataFrame()
    
    # Model recommendations
    for user_id in user_id_int_list:
        scores = model.predict(user_id, item_id_int_list, item_features)
        top_items_ids = np.argsort(-scores)
        top_items_ids = [item_id_map_reverse[ids] for ids in top_items_ids]
         
        # Individual row. Two steps are necessary for the first row to call "user_id"
        user_id_df = pd.DataFrame([user_id_map_reverse[user_id]], columns=["user_id"], dtype = int)
        top_items_ids = pd.DataFrame([top_items_ids[:k_top]])
        user_reco_df = pd.concat([user_id_df, top_items_ids], axis=1)
        
        # Concatenating rows
        model_reco_df = pd.concat([model_reco_df, user_reco_df])
        

        
        
    # Non-personalized recommendations
    non_pers_reco_df = non_pers_reco_output(user_id_ext_excluded, non_pers_reco_list)
    
    # Concatenating all recommendations
    if model_reco_df.shape[0] == 0:
        final_reco_df = non_pers_reco_df
    elif non_pers_reco_df.shape[0] == 0:
        final_reco_df = model_reco_df
    else:
        final_reco_df = pd.concat([model_reco_df, non_pers_reco_df])
    
    return final_reco_df

In [55]:
collab_reco_val = lightFM_recommendations(lfmdataset, lfmodel, users_val, non_pers_recommendations, k_top=k_top)


Unnamed: 0,user_id,0,1,2,3,4,5,6,7,8,9
0,189334,0316666343,059035342X,0312195516,0142001740,0971880107,0452282152,043935806X,0439139597,0804106304,0345370775
0,112001,0316666343,0312195516,059035342X,0971880107,0316769487,0439139597,0452282152,0345370775,0590353403,0671027360
0,97324,0316666343,0312195516,059035342X,0971880107,0142001740,0345370775,0452282152,0316769487,0804106304,0440211727
0,76499,0316666343,059035342X,0142001740,0312195516,0671027360,0971880107,0452282152,0439139597,043935806X,0345370775
0,203240,0316666343,0312195516,059035342X,0971880107,0316769487,0345370775,0439139597,0142001740,0452282152,043935806X
...,...,...,...,...,...,...,...,...,...,...,...
0,123094,0316666343,059035342X,0312195516,0142001740,0971880107,0316769487,0452282152,0439139597,0345370775,0671027360
0,172061,0316666343,059035342X,0142001740,0312195516,0452282152,0345370775,0971880107,0671027360,0316769487,0439139597
0,160819,0316666343,0312195516,059035342X,0971880107,0142001740,0439139597,0316769487,0452282152,0671027360,043935806X
0,140036,0316666343,0142001740,0312195516,0971880107,059035342X,0345370775,0671027360,0452282152,044023722X,0316769487


In [58]:
collab_reco_val
save_recommendations(collab_reco_val, "collaborative_recommendations_VAL")

Recommendations were saved on file collaborative_recommendations_VAL.csv.


## Evaluate results (Again)
- Calculate the evaluation metric on the validation users.

In [59]:
evaluate_solution('collaborative_recommendations_VAL', 'validation_recommendations')

0.20106474933207605

## Content-based Recommendations

- Compute the item similarities matrix.
- Predict ratings.
- Select the best recommendations.
- Submit recommendations.

## Items based with NLP

In [22]:
items_info

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L,authors,description,pageCount,categories
0,0195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,"['Mark P. O. Morford', 'Robert J. Lenardon']",Provides an introduction to classical myths pl...,808.0,Social Science
1,0002005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,['Richard Bruce Wright'],"In a small town in Canada, Clara Callan reluct...",414.0,Actresses
2,0060973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,"[""Carlo D'Este""]","Here, for the first time in paperback, is an o...",555.0,1940-1949
3,0374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,['Gina Bari Kolata'],"Describes the great flu epidemic of 1918, an o...",330.0,Medical
4,0393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,['E. J. W. Barber'],A look at the incredibly well-preserved ancien...,240.0,Design
...,...,...,...,...,...,...,...,...,...,...,...,...
112336,1582380805,Tropical Rainforests: 230 Species in Full Colo...,"Allen M., Ph.D. Young",2001,Golden Guides from St. Martin's Press,http://images.amazon.com/images/P/1582380805.0...,http://images.amazon.com/images/P/1582380805.0...,http://images.amazon.com/images/P/1582380805.0...,['Allen M. Young'],A richly illustrated guide to the tropical rai...,160.0,Nature
112337,1845170423,Cocktail Classics,David Biggs,2004,Connaught,http://images.amazon.com/images/P/1845170423.0...,http://images.amazon.com/images/P/1845170423.0...,http://images.amazon.com/images/P/1845170423.0...,['David Biggs'],,,
112338,0449906736,Flashpoints: Promise and Peril in a New World,Robin Wright,1993,Ballantine Books,http://images.amazon.com/images/P/0449906736.0...,http://images.amazon.com/images/P/0449906736.0...,http://images.amazon.com/images/P/0449906736.0...,"['Robin Wright', 'Doyle McManus']",From two of America's most accomplished journa...,260.0,Political Science
112339,0440400988,There's a Bat in Bunk Five,Paula Danziger,1988,Random House Childrens Pub (Mm),http://images.amazon.com/images/P/0440400988.0...,http://images.amazon.com/images/P/0440400988.0...,http://images.amazon.com/images/P/0440400988.0...,['Paula Danziger'],"On her own for the first time, fourteen-year-o...",150.0,Adolescence


In [223]:
items_info_train = items_info[items_info["ISBN"].isin(data_train["ISBN"].unique())]
items_info_train.shape

(39192, 12)

In [43]:
items_info_train['categories'] = items_info_train['categories'].str.replace("[", "").str.replace("]", "")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  items_info_train['categories'] = items_info_train['categories'].str.replace("[", "").str.replace("]", "")


In [44]:
items_info_train['categories'] = items_info_train['categories'].str.replace("\'", "")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  items_info_train['categories'] = items_info_train['categories'].str.replace("\'", "")


In [45]:
all_content = items_info_train['description'] + items_info_train['categories']

In [20]:
vectorizer = TfidfVectorizer()

In [52]:
item_profiles = vectorizer.fit_transform(all_content.fillna(""))
item_profiles

<39192x81721 sparse matrix of type '<class 'numpy.float64'>'
	with 1420989 stored elements in Compressed Sparse Row format>

In [97]:
item_profiles[0:10].toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [53]:
def make_user_profiles(R, item_profiles):
    return np.dot(R, item_profiles)


user_profiles = make_user_profiles(R, item_profiles)
user_profiles

<5709x81721 sparse matrix of type '<class 'numpy.float64'>'
	with 2087785 stored elements in Compressed Sparse Row format>

In [56]:
def make_predictions(R, item_profiles, user_profiles):
    
    preds = cosine_similarity(user_profiles, item_profiles)
    
    # Exclude previously rated items.
    preds[R.nonzero()] = 0
    
    return csr_matrix(preds)


content_preds = make_predictions(R, item_profiles, user_profiles)
content_preds

<5709x39192 sparse matrix of type '<class 'numpy.float64'>'
	with 198631166 stored elements in Compressed Sparse Row format>

In [57]:
content_preds_array = content_preds.toarray()


array([[0.11634188, 0.07454263, 0.07293708, ..., 0.        , 0.        ,
        0.06451757],
       [0.04969373, 0.02348953, 0.02441673, ..., 0.        , 0.        ,
        0.02703826],
       [0.17243004, 0.08695993, 0.05789121, ..., 0.        , 0.        ,
        0.10520963],
       ...,
       [0.10776761, 0.06311419, 0.0608313 , ..., 0.        , 0.        ,
        0.04600699],
       [0.13229242, 0.08987273, 0.07189845, ..., 0.        , 0.        ,
        0.08770269],
       [0.07268861, 0.06176574, 0.03850712, ..., 0.        , 0.        ,
        0.04488307]])

In [86]:
val_users = data_val_final["User-ID"].unique()
train_users = data_train["User-ID"].unique()
val_index_in_train = [np.where(train_users == user)[0] for user in val_users]

In [160]:
all_recomms = []
for val_user in val_index_in_train:
    all_recomms.append(content_preds_array[val_user].argsort()[0][:10])

In [166]:
all_recomms_ISBN = []
for recom in all_recomms:
    all_recomms_ISBN.append([data_train.reset_index().iloc[index].ISBN for index in recom])

In [172]:
def content_reco_output(user_id_list:list, reco_list:list):
    """
    Creates a non-personalized recommendation dataframe for specified users.
    """
    nr_test_users = len(user_id_list)
    user_id_df = pd.DataFrame(user_id_list, columns = ["user_id"], dtype = int)
    reco_df =  pd.DataFrame(reco_list)
    reco_output = pd.concat([user_id_df, reco_df], axis=1)
    
    # Reset columns numbering. Useful later.
    #non_pers_reco_output.columns = np.arange(len(non_pers_reco_output.columns))
    
    return reco_output

In [176]:
content_recommendations_NLP_VAL = content_reco_output(val_users, all_recomms_ISBN)

In [177]:
save_recommendations(content_recommendations_NLP_VAL, "content_recommendations_NLP_VAL")

Recommendations were saved on file content_recommendations_NLP_VAL.csv.


## Evaluate results (Yet again)
- Calculate the evaluation metric on the validation users.

In [179]:
evaluate_solution('content_recommendations_NLP_VAL', 'validation_recommendations')

0.049275760909424274

## Items based without NLP, with lightFM


In [186]:
items_info_train.describe()

Unnamed: 0,pageCount
count,35585.0
mean,298.393115
std,174.886681
min,1.0
25%,192.0
50%,280.0
75%,375.0
max,3591.0


In [224]:
items_info_train["Year-Of-Publication"].unique()

array([2001, 1991, 1999, 1994, 2004, 1997, 2000, 1996, 2003, 1998, 1988,
       2002, 1993, 1979, 1995, 1992, 1986, 1978, 1983, 1987, 1990, 1961,
       0, 1989, 1982, 1985, 1975, 1965, 1941, 1970, 1962, 1971, 1972,
       1984, 1977, 1980, 1960, 1974, 1976, 1981, 1973, 1956, 1959, 1942,
       1963, 1964, 1969, 1950, 1967, 1958, 1954, 1940, 1955, 1968, 1966,
       1946, 1936, 1953, 1957, 1947, 1945, 1943, 1951, 1939, 1926, 1938,
       1932, 1952, 2005, 1949, 1923, 1927, 1930, 1920, 2020, 1911, 1902,
       1937, 2038, '1996', '2003', '2002', '1997', '1998', '1993', '1994',
       '1999', '1991', '1987', '2000', '1973', '2004', '1986', '2001',
       '1990', '0', '1995', '1988', '1978', '1992', '1976', '1975',
       '1982', '1984', '1977', '1972', '1985', '1979', '1989', '1974',
       '1980', '1971', '1981', '1983', '1964', '1955', '1970', '1920',
       '1936', '1953', '1946', '1959', '1969', '1902', '1957', '1951',
       '1939', '1935', '1806', '1967', '1954', '1961', '1968', '1

In [226]:
items_info_train = items_info_train[items_info_train["Year-Of-Publication"]!= 'DK Publishing Inc']

In [227]:
items_info_train["Year-Of-Publication"] = items_info_train["Year-Of-Publication"].astype(int)

In [228]:
items_info_train.describe()

Unnamed: 0,Year-Of-Publication,pageCount
count,39191.0,35584.0
mean,1973.416881,298.400152
std,206.794309,174.8841
min,0.0,1.0
25%,1991.0,192.0
50%,1997.0,280.0
75%,2001.0,375.0
max,2038.0,3591.0


In [231]:
# remove outliers
items_info_train = items_info_train.loc[items_info_train["Year-Of-Publication"] > 0]
items_info_train = items_info_train.loc[items_info_train["Year-Of-Publication"] < 2022]


In [233]:
items_info_train.describe()

Unnamed: 0,Year-Of-Publication,pageCount
count,38765.0,35275.0
mean,1995.050767,298.367597
std,8.284545,174.634684
min,1378.0,4.0
25%,1992.0,192.0
50%,1997.0,280.0
75%,2001.0,374.0
max,2021.0,3591.0


In [324]:
min_max_scaler = StandardScaler()
items_info_rescale = items_info_train.copy()
items_info_rescale = items_info_rescale[["ISBN", "Year-Of-Publication", "pageCount"]]

items_info_rescale[["Year-Of-Publication", "pageCount"]] = min_max_scaler.fit_transform(items_info_rescale[["Year-Of-Publication", "pageCount"]])
items_info_rescale.head()

Unnamed: 0,ISBN,Year-Of-Publication,pageCount
1,2005018,0.718121,0.662148
2,60973129,-0.488961,1.469559
3,374157065,0.476705,0.181137
5,399135782,-0.488961,0.667875
12,1881320189,-0.126836,-0.614821


In [314]:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')
items_info_encoding = enc.fit_transform(items_info_train["categories"].fillna("").values.reshape(-1, 1))

In [315]:
items_info_encoding_df = pd.DataFrame(items_info_encoding.toarray(), columns=enc.get_feature_names())

In [296]:
# some examples of categories that only appear on one item
pd.Series(items_info_encoding_df.sum()).sort_values()[:20]

x0_Easter                                                                                       1.0
x0_Gulls                                                                                        1.0
x0_Gulliver, Lemuel (Fictitious character)                                                      1.0
x0_Guinea pigs                                                                                  1.0
x0_GroÃbritannien - Sozialordnung - AuflÃ¶sung - Zukunft - Belletristische Darstellung         1.0
x0_GroÃbritannien - Mittelstand - Ehepaar - Einbruchdiebstahl - Belletristische Darstellung    1.0
x0_Group counseling                                                                             1.0
x0_Greece                                                                                       1.0
x0_Gray, P. J. (Fictitious character)                                                           1.0
x0_Gratitude                                                                                    1.0


In [316]:
items_info_encoding_df

Unnamed: 0,x0_,"x0_""Aesops fables""","x0_""April Fools Day""","x0_""Artists spouses""","x0_""Authors spouses""","x0_""Bugs life (Motion picture)""","x0_""Childrens audiobooks""","x0_""Childrens costumes""","x0_""Childrens fantasy fiction""","x0_""Childrens literature""",...,x0_World history,x0_XML (Document markup language),x0_Yoga,x0_Young Adult Fiction,x0_Young Adult Nonfiction,x0_Young adult fiction,x0_Young women,"x0_Ypres, 3rd Battle of, 1917",x0_avstrijska knjiÅ¾evnost - mladinska knjiÅ¾evnost - roman,x0_poems
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38760,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
38761,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
38762,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
38763,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [318]:
# filter out categories associated with a single item
items_info_encoding_df = items_info_encoding_df[items_info_encoding_df.columns[items_info_encoding_df.sum()>1]]

In [319]:
items_info_encoding_df = pd.concat([items_info_train["ISBN"].reset_index(drop=True), items_info_encoding_df], axis=1)

In [325]:
items_features_df = items_info_rescale.merge(items_info_encoding_df, on="ISBN")
items_features_df

Unnamed: 0,ISBN,Year-Of-Publication,pageCount,x0_,"x0_""Aesops fables""","x0_""Authors spouses""","x0_""Childrens literature""","x0_""Childrens poetry""","x0_""Childrens poetry, American.""","x0_""Childrens poetry, English""",...,x0_Vocabulary,x0_Washington (D.C.),x0_West (U.S.),x0_Whales,x0_Women,"x0_World War, 1939-1945",x0_World history,x0_Young Adult Fiction,x0_Young Adult Nonfiction,x0_Young adult fiction
0,0002005018,0.718121,0.662148,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0060973129,-0.488961,1.469559,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0374157065,0.476705,0.181137,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0399135782,-0.488961,0.667875,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1881320189,-0.126836,-0.614821,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38760,3423200944,0.235288,,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
38761,3453065123,-0.247545,-0.523200,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
38762,3548740146,0.718121,,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
38763,1845170423,1.080246,,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [326]:
items_features_df.set_index("ISBN", drop=True, inplace=True)
items_features_df.columns = [str(i) for i in range(len(items_features_df.columns))]
items_features_df.head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,688,689,690,691,692,693,694,695,696,697
ISBN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2005018,0.718121,0.662148,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
60973129,-0.488961,1.469559,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
374157065,0.476705,0.181137,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
399135782,-0.488961,0.667875,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1881320189,-0.126836,-0.614821,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [329]:
item_generator = items_features_df.itertuples(index=True, name=None)

In [330]:
content_dataset = lfmDataset()
content_dataset.fit(data_train['User-ID'], data_train["ISBN"], item_features=item_generator)

In [331]:
item_features = content_dataset.build_item_features(item_generator)
(interactions, weights) = content_dataset.build_interactions((row for row in data_train.values))

In [332]:
content_model = LightFM(loss='warp')
content_model.fit(interactions, item_features=item_features)

<lightfm.lightfm.LightFM at 0x7f54198e4be0>

In [335]:
content_reco_val = lightFM_recommendations(content_dataset,
                                           content_model,
                                           users_val,
                                           non_pers_recommendations,
                                           k_top=k_top,
                                           item_features = item_features)
save_recommendations(content_reco_val, "content_recommendations_VAL")

Recommendations were saved on file content_recommendations_VAL.csv.


In [336]:
evaluate_solution('content_recommendations_VAL', 'validation_recommendations')

0.21095264288333596

## Potential improvements

At this point you can try to improve your prediction using several approaches:
- Aggregation of ratings from different sources. 
- Mixing Collaborative Filtering and Content-based Recommendations.
- Matrix Factorization.
- Could you use a classification or regression models to predict users' preference? 🤔

In [34]:
# YOUR CODE HERE