<a href="https://colab.research.google.com/github/YuliiaChorna1/DataScience-10-Reccomender-systems/blob/main/10_2_2_extra_lightfm_for_recommend_practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Recommendation in Python: LightFM

In [39]:
!pip install lightfm



In [40]:
!pip  install scikit-optimize



In [41]:
!pip install pandas-profiling



In [42]:
!pip install ydata-profiling



In [43]:
# import dependent libraries
import os
import random
import numpy as np
import pandas as pd
import ydata_profiling as pandas_profiling
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec
from scipy.sparse import csr_matrix
from IPython.display import display_html
import seaborn as sns
import warnings
%matplotlib inline

from lightfm.cross_validation import random_train_test_split
from lightfm.evaluation import auc_score, precision_at_k, recall_at_k
from lightfm import LightFM
from skopt import forest_minimize

In [44]:
def display_side_by_side(*args):
    html_str = ""
    for df in args:
        html_str += df.to_html()
    display_html(html_str.replace(
        "table", "table style='display:inline'"), raw=True)

# update the working directory to the root of the project
os.chdir("..")
warnings.filterwarnings("ignore")

## Goodreads Data

The datasets were collected in late 2017 from goodreads.com, where we only scraped users' public shelves, i.e. everyone can see it on web without login. User IDs and review IDs are anonymized.
We collected these datasets for academic use only. Please do not redistribute them or use for commercial pupposes.
There are 3 groups of datasets: (1) meta-data of the books, (2) user-book interactions (users' public shelves) and (3) users' detailed book reviews. These datasets can be merged together by matching book/user/review ids. For the purposes of this tutorial, we'll be using only the former two.

You can downlod the dataset using in this article from here:
1. Books Metadata: https://drive.google.com/uc?id=1H6xUV48D5sa2uSF_BusW-IBJ7PCQZTS1
2. User-Book Interactions: https://drive.google.com/uc?id=17G5_MeSWuhYnD4fGJMvKRSOIBqCCimxJ

###Load Raw Data

In [45]:
from google.colab import drive
drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [46]:
path = "/content/drive/MyDrive/Recommender_systems"

In [47]:
%%time
books_metadata = pd.read_json(path + "/goodreads_books_poetry.json", lines=True)
interactions = pd.read_json(path + "/goodreads_interactions_poetry.json", lines=True)

CPU times: user 58.5 s, sys: 12.7 s, total: 1min 11s
Wall time: 1min 33s


### Data Inspection & Preparation: Books Metadata

Let's start by inspecting the books' metadata information. To develop a reliable and robust ML model, it is essential to get a thorough understanding of the available data.

As the first step, let's take a look at all the available fields, and sample data

In [48]:
books_metadata.columns.values

array(['isbn', 'text_reviews_count', 'series', 'country_code',
       'language_code', 'popular_shelves', 'asin', 'is_ebook',
       'average_rating', 'kindle_asin', 'similar_books', 'description',
       'format', 'link', 'authors', 'publisher', 'num_pages',
       'publication_day', 'isbn13', 'publication_month',
       'edition_information', 'publication_year', 'url', 'image_url',
       'book_id', 'ratings_count', 'work_id', 'title',
       'title_without_series'], dtype=object)

In [49]:
books_metadata.sample(2)

Unnamed: 0,isbn,text_reviews_count,series,country_code,language_code,popular_shelves,asin,is_ebook,average_rating,kindle_asin,...,publication_month,edition_information,publication_year,url,image_url,book_id,ratings_count,work_id,title,title_without_series
25445,571108903.0,1,[],US,,"[{'count': '12', 'name': 'to-read'}, {'count':...",,False,4.07,,...,,,,https://www.goodreads.com/book/show/10991360-s...,https://s.gr-assets.com/assets/nophoto/book/11...,10991360,0,2477224,Season Songs,Season Songs
4286,,1,[],US,eng,"[{'count': '1322', 'name': 'to-read'}, {'count...",,False,3.92,,...,6.0,,2016.0,https://www.goodreads.com/book/show/32866970-t...,https://images.gr-assets.com/books/1478204489m...,32866970,1,45899520,The White Cat and the Monk: A Retelling of the...,The White Cat and the Monk: A Retelling of the...


In [50]:
books_metadata.shape

(36514, 29)

While all the available information is vital to extract contextual information to be able to train a better recomandation system, for this example, we'll only focus on the selected fields that require minimal manipulation.

In [51]:
# Limit the books metadata to selected fields
books_metadata_selected = books_metadata[["book_id", "average_rating", "is_ebook", "num_pages",
                                          "publication_year", "ratings_count", "language_code"]]
books_metadata_selected.sample(5)

Unnamed: 0,book_id,average_rating,is_ebook,num_pages,publication_year,ratings_count,language_code
6564,22733645,3.28,False,335,2014,368,spa
4476,310883,4.09,False,1200,2005,273,en-US
14242,52860,4.04,False,960,2005,469,en-US
1817,25620725,4.22,False,80,2015,18,
19716,928548,3.59,False,55,1995,29,


Now that we have the data with selected fields, next we'll run it through pandas profiler to perform preliminary exploratory data analysis to help us better understand the available data

In [52]:
# replace blank cells with NaN
books_metadata_selected.replace("", np.nan, inplace=True)

# not takingbook_id into the profiler report
profile = pandas_profiling.ProfileReport(books_metadata_selected[["average_rating", "is_ebook", "num_pages",
                                          "publication_year", "ratings_count"]])
profile.to_file(path + "/results/profiler_books_metadata_1.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [53]:
profile



Considering the results from the profiler, we'll perform following transformations to the dataset:
- Replace the missing value of categorical values with another value to create a new category
- Convert bin values for numeric variables into discrete intervals

In [54]:
# using pandas cut method to convert fields into discrete intervals
books_metadata_selected["num_pages"].replace(np.nan, -1, inplace=True)
books_metadata_selected["num_pages"] = pd.to_numeric(books_metadata_selected["num_pages"])
books_metadata_selected["num_pages"] = pd.cut(books_metadata_selected["num_pages"], bins=25)

# rounding ratings to neares .5 score
books_metadata_selected["average_rating"] = books_metadata_selected["average_rating"].apply(lambda x: round(x*2)/2)

# using pandas qcut method to convert fields into quantile-based discrete intervals
books_metadata_selected["ratings_count"] = pd.qcut(books_metadata_selected["ratings_count"], 25)

# replacing missing values to year 2100
books_metadata_selected["publication_year"].replace(np.nan, 2100, inplace=True)

# replacing missing values to "unknown"
books_metadata_selected["language_code"]. replace(np.nan, "unknown", inplace=True)

# convert is_ebook column into 1/0 where true=1 and false=0
books_metadata_selected["is_ebook"] = books_metadata_selected.is_ebook.map(
    lambda x: 1.0*(x == "true"))

In [55]:
profile = pandas_profiling.ProfileReport(books_metadata_selected[["average_rating", "is_ebook", "num_pages",
                                                                  "publication_year", "ratings_count"]])
profile.to_file(path + "/results/profiler_books_metadata_2.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [56]:
books_metadata_selected.sample(5)

Unnamed: 0,book_id,average_rating,is_ebook,num_pages,publication_year,ratings_count,language_code
6081,32334098,4.0,0.0,"(-11.961, 437.44]",2017,"(614.0, 1029527.0]",eng
20565,26273646,4.5,0.0,"(-11.961, 437.44]",2015,"(-0.001, 2.0]",unknown
8909,6478161,4.0,0.0,"(-11.961, 437.44]",2010,"(40.0, 49.0]",unknown
2950,29873311,4.5,0.0,"(-11.961, 437.44]",2016,"(179.0, 285.0]",eng
23006,1246895,4.5,0.0,"(-11.961, 437.44]",2006,"(14.0, 16.0]",unknown


## Data inspection & Preparation: Interactions data

As the first step, let's take a look at th eavailable fields, and sample data

In [57]:
interactions.columns.values

array(['user_id', 'book_id', 'review_id', 'is_read', 'rating',
       'review_text_incomplete', 'date_added', 'date_updated', 'read_at',
       'started_at'], dtype=object)

In [58]:
interactions.sample(5)

Unnamed: 0,user_id,book_id,review_id,is_read,rating,review_text_incomplete,date_added,date_updated,read_at,started_at
2366126,62fe6344c02c0612ea58c89bea5ea913,9516801,0d05d1c161d8762245cde1a9e0accb92,False,0,,Fri Jul 20 22:50:37 -0700 2012,Sat Jul 28 02:30:30 -0700 2012,,
411459,9dbf404a99f619a917b841a958d58b91,203220,55f193693f4b5a39e4fdc153dbd035e7,True,5,,Wed Feb 17 03:28:48 -0800 2010,Wed Feb 17 03:28:48 -0800 2010,,
1243596,9f1c604c406add8b56f2eedb6a570a93,25332002,e1aa82114f300b7de5035efac7112022,False,0,,Wed Aug 24 05:44:13 -0700 2016,Wed Aug 24 05:44:13 -0700 2016,,
1626652,8c8a2ff47978a902d671d80e0712ef9e,152236,75c57c9d33825047e29be8a8f4c8b38b,True,5,,Sat Mar 13 17:18:32 -0800 2010,Sat Mar 13 17:18:32 -0800 2010,,
2283736,60bae9ddadeb5802135a7f31d020d99f,10674301,6de80aa8f3e7b1c0417ceb1f6dc2dbf5,False,0,,Sat Nov 24 09:11:30 -0800 2012,Sun Nov 25 12:28:41 -0800 2012,,


In [59]:
interactions.shape

(2734350, 10)

While all the available information is vital to extract contextual information to be able to train a better recommendation system, for this example, we'll only focus on the selected fields that require minimal manipulation.

In [60]:
# Limit the books metadata to selected fields
interactions_selected = interactions[["user_id", "book_id", "is_read", "rating"]]

# mapping boolean to string
booleanDictionary = {True: "true", False: "false"}
interactions_selected["is_read"] = interactions_selected["is_read"].replace(booleanDictionary)

interactions_selected.sample(5)

Unnamed: 0,user_id,book_id,is_read,rating
2441390,e41a4923e0847a61bde534ca1b3bebb4,1715,True,4
2182854,e97548edd6ae1fcb571fcc08d5898c2a,23453114,False,0
516179,ccf7b1811fb2a39097db2a5389424d95,23627418,False,0
650080,592f2bc260df866d999213c124362681,6798263,True,4
72644,d1a88941530459034691ed7c3dd08988,293068,True,0


In [61]:
profile = pandas_profiling.ProfileReport(interactions_selected[["is_read", "rating"]])
profile.to_file(path + "/results/profiler_interactions.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Considering results from the profiler, we'll perform following transformations to the dataset:
- Convert is_read column to 1/0

In [62]:
# convert is_read column into 1/0 where true=1 and false=0
interactions_selected["is_read"] = interactions_selected.is_read.map(
    lambda x: 1.0*(x == "true"))

In [63]:
interactions_selected.sample(10)

Unnamed: 0,user_id,book_id,is_read,rating
2381198,8d363fb6a811f6c5805c090c44bfdb55,854644,0.0,0
70366,e72e18a248ed2acc72b73e071d63dd20,12812801,0.0,0
1861947,8184c9cf3bfcb1b37fe20e6e173b3967,99436,1.0,1
176433,a7051663eaf6bfd33d6a70bedecdd6f7,303502,1.0,0
953747,6826c192d6eb11e559e45bd3f5ae8610,16166468,0.0,0
2535598,e8c6d9785e8d8181add42a560eab9a71,30118,1.0,3
925731,964cbba204268fa310db9084ec692267,120726,0.0,0
1008286,859fb6b8671bc1e9bd5c404d555fb76e,8244358,1.0,0
2372664,d889db4c9eb100693d812382e6b708db,17317727,0.0,0
1014933,ab719f895f5c6e0888bd3c8cd6562bd1,34594982,0.0,0


Since we have two fields denoting interaction between a user and a book, `is_read` and `rating` - let's see how many data points we have where the user hasn't read the book but have given the ratings.

In [64]:
interactions_selected.groupby(["rating", "is_read"]).size().reset_index().pivot(columns="rating", index="is_read", values=0)

rating,0,1,2,3,4,5
is_read,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0.0,1420740.0,,,,,
1.0,84551.0,20497.0,64084.0,237942.0,405565.0,500971.0


From the above results, we can conclusively infer that users with ratings >= 1 have all read the book. Therefore, we'll use the `ratings` as the final score, drop interactions where `is_read` is false, and limit interactions from random 500 users to limit the data size for further analysis

In [65]:
interactions_selected = interactions_selected.loc[interactions_selected["is_read"]==1, ["user_id", "book_id", "rating"]]

interactions_selected = interactions_selected[interactions_selected["user_id"].isin(random.sample(list(interactions_selected["user_id"].unique()), k=5000))]

interactions_selected.sample(10)

Unnamed: 0,user_id,book_id,rating
357152,d929217b809e65a194fa37c84f5f016d,30119,5
2339308,bd9a2a203269a98a2317b17b2afa0c14,99944,2
1571362,b21b54825fb2d297e9eb6fcde51956f8,112200,5
153804,dd33bfd1040abf67943cb826cd2f2a77,50470,5
138658,3f47aca35d51c49cbc86a06a3739ab0f,3049,3
1335166,f17edb9fa114ded202b3fdab094df4d3,1829903,3
2640700,32c238b8e7e36016ac810ec57b997b6f,144611,5
2337521,0cc53394ed77243ccc84335561b93265,20413,3
152134,df4ab047550ec45afa251087a834cea1,95819,4
53945,02d9b0a1d38482cc16ed187fb7f868b1,1381,3


In [66]:
interactions_selected.shape

(22916, 3)

## Data preprocessing

Now, let's transform the available data into CSR aparse matrix that can be used for matrix operations. We will start by the process by creating books_metadata matrix which is np.float64 csr_matrix of shape([n_books, n_features]) - Each row contains taht books weights over features. However, before we create a sparse matrix, we'll first create an item dictionar for future references

In [67]:
item_dict = {}
df = books_metadata[["book_id", "title"]].sort_values("book_id").reset_index()

for i in range(df.shape[0]):
    item_dict[(df.loc[i, "book_id"])] = df.loc[i, "title"]

In [68]:
# dummify categorical features
books_metadata_selected_transformed = pd.get_dummies(books_metadata_selected, columns = ["average_rating", "is_ebook", "num_pages",
                                                                                         "publication_year", "ratings_count",
                                                                                         "language_code"])

books_metadata_selected_transformed = books_metadata_selected_transformed.sort_values("book_id").reset_index().drop("index", axis=1)
books_metadata_selected_transformed.head(5)

Unnamed: 0,book_id,average_rating_0.0,average_rating_1.0,average_rating_1.5,average_rating_2.0,average_rating_2.5,average_rating_3.0,average_rating_3.5,average_rating_4.0,average_rating_4.5,...,language_code_tel,language_code_tgl,language_code_tha,language_code_tlh,language_code_tur,language_code_ukr,language_code_unknown,language_code_urd,language_code_vie,language_code_zho
0,234,False,False,False,False,False,False,False,True,False,...,False,False,False,False,False,False,True,False,False,False
1,236,False,False,False,False,False,False,False,True,False,...,False,False,False,False,False,False,True,False,False,False
2,241,False,False,False,False,False,False,True,False,False,...,False,False,False,False,False,False,True,False,False,False
3,244,False,False,False,False,False,False,False,True,False,...,False,False,False,False,False,False,False,False,False,False
4,254,False,False,False,False,False,False,False,True,False,...,False,False,False,False,False,False,True,False,False,False


In [69]:
# convert to csr matrix
books_metadata_csr = csr_matrix(books_metadata_selected_transformed.drop("book_id", axis=1).values)
books_metadata_csr

<36514x357 sparse matrix of type '<class 'numpy.bool_'>'
	with 219084 stored elements in Compressed Sparse Row format>

Next we'll create interactions matrix which is np.float64 csr_matrix of shape([n_users, n_books]). We'll also create a user dictionary for future use cases

In [70]:
user_book_interaction = pd.pivot_table(interactions_selected, index="user_id", columns="book_id", values="rating")

# fill missing values with 0
user_book_interaction = user_book_interaction.fillna(0)

user_book_interaction.head(10)

book_id,234,241,244,254,285,286,290,291,292,676,...,35960350,36122873,36126998,36153320,36262212,36262245,36270857,36276527,36282926,36350410
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0017507d4413a03fbfa5848972658206,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0019e891665331a2d57eceda5f73cc43,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
001c711547901be937ce4fb25380a433,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0021909de18ab01ec27517ea8dc0aa93,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0027e0ab09e095e4e8dabf0d3c0fe48d,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
002a98d66696f786d49898d9e6fc5cd9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
00321c623290ac06fe5c63e02b3cbdbd,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0044b5fc499e772b3fd41f105271824e,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
00598b7666b3b0e9c17d882af05b1c8e,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
005bcc76d9816d6f98c8eec0eeb9d816,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [71]:
user_id = list(user_book_interaction.index)
user_dict = {}
counter = 0

for i in user_id:
    user_dict[i] = counter
    counter += 1

In [72]:
# convert to csr matrix
user_book_interaction_csr = csr_matrix(user_book_interaction.values)
user_book_interaction_csr

<5000x6625 sparse matrix of type '<class 'numpy.float64'>'
	with 21493 stored elements in Compressed Sparse Row format>

# Model training

Ideally, we would build, train and eavluate several models for our recommender system to determine which model holds the most promise for further optimization (hyper-parameter tuning).
However, for this tutorial, we'll train the base model, with randomly selected input parameters for demonstrations

In [73]:
model = LightFM(loss="warp",
                random_state=2016,
                learning_rate=0.90,
                no_components=150,
                user_alpha=0.000005)

model = model.fit(user_book_interaction_csr,
                  epochs=100,
                  num_threads=16, verbose=False)

## Top recommendations

In [74]:
def sample_recommendation_user(model, interactions, user_id, user_dict,
                         item_dict, threshold=0, nrec_items=5, show=True):

    n_users, n_items = interactions.shape
    user_x = user_dict[user_id]
    scores = pd.Series(model.predict(user_x, np.arange(n_items), item_features=books_metadata_csr))
    scores.index = interactions.columns
    scores = list(pd.Series(scores.sort_values(ascending=False).index))

    known_items = list(pd.Series(interactions.loc[user_id, :] \
                                 [interactions.loc[user_id, :] > threshold].index).sort_values(ascending=False))

    scores = [x for x in scores if x not in known_items]
    return_score_list = scores[0:nrec_items]
    known_items = list(pd.Series(known_items).apply(lambda x: item_dict[x]))
    scores = list(pd.Series(return_score_list).apply(lambda x: item_dict[x]))
    if show == True:
        print("User: " + str(user_id))
        print("Known Likes:")
        counter = 1
        for i in known_items:
            print(str(counter) + "- " + i)
            counter += 1

        print("\n Recommended Items:")
        counter = 1
        for i in scores:
            print(str(counter) + "- " + i)
            counter += 1

In [76]:
sample_recommendation_user(model, user_book_interaction, "002a98d66696f786d49898d9e6fc5cd9", user_dict, item_dict)

User: 002a98d66696f786d49898d9e6fc5cd9
Known Likes:

 Recommended Items:
1- Старшая Эдда
2- Μικρό βιβλίο για μεγάλα όνειρα
3- ارغنون
4- Faithful and Virtuous Night
5- Alussa oli nainen


In [77]:
sample_recommendation_user(model, user_book_interaction, "0019e891665331a2d57eceda5f73cc43", user_dict, item_dict)


User: 0019e891665331a2d57eceda5f73cc43
Known Likes:
1- The Melancholy Death of Oyster Boy and Other Stories
2- Making Cocoa for Kingsley Amis
3- Tell Me the Truth about Love
4- Beastly Tales from Here and There
5- Revolting Rhymes
6- The Complete Nonsense of Edward Lear

 Recommended Items:
1- Book of Blues
2- Portable Kisses
3- Notebook of a Return to My Native Land / Cahier D'Un Retour Au Pays Natal
4- Les Cent Plus Beaux Poemes de La Langue Franc
5- النبي
