## Main idea: 


Building a content-based recommender system to combine the results of collaborative filtering to create a hybrid recommender system

### Data cleaning and preproccesing

In [69]:
import gzip
import json
import pandas as pd
from tqdm import tqdm
import os
from datetime import datetime

### Adding content meta data

Download dump of Amazon Books Meta-data from http://deepyeti.ucsd.edu/jianmo/amazon/index.html - 
around 3M unique item

#### Parsing gzip file 

The file size is over 1.2G, so let's create an iterator to convert the file by parts

In [3]:
books_df = pd.read_csv('data/books_cleaned.csv', low_memory=False)

In [19]:
filename = '/Users/elv1ento/Downloads/meta_Books.json.gz'
batch_size = 500000
dir_name = 'dump_butches'

if not os.path.exists(dir_name):
    os.makedirs(dir_name)
    
    
def parse(filename):
    f = gzip.open(filename, 'r')
    
    # select only target fields
    keys = ['asin', 'category', 'description']
    
    for l in f:
        entity = json.loads(l)
        yield {key:entity.get(key) for key in keys}
        

# create iterator object
res = parse(filename)


# loop for extracting and saving data
count = 1
while res:
    
    part = []
    
    for _ in range(batch_size):
        try:
            part.append(next(res))
        except StopIteration:
            res = False
            break
    
    if part:
        count +=1
        df = pd.DataFrame(part)
        
        # select only the data that is in our books database
        df = df[df.asin.isin(books_df.ISBN)]
        
        if not df.empty:
            
            # convert list to sting
            df = df.assign(
                category = df.category.apply(lambda x: ', '.join(x)),
                description = df.description.apply(lambda x: ', '.join(x))
            )
            
            # save batch
            df.to_csv(f'{dir_name}/book_meta_{count}.csv', index=False)
            print(f'[{datetime.now().strftime("%H:%m:%S")}] '
                  f'Loaded new batch with {df.shape[0]} rows')

[22:06:31] Loaded new batch with 54899 rows
[22:06:49] Loaded new batch with 29045 rows
[22:06:05] Loaded new batch with 6366 rows
[22:06:20] Loaded new batch with 7744 rows
[22:06:36] Loaded new batch with 4615 rows
[22:06:48] Loaded new batch with 1180 rows


### Cleaning and preprocessing text data

Using fast regular expressions to clean up data and Spacy to tokenize and lemmatize text

In [9]:
import re
import html
import string
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English


punctuations = string.punctuation
nlp = spacy.load("en_core_web_sm")
stop_words = spacy.lang.en.stop_words.STOP_WORDS


def text_cleaner(sentence):
    
    # unescape
    sentence = html.unescape(sentence)

    # remove HTML tags
    regex = re.compile('<.*?>')
    sentence = regex.sub('', sentence)
    
    # remove punctuation
    regex = re.compile('[%s]' % re.escape(string.punctuation))
    sentence = regex.sub('', sentence)

    # tokenize and lemmatize
    sentence = ' '.join([word.lemma_ for word in nlp(sentence)])

    # remove stop words
    regex = re.compile(' | '.join(stop_words))
    sentence = regex.sub(' ', sentence)
    
    regex = re.compile(r'\s+')
    sentence = regex.sub(' ', sentence)

    
    return sentence.split(' ')

### Parallelizing Text Data Processing Using Dask

Dask is a free and open-source library for parallel computing in Python.

Dask's schedulers scale to thousand-node clusters and its algorithms have been tested on some of the largest supercomputers in the world.

Dask ships with schedulers designed for use on personal machines. Many people use Dask today to scale computations on their laptop, using multiple cores for computation and their disk for excess storage.

In [12]:
import dask
from dask.distributed import Client
import dask.dataframe as dd

In [13]:
client = dask.distributed.Client()

In [14]:
client

0,1
Connection method: Cluster object,Cluster type: LocalCluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Status: running,Using processes: True
Dashboard: http://127.0.0.1:8787/status,Workers: 4
Total threads:  8,Total memory:  16.00 GiB

0,1
Comm: tcp://127.0.0.1:51074,Workers: 4
Dashboard: http://127.0.0.1:8787/status,Total threads:  8
Started:  Just now,Total memory:  16.00 GiB

0,1
Comm: tcp://127.0.0.1:51081,Total threads: 2
Dashboard: http://127.0.0.1:51085/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:51078,
Local directory: /Users/elv1ento/PycharmProjects/DSrecommender/app/dask-worker-space/worker-t3vvghi9,Local directory: /Users/elv1ento/PycharmProjects/DSrecommender/app/dask-worker-space/worker-t3vvghi9

0,1
Comm: tcp://127.0.0.1:51083,Total threads: 2
Dashboard: http://127.0.0.1:51086/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:51079,
Local directory: /Users/elv1ento/PycharmProjects/DSrecommender/app/dask-worker-space/worker-0mrk751z,Local directory: /Users/elv1ento/PycharmProjects/DSrecommender/app/dask-worker-space/worker-0mrk751z

0,1
Comm: tcp://127.0.0.1:51080,Total threads: 2
Dashboard: http://127.0.0.1:51087/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:51076,
Local directory: /Users/elv1ento/PycharmProjects/DSrecommender/app/dask-worker-space/worker-s7s0jblq,Local directory: /Users/elv1ento/PycharmProjects/DSrecommender/app/dask-worker-space/worker-s7s0jblq

0,1
Comm: tcp://127.0.0.1:51082,Total threads: 2
Dashboard: http://127.0.0.1:51084/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:51077,
Local directory: /Users/elv1ento/PycharmProjects/DSrecommender/app/dask-worker-space/worker-qm86oxdv,Local directory: /Users/elv1ento/PycharmProjects/DSrecommender/app/dask-worker-space/worker-qm86oxdv


In [20]:
filename = f'{dir_name}/*.csv'
df = dd.read_csv(filename)

In [21]:
df.compute()

Unnamed: 0,asin,category,description
0,0000913154,"Books, Engineering & Transportation, Engineering",
1,0001047868,"Books, Literature & Fiction, Classics",Grade 6 Up-Kidnapped by Robert Louis Stevenson...
2,0001056107,"Books, Literature & Fiction, Classics",
3,0001053744,"Books, Literature & Fiction, Classics",While many readers are familiar with SIR GAWAI...
4,0001821504,"Books, Literature & Fiction",Paddington is the famous bear from darkest Per...
...,...,...,...
1175,9999344968,"Books, Politics &amp; Social Sciences, Politic...",
1176,B000023VWY,"Books, Literature & Fiction, United States","<B>Penzler Pick, March 2000</B>: How does one ..."
1177,B0000AA9JU,"Books, Literature & Fiction, Action & Adventure",
1178,B000023VWY,"Books, Literature & Fiction, United States","<B>Penzler Pick, March 2000</B>: How does one ..."


From the 2.9M database matched 103K books. For better results can download a larger metadata dump, for example using wikipedia https://www.mediawiki.org/wiki/Download.

In [23]:
df = pd.read_csv('data/normalized_dask.csv').drop(['item_id'], axis=1)
df

Unnamed: 0,ISBN,Fulltext
0,0000913154,engineer transportation engineering way thing ...
1,0001047868,literature fiction classic grade 6 upkidnappe ...
2,0001053744,literature fiction classic many reader familia...
3,0001056107,literature fiction classic farmer gile ham oth...
4,0001061127,excellent approach teach basic chess great bo...
...,...,...
103911,9999364497,new use rental textbook the flap this 274 page...
103912,B000023VWY,literature fiction united state penzler pick m...
103913,B000023VWY,literature fiction united state penzler pick m...
103914,B0000AA9JU,literature fiction action adventure aftermath ...


In [24]:
# use map_partitions for performance improvement

df = df.set_index('asin').fillna(' ')

# add text columns: Author, Title
df = df.join(books_df.set_index('ISBN'), how='inner').iloc[:, :-5]

# remove the word "Book", which appears in each row of the category
df['category'] = df.map_partitions(lambda df: df.category.apply(
    lambda x: x.replace(' & ', ', ').replace('Books, ', '')))

# connect all text fields for the subsequent receipt of embeddings of each book
df = df.apply(
    lambda x: ' '.join(x.values).lower(), axis=1, meta=(None, 'object')
).to_frame(name='Fulltext')

# tokenize and lemmatize each row for better vectorization
df['Fulltext'] = df.map_partitions(lambda df: df.Fulltext.apply(text_cleaner))

df = df.compute(sheduler='processes')

df = df.reset_index().rename(columns={'asin': 'ISBN'})
df

Unnamed: 0,ISBN,Fulltext
0,0000913154,engineer transportation engineering way thing ...
1,0001047868,literature fiction classic grade 6 upkidnappe ...
2,0001053744,literature fiction classic many reader familia...
3,0001056107,literature fiction classic farmer gile ham oth...
4,0001061127,excellent approach teach basic chess great bo...
...,...,...
103911,9999364497,new use rental textbook the flap this 274 page...
103912,B000023VWY,literature fiction united state penzler pick m...
103913,B000023VWY,literature fiction united state penzler pick m...
103914,B0000AA9JU,literature fiction action adventure aftermath ...


In [None]:
# save result
df.to_csv('data/normalized_dask.csv')

### Transform text into Word embeddings

In detail, TF IDF is composed of two parts: TF which is the term frequency of a word, i.e. the count of the word occurring in a document and IDF, which is the inverse document frequency, i.e. the weight component that gives higher weight to words occuring in only a few documents.

In [30]:
import pickle
from sklearn.feature_extraction.text import TfidfVectorizer

In [27]:
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(df['Fulltext'].values)

In [28]:
tfidf_matrix

<103916x328229 sparse matrix of type '<class 'numpy.float64'>'
	with 8225077 stored elements in Compressed Sparse Row format>

#### Simple logic to get similar vectors using cosine similarities

Cosine similarity is a metric used to measure how similar the documents are irrespective of their size. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space

The distances between the row vectors of X and the row vectors of Y can be evaluated using pairwise_distances. If Y is omitted the pairwise distances of the row vectors of X are calculated. Similarly, pairwise.pairwise_kernels can be used to calculate the kernel between X and Y using different kernel functions.

In [33]:
from sklearn.metrics.pairwise import linear_kernel

In [37]:
cosine_similarities = linear_kernel(tfidf_matrix[0:1], tfidf_matrix).flatten()
related_docs_indices = cosine_similarities.argsort()[:-5:-1]

In [38]:
cosine_similarities

array([1.        , 0.00439163, 0.00540603, ..., 0.07314789, 0.        ,
       0.        ])

In [39]:
related_docs_indices

array([    0, 57392, 73663, 27221])

In [None]:
# save matrix
pickle.dump(tfidf_matrix, open('tfidf_matrix.pkl', 'wb'))

#### Improving vectorization

To improve the results of vectorization, we can use other tools to obtain embeddings. For example, Bert, who actively uses Google or Word2Vec from Gensim. Also, in order to use the categories of books we have, we can use CountVectorizer for more accurate classification. 

### Solving the problem of user search query context using Okapi BM25

In information retrieval, Okapi BM25 (BM is an abbreviation of best matching) is a ranking function used by search engines to estimate the relevance of documents to a given search query. It is based on the probabilistic retrieval framework. BM25 represent state-of-the-art TF-IDF-like retrieval functions used in document retrieval.
https://en.wikipedia.org/wiki/Okapi_BM25

In [40]:
from rank_bm25 import BM25Okapi

In [42]:
bm25 = BM25Okapi(df['Fulltext'].apply(lambda x: x.split(' ')))

In [57]:
def get_relevance_items(query):
    tokenized_query = text_cleaner(query)
    doc_scores = bm25.get_scores(tokenized_query)
    score_dict = dict(zip(df.index, doc_scores))
    return sorted(score_dict, key=score_dict.get, reverse = True)  

In [58]:
query = "lord of the ring tolkien"

doc_ranking = get_relevance_items(query)
books_df.set_index('ISBN').loc[df.loc[doc_ranking[:10]].ISBN.tolist()].iloc[:,:-3]

Unnamed: 0_level_0,Book-Title,Book-Author,Year-Of-Publication,Publisher
ISBN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
618195580,The Fellowship of the Ring Photo Guide (The Lo...,Alison Sage,2001,Houghton Mifflin
141315741,"The Magical Worlds of the \Lord of the Rings\""""",David Colbert,2002,Puffin Books
395498635,The Return of the Shadow: The History of The L...,Christopher Tolkien,1989,Houghton Mifflin
618257365,The Two Towers Movie Photo Guide (The Lord of ...,David Brawn,2002,Houghton Mifflin
618258000,The Making of the Movie Trilogy (The Lord of t...,Brian Sibley,2002,Houghton Mifflin
801030145,The J.R.R. Tolkien Handbook: A Concise Guide t...,Colin Duriez,1992,Baker Books
1892975904,The People's Guide to J.R.R. Tolkien,Erica Challis,2003,Cold Spring Press
898452236,The Lord of the Rings : The Two Towers and the...,J.R.R. Tolkien,1988,HarperAudio
553456539,The Lord of the Rings (BBC Dramatization),J.R.R. TOLKIEN,1999,Random House Audio
1843532751,The Rough Guide to the Lord of the Rings (Roug...,Paul Simpson,2003,Rough Guides Limited


In [60]:
query = "a book about magic and adventure in middleearth"

doc_ranking = get_relevance_items(query)
books_df.set_index('ISBN').loc[df.loc[doc_ranking[:10]].ISBN.tolist()].iloc[:,:-3]

Unnamed: 0_level_0,Book-Title,Book-Author,Year-Of-Publication,Publisher
ISBN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
345400437,The Shaping of Middle-Earth (The History of Mi...,J. R. R. Tolkien,1995,Del Rey Books
618009361,Farmer Giles of Ham : The Rise and Wonderful A...,J. R. R. Tolkien,1999,Houghton Mifflin
395498635,The Return of the Shadow: The History of The L...,Christopher Tolkien,1989,Houghton Mifflin
261102753,The Road to Middle-Earth,T. A. Shippey,1992,HarperCollins
61055328,Realms of Tolkien: Images of Middle-earth,Ted Nasmith,1996,Eos
395286654,The Atlas of Middle-Earth,Karen Wynn Fonstad,1981,Houghton Mifflin Company
395291305,The Languages of Tolkien's Middle-Earth,Ruth S. Noel,1980,Houghton Mifflin
1564147029,The Essential J.R.R. Tolkien Sourcebook: A Fan...,George Beahm,2003,New Page Books
1558062165,Creatures of Middle-earth (#2012),R. Sochard Pitt,1995,Iron Crown Enterprises
345466454,"Histories of Middle Earth, Volumes 1-5",J.R.R. TOLKIEN,2003,Del Rey


In [61]:
query = "I want to read a book about a hobby related to handicrafts"

doc_ranking = get_relevance_items(query)
books_df.set_index('ISBN').loc[df.loc[doc_ranking[:10]].ISBN.tolist()].iloc[:,:-3]

Unnamed: 0_level_0,Book-Title,Book-Author,Year-Of-Publication,Publisher
ISBN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
965112608,Awakening Brilliance: How to Inspire Children ...,Pamela Sims,1996,Bayhampton Publications
688116639,Victoria: A Woman's Christmas: Returning to th...,Arlene Hamilton Stewart,1995,Sterling Pub Co Inc
1850292469,Tricia Guild's new soft furnishings,Tricia Guild,1990,Conran Octopus
1591131510,Hunting for Mr. Good Bargain,Marlene M. Moore,2002,Booklocker.com
967983320,"Dimensional Flowers, Leaves &amp; Vines",Barbara Grainger,2000,Barbara L Grainger
595206425,My Brother's Keeper,Lorrieann Russell,2001,Writers Club Press
553483765,Before They Rode Horses (Saddle Club Super Edi...,Bonnie Bryant,1997,Skylark Books
966429702,200 Beats Per Minute,Eddie Beverage,1998,Sure Shot Pub
911214607,The human side of human beings: The theory of ...,Harvey Jackins,1978,Rational Island Publishers
486235793,World War I Uniforms Coloring Book,Copeland,1980,Dover Pubns


### Find similarities using K-means 

Kmeans algorithm is an iterative algorithm that partitions the data set using the clustering technique into k clusters with a centroid for each cluster and finds the distance between the data points and centroids using distance metrics such as euclidean distance, hamming distance, etc., and iteratively does the clustering process.

In [62]:
from sklearn.neighbors import NearestNeighbors

In [63]:
knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=3, n_jobs=-1)
knn.fit(tfidf_matrix)

NearestNeighbors(algorithm='brute', metric='cosine', n_jobs=-1, n_neighbors=3)

In [66]:
distances, indices = knn.kneighbors(tfidf_matrix[0:1], n_neighbors=10)

In [68]:
pd.DataFrame(zip(indices.ravel(),distances.ravel()), columns=['item_id', 'distance'])

Unnamed: 0,item_id,distance
0,0,0.0
1,57392,0.562699
2,73663,0.605869
3,27221,0.624871
4,56622,0.646004
5,77865,0.652654
6,30189,0.658845
7,78132,0.67697
8,99766,0.678921
9,95366,0.682965


Now, using the model, we can get the nearest neighbors of the target vectors

## Conclusions

In this notebook, we went through the process of building a simple content-based recommendation system step by step. Having created a function that transforms the distance between vectors into a relevance score, we can add this value to the collaborative recommendation system.

### How to improve for big data

Using similar techniques at every step of preparation, I would build a search engine based on ElasticSearch. It also uses TFIDF to index documents. Also, the main advantage of this engine is the creation of a full-fledged relevance function. In this function, weights can be used to combine search criteria such as category, author, description and reviews. Elastic is highly scalable and demonstrates a high speed of obtaining results.