## Import and Functions

In [1]:
import pandas as pd
import numpy as np

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import NearestNeighbors

## Read CSV

In [2]:
meta_data = pd.read_csv('Data/out_content.zip')
meta_data.head()

Unnamed: 0,article_id,product_group_name_Accessories,product_group_name_Bags,product_group_name_Cosmetic,product_group_name_Fun,product_group_name_Furniture,product_group_name_Garment Full body,product_group_name_Garment Lower body,product_group_name_Garment Upper body,product_group_name_Garment and Shoe care,...,garment_group_name_Shorts,garment_group_name_Skirts,garment_group_name_Socks and Tights,garment_group_name_Special Offers,garment_group_name_Swimwear,garment_group_name_Trousers,garment_group_name_Trousers Denim,"garment_group_name_Under-, Nightwear",garment_group_name_Unknown,garment_group_name_Woven/Jersey/Knitted mix Baby
0,108775015,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,108775044,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,108775051,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,110065001,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,110065002,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [3]:
meta_data['article_id'][1]

108775044

In [4]:
articles_df = pd.read_csv('Data/h-and-m-personalized-fashion-recommendations/articles.csv')
articles_df.head(3)

Unnamed: 0,article_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,colour_group_code,colour_group_name,...,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc
0,108775015,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,9,Black,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
1,108775044,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,10,White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
2,108775051,108775,Strap top (1),253,Vest top,Garment Upper body,1010017,Stripe,11,Off White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.


## Recommendation Function 

> The created function takes in a `article ID` and `n` number of recommendations, provided by the customer, and returns top article recommendations. As expected from a content-based system returned recommendations follow closely along the article description line. The article ID input is used to index the dataframe based on unique article ID ('article_id'). Due to the size of this dataset a full cosine_similarity matrix could not be created as it quickly runs out of virtual memory. In order to work around this I used the indexed book row (y) to compute similarity scores between all other rows one at a time, producing an array that denotes similarities for that particular entry only. This allowed me to create the function that computes scores solely for the input article it is given rather then the whole dataset. Returned results are sorted based on descending scores and used to index into the meta data dataframe to return information pertaining to the top recommendations.

### Building

In [5]:
# Input for article ID that returns the 'article_id' index number for the article to be used to call dataframe
articleid = input('Article ID: ')
article = articles_df.index[articles_df['article_id'] == int(articleid)]
article

Article ID: 108775044


Int64Index([1], dtype='int64')

In [6]:
# Pulling out an individual row indexed by book article ID ('article_id'), can use the 'article' variable set above
y = np.array(meta_data.loc[article])
# Need to reshape so it can be passed into cosine_sim function
y = y.reshape(1, -1)
y

array([[108775044,         0,         0,         0,         0,         0,
                0,         0,         1,         0,         0,         0,
                0,         0,         0,         0,         0,         0,
                0,         0,         0,         0,         1,         0,
                0,         0,         0,         0,         0,         0,
                1,         0,         0,         0,         0,         0,
                0,         0,         0,         0,         0,         0,
                0,         0,         0,         0]])

In [7]:
# Utilize cosine_similarity from sklearn to return similarity scores based on cosine distance
cos_sim = cosine_similarity(meta_data, y)
# Create a dataframe with similairty scores with article ID ('article_id') as index
cos_sim = pd.DataFrame(data=cos_sim, index=meta_data.index)
cos_sim.head()

Unnamed: 0,0
0,1.0
1,1.0
2,1.0
3,1.0
4,1.0


In [8]:
# Input used to ask how many recommendations the user would like returned
n_recs = int(input('How many recommendations? '))
# The cos_sim scores then need to be sorted in descending order
cos_sim.sort_values(by = 0, ascending = False, inplace=True)
# In order to not return the original article input first need to obtain the index values for the requested # of recommendations
results = cos_sim.index.values[1:n_recs+1]
results

How many recommendations? 10


array([70358, 70368, 70367, 70366, 70365, 70364, 70363, 70362, 70361,
       70360])

> I also wanted to try a version that utilizes `K-Nearest Neighbors model` instead of Cosine Similarity. The KNN model can be fit to the meta_data merged dataframe. The method `.kneighbors` can then be utilized to return n-number of nearest neighbors for the article entry y (similar to cosine score). Once again this returned results output can be used to index into the meta data in order to return information on article recommendations.

In [56]:
# Instaniate and fit the model using merged dataframe
knn = NearestNeighbors(n_neighbors=5)
knn.fit(meta_data)
# Return results using .kneighbors attribute of knn model
index2 = knn.kneighbors(X=y, n_neighbors=n_recs+1, return_distance=False).flatten()
results2 = articles_df.iloc[index2].index.values[1:]
results2



array([ 2,  0,  3,  4,  5,  6,  7,  8,  9, 10])

In [57]:
# Using returned results variable, index the original meta data frame to return appropriate information for each article
results_df = articles_df.loc[results2]
# Reset index for better print out
results_df.reset_index(inplace=True)
# Captilizing column names for a more appealing final display
# results_df.rename(columns={'title':'Title', 'author':'Author',
#                                'genre':'Genre', 'print_length':'# Pages',
#                                'word_wise':'Word Wise', 'lending':'Lending', 'asin':'ASIN'}, inplace=True)
# Changing certain columns to display integer instead of float for more appealing final display
# results_df[['# Pages', 'Word Wise', 'Lending']] = results_df[['# Pages', 'Word Wise', 'Lending']].astype(int)
results_df

Unnamed: 0,index,article_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,colour_group_code,...,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc
0,2,108775051,108775,Strap top (1),253,Vest top,Garment Upper body,1010017,Stripe,11,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
1,0,108775015,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,9,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
2,3,110065001,110065,OP T-shirt (Idro),306,Bra,Underwear,1010016,Solid,9,...,Clean Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Microfibre T-shirt bra with underwired, moulde..."
3,4,110065002,110065,OP T-shirt (Idro),306,Bra,Underwear,1010016,Solid,10,...,Clean Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Microfibre T-shirt bra with underwired, moulde..."
4,5,110065011,110065,OP T-shirt (Idro),306,Bra,Underwear,1010016,Solid,12,...,Clean Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Microfibre T-shirt bra with underwired, moulde..."
5,6,111565001,111565,20 den 1p Stockings,304,Underwear Tights,Socks & Tights,1010016,Solid,9,...,Tights basic,B,Lingeries/Tights,1,Ladieswear,62,"Womens Nightwear, Socks & Tigh",1021,Socks and Tights,"Semi shiny nylon stockings with a wide, reinfo..."
6,7,111565003,111565,20 den 1p Stockings,302,Socks,Socks & Tights,1010016,Solid,13,...,Tights basic,B,Lingeries/Tights,1,Ladieswear,62,"Womens Nightwear, Socks & Tigh",1021,Socks and Tights,"Semi shiny nylon stockings with a wide, reinfo..."
7,8,111586001,111586,Shape Up 30 den 1p Tights,273,Leggings/Tights,Garment Lower body,1010016,Solid,9,...,Tights basic,B,Lingeries/Tights,1,Ladieswear,62,"Womens Nightwear, Socks & Tigh",1021,Socks and Tights,Tights with built-in support to lift the botto...
8,9,111593001,111593,Support 40 den 1p Tights,304,Underwear Tights,Socks & Tights,1010016,Solid,9,...,Tights basic,B,Lingeries/Tights,1,Ladieswear,62,"Womens Nightwear, Socks & Tigh",1021,Socks and Tights,"Semi shiny tights that shape the tummy, thighs..."
9,10,111609001,111609,200 den 1p Tights,304,Underwear Tights,Socks & Tights,1010016,Solid,9,...,Tights basic,B,Lingeries/Tights,1,Ladieswear,62,"Womens Nightwear, Socks & Tigh",1021,Socks and Tights,Opaque matt tights. 200 denier.


In [58]:
print(f'The returned article index results for Cosine Similarity: {results}')
print(f'The returned book index results for K-Nearest Neighbors: {results2}')
print(results == results2)

The returned article index results for Cosine Similarity: [70358 70368 70367 70366 70365 70364 70363 70362 70361 70360]
The returned book index results for K-Nearest Neighbors: [ 2  0  3  4  5  6  7  8  9 10]
[False False False False False False False False False False]


## Cosine Similarity

In [None]:
# Compiling the above code into a working function that takes a article ID as input and returns n-recommendations
def book_review_recommend():
    
    title = input('Article ID: ')
    article = meta_data.index[df_meta_all['title'] == title]
    n_recs = int(input('How many recommendations? '))
    
    y = np.array(model_df.loc[book]).reshape(1, -1)
    cos_sim = cosine_similarity(model_df, y)
    cos_sim = pd.DataFrame(data=cos_sim, index=model_df.index)
    cos_sim.sort_values(by = 0, ascending = False, inplace=True)
    results = cos_sim.index.values[1:n_recs+1]
    results_df = df_meta_all.loc[results]
    results_df.reset_index(inplace=True)
    results_df.rename(columns={'title':'Title', 'author':'Author',
                               'genre':'Genre', 'print_length':'# Pages',
                               'word_wise':'Word Wise', 'lending':'Lending', 'asin':'ASIN'}, inplace=True)
    results_df[['# Pages', 'Word Wise', 'Lending']] = results_df[['# Pages', 'Word Wise', 'Lending']].astype(int)
    return results_df

## K-Nearest Neighbors

In [None]:
def book_review_recommend_knn():
    
    title = input('Title: ')
    book = df_meta_all.index[df_meta_all['title'] == title]
    n_recs = int(input('How many recommendations? '))
    
    X = np.array(model_df.loc[book]).reshape(1, -1)
    results2 = knn.kneighbors(X, n_neighbors=n_recs+1, return_distance=False).flatten()
    results2 = model_df.iloc[results2].index.values[1:]
    results2
    results_df = df_meta_all.loc[results2]
    results_df.reset_index(inplace=True)
    results_df.rename(columns={'title':'Title', 'author':'Author',
                               'genre':'Genre', 'print_length':'# Pages',
                               'word_wise':'Word Wise', 'lending':'Lending', 'asin':'ASIN'}, inplace=True)
    results_df[['# Pages', 'Word Wise', 'Lending']] = results_df[['# Pages', 'Word Wise', 'Lending']].astype(int)
    return results_df

In [None]:
book_review_recommend_knn()

## Evaluation

> To better display the returned recommendation results I have adjsuted the display setting within pandas to show maximum column width. This will act to display the full title without truncating it at all. I want to highlight below, using two eBook examples, how this recommendation system is able to distinguish between content within genres. Both of the inputed eBooks are within the Literature and Fiction Genre however contain vastly different content, one is a thriller and the other a romance. Recommendations for these books follow the same convention. I am incredibly pleased with how this content-based system is working giving its ability to distinush across and within genres. As stated previously I wanted to make sure the system was not just returning books in the same series and thus excluded author. From the first example below one can see that this is working as intended. <br> <br>
For my next steps I would like explore an expansion of this system that only uses review text data (excludes genres) but can be specified for specific genres as an input. I would like to see how this holds up against the system that includes genre as features. I also believe a deeper look into genre classification is warranted given the inclusion and prominence of Literature & Fiction as a prevelant genre. This genre contains a broad category of books that could be further classified into several different genres already existing within the dataset and represents a flaw in the labeling of data via the Amazon kindle store. It would be interesting to attempt clustering of some kind on the dataset to try and use unsupervised learning to determine genre but this was outside the scope of the project.

In [59]:
pd.set_option('display.max_colwidth', None)

In [None]:
book_review_recommend()

In [None]:
book_review_recommend()

In [None]:
book_review_recommend()

## Conclusion

> In conclusion the content-based recommendation system works by taking in review text and vectorizing this into individual word features that can then be used to desrcribe each book as a large mutli-dimensional vector. A comparison of these vectors (using cosine similarity) is then used to return the top closest eBooks as recommendations for the user. It takes in a book title as the input. Currently this system is limited to eBooks within my dataset but it can easily be expanded to many other books as long as review text exists for them. Sites like goodreads contain a wealth of reviews and could be mined for further book entries.
<br><br>A unique benefit of this content-based approach as compared to the collaborative approach seen here is its ability to be used by anyone looking to find similar books. It does not require any prior review or rating from any specifc user. These approachs can be used in tandem to provide robust and varied eBook recommendations for kindle users and would help to promote the sale of further eBooks. Implementation within the kindle system can be achieved with ease and this would also allow for continued data collection to help further refine the system. The more reviews and text that each book has the better the term vector will be able to distinquish and make accurate recommendations.
<br><br>As expected this content-based system recommends books along convential genre lines, which can be seen as a beneift and also a slight limitation. An investigation into the exlcusion of genre might be warranted as well as an depth look at re-classifing genres within the dataset using unsupervised clustering. Further eBook review data is also needed to expand the range of possible eBook recommendations. New books are continually being published and their review data can be easily entered through the cleaning pipeline that utlizes texthero. This information can then be merged into the exisiting dataset thus expanding the potential recommendation pool.