## Import and Functions

In [1]:
import pandas as pd
import numpy as np

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import NearestNeighbors

## Read CSV

In [2]:
meta_data = pd.read_csv('Data/out_content.zip')
meta_data.head()

Unnamed: 0,article_id,product_group_name_Accessories,product_group_name_Bags,product_group_name_Cosmetic,product_group_name_Fun,product_group_name_Furniture,product_group_name_Garment Full body,product_group_name_Garment Lower body,product_group_name_Garment Upper body,product_group_name_Garment and Shoe care,...,garment_group_name_Shorts,garment_group_name_Skirts,garment_group_name_Socks and Tights,garment_group_name_Special Offers,garment_group_name_Swimwear,garment_group_name_Trousers,garment_group_name_Trousers Denim,"garment_group_name_Under-, Nightwear",garment_group_name_Unknown,garment_group_name_Woven/Jersey/Knitted mix Baby
0,108775015,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,108775044,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,108775051,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,110065001,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,110065002,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [3]:
meta_data['article_id'][1]

108775044

In [4]:
articles_df = pd.read_csv('Data/articles.csv.zip')
articles_df.head(3)

Unnamed: 0,article_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,colour_group_code,colour_group_name,...,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc
0,108775015,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,9,Black,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
1,108775044,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,10,White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
2,108775051,108775,Strap top (1),253,Vest top,Garment Upper body,1010017,Stripe,11,Off White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.


## Recommendation Function 

> The created function takes in a `article ID` and `n` number of recommendations, provided by the customer, and returns top article recommendations. As expected from a content-based system returned recommendations follow closely along the article description line. The article ID input is used to index the dataframe based on unique article ID ('article_id'). Due to the size of this dataset a full cosine_similarity matrix could not be created as it quickly runs out of virtual memory. In order to work around this I used the indexed book row (y) to compute similarity scores between all other rows one at a time, producing an array that denotes similarities for that particular entry only. This allowed me to create the function that computes scores solely for the input article it is given rather then the whole dataset. Returned results are sorted based on descending scores and used to index into the meta data dataframe to return information pertaining to the top recommendations.

### Building

In [8]:
articles_df.shape

(105542, 25)

In [7]:
# Input for article ID that returns the 'article_id' index number for the article to be used to call dataframe
articleid = input('Article ID: ')
article = articles_df.index[articles_df['article_id'] == int(articleid)]
article

Article ID: 893059004


Int64Index([100937], dtype='int64')

In [6]:
# Pulling out an individual row indexed by book article ID ('article_id'), can use the 'article' variable set above
y = np.array(meta_data.loc[article])
# Need to reshape so it can be passed into cosine_sim function
y = y.reshape(1, -1)
y

array([[893059004,         0,         0,         0,         0,         0,
                0,         0,         1,         0,         0,         0,
                0,         0,         0,         0,         0,         0,
                0,         0,         0,         0,         1,         0,
                0,         0,         0,         0,         0,         0,
                0,         0,         1,         0,         0,         0,
                0,         0,         0,         0,         0,         0,
                0,         0,         0,         0]])

In [8]:
# Utilize cosine_similarity from sklearn to return similarity scores based on cosine distance
cos_sim = cosine_similarity(meta_data, y)
# Create a dataframe with similairty scores with article ID ('article_id') as index
cos_sim = pd.DataFrame(data=cos_sim, index=meta_data.index)
cos_sim.head()

Unnamed: 0,0
0,1.0
1,1.0
2,1.0
3,1.0
4,1.0


In [9]:
# Input used to ask how many recommendations the user would like returned
n_recs = int(input('How many recommendations? '))
# The cos_sim scores then need to be sorted in descending order
cos_sim.sort_values(by = 0, ascending = False, inplace=True)
# In order to not return the original article input first need to obtain the index values for the requested # of recommendations
results = cos_sim.index.values[1:n_recs+1]
results

How many recommendations? 10


array([70358, 70368, 70367, 70366, 70365, 70364, 70363, 70362, 70361,
       70360])

> I also wanted to try a version that utilizes `K-Nearest Neighbors model` instead of Cosine Similarity. The KNN model can be fit to the meta_data merged dataframe. The method `.kneighbors` can then be utilized to return n-number of nearest neighbors for the article entry y (similar to cosine score). Once again this returned results output can be used to index into the meta data in order to return information on article recommendations.

In [10]:
# Instaniate and fit the model using merged dataframe
knn = NearestNeighbors(n_neighbors=5)
knn.fit(meta_data)
# Return results using .kneighbors attribute of knn model
index2 = knn.kneighbors(X=y, n_neighbors=n_recs+1, return_distance=False).flatten()
results2 = articles_df.iloc[index2].index.values[1:]
results2



array([ 2,  0,  3,  4,  5,  6,  7,  8,  9, 10])

In [11]:
# Using returned results variable, index the original meta data frame to return appropriate information for each article
results_df = articles_df.loc[results2]
# Reset index for better print out
results_df.reset_index(inplace=True)
# Captilizing column names for a more appealing final display
results_df.rename(columns={'prod_name':'Product Name', 'author':'Author',
                               'product_type_name':'Product Type Name', 'product_group_name':'Product Group Name',
                               'index_group_name':'Index Group Name', 'garment_group_name ':'Garment Group Name'}, inplace=True)
results_df

Unnamed: 0,index,article_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,colour_group_code,...,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc
0,2,108775051,108775,Strap top (1),253,Vest top,Garment Upper body,1010017,Stripe,11,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
1,0,108775015,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,9,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
2,3,110065001,110065,OP T-shirt (Idro),306,Bra,Underwear,1010016,Solid,9,...,Clean Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Microfibre T-shirt bra with underwired, moulde..."
3,4,110065002,110065,OP T-shirt (Idro),306,Bra,Underwear,1010016,Solid,10,...,Clean Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Microfibre T-shirt bra with underwired, moulde..."
4,5,110065011,110065,OP T-shirt (Idro),306,Bra,Underwear,1010016,Solid,12,...,Clean Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Microfibre T-shirt bra with underwired, moulde..."
5,6,111565001,111565,20 den 1p Stockings,304,Underwear Tights,Socks & Tights,1010016,Solid,9,...,Tights basic,B,Lingeries/Tights,1,Ladieswear,62,"Womens Nightwear, Socks & Tigh",1021,Socks and Tights,"Semi shiny nylon stockings with a wide, reinfo..."
6,7,111565003,111565,20 den 1p Stockings,302,Socks,Socks & Tights,1010016,Solid,13,...,Tights basic,B,Lingeries/Tights,1,Ladieswear,62,"Womens Nightwear, Socks & Tigh",1021,Socks and Tights,"Semi shiny nylon stockings with a wide, reinfo..."
7,8,111586001,111586,Shape Up 30 den 1p Tights,273,Leggings/Tights,Garment Lower body,1010016,Solid,9,...,Tights basic,B,Lingeries/Tights,1,Ladieswear,62,"Womens Nightwear, Socks & Tigh",1021,Socks and Tights,Tights with built-in support to lift the botto...
8,9,111593001,111593,Support 40 den 1p Tights,304,Underwear Tights,Socks & Tights,1010016,Solid,9,...,Tights basic,B,Lingeries/Tights,1,Ladieswear,62,"Womens Nightwear, Socks & Tigh",1021,Socks and Tights,"Semi shiny tights that shape the tummy, thighs..."
9,10,111609001,111609,200 den 1p Tights,304,Underwear Tights,Socks & Tights,1010016,Solid,9,...,Tights basic,B,Lingeries/Tights,1,Ladieswear,62,"Womens Nightwear, Socks & Tigh",1021,Socks and Tights,Opaque matt tights. 200 denier.


In [12]:
print(f'The returned article index results for Cosine Similarity: {results}')
print(f'The returned book index results for K-Nearest Neighbors: {results2}')
print(results == results2)

The returned article index results for Cosine Similarity: [70358 70368 70367 70366 70365 70364 70363 70362 70361 70360]
The returned book index results for K-Nearest Neighbors: [ 2  0  3  4  5  6  7  8  9 10]
[False False False False False False False False False False]


## Cosine Similarity Function

In [13]:
meta_data

Unnamed: 0,article_id,product_group_name_Accessories,product_group_name_Bags,product_group_name_Cosmetic,product_group_name_Fun,product_group_name_Furniture,product_group_name_Garment Full body,product_group_name_Garment Lower body,product_group_name_Garment Upper body,product_group_name_Garment and Shoe care,...,garment_group_name_Shorts,garment_group_name_Skirts,garment_group_name_Socks and Tights,garment_group_name_Special Offers,garment_group_name_Swimwear,garment_group_name_Trousers,garment_group_name_Trousers Denim,"garment_group_name_Under-, Nightwear",garment_group_name_Unknown,garment_group_name_Woven/Jersey/Knitted mix Baby
0,108775015,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,108775044,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,108775051,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,110065001,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,110065002,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
105537,953450001,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
105538,953763001,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
105539,956217002,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
105540,957375001,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [29]:
# Compiling the above code into a working function that takes a article ID as input and returns n-recommendations
def article_recommend():
    
    title = input('Article ID: ')
    article = articles_df.index[articles_df['article_id'] == int(articleid)]
    n_recs = int(input('How many recommendations? '))
    
    y = np.array(meta_data.loc[article]).reshape(1, -1)
    cos_sim = cosine_similarity(meta_data, y)
    cos_sim = pd.DataFrame(data=cos_sim, index=meta_data.index)
    cos_sim.sort_values(by = 0, ascending = False, inplace=True)
    results = cos_sim.index.values[1:n_recs+1]
    results_df = articles_df.loc[results]
    results_df.reset_index(inplace=True)
    results_df.rename(columns={'prod_name':'Product Name', 'author':'Author',
                               'product_type_name':'Product Type Name', 'product_group_name':'Product Group Name',
                               'index_group_name':'Index Group Name', 'garment_group_name ':'Garment Group Name'}, inplace=True)
    return results_df

In [35]:
articles_df.index[articles_df['article_id'] == 953450001]

Int64Index([105537], dtype='int64')

In [32]:
articles_df.tail()

Unnamed: 0,article_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,colour_group_code,colour_group_name,...,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc
105537,953450001,953450,5pk regular Placement1,302,Socks,Socks & Tights,1010014,Placement print,9,Black,...,Socks Bin,F,Menswear,3,Menswear,26,Men Underwear,1021,Socks and Tights,Socks in a fine-knit cotton blend with a small motif at the top and elasticated tops.
105538,953763001,953763,SPORT Malaga tank,253,Vest top,Garment Upper body,1010016,Solid,9,Black,...,Jersey,A,Ladieswear,1,Ladieswear,2,H&M+,1005,Jersey Fancy,Loose-fitting sports vest top in ribbed fast-drying functional fabric made from recycled polyester with a racer back and rounded hem.
105539,956217002,956217,Cartwheel dress,265,Dress,Garment Full body,1010016,Solid,9,Black,...,Jersey,A,Ladieswear,1,Ladieswear,18,Womens Trend,1005,Jersey Fancy,"Short, A-line dress in jersey with a round neckline and V-shaped opening at the front with narrow ties. Long, voluminous raglan sleeves and wide cuffs with covered buttons."
105540,957375001,957375,CLAIRE HAIR CLAW,72,Hair clip,Accessories,1010016,Solid,9,Black,...,Small Accessories,D,Divided,2,Divided,52,Divided Accessories,1019,Accessories,Large plastic hair claw.
105541,959461001,959461,Lounge dress,265,Dress,Garment Full body,1010016,Solid,11,Off White,...,Jersey,A,Ladieswear,1,Ladieswear,18,Womens Trend,1005,Jersey Fancy,"Calf-length dress in ribbed jersey made from a cotton blend. Low-cut V-neck at the back, dropped shoulders and long, wide sleeves that taper to the cuffs. Unlined."


In [33]:
articles_df.index[articles_df['article_id'] == 953450001]

Int64Index([105537], dtype='int64')

## Evaluation

In [22]:
pd.set_option('display.max_colwidth', None)

In [23]:
article_recommend()

Article ID: 110065001
How many recommendations? 3


Unnamed: 0,index,article_id,product_code,Product Name,product_type_no,Product Type Name,Product Group Name,graphical_appearance_no,graphical_appearance_name,colour_group_code,...,department_name,index_code,index_name,index_group_no,Index Group Name,section_no,section_name,garment_group_no,garment_group_name,detail_desc
0,70358,760158001,760158,DIV Rachel denim,272,Trousers,Garment Lower body,1010023,Denim,10,...,Divided+,D,Divided,2,Divided,50,Divided Projects,1001,Unknown,"5-pocket, ankle-length jeans in washed, stretch denim in a relaxed fit with a high waist, zip fly and button and straight legs with cut-off, raw-edge hems."
1,70368,760214002,760214,Semide tie dress,265,Dress,Garment Full body,1010016,Solid,9,...,Dress,A,Ladieswear,1,Ladieswear,6,Womens Casual,1013,Dresses Ladies,"Short dress in a viscose weave with a wide neckline, long sleeves with narrow elastication at the cuffs and a detachable tie belt at the waist. Unlined."
2,70367,760208001,760208,Class Cleo bracelet,68,Bracelet,Accessories,1010016,Solid,5,...,Jewellery Extended,C,Ladies Accessories,1,Ladieswear,66,Womens Small accessories,1019,Accessories,Two-strand bracelet in metal chains of different designs with a coin-shaped pendant. Adjustable length 20-27 cm.


In [24]:
article_recommend()

Article ID: 953763001
How many recommendations? 7


Unnamed: 0,index,article_id,product_code,Product Name,product_type_no,Product Type Name,Product Group Name,graphical_appearance_no,graphical_appearance_name,colour_group_code,...,department_name,index_code,index_name,index_group_no,Index Group Name,section_no,section_name,garment_group_no,garment_group_name,detail_desc
0,70358,760158001,760158,DIV Rachel denim,272,Trousers,Garment Lower body,1010023,Denim,10,...,Divided+,D,Divided,2,Divided,50,Divided Projects,1001,Unknown,"5-pocket, ankle-length jeans in washed, stretch denim in a relaxed fit with a high waist, zip fly and button and straight legs with cut-off, raw-edge hems."
1,70368,760214002,760214,Semide tie dress,265,Dress,Garment Full body,1010016,Solid,9,...,Dress,A,Ladieswear,1,Ladieswear,6,Womens Casual,1013,Dresses Ladies,"Short dress in a viscose weave with a wide neckline, long sleeves with narrow elastication at the cuffs and a detachable tie belt at the waist. Unlined."
2,70367,760208001,760208,Class Cleo bracelet,68,Bracelet,Accessories,1010016,Solid,5,...,Jewellery Extended,C,Ladies Accessories,1,Ladieswear,66,Womens Small accessories,1019,Accessories,Two-strand bracelet in metal chains of different designs with a coin-shaped pendant. Adjustable length 20-27 cm.
3,70366,760195006,760195,FLORA turtle neck,255,T-shirt,Garment Upper body,1010016,Solid,51,...,Young Girl Jersey Basic,I,Children Sizes 134-170,4,Baby/Children,79,Girls Underwear & Basics,1002,Jersey Basic,"Top in soft, ribbed jersey made from a cotton blend with a turtle neck and long sleeves. The cotton content of the top is organic."
4,70365,760195005,760195,FLORA turtle neck,255,T-shirt,Garment Upper body,1010016,Solid,93,...,Young Girl Jersey Basic,I,Children Sizes 134-170,4,Baby/Children,79,Girls Underwear & Basics,1002,Jersey Basic,"Top in soft, ribbed jersey made from a cotton blend with a turtle neck and long sleeves. The cotton content of the top is organic."
5,70364,760195004,760195,FLORA turtle neck,255,T-shirt,Garment Upper body,1010016,Solid,9,...,Young Girl Jersey Basic,I,Children Sizes 134-170,4,Baby/Children,79,Girls Underwear & Basics,1002,Jersey Basic,"Top in soft, ribbed jersey made from a cotton blend with a turtle neck and long sleeves. The cotton content of the top is organic."
6,70363,760195003,760195,FLORA turtle neck,255,T-shirt,Garment Upper body,1010016,Solid,51,...,Young Girl Jersey Basic,I,Children Sizes 134-170,4,Baby/Children,79,Girls Underwear & Basics,1002,Jersey Basic,"Top in soft, ribbed jersey made from a cotton blend with a turtle neck and long sleeves. The cotton content of the top is organic."


## Conclusion

> In conclusion the content-based recommendation system works by taking in meta data of description into features that can then be used to describe each article as a large multi-dimensional vector. A comparison of these vectors (using cosine similarity) is then used to return the top closest articles as recommendations for the user. It takes in a article ID as the input. Currently this system is limited to articles within my dataset but it can easily be expanded to many other articles as long as meta data exists for them. New articles added every season from the new collection can be added to this and recommendations can be made on them too.
<br><br>A unique benefit of this content-based approach as compared to the collaborative approach seen here is its ability to be used by anyone looking to find similar articles. It does not require any prior purchase from any customer. These approaches can be used in tandem to provide robust and varied article recommendations for H&M customer  and would help to promote the sale of further articles. Implementation within the H&M e-commerce website can be achieved with ease and this would also allow for continued data collection to help further refine the system.
<br><br>As expected this content-based system recommends articles along conventional description/group lines, which can be seen as a benefit and also a slight limitation. An investigation into the exclusion of specific group details might be warranted as well as an depth look at re-classifying groups within the dataset using unsupervised clustering.