<img src='http://imgur.com/1ZcRyrc.png' style='float: left; margin: 20px; height: 55px'>

# Capstone Project: H & M Recommender System
## Part 1
---
## Contents
---

### Part 1
1. Introduction
2. Problem Statement
3. Data Import and Cleaning

### [Part 2](part2_hm.ipynb)
4. [Exploratory Data Analysis](#4.-EDA)

### [Part 3](part3_hm.ipynb)
5.  [User-Item Matrix](#5.-User-item Matrix)
6.  [Challenges](#6.-Challenges)
7.  [Conclusion](#7.-Conclusion)


## 1. Introduction 

### 1.1 Background

H&M Group is a family of brands and businesses with 53 online markets and approximately 4,850 stores. Her online store offers shoppers an extensive selection of products to browse through. 

While some customers are good at searching for what they want, others may prefer not to. Hence, product recommendations become key in enhancing customer experience.

Simple data, such as garment type and customer age, to text data from product descriptions, to image data from garment images have been gathered through the years 2018 to 2021 to study customer transactions.

## 2. Problem Statement
H&M Customers have too many products to browse online. They might not quickly find what interests them or what they are looking for, and ultimately, might not make a purchase. To help customers make the right choices, product recommendations become important. 

In this competition, H&M Group invites participants to develop product recommendations based on data from previous transactions, as well as from customer and product meta data.

reference:https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations/overview

## 3. Imports 

### 3.1 Import libraries and files

In [1]:
#import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from skimage import io
import random
from sklearn.preprocessing import OneHotEncoder


  "class": algorithms.Blowfish,


In [2]:
# reading files
df = pd.read_csv('../data/articles.csv')
df_customers = pd.read_csv('../data/customers.csv')
df_transaction_train = pd.read_csv('../data/transactions_train.csv')

In [3]:
# checking for shape of articles data
df.shape

(105542, 25)

In [4]:
#checking for shape of customers data
df_customers.shape

(1371980, 7)

In [5]:
#checking for shape of transactions data
df_transaction_train.shape

(31788324, 5)

In [7]:
# checking for headings of articles data
df.head(1)

Unnamed: 0,article_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,colour_group_code,colour_group_name,...,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc
0,108775015,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,9,Black,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.


In [8]:
# checking for headings of customers data
df_customers.head(1)

Unnamed: 0,customer_id,FN,Active,club_member_status,fashion_news_frequency,age,postal_code
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,,,ACTIVE,NONE,49.0,52043ee2162cf5aa7ee79974281641c6f11a68d276429a...


In [9]:
# checking for headings of transactions data
df_transaction_train.head(1)

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id
0,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001,0.050831,2


In [10]:
# limiting customers to 10 k
customers = np.random.choice(df_customers['customer_id'],10000)
df_customers = df_customers[df_customers['customer_id'].isin(customers)]
df_transaction_train = df_transaction_train[df_transaction_train['customer_id'].isin(customers)]

In [11]:
#merging data
merged_df = df_transaction_train.merge(df, on='article_id', validate='many_to_one')
merged_df = merged_df.merge(df_customers,on='customer_id',validate='many_to_one')

In [None]:
# Description of the 3 datasets

### Articles
article_id : A unique identifier of every article
<br>product_code, prod_name : A unique identifier of every product and its name
<br>product_type, product_type_name : The group of product_code and its name
<br>graphical_appearance_no, graphical_appearance_name : The group of graphics and its name
<br>colour_group_code, colour_group_name : The group of color and its name
<br>perceived_colour_value_id, perceived_colour_value_name, perceived_colour_master_id, <br>perceived_colour_master_name : The added color info
<br>department_no, department_name: : A unique identifier of every dep and its name
<br>index_code, index_name: : A unique identifier of every index and its name
<br>index_group_no, index_group_name: : A group of indices and its name
<br>section_no, section_name: : A unique identifier of every section and its name
<br>garment_group_no, garment_group_name: : A unique identifier of every garment and its name
<br>detail_desc: : Details


### Customers data description
<br>customer_id : A unique identifier of every customer
<br>FN : 1 or missed (not clear what it is)
<br>Active : 1 or missed 
<br>club_member_status : Status in club
<br>fashion_news_frequency : How often H&M may send news to customer
<br>age : The current age
<br>postal_code : Postal code of customer

### Transactions data description

<br>t_dat : Purchase date
<br>customer_id : A unique identifier of every customer (in customers table)
<br>article_id : A unique identifier of every article (in articles table)
<br>price : Price of purchase
<br>sales_channel_id : 1 or 2

### 3.2 Cleaning

In [6]:
# checking for null values
merged_df.isnull().sum()

t_dat                                0
customer_id                          0
article_id                           0
price                                0
sales_channel_id                     0
product_code                         0
prod_name                            0
product_type_no                      0
product_type_name                    0
product_group_name                   0
graphical_appearance_no              0
graphical_appearance_name            0
colour_group_code                    0
colour_group_name                    0
perceived_colour_value_id            0
perceived_colour_value_name          0
perceived_colour_master_id           0
perceived_colour_master_name         0
department_no                        0
department_name                      0
index_code                           0
index_name                           0
index_group_no                       0
index_group_name                     0
section_no                           0
section_name             

In [12]:
# drop rows with null values
merged_df.dropna(subset=['age','detail_desc','club_member_status','fashion_news_frequency'], inplace=True)

In [21]:
# saving file merged_df
merged_df.to_csv('merged_df.csv', index = False)

In [13]:
# selecting features of articles for building user_item matrix
df1 = merged_df[['t_dat', 'customer_id', 'article_id', 'prod_name', 'product_type_name',
       'product_group_name',
       'graphical_appearance_name', 'colour_group_name',
       'perceived_colour_value_name',
       'perceived_colour_master_name',
       'department_name', 'index_name',
       'index_group_name', 'section_name',
       'garment_group_name', 'detail_desc']]

feature_subset = ['product_group_name',
       'graphical_appearance_name', 'colour_group_name',
       'perceived_colour_value_name',
       'perceived_colour_master_name',
       'department_name', 'index_name',
       'index_group_name', 'section_name',
       'garment_group_name']

In [19]:
#Choose features to build feature space and dummify categorical features
features = feature_subset
df1 = df1[['customer_id', 'article_id'] + features]
dummies_df = pd.get_dummies(df1, columns=features)
dummies_df.head()

Unnamed: 0,customer_id,article_id,product_group_name_Accessories,product_group_name_Bags,product_group_name_Cosmetic,product_group_name_Garment Full body,product_group_name_Garment Lower body,product_group_name_Garment Upper body,product_group_name_Garment and Shoe care,product_group_name_Items,...,garment_group_name_Shorts,garment_group_name_Skirts,garment_group_name_Socks and Tights,garment_group_name_Special Offers,garment_group_name_Swimwear,garment_group_name_Trousers,garment_group_name_Trousers Denim,"garment_group_name_Under-, Nightwear",garment_group_name_Unknown,garment_group_name_Woven/Jersey/Knitted mix Baby
0,0247a4cfbe56ac5e1c0b5a91b0ab0aad812f6fca13d4fd...,618840005,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0247a4cfbe56ac5e1c0b5a91b0ab0aad812f6fca13d4fd...,625034001,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0247a4cfbe56ac5e1c0b5a91b0ab0aad812f6fca13d4fd...,659124001,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0247a4cfbe56ac5e1c0b5a91b0ab0aad812f6fca13d4fd...,659124001,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0247a4cfbe56ac5e1c0b5a91b0ab0aad812f6fca13d4fd...,629826001,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1


In [15]:
# checking for number of unique customer ids
df1.customer_id.nunique()

9642

In [16]:
#checking for number of unique articles
merged_df['article_id'].nunique()

46050

In [18]:
#Choose features to build feature space
features = feature_subset
df1 = df1[['customer_id', 'article_id'] + features]
dummies_df = pd.get_dummies(df1, columns=features)
dummies_df.head(1)

Unnamed: 0,customer_id,article_id,product_group_name_Accessories,product_group_name_Bags,product_group_name_Cosmetic,product_group_name_Garment Full body,product_group_name_Garment Lower body,product_group_name_Garment Upper body,product_group_name_Garment and Shoe care,product_group_name_Items,...,garment_group_name_Shorts,garment_group_name_Skirts,garment_group_name_Socks and Tights,garment_group_name_Special Offers,garment_group_name_Swimwear,garment_group_name_Trousers,garment_group_name_Trousers Denim,"garment_group_name_Under-, Nightwear",garment_group_name_Unknown,garment_group_name_Woven/Jersey/Knitted mix Baby
0,0247a4cfbe56ac5e1c0b5a91b0ab0aad812f6fca13d4fd...,618840005,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [11]:
# checking for columns in file
dummies_df.columns

Index(['customer_id', 'article_id', 'product_group_name_Accessories',
       'product_group_name_Bags', 'product_group_name_Cosmetic',
       'product_group_name_Furniture', 'product_group_name_Garment Full body',
       'product_group_name_Garment Lower body',
       'product_group_name_Garment Upper body',
       'product_group_name_Garment and Shoe care',
       ...
       'garment_group_name_Shorts', 'garment_group_name_Skirts',
       'garment_group_name_Socks and Tights',
       'garment_group_name_Special Offers', 'garment_group_name_Swimwear',
       'garment_group_name_Trousers', 'garment_group_name_Trousers Denim',
       'garment_group_name_Under-, Nightwear', 'garment_group_name_Unknown',
       'garment_group_name_Woven/Jersey/Knitted mix Baby'],
      dtype='object', length=456)

In [22]:
#saving file dummies_df
dummies_df.to_csv('dummies_df.csv', index = False)