# H&M Personalized Fashion Recommendations

First we import all the required external libraries.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

## Exploratory Data Analysis (EDA)

We first start by analisys all the data that we have available for this project. We start by reading all the data from the provided datasets.

In [2]:
articles = pd.read_csv('articles.csv')
customers = pd.read_csv('customers.csv')
transaction_train = pd.read_csv('transactions_train.csv')
sample_submission = pd.read_csv('sample_submission.csv')

Out of those, we have three datasets to analyse and use for the recognition system:
- Articles: clothing products information
- Customers: details about single customers
- Transactions: relate the customers with the articles bought

### Articles

To study the content of the dataset, a good approach is to display the values of the first articles in the dataset.

In [3]:
articles.head()

Unnamed: 0,article_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,colour_group_code,colour_group_name,...,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc
0,108775015,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,9,Black,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
1,108775044,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,10,White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
2,108775051,108775,Strap top (1),253,Vest top,Garment Upper body,1010017,Stripe,11,Off White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
3,110065001,110065,OP T-shirt (Idro),306,Bra,Underwear,1010016,Solid,9,Black,...,Clean Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Microfibre T-shirt bra with underwired, moulde..."
4,110065002,110065,OP T-shirt (Idro),306,Bra,Underwear,1010016,Solid,10,White,...,Clean Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Microfibre T-shirt bra with underwired, moulde..."


We can see there that most of the values are categorical data (no numerical), which means that we are going to need to analyse the data and find a way to match the values.

Apart from categorical data, there is another thing to be mindful of: missing data.

We have to start with the missing data in order to be able to later fill those categorical values if we filled some values without numerical values.

#### Feature Selection

In [4]:
# TODO: Reason all the feature selection

Now we can drop columns that are not relevant for the purpose of this recommendation system.

Since we have three different datasets that can be merged (transactions relate customers with articles) we will be needing the article_id. Also, this data will be useful to link later the recommendations with some images.

We will not be needing graphical_appearance_no, graphical_appearance_name, perceived_colour_value_id, garment_group_no, garment_group_name and detail_desc.

Since product_code, product_type_no and product_type_name seem to match in products groups, so we can keep only product_code. Same happens with colour_group_code and colour_group_name, so we will keep colour_group_code. And with department_no and department_name, so we will keep a_department_no. Same again for index_code, index_name and index_group_no, from which we will be keeping index_code. And lastly, from section_no and section_name, we will keep section_no.

In [5]:
cleaned_articles = articles[['article_id','product_code','colour_group_code','department_no','index_group_no','section_no']]
cleaned_articles.head()

Unnamed: 0,article_id,product_code,colour_group_code,department_no,index_group_no,section_no
0,108775015,108775,9,1676,1,16
1,108775044,108775,10,1676,1,16
2,108775051,108775,11,1676,1,16
3,110065001,110065,9,1339,1,61
4,110065002,110065,10,1339,1,61


#### Missing Data

After selecting the features we will continue with the missing data in order to be able to later fill those categorical values if we filled some values without numerical values. 

We can see the number of missing("null") values for each column:

In [6]:
cleaned_articles.isnull().sum()

article_id           0
product_code         0
colour_group_code    0
department_no        0
index_group_no       0
section_no           0
dtype: int64

After the feature selection we don't have any missing data.

#### Categorical Data

With our current features, we don't need to analyse any categorical data since all values are numerical.

### Customers

Let's start with the same approach than when processing the articles database, by displaying and analysing the first entries of the dataset.

In [7]:
customers.head()

Unnamed: 0,customer_id,FN,Active,club_member_status,fashion_news_frequency,age,postal_code
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,,,ACTIVE,NONE,49.0,52043ee2162cf5aa7ee79974281641c6f11a68d276429a...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,,,ACTIVE,NONE,25.0,2973abc54daa8a5f8ccfe9362140c63247c5eee03f1d93...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,,,ACTIVE,NONE,24.0,64f17e6a330a85798e4998f62d0930d14db8db1c054af6...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,,,ACTIVE,NONE,54.0,5d36574f52495e81f019b680c843c443bd343d5ca5b1c2...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,1.0,1.0,ACTIVE,Regularly,52.0,25fa5ddee9aac01b35208d01736e57942317d756b32ddd...


We can see there that, as for the articles dataset, most of the values are categorical data (no numerical), which means that we are going to need to analyse the data and find a way to match the values.

#### Feature Selection

In [8]:
# TODO: Reason all the feature selection

Just as we did before, let's start by dropping the columns that are not relevant for the purpose of this recommendation system.

We can only keep the customer_id to merge the data and the customer's age.

In [9]:
cleaned_customers = customers[['customer_id','age']]
cleaned_customers.head()

Unnamed: 0,customer_id,age
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,49.0
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,25.0
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,24.0
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,54.0
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,52.0


#### Missing Data

We can see several values marked as "NaN" which means that those values are actually missing data.

We can see the number of missing values (both "null" and "NaN") for each column:

In [10]:
cleaned_customers.isnull().sum()

customer_id        0
age            15861
dtype: int64

In this dataset we have a lot more missing values, it probably wont be worth to drop the entries with those missing values.

However we only need to fill the missing data for the age feature.

Looking at the number of entries:

In [11]:
cleaned_customers.shape[0]

1371980

We can confirm that dropping the values is not a option.

Another way to solve the "missing data" issue is to fill the values. 

We can check the different values that the entries of the dataset have for the age feature and find a good substitude for those missing values.

In [12]:
cleaned_customers['age'].unique()

array([49., 25., 24., 54., 52., nan, 20., 32., 29., 31., 56., 75., 41.,
       27., 30., 48., 35., 22., 40., 38., 45., 68., 55., 19., 60., 44.,
       21., 26., 28., 53., 33., 17., 23., 51., 18., 34., 57., 47., 70.,
       50., 63., 58., 43., 67., 72., 42., 39., 79., 71., 59., 36., 62.,
       37., 46., 73., 64., 74., 61., 85., 69., 76., 66., 65., 82., 16.,
       90., 80., 78., 81., 84., 77., 97., 89., 83., 98., 88., 86., 87.,
       93., 91., 99., 96., 94., 92., 95.])

As we can see, the age ranges is quite large:

In [13]:
cleaned_customers['age'].min()

16.0

In [14]:
cleaned_customers['age'].max()

99.0

Since the ages of the customers vary from 16 to 99, it would be a good idea to analyse other metrics from the age feature to get a better approach of what would be a good way to fill in the missing values.

In [15]:
cleaned_customers['age'].mean()

36.386964565794

In [16]:
cleaned_customers['age'].mode()

0    21.0
Name: age, dtype: float64

In [17]:
cleaned_customers['age'].median()

32.0

Checking the mean, mode and median, there is not much difference between mean (36.386964565794) and median (32). We could fill all the remaining values with the mode (21), but being such a difference between that age and the mean, substituding the missing data with the mode might alter the results. Thus, we assumed the round up mean (36), to fill the missing data.

In [18]:
cleaned_customers['age'].fillna(round(cleaned_customers['age'].mean()), inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_customers['age'].fillna(round(cleaned_customers['age'].mean()), inplace=True)


In [30]:
cleaned_customers.isnull().sum()

customer_id    0
age            0
dtype: int64

Now that we have finished filling all the missing data for this dataset, we have to analyse the categorical data.

#### Categorical Data

All the values for this dataset are numerical so we don't need to analyse the categorical data.

### Transactions

Let's check the values of the first entries.

In [20]:
transaction_train.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id
0,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001,0.050831,2
1,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023,0.030492,2
2,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221004,0.015237,2
3,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687003,0.016932,2
4,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687004,0.016932,2


The main application of this dataset is to combine the articles and customer by their ID numbers. We can also use the prices to check with the articles whether price makes a difference in the choosing of items, and thus, what would be better to recommend according to the customers profile and preferences.

#### Feature Selection

We don't really mind the transaction date (t_dat) in order to recommend the products unless is an article in high demmand in the present but that's not very probable. 

In summary we are only keeping the customer_id and the article_id to set a link between the three datasets.

In [21]:
cleaned_transaction_train = transaction_train[['customer_id','article_id']]
cleaned_transaction_train.head()

Unnamed: 0,customer_id,article_id
0,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001
1,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023
2,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221004
3,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687003
4,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687004


#### Missing Data

In [22]:
cleaned_transaction_train.isnull().sum()

customer_id    0
article_id     0
dtype: int64

There are no missing values in this dataset after the feature selection. So we don't need to process the data to fill them up.

#### Categorical Data

There is no categorical data to analize.

### Data Merge

Up until this point, we have been processing three different datasets, however we need a unique dataset to have the model run over it and get our recommendeded products.

As we mentioned before, the transaction_train dataset related information between the other two datasets so we can use it to merge the data.

For easier understanding and readability we will rename the column names to specify if the data refers to the transaction, the customer or the article.

Now we can start merging the articles with the transaction by using the article_id feature.

In [23]:
customers_n_transactions = pd.merge(cleaned_transaction_train, cleaned_customers, on='customer_id', how='outer')
customers_n_transactions = customers_n_transactions.rename(columns={'age': 'c_age'})
customers_n_transactions.head()

Unnamed: 0,customer_id,article_id,c_age
0,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001.0,24.0
1,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023.0,24.0
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001.0,24.0
3,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,578020002.0,24.0
4,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,723529001.0,24.0


It is possible for some of the data to be missing after the merge if there are no direct relations. So let's first check if we have any missing data.

In [33]:
customers_n_transactions.isnull().sum()

customer_id       0
article_id     9699
c_age             0
dtype: int64

As we can see, there are some missing entries in the articles. If this entries are a small amount compared to the number of data, we might be able to drop those entries completely.

In [36]:
( customers_n_transactions['article_id'].isnull().sum() * 100 ) / customers_n_transactions.shape[0]

0.030501896297137718

Since is such a small percentage (0.03%) we can drop the entries with missing data.

In [41]:
customers_n_transactions = customers_n_transactions.dropna(subset=['article_id'])
customers_n_transactions.isnull().sum()

customer_id    0
article_id     0
c_age          0
dtype: int64

Now we can merge the articles to the resulting dataset.

In [42]:
complete_dataset = pd.merge(customers_n_transactions, cleaned_articles, on='article_id', how='outer')
complete_dataset = complete_dataset.rename(columns={'product_code': 'a_product_code', 
                                 'colour_group_code': 'a_colour_group_code', 
                                 'department_no': 'a_department_no', 
                                 'index_group_no': 'a_index_group_no',
                                 'section_no': 'a_section_no'})
complete_dataset.head()

Unnamed: 0,customer_id,article_id,c_age,a_product_code,a_colour_group_code,a_department_no,a_index_group_no,a_section_no
0,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001.0,24.0,663713,9,1338,1,61
1,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001.0,24.0,663713,9,1338,1,61
2,1385e5f3a2d3dbd21237f91faf81254a6d96de31b07b0b...,663713001.0,25.0,663713,9,1338,1,61
3,1385e5f3a2d3dbd21237f91faf81254a6d96de31b07b0b...,663713001.0,25.0,663713,9,1338,1,61
4,3681748607f3287d2c3a65e00bb5fb153de30e9becf158...,663713001.0,30.0,663713,9,1338,1,61


We now need to check again for missing values.

In [43]:
complete_dataset.isnull().sum()

customer_id            995
article_id               0
c_age                  995
a_product_code           0
a_colour_group_code      0
a_department_no          0
a_index_group_no         0
a_section_no             0
dtype: int64

Since there are some missing values for both customer_id and c_age, which proceed from the same dataset, it is likely that the data missinig is in the same 995 entries. And as 995 is also a low percentage of the set (9699 was only 0.03%) we can drop the entries with missing data.

In [45]:
complete_dataset = complete_dataset.dropna(subset=['customer_id', 'c_age'])
complete_dataset.isnull().sum()

customer_id            0
article_id             0
c_age                  0
a_product_code         0
a_colour_group_code    0
a_department_no        0
a_index_group_no       0
a_section_no           0
dtype: int64

## Tests

In [63]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

X = complete_dataset.drop(['customer_id'], axis=1)
y = complete_dataset.article_id

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

In [64]:
knn_model = KNeighborsClassifier(n_neighbors=3)
knn_model.fit(X_train, y_train)

KNeighborsClassifier(n_neighbors=3)

In [66]:
print(knn_model.predict(X_test))

[5.99718002e+08 6.89009001e+08 8.26492003e+08 ... 7.46509005e+08
 8.60949001e+08 4.96111020e+08]
