# H&M Personalized Fashion Recommendations

First we import all the required external libraries.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

## Exploratory Data Analysis (EDA)

We first start by analisys all the data that we have available for this project. We start by reading all the data from the provided datasets.

In [2]:
articles = pd.read_csv('articles.csv')
customers = pd.read_csv('customers.csv')
transaction_train = pd.read_csv('transactions_train.csv')
sample_submission = pd.read_csv('sample_submission.csv')

Out of those, we have three datasets to analyse and use for the recognition system:
- Articles: clothing products information
- Customers: details about single customers
- Transactions: relate the customers with the articles bought

### Articles

To study the content of the dataset, a good approach is to display the values of the first articles in the dataset.

In [3]:
articles.head()

Unnamed: 0,article_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,colour_group_code,colour_group_name,...,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc
0,108775015,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,9,Black,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
1,108775044,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,10,White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
2,108775051,108775,Strap top (1),253,Vest top,Garment Upper body,1010017,Stripe,11,Off White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
3,110065001,110065,OP T-shirt (Idro),306,Bra,Underwear,1010016,Solid,9,Black,...,Clean Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Microfibre T-shirt bra with underwired, moulde..."
4,110065002,110065,OP T-shirt (Idro),306,Bra,Underwear,1010016,Solid,10,White,...,Clean Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Microfibre T-shirt bra with underwired, moulde..."


We can see there that most of the values are categorical data (no numerical), which means that we are going to need to analyse the data and find a way to match the values.

Apart from categorical data, there is another thing to be mindful of: missing data.

We have to start with the missing data in order to be able to later fill those categorical values if we filled some values without numerical values.


#### Missing Data

We can see the number of missing("null") values for each column:

In [4]:
articles.isnull().sum()

article_id                        0
product_code                      0
prod_name                         0
product_type_no                   0
product_type_name                 0
product_group_name                0
graphical_appearance_no           0
graphical_appearance_name         0
colour_group_code                 0
colour_group_name                 0
perceived_colour_value_id         0
perceived_colour_value_name       0
perceived_colour_master_id        0
perceived_colour_master_name      0
department_no                     0
department_name                   0
index_code                        0
index_name                        0
index_group_no                    0
index_group_name                  0
section_no                        0
section_name                      0
garment_group_no                  0
garment_group_name                0
detail_desc                     416
dtype: int64

We only have 416 "null" values. 

Given that we have a dataset with the following number of entries:

In [5]:
articles.shape[0]

105542

Since there are only 416 empy values in the same category in the whole dataset. Missing values are only affecting 416 entries out of 105.542 which is the following porcentage of the whole dataset:

In [6]:
(416/ 105542 ) * 100

0.3941558810710428

Only the 0.4% of the articles dataset is being affected by missing data, thus we can drop the entries that contain empty values in the detail_desc category.

In [7]:
articles = articles.dropna()

We no longer have missing values in the dataset.

In [8]:
articles.isnull().sum()

article_id                      0
product_code                    0
prod_name                       0
product_type_no                 0
product_type_name               0
product_group_name              0
graphical_appearance_no         0
graphical_appearance_name       0
colour_group_code               0
colour_group_name               0
perceived_colour_value_id       0
perceived_colour_value_name     0
perceived_colour_master_id      0
perceived_colour_master_name    0
department_no                   0
department_name                 0
index_code                      0
index_name                      0
index_group_no                  0
index_group_name                0
section_no                      0
section_name                    0
garment_group_no                0
garment_group_name              0
detail_desc                     0
dtype: int64

#### Categorical Data

In [9]:
# TODO: Analyse and process categorical data for "Articles"

### Customers

Let's start with the same approach than when processing the articles database, by displaying and analysing the first entries of the dataset.

In [11]:
customers.head()

Unnamed: 0,customer_id,FN,Active,club_member_status,fashion_news_frequency,age,postal_code
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,,,ACTIVE,NONE,49.0,52043ee2162cf5aa7ee79974281641c6f11a68d276429a...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,,,ACTIVE,NONE,25.0,2973abc54daa8a5f8ccfe9362140c63247c5eee03f1d93...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,,,ACTIVE,NONE,24.0,64f17e6a330a85798e4998f62d0930d14db8db1c054af6...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,,,ACTIVE,NONE,54.0,5d36574f52495e81f019b680c843c443bd343d5ca5b1c2...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,1.0,1.0,ACTIVE,Regularly,52.0,25fa5ddee9aac01b35208d01736e57942317d756b32ddd...


We can see there that, as for the articles dataset, most of the values are categorical data (no numerical), which means that we are going to need to analyse the data and find a way to match the values.

#### Missing Data

We can see several values marked as "NaN" which means that those values are actually missing data.

We can see the number of missing values (both "null" and "NaN") for each column:

In [12]:
customers.isnull().sum()

customer_id                    0
FN                        895050
Active                    907576
club_member_status          6062
fashion_news_frequency     16009
age                        15861
postal_code                    0
dtype: int64

In this dataset we have a lot more missing values, it probably wont be worth to drop the entries with those missing values.

Looking at the number of entries:

In [13]:
customers.shape[0]

1371980

Only taking into account the values of the feature "Active", we have missing data in almost all the entries of the dataset, so we can confirm that dropping the values is not a option.

Another way to solve the "missing data" issue is to fill the values. 

We can check the different values that the entries of the dataset have for each category and find a good substitude for those missing values.

In [14]:
customers['FN'].unique()

array([nan,  1.])

We have two possible values for "FN", "nan" and "1.". To solve the missing ("nan") we can substitude it by 0.

In [45]:
customers['FN'].fillna(0, inplace=True)

In [46]:
customers.isnull().sum()

customer_id                    0
FN                             0
Active                    907576
club_member_status             0
fashion_news_frequency     16009
age                            0
postal_code                    0
dtype: int64

Now we don't have missing values in the "FN" feature. 

Let's jump to the next feature, "Active".

In [17]:
customers['Active'].unique()

array([nan,  1.])

As with "FN", we have two possible values for "FN", "nan" and "1.". To solve the missing ("nan") we can substitude it by 0.

In [101]:
customers['Active'].fillna(0, inplace = True)

In [102]:
customers['Active'].isnull().sum()

0

Jumping to the next feature with missing data, we have "club_member_status".

In [19]:
customers['club_member_status'].unique()

array(['ACTIVE', nan, 'PRE-CREATE', 'LEFT CLUB'], dtype=object)

We have 4 possible values for this feature and given the names we can understand them as:

- Active: active club member
- nan: missing value
- Pre-create: in process of being a member
- Left club: unsubscribed from the club membership

Given those status, there are two possible assumptions, either the missing values are customers who aren't or haven't been members of the club and so we can fill those values with "NO MEMBER" or they are inactive member (members of the club with little to none participation or presence) and so we can call them "INACTIVE". (After discussing it, we agreed that <- CAN BE CHANGED) the first option made more sense in the context of the whole dataset.

In [43]:
customers['club_member_status'].fillna('NO MEMBER', inplace=True)

In [44]:
customers.isnull().sum()

customer_id                     0
FN                        1371980
Active                     907576
club_member_status              0
fashion_news_frequency      16009
age                             0
postal_code                     0
dtype: int64

Going fo the next feature with missing data, we find "fashion_news_frequency", which has the following values.

In [103]:
# TODO: [Alexia] Fill missing values for "fashion_news_frequency"
customers['fashion_news_frequency'].unique()
# Convert 'none' values to have same format
customers.loc[customers['fashion_news_frequency'] == 'NONE', 'fashion_news_frequency'] = 'none'
# Fill in nulls with "none"
customers['fashion_news_frequency'].fillna("No news subscription", inplace = True)
# Verify if there are any nulls left in the Fashion News Frequency column
customers['fashion_news_frequency'].isnull().sum()
customers.loc[customers['fashion_news_frequency'] == 'None', 'fashion_news_frequency'] = 'none'

In [104]:
customers['fashion_news_frequency'].isnull().sum()

0

Lastly, the remaining missing data is from the "age" feature.

In [27]:
customers['age'].unique()

array([49., 25., 24., 54., 52., nan, 20., 32., 29., 31., 56., 75., 41.,
       27., 30., 48., 35., 22., 40., 38., 45., 68., 55., 19., 60., 44.,
       21., 26., 28., 53., 33., 17., 23., 51., 18., 34., 57., 47., 70.,
       50., 63., 58., 43., 67., 72., 42., 39., 79., 71., 59., 36., 62.,
       37., 46., 73., 64., 74., 61., 85., 69., 76., 66., 65., 82., 16.,
       90., 80., 78., 81., 84., 77., 97., 89., 83., 98., 88., 86., 87.,
       93., 91., 99., 96., 94., 92., 95.])

As we can see, the age ranges is quite large:

In [28]:
customers['age'].min()

16.0

In [29]:
customers['age'].max()

99.0

Since the ages of the customers vary from 16 to 99, it would be a good idea to analyse other metrics from the age feature to get a better approach of what would be a good way to fill in the missing values.

In [30]:
customers['age'].mean()

36.386964565794

In [31]:
customers['age'].mode()

0    21.0
dtype: float64

In [32]:
customers['age'].median()

32.0

Checking the mean, mode and median, there is not much difference between mean (36.386964565794) and median (32). We could fill all the remaining values with the mode (21), but being such a difference between that age and the mean, substituding the missing data with the mode might alter the results. Thus, we assumed the round up mean (36), to fill the missing data.

In [33]:
customers['age'].fillna(round(customers['age'].mean()), inplace=True)

In [34]:
customers['age'].unique()

array([49., 25., 24., 54., 52., 36., 20., 32., 29., 31., 56., 75., 41.,
       27., 30., 48., 35., 22., 40., 38., 45., 68., 55., 19., 60., 44.,
       21., 26., 28., 53., 33., 17., 23., 51., 18., 34., 57., 47., 70.,
       50., 63., 58., 43., 67., 72., 42., 39., 79., 71., 59., 62., 37.,
       46., 73., 64., 74., 61., 85., 69., 76., 66., 65., 82., 16., 90.,
       80., 78., 81., 84., 77., 97., 89., 83., 98., 88., 86., 87., 93.,
       91., 99., 96., 94., 92., 95.])

In [47]:
customers.isnull().sum()

customer_id                    0
FN                             0
Active                    907576
club_member_status             0
fashion_news_frequency     16009
age                            0
postal_code                    0
dtype: int64

Now that we have finished filling all the missing data for this dataset, we have to analyse the categorical data.

#### Categorical Data

In [None]:
# TODO: Analyse and process categorical data for "Customers"

### Transactions

Let's check the values of the first entries.

In [40]:
transaction_train.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id
0,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001,0.050831,2
1,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023,0.030492,2
2,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221004,0.015237,2
3,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687003,0.016932,2
4,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687004,0.016932,2


The main application of this dataset is to combine the articles and customer by their ID numbers. We can also use the prices to check with the articles whether price makes a difference in the choosing of items, and thus, what would be better to recommend according to the customers profile and preferences.

#### Missing Data

In [42]:
transaction_train.isnull().sum()

t_dat               0
customer_id         0
article_id          0
price               0
sales_channel_id    0
dtype: int64

There are no missing values in this dataset. So we don't need to process the data to fill them up.

#### Categorical Data

In [None]:
# TODO: Analyse and process categorical data for "Transactions"