my first notebook

thank you DANIIL KARPOV https://www.kaggle.com/code/vanguarde/h-m-eda-first-look for giving me all the inspiration

## **<span style="color:#008000;font-size:490%"><center>EDA</center></span><span style="color:#008000;font-size:200%"><center>Exploratory Data Analysis. H&M</center></span>**

# Introduction

For this challenge you are given the purchase history of customers across time, along with supporting metadata. Your challenge is to predict what articles each customer will purchase in the 7-day period immediately after the training data ends. Customer who did not make any purchase during that time are excluded from the scoring.


The dataset contains 4 csv files and one folder with several subfolders, each with a different number of images.

In this Exploratory Data Analysis Notebook we will look to the data, will analyze the content of each csv file, check for missing data, understand the data distribution, see what are the relations between data in various files.

We will also explore the image data, understand how images are indexed in the csv files, if there are articles in the dataset without images. We will also explore image additional information, like image width and height.

We also investigate a very simple baseline model and create an initial submission.

For baseline model I migth use weekly bestsellers. 

![image.png](attachment:263a5724-ec87-4151-86a2-a85966720b76.png)

# Analysis preparation

We will include here the required packages for reading, parsing, filtering, processing, visualizing the data, both tabular and image.

![image.png](attachment:58798c57-524a-45c9-b7cc-6f4f6925a0dd.png)


In [1]:
import numpy as np
import pandas as pd
import os
from tqdm import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
# from wordcloud import WordCloud, STOPWORDS
from datetime import datetime
from PIL import Image
from plotnine import *

In [2]:
articles = pd.read_csv("../input/h-and-m-personalized-fashion-recommendations/articles.csv")
customers = pd.read_csv("../input/h-and-m-personalized-fashion-recommendations/customers.csv")
transactions = pd.read_csv("../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv")

### first lets go through the articles that is selling on H&M

In [3]:
len(articles.columns)

So we have in total of 25 columns. What are they? what are their characters?

In [4]:
articles.columns

article_id : A unique identifier of every article.

product_code, prod_name : A unique identifier of every product and its name (not the same).

product_type, product_type_name : The group of product_code and its name

graphical_appearance_no, graphical_appearance_name : The group of graphics and its name

colour_group_code, colour_group_name : The group of color and its name

perceived_colour_value_id, perceived_colour_value_name, perceived_colour_master_id, perceived_colour_master_name : The added color info

department_no, department_name: : A unique identifier of every dep and its name

index_code, index_name: : A unique identifier of every index and its name

index_group_no, index_group_name: : A group of indeces and its name

section_no, section_name: : A unique identifier of every section and its name

garment_group_no, garment_group_name: : A unique identifier of every garment and its name

detail_desc: : Details

I notice that all of the data here is categorical / index for each feature. 

So Let's do some question and answer style to think about this data!

clean up the data for a little bit. I first want to get rid of colinearity on all columns.

In [5]:
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np

# https://www.kaggle.com/c/h-and-m-personalized-fashion-recommendations/discussion/308635
def customer_hex_id_to_int(series):
    return series.str[-16:].apply(hex_id_to_int)

def hex_id_to_int(str):
    return int(str[-16:], 16)

def article_id_str_to_int(series):
    return series.astype('int32')

def article_id_int_to_str(series):
    return '0' + series.astype('str')

class Categorize(BaseEstimator, TransformerMixin):
    def __init__(self, min_examples=0):
        self.min_examples = min_examples
        self.categories = []
        
    def fit(self, X):
        for i in range(X.shape[1]):
            vc = X.iloc[:, i].value_counts()
            self.categories.append(vc[vc > self.min_examples].index.tolist())
        return self

    def transform(self, X):
        data = {X.columns[i]: pd.Categorical(X.iloc[:, i], categories=self.categories[i]).codes for i in range(X.shape[1])}
        return pd.DataFrame(data=data)

In [6]:
articles.article_id = article_id_str_to_int(articles.article_id)

In [7]:
encoded_articles = articles.copy()

In [8]:
for col in encoded_articles.columns:
    if encoded_articles[col].dtype == 'object':
        encoded_articles[col] = Categorize().fit_transform(articles[[col]])[col]

for col in encoded_articles.columns:
    if encoded_articles[col].dtype == 'int64':
        encoded_articles[col] = encoded_articles[col].astype('int32')

In [9]:
sns.heatmap(encoded_articles.corr());

In [10]:
encoded_articles = encoded_articles.drop(columns = ['product_code','prod_name','product_type_name','product_group_name','graphical_appearance_name','colour_group_name','perceived_colour_value_name','perceived_colour_master_name','department_name','index_name','index_group_name','section_name','garment_group_name','detail_desc'])

In [11]:
sns.heatmap(encoded_articles.corr());

First I want to take out columns that have high colinearity. 

By human check, taking out the high coliearity variables. That's all i will do for now

In [12]:
import seaborn as sns
import matplotlib.pyplot as plt

Lets peek into what are the things in articles.

In [13]:
articles.columns

In [14]:
product_count = pd.DataFrame(articles[['product_code']].value_counts().sort_values(ascending=False),columns = ['count'])
product_count = product_count.reset_index()
product_count

In [15]:
name_code = articles[['prod_name','product_code']].drop_duplicates(['product_code'])

In [16]:
product_distribution = product_count.merge(name_code, left_on='product_code',right_on='product_code')
product_distribution.describe()

Majority of product only has 1-2 variation. We may need to remove outliers that is 75 variation. 
We can set threshold that more than 10 different product's product we may take them out. 

or it may not be necessary.

In [17]:
plt.hist(product_distribution['count'])
plt.show() 

In [18]:
plt.boxplot(product_distribution['count'])

In [19]:
name_code.head(2)



<font size="5">By looking at product code, we can see top one percent of product have multiple different variation. Like \#1 most count of product is actually kid clothing!
Whereas the 90% of the product is only average of 1-2 variation. </font>

<font size="5">From here we can note that the same product can have 1-2 variation variation. </font>

In [20]:
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
ax.bar(product_count.index[:2000],product_count['count'][:2000])
plt.ylim([0, 100])
plt.show()

In [21]:
product_name = articles[['product_code','prod_name']]

In [22]:
product_count.merge(product_name, how = "left", left_on="product_code",right_on="product_code").sort_values(by='count')

<font size="5">I notice the most repeated product is kid clothing!
</font>

<font size="5">
        prod_name can be really creative, and can definitely contribute later. For now, I may skip this variable. 
</font>


<font size="5">
Now I can ask a better question, So what product type do we have here?
</font>

In [23]:
product_type = articles[['product_code', 'product_type_no','product_type_name']]

In [24]:
product_type_name = articles[['product_type_no','product_type_name']]
product_type_name = product_type_name.drop_duplicates()

In [25]:
product_type_name

<font size="5">
we have 132 unique type in all our articles. Lets see how they are distributed!
</font>

In [26]:
product_type = pd.DataFrame(product_type.groupby(by=["product_type_no"])['product_type_name'].count().sort_values(ascending=False))
product_type = product_type.reset_index()
product_type.columns = ['product_type_no', 'product_type_count']

In [27]:
product_type

In [28]:
type_no_to_name = dict(zip(product_type_name['product_type_no'],product_type_name['product_type_name']))

In [29]:
product_type['product_type_name'] = product_type['product_type_no'].apply(lambda x : type_no_to_name[x]) 

In [30]:
product_type.describe()

<font size="5">
Product type have many types 
</font>

In [31]:
product_type

In [32]:
product_type['product_type_count'].quantile(0.8)

In [33]:
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
ax.bar(product_type['product_type_name'][:10],product_type['product_type_count'][:10])
plt.xticks(rotation='vertical')
plt.show()

In [34]:
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
ax.bar(product_type['product_type_name'][10:20],product_type['product_type_count'][10:20])
plt.xticks(rotation='vertical')
plt.show()

<font size="5">
Seems like 
</font>

In [35]:
plt.pie(product_type['product_type_count'], labels = product_type['product_type_name'])
plt.show()

In [36]:
sum(product_type.head(5).product_type_count) / sum(product_type.product_type_count)

In [37]:
# calculate the percentage of top 5 in all the articles

product_type[product_type['product_type_name'].isin(['Top','T-shirt','Sweater','Dress','Trousers'])]['product_type_count'].sum() / product_type['product_type_count'].sum()


<font size="5">
Trouser Dress Sweater T-shirt Top Blouse is highly skewed Since they take up more than 40% of articles in H&M
</font>


<font size="4">
I was exploring what type of articles do we have. And I found that that top 5 most provided type,'Top','T-shirt','Sweater','Dress','Trousers' takes up 40% of the articles sound at H&M. That's crazy! I was expecting it will be more even distribution among all the product type. 
</font>

In [38]:
product_group = pd.DataFrame(articles[['article_id','product_group_name']].groupby('product_group_name')['article_id'].count().sort_values(ascending=False))
product_group = product_group.reset_index()
product_group.columns = ['product_group_name','count']
product_group

In [39]:
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
ax.bar(product_group['product_group_name'],product_group['count'])
plt.xticks(rotation='vertical')
plt.show()

In [40]:
plt.pie(product_group['count'], labels = product_group['product_group_name'])
plt.show()


So far we have looked at data from 1 dimension. Now lets increase the dimension and explore more!

In [41]:
articles.groupby(['product_group_name','index_name'])['article_id'].count().sort_values()

In [42]:
f, ax = plt.subplots(figsize=(15, 7))
ax = sns.histplot(data=articles, y='product_group_name', color='orange', hue='index_group_name', multiple="stack")
ax.set_xlabel('count by garment group')
ax.set_ylabel('garment group')
plt.show()

### I definitley notice *lady and children* is dominating almost all the garment group. 

In [43]:
articles.groupby(['index_group_name','index_name']).count()['article_id']

In [44]:
f, ax = plt.subplots(figsize=(15, 7))
ax = sns.histplot(data=articles, y='index_name', color='orange', hue='index_group_name', multiple="stack")
ax.set_xlabel('count by index_name')
ax.set_ylabel('index_name')
plt.show()

In [45]:
pd.options.display.max_rows = None

articles.groupby(['product_group_name', 'product_type_name','index_group_name']).count()['article_id']
#.get("Accessories").sort_values()

Since now we have some understanding of Articles, aka most of the stuff are for female and children. 

we can give a look at Customers and check if there is any interesting finding!

In [46]:
# missing data percentage
customers.isna().sum() / customers.shape[0]

In [47]:
# since its hard to know what age just base on this, we may omit and try guess their age base on what they buy ?

age_na_cus = customers.age.isna()

# kept the one that has age 

customers_age = customers[age_na_cus.apply(lambda x : not x)]

lets check out the distribution of age.

In [48]:
customers_age = customers_age.sort_values(by='age',ascending=False)

In [None]:
customers_age

In [None]:
sns.set_style("darkgrid")
f, ax = plt.subplots(figsize=(10,5))
ax = sns.histplot(data=customers_age, x='age', bins=20, color='orange')
ax.set_xlabel('Distribution of the customers age')
plt.show()

Not much for customer data besides their age, news, and club member status

now i want to find out maybe how much the same customer spend by monthly in 2019/9/22 - 2020/9/22. 

In [None]:
transactions = transactions.sort_values(['customer_id'])

In [None]:
transactions.t_dat

In [None]:
transactions.describe()

In [None]:
sns.set_style("darkgrid")
f, ax = plt.subplots(figsize=(10,5))
ax = sns.histplot(data=transactions, x='price', bins=100, color='orange')
ax.set_xlabel('Distribution of the customer spending')
plt.xlim([0,0.2])
plt.show()

In [None]:
transactions.groupby('customer_id')

In [None]:
last_year_tran = transactions[transactions['t_dat'] >= '2019-09-22']

In [None]:
last_year_tran.t_dat = pd.to_datetime(last_year_tran.t_dat)

In [58]:
last_year_tran['month'] = last_year_tran.t_dat.apply(lambda x : x.month)

# import cudf
# train = cudf.read_csv('../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv')
# train['customer_id'] = train['customer_id'].str[-16:].str.hex_to_int().astype('int64')
# train['article_id'] = train.article_id.astype('int32')
# train.t_dat = cudf.to_datetime(train.t_dat)
# train = train[['t_dat','customer_id','article_id']]
# train.to_parquet('train.pqt',index=False)
# print( train.shape )
# train.head()

How many unique customer made purchase in 2022?

In [59]:
last_year_tran['customer_id'].nunique()

**<span style="color:#008000;"> Let's see the general summary about the transaction</span>**

So I want to see what is the totoal number of item from each customer for last year of transaction. 

In [60]:
cus_bought_count = pd.DataFrame(last_year_tran.groupby('customer_id')['article_id'].count().sort_values(ascending=False))

In [None]:
last_year_tran

In [None]:
cus_bought_count = cus_bought_count.reset_index()
cus_bought_count.columns = ['customer_id','article_count']

In [None]:
cus_bought_count.describe()

Since I noticed huge discrepency between people who are top 10% of customer and rest of 90%

I may want to consider top 10 to be loyal customer as average of them purchase at least 2-3 item every month.

while the rest of 90% make occasional purchases. 

In [None]:
cus_bought_count[cus_bought_count['article_count'] >= 36]['article_count'].describe()

In [None]:
cus_bought_count[cus_bought_count['article_count'] < 36]['article_count'].describe()

Now I have a business related question, top 10% customer is responsible for how much % of revenue in last year?


In [None]:
cus_spent = pd.DataFrame(last_year_tran.groupby('customer_id')['price'].sum().sort_values(ascending=False)).reset_index()

In [None]:
cus_spent.head(2)

In [None]:
top_10 = cus_bought_count[cus_bought_count['article_count'] >= 36]

In [None]:
top_10_count_spent = top_10.merge(cus_spent, how='left', left_on='customer_id',right_on='customer_id')

In [None]:
top_10_count_spent.sample(3)

In [None]:
top_10_count_spent.price.sum() / last_year_tran.price.sum()

After some digging, I find out top 10% of the customer is responsible for 45% of sale in 2022.

If I was the CEO, I will be really sure those 10% of cusotmer get best possible experience to maintain relationship with them. 

In [None]:
plt.boxplot(cus_bought_count['article_count'])

Now I want to know what items have the customer bought throughout the entire year. Can be guess about their spending behavior and potentially age and gender?

In [None]:
#I will start with top 10% customers. Aka loyal customers.
top_10.head(1)

I am going to explore what has our \#1 customer has bought for the last year, as this customer bought 1020 items in the last year. 

In [None]:
number_1 = last_year_tran[last_year_tran['customer_id'] == top_10.head(1)['customer_id'].values[0]]

In [None]:
articles.head(3)


In [None]:
stuff_number1_bought = number_1.merge(articles, how = 'left', left_on = 'article_id', right_on = 'article_id')

In [None]:
articles['product_group_name'].head(2)

In [None]:
f, ax = plt.subplots(figsize=(15, 7))
ax = sns.histplot(data=stuff_number1_bought, y='product_group_name', color='orange', hue='index_name', multiple="stack")
# ax.set_xlabel('count by garment group')
# ax.set_ylabel('garment group')
plt.show()

From my understanding, Divided is also mainly female clothing. 

In [None]:
customers.head(2)

In [None]:
stuff_number1_bought.head(3)

In [None]:
customers[customers['customer_id'] == top_10.head(1)['customer_id'].values[0]]