<a href="https://colab.research.google.com/github/amien1410/colab-notebooks/blob/main/Colab_Pyspark_H%26M_EDA_Recommendation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
#@title Install Kaggle modules and download the dataset

from google.colab import drive
drive.mount('/content/drive')

!pip install kaggle
import os
os.environ['KAGGLE_CONFIG_DIR'] = '/content/drive/MyDrive/kaggle'
!kaggle datasets download -d odins0n/hm256x256
!unzip -q "/content/hm256x256.zip"

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Dataset URL: https://www.kaggle.com/datasets/odins0n/hm256x256
License(s): other
Downloading hm256x256.zip to /content
 99% 2.10G/2.13G [00:27<00:00, 56.5MB/s]
100% 2.13G/2.13G [00:27<00:00, 83.3MB/s]


## Let's Explore the H&M Fashion World Together!

Get ready to explore the amazing H&M Personalized Fashion Recommendations dataset! This is a fun challenge from Kaggle. Our main goal? To build a cool system that suggests clothes people will love, based on what they've bought before. But first, we need to get to know our data!

**Why explore this dataset? It's like going on a treasure hunt!**

Exploring our data is super important. It helps us:

*   **See what we have:** Look at all the different pieces of information about customers, clothes, and sales.
*   **Spot any messy bits:** Find things that might be missing or don't look quite right.
*   **Find cool patterns:** Discover interesting trends in what people buy and when.
*   **Get ideas for our model:** Think of clever ways to use the data to make great recommendations.
*   **Choose the best tools:** Figure out which methods will work best for our recommendation system.

**How will we explore? It's easy and fun!**

We'll use simple steps to explore:

*   We'll load everything up so we can see it clearly.
*   We'll look at numbers and pictures to understand things like customer ages and how much clothes cost.
*   We'll see how different pieces of information fit together.
*   We'll check for any missing puzzle pieces.

**What's the big goal of exploring? To become data detectives!**

By exploring, we want to:

*   Really understand the H&M fashion world in our data.
*   Learn all about the customers, the clothes, and the sales.
*   Find exciting clues that will help us build an awesome recommendation system.
*   Get everything ready for the next steps, like preparing the data and building our model.

Let's have fun exploring and build something amazing!

# Firstly first
Let's start by loading all the necessary libraries and the datasets we'll be working with: `articles.csv`, `customers.csv`, and `transactions_train.csv`. This will prepare our environment and make the data available for exploration.

In [2]:
#Load all libraries
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
import matplotlib.dates as mdates
import matplotlib.ticker as ticker
import plotly.graph_objects as go
import datetime as dt
import matplotlib.image as mpimg

#Load all the datasets
articles = pd.read_csv("/content/articles.csv")
customers = pd.read_csv("/content/customers.csv")
transactions = pd.read_csv("/content/transactions_train.csv")

## 📏 Checking the Size of Our Dataframes

Let's find out the shapes of all three dataframes: `articles`, `customers`, and `transactions`. This will tell us how many rows and columns each dataset has, giving us an idea of their size.

In [3]:
# Let's find out the shapes of all three dataframes
shape=pd.DataFrame({"Total Rows":[articles.shape[0],customers.shape[0],transactions.shape[0]],
                    "Total Columns":[articles.shape[1],customers.shape[1],transactions.shape[1]]},index=['articles','customers','transactions'])
shape

Unnamed: 0,Total Rows,Total Columns
articles,105542,25
customers,1371980,7
transactions,31788324,5


## 🕵️‍♀️ Customer Purchase Analysis

Let's investigate how many customers in our dataset have made at least one transaction compared to those who haven't. This gives us insights into customer activity. 🛍️📊

In [4]:
n = len(pd.unique(transactions['customer_id']))
m = len(pd.unique(customers['customer_id']))
length=len(set(transactions.customer_id.values.tolist()))/customers.shape[0]
npur=100-(length*100)
print("Total No of customers:",m)
print("No of customers who made at least one transaction:",n)
print("% of customers who made a at least one transaction : ",length*100)
print("Number of customers who did not make a purchase : ",(customers.shape[0] - len(set(transactions.customer_id.values.tolist()))))
print("% of customers who did not make a purchase : ",npur)
print("It seems that not all customers made a purchase, there is around 1% with no purchase history.")

Total No of customers: 1371980
No of customers who made at least one transaction: 1362281
% of customers who made a at least one transaction :  99.29306549658158
Number of customers who did not make a purchase :  9699
% of customers who did not make a purchase :  0.7069345034184238
It seems that not all customers made a purchase, there is around 1% with no purchase history.


## 👀 Peeking at the Data

Let's take a look at the first few rows of the `transactions` and `articles` dataframes to understand their structure and see what kind of information they contain.

In [5]:
transactions.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id
0,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001,0.050831,2
1,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023,0.030492,2
2,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221004,0.015237,2
3,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687003,0.016932,2
4,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687004,0.016932,2


In [6]:
articles.head()

Unnamed: 0,article_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,colour_group_code,colour_group_name,...,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc
0,108775015,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,9,Black,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
1,108775044,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,10,White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
2,108775051,108775,Strap top (1),253,Vest top,Garment Upper body,1010017,Stripe,11,Off White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
3,110065001,110065,OP T-shirt (Idro),306,Bra,Underwear,1010016,Solid,9,Black,...,Clean Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Microfibre T-shirt bra with underwired, moulde..."
4,110065002,110065,OP T-shirt (Idro),306,Bra,Underwear,1010016,Solid,10,White,...,Clean Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Microfibre T-shirt bra with underwired, moulde..."
