# Task : 02 - Customer Segmentation Using Unsupervised Learning

### Dataset Description
The dataset 'customer_shopping_data.csv' has transaction level purchase data of mall customers. In this dataset each row represents a sinle purchase made by a customer. It includes demographics and purchase details such as price, quantity, product category, and purchase date. 

In this dataset the 'spending score' column is not provided so I have calculated the customers spending habits using real purchase information. I did this becasue this approach reflects real world customer behavior more accurately than pre defined scores. 

I have chosen this large dataset instead of tradational mall.csv to find diverse customer patterns, improve clustering reliability.

## EDA

In [1]:
import pandas as pd # importing libraries

In [2]:
df = pd.read_csv('customer_shopping_data.csv') # loading the dataset

In [3]:
df.head() # inspecting first five rows

Unnamed: 0,invoice_no,customer_id,gender,age,category,quantity,price,payment_method,invoice_date,shopping_mall
0,I138884,C241288,Female,28,Clothing,5,1500.4,Credit Card,5/8/2022,Kanyon
1,I317333,C111565,Male,21,Shoes,3,1800.51,Debit Card,12/12/2021,Forum Istanbul
2,I127801,C266599,Male,20,Clothing,1,300.08,Cash,9/11/2021,Metrocity
3,I173702,C988172,Female,66,Shoes,5,3000.85,Credit Card,16/05/2021,Metropol AVM
4,I337046,C189076,Female,53,Books,4,60.6,Cash,24/10/2021,Kanyon


In [4]:
df.shape # checking the shape

(99457, 10)

In [5]:
df.info() # checking info 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99457 entries, 0 to 99456
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   invoice_no      99457 non-null  object 
 1   customer_id     99457 non-null  object 
 2   gender          99457 non-null  object 
 3   age             99457 non-null  int64  
 4   category        99457 non-null  object 
 5   quantity        99457 non-null  int64  
 6   price           99457 non-null  float64
 7   payment_method  99457 non-null  object 
 8   invoice_date    99457 non-null  object 
 9   shopping_mall   99457 non-null  object 
dtypes: float64(1), int64(2), object(7)
memory usage: 7.6+ MB


In [6]:
df.describe() # checking statistical summary of numeric columns

Unnamed: 0,age,quantity,price
count,99457.0,99457.0,99457.0
mean,43.427089,3.003429,689.256321
std,14.990054,1.413025,941.184567
min,18.0,1.0,5.23
25%,30.0,2.0,45.45
50%,43.0,3.0,203.3
75%,56.0,4.0,1200.32
max,69.0,5.0,5250.0


In [7]:
df.isnull().sum().sum() # sum of null values

np.int64(0)

In [8]:
df['gender'].value_counts() # checking count of gender

gender
Female    59482
Male      39975
Name: count, dtype: int64

In [9]:
df['category'].value_counts() # checking count of each category 

category
Clothing           34487
Cosmetics          15097
Food & Beverage    14776
Toys               10087
Shoes              10034
Souvenir            4999
Technology          4996
Books               4981
Name: count, dtype: int64

In [10]:
df.duplicated().sum() # checking for duplicate values 

np.int64(0)

## Feature Engineering

In this step, I have transform raw transaction data into customer-level spending behavior features.

Since spending habits are not directly available, I have calculated meaningful metrics such as total spending, average spending, purchase frequency, and total quantity purchased for each customer.

These features will later be used for customer segmentation using K-Means clustering.

In [11]:
df["Total_Amount"] = df["price"] * df["quantity"]

In [12]:
# Group the dataset by customer_id to calculate customer-level features
customer_features = df.groupby("customer_id").agg(
    
    # Calculate total spending by summing total amount spent by each customer
    Total_Spending=("Total_Amount", "sum"),
    
    # Calculate purchase frequency by counting number of invoices per customer
    Purchase_Frequency=("invoice_no", "count"),
    
    # Calculate total quantity purchased by each customer
    Total_Quantity=("quantity", "sum"),
    
    # Calculate average spending per transaction for each customer
    Avg_Spending=("Total_Amount", "mean"),
    
    # Take the average age of the customer (age is constant per customer)
    Age=("age", "mean")
)

In [13]:
# Reset index to convert customer_id from index to a column
customer_features = customer_features.reset_index()

In [14]:
customer_features.head()

Unnamed: 0,customer_id,Total_Spending,Purchase_Frequency,Total_Quantity,Avg_Spending,Age
0,C100004,7502.0,1,5,7502.0,61.0
1,C100005,2400.68,1,2,2400.68,34.0
2,C100006,322.56,1,3,322.56,44.0
3,C100012,130.75,1,5,130.75,25.0
4,C100019,35.84,1,1,35.84,21.0


In [15]:
# Save the customer features to a CSV file
customer_features.to_csv("customer_spending_features.csv", index=False)