## Introduction to Customer Segmentation
    
    Customer segmentation is a marketing strategy that involves dividing customers into distinct groups based on shared characteristics. This approach helps businesses tailor their marketing efforts, product offerings, and communication strategies to better meet the needs of each segment.
    
### Objectives
    Identify high-value customers.
    Understand customer behavior.
    Develop targeted strategies.
    Measure performance within each segment.

    
    In this project, we will use machine learning techniques to analyze a dataset containing customer and transactional information. Our goal is to derive actionable insights and develop targeted strategies for each identified customer segment

### Import Packages

In [211]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings('ignore')

%matplotlib inline

In [212]:
df = pd.read_csv('new_retail_data.csv')

### <span style="color: red;">DESCRIPTIONS OF FEATURES </span>

#### Customer Details
    city → City where the customer lives
    state → State where the customer lives
    country → Country of the customer
#### Demographic Information
    age → Customer’s age
    gender → Customer’s gender (Male/Female/Other)
    income → Customer’s yearly income
    customer segment → Predefined segment (if available)
#### Purchase & Spending Information
    total purchases → Total number of purchases made
    Amount → Amount spent in a single transaction
    Total_Amount → Total amount spent by the customer
#### Product Information
    product category → Type of product purchased (e.g., electronics, clothing)
    product brand → Brand of the purchased product
    product type → Specific type of product within a category
    Products → List of products bought in a transaction
#### Customer Feedback & Ratings
    Feedback → Customer's review or comments
    Ratings → N'umeric rating given by the customer
#### Shipping & Payment Details
    Shipping_Method → How the product was shipped (e.g., standard, express)
    Payment_Method → How the customer paid (e.g., credit card, PayPal)
#### Order Status
    Order_Status → Status of the order (e.g., delivered, pending, canceled)

### REMOVING UNNECESSARY FEATURES

In [213]:
df = df.drop(columns=['Transaction_ID','Customer_ID','Name','Email','Phone','Address','Zipcode','Date','Time'],axis=1)

In [214]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 302010 entries, 0 to 302009
Data columns (total 21 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   City              301762 non-null  object 
 1   State             301729 non-null  object 
 2   Country           301739 non-null  object 
 3   Age               301837 non-null  float64
 4   Gender            301693 non-null  object 
 5   Income            301720 non-null  object 
 6   Customer_Segment  301795 non-null  object 
 7   Year              301660 non-null  float64
 8   Month             301737 non-null  object 
 9   Total_Purchases   301649 non-null  float64
 10  Amount            301653 non-null  float64
 11  Total_Amount      301660 non-null  float64
 12  Product_Category  301727 non-null  object 
 13  Product_Brand     301729 non-null  object 
 14  Product_Type      302010 non-null  object 
 15  Feedback          301826 non-null  object 
 16  Shipping_Method   30

In [215]:
df.sample(5)

Unnamed: 0,City,State,Country,Age,Gender,Income,Customer_Segment,Year,Month,Total_Purchases,...,Total_Amount,Product_Category,Product_Brand,Product_Type,Feedback,Shipping_Method,Payment_Method,Order_Status,Ratings,products
205032,San Francisco,Illinois,USA,24.0,Male,High,Regular,2023.0,October,6.0,...,1148.339875,Clothing,Zara,Dress,Average,Express,Debit Card,Delivered,2.0,Sundress
282914,Kelowna,Ontario,Canada,32.0,Female,Medium,Regular,2024.0,April,6.0,...,216.295604,Electronics,Samsung,Smartphone,Average,Same-Day,Debit Card,Delivered,2.0,iPhone
189787,Barrie,Ontario,Canada,18.0,Male,High,Premium,2023.0,April,4.0,...,583.842116,Grocery,Pepsi,Soft Drink,Bad,Standard,Debit Card,Shipped,1.0,Energy drink
201238,Saskatoon,Ontario,Canada,26.0,Female,Low,New,2023.0,May,4.0,...,625.231853,Electronics,Samsung,Smartphone,Bad,Same-Day,PayPal,Processing,1.0,Samsung Galaxy
261059,Kelowna,Ontario,Canada,70.0,Male,Medium,New,2023.0,May,1.0,...,51.811652,Books,HarperCollins,Thriller,Good,Same-Day,Cash,Delivered,3.0,Techno-thriller


### HANDLING MISSING VALUE

In [216]:
df[df.isnull().any(axis=1)]

Unnamed: 0,City,State,Country,Age,Gender,Income,Customer_Segment,Year,Month,Total_Purchases,...,Total_Amount,Product_Category,Product_Brand,Product_Type,Feedback,Shipping_Method,Payment_Method,Order_Status,Ratings,products
99,Portsmouth,England,UK,62.0,Female,Medium,Regular,,November,4.0,...,418.025349,Books,Penguin Books,Fiction,Excellent,Express,Cash,Processing,5.0,Literary fiction
109,Portsmouth,England,UK,65.0,Male,Low,Regular,2023.0,June,4.0,...,1912.436648,Grocery,Coca-Cola,Juice,,Same-Day,Debit Card,Delivered,,Pineapple juice
123,Portsmouth,England,UK,39.0,Male,Medium,Regular,2023.0,March,10.0,...,3002.684416,Clothing,Adidas,Jacket,,Standard,PayPal,Pending,,Varsity jacket
142,Portsmouth,England,UK,37.0,Male,High,Regular,2023.0,September,1.0,...,253.054157,Clothing,Zara,Dress,Average,,PayPal,Pending,2.0,Sheath dress
174,Portsmouth,England,UK,50.0,Male,High,Regular,2023.0,July,,...,1251.102005,Grocery,Pepsi,Water,Average,Standard,Credit Card,Pending,2.0,Sparkling water
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
301515,Sacramento,New Hampshire,USA,36.0,Female,Medium,Regular,2023.0,September,6.0,...,1177.654223,Grocery,Nestle,Snacks,,Express,Cash,Delivered,,Trail mix
301567,Vancouver,Ontario,Canada,37.0,Male,Low,New,2024.0,February,1.0,...,259.916898,Books,Random House,Literature,Bad,,Cash,Delivered,1.0,Classic literature
301738,Darwin,New South Wales,Australia,35.0,Male,,Premium,2023.0,November,2.0,...,353.690926,Books,HarperCollins,Non-Fiction,Bad,Standard,Cash,Pending,1.0,Memoir
301875,New Orleans,Pennsylvania,USA,44.0,Male,High,Premium,2024.0,January,6.0,...,420.831710,Home Decor,Home Depot,Decorations,Bad,Same-Day,Cash,Delivered,1.0,Candles


In [217]:
categorical_feature = [feature for feature in df.columns if df[feature].dtype == 'O']
for col in categorical_feature:
    df[col] = df[col].fillna(df[col].mode()[0])

In [218]:
numerical_feature = [feature for feature in df.columns if df[feature].dtype != 'O']
for col in numerical_feature:
    df[col] = df[col].fillna(df[col].median())

In [219]:
df.isnull().sum()

City                0
State               0
Country             0
Age                 0
Gender              0
Income              0
Customer_Segment    0
Year                0
Month               0
Total_Purchases     0
Amount              0
Total_Amount        0
Product_Category    0
Product_Brand       0
Product_Type        0
Feedback            0
Shipping_Method     0
Payment_Method      0
Order_Status        0
Ratings             0
products            0
dtype: int64

### HANDLING DUPLICATES VALUES

In [220]:
df[df.duplicated]

Unnamed: 0,City,State,Country,Age,Gender,Income,Customer_Segment,Year,Month,Total_Purchases,...,Total_Amount,Product_Category,Product_Brand,Product_Type,Feedback,Shipping_Method,Payment_Method,Order_Status,Ratings,products
299759,Kitchener,Ontario,Canada,54.0,Female,Low,Regular,2023.0,December,7.0,...,940.619277,Clothing,Adidas,T-shirt,Bad,Express,Credit Card,Processing,1.0,Off-the-shoulder tee
301094,Wollongong,New South Wales,Australia,54.0,Male,Low,New,2023.0,December,6.0,...,2201.568075,Grocery,Pepsi,Soft Drink,Excellent,Express,Cash,Delivered,4.0,Fruit punch
301362,Leicester,England,UK,63.0,Male,Low,Regular,2023.0,May,8.0,...,1535.255087,Clothing,Adidas,Jacket,Average,Same-Day,Cash,Pending,2.0,Varsity jacket
301486,Bremen,Berlin,Germany,59.0,Male,Low,New,2023.0,November,9.0,...,2450.946762,Grocery,Pepsi,Soft Drink,Bad,Standard,Cash,Pending,1.0,Iced tea


In [221]:
df = df.drop_duplicates(keep='first').reset_index(drop=True)

In [222]:
df.duplicated().sum()

0

### CONVERT DATATYPE

In [223]:
df[['Age','Year']] = df[['Age','Year']].astype(int) 
df['Ratings'] = df['Ratings'].astype(str)

In [224]:
df.nunique()

City                   130
State                   54
Country                  5
Age                     53
Gender                   2
Income                   3
Customer_Segment         3
Year                     2
Month                   12
Total_Purchases         10
Amount              299297
Total_Amount        299306
Product_Category         5
Product_Brand           18
Product_Type            33
Feedback                 4
Shipping_Method          3
Payment_Method           4
Order_Status             4
Ratings                  5
products               318
dtype: int64

In [225]:
df['Product_Type'] = df['Product_Type'].replace('Mitsubishi 1.5 Ton 3 Star Split AC','Mitsubishi')

In [226]:
df['Product_Type'].unique()

array(['Shorts', 'Tablet', "Children's", 'Tools', 'Chocolate',
       'Television', 'Shirt', 'Decorations', 'Non-Fiction', 'Water',
       'Snacks', 'T-shirt', 'Literature', 'Juice', 'Furniture', 'Coffee',
       'Bathroom', 'Kitchen', 'Smartphone', 'Shoes', 'Thriller',
       'Soft Drink', 'Laptop', 'Dress', 'Headphones', 'Lighting',
       'Bedding', 'Jacket', 'Fiction', 'Jeans', 'Fridge', 'Mitsubishi',
       'BlueStar AC'], dtype=object)

### STATISTICAL INFO

In [227]:
df.describe()

Unnamed: 0,Age,Year,Total_Purchases,Amount,Total_Amount
count,302006.0,302006.0,302006.0,302006.0,302006.0
mean,35.47904,2023.164924,5.359271,255.164574,1367.267241
std,15.01774,0.371112,2.866892,141.306763,1128.403235
min,18.0,2023.0,1.0,10.000219,10.00375
25%,22.0,2023.0,3.0,133.024389,439.178076
50%,32.0,2023.0,5.0,255.470969,1041.117547
75%,46.0,2023.0,8.0,377.509122,2027.928818
max,70.0,2024.0,10.0,499.997911,4999.625796


### <span style ="color:green">INSIGHT</span>

<span style = "color : blue">
    - No Possile ways to get outlier values
</span>

## DISTRIBUTIONS OF NUMERICAL FEATURES

In [230]:
plt.figure(figsize=(16,5))
for i in range(len(numerical_feature)):
    plt.subplot(1,6,i+1)
    sns.kdeplot(data=df[numerical_feature[i]],fill=True)
    plt.yticks([])
    plt.ylabel('')
plt.show()

TypeError: The x variable is categorical, but one of ['numeric', 'datetime'] is required

### <span style ="color:green">INSIGHT</span>

<span style = "color : black">
    <ul>
      <li><span style ="color:red">Age Distribution</span>: Most customers are between 20 and 30 years old.</li>
      <li><span style ="color:red">Purchase Frequency</span>: The majority of customers have made a total of 5 purchases.</li>
      <li><span style ="color:red">Spending Pattern</span>: Most customers have spent between 1,000 and 1,500 in total.</li>
    </ul>
</span>

## SOME CATEGORICAL FEATURES

In [None]:
plt.rcParams['figure.figsize'] = (30, 12)

categ = ['Gender','Customer_Segment','Payment_Method','Month','Country']
for i in range(0,len(categ)):
    plt.subplot(1,5,i+1)
    size = df[categ[i]].value_counts()
    labels = size.index.tolist()
    plt.pie(size,labels = labels, autopct = '.%1.1f%%')
    plt.title(categ[i],fontsize=20)
    plt.axis('off')

plt.tight_layout()
plt.grid()
plt.show()


### <span style ="color:green">INSIGHT</span>

<span style = "color : black">
    <ul>
      <li><span style ="color:gray">Majority of customers are <span style = "color:red">male</span>.</span></li>
      <li><span style ="color:gray">Most customers belong to the <span style = "color:red">Regular Customer</span> segment.</span></li>
      <li><span style ="color:gray"><span style = "color:red">Credit cards</span> are the most commonly used payment method.</span></li>
      <li><span style ="color:gray"><span style = "color:red">April, January, August, and July</span>witness the highest number of purchases.</span></li>
      <li><span style ="color:gray">The highest number of product sales occur in the <span style = "color:red">US</span> and <span style = "color:red">UK</span>, suggesting these countries are key markets for the business.</span></li>
    </ul>
</span>


### UNDERSTANDING CUSTOMER'S EXPERIENCE

In [None]:
f,ax = plt.subplots(1,2,figsize=(20,10))
sns.countplot(x=df['Feedback'],data=df,palette='bright',ax=ax[0],saturation=0.95)
for containers in ax[0].containers:
    ax[0].bar_label(containers,size=20)
plt.pie(x=df['Ratings'].value_counts(),labels=df['Ratings'].value_counts().index,explode=[0.1,0,0,0,0],autopct='%1.1f%%',shadow=True)

plt.show()

### <span style ="color:green">INSIGHT</span>

<span style = "color : black">
    <ul>
      <li><span style ="color:red">Customer's Feedback</span>: Most customers provide excellent or good feedback.</li>
      <li><span style ="color:red">Customers's Ratings</span>: The majority of customers have given a rating of 4.</li>
    </ul>
</span>


In [None]:
plt.rcParams['figure.figsize'] = (15,9)
plt.style.use('fivethirtyeight')
sns.countplot(df['Product_Category'],palette = 'Blues')
plt.title('Category-wise of evaluation')
plt.xlabel('Category')
plt.ylabel('Count')
plt.show()

## TOP 10 BRANDS WITH HIGHEST SALES

In [None]:
Brand = pd.DataFrame(df['Product_Brand'].value_counts())

In [None]:
Brand[:10]

In [None]:
plt.rcParams['figure.figsize'] = (15,9)
sns.barplot(data=Brand[:10],x=Brand.index[:10],y='count',palette='hls')
plt.title('Top 10 Best-Selling Brands')
plt.xticks(rotation = 45)
plt.xlabel('')
plt.show()

In [None]:
f, ax = plt.subplots(1,2,figsize=(20,10))

month_order = ['January','February','March','April','May','June','July','August','September','October','November','December']
df_grouped['Month'] = pd.Categorical(df_grouped['Month'],categories=month_order,ordered = True)
df_grouped = df_grouped.sort_values('Month')
sns.lineplot(data = df_grouped, x=df_grouped['Month'],y=df_grouped['Total_Amount'],markers='o',ax=ax[0])
ax[0].set_title('MONTHLY ANALYSIS',fontsize=15)
ax[0].tick_params(axis='x',rotation=45)

sns.violinplot(x=df['Feedback'],y=df['Income'],palette='viridis',ax=ax[1])
ax[1].set_title('VIOLIN PLOT B/W FEEDBACK AND INCOME',fontsize=15)
plt.show()

## FEATURE SELECTION

In [None]:
corr = df[numerical_feature].corr
corr()

In [None]:
sns.heatmap(data=corr(),annot=True,fmt=".2f",linewidths=0.5,linecolor="black",cbar=True)
plt.show()

### <span style ="color:green">INSIGHT</span>

<span style = "color : black">
    <ul>
      <pre>I have decided to remove the <span style ="color:red">Amount</span> feature and keep <span style ="color:red">Total_Amount</span> for analysis. Since Total_Amount represents the overall spending of customers, it provides a more complete picture of their purchasing behavior. This will help in making better segmentation decisions.</pre>
    </ul>
</span>

In [None]:
#Removing Amount Feature
df = df.drop(columns=['Amount'],axis=1)