# **Customer Shopping Behavior**

## 1. Introduction
This project analyzes customer shopping behavior to understand how demographic and product-related factors influence spending patterns, product preferences, and subscription behavior.


## 2. Import Libraries

In [72]:
import pandas as pd 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats

## 3. Load Dataset

In [33]:
df = pd.read_csv('../Data/customer_shopping_behavior.csv')

df.head()

Unnamed: 0,Customer ID,Age,Gender,Item Purchased,Category,Purchase Amount (USD),Location,Size,Color,Season,Review Rating,Subscription Status,Shipping Type,Discount Applied,Promo Code Used,Previous Purchases,Payment Method,Frequency of Purchases
0,1,55,Male,Blouse,Clothing,53,Kentucky,L,Gray,Winter,3.1,Yes,Express,Yes,Yes,14,Venmo,Fortnightly
1,2,19,Male,Sweater,Clothing,64,Maine,L,Maroon,Winter,3.1,Yes,Express,Yes,Yes,2,Cash,Fortnightly
2,3,50,Male,Jeans,Clothing,73,Massachusetts,S,Maroon,Spring,3.1,Yes,Free Shipping,Yes,Yes,23,Credit Card,Weekly
3,4,21,Male,Sandals,Footwear,90,Rhode Island,M,Maroon,Spring,3.5,Yes,Next Day Air,Yes,Yes,49,PayPal,Weekly
4,5,45,Male,Blouse,Clothing,49,Oregon,M,Turquoise,Spring,2.7,Yes,Free Shipping,Yes,Yes,31,PayPal,Annually


## 4. Data Exploration

In [10]:
df.shape

(3900, 18)

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3900 entries, 0 to 3899
Data columns (total 18 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   customer id             3900 non-null   int64  
 1   age                     3900 non-null   int64  
 2   gender                  3900 non-null   object 
 3   item purchased          3900 non-null   object 
 4   category                3900 non-null   object 
 5   purchase amount (usd)   3900 non-null   int64  
 6   location                3900 non-null   object 
 7   size                    3900 non-null   object 
 8   color                   3900 non-null   object 
 9   season                  3900 non-null   object 
 10  review rating           3863 non-null   float64
 11  subscription status     3900 non-null   object 
 12  shipping type           3900 non-null   object 
 13  discount applied        3900 non-null   object 
 14  promo code used         3900 non-null   

In [12]:
df.describe()

Unnamed: 0,Customer ID,Age,Purchase Amount (USD),Review Rating,Previous Purchases
count,3900.0,3900.0,3900.0,3863.0,3900.0
mean,1950.5,44.068462,59.764359,3.750065,25.351538
std,1125.977353,15.207589,23.685392,0.716983,14.447125
min,1.0,18.0,20.0,2.5,1.0
25%,975.75,31.0,39.0,3.1,13.0
50%,1950.5,44.0,60.0,3.8,25.0
75%,2925.25,57.0,81.0,4.4,38.0
max,3900.0,70.0,100.0,5.0,50.0


## 5. Data Cleaning

In [37]:
df.columns = df.columns.str.lower()
df.columns = df.columns.str.replace(' ','_')

In [38]:
df = df.rename(columns= {'purchase_amount_(usd)':'purchase_amount'})

In [39]:
df.isnull().sum()

customer_id                0
age                        0
gender                     0
item_purchased             0
category                   0
purchase_amount            0
location                   0
size                       0
color                      0
season                     0
review_rating             37
subscription_status        0
shipping_type              0
discount_applied           0
promo_code_used            0
previous_purchases         0
payment_method             0
frequency_of_purchases     0
dtype: int64

In [42]:
med = df['review_rating'].median()
df['review_rating'].fillna(med,inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['review_rating'].fillna(med,inplace=True)


In [43]:
df.isnull().sum()

customer_id               0
age                       0
gender                    0
item_purchased            0
category                  0
purchase_amount           0
location                  0
size                      0
color                     0
season                    0
review_rating             0
subscription_status       0
shipping_type             0
discount_applied          0
promo_code_used           0
previous_purchases        0
payment_method            0
frequency_of_purchases    0
dtype: int64

In [44]:
# Age group

bins = [0,20,30,50,80]
labels = ['Children','Young','Mid-senior','Senior']
df['age_group'] = pd.cut(df['age'],bins = bins , labels = labels, right=True)

In [45]:
df.head()

Unnamed: 0,customer_id,age,gender,item_purchased,category,purchase_amount,location,size,color,season,review_rating,subscription_status,shipping_type,discount_applied,promo_code_used,previous_purchases,payment_method,frequency_of_purchases,age_group
0,1,55,Male,Blouse,Clothing,53,Kentucky,L,Gray,Winter,3.1,Yes,Express,Yes,Yes,14,Venmo,Fortnightly,Senior
1,2,19,Male,Sweater,Clothing,64,Maine,L,Maroon,Winter,3.1,Yes,Express,Yes,Yes,2,Cash,Fortnightly,Children
2,3,50,Male,Jeans,Clothing,73,Massachusetts,S,Maroon,Spring,3.1,Yes,Free Shipping,Yes,Yes,23,Credit Card,Weekly,Mid-senior
3,4,21,Male,Sandals,Footwear,90,Rhode Island,M,Maroon,Spring,3.5,Yes,Next Day Air,Yes,Yes,49,PayPal,Weekly,Young
4,5,45,Male,Blouse,Clothing,49,Oregon,M,Turquoise,Spring,2.7,Yes,Free Shipping,Yes,Yes,31,PayPal,Annually,Mid-senior


In [None]:
df = df.drop(['discount_applied','promo_code_used'],axis = 1)

In [89]:
df['age'].skew()

-0.0063797217209905395

In [91]:
df['purchase_amount'].skew()

0.012701757626433795

In [None]:
stats, 

## 6. Behavior by Segments

### Gender with average purchase amount

In [57]:
df.groupby(df['gender'])['purchase_amount'].mean().sort_values(ascending=False)

gender
Female    60.249199
Male      59.536199
Name: purchase_amount, dtype: float64

### Age group with average purchase amount

In [58]:
df.groupby(df['age_group'])['purchase_amount'].mean().sort_values(ascending=False)

  df.groupby(df['age_group'])['purchase_amount'].mean().sort_values(ascending=False)


age_group
Young         60.753053
Senior        59.945799
Mid-senior    59.201356
Children      58.981132
Name: purchase_amount, dtype: float64

### Category with average purchase amount

In [61]:
df.groupby(df['category'])['purchase_amount'].mean().sort_values(ascending=False)

category
Footwear       60.255426
Clothing       60.025331
Accessories    59.838710
Outerwear      57.172840
Name: purchase_amount, dtype: float64

### Category with average review rating

In [59]:
df.groupby(df['category'])['review_rating'].mean().sort_values(ascending=False)

category
Footwear       3.793823
Accessories    3.770242
Outerwear      3.745988
Clothing       3.722395
Name: review_rating, dtype: float64

### Season with average purchase amount

In [64]:
df.groupby(df['season'])['purchase_amount'].mean().sort_values(ascending=False)

season
Fall      61.556923
Winter    60.357364
Spring    58.737738
Summer    58.405236
Name: purchase_amount, dtype: float64

## 7. Relationship between Variables


####  H0 = there is significant difference between gender and subscription_status (Reject H0).
####  H1 = there is no significant difference beteeen gender and subscription_status (fail to reject H0).


In [88]:
stats , p = stats.shapiro(df['purchase_amount'])
p

1.815579163868232e-34

In [67]:
cont_table = pd.crosstab(df['gender'],df['subscription_status'])
cont_table

subscription_status,No,Yes
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,1248,0
Male,1599,1053


In [73]:

stats , p , dof, expected = stats.chi2_contingency(cont_table) 
p

3.3268630006040623e-149

In [74]:
if p < 0.05:
    print('There is significant difference between gender and subscription status. (Reject H0)')
else:
    print('There is no significant difference between gender and subscription status. (Fail to reject H0)')

There is significant difference between gender and subscription status. (Reject H0)


####  H0 = there is significant difference between subscription status and payment method (Reject H0).
####  H1 = there is no significant difference beteeen subcription status and payment method (fail to reject H0).


In [79]:
cont_table2 = pd.crosstab(df['payment_method'],df['subscription_status'])
cont_table2

subscription_status,No,Yes
payment_method,Unnamed: 1_level_1,Unnamed: 2_level_1
Bank Transfer,455,157
Cash,497,173
Credit Card,492,179
Debit Card,446,190
PayPal,497,180
Venmo,460,174


In [86]:
s,p,d,e = stats.chi2_contingency(cont_table2)
p

0.569927650448781

In [87]:
if p < 0.05:
    print('There is significant difference between subscription status and payment method. (Reject H0)')
else:
    print('There is no significant difference between subscription status and payment method. (Fail to reject H0)')

There is no significant difference between subscription status and payment method. (Fail to reject H0)
