Objective:
The goal of this project is to develop a predictive machine learning model that forecasts the likelihood of individuals making a purchase based on their age, gender, and annual income. This model will assist businesses in optimizing marketing strategies, targeting the right customer segments, and improving overall sales conversion rates.

Loading in data

In [16]:
import pandas as pd
import numpy as np

file = 'customer_purchase_data.csv'
cust_data = pd.read_csv(file)

Reading the data

In [19]:
cust_data.head()

Unnamed: 0,Age,Gender,AnnualIncome,NumberOfPurchases,ProductCategory,TimeSpentOnWebsite,LoyaltyProgram,DiscountsAvailed,PurchaseStatus
0,40,1,66120.267939,8,0,30.568601,0,5,1
1,20,1,23579.773583,4,2,38.240097,0,5,0
2,27,1,127821.306432,11,2,31.633212,1,0,1
3,24,1,137798.62312,19,3,46.167059,0,4,1
4,31,1,99300.96422,19,1,19.823592,0,0,1


Understand the data types

In [22]:
cust_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1500 entries, 0 to 1499
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Age                 1500 non-null   int64  
 1   Gender              1500 non-null   int64  
 2   AnnualIncome        1500 non-null   float64
 3   NumberOfPurchases   1500 non-null   int64  
 4   ProductCategory     1500 non-null   int64  
 5   TimeSpentOnWebsite  1500 non-null   float64
 6   LoyaltyProgram      1500 non-null   int64  
 7   DiscountsAvailed    1500 non-null   int64  
 8   PurchaseStatus      1500 non-null   int64  
dtypes: float64(2), int64(7)
memory usage: 105.6 KB


In [54]:
cust_data.shape

(1500, 9)

Check for missing data

In [35]:
missing_count = cust_data.isnull().sum()
missing_percent = 100 * cust_data.isnull().mean()

#Combining into a df
missing = pd.concat([missing_count, missing_percent], axis=1)

#Renaming cols
missing.columns = ['count', '%']

#Sorting by count highest first
missing = missing.sort_values(by='count', ascending=False)

print(missing)

                    count    %
Age                     0  0.0
Gender                  0  0.0
AnnualIncome            0  0.0
NumberOfPurchases       0  0.0
ProductCategory         0  0.0
TimeSpentOnWebsite      0  0.0
LoyaltyProgram          0  0.0
DiscountsAvailed        0  0.0
PurchaseStatus          0  0.0


We have 0 missing values in this dataset

In [38]:
cust_data.describe()

Unnamed: 0,Age,Gender,AnnualIncome,NumberOfPurchases,ProductCategory,TimeSpentOnWebsite,LoyaltyProgram,DiscountsAvailed,PurchaseStatus
count,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0
mean,44.298667,0.504667,84249.164338,10.42,2.012667,30.46904,0.326667,2.555333,0.432
std,15.537259,0.500145,37629.493078,5.887391,1.428005,16.984392,0.469151,1.705152,0.49552
min,18.0,0.0,20001.512518,0.0,0.0,1.037023,0.0,0.0,0.0
25%,31.0,0.0,53028.979155,5.0,1.0,16.1567,0.0,1.0,0.0
50%,45.0,1.0,83699.581476,11.0,2.0,30.939516,0.0,3.0,0.0
75%,57.0,1.0,117167.772858,15.0,3.0,44.369863,1.0,4.0,1.0
max,70.0,1.0,149785.176481,20.0,4.0,59.991105,1.0,5.0,1.0


Looks like there is no obvious outliers and we have equal counts for each column

In [58]:
cust_data.duplicated().sum()

112

So there are duplicates, but I am unsure if we should drop or keep them? For now I will keep

Now I am going to focus on the columns and their dtypes

In [43]:
print(cust_data['Age'].head(20))

0     40
1     20
2     27
3     24
4     31
5     66
6     39
7     64
8     43
9     20
10    66
11    70
12    54
13    64
14    19
15    70
16    51
17    18
18    57
19    20
Name: Age, dtype: int64


I do not think we need to make any changes to the 'Age' col

In [46]:
print(cust_data['Gender'].head(20))

0     1
1     1
2     1
3     1
4     1
5     1
6     1
7     1
8     0
9     1
10    1
11    1
12    0
13    1
14    1
15    1
16    1
17    1
18    0
19    0
Name: Gender, dtype: int64


Male = 0, Female = 1, no changes needed here either. Perhaps changing the dtype to a 'category'?

In [49]:
print(cust_data['AnnualIncome'].head(20))

0      66120.267939
1      23579.773583
2     127821.306432
3     137798.623120
4      99300.964220
5      37758.117475
6     126883.385286
7      39707.359724
8     102797.301269
9      63854.921080
10     66199.993929
11     83556.718133
12    114467.228969
13     31880.893223
14    107485.660911
15     67049.598809
16    129174.208866
17    128374.495052
18     71740.688084
19    121499.006189
Name: AnnualIncome, dtype: float64


This also looks good, only thing I could imagine adjusting would be shortnening this number, but likely not necessary