Objective: Perform data aggregation and dimensionality reduction on a marketing dataset

In [2]:
# import require libraries
import pandas as pd
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# loading the data
data = pd.read_csv('Datasets/MMDS_c02_Data/data_aggregation_reduction.csv')

Task 1
Aggregate the "data_aggregatiion_reduction.csv" data by 'Region' and calculate the average 'Monthly Spend' and total 'Purchases Frequency' per region


In [30]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Customer ID         1000 non-null   int64  
 1   Age Group           1000 non-null   object 
 2   Region              1000 non-null   object 
 3   Monthly Spend ($)   1000 non-null   float64
 4   Product Category    1000 non-null   object 
 5   Purchase Frequency  1000 non-null   int64  
dtypes: float64(1), int64(2), object(3)
memory usage: 47.0+ KB


In [29]:
# Data aggregation
# Aggregating by 'Region':
# book way: region_aggregated_data = data.groupby('Region').agg( Average_Monthly_Spend=pd.NamedAgg(column='Monthly Spend ($)', aggfunc='mean'), Total_Purchase_Frequency=pd.NamedAgg(column='Purchase Frequency', aggfunc='sum') ).reset_index()

region_aggregated_data = data.groupby('Region').agg(
    Avg_Monthly_Spend=('Monthly Spend ($)', 'mean'),
    Total_Purchase_Frequency=('Purchase Frequency', 'sum')
    ).reset_index()

In [28]:
region_aggregated_data

Unnamed: 0,Region,Avg_Monthly_Ppend,Total_Purchases_Frequency
0,East,273.567194,2423
1,North,264.172629,2645
2,South,290.110025,2712
3,West,265.984319,2464


Task 2 
Perform a principal component analysis (PCA) to reduce the dimensions of of the data while retaining key information.

In [33]:
# Data Reduction using Principle Component Analysis (PCA)
# Prepareing Data for PCA:
pca_data = data[['Monthly Spend ($)', 'Purchase Frequency']]

Standardized the data: 
$x_{standardized} = \frac{x- \mu}{\sigma}$ where: $x$ is the orginal value, $\mu$ is the mean of the feature, $\sigma$ is the standard deviation of the feature

PCA is sentive to the scale of the data so we standarize the features to the mean of 0 and a standard deviation of 1


In [37]:
# standardized the data
pca_data_standardized = (pca_data-pca_data.mean()) / pca_data.std()

#Alternative code:
from sklearn.preprocessing import StandardScaler
pca_data_standardized_a= StandardScaler().fit_transform(pca_data)


print(pca_data_standardized)
print(pca_data_standardized_a)

     Monthly Spend ($)  Purchase Frequency
0             0.280105            1.071056
1            -0.646991           -1.161861
2             1.492986            1.629285
3             0.070849            0.512827
4            -0.359618            0.140674
..                 ...                 ...
995           0.561980            1.443209
996           0.091150            1.071056
997           0.902244           -0.603632
998          -1.619106           -1.534014
999          -1.048083           -1.347937

[1000 rows x 2 columns]
[[ 0.28024558  1.07159168]
 [-0.64731493 -1.16244239]
 [ 1.49373284  1.63010019]
 ...
 [ 0.90269579 -0.60393388]
 [-1.61991584 -1.53478141]
 [-1.04860786 -1.3486119 ]]


In [40]:
pca = PCA(n_components=2)
principal_components = pca.fit_transform(pca_data_standardized)
principal_components_a = pca.fit_transform(pca_data_standardized_a)

print(principal_components)
print(principal_components_a)

[[ 0.55928634  0.95541522]
 [-0.36406795 -1.27905167]
 [ 0.09637806  2.2077788 ]
 ...
 [-1.06481524  0.21115092]
 [ 0.06016903 -2.22959218]
 [-0.2120288  -1.69424259]]
[[ 0.55956619  0.95589329]
 [-0.36425012 -1.27969168]
 [ 0.09642628  2.20888352]
 ...
 [-1.06534805  0.21125657]
 [ 0.06019914 -2.23070782]
 [-0.21213489 -1.69509035]]


In [43]:
# Creating DataFrame with Principal Components: 
principal_df = pd.DataFrame(data=principal_components, columns = ['Principal Component 1', 'Principal Component 2'])
principal_df

Unnamed: 0,Principal Component 1,Principal Component 2
0,0.559286,0.955415
1,-0.364068,-1.279052
2,0.096378,2.207779
3,0.312525,0.412721
4,0.353760,-0.154817
...,...,...
995,0.623123,1.417882
996,0.692898,0.821804
997,-1.064815,0.211151
998,0.060169,-2.229592


In [44]:
principal_df_a = pd.DataFrame(data=principal_components_a, columns =['PC 1', 'PC 2'])
principal_df_a
                              

Unnamed: 0,PC 1,PC 2
0,0.559566,0.955893
1,-0.364250,-1.279692
2,0.096426,2.208884
3,0.312682,0.412927
4,0.353937,-0.154895
...,...,...
995,0.623435,1.418592
996,0.693245,0.822215
997,-1.065348,0.211257
998,0.060199,-2.230708
