In [2]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Links:
    - Kaggle Dataset: https://www.kaggle.com/datasets/vetrirah/customer
    - github: https://github.com/andycrossman11/Customer_Segmentation_Project

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

## Introduction
    - A marketing client has customer related data and has hypothesized that there are 4 types of customers.
    - The data provided has a 'Segmentation' assignment of 'A', 'B', 'C', or 'D' for each customer.
    - The client has asked to provide a clustering model that validates this segmentation hypothesis.
    
    - Therefore, as a validation task, this work calls for an unsupervised learning algorithm.
    - If I were to utilize the client's hypothesis as ground truth labels in a supervised learning algorithm, the objective function of the alg. will be optimized to make the hypothesis look correct. Such a bias is incorrect and therefore in order to validate the assignment, I would need to find a vector space that confirms the hypothesis.

In [None]:
train_data = pd.read_csv("/kaggle/input/customer/Train.csv")

print(train_data.head(5))

## Perform initial EDA:
    - Look at each feature(datatype, distrib, histogram, etc).
    - Look for missing values

In [None]:
print(train_data.info())

print(train_data.describe(include="all"))


In [None]:
for column in train_data.select_dtypes(include=['object']).columns:
    train_data[column].value_counts().plot(kind='bar')
    plt.title(f'Histogram for {column}')
    plt.xlabel(column)
    plt.ylabel('Frequency')
    plt.show()

## Categorical Variable Analysis:
    - Gender Skewed towards Male
    - Ever_Married skewed towards Yes
    - Graduated skewed towards Yes
    - Profession has skew as Artist and Healthcare make up roughly 1/2 of all data
    - Spending Score has skew with over 1/2 of all data being 'Low'
    - Segmentation is roughly uniform across the 4 customer types defined by the marketing agency

In [None]:
for column in train_data.select_dtypes(include=['int64', 'float64']).columns:
    if column != "ID":
        train_data[column].hist(bins=30)
        plt.title(f'Histogram for {column}')
        plt.xlabel(column)
        plt.ylabel('Frequency')
        plt.show()

## Numeric Variable Analysis:
    - All 3 appear to be from gamma distributions as the majority of data is on the left of center with a right skew
    - From the histogram, it looks like Family_Size could actually be treated as categorical. This would be a mistake as switching to a categorical representation would lose generality with unseen data. Family size could theoretically be an positive integer.

In [None]:
missing_percentages = train_data.isnull().sum() * 100 / len(train_data)
print(missing_percentages)

## Handling Missing Values
    - Ever_Married is missing ~ 1.74% of values. The skew is fairly small, and certainly not large enough to just convert missing values to 'yes'. To handle the missing values I will replace nan with 'Missing' thus turning the binary categorical var to a 3-way categorical var.
    - Graduated will be treated in the same manner since the skew is not that high
    - Since Profession has so many types already, it makes sense to add a 'missing' type to handle the missing data rather than assigning to 'artist' type
       - Var_1 is is so heavily skewed as a histogram that intuitively it makes sense to impute with the high frequency type, 'Cat_6'.
    - Work_Experience looks to follow a gamma distribution so I will impute with the median as this is more robust to the outliers in the right tail
    - Family_Size also looks to follow a gamma distribution so I will impute with the median for the same reasons as above

In [None]:
train_data.fillna({'Ever_Married':'Missing'}, inplace=True)

train_data.fillna({'Graduated': 'Missing'}, inplace=True)

train_data.fillna({'Profession': 'Missing'}, inplace=True)

train_data.fillna({'Var_1': 'Cat_6'}, inplace=True)

train_data.fillna({'Work_Experience': train_data['Work_Experience'].median()}, inplace=True)

train_data.fillna({'Family_Size': train_data['Family_Size'].median()}, inplace=True)

new_missing_percentages = train_data.isnull().sum() * 100 / len(train_data)
print(new_missing_percentages)

## Scale all numeric features to [0,1] interval with min-max scaling. Why utilizing distance metrics in algorithms, it is important to use the same scale if you would like numeric featureas to be treated equally. Otherwise, the larger scaled features will be emphasized over the small scale features

## Also drop ID column since this will be of no use in prediction

In [None]:
from sklearn.preprocessing import MinMaxScaler

train_data.drop(columns=["ID"], inplace=True)

numeric_columns = train_data.select_dtypes(include=['int64', 'float64']).columns

mm_scaler = MinMaxScaler()
train_data[numeric_columns] = mm_scaler.fit_transform(train_data[numeric_columns])

print(train_data[numeric_columns].describe(include="all"))

## Now convert Categorical Variables to One-hot Encodings

## First drop Segmentation Column! It is simply there for validation and not to be used for training since this is an unsupervised learning task!

In [None]:
segmentation = train_data["Segmentation"]

train_data.drop(columns=["Segmentation"], inplace=True)

categorical_columns = train_data.select_dtypes(include=['object']).columns

train_data = pd.get_dummies(train_data, columns=categorical_columns)

print(train_data.head(5))

## Correlation Matrix on Transformed Data

In [None]:
correlation = train_data.corr()

plt.figure(figsize=(20, 20))
sns.heatmap(correlation, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Matrix')
plt.show()

## The one-hot encoded columns will be highly correlated, especially when the possible values for that category is low. For example, the correlation between Graduated_Yes and Graduated_No is -.98. Intuitively this makes sense because knowing the value for one hot-encoded column allows us to determine the other. There is no perfect correlation because some values were missing for the data is not binary

## First try projecting down to 3 dimensions for visual purposes using:
    - PCA
    - t-SNE
    - Matrix Factorization(NMF)

In [None]:
from sklearn.decomposition import PCA,TruncatedSVD
from sklearn.manifold import TSNE


unique_clusters = segmentation.unique()
colors = sns.color_palette('viridis', len(unique_clusters))

In [None]:
pca = PCA(n_components=3, random_state=11)
pca_fit = pca.fit_transform(train_data.values)

pca_df = pd.DataFrame({"pca_one": pca_fit[:,0], "pca_two": pca_fit[:,1], "pca_three": pca_fit[:,2], "Segmentation": segmentation})

fig = plt.figure(figsize=(18, 10))
ax = fig.add_subplot(111, projection='3d')


for segment, color in zip(unique_clusters, colors):
    subset = pca_df[pca_df['Segmentation'] == segment]
    ax.scatter(subset['pca_one'], subset['pca_two'], subset['pca_three'], label=segment, color=color)
    
ax.set_xlabel('pca_one')
ax.set_ylabel('pca_two')
ax.set_zlabel('pca_three')
ax.set_title('3D Scatter Plot of PCA Projection')

ax.legend(title='Segment')
plt.show()

In [None]:
tsne = TSNE(n_components=3, random_state=11)
tsne_fit = tsne.fit_transform(train_data)

tsne_df = pd.DataFrame({"tsne_one": tsne_fit[:,0], "tsne_two": tsne_fit[:,1], "tsne_three": tsne_fit[:,2], "Segmentation": segmentation})

fig = plt.figure(figsize=(16, 10))
ax = fig.add_subplot(111, projection='3d')

for segment, color in zip(unique_clusters, colors):
    subset = tsne_df[tsne_df['Segmentation'] == segment]
    ax.scatter(subset['tsne_one'], subset['tsne_two'], subset['tsne_three'], label=segment, color=color)
    
ax.set_xlabel('tsne_one')
ax.set_ylabel('tsne_two')
ax.set_zlabel('tsne_3')
ax.set_title('3D Scatter Plot of T-Sne Projection')

ax.legend(title='Segment')
plt.show()

In [None]:
svd = TruncatedSVD(n_components=3, random_state=11)
svd_fit = svd.fit_transform(train_data)

svd_df = pd.DataFrame({"svd_one": svd_fit[:,0], "svd_two": svd_fit[:,1], "svd_three": svd_fit[:,2], "Segmentation": segmentation})

fig = plt.figure(figsize=(16, 10))
ax = fig.add_subplot(111, projection='3d')

for segment, color in zip(unique_clusters, colors):
    subset = svd_df[tsne_df['Segmentation'] == segment]
    ax.scatter(subset['svd_one'], subset['svd_two'], subset['svd_three'], label=segment, color=color)
    
ax.set_xlabel('svd_one')
ax.set_ylabel('svd_two')
ax.set_zlabel('svd_3')
ax.set_title('3D Scatter Plot of SVD Projection')

ax.legend(title='Segment')
plt.show()

In [None]:
from scipy.optimize import linear_sum_assignment

def create_label_mapping(y_true, y_pred):

    cost_matrix = np.zeros((len(np.unique(y_true)), len(np.unique(y_pred))))
    for i, true_label in enumerate(np.unique(y_true)):
        for j, pred_label in enumerate(np.unique(y_pred)):
            cost_matrix[i, j] = -np.sum((y_true == true_label) & (y_pred == pred_label))
    
    row_ind, col_ind = linear_sum_assignment(cost_matrix)
    
    label_mapping = {pred_label: true_label for pred_label, true_label in zip(col_ind, np.unique(y_true))}
    return label_mapping

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score, confusion_matrix

pca_df.drop(columns=["Segmentation"], inplace=True)
tsne_df.drop(columns=["Segmentation"], inplace=True)
svd_df.drop(columns=["Segmentation"], inplace=True)

dataframes = {"Original Transformed Data": train_data, "PCA to 3-D": pca_df, "T-Sne to 3-D": tsne_df, "SVD to 3-D": svd_df}

In [None]:
for key, df in dataframes.items():
    kmeans = KMeans(n_clusters=4, random_state=11, n_init=10)
    kmeans.fit(df)
    
    label_map = create_label_mapping(segmentation, kmeans.labels_)
    int_to_label = pd.Series(kmeans.labels_).map(label_map)

    accuracy = accuracy_score(segmentation, int_to_label)
    
    print(f"{key} Results:\nInertia: {kmeans.inertia_}\nDomain Specific Assignment Accuracy: {accuracy}\n\n")
    
    cm = confusion_matrix(segmentation, int_to_label)

    plt.figure(figsize=(10, 7))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.xlabel('Predicted Labels')
    plt.ylabel('True Labels')
    plt.title(f'{key} Confusion Matrix')
    plt.show()
    
    print("\n\n\n\n")
    

## Interpreting Results:
    - For each kmeans model, I returned the inertia, assingment accuracy, and confusion matrix.
    
    - The inertia is the sum of squared distance to cluster centers, which is the optimization function of the k-means algorithm. The best inertia comes from the SVD model. 
    
    -Note that the kmeans model trained on the high dim. dataset has a huge inertia relative to the other models. This is because the data space is so much higher so that is not a fair comparison. I trained to model to iterate this point and to compare assignment accuracies.
    
    - All models have awful accuracy scores of < 40%. This is because k-means is an unsupervised algorithm! The domain specific clustering information was not used in training the kmeans model. Therefore, none of the fitted kmeans models provide a source of validation for the clients hypothesis.
    
    - Just because none of the clustering in these different vector spaces confirms the hypothesis does not mean the hypothesis is wrong. It does I cannot further validate the customer's claim at this moment.
    
    - This is why it is so important to specify what the real objective is going into a ML problem. The client provided "customer types" that they found to be ground truth. 
    
    - Suppose the problem is to verify that there exists a vector space where these assignments are validated, then dimensionality reduction and clustering are the algorithms to use.
    
    - If the problem is to accept the client's "customer type" clustering and to provide predictive capabilities for future customers, then a supervised algorithm would be the best course of action.
    
    - This notebook is indicated to showcase data cleaning and the implications from such cleaning operations, the reason for training a kmeans model, the limitations of such a model, and the importance of a clear objective in machine learning.