## Enhancing Customer Engagement: A Segmentation Approach for Credit Card Users with Model Deployment on Web

Clustering: The primary objective of clustering is to group similar data points together based on certain features or characteristics. Clusters are formed to maximize the intra-cluster similarity and minimize the inter-cluster similarity.
Segmentation: Segmentation is more business-oriented and aims to identify groups of customers or market segments that share similar behaviors, needs, or characteristics. It is often used in marketing and customer analysis to tailor strategies for different segments.

### Importing Necessary Libraries

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px # is a high-level interface for creating various types of interactive plots with minimal code. 
import plotly.graph_objects as go # is a lower-level interface that offers more control and customization over the appearance and behavior of your plots

In [None]:
import warnings
warnings.filterwarnings('ignore')

### Importing Dataset

In [None]:
df = pd.read_csv("customer_data_credit_card.csv")

### Explorarory Data Analysis

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.duplicated().sum()

In [None]:
df.isnull().sum()

In [None]:
(df['PAYMENTS'] == 0).sum()

In [None]:
mode_credit_limit = df['CREDIT_LIMIT'].mode()[0]
df['CREDIT_LIMIT'].fillna(mode_credit_limit, inplace=True)

In [None]:
df.loc[df['PAYMENTS'] == 0, 'MINIMUM_PAYMENTS'] = df.loc[df['PAYMENTS'] == 0, 'MINIMUM_PAYMENTS'].fillna(0)

In [None]:
min_payments = df['PAYMENTS'].min()
df['MINIMUM_PAYMENTS'].fillna(min_payments, inplace=True)

In [None]:
df.isnull().sum()

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.nunique()

In [None]:
df = df.drop(['CUST_ID'], axis = 1)

In [None]:
def identify_numeric_type(df, column_name):
    unique_values_count = len(df[column_name].unique())
    if unique_values_count < 10:
        return 'Discrete'
    else:
        return 'Continuous'

In [None]:
discrete_numeric_columns = []
continuous_numeric_columns = []

In [None]:
for column in df.columns:
    if df[column].dtype == 'float64' or df[column].dtype == 'int64':
        column_type = identify_numeric_type(df, column)
        if column_type == 'Discrete':
            discrete_numeric_columns.append(column)
        elif column_type == 'Continuous':
            continuous_numeric_columns.append(column)

In [None]:
print('Discrete Numeric Columns:', discrete_numeric_columns)
print('\n')
print('Continuous Numeric Columns:', continuous_numeric_columns)

In [None]:
for i in discrete_numeric_columns:
    print(i)
    print(df[i].unique())

In [None]:
for i in discrete_numeric_columns:
    print(i)
    print(df[i].value_counts())

### Data Visualization

In [None]:
for i in discrete_numeric_columns:
    print('Countplot for:', i)
    plt.figure(figsize=(15,6))
    sns.countplot(x = df[i], data = df, palette = 'hls')
    plt.show()

In [None]:
for i in discrete_numeric_columns:
    print('Pie plot for:', i)
    plt.figure(figsize=(20, 10))
    df[i].value_counts().plot(kind='pie', autopct='%1.1f%%')
    plt.title('Distribution of ' + i)
    plt.ylabel('')
    plt.show()

In [None]:
for i in discrete_numeric_columns:
    fig = go.Figure(data=[go.Bar(x=df[i].value_counts().index, 
                                 y=df[i].value_counts())])
    fig.update_layout(
        title=i,
        xaxis_title=i,
        yaxis_title="Count")
    fig.show()

In [None]:
for i in discrete_numeric_columns:
    print('Pie plot for:', i)
    fig = px.pie(df, names=i)
    fig.show()
    print('\n')

In [None]:
for i in continuous_numeric_columns:
    plt.figure(figsize=(15,6))
    sns.histplot(df[i], kde = True, bins = 20, palette = 'hls')
    plt.xticks(rotation = 0)
    plt.show()

In [None]:
for i in continuous_numeric_columns:
    plt.figure(figsize=(15,6))
    sns.distplot(df[i], kde = True, bins = 20)
    plt.xticks(rotation = 0)
    plt.show()

In statistics and data visualization, KDE (Kernel Density Estimation) and PDF (Probability Density Function) are related concepts used to estimate and visualize the probability density of a continuous random variable.

Kernel Density Estimation (KDE): Kernel Density Estimation is a non-parametric way to estimate the probability density function (PDF) of a random variable. It provides a smoother representation of the underlying probability distribution compared to traditional histograms. In KDE, a kernel (a smooth, symmetric, and usually bell-shaped function) is placed at each data point, and the overall density is obtained by summing these kernels. The bandwidth parameter controls the smoothness of the resulting density estimation.

Probability Density Function (PDF): The Probability Density Function (PDF) is a function that describes the likelihood of a continuous random variable taking a particular value. For continuous random variables, the probability of the variable falling within a specific range is given by the integral of the PDF over that range. The PDF should satisfy two conditions: it should be non-negative for all values, and the total area under the curve (over all possible values) should equal 1.

In summary, KDE is a method to estimate the PDF of a continuous random variable by smoothing the distribution using kernels, and the PDF is a mathematical function that describes the probability distribution of a continuous random variable.

In [None]:
for i in continuous_numeric_columns:
    plt.figure(figsize=(5,5))
    sns.boxplot(df[i],palette='hls')
    plt.xticks(rotation = 0)
    plt.title(i)
    plt.show()

In [None]:
for i in continuous_numeric_columns:
    plt.figure(figsize=(5,5))
    sns.violinplot(df[i], palette='hls')
    plt.xticks(rotation = 0)
    plt.title(i)
    plt.show()

In [None]:
for i in continuous_numeric_columns:
    fig = go.Figure(data=[go.Histogram(x=df[i])])
    fig.update_layout(
        title=i,
        xaxis_title=i,
        yaxis_title="Value")
    fig.show()

In [None]:
for i in continuous_numeric_columns:
    fig = go.Figure(data=[go.Box(x=df[i])])
    fig.update_layout(
        title=i,
        xaxis_title=i,
        yaxis_title="Value")
    fig.show()

In [None]:
for i in continuous_numeric_columns:
    fig = go.Figure(data=[go.Violin(x=df[i])])
    fig.update_layout(
        title=i,
        xaxis_title=i,
        yaxis_title="Value")
    fig.show()

In [None]:
for i in discrete_numeric_columns:
    for j in continuous_numeric_columns:
        plt.figure(figsize=(15,6))
        sns.barplot(x = df[i], y = df[j], data = df, ci = None, palette = 'hls')
        plt.show()

In [None]:
for i in discrete_numeric_columns:
    for j in continuous_numeric_columns:
        plt.figure(figsize=(15,6))
        sns.boxplot(x = df[i], y = df[j], data = df, palette = 'hls')
        plt.show()

In [None]:
for i in discrete_numeric_columns:
    for j in continuous_numeric_columns:
        plt.figure(figsize=(15,6))
        sns.violinplot(x = df[i], y = df[j], data = df, palette = 'hls')
        plt.show()

In [None]:
# for i in discrete_numeric_columns:
#     for j in continuous_numeric_columns:
#         fig = go.Figure()
#         fig.add_trace(go.Bar(x=df[i], y=df[j], name=f'{i} vs {j}'))
#         fig.update_layout(title=f'{i} vs {j}', xaxis_title=i, yaxis_title=j)
#         fig.show()

In [None]:
# for i in discrete_numeric_columns:
#     for j in continuous_numeric_columns:
#         fig = go.Figure()
#         fig.add_trace(go.Box(x=df[i], y=df[j], name=f'{i} vs {j}'))
#         fig.update_layout(title=f'{i} vs {j}', xaxis_title=i, yaxis_title=j)
#         fig.show()

In [None]:
for i in discrete_numeric_columns:
    for j in continuous_numeric_columns:
        fig = go.Figure()
        fig.add_trace(go.Violin(x=df[i], y=df[j], name=f'{i} vs {j}'))
        fig.update_layout(title=f'{i} vs {j}', xaxis_title=i, yaxis_title=j)
        fig.show()

In [None]:
for i in continuous_numeric_columns:
    for j in continuous_numeric_columns:
        if i != j:
            plt.figure(figsize=(15,6))
            sns.lineplot(x = df[j], y = df[i], data = df, palette = 'hls')
            plt.show()

In [None]:
for i in continuous_numeric_columns:
    for j in continuous_numeric_columns:
        if i != j:
            plt.figure(figsize=(15,6))
            sns.scatterplot(x = df[j], y = df[i], data = df, palette = 'hls')
            plt.show()

In [None]:
for i in continuous_numeric_columns:
    for j in continuous_numeric_columns:
        if i != j:
            fig = go.Figure()
            fig.add_trace(go.Scatter(x=df[j], y=df[i], mode='markers', 
                                     marker=dict(color='blue', opacity=0.6),
                                     name=f'{i} vs {j}'))
            fig.update_layout(title=f'{i} vs {j}', xaxis_title=j, yaxis_title=i)
            fig.show()

In [None]:
correlation_matrix = df.corr()

In [None]:
correlation_matrix

In [None]:
plt.figure(figsize=(15, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.show()

In [None]:
# Credit Utilization Analysis
df['CREDIT_UTILIZATION'] = df['PURCHASES'] / df['CREDIT_LIMIT']
plt.figure(figsize=(10, 6))
sns.scatterplot(x='CREDIT_UTILIZATION', y='MINIMUM_PAYMENTS', data=df)
plt.title('Credit Utilization vs. Minimum Payments')
plt.xlabel('Credit Utilization')
plt.ylabel('Minimum Payments')
plt.show()

In [None]:
# Behavioral Patterns Analysis
behavioral_features = ['PURCHASES', 'ONEOFF_PURCHASES', 'INSTALLMENTS_PURCHASES']
df[behavioral_features].plot(kind='hist', bins=30, alpha=0.7, figsize=(12, 6))
plt.title('Purchase Patterns')
plt.xlabel('Value')
plt.ylabel('Count')
plt.show()

In [None]:
fig = px.histogram(df, 
                   x=behavioral_features, 
                   nbins=30, 
                   opacity=0.7, 
                   labels={'value': 'Value', 'count': 'Count'},
                   title='Purchase Patterns')

fig.update_layout(bargap=0.2)
fig.show()

In [None]:
df1 = df.copy()

### removing outliers based on IQR

In [None]:
Q1 = df1.quantile(0.10)                   ## selcting loweer 10 % qunatitle
Q3 = df1.quantile(0.90)                   #  selecting higher 90 %
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

df2= df1[((df1 >= lower_bound) & (df1 <= upper_bound)).all(axis=1)]

In [None]:
df2.shape

In [None]:
from sklearn.decomposition import PCA

In [None]:
X = df2.values

In [None]:
X_standardized = (X - X.mean(axis=0)) / X.std(axis=0)

In [None]:
pca = PCA()
pca.fit(X_standardized)

In [None]:
explained_variance_ratio = pca.explained_variance_ratio_

In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=np.arange(1, len(explained_variance_ratio) + 1),
                         y=np.cumsum(explained_variance_ratio),
                         mode='lines+markers',
                         name='Cumulative Explained Variance'))
fig.add_trace(go.Bar(x=np.arange(1, len(explained_variance_ratio) + 1),
                     y=explained_variance_ratio,
                     name='Explained Variance Ratio'))

fig.update_layout(title='Explained Variance Ratio by Principal Component',
                  xaxis_title='Principal Component',
                  yaxis_title='Explained Variance Ratio',
                  showlegend=True)

fig.show()

"Explained Variance" and "Cumulative Explained Variance" are essential concepts, especially when dealing with dimensionality reduction techniques such as Principal Component Analysis (PCA). Let's break down these terms:

### Explained Variance (EV) or Explained Variance Ratio (EVR):

- **Explained Variance (EV):** In the context of PCA, the explained variance is the amount of variance captured by each principal component. It tells us how much information (variance) is attributed to each of the principal components.

- **Explained Variance Ratio (EVR):** It is the proportion of the dataset's variance that lies along the axis of each principal component. For each principal component, the explained variance ratio is the ratio of the variance captured by that component to the total variance.

In mathematical terms, for a particular principal component \(i\), the explained variance ratio \(EV_i\) is calculated as:

\[ EV_i = \frac{\text{Variance along PC}_i}{\text{Total Variance}} \]

The cumulative explained variance at \(i\) principal components is the sum of explained variance ratios up to the \(i\)th component. It represents the total amount of variance captured by considering the first \(i\) principal components.

### Cumulative Explained Variance:

- **Cumulative Explained Variance:** This is the accumulated amount of variance explained by the first \(i\) principal components. It helps us understand how much of the total variance in the dataset is explained by including more principal components.

In mathematical terms, for \(i\) principal components, the cumulative explained variance \(CEV_i\) is calculated as:

\[ CEV_i = \sum_{k=1}^{i} EV_k \]

In a practical sense, it's common to plot the cumulative explained variance against the number of principal components. This plot helps in deciding how many principal components to retain for dimensionality reduction while capturing a significant portion of the dataset's variance.

The cumulative explained variance plot often shows an "elbow" point, which can help determine the optimal number of principal components to retain based on the diminishing returns of additional components in explaining the variance.

Understanding these metrics is crucial for making informed decisions about the number of principal components to keep in dimensionality reduction techniques like PCA.

In [None]:
feature_names = df2.columns

In [None]:
sorted_feature_names = [x for _, x in sorted(zip(explained_variance_ratio, feature_names), reverse=True)]

In [None]:
for i, feature in enumerate(sorted_feature_names, start=1):
    print(f"{i}. {feature}")

In [None]:
fig_features = go.Figure(go.Bar(
    x=feature_names,
    y=pca.explained_variance_ratio_,
    text=feature_names,
    hoverinfo='x+y',
))
fig_features.update_layout(
    title='Feature Contributions to Explained Variance',
    xaxis_title='Features',
    yaxis_title='Explained Variance Ratio',
)
fig_features.show()

In [None]:
correlation_threshold = 0.5

correlation_matrix = df2.corr().abs()

high_correlation_features = []
for feature in df2.columns:
    correlated_features = correlation_matrix[feature][correlation_matrix[feature] > correlation_threshold].index.tolist()
    correlated_features.remove(feature)  # Remove self-correlation
    if correlated_features:
        high_correlation_features.append((feature, correlated_features))

for feature, correlated_features in high_correlation_features:
    print(f"Feature '{feature}' is highly correlated with: {', '.join(correlated_features)}")

Certainly! Here's a summary of the high correlations observed among the features:

1. **BALANCE** is strongly correlated with:
   - **CASH_ADVANCE**: Reflecting a link between cash advances and account balance.
   - **MINIMUM_PAYMENTS**: Suggesting a relationship between minimum payments and account balance.

2. **PURCHASES** is highly correlated with:
   - **ONEOFF_PURCHASES**, **INSTALLMENTS_PURCHASES**: Indicating different types of purchases are correlated with total purchases.
   - **PURCHASES_FREQUENCY**, **ONEOFF_PURCHASES_FREQUENCY**: Suggesting the frequency of purchases correlates with total purchases.
   - **PURCHASES_TRX**: Showing a correlation with the number of purchase transactions.

3. **ONEOFF_PURCHASES** is highly correlated with:
   - **PURCHASES**, **ONEOFF_PURCHASES_FREQUENCY**: Indicating a link with general purchases and frequency of significant purchases.
   - **PURCHASES_TRX**: Showing a correlation with the number of purchase transactions.
   - **CREDIT_UTILIZATION**: Suggesting a relationship with credit card utilization for one-off purchases.

4. **INSTALLMENTS_PURCHASES** is highly correlated with:
   - **PURCHASES_FREQUENCY**, **PURCHASES_INSTALLMENTS_FREQUENCY**: Suggesting a link with purchase frequency and installment purchases.
   - **PURCHASES_TRX**: Showing a correlation with the number of purchase transactions.

5. **CASH_ADVANCE** is highly correlated with:
   - **BALANCE**: Reflecting a correlation with account balance.
   - **CASH_ADVANCE_FREQUENCY**, **CASH_ADVANCE_TRX**: Indicating a relationship with the frequency and number of cash advances.

6. **PURCHASES_FREQUENCY** is highly correlated with:
   - **PURCHASES**, **INSTALLMENTS_PURCHASES**, **PURCHASES_INSTALLMENTS_FREQUENCY**: Showing a link with different aspects of purchase frequency.
   - **PURCHASES_TRX**: Correlated with the number of purchase transactions.

7. **ONEOFF_PURCHASES_FREQUENCY** is highly correlated with:
   - **PURCHASES**, **ONEOFF_PURCHASES**, **PURCHASES_TRX**: Indicating a correlation with purchases and frequency of significant purchases.

8. **PURCHASES_INSTALLMENTS_FREQUENCY** is highly correlated with:
   - **INSTALLMENTS_PURCHASES**, **PURCHASES_FREQUENCY**: Suggesting a link with purchase frequency and installment purchases.
   - **PURCHASES_TRX**: Showing a correlation with the number of purchase transactions.

9. **CASH_ADVANCE_FREQUENCY** is highly correlated with:
   - **CASH_ADVANCE**, **CASH_ADVANCE_TRX**: Indicating a relationship with the frequency and number of cash advances.

10. **CASH_ADVANCE_TRX** is highly correlated with:
    - **CASH_ADVANCE**, **CASH_ADVANCE_FREQUENCY**: Showing a correlation with cash advances and their frequency.

11. **PURCHASES_TRX** is highly correlated with:
    - **PURCHASES**, **ONEOFF_PURCHASES**, **INSTALLMENTS_PURCHASES**, **PURCHASES_FREQUENCY**, **ONEOFF_PURCHASES_FREQUENCY**, **PURCHASES_INSTALLMENTS_FREQUENCY**: Indicating a strong relationship with various transactional aspects.

12. **MINIMUM_PAYMENTS** is highly correlated with:
    - **BALANCE**: Showing a relationship between account balance and minimum payments.

13. **CREDIT_UTILIZATION** is highly correlated with:
    - **PURCHASES**, **ONEOFF_PURCHASES**, **INSTALLMENTS_PURCHASES**, **PURCHASES_FREQUENCY**, **PURCHASES_TRX**: Indicating a correlation with various aspects of credit card usage.

These correlations provide valuable insights into the relationships among different features, which can be used for further analysis and segmentation in the credit card domain.

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
df2.columns

In [None]:
selected_features = [
    'BALANCE',
    'PURCHASES',
    'ONEOFF_PURCHASES',
    'INSTALLMENTS_PURCHASES',
    'CASH_ADVANCE',
    'CREDIT_LIMIT',
    'PAYMENTS',
    'PRC_FULL_PAYMENT',
    'TENURE',
    'CREDIT_UTILIZATION'
]

selected_features = ['BALANCE', 'CASH_ADVANCE', 'MINIMUM_PAYMENTS', 
                     'PURCHASES', 'ONEOFF_PURCHASES', 'INSTALLMENTS_PURCHASES',
                     'PURCHASES_FREQUENCY', 'ONEOFF_PURCHASES_FREQUENCY', 
                     'PURCHASES_TRX']

In [None]:
selected_data = df2[selected_features]

In [None]:
selected_data

In [None]:
scaler = StandardScaler()
scaled_data = scaler.fit_transform(selected_data)

### Data Modeling

In [None]:
from sklearn.cluster import KMeans

In [None]:
inertia = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, random_state=42)
    kmeans.fit(scaled_data)
    inertia.append(kmeans.inertia_)

Inertia: Inertia, also known as within-cluster sum of squares, is a measure of how far the points within a cluster are from the center of that cluster (centroid). It quantifies the compactness of the clusters. The lower the inertia, the better the clustering.

KMeans Clustering: KMeans is a popular clustering algorithm that partitions a dataset into 'k' distinct, non-overlapping subgroups or clusters. It aims to minimize the within-cluster variance (inertia) by iteratively reassigning data points to clusters and updating the cluster centroids.

In [None]:
# Plot the Elbow Method
plt.plot(range(1, 11), inertia, marker='o')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal Cluster Selection')
plt.show()

In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=list(range(1, 11)), y=inertia, mode='lines+markers'))
fig.update_layout(title='Elbow Method for Optimal Cluster Selection',
                  xaxis=dict(title='Number of Clusters'),
                  yaxis=dict(title='Inertia'))
fig.show()

In [None]:
num_clusters = 2

In [None]:
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
clusters = kmeans.fit_predict(scaled_data)

In [None]:
new_dfa = pd.DataFrame(data = scaled_data, columns = selected_features)
new_dfa['label_kmeans'] = clusters
new_dfa

In [None]:
new_dfa['label_kmeans'].unique()

In [None]:
centroids = kmeans.cluster_centers_
labels = kmeans.labels_

print('Centroids:', centroids)
print('Labels:', labels)

In [None]:
fig = px.scatter(data_frame=new_dfa, x='BALANCE', y='PURCHASES', color='label_kmeans',
                 title='Customer Segmentation based on Balance and Purchases',
                 labels={'BALANCE': 'Balance', 'PURCHASES': 'Purchases'},
                 color_continuous_scale='Viridis')

fig.show()

In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=new_dfa['BALANCE'], y=new_dfa['PURCHASES'], mode='markers', 
                         marker=dict(color=new_dfa['label_kmeans'], colorscale='Viridis', opacity=0.6),
                         text='Cluster ' + new_dfa['label_kmeans'].astype(str),
                         hoverinfo='text'))
fig.add_trace(go.Scatter(x=centroids[:, 0], y=centroids[:, 1], 
                         mode='markers', marker=dict(color='red', size=10, symbol='cross'),
                         name='Cluster Centroids'))
fig.update_layout(title='Customer Segmentation based on Balance and Purchases with Cluster Centroids',
                  xaxis=dict(title='Balance'),
                  yaxis=dict(title='Purchases'))
fig.show()

Hierarchical clustering is a popular and powerful method used in data analysis and machine learning for the purpose of grouping or clustering similar data points together. It is a bottom-up approach to clustering, where data points are initially treated as individual clusters and are successively merged or grouped together based on their similarity. This merging process continues until all data points belong to a single, large cluster or until a predefined number of clusters is reached.

Here's a step-by-step explanation of how hierarchical clustering works:

1. **Initialization**: Start by considering each data point as an individual cluster. So, if you have N data points, you initially have N clusters.

2. **Calculate Pairwise Distances**: Compute the pairwise distances or dissimilarities between all pairs of data points. The choice of distance metric (e.g., Euclidean distance, Manhattan distance, etc.) depends on the nature of the data and the problem.

3. **Merge Closest Clusters**: Identify the two closest clusters based on the computed distances and merge them into a single cluster. This process continues until there is only one cluster left (agglomerative) or until you have the desired number of clusters (divisive).

4. **Distance Metric**: Choose a linkage criterion to determine how the distance between clusters is calculated during the merging process. Common linkage methods include:
   - **Single Linkage**: Merge clusters based on the minimum distance between any two data points in the clusters.
   - **Complete Linkage**: Merge clusters based on the maximum distance between any two data points in the clusters.
   - **Average Linkage**: Merge clusters based on the average distance between all pairs of data points in the clusters.
   - **Ward's Linkage**: Minimize the increase in the total within-cluster variance when merging clusters.

5. **Dendrogram**: As the clusters are merged, a dendrogram is typically created. A dendrogram is a tree-like structure that visually represents the merging process and allows you to see how the clusters are formed at different levels.

6. **Choosing the Number of Clusters**: You can cut the dendrogram at a certain height or level to determine the number of clusters you want. This allows you to control the granularity of the clustering.

Hierarchical clustering has several advantages, including its ability to reveal the hierarchical structure of the data and its flexibility in choosing the number of clusters. However, it can be computationally expensive for large datasets.

In summary, hierarchical clustering is a versatile technique used for grouping data points into clusters based on their similarity or dissimilarity, and it provides valuable insights into the structure of the data.

In [None]:
import scipy.cluster.hierarchy as sch

In [None]:
distance_matrix = sch.distance.pdist(scaled_data)

In [None]:
linkage = sch.linkage(distance_matrix, method='complete')

In [None]:
plt.figure(figsize=(15, 8))
plt.title('Dendrogram')
dendrogram = sch.dendrogram(linkage)
plt.xlabel('Customers')
plt.ylabel('Euclidean distances')
plt.show()

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm widely used in machine learning and data analysis. Unlike k-means or hierarchical clustering, DBSCAN does not require the number of clusters to be specified in advance and can discover arbitrarily shaped clusters. It's particularly effective in scenarios where the clusters have irregular shapes and different densities.

Here's a step-by-step explanation of how DBSCAN works:

1. **Initialization**: Start with an arbitrary data point that has not been visited.

2. **Density-Based Neighborhood Search**: For this point and its neighborhood, defined by a distance parameter ε (epsilon), find all nearby points. These are the density-connected points.

3. **Density Connectivity**: If the number of density-connected points is greater than a predefined threshold (MinPts), consider this point and its neighborhood as a cluster. 

4. **Expand the Cluster**: Expand this cluster by recursively repeating the neighborhood search and density connectivity steps for all density-connected points.

5. **Form Additional Clusters**: If a point is found to be a density-connected point to multiple clusters, it's considered a border point. If it's not density-connected to any cluster, it's treated as noise.

DBSCAN classifies each point in the dataset into one of the following:

- **Core Points**: These are the points that have at least MinPts within their ε-neighborhood. They start a new cluster or expand an existing one.
  
- **Border Points**: These have fewer than MinPts within the ε-neighborhood but are in the neighborhood of a core point. They belong to the cluster of that core point.

- **Noise or Outliers**: These are points that are neither core points nor in the ε-neighborhood of a core point. They do not belong to any cluster.

Key features and advantages of DBSCAN:

- **Flexibility in Cluster Shape**: DBSCAN can find arbitrarily shaped clusters and is not sensitive to the order of the data.

- **Robust to Noise**: Noise in the form of outliers does not affect the clustering process significantly.

- **Automatic Cluster Number**: It does not require specifying the number of clusters beforehand, unlike k-means.

- **Efficient**: It's relatively efficient and can handle large datasets.

However, setting the distance parameter ε and the minimum number of points MinPts appropriately can be a challenge, and the results can be sensitive to these parameters. Overall, DBSCAN is a powerful tool for clustering data based on density and is widely used in various applications such as anomaly detection, spatial analysis, and more.

In [None]:
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan_labels = dbscan.fit_predict(scaled_data)

DBSCAN clustering using scikit-learn's `DBSCAN` class with a specified epsilon (eps) value of 0.5 and a minimum number of samples (min_samples) of 5. Let me know if there's something specific you would like to discuss or if you have any questions related to this code or DBSCAN clustering in general!

In [None]:
dbscan_labels

df2['dbscan'] = dbscan_labels

Mean Shift is a clustering algorithm used in unsupervised machine learning to identify clusters or groupings in a dataset based on its underlying probability density function. Unlike K-means, Mean Shift doesn't require the number of clusters to be predefined.

Here's a step-by-step explanation of the Mean Shift algorithm:

1. **Initialization**: Initialize a set of data points as the starting points for the algorithm.

2. **Kernel Estimation**: For each data point in the dataset, estimate its probability density function using a kernel function (e.g., Gaussian kernel). The kernel function assigns weights to the neighboring points, with closer points receiving higher weights.

3. **Mean Shift**: For each data point, calculate the mean shift vector, which represents the direction and magnitude to shift the data point for maximizing the local density of points. This is done by computing the weighted mean of the data points based on the kernel.

4. **Update Data Points**: Shift each data point in the direction of the mean shift vector.

5. **Convergence**: Repeat steps 2-4 until the data points converge to stable positions. Convergence occurs when the mean shift vector becomes very small.

6. **Cluster Assignment**: Group data points that converge to the same stable position into a cluster.

Mean Shift tends to find a variable number of clusters based on the density of the data. It's particularly useful for applications where the number of clusters is not known a priori, and it can identify arbitrarily shaped clusters.

The key parameters in Mean Shift are the bandwidth or radius of the kernel (which influences the size of clusters) and the choice of the kernel function. Adjusting the bandwidth can have a significant impact on the resulting clusters.

In [None]:
from sklearn.cluster import MeanShift
meanshift = MeanShift()
meanshift_labels = meanshift.fit_predict(scaled_data)

In [None]:
meanshift_labels

Gaussian Mixture Model (GMM) is a probabilistic model used for clustering and density estimation. It represents a mixture of several Gaussian distributions, each characterized by its mean and covariance matrix.

Here's a step-by-step explanation of how Gaussian Mixture Model works:

1. **Initialization**: Start by initializing the parameters of the model. This includes selecting the number of Gaussian components (clusters) and their initial mean, covariance, and weights.

2. **Expectation-Maximization (EM) Algorithm**:
   - **Expectation (E-step)**: Calculate the probability that each data point belongs to each Gaussian component. This step computes the posterior probability using Bayes' rule.
   - **Maximization (M-step)**: Update the parameters (mean, covariance, and weight) of each Gaussian component to maximize the likelihood of the observed data based on the probabilities calculated in the E-step.

3. **Convergence Check**: Check for convergence by evaluating the change in log-likelihood or the change in parameters. If the change is below a certain threshold, the algorithm has converged.

4. **Repeat Steps 2 and 3**: Continue iterating between the E-step and M-step until the algorithm converges.

5. **Cluster Assignment**: After convergence, each data point is assigned to the Gaussian component with the highest probability.

Gaussian Mixture Model is very flexible and can model a wide range of data distributions. It's capable of fitting complex data patterns and can identify clusters with different shapes and sizes. Additionally, GMM provides a probabilistic framework, meaning it assigns probabilities to each point's membership in each cluster rather than a hard assignment.

The number of Gaussian components (clusters) in GMM needs to be specified a priori or determined using model selection techniques like the Bayesian Information Criterion (BIC) or cross-validation.

In [None]:
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=3)  # specify the number of components (clusters)
gmm_labels = gmm.fit_predict(scaled_data)

In [None]:
gmm_labels

Agglomerative Clustering is a hierarchical clustering technique that builds a dendrogram (tree-like diagram) to illustrate the arrangement of the clusters. It's a "bottom-up" clustering method, where each data point starts as its own cluster and pairs of clusters are merged based on a specific criterion until a single cluster containing all data points is formed.

Here's a step-by-step explanation of how Agglomerative Clustering works:

1. **Initialization**: Start by considering each data point as an individual cluster. The number of initial clusters is equal to the number of data points.

2. **Distance Calculation**: Compute the pairwise distance (e.g., Euclidean distance) between all clusters.

3. **Merge Clusters**: Identify the two clusters that are closest to each other based on the chosen distance metric. Merge these clusters into a single cluster.

4. **Update Distance Matrix**: Recalculate the distances between the new cluster and all other remaining clusters.

5. **Repeat Steps 3 and 4**: Repeat the process of merging the closest clusters and updating the distance matrix until only a single cluster remains.

6. **Dendrogram Construction**: Construct a dendrogram to illustrate the clustering process, showing the merging of clusters at different distances.

7. **Choosing the Number of Clusters**: Determine the number of clusters by selecting a distance threshold or cutting the dendrogram at a certain height, which corresponds to the desired number of clusters.

Agglomerative Clustering offers flexibility in determining the number of clusters after observing the dendrogram. It also allows for different linkage criteria, such as Ward's linkage, complete linkage, average linkage, and single linkage, which affect how the distance between clusters is calculated during the merging process.

One drawback is its time complexity, which can be \(O(n^3)\) for a complete dataset, making it less efficient for large datasets. However, with various optimization techniques, this complexity can be reduced to \(O(n^2 \log n)\).

In [None]:
from sklearn.cluster import AgglomerativeClustering
agg_clustering = AgglomerativeClustering(n_clusters=3, linkage='ward')
agg_labels = agg_clustering.fit_predict(scaled_data)

In [None]:
agg_labels 

OPTICS (Ordering Points To Identify the Clustering Structure) is a density-based clustering algorithm that extends the concepts of DBSCAN (Density-Based Spatial Clustering of Applications with Noise). It aims to overcome some limitations of DBSCAN, particularly in handling clusters of varying densities and identifying noise in the data.

Here's a breakdown of how OPTICS works:

1. **Reachability Distance**: Similar to DBSCAN, OPTICS defines a reachability distance for each point. The reachability distance of point A from point B is defined as the maximum of the distance between A and B and the \( \epsilon \) parameter.

2. **Core Distance**: For each point, compute its core distance, which is the distance to the \( \text{minPts} \)th nearest neighbor (where \( \text{minPts} \) is a parameter specified by the user).

3. **Ordering Points**: Sort the points based on their core distances in ascending order. This ordering helps in identifying clusters and noise.

4. **Building Reachability Plot**: Create a reachability plot, which shows the reachability distances of each point from its \( \text{minPts} \)th nearest neighbor. It helps identify regions of varying densities.

5. **Clustering**: Based on the reachability plot, clusters are identified as regions where the reachability distance is within a specified range. Clusters can have varying densities.

6. **Extracting Clusters**: Extract the clusters from the reachability plot. A cluster is formed by connecting the points where the reachability distance exceeds a specified threshold.

OPTICS provides a more flexible approach to clustering by allowing the detection of clusters with different densities and handling noise effectively. It also offers the advantage of not requiring the user to specify the number of clusters in advance, making it suitable for a variety of datasets. However, it may be computationally expensive for large datasets, particularly due to the need to compute the reachability distances for all points.

In [None]:
from sklearn.cluster import OPTICS
optics = OPTICS(eps=0.5, min_samples=5)
optics_labels = optics.fit_predict(scaled_data)

In [None]:
optics_labels

# Cluster Profiling

Cluster profiling, in the context of clustering algorithms, involves analyzing and summarizing the characteristics of clusters that have been identified in a dataset. It's an important step in understanding the nature and properties of different clusters generated by the clustering algorithm.

Here are the main steps involved in cluster profiling:

1. **Cluster Description**: Summarize the main features and properties of each cluster. This includes statistical measures like mean, median, standard deviation, etc., for numerical features, and mode for categorical features.

2. **Visualization**: Create visual representations of the clusters to better understand their characteristics. Common visualizations include scatter plots for 2D or 3D feature spaces, box plots, parallel coordinate plots, and cluster center visualizations.

3. **Feature Importance**: Determine the importance of different features in defining each cluster. This can be done using various techniques such as analyzing feature importance scores from the clustering algorithm or using domain-specific knowledge.

4. **Comparison between Clusters**: Compare the characteristics of different clusters to identify patterns or anomalies. This helps in understanding the distinctive traits of each cluster.

5. **Segment Description**: Provide a clear and interpretable description of each segment or cluster. This often involves summarizing the common traits, behaviors, or characteristics of the data points within a cluster.

6. **Business Insights**: Translate the technical understanding of the clusters into actionable business insights. Understanding the characteristics of each cluster helps in tailoring strategies, marketing campaigns, or product offerings to specific customer segments.

Cluster profiling is crucial in various domains such as customer segmentation, marketing, healthcare, finance, and more. It helps in making informed decisions based on the knowledge extracted from the clusters, ultimately driving better business outcomes.

In [None]:
cluster_centers_original = scaler.inverse_transform(kmeans.cluster_centers_)

In [None]:
cluster_centers_df = pd.DataFrame(cluster_centers_original, columns=selected_features)

In [None]:
cluster_centers_df

In [None]:
plt.figure(figsize=(12, 6))
for i in range(num_clusters):
    plt.bar(selected_features, cluster_centers_df.iloc[i], alpha=0.5, label=f'Cluster {i}')
plt.xlabel('Features')
plt.ylabel('Mean Value')
plt.title('Mean Values of Features by Cluster')
plt.xticks(rotation = 90)
plt.legend()
plt.show()

In [None]:
data = cluster_centers_df.transpose()  
fig = px.bar(data, 
             x=data.index, 
             y=data.columns, 
             barmode='group',
             title='Mean Values of Features by Cluster',
             labels={'x': 'Features', 'y': 'Mean Value'},
             category_orders={"x": selected_features},
             width=800, height=400)
fig.update_xaxes(tickangle=45)
fig.show()

In [None]:
for i in range(num_clusters):
    print(f"\nCluster {i} Characteristics:")
    for feature, centroid_value in zip(selected_features, cluster_centers_df.iloc[i]):
        print(f"- {feature}: {centroid_value:.2f}")

In [None]:
segment_descriptions = [
    "Segment 0: These customers have a moderate balance, make regular purchases, and use credit moderately.",
    "Segment 1: These customers have a high balance, make frequent purchases, and utilize credit extensively.",
]

In [None]:
print("\nSegment Descriptions:")
for description in segment_descriptions:
    print(description)

In [None]:
segment_recommendations = {
    0: "Recommendation for Segment 0: Encourage customers to make more regular purchases to maximize their credit benefits. Offer personalized product suggestions based on their purchase history.",
    1: "Recommendation for Segment 1: Leverage targeted marketing campaigns to promote high-end products or credit limit upgrades, considering their high credit utilization and purchase frequency.",
}

In [None]:
print("Tailored Recommendations:")
for segment, recommendation in segment_recommendations.items():
    print(f"\nSegment {segment} Recommendations:")
    print(recommendation)

In [None]:
import pickle

kmeans_model_file = "kmeans_model.pkl"

with open(kmeans_model_file, "wb") as file:
    pickle.dump(kmeans, file)

print("K-means model saved successfully.")

In [None]:
import streamlit as st

In [None]:
streamlit_version = st.__version__
print("Streamlit version:", streamlit_version)

pandas_version = pd.__version__
print("Pandas version:", pandas_version)

In [None]:
import sklearn
sklearn_version = sklearn.__version__
print("scikit-learn version:", sklearn_version)