# Bank-Additional DataSet


## Selecting the Business Problem

### Identify Business Goals
The primary objective of the Portuguese banking institution is to increase the subscription rate of term deposits through their direct marketing campaigns. By enhancing the effectiveness of these campaigns, the bank aims to:

- Boost overall revenue.
- Expand its customer base.
- Strengthen market position.
- Improve customer relationships and satisfaction.

### Align with Data Availability
The **Bank Marketing Dataset** provides comprehensive data that aligns with this business goal:

- **Customer Demographics**: `age`, `job`, `marital`, `education`.
- **Financial Information**: `balance`, `default` (credit in default), `housing` (housing loan), `loan` (personal loan).
- **Marketing Interaction Details**: `contact` (communication type), `day` (last contact day), `month` (last contact month), `duration` (last contact duration).
- **Previous Campaign Outcomes**: `campaign` (number of contacts), `pdays` (days since last contact), `previous` (number of prior contacts), `poutcome` (outcome of the previous campaign).

This rich dataset enables the bank to analyze customer behaviors and preferences, which is essential for tailoring marketing strategies effectively.

### Specific Problem Statement
How can we segment our customers based on their demographic characteristics, financial status, and previous interactions to tailor our marketing campaigns effectively and increase the subscription rate of term deposits?

---

### Importance of Segmentation
Customer segmentation is crucial for businesses due to the following reasons:

1. **Personalized Marketing**: Enables the creation of targeted marketing messages that resonate with specific groups, increasing engagement and conversion rates.
2. **Improved Customer Service**: Allows businesses to address the unique needs and concerns of different customer segments, enhancing satisfaction and loyalty.
3. **Optimized Resource Allocation**: Helps in allocating marketing budgets and resources more efficiently by focusing on high-potential segments.
4. **Product Development**: Insights from segmentation can guide the development of new products or services tailored to the needs of specific groups.
5. **Competitive Advantage**: Understanding customer segments better than competitors can lead to more effective strategies and increased market share.

---

### Specific Benefits for the Bank Marketing Dataset
Applying customer segmentation to the Bank Marketing Dataset offers several benefits:

1. **Enhanced Campaign Effectiveness**: By identifying distinct customer segments, the bank can tailor its communication strategies to address the specific needs and preferences of each group, leading to higher subscription rates for term deposits.
2. **Increased Conversion Rates**: Targeting customers who are more likely to subscribe based on their profile and past behavior can improve conversion rates and reduce the cost per acquisition.
3. **Customer Retention**: Understanding different customer segments helps in developing retention strategies for valuable customers, such as offering personalized financial advice or exclusive products.
4. **Resource Optimization**: Focusing efforts on segments with the highest potential return allows for better use of marketing resources and maximizes ROI.
5. **Strategic Planning**: Insights from segmentation can inform broader business strategies, such as identifying new market opportunities or adjusting product offerings to meet the needs of underserved segments.

By leveraging clustering techniques on this dataset, the bank can gain valuable insights into customer behaviors and preferences, ultimately leading to more successful marketing campaigns and business growth.














   **INPUT VARIABLES:**

   **Bank client data:**

   - <ins>age</ins> (numeric)
   - <ins>job</ins> : type of job (categorical: "admin.","blue-collar","entrepreneur","housemaid","management","retired","self-employed","services","student","technician","unemployed","unknown")
   - <ins>marital</ins> : marital status (categorical: "divorced","married","single","unknown"; note: "divorced" means divorced or widowed)
   - <ins>education</ins> (categorical: "basic.4y","basic.6y","basic.9y","high.school","illiterate","professional.course","university.degree","unknown")
   - <ins>default</ins>: has credit in default? (categorical: "no","yes","unknown")
   - <ins>housing</ins>: has housing loan? (categorical: "no","yes","unknown")
   - <ins>loan</ins>: has personal loan? (categorical: "no","yes","unknown")

   **Related with the last contact of the current campaign:**

   - <ins>contact</ins>: contact communication type (categorical: "cellular","telephone") 
   - <ins>month</ins>: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")
   - <ins>day_of_week</ins>: last contact day of the week (categorical: "mon","tue","wed","thu","fri")
   - <ins>duration</ins>: last contact duration, in seconds (numeric). Important note:  this attribute highly affects the output target (e.g., if duration=0 then y="no"). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

   **Other attributes:**

   - <ins>campaign</ins>: number of contacts performed during this campaign and for this client (numeric, includes last contact)
   - <ins>pdays</ins>: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
   - <ins>previous</ins>: number of contacts performed before this campaign and for this client (numeric)
   - <ins>poutcome</ins>: outcome of the previous marketing campaign (categorical: "failure","nonexistent","success")
   **Social and economic context attributes:**
   - <ins>emp.var.rate</ins>: employment variation rate - quarterly indicator (numeric)
   - <ins>cons.price.idx</ins>: consumer price index - monthly indicator (numeric)     
   - <ins>cons.conf.idx</ins>: consumer confidence index - monthly indicator (numeric)     
   - <ins>euribor3m</ins>: euribor 3 month rate - daily indicator (numeric)
   - <ins>nr.employed</ins>: number of employees - quarterly indicator (numeric)

  **OUTPUT VARIABLE (DESIRED TARGET):**

   - <ins>y</ins> - has the client subscribed a term deposit? (binary: "yes","no")
   

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import sklearn as sl
from IPython.display import display
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import LabelEncoder
from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from mpl_toolkits.mplot3d import Axes3D
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from scipy.cluster.vq import vq

# Load the dataset
ds = pd.read_csv('bank-additional-full.csv', sep=';')
display(ds)

### Exploratory Data Analysis (EDA)

#### Summary statistics

In [None]:

# Check the structure of the dataset
print("\nDataset Information:")
ds.info()


# Get summary statistics for numerical features
print("\nSummary Statistics for Numerical Features:")
display(ds.describe())

# Get summary statistics for categorical features
print("\nSummary Statistics for Categorical Features:")
display(ds.describe(include='object'))



#### Feature Distributions

In [None]:


# Feature Distributions
# List of numerical and categorical features
numerical_features = ['age', 'duration', 'campaign', 'pdays', 'previous', 'emp.var.rate', 
                      'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed']
categorical_features = ['job', 'marital', 'education', 'default', 'housing', 'loan', 
                        'contact', 'month', 'day_of_week', 'poutcome']

# Histograms for numerical features with multiple plots per row
plots_per_row = 3  # Number of plots per row
num_features = len(numerical_features)

for i in range(0, num_features, plots_per_row):
    features = numerical_features[i:i+plots_per_row]
    num_plots = len(features)
    fig, axes = plt.subplots(1, num_plots, figsize=(6 * num_plots, 4))
    
    if num_plots == 1:
        axes = [axes]  # Ensure axes is iterable when there's only one plot
    
    for ax, feature in zip(axes, features):
        sns.histplot(ds[feature], bins=30, kde=True, ax=ax)
        ax.set_title(f'Distribution of {feature}')
        ax.set_xlabel(feature)
        ax.set_ylabel('Frequency')
    plt.tight_layout()
    plt.show()

# Bar plots for categorical features with multiple plots per row
num_features = len(categorical_features)

for i in range(0, num_features, plots_per_row):
    features = categorical_features[i:i+plots_per_row]
    num_plots = len(features)
    fig, axes = plt.subplots(1, num_plots, figsize=(6 * num_plots, 4))
    
    if num_plots == 1:
        axes = [axes]  # Ensure axes is iterable when there's only one plot
    
    for ax, feature in zip(axes, features):
        sns.countplot(data=ds, x=feature, order=ds[feature].value_counts().index, ax=ax)
        ax.set_title(f'Distribution of {feature}')
        ax.set_xlabel(feature)
        ax.set_ylabel('Count')
        ax.tick_params(axis='x', rotation=45)
    plt.tight_layout()
    plt.show()



### Patterns and Anomalies

#### Outliers
- **Duration**: The long tail suggests potential outliers with exceptionally long calls.
- **Pdays**: The peak at 999 indicates a placeholder value for customers not previously contacted, which could skew analyses.
- **Campaign**: A few cases with very high contact numbers might be outliers.

#### Missing Values
- Categorical features like `job`, `education`, `default`, `housing`, and `loan` contain `"unknown"` values. These could represent missing data that needs imputation or separate treatment.

#### Patterns
- Economic indicators like `emp.var.rate` and `cons.price.idx` show distinct clusters, possibly corresponding to different economic conditions.
- The concentration of `age` around 30–50 and `duration` below 500 seconds suggests a targeted customer group and the brevity of most marketing interactions.


In [None]:
print("Missing Values in Each Column:")
print("--------------------------------")

# Initialize a list to store missing value information
missing_values = []

# Check for missing values in numerical features
for feature in numerical_features:
    missing_count = ds[feature].isnull().sum()
    total_count = ds.shape[0]
    missing_percentage = (missing_count / total_count) * 100
    missing_values.append({'Feature': feature,
                           'Missing Count': missing_count,
                           'Missing Percentage': missing_percentage})

# Check for missing values in categorical features (considering 'unknown' as missing)
for feature in categorical_features:
    missing_count = ds[ds[feature] == 'unknown'][feature].count()
    total_count = ds.shape[0]
    missing_percentage = (missing_count / total_count) * 100
    missing_values.append({'Feature': feature,
                           'Missing Count': missing_count,
                           'Missing Percentage': missing_percentage})

# Create a DataFrame from the list of dictionaries
missing_value_df = pd.DataFrame(missing_values)

# Print the missing value information
print(missing_value_df)


## Data Preprocessing
### Handling missing values

### Excluding the contact Feature from Clustering Analysis

Excluding the contact feature from the clustering process ensures that the resulting segments are more meaningful and unbiased. By removing this operational variable, the analysis focuses on more intrinsic attributes, allowing the clusters to better reflect natural patterns in the data rather than being influenced by the bank's communication methods. This approach enhances the clarity and actionability of the segments, making them more suitable for developing targeted strategies and driving effective marketing decisions. For the features with low missing percentages (job, marital, education, housing, loan) we substitute the unkown values with the mode.


In [None]:
# Drop the 'contact' feature from the dataset
ds = ds.drop(columns=['contact'])

# List of features to impute
features_to_impute = ['job', 'marital', 'education', 'housing', 'loan']

for feature in features_to_impute:
    mode_value = ds[ds[feature] != 'unknown'][feature].mode()[0]
    ds.loc[ds[feature] == 'unknown', feature] = mode_value
    print(f'Imputed "unknown" in {feature} with mode: {mode_value}')


### Outliers handling


#### 1. pdays feature
- Create a binary feature indicating whether the customer was previously contacted.
- Set pdays to NaN or another placeholder value for customers not previously contacted.

#### 2. duration feature
- Winsorize outliers by capping extreme values at the 99th percentile to reduce their impact while retaining all data points.

#### 3. campaign feature
- Cap values at the 95th percentile to address extreme cases and improve consistency.


In [None]:
# 1. Handling outliers in 'duration'
duration_99 = ds['duration'].quantile(0.99)
print(f"99th percentile of 'duration': {duration_99} seconds")

ds['duration_capped'] = np.where(ds['duration'] > duration_99, duration_99, ds['duration'])

# Visualize 'duration_capped'
plt.figure(figsize=(6, 4))
sns.histplot(ds['duration_capped'], bins=30, kde=True)
plt.title('Distribution of Duration after Capping Outliers')
plt.xlabel('Duration (seconds)')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()

# 2. Handling 'pdays'

ds['previously_contacted'] = np.where(ds['pdays'] != 999, 1, 0)
ds['pdays_cleaned'] = ds['pdays'].replace(999, -1)



# Visualize 'pdays_cleaned'
plt.figure(figsize=(6, 4))
sns.histplot(ds['pdays_cleaned'], bins=30, kde=True)
plt.title('Distribution of pdays after Cleaning')
plt.xlabel('pdays')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()

# 3. Handling outliers in 'campaign'
campaign_95 = ds['campaign'].quantile(0.95)
print(f"95th percentile of 'campaign': {campaign_95}")

ds['campaign_capped'] = np.where(ds['campaign'] > campaign_95, campaign_95, ds['campaign'])

# Visualize 'campaign_capped'
plt.figure(figsize=(6, 4))
sns.histplot(ds['campaign_capped'], bins=30, kde=True)
plt.title('Distribution of Campaign after Capping Outliers')
plt.xlabel('Number of Contacts')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()
ds = ds.drop(columns=['pdays', 'duration', 'campaign'])


### Encoding Categorical Features
In this step, categorical features are transformed into numeric format using LabelEncoder. This encoding ensures that machine learning models can process these features effectively. Each categorical feature is replaced with its corresponding numerical representation, making the dataset ready for further analysis and modeling.


In [None]:
# Initialize LabelEncoder
label_encoder = LabelEncoder()
# List of categorical features to encode
# List of categorical features to encode (without 'contact')
categorical_features = ['job', 'marital', 'education', 'default', 'housing', 'loan', 
                        'month', 'day_of_week', 'poutcome']


# Encode each categorical feature using LabelEncoder
for feature in categorical_features:
    ds[feature] = label_encoder.fit_transform(ds[feature])
    print(f"Encoded '{feature}' with classes: {label_encoder.classes_}")

display(ds)


### Creating and Encoding Age Groups
This code dynamically defines age bins based on the dataset's minimum and maximum age, categorizes ages into groups (`young`, `adult`, `senior`), and encodes the `age_group` feature using LabelEncoder. Adjusting bins ensures all ages are included, and the encoding maps groups to numerical values for analysis.


In [None]:
# Define age bins and labels, adjusting to include all ages
min_age = ds['age'].min()
max_age = ds['age'].max()

age_bins = [min_age, 30, 60, max_age + 1]
age_labels = ['young', 'adult', 'senior']

# Create the 'age_group' feature
ds['age_group'] = pd.cut(ds['age'], bins=age_bins, labels=age_labels, right=False)

# Check for missing values in 'age_group'
missing_age_groups = ds['age_group'].isnull().sum()
print(f"Missing age groups: {missing_age_groups}")

# If there are missing values, decide how to handle them
# For this example, we'll assume there are no missing values after adjusting the bins

# Encode 'age_group' using LabelEncoder
ds['age_group'] = ds['age_group'].astype(str)

label_encoder = LabelEncoder()
ds['age_group_encoded'] = label_encoder.fit_transform(ds['age_group'])

# Display the mapping
age_group_mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))
print("Age Group Encoding Mapping:")
print(age_group_mapping)

# Verify the distribution
print("Age Group Distribution:")
print(ds['age_group'].value_counts())



### Numerical features normalization
Normalization ensures that all numerical features contribute equally to the analysis by standardizing them to a common scale. In an unsupervised learning context, this is crucial for distance-based algorithms (e.g., clustering), where features with larger scales could dominate the results and bias the model.


In [None]:

numerical_features = ['age', 'duration_capped', 'campaign_capped', 'pdays_cleaned', 'previous', 'emp.var.rate', 
                      'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed']
# Check for missing values in the dataset
print("Missing values in the dataset before scaling:")
print(ds.isnull().sum())

# Initialize the scaler
scaler = StandardScaler()

# Fit the scaler to the numerical features
scaler.fit(ds[numerical_features])

# Transform the numerical features
ds[numerical_features] = scaler.transform(ds[numerical_features])
# Drop the 'age_group' column from the dataset

ds = ds.drop(columns=['age_group'])


print(ds.dtypes)
display(ds)


### Model Building and evaluation


In [None]:
# 1. Encode the target variable 'y' into numerical format
ds['num_y'] = ds['y'].map({'yes': 1, 'no': 0})

# 2. Drop the original 'y' column
ds = ds.drop(columns=['y'])

# 3. Separate features and labels
X = ds.drop(columns=['num_y'])
y = ds['num_y']

# 4. Reset index to ensure alignment
X = X.reset_index(drop=True)
y = y.reset_index(drop=True)

# 5. Separate the data based on the target variable
Si_sample = X.loc[y == 1].reset_index(drop=True)
No_sample = X.loc[y == 0].reset_index(drop=True)

# 6. Combine Si and No samples for consistent scaling
combined_data = pd.concat([Si_sample, No_sample], ignore_index=True)

# **New Step: Encode non-numeric columns**

# Identify non-numeric columns
non_numeric_columns = combined_data.select_dtypes(include=['object']).columns.tolist()
print(f"Non-numeric columns: {non_numeric_columns}")

# Encode non-numeric columns
from sklearn.preprocessing import LabelEncoder

for col in non_numeric_columns:
    label_encoder = LabelEncoder()
    combined_data[col] = label_encoder.fit_transform(combined_data[col])

# 7. Initialize the scaler
scaler = StandardScaler()

# 8. Fit and transform the combined data
scaled_data = scaler.fit_transform(combined_data)

# 9. Apply PCA to the combined scaled data
n_components = 2  # Number of principal components
pca = PCA(n_components=n_components)
pca_transformed = pca.fit_transform(scaled_data)

# 10. Get the number of samples in each group
num_Si_samples = len(Si_sample)
num_No_samples = len(No_sample)

# 11. Separate the transformed data back into Si and No samples
Si_transformed = pca_transformed[:num_Si_samples]
No_transformed = pca_transformed[num_Si_samples:]

# 12. Plot the first two principal components for Si and No samples
plt.figure(figsize=(10, 7))
plt.scatter(No_transformed[:, 0], No_transformed[:, 1], color='red', label='NO', alpha=0.6)
plt.scatter(Si_transformed[:, 0], Si_transformed[:, 1], color='green', label='YES', alpha=0.6)

plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.title("PCA: First Two Principal Components")
plt.legend()
plt.show()

# 13. Print explained variance ratio for PC1 and PC2
explained_variance = pca.explained_variance_ratio_
print(f"Explained Variance Ratio for PC1 and PC2: {explained_variance}")



### **Interpretation of PCA Results**
The PCA plot shows the data reduced to two dimensions (PC1 and PC2), where PC1 captures the most variance and PC2 captures the second most. 
The partial overlap between the `YES` (green) and `NO` (red) groups indicates that the features used for PCA do not fully separate the two classes, suggesting shared characteristics or noise in the data. 
Some distinct clusters are visible, highlighting subsets of the data with unique feature combinations. Overall, while PCA reveals variability and some patterns, additional components or refined features may be needed for better group differentiation.



Now we can try to calculate the cumulative explained variance for all principal components using PCA, showing how much variance is retained as components are added.  
The cumulative variance is plotted to visualize the contribution of each component, helping identify the optimal number of components needed to retain most of the dataset's information.  
The plot provides a clear way to determine the point of diminishing returns, where adding more components yields minimal additional variance.


In [None]:
# Calculate explained variance for all components
pca_full = PCA()
pca_full.fit(X)
cumulative_variance = np.cumsum(pca_full.explained_variance_ratio_)

# Plot cumulative explained variance
plt.figure(figsize=(8, 6))
plt.plot(range(1, len(cumulative_variance) + 1), cumulative_variance, marker='o', linestyle='--')
plt.title('Cumulative Explained Variance by PCA Components')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.grid(True)
plt.show()


In [None]:
# Apply PCA with the chosen number of components
n_components = 6  # or 8, based on your preference
pca = PCA(n_components=n_components)
X_reduced = pca.fit_transform(X)
# Get the PCA components (loadings)
loadings = pd.DataFrame(
    pca.components_.T,
    columns=[f'PC{i}' for i in range(1, n_components + 1)],
    index=X.columns
)

# Display loadings
print("PCA Loadings:")
display(loadings)
import seaborn as sns

# Combine the reduced data with the target variable
df_pca = pd.DataFrame(X_reduced, columns=[f'PC{i}' for i in range(1, n_components + 1)])
df_pca['Target'] = y.values

# Plot pairwise relationships
sns.pairplot(df_pca, hue='Target', vars=[f'PC{i}' for i in range(1, 4)])
plt.show()


### **PCA Results**

The PCA analysis reduces the dataset into components, with PC1, PC2, and PC3 capturing the most variance. The pairwise scatterplots show significant overlap between the two classes (`0` and `1`), indicating limited class separability in this space.  

PCA loadings reveal that `job`, `month`, and `education` dominate the first three components, highlighting their importance in explaining the variance. Additional feature engineering or non-linear dimensionality reduction (e.g., t-SNE) may improve class separation.


## K MEANS

In [None]:
# Import necessary libraries for clustering
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, davies_bouldin_score

# Assuming 'X_reduced' is your PCA-reduced data with 6 components
# and 'y' is your target variable

# 1. Determine the optimal number of clusters using the Elbow Method
wcss = []
K = range(1, 11)
for k in K:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X_reduced)
    wcss.append(kmeans.inertia_)

# Plot the Elbow Method graph
plt.figure(figsize=(8, 5))
plt.plot(K, wcss, 'bo-', markersize=8)
plt.title('Elbow Method for Optimal k')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Within-Cluster Sum of Squares (WCSS)')
plt.grid(True)
plt.show()


### **Elbow Method for Optimal k**

This graph uses the Elbow Method to determine the optimal number of clusters (k) for K-Means clustering. The x-axis represents the number of clusters, and the y-axis shows the Within-Cluster Sum of Squares (WCSS), which measures the total variance within clusters.

The "elbow" point occurs around **k=4 or k=5**, where the reduction in WCSS starts to slow down significantly. This suggests that 4 or 5 clusters may be optimal, as adding more clusters does not provide substantial improvement in reducing variance.


I could not perform Silhoutte analysis on the whole dataset due to his size.

In [None]:
optimal_k = 5 
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
cluster_labels = kmeans.fit_predict(X_reduced)
df_clusters = pd.DataFrame(X_reduced, columns=[f'PC{i}' for i in range(1, n_components + 1)])
df_clusters['Cluster'] = cluster_labels
df_clusters['Target'] = y.values
plt.figure(figsize=(10, 7))
sns.scatterplot(
    x='PC1',
    y='PC2',
    hue='Cluster',
    data=df_clusters,
    palette='viridis',
    alpha=0.6
)
plt.title(f'K-Means Clusters Visualization (k={optimal_k})')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title='Cluster')
plt.show()
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
scatter = ax.scatter(
    df_clusters['PC1'],
    df_clusters['PC2'],
    df_clusters['PC3'],
    c=df_clusters['Cluster'],
    cmap='viridis',
    alpha=0.6
)
ax.set_xlabel('Principal Component 1')
ax.set_ylabel('Principal Component 2')
ax.set_zlabel('Principal Component 3')
ax.set_title(f'3D K-Means Clusters (k={optimal_k})')
legend = ax.legend(*scatter.legend_elements(), title='Cluster')
ax.add_artist(legend)
plt.show()
