# Predicting Customer Term Deposits

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

ModuleNotFoundError: No module named 'seaborn'

## Summary

TBD once coding is finished

## Introduction 

According to a poll from Investor's Edge, the direct investing division of CIBC, 79% of Canadians acknowledge that it is important to know how to invest their money (Gonzales 2024). However, only 48% of Canadians are investing their money annually (Gonzales 2024). A large proportion of those hesitant to invest (57%) stated a fear of losing money as the reason (Gonzales 2024). It’s important for banks to understand Canadians’ investment habits in order to better provide services to support Canadians investment decisions. Keeping this in mind, many banks offer a lower risk investment option called bank term deposits.

A bank term deposit is a type of secure investment that allows individuals to deposit a lump sum amount of money for a fixed period of time (term). The period of time can range from short-term to long-term. The money is “locked in” for the duration of the term at an agreed upon interest rate. At the end of the term, the customer will recieve the initial deposit along with the additional accumulated interest. It’s important to note that term deposits are very low risk and the initial investment is protected (IslandSavings, n.d.)

In this analysis, we aim to determine whether a machine learning model can predict if a customer will agree to open a term deposit, using data from a Portuguese banking institution. Answering this question will be valuable for the banking institution, as it will allow them to focus their calling campaign on customers that are more likely to agree to a term deposit. This targeted approach will save valuable time and resources, as fewer customers will need to be contacted in future campaigns based on the models predictions. Additionally, this analysis will also help the bank understand their customers investment preferences, enabling them to build stronger relationship with their client-base and offer investment options that align with customer needs.

# Methods

## Data

The dataset used in this project is from a direct marketing campaign conducted via phone calls from a Portuguese banking institution. The dataset was created by S. Moro, P. Rita, and P. Cortez and collected between May 2008 and November 2010 (Moro, Rita, and Cortez, 2014). Our team sourced the data from the UCI Machine Learning Repository which can be accessed directly here. Each row in the dataset represents a bank client, with 17 features that capture aspects of the clients characteristics, as well as whether the client opened a term deposit or not. Some features are more specific to the individual (e.g. age, job, marital status and education level), while others pertain to their relationship with the bank, such as such as past interactions through previous campaigns, or the number of days since the last contact.

## Step 1: Data Loading

Objective: Provide context about the dataset and load it for analysis.

In [None]:
# Path to the CSV file
file_path = "../data/bank-full.csv"

# Load the data
data = pd.read_csv(file_path, sep=';')

# Preview the first few rows
data.head()

## Step 2: Data Cleaning and Missing Value Handling

Objective: Standardize missing values, clean data, and remove irrelevant columns.

In [None]:
# Replace "unknown" with NaN
data.replace('unknown', np.nan, inplace=True)

# Check for missing values
print(data.isnull().sum())

# Visualize the relationship between 'contact' and 'y' to evaluate its importance
sns.countplot(x='contact', hue='y', data=data)
plt.title('Contact Method vs Subscription')
plt.xlabel('Contact Method')
plt.ylabel('Count')
plt.show()

# Impute missing values in 'job' and 'education' with mode
data['job'] = data['job'].fillna(data['job'].mode()[0])
data['education'] = data['education'].fillna(data['education'].mode()[0])

# Decision about 'contact':
# Based on the visualization, replace missing values in 'contact' with 'Unknown Contact'
data['contact'] = data['contact'].fillna('Unknown Contact')

# Drop the 'poutcome' column due to excessive missing values
data.drop(columns=['poutcome'], inplace=True)

# Drop the 'duration' column as it is deemed irrelevant
data.drop(columns=['duration'], inplace=True)

# Verify the cleaned dataset
print(data.isnull().sum())
print(data.info())


To prepare the dataset for analysis, we addressed missing values and removed irrelevant columns. Missing values in job and education were imputed with the mode, as their proportions were small, and mode imputation preserves their categorical nature. For contact, we visualized its relationship with the target variable (y) and found that contact method correlates with subscription rates. Based on this insight, missing values in contact were replaced with "Unknown Contact" to retain its predictive value. Columns like poutcome, which had excessive missing values (82%), were dropped to reduce noise, while duration was removed to prevent data leakage, as it directly correlates with the target variable. These cleaning decisions ensure a clean, consistent dataset, while preserving key patterns for predictive modeling.

## Step 3: Exploratory Data Analysis (EDA)

Objective: Gain insights into the dataset using summary statistics and visualizations.

### 3.1 Summary Statistics Code

In [None]:
# Generate summary statistics for numerical columns
print(data.describe())

The summary statistics provide an overview of the central tendencies, variability, and range of the numerical columns in the dataset. The age column, with a mean of 41 years and a standard deviation of 10.6, indicates a diverse range from 18 to 95 years. The balance column shows significant variability, with a wide range from -8019 to 102127 and a standard deviation over 3000, suggesting the presence of outliers. The day column, representing the last contact day, is evenly distributed across the month, while campaign has a median of 2 and a maximum of 63, showing that most clients were contacted only a few times. The pdays column has many entries with -1, likely indicating no prior contact, while previous has a low mean (0.58) but a maximum of 275, highlighting infrequent yet extreme cases. These statistics offer critical insights into the dataset’s distribution and help identify features that may need further preprocessing.

### 3.2 Distribution of the Target Variable

In [None]:
# Visualize the distribution of the target variable
sns.countplot(x='y', data=data)
plt.title('Distribution of Target Variable (Term Deposit Subscription)')
plt.xlabel('Subscription Status')
plt.ylabel('Count')
plt.show()


### 3.3 Relationships Between Features and Target Code:

In [None]:
# Set a consistent figure size
plot_size = (8, 5)

# Boxplot for balance vs subscription status
plt.figure(figsize=plot_size)
sns.boxplot(x='y', y='balance', data=data)
plt.title('Balance Distribution by Subscription Status')
plt.xlabel('Subscription Status')
plt.ylabel('Balance')
plt.show()

# Job vs subscription
plt.figure(figsize=plot_size)
sns.countplot(x='job', hue='y', data=data)
plt.title('Job Type vs Subscription')
plt.xlabel('Job Type')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.legend(title='Subscription Status', loc='upper right')
plt.show()

## Step 4. Correlation Analysis

Objective: Explore relationships between numerical features.

In [None]:
# Select only numerical columns
numerical_data = data.select_dtypes(include=['float64', 'int64'])

# Calculate and visualize the correlation matrix
correlation_matrix = numerical_data.corr()
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

The correlation heatmap reveals that most numerical features have weak or negligible correlations, indicating low multicollinearity and diverse contributions to prediction. A notable moderate correlation (0.45) exists between pdays and previous, suggesting some redundancy, but other features like age, balance, and campaign show minimal linear relationships. This suggests that the numerical features are largely independent and suitable for modeling without significant concerns about collinearity.

## Step 5. Analysis

## References

Moro, S., Rita, P., and Cortez, P. (2014). Bank Marketing [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5K306.

Freschia Gonzales. (May 16, 2024). “More than half of Canadians don’t invest annually, CIBC poll finds.” Wealth Professional. https://www.wealthprofessional.ca/investments/wealth-technology/more-than-half-of-canadians-dont-invest-annually-cibc-poll-finds/385897 

IslandSavings. (n.d.). “A Complete Guide to Term Deposits.” IslandSavings. https://www.islandsavings.ca/simple-advice/wealth/term-deposits-guide#p1 