---
# **Bank Churn EDA**:
A Comparative Study

<div >
  <img src="https://res.cloudinary.com/dhditogyd/image/upload/v1738068132/Data%20Science%20Media/Bank%20Churn%20Prediction/l5s9r0ttmsczsblljswh.png" width="100%" height=" 100%;"/>
</div>

---





Author: [Muhammad Faizan](https://www.linkedin.com/in/mrfaizanyousaf/)

<div >
  <img src="https://res.cloudinary.com/dhditogyd/image/upload/v1735402856/Passport_photo_jsvsip.png" width="20%" height=" 20%;"/>
</div>

## Muhammad Faizan

🎓 **3rd Year BS Computer Science** student at the **University of Agriculture, Faisalabad**  
💻 Enthusiast in **Machine Learning, Data Engineering, and Data Analytics**


## 🌐 Connect with Me

[Kaggle](https://www.kaggle.com/faizanyousafonly/) | [LinkedIn](https://www.linkedin.com/in/mrfaizanyousaf/) | [GitHub](https://github.com/faizan-yousaf/)  




## 💬 Contact Me
- **Email:** faizanyousaf815@gmail.com
- **WhatsApp:** [+92 306 537 5389](https://wa.me/923065375389)


🔗 **Let’s Collaborate:**  
I'm always open to queries, collaborations, and discussions. Let's build something amazing together!


## Meta-Data (About Dataset)

## Context: 
This is a multivariate type of dataset which means providing or involving a variety of separate mathematical or statistical variables, multivariate numerical data analysis. It is composed of `14 attributes` which are age, sex, chest pain type, resting blood pressure, serum cholesterol, fasting blood sugar, resting electrocardiographic results, maximum heart rate achieved, exercise-induced angina, oldpeak — ST depression induced by exercise relative to rest, the slope of the peak exercise ST segment, number of major vessels and Thalassemia. This database includes `76 attributes`, but all published studies relate to the use of a subset of 14 of them. The Cleveland database is the only one used by ML researchers to date. One of the major tasks on this dataset is to predict based on the given attributes of a patient that whether that particular person has heart disease or not and other is the experimental task to diagnose and find out various insights from this dataset which could help in understanding the problem more.

### Content:

#### Column Descriptions:

* `id:` (Unique id for each patient)
* `age:` (Age of the patient in years)
* `origin:` (place of study)
* `sex:` (Male/Female)
* `cp:` chest pain type ([typical angina, atypical angina, non-anginal, asymptomatic])
* `trestbps:` resting blood pressure (resting blood pressure (in mm Hg on admission to the hospital))
* `chol:` (serum cholesterol in mg/dl)
* `fbs:` (if fasting blood sugar > 120 mg/dl)
* `restecg:` (resting electrocardiographic results)
* `-- Values`: [normal, stt abnormality, lv hypertrophy]
* `thalach`: maximum heart rate achieved
* `exang`: exercise-induced angina (True/ False)
* `oldpeak`: ST depression induced by exercise relative to rest
* `slope`: the slope of the peak exercise ST segment
* `ca`: number of major vessels (0-3) colored by fluoroscopy
* `thal`: a blood disorder called *thalassemia* [normal; fixed defect; reversible defect]
* `num`: the predicted attribute


### Acknowledgements
### Creators:

* Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
* University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
* University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
* V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.

### Relevant Papers:
* Detrano, R., Janosi, A., Steinbrunn, W., Pfisterer, M., Schmid, J., Sandhu, S., Guppy, K., Lee, S., & Froelicher, V. (1989). International application of a new probability algorithm for the diagnosis of coronary artery disease. American Journal of Cardiology, 64,304--310. [Web Link](http://rexa.info/paper/b884ce2f4aff7ed95ce7bfa7adabaef46b88c60c)
* David W. Aha & Dennis Kibler. "Instance-based prediction of heart-disease presence with the Cleveland database." [Web Link](rexa.info/paper/0519d1408b992b21964af4bfe97675987c0caefc)
* Gennari, J.H., Langley, P, & Fisher, D. (1989). Models of incremental concept formation. Artificial Intelligence, 40, 11--61. [Web Link](http://rexa.info/paper/faecfadbd4a49f6705e0d3904d6770171b05041f)

### Citation Request:
The authors of the databases have requested that any publications resulting from the use of the data include the names of the principal investigator responsible for the data collection at each institution. 

**They would be**:

* Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
* University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
* University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
* V.A. Medical Center, Long Beach and Cleveland Clinic Foundation:Robert Detrano, M.D., Ph.D.

## Aims and Objectives:

We will fill this after doing the EDA and Data Preprocessing.

# **Import Libraries:**

Let's start the project by importing all the libraries that we will use in this project.

In [None]:
# import libraries:

# 1. to handel the data:
import numpy as np
import pandas as pd

# 2. to visualize the data:
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# 3. to preprocess the data:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# 4. to build the model:
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV

# 5. for classification task:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier

# 6. Metrics:

from sklearn.metrics import accuracy_score, precision_score, recall_score, r2_score, f1_score , classification_report, root_mean_squared_error, mean_absolute_error, mean_absolute_percentage_error

# 7. to ignore the warnings:
import warnings
warnings.filterwarnings("ignore")

print("Libraries have been loaded successfully")

# 8. Display all rows and columns:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [1]:
# Setting a style for the plots
sns.set_theme(style="whitegrid")

NameError: name 'sns' is not defined

# 1. 📊 Loading and Peeking at the Data

In [None]:
# 🚀 Step 1: Loading the Data
print("Loading the dataset... 🕵️‍♂️")
data = pd.read_csv("train.csv")
print("Dataset loaded successfully!")

In [None]:
# Displaying the first few rows of the dataset
print("\nLet's take a peek at the dataset: 👀")
data.head()

In [None]:
# Basic information about the dataset
print("\n📝 Quick summary of our dataset:")
print(df.info())

# 2: 🔍Understanding the Data Structure

In [None]:
print("\nUnderstanding the dataset structure and dimensions: 🧱")
print(f"Rows: {data.shape[0]}, Columns: {data.shape[1]}")
print("\nColumn Information:")
data.info()


In [None]:
print("\n🧮 Let's crunch some numbers!")
print(df.describe())

In [None]:
print("\n🧮 Let's crunch some numbers!")
print(df.describe())
