**Richard Alberto - 0706022210024**

# **Week 7**


---


## Random Forest and Gradient Boosting Classifier
Today lab exercise will be using Bank Marketing Dataset
- Download the **Bank Marketing Dataset** from Elearn
- Upload the dataset to your own Github
- Import the dataset using URL from your Github

**METADATA** <br>
Age <br>
Job : type of job <br>
Marital : marital status <br>
Education <br>
Default: has credit in default? <br>
Housing: has housing loan? <br>
Loan: has personal loan? <br>
Contact: contact communication type <br>
Month: last contact month of year <br>
Day: last contact day of the week <br>
Duration: last contact duration, in seconds. Important
note: this attribute highly affects the output target (e.g., if
duration=0 then y='no'). <br>
Campaign: number of contacts performed during this campaign and for
this client (includes last contact) <br>
Pdays: number of days that passed by after the client was last
contacted from a previous campaign (999 means client was not
previously contacted) <br>
Previous: number of contacts performed before this campaign and for
this client <br>
Poutcome: outcome of the previous marketing campaign <br>
y: has the client subscribed a term deposit?

## Import Libraries

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

## Load Dataset

In [2]:
url = "https://raw.githubusercontent.com/blaqqqqq/bank-dataset/refs/heads/main/Bank.csv"
data = pd.read_csv(url, delimiter=";")
data.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown,no
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,no
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,no
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown,no
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,no


## Exploratory Data Analysis
Explore the data to identify patterns within the dataset. <br>
Hint: <br>
* There are no missing values, but some columns contain 'unknown' values. Decide whether it is best to drop it or not.
* The 'admin' and 'management' unique value in the job column represent similar roles, so you can combine them under a single categorical value.




In [3]:
data.isnull().sum()

Unnamed: 0,0
age,0
job,0
marital,0
education,0
default,0
balance,0
housing,0
loan,0
contact,0
day,0


In [4]:
data.replace("unknown", np.nan, inplace=True)
data.dropna(inplace=True)

In [5]:
data['job'] = data['job'].replace(['admin'], 'management')
data['job'].value_counts()

Unnamed: 0_level_0,count
job,Unnamed: 1_level_1
management,177
blue-collar,143
technician,137
admin.,102
services,58
retired,44
self-employed,26
entrepreneur,21
unemployed,20
student,19


## Machine Learning (Split Data)

In [6]:
X = data.drop("y", axis=1)
y = data["y"]

# Converting categorical columns to dummy variables
X = pd.get_dummies(X, drop_first=True)

# Splitting data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Random Forest


---

a. Defining the model <br>
b. Predict the test set results <br>
c. Check accuracy score <br>
d. Confusion matrix <br>
e. Classification report <br>
f. Results and conclusion <br>

In [9]:
#Defining the model
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)

In [10]:
#Predict the test set results
y_pred_rf = rf_model.predict(X_test)


In [11]:
#Check accuracy score
accuracy_rf = accuracy_score(y_test, y_pred_rf)
accuracy_rf


0.7973856209150327

In [12]:
#Confusion matrix
confusion_rf = confusion_matrix(y_test, y_pred_rf)
confusion_rf

array([[110,   5],
       [ 26,  12]])

In [13]:
#Classification report
report_rf = classification_report(y_test, y_pred_rf)
print(report_rf)

              precision    recall  f1-score   support

          no       0.81      0.96      0.88       115
         yes       0.71      0.32      0.44        38

    accuracy                           0.80       153
   macro avg       0.76      0.64      0.66       153
weighted avg       0.78      0.80      0.77       153



In [None]:
#Results and conclusion
#The accuracy score, confusion matrix, and classification report give insight into the model's performance.
#The Random Forest classifier is generally robust, often providing good performance on structured/tabular data.

## Gradient Boosting Classifier


---

a. Defining the model <br>
b. Predict the test set results <br>
c. Check accuracy score <br>
d. Confusion matrix <br>
e. Classification report <br>
f. Results and conclusion <br>

In [14]:
 #Defining the model
 gb_model = GradientBoostingClassifier(random_state=42)
 gb_model.fit(X_train, y_train)

In [15]:
#Predict the test set results
y_pred_gb = gb_model.predict(X_test)

In [16]:
#Check accuracy score
accuracy_gb = accuracy_score(y_test, y_pred_gb)
accuracy_gb

0.8300653594771242

In [17]:
#Confusion matrix
confusion_gb = confusion_matrix(y_test, y_pred_gb)
confusion_gb

array([[110,   5],
       [ 21,  17]])

In [18]:
#Classification report
report_gb = classification_report(y_test, y_pred_gb)
print(report_gb)

              precision    recall  f1-score   support

          no       0.84      0.96      0.89       115
         yes       0.77      0.45      0.57        38

    accuracy                           0.83       153
   macro avg       0.81      0.70      0.73       153
weighted avg       0.82      0.83      0.81       153



In [19]:
#Results and conclusion
#Similar to the Random Forest classifier, the Gradient Boosting classifier's accuracy, confusion matrix, and classification report show its effectiveness.
#Gradient Boosting is often more accurate but may take longer to train than Random Forest.

Summary of Result

In [20]:
print("Random Forest Accuracy:", accuracy_rf)
print("Gradient Boosting Accuracy:", accuracy_gb)

Random Forest Accuracy: 0.7973856209150327
Gradient Boosting Accuracy: 0.8300653594771242
