# Machine Learning Classification using Gradient Boost Tree

Classification is a kind of supervised learning of Machine Learning and these problems have categorical outputs (true-false, yes-no or multiple categories like low-medium-high). 


A term deposit is a banking process that money loaned to an Institue for a fixed period time and cannot get back until the end of the term. In this project, we will try to prediction the subscription of a term deposit of a Bank. The banking dataset is downloaded from https://data.world/data-society/bank-marketing-data/ web page where can be founded on public datasets. Dataset consists of 21 features where are 20 features and 1 label feature. All feature are as follows:

#### --> Bank Client Attributes
1 - Age (numeric) 

2 - Job (categorical): type of job (admin, entrepreneur... etc)

3 - Maritial (categorical): maritial status (married, unknown...)

4 - Education (categorical): (basic.y4, basic.6y, high school...)

5 - Default (categorical): has credit in default ? (yes, no)

6 - Housing (categorical): has housing loan ? (yes, no)

7 - Loan (categorical): has personal loan ? (yes, no)

#### --> Related with the last contact of the current campaing
8 - Contact (categorical): contact communication type (cellular, telephone)

9 - Month (categorical): contact month of year (jan, feb, march...)

10 - Day_of_Week (categorical): last contact day of week (mon, tue, wed...)

11 - Duration (numeric): last contact duration in second

#### --> Other Attributes
12 - Campaign (numeric): number of contacts performed during this campaign and for this client 

13 - pDays (numeric): number of days that passed by after the client was last contacted from a previoud campaign 

14 - Previous (numeric): number of contacts performend before this campaign and for this client

15 - pOutcome (categorical): outcome of the previous marketing campaign (failure, nonexistent...)

#### --> Social and Economic Context Attributes
16 - emp.var.rate (numeric): employment variation rate - quarterly indicatior

17 - cons.price.idx (numeric): consumer price index - monthly indicator

18 - cons.conf.idx (numeric): consumer confidence index - monthly indicator

19 - euribor3m (numeric): euribor 3 month rate - daily indicator

20 - nr.employed (numeric): number of employees - quarterly indicator

#### --> Output Variable (related with subscription)
2 - label (binary): Has the client subscribed a term deposit ? (yes - no)

## 1. Configuration

In [1]:
from pyspark.sql import SparkSession

pyspark = SparkSession.builder\
.master("local[4]")\
.appName("ML-Classification")\
.config("spark.executor.memory","2g")\
.config("spark.driver.memory","3g")\
.getOrCreate()

sc = pyspark.sparkContext

## 2. Load Dataset

In [2]:
bank_df = spark.read.format("csv")\
.option("header","True")\
.option("inferSchema","True")\
.option("sep",",")\
.load("bank_additional.csv")

In [3]:
bank_df.toPandas().head(10)

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,subscription
0,30,blue-collar,married,basic.9y,no,yes,no,cellular,may,fri,...,2,999,0,nonexistent,-1.8,92.893,-46.2,1.313,5099.1,no
1,39,services,single,high.school,no,no,no,telephone,may,fri,...,4,999,0,nonexistent,1.1,93.994,-36.4,4.855,5191.0,no
2,25,services,married,high.school,no,yes,no,telephone,jun,wed,...,1,999,0,nonexistent,1.4,94.465,-41.8,4.962,5228.1,no
3,38,services,married,basic.9y,no,unknown,unknown,telephone,jun,fri,...,3,999,0,nonexistent,1.4,94.465,-41.8,4.959,5228.1,no
4,47,admin.,married,university.degree,no,yes,no,cellular,nov,mon,...,1,999,0,nonexistent,-0.1,93.2,-42.0,4.191,5195.8,no
5,32,services,single,university.degree,no,no,no,cellular,sep,thu,...,3,999,2,failure,-1.1,94.199,-37.5,0.884,4963.6,no
6,32,admin.,single,university.degree,no,yes,no,cellular,sep,mon,...,4,999,0,nonexistent,-1.1,94.199,-37.5,0.879,4963.6,no
7,41,entrepreneur,married,university.degree,unknown,yes,no,cellular,nov,mon,...,2,999,0,nonexistent,-0.1,93.2,-42.0,4.191,5195.8,no
8,31,services,divorced,professional.course,no,no,no,cellular,nov,tue,...,1,999,1,failure,-0.1,93.2,-42.0,4.153,5195.8,no
9,35,blue-collar,married,basic.9y,unknown,no,no,telephone,may,thu,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.855,5191.0,no


## 2. Data Understanding 

In [4]:
from pyspark.sql.functions import *

### 3.1. Total Data Count

In [5]:
print("Total data count: ", bank_df.count())

Total data count:  4119


### 3.2. Description of dataset

In [6]:
bank_df.describe().toPandas().head()

Unnamed: 0,summary,age,job,marital,education,default,housing,loan,contact,month,...,campaign,pdays,previous,poutcome,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,subscription
0,count,4119.0,4119,4119,4119,4119,4119,4119,4119,4119,...,4119.0,4119.0,4119.0,4119,4119.0,4119.0,4119.0,4119.0,4119.0,4119
1,mean,40.11361981063365,,,,,,,,,...,2.537266326778344,960.4221898519058,0.1903374605486768,,0.0849720806020858,93.57970429716252,-40.49910172371938,3.621355668851656,5166.481694586143,
2,stddev,10.313361547199827,,,,,,,,,...,2.568159237578134,191.92278580077647,0.541788323429031,,1.5631144559116772,0.579348804988967,4.594577506837539,1.7335912227013537,73.66790355721237,
3,min,18.0,admin.,divorced,basic.4y,no,no,no,cellular,apr,...,1.0,0.0,0.0,failure,-3.4,92.201,-50.8,0.635,4963.6,no
4,max,88.0,unknown,unknown,unknown,yes,yes,yes,telephone,sep,...,35.0,999.0,6.0,success,1.4,94.767,-26.9,5.045,5228.1,yes


### 3.3. Checking Null Values

In [7]:
count_for_null = 1
for column in bank_df.columns:
    if(bank_df.filter(col(column).isNull()).count()>0):
        print(count_for_null, ".", column, "--> \033[1;31;1m there has null values \033[0m")
    else:
        print(count_for_null, ".",column,"--> \033[1;32;1m is clean \033[0m")
    count_for_null += 1

1 . age --> [1;32;1m is clean [0m
2 . job --> [1;32;1m is clean [0m
3 . marital --> [1;32;1m is clean [0m
4 . education --> [1;32;1m is clean [0m
5 . default --> [1;32;1m is clean [0m
6 . housing --> [1;32;1m is clean [0m
7 . loan --> [1;32;1m is clean [0m
8 . contact --> [1;32;1m is clean [0m
9 . month --> [1;32;1m is clean [0m
10 . day_of_week --> [1;32;1m is clean [0m
11 . duration --> [1;32;1m is clean [0m
12 . campaign --> [1;32;1m is clean [0m
13 . pdays --> [1;32;1m is clean [0m
14 . previous --> [1;32;1m is clean [0m
15 . poutcome --> [1;32;1m is clean [0m
16 . emp_var_rate --> [1;32;1m is clean [0m
17 . cons_price_idx --> [1;32;1m is clean [0m
18 . cons_conf_idx --> [1;32;1m is clean [0m
19 . euribor3m --> [1;32;1m is clean [0m
20 . nr_employed --> [1;32;1m is clean [0m
21 . subscription --> [1;32;1m is clean [0m


In [8]:
bank_df.groupBy("subscription").count().toPandas().head()

Unnamed: 0,subscription,count
0,no,3668
1,yes,451


### 3.4. Checking of the Unknown Category in the Dataset

In [9]:
count_for_null = 1
for column in bank_df.columns:
    if(bank_df.filter(col(column).contains("unknown")).count()>0):
        print(count_for_null, ".", column, "--> \033[1;31;1m there has unknown \033[0m")
    else:
        print(count_for_null, ".",column,"--> \033[1;32;1m is clean \033[0m")
    count_for_null += 1

1 . age --> [1;32;1m is clean [0m
2 . job --> [1;31;1m there has unknown [0m
3 . marital --> [1;31;1m there has unknown [0m
4 . education --> [1;31;1m there has unknown [0m
5 . default --> [1;31;1m there has unknown [0m
6 . housing --> [1;31;1m there has unknown [0m
7 . loan --> [1;31;1m there has unknown [0m
8 . contact --> [1;32;1m is clean [0m
9 . month --> [1;32;1m is clean [0m
10 . day_of_week --> [1;32;1m is clean [0m
11 . duration --> [1;32;1m is clean [0m
12 . campaign --> [1;32;1m is clean [0m
13 . pdays --> [1;32;1m is clean [0m
14 . previous --> [1;32;1m is clean [0m
15 . poutcome --> [1;32;1m is clean [0m
16 . emp_var_rate --> [1;32;1m is clean [0m
17 . cons_price_idx --> [1;32;1m is clean [0m
18 . cons_conf_idx --> [1;32;1m is clean [0m
19 . euribor3m --> [1;32;1m is clean [0m
20 . nr_employed --> [1;32;1m is clean [0m
21 . subscription --> [1;32;1m is clean [0m


#### 3.4.1. job feature

In [10]:
bank_df.groupBy("job").count().sort(col("count")).toPandas().head(15)

Unnamed: 0,job,count
0,unknown,39
1,student,82
2,housemaid,110
3,unemployed,111
4,entrepreneur,148
5,self-employed,159
6,retired,166
7,management,324
8,services,393
9,technician,691


#### 3.4.2. maritial feature

In [11]:
bank_df.groupBy("marital").count().sort(col("count")).toPandas().head(20)

Unnamed: 0,marital,count
0,unknown,11
1,divorced,446
2,single,1153
3,married,2509


#### 3.4.2. education feature

In [12]:
bank_df.groupBy("education").count().sort(col("count")).toPandas().head(20)

Unnamed: 0,education,count
0,illiterate,1
1,unknown,167
2,basic.6y,228
3,basic.4y,429
4,professional.course,535
5,basic.9y,574
6,high.school,921
7,university.degree,1264


#### 3.4.2. default feature

In [13]:
bank_df.groupBy("default").count().sort(col("count")).toPandas().head(20)

Unnamed: 0,default,count
0,yes,1
1,unknown,803
2,no,3315


#### 3.4.2. housing feature

In [14]:
bank_df.groupBy("housing").count().sort(col("count")).toPandas().head(20)

Unnamed: 0,housing,count
0,unknown,105
1,no,1839
2,yes,2175


#### 3.4.2. loan feature

In [15]:
bank_df.groupBy("loan").count().sort(col("count")).toPandas().head(20)

Unnamed: 0,loan,count
0,unknown,105
1,yes,665
2,no,3349


## 3. Data Cleaning

### 3.1.  Remove Unknown categories in the features (job, marital, education, housing, loan)

##### 3.1.1. unknown categories in the job feature

In [16]:
bank_df_clean = bank_df.filter(bank_df.job != "unknown")
bank_df_clean.count()

4080

##### 3.1.2. unknown categories in the marital feature

In [17]:
bank_df_clean = bank_df_clean.filter(bank_df.marital != "unknown")
bank_df_clean.count()

4069

##### 3.1.3. unknown categories in the education feature

In [18]:
bank_df_clean = bank_df_clean.filter(bank_df.education != "unknown")
bank_df_clean.count()

3915

##### 3.1.4. unknown categories in the housing feature

In [19]:
bank_df_clean = bank_df_clean.filter(bank_df.housing != "unknown")
bank_df_clean.count()

3811

##### 3.1.5. unknown categories in the loan feature

In [20]:
bank_df_clean = bank_df_clean.filter(bank_df.loan != "unknown")
bank_df_clean.count()

3811

### 3.2. Removing weak category count

###### 3.2.1. Remove yes category which is a weak class in the default 

In [21]:
bank_df_clean.groupBy("default").count().sort(col("count")).toPandas().head(20)

Unnamed: 0,default,count
0,yes,1
1,unknown,721
2,no,3089


In [22]:
bank_df_clean = bank_df_clean.filter(bank_df.default != "yes")
bank_df_clean.count()

3810

###### 3.2.2. Remove illiterate category which is a weak class in the education 

In [23]:
bank_df_clean = bank_df_clean.filter(bank_df.education != "illiterate")
bank_df_clean.count()

3809

### 3.3. Remove admin's point (.) in the job feature

In [24]:
bank_remove_point = bank_df_clean\
.withColumn("job", regexp_replace(col("job"), "admin.", "admin"))

bank_remove_point.filter(col("job").contains("admin")).select("job").toPandas().head()

Unnamed: 0,job
0,admin
1,admin
2,admin
3,admin
4,admin


### 3.4. Combining of categories of Education 

In [25]:
bank_remove_point.groupBy("education").count().toPandas().head(20)

Unnamed: 0,education,count
0,high.school,884
1,basic.6y,220
2,professional.course,517
3,university.degree,1239
4,basic.4y,405
5,basic.9y,544


We have 5 kind of education categories. We combine some of them into one category as follow:

basic.4y, basic6.y, basic.9y ---> elementary-school

high.school ---> high-school

university.degree ---> university-degree

professional.course ---> professional-course

In [26]:
bank_df_combined = bank_remove_point.withColumn("education",
                                               when(col("education").isin("basic.4y","basic.6y","basic.9y"), "elementary-school")
                                                .when(col("education").isin("high.school"), "high-school")
                                                .when(col("education").isin("university.degree"), "university-degree")
                                                .when(col("education").isin("professional.course"), "professional-course")
                                                .otherwise(col("education")))

bank_df_combined.toPandas().head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,subscription
0,30,blue-collar,married,elementary-school,no,yes,no,cellular,may,fri,...,2,999,0,nonexistent,-1.8,92.893,-46.2,1.313,5099.1,no
1,39,services,single,high-school,no,no,no,telephone,may,fri,...,4,999,0,nonexistent,1.1,93.994,-36.4,4.855,5191.0,no
2,25,services,married,high-school,no,yes,no,telephone,jun,wed,...,1,999,0,nonexistent,1.4,94.465,-41.8,4.962,5228.1,no
3,47,admin,married,university-degree,no,yes,no,cellular,nov,mon,...,1,999,0,nonexistent,-0.1,93.2,-42.0,4.191,5195.8,no
4,32,services,single,university-degree,no,no,no,cellular,sep,thu,...,3,999,2,failure,-1.1,94.199,-37.5,0.884,4963.6,no


In [27]:
bank_df_clean = bank_df_combined

## 4. Data Preparation

In [28]:
from pyspark.ml.feature import StringIndexer

We have to transform some categorical values to numeric values. To be transformed values are as follows:

- job 

- maritial

- education

- default

- housing

- loan 

- contact 

- month 

- day_of_weeks

- poutcome

### 4.1. StringIndexer (Categorical features)

In [29]:
job_indexer = StringIndexer()\
.setInputCol("job")\
.setOutputCol("job_index")

In [30]:
marital_indexer = StringIndexer()\
.setInputCol("marital")\
.setOutputCol("marital_index")

In [31]:
education_indexer = StringIndexer()\
.setInputCol("education")\
.setOutputCol("education_index")

In [32]:
default_indexer = StringIndexer()\
.setInputCol("default")\
.setOutputCol("default_index")

In [33]:
housing_indexer = StringIndexer()\
.setInputCol("housing")\
.setOutputCol("housing_index")

In [34]:
loan_indexer = StringIndexer()\
.setInputCol("loan")\
.setOutputCol("loan_index")

In [35]:
contact_indexer = StringIndexer()\
.setInputCol("contact")\
.setOutputCol("contact_index")

In [36]:
month_indexer = StringIndexer()\
.setInputCol("month")\
.setOutputCol("month_index")

In [37]:
day_of_weeks_indexer = StringIndexer()\
.setInputCol("day_of_week")\
.setOutputCol("day_of_week_index")

In [38]:
poutcomes = StringIndexer()\
.setInputCol("poutcome")\
.setOutputCol("poutcome_index")

### 4.2. OneHotEncoderEstimator (Categorical Features)

In [39]:
from pyspark.ml.feature import OneHotEncoderEstimator

In [40]:
encoder = OneHotEncoderEstimator()\
.setInputCols(["job_index","marital_index","education_index", "default_index",
              "housing_index","loan_index","contact_index","month_index",
               "day_of_week_index","poutcome_index"])\
.setOutputCols(["job_encoded","marital_encoded","education_encoded", "default_encoded",
               "housing_encoded","loan_encoded","contact_encoded","month_encoded",
               "day_of_week_encoded","poutcome_encoded"])

### 4.3. VectorAssembler

In [41]:
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler()\
.setInputCols(["age","job_encoded","marital_encoded","education_encoded",
              "default_encoded","housing_encoded","loan_encoded","contact_encoded",
              "month_encoded","day_of_week_encoded","duration","campaign",
              "pdays", "previous", "poutcome_encoded", "emp_var_rate","cons_price_idx",
              "cons_conf_idx","euribor3m","nr_employed"])\
.setOutputCol("vectorized_features")

            

### 4.4. LabelIndexer 

In [42]:
label_indexer = StringIndexer()\
.setInputCol("subscription")\
.setOutputCol("label")

### 4.5 Normalization-Standardization

In [43]:
from pyspark.ml.feature import StandardScaler

scaler = StandardScaler()\
.setInputCol("vectorized_features")\
.setOutputCol("features")

### 4.6. Split dataset into Train and Test 

In [44]:
train_df, test_df = bank_df_clean.randomSplit([0.8, 0.2], seed=142)

## 5. Machine Learning Algorithm (Gradient Boost Tree)

### 5.1. Gradient Boost Tree process

In [45]:
from pyspark.ml.classification import GBTClassifier

In [46]:
gradient_boost_tree = GBTClassifier()\
.setFeaturesCol("features")\
.setLabelCol("label")\
.setPredictionCol("prediction")

### 5.2. Pipeline process

In [47]:
from pyspark.ml import Pipeline

In [48]:
pipeline_obj = Pipeline()\
.setStages([job_indexer, 
            marital_indexer,
            education_indexer,
            default_indexer,
            housing_indexer,
            loan_indexer,
            contact_indexer,
            month_indexer,
            day_of_weeks_indexer,
            poutcomes,
            encoder,
            assembler,
            label_indexer,
            scaler,
            gradient_boost_tree])

In [49]:
pipeline_model = pipeline_obj.fit(train_df)

result = pipeline_model.transform(test_df)

In [50]:
result.select("features","label","prediction").toPandas().head()

Unnamed: 0,features,label,prediction
0,"(1.9530460909171459, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0.0,0.0
1,"(2.0506983954630034, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0.0,0.0
2,"(2.0506983954630034, 0.0, 0.0, 0.0, 0.0, 0.0, ...",1.0,1.0
3,"(2.0506983954630034, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0.0,0.0
4,"(2.1483507000088604, 0.0, 0.0, 0.0, 3.44003024...",0.0,0.0


## 6. Evaluation the Model 

In [51]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

### 6.1. Accuracy Evaluation

In [52]:
evaluator = MulticlassClassificationEvaluator(
    labelCol="label", 
    predictionCol="prediction", 
    metricName = "accuracy")

gbt = evaluator.evaluate(result)

In [53]:
print(" --- Gradient Boost Tree Classification ---")
print("Accuracy Rate: ", gbt)
print("Error Rate: ", (1.0 - gbt))

 --- Gradient Boost Tree Classification ---
Accuracy Rate:  0.9156939040207522
Error Rate:  0.08430609597924776


### 6.2. Confusion Matrix

In [54]:
predictionAndLabel = result.select("prediction","label").rdd

from pyspark.mllib.evaluation import MulticlassMetrics
metrics = MulticlassMetrics(predictionAndLabel)
cm = metrics.confusionMatrix()
rows = cm.toArray().tolist()

confusion_matrix = spark.createDataFrame(rows,["normal","anomaly"])
confusion_matrix.show()

+------+-------+
|normal|anomaly|
+------+-------+
| 669.0|   32.0|
|  33.0|   37.0|
+------+-------+



In [55]:
result.groupBy("label","prediction").count().show()

+-----+----------+-----+
|label|prediction|count|
+-----+----------+-----+
|  1.0|       1.0|   37|
|  0.0|       1.0|   32|
|  1.0|       0.0|   33|
|  0.0|       0.0|  669|
+-----+----------+-----+



##### Results

As a result, %90.92 (0.9092) accuracy rate obtained using Gradient Boost Tree 