# Bank Marketing

## Overview

The  data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.  

We chose this dataset because its big enough, so when we split it into train, validation and test, these subsets will be relatively big for our algorithms to learn well increasing their accuracy and reliability also because Marketing is a very crucial part of every business, so knowing how to win a customer sounded very interesting to us since we are anticipating on changing the world in the business field in future.

## Dataset Overview
### The Dataset has 20 Input Variables  (Features):

**#bank client data:**  
1 - _Age_ (numeric)  
2 - _Job_ **:** type of job (categorical: 'admin.', 'blue-collar', 'entrepreneur', 'housemaid', 'management', 'retired', 'self-employed', 'services', 'student', 'technician', 'unemployed', 'unknown')  
3 - _Marital_ **:** marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)  
4 - _Education_ (categorical:   'basic.4y', 'basic.6y', 'basic.9y', 'high.school', 'illiterate', 'professional.course', 'university.degree', 'unknown')  
5 - _Default_ **:** has credit in default? (categorical: 'no', 'yes', 'unknown')
6 - _Housing_ **:** has housing loan? (categorical: 'no', 'yes', 'unknown')  
7 - _Loan_ **:** has personal loan? (categorical: 'no', 'yes', 'unknown')  

**#related with the last contact of the current campaign:**  
8 - _Contact_ **:** contact communication type (categorical: 'cellular', 'telephone')  
9 - _Month_ **:** last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')  
10 - _Day_of_week_ **:** last contact day of the week (categorical: 'mon', 'tue', 'wed', 'thu', 'fri')  
11 - _Duration_ **:** last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.  

**#other attributes:**  
12 - _Campaign_ **:** number of contacts performed during this campaign and for this client (numeric, includes last contact)  
13 - _pDays_ **:** number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)  
14 - _Previous_ **:** number of contacts performed before this campaign and for this client (numeric)  
15 - _pOutcome_ **:** outcome of the previous marketing campaign (categorical: 'failure', 'nonexistent', 'success')  

**#social and economic context attributes**  
16 - _emp.var.rate_ **:** employment variation rate - quarterly indicator (numeric)  
17 - _cons.price.idx_ **:** consumer price index - monthly indicator (numeric)  
18 - _cons.conf.idx_ **:** consumer confidence index - monthly indicator (numeric)  
19 - _Euribor3m_ **:** euribor 3 month rate - daily indicator (numeric)  
20 - _nr.employed_ **:** number of employees - quarterly indicator (numeric)  

### Output Variable (Desired Target):
y - has the client subscribed a term deposit? (binary: 'yes', 'no')

## Contributors:
* Phillip Moyo – 2185695   
* Moshito Charles Makgakga – 1445435   
* Godfrey T Chamunogwa – 2234379
* Fankholoro Vincent Sebothoma – 1671848   

# Logistic Regression
Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah Blah 

# Import Libraries

In [112]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Import the Dataset


In [113]:
df = pd.read_csv('bank-full.csv', sep=";")
display(df)

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45206,51,technician,married,tertiary,no,825,no,no,cellular,17,nov,977,3,-1,0,unknown,yes
45207,71,retired,divorced,primary,no,1729,no,no,cellular,17,nov,456,2,-1,0,unknown,yes
45208,72,retired,married,secondary,no,5715,no,no,cellular,17,nov,1127,5,184,3,success,yes
45209,57,blue-collar,married,secondary,no,668,no,no,telephone,17,nov,508,4,-1,0,unknown,no


## Trimming the Data
The Data is heavily biased, so we are trimming it to make the different output classes more even (unbiased)

In [114]:
df_yes = df[df['y']=='yes']
df_no = df[df['y']=='no']
df_no = df_no.iloc[:5289, :]

df = pd.concat([df_yes, df_no])
df = df.sample(frac=1).reset_index(drop=True)       #shuffle the rows
display(df)

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,47,blue-collar,married,unknown,no,306,yes,no,unknown,5,may,13,1,-1,0,unknown,no
1,25,student,single,secondary,no,1868,no,no,cellular,26,oct,259,1,103,2,other,yes
2,43,blue-collar,married,primary,no,1401,yes,no,unknown,12,may,195,2,-1,0,unknown,no
3,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
4,52,admin.,single,secondary,no,2398,yes,no,cellular,3,nov,412,1,-1,0,unknown,yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10573,27,self-employed,single,tertiary,no,-503,yes,no,unknown,9,may,98,3,-1,0,unknown,no
10574,34,management,married,tertiary,no,0,yes,no,unknown,21,may,88,2,-1,0,unknown,no
10575,39,technician,married,secondary,no,116,no,no,cellular,22,may,554,2,-1,0,unknown,yes
10576,57,management,married,tertiary,no,10583,no,no,cellular,28,sep,341,3,98,3,success,yes


# Feature Scaling

In [115]:
##--scaling column(y)=> 'yes'=1 and 'no'=0 also @there are no null values in our dataset---##
y = LabelEncoder()
df.iloc[:,-1] = y.fit_transform(df.iloc[:,-1])

In [116]:
##--scaling column(poutcome)=> 'failure'=0, 'other'=1, 'success'=2, 'unknown'=3
poutcome = LabelEncoder()
df.iloc[:,-2] = poutcome.fit_transform(df.iloc[:,-2])

In [117]:
##--scaling column(contact)=> 'cellular'=0, 'telephone'=1, 'unknown'=2
contact = LabelEncoder()
df.iloc[:,8] = contact.fit_transform(df.iloc[:,8])

In [118]:
##--scaling column(marital)=> 'married'=1, 'divorced'=0, 'single'=2
marital = LabelEncoder()
df.iloc[:,2] = marital.fit_transform(df.iloc[:,2])

In [119]:
##--scaling column(education)=> 'primary'=0, 'secondary'=1, 'tertiary'=2, 'unknown'=3
education = LabelEncoder()
df.iloc[:,3] = education.fit_transform(df.iloc[:,3])

In [120]:
##--scaling column(default)=> 'yes'=1, 'no'=0'
default = LabelEncoder()
df.iloc[:,4] = default.fit_transform(df.iloc[:,4])

In [121]:
##--scaling column(housing)=> 'yes'=1, 'no'=0'
housing = LabelEncoder()
df.iloc[:,6] = housing.fit_transform(df.iloc[:,6])

In [122]:
##--scaling column(loan)=> 'yes'=1, 'no'=0'
loan = LabelEncoder()
df.iloc[:,7] = loan.fit_transform(df.iloc[:,7])

In [123]:
##--scaling column(month)=> 'jan'=1,'feb'=2, 'mar'=3,...,'dec'=12
month = df.iloc[:,10].replace(['jan','feb','mar','apr','may','jun','jul','aug','sep','oct','nov','dec'],[1,2,3,4,5,6,7,8,9,10,11,12])
df.iloc[:,10] = month 

In [124]:
##--scaling column(job)=> 
#['admin.'=1, 'blue-collar'=2, 'entrepreneur'=3, 'housemaid'=4, 'management'=5,
#'retired'=6, 'self-employed'=7, 'services'=8, 'student'=9, 'technician'=10,
#'unemployed'=11, 'unknown'=12]

job = df.iloc[:,1].replace(['admin.', 'blue-collar', 'entrepreneur', 'housemaid', 'management',
       'retired', 'self-employed', 'services', 'student', 'technician',
       'unemployed', 'unknown'],[1,2,3,4,5,6,7,8,9,10,11,12])
df.iloc[:,1] = job

# Splitting the Dataset into Training, Validation and Test set

#### Training data (60% of the data)

In [125]:
# trining dataset
train_data = df.iloc[:6347:]

# training features
train_features = train_data.iloc[:,:-1].values

# training targets
train_targets = train_data.iloc[:,-1].values


#### Validation data (20% of the data)

In [126]:
# validation dataset
validate_data = df.iloc[6347:8463:]

# validation features
validate_features = validate_data.iloc[:,:-1].values

# validation targets
validate_targets = validate_data.iloc[:,-1].values


#### Testing data (20% of the data)

In [127]:
# testing dataset
test_data = df.iloc[8463::]

# testing features
test_features = test_data.iloc[:,:-1].values

# testing targets
test_targets = test_data.iloc[:,-1].values

# Training the Logistic Regression Model on the Training Set

# Predicting the Test Results


# Predicting the Test Results


# Analyzing the Accuracy of the Model



# Visualizing Set Results

