# Machine Learning Engineer Nanodegree
## Using Supervised Classification Algorithms to Predict Bank Term Deposit Subscription
Fabiano Shoji Yoschitaki  
June 28th, 2018

## Project Design

As it is described the capstone proposal document, the project is composed of the following activites:

- **Data and Library Loading: ** the first step is to load the Bank Marketing data set in the CSV format from the UCI's Machine Learning Repository and all the libraries needed for the project.

- **Data Exploration: ** in this step, we'll do some tasks like: visualize the data, print some samples, check its dimensions, check the most relevant features, show its statistical summary.  

- **Data Preparation: ** after exploring the data, pre-processing tasks will be done: data cleaning, remove null values, convert categorical features into dummy/indicator variables and split the data into training and testing datasets. 

- **Model Selection: ** with the prepared data, various supervised classification algorithms will be experimented in order to find compare their results and choose the best one (taking into account the accuracy score) for model tuning.  

- **Model Tuning: ** after we choose the best model, grid search cross validation will be applied with the objective to tune the hyper-parameters of the model.

- **Final Evaluation: ** in this step, the accuracy score of the tuned model will be evaluated by applying it to the testing dataset. 

-----------
### 1. Data and Library Loading
In this section, we will load the dataset and the libraries used in the project.  

#### 1.1. Library Loading
Loading all libraries needed for the project.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
from pandas.tools.plotting import scatter_matrix
import matplotlib.pyplot as plt
from time import time
from IPython.display import display
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, f1_score
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn import preprocessing, svm
from sklearn.grid_search import GridSearchCV
from xgboost.sklearn import XGBClassifier
from sklearn.model_selection import ShuffleSplit, cross_val_score
from sklearn.svm import SVC
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

#### 1.2. Data Loading
Loading the dataset from the CSV file.

In [None]:
bank_full_data = pd.read_csv('bank-full.csv', delimiter=';')
print("Bank dataset was loaded successfully!")

-----------
### 2. Data Exploration
Here we will apply some methods/techniques for Exploratory Data Analysis to better understand the data.

#### 2.1. Data Dimensions
Printing the first 10 rows from the data.

In [None]:
print("The dataset has {} rows and {} columns".format(bank_full_data.shape[0], bank_full_data.shape[1]))

#### 2.2. Data Info
Printing information about column dtypes, non null values and memory usage.

In [None]:
bank_full_data.info()

#### 2.3. Data Samples
Printing the first 10 rows of the data.

In [None]:
bank_full_data.head(10)

#### 2.4. Data Descriptive Statistics
Visualizing statistical summary of the data.

In [None]:
bank_full_data.describe()

#### 2.5 Data general information
Exploring features information.

In [None]:
# Calculate number of clients
n_clients = len(bank_full_data)

# Calculate clients who have subscribed
n_clients_subscribed = len(bank_full_data[bank_full_data['y'] == 'yes'])

# Calculate clients who haven't subscribed
n_clients_not_subscribed = len(bank_full_data[bank_full_data['y'] == 'no'])

# Calculate graduation rate
subscription_rate = float(n_clients_subscribed)/float(n_clients) * 100

# Print the results
print("Total number of clients: {}".format(n_clients))
print("Number of clients who have subscribed: {}".format(n_clients_subscribed))
print("Number of clients who haven't subscribed: {}".format(n_clients_not_subscribed))
print("Subscription rate of the dataset: {:.2f}%".format(subscription_rate))

#### 2.6 Visualization
Generating some graphs for visualization.

In [None]:
plt.figure(figsize=(8,5))
plt.title("Distribution of Clients Subscribed vs Not Subscribed")
bank_full_data.groupby("y")['y'].count().plot.bar()

In [None]:
age_histogram = sns.distplot(bank_full_data['age'], bins=10)
plt.title('Distribution by Age')
age_histogram.figure.set_size_inches(16,6)
plt.show()

In [None]:
plt.figure(figsize=(12, 6))
sns.heatmap(bank_full_data.corr(), annot=True, cmap="Blues")

-----------
### 3. Data Preparation

-----------
### 4. Model Selection

-----------
### 5. Model Tuning

-----------
### 6. Final Evaluation