# Machine Learning Engineer Nanodegree
## Using Supervised Classification Algorithms to Predict Bank Term Deposit Subscription
Fabiano Shoji Yoschitaki  
June 28th, 2018

## Project Design

As it is described the capstone proposal document, the project is composed of the following activites:

- **Data and Library Loading: ** the first step is to load the Bank Marketing data set in the CSV format from the UCI's Machine Learning Repository and all the libraries needed for the project.

- **Data Exploration: ** in this step, we'll do some tasks like: visualize the data, print some samples, check its dimensions, check the most relevant features, show its statistical summary.  

- **Data Preparation: ** after exploring the data, pre-processing tasks will be done: data cleaning, remove null values, convert categorical features into dummy/indicator variables and split the data into training and testing datasets. 

- **Model Selection: ** with the prepared data, various supervised classification algorithms will be experimented in order to find compare their results and choose the best one (taking into account the accuracy score) for model tuning.  

- **Model Tuning: ** after we choose the best model, grid search cross validation will be applied with the objective to tune the hyper-parameters of the model.

- **Final Evaluation: ** in this step, the accuracy score of the tuned model will be evaluated by applying it to the testing dataset. 

-----------
### 1. Data and Library Loading
In this section, we will load the dataset and the libraries used in the project.  

#### 1.1. Library Loading
Loading all libraries needed for the project.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
from pandas.tools.plotting import scatter_matrix
import matplotlib.pyplot as plt
from time import time
from IPython.display import display
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, f1_score
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn import preprocessing, svm
from sklearn.grid_search import GridSearchCV
from xgboost.sklearn import XGBClassifier
from sklearn.model_selection import ShuffleSplit, cross_val_score
from sklearn.cross_validation import train_test_split
from sklearn.svm import SVC
import warnings
warnings.filterwarnings('ignore')
sns.set(style="whitegrid")
%matplotlib inline

#### 1.2. Data Loading
Loading the dataset from the CSV file.

In [None]:
bank_full_data = pd.read_csv('bank-full.csv', delimiter=';')
print("Bank dataset was loaded successfully!")

-----------
### 2. Data Exploration
Here we will apply some methods/techniques for Exploratory Data Analysis to better understand the data.

#### 2.1. Data Dimensions
Printing the first 10 rows from the data.

In [None]:
print("The dataset has {} rows and {} columns".format(bank_full_data.shape[0], bank_full_data.shape[1]))

#### 2.2. Data Information
Printing information about column dtypes, non null values and memory usage.

In [None]:
bank_full_data.info()

#### 2.3. Data Samples
Printing the first 10 rows of the data.

In [None]:
bank_full_data.head(10)

#### 2.4. Data Descriptive Statistics
Visualizing statistical summary of the data.

In [None]:
bank_full_data.describe()

#### 2.5 Data General Information
Exploring features information.

In [None]:
# Calculate number of clients
n_clients = len(bank_full_data)

# Calculate clients who have subscribed
n_clients_subscribed = len(bank_full_data[bank_full_data['y'] == 'yes'])

# Calculate clients who haven't subscribed
n_clients_not_subscribed = len(bank_full_data[bank_full_data['y'] == 'no'])

# Calculate graduation rate
subscription_rate = float(n_clients_subscribed)/float(n_clients) * 100

# Print the results
print("Total number of clients: {}".format(n_clients))
print("Number of clients who have subscribed: {}".format(n_clients_subscribed))
print("Number of clients who haven't subscribed: {}".format(n_clients_not_subscribed))
print("Subscription rate of the dataset: {:.2f}%".format(subscription_rate))

#### 2.6. Visualization
Generating some graphs for visualization.

In [None]:
plt.figure(figsize=(8,5))
plt.title("Distribution of Clients Subscribed vs Not Subscribed")
bank_full_data.groupby("y")['y'].count().plot.bar()

In [None]:
age_histogram = sns.distplot(bank_full_data['age'], bins=10)
plt.title('Distribution by Age')
age_histogram.figure.set_size_inches(12,6)
plt.show()

In [None]:
figure = plt.figure(figsize=(12, 6))
mask = np.zeros_like(bank_full_data.corr(), dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(bank_full_data.corr(), mask=mask, annot=True, cmap="Blues")
figure.suptitle('Correlation Matrix', fontsize=15)

In [None]:
#pd.plotting.scatter_matrix(bank_full_data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');

-----------
### 3. Data Preparation
In this section we will apply some methods/techniques for Data Preprocessing.

#### 3.1. Checking for null values

In [None]:
bank_full_data.isnull().sum()

#### 3.1. Preprocessing Features
Applying pandas_get_dummies to convert categorical features into binary variables. Also, we'll replace 'yes' -> 1, 'no' -> 0.

In [None]:
def preprocess_features(X):
    ''' Preprocesses the student data and converts non-numeric binary variables into
        binary (0/1) variables. Converts categorical variables into dummy variables. '''
    
    # Initialize new output DataFrame
    output = pd.DataFrame(index = X.index)

    # Investigate each feature column for the data
    for col, col_data in X.iteritems():
        
        # If data type is non-numeric, replace all yes/no values with 1/0
        if col_data.dtype == object:
            col_data = col_data.replace(['yes', 'no'], [1, 0])

        # If data type is categorical, convert to dummy variables
        if col_data.dtype == object:
            # Example: 'school' => 'school_GP' and 'school_MS'
            col_data = pd.get_dummies(col_data, prefix = col)  
                    
        # Collect the revised columns
        output = output.join(col_data)
    
    return output

In [None]:
bank_full_data = preprocess_features(bank_full_data)
print("Processed feature columns ({} total features): \n{}".format(len(bank_full_data.columns), list(bank_full_data.columns)))

#### 3.2. Identifying Feature and Target Columns

In [None]:
# Extract feature columns
feature_cols = list(bank_full_data.columns[:-1])

# Extract target column 'y' (subscribed/not subscribed)
target_col = bank_full_data.columns[-1] 

# Show the list of columns
print("Feature columns:\n{}".format(feature_cols))
print("\nTarget column: {}".format(target_col))

# Separate the data into feature data and target data (X_all and y_all, respectively)
X_all = bank_full_data[feature_cols]
y_all = bank_full_data[target_col]

# Show the feature information by printing the first five rows
print("\nFeature values:")
print(X_all.head())

In [None]:
X_all.head(10)

#### 3.3. Splitting Data into Training and Testing datasets

In [None]:
# Shuffle and split the dataset into the number of training and testing points above
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size=0.3, random_state=10)

print("Training set has {} samples with {:.2f}% of 'yes' (subscribed) and {:.2f}% of 'no' (not subscribed)."
      .format(X_train.shape[0], 
        100 * len(y_train[y_train == 1])/len(y_train), 
        100 * len(y_train[y_train == 0])/len(y_train)))

print("Testing set has {} samples with {:.2f}% of 'yes' (subscribed) and {:.2f}% of 'no' (not subscribed)."
      .format(X_test.shape[0], 
        100 * len(y_test[y_test == 1])/len(y_test), 
        100 * len(y_test[y_test == 0])/len(y_test)))

In [None]:
bank_full_data.isnull().sum()

-----------
### 4. Model Selection

-----------
### 5. Model Tuning

-----------
### 6. Final Evaluation