# Banking Project

***

>The bank wants to improve their services. For instance, the bank managers have only vague idea, who is a good client (whom to offer some additional services) and who is a bad client (whom to watch carefully to minimize the bank loses). Fortunately, the bank stores data about their clients, the accounts (transactions within several months), the loans already granted, the credit cards issued. The bank managers hope to improve their understanding of customers and seek specific actions to improve services. A mere application of a discovery tool will not be convincing for them.  

>To test a data mining approach to help the bank managers, it was decided to address two problems, a descriptive and a predictive one. While the descriptive problem was left open, the predictive problem is the prediction of whether a loan will end successfuly.

> _ - in Banking Case Description, ECAC Moodle Page_

***

[Kaggle Challenge Page](https://www.kaggle.com/)

The steps performed are as follows:
* Data Loading and Preparation
* Descriptive Data Mining & Feature Engineering
* Predictive Data Mining

## Tools

For this work we will use the common tools in a data scientist and engineer arsenal. All of them work together in a seamless fashion, as well as with the Jupyter Notebook (this enhanced interactive document).

* **Numpy** is the fundamental package for scientific computing with Python
* **Pandas** provides high-performance, easy-to-use data structures (_e.g._ data frames) and data analysis tools
* **Matplotlib** implements plotting functionality
* **Scikit Learn** aggregates advanced machine learning tools

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import sklearn

plt.style.use('ggplot')
%matplotlib inline

## 1. Data Loading and Preparation

A key initial step in every data mining work is to prepare the data. This reduces the occurence of future unexpected behaviors and gives a preliminary insight over the "raw" data.

The **transactions** records describe transactions on accounts, representing dynamic characteristics of the accounts.

In [None]:
transactions_df = pd.read_csv('./data/banking - transaction.csv', 
                              sep=';',
                              parse_dates=['date'],
                              infer_datetime_format=True,
                              dtype={'bank':np.str},
                              index_col='trans_id')

In [None]:
transactions_df.head()

**k_symbol** name is not very represent representative.

In [None]:
transactions_df = transactions_df.rename(columns={
    'k_symbol': 'trans_char'
    })

In [None]:
transactions_df.head()

The **accounts** records contain static characteristics of the accounts.

In [None]:
accounts_df = pd.read_excel('./data/banking.xlsx', 
                            sheetname='account',
                            parse_dates=['date'],
                            infer_datetime_format=True,
                            index_col='account_id'
                           )

In [None]:
accounts_df.head()

The **clients** records describe static characteristics of the clients.

In [None]:
clients_df = pd.read_excel('./data/banking.xlsx',
                           sheetname='client',
                           index_col='client_id')

In [None]:
clients_df.head()

The **birth_number** feature is not readable in this representation. We have, then, to parse it and transform it into two new columns: **birthday** and **gender**.

In [None]:
clients_df['gender'] = clients_df.apply(lambda c: 'Male' if c['birth_number'] % 10000 < 5000 else 'Female', axis=1)

In [None]:
from datetime import date

def normalize_birth_number(client):
    birth_number = int(client['birth_number'])
    year = birth_number // 10000
    month = (birth_number // 100) % 100
    day = birth_number % 100
    
    month = month if month < 50 else month - 50
    
    return  "{0:02d}{1:02d}{2:02d}".format(year, month, day)


clients_df['birth_number'] = clients_df.apply(normalize_birth_number, axis=1) # month - 50 on females
clients_df['birthday'] = pd.to_datetime(clients_df['birth_number'], format='%y%m%d')
clients_df['birthday'] = clients_df.apply(
    lambda c: c['birthday'] if c['birthday'].date() <= date.today() else (c['birthday'] - pd.tseries.offsets.DateOffset(years=100)), 
    axis=1) # if infered year > 2015 the it is in the 19's
clients_df = clients_df.drop('birth_number', axis=1)

In [None]:
clients_df.head()

The **dispositions** records relate a client with an account (being useful in join operations).

In [None]:
dispositions_df = pd.read_excel('./data/banking.xlsx',
                                sheetname='disposition',
                                index_col='disp_id')

In [None]:
dispositions_df.head()

The **payment_orders** records, like **transaction** records, represent another dynamic characteristic of accounts.

In [None]:
payment_orders_df = pd.read_excel('./data/banking.xlsx',
                                  sheetname='payment order',
                                  index_col='order_id')

In [None]:
payment_orders_df.head()

The **loans** records describe information of a loan for an account.

In [None]:
loans_df = pd.read_excel('./data/banking.xlsx',
                         sheetname='loan',
                         parse_dates=['date'],
                         infer_datetime_format=True,
                         index_col='loan_id')

In [None]:
loans_df.head()

The **credit_cards** records describes static information of a credit card.

In [None]:
credit_cards_df = pd.read_excel('./data/banking.xlsx',
                                sheetname='credit card',
                                parse_dates=['issued'],
                                infer_datetime_format=True,
                                index_col='card_id')

In [None]:
credit_cards_df.head()

The **districts** records provide demographic information about a district.

In [None]:
districts_df = pd.read_excel('./data/banking.xlsx',
                             sheetname='district',
                             index_col='A1')

In [None]:
districts_df.head()

The column labels provided lack any useful information.

In [None]:
districts_df = districts_df.rename(columns={
        'A2': 'district_name',
        'A3': 'region',
        'A4': 'no_inhabitants',
        'A5': 'no_municipalities_w_inhabitants_<499',
        'A6': 'no_municipalities_w_inhabitants_500-1999',
        'A7': 'no_municipalities_w_inhabitants_2000-9999',
        'A8': 'no_municipalities_w_inhabitants_>10000',
        'A9': 'no_cities',
        'A10': 'ratio_urban_inhabitants',
        'A11': 'average_salary',
        'A12': 'unemployment_rate_95',
        'A13': 'unemployment_rate_96',
        'A14': 'no_enterpreneurs_per_1000_inhabitants',
        'A15': 'no_commited_crimes_95',
        'A16': 'no_commited_crimes_96',
    })

districts_df.index.name = 'district_id'

In [None]:
districts_df.head()

Lets check if the types infered by Pandas library are correct:

In [None]:
districts_df.dtypes

We see that **unemployment_rate_95** and **no_commited_crimes_95** are loaded as objects.

In [None]:
districts_df['unemployment_rate_95'].unique()

In [None]:
districts_df['no_commited_crimes_95'].unique()

We see that both use a question mark to demark missing values. We'll convert properly those columns:

In [None]:
districts_df['unemployment_rate_95'] = pd.to_numeric(districts_df['unemployment_rate_95'], errors='coerce')
districts_df['no_commited_crimes_95'] = pd.to_numeric(districts_df['no_commited_crimes_95'], errors='coerce')

districts_df.dtypes

## 2. Descriptive Data Mining & Feature Engineering

This first section aims at providing ways to better understand and extract value from the data. This is mostly accoplished by gathering descriptive statistics and ploting.

Considering this gathered knowledge, the datasets are edited and joined into useful intermediate format, which represent the main entities in the data, and then in a format in which the machine learning algorithms are able to understand (most of the times a single matrix, and most of the times without missing values).

The **loans** relate to the remainder entities through the **account** they are linked to. Therefore, the remainder entities should be summarized in such a way that each of the **accounts** information is given in a single row.

But before making any assumption, we should extract simple statistics about the feature space.

### Transactions Dataframe

In [None]:
transactions_df.describe(include='all')

**type**, **operation** and **trans_char** seem all to represent the same information. Lets evaluate that:

In [None]:
print("type:", transactions_df['type'].unique())

In [None]:
print("operation:", transactions_df['operation'].unique())

In [None]:
print("trans_char:", transactions_df['trans_char'].unique())

**operation** seems irrelevant give **type**:

In [None]:
transactions_df_e = transactions_df.drop('operation', axis=1).copy() # 'e' for edited

Its also irrelevant the distinction between *withrawal* and *withrawal in cash*.

In [None]:
mask = transactions_df_e['type'] == 'withdrawal in cash'
transactions_df_e.ix[mask, 'type'] = ('withdrawal')

We will create an additional column to store the signed amount (given by the type of operation), and another with the normalized **signed_amount** value, according to the **balance** previous to the operation (if an operation is the first one, we store 0):

In [None]:
transactions_df_e['signed_amount'] = transactions_df_e.apply(lambda x: - x['amount'] if x['type'] == 'withdrawal' else x['amount'], axis=1)
transactions_df_e['norm_signed_amount'] = transactions_df_e.apply(lambda x: 
                                                                      0 if (x['balance'] - x['signed_amount']) == 0 
                                                                      else x['signed_amount'] / (x['balance'] - x['signed_amount']), 
                                                                  axis=1)

From **trans_char** we are able to extract if the user is pensionist, or if the user has been sanctioned for negative balance, among other things. We will create an additional table with that information, with the values weighted by the **amount**, indexed by **account_id**:

In [None]:
# select the needed rows from transactions_df
trans_temp_df = transactions_df_e[['account_id', 'trans_char', 'norm_signed_amount']].copy()

# remove the rows where an empty string is present
mask = trans_temp_df.trans_char != ' '
trans_temp_df = trans_temp_df.ix[mask]

# remove the rows containing NaN
trans_temp_df = trans_temp_df.dropna(axis='index')

In [None]:
# create the dataframe indexed by account_id
account_features_df = trans_temp_df[['account_id']].drop_duplicates(subset=['account_id'])
account_features_df = account_features_df.set_index('account_id')

# create the count columns, corresponding to the data countained in
    # pension
    # interest credited
    # household
    # statement
    # insurance payment
    # sanction for negative balance
    # loan payment

def create_trans_count_col(df, val):
    new_df = df.ix[df['trans_char'] == val].groupby('account_id').sum()
    new_df = new_df.rename(columns={'norm_signed_amount':val})
    return new_df

additional_dfs = [create_trans_count_col(trans_temp_df, val) for val in trans_temp_df['trans_char'].unique()]
account_features_df = account_features_df.join(additional_dfs)

account_features_df = account_features_df.fillna(0)

account_features_df.head()

Now merge it with the **accounts** dataframe:

In [None]:
accounts_df = accounts_df.join(account_features_df)

# set to 0 the columns for accounts that do not have any transaction
accounts_df = accounts_df.fillna(0)

accounts_df.head()

Now we'll look into the **normalized signed amount** and **type** of each operation.

We will extract, for each **account**, the *count*, *mean* and *standard deviation* of operation values, as well as *mean* and *standard deviation* of the number of days between each **operation**, grouped by operation **type**.

In [None]:
temp_df = transactions_df_e[['account_id', 'type', 'norm_signed_amount', 'date']].copy()

# create columns with dates converted to days since 01-01-1970
#temp_df['date_days'] = temp_df.apply(lambda x: (x['date'] - pd.datetime(1970,1,1)).days, axis=1)

In [None]:
# sort firstly by account_id, then by date
temp_df = temp_df.sort_values(by=['account_id', 'date'])

#obtain, by row, the previous date
prev_dates = temp_df.groupby('account_id').apply(lambda x: x['date'].shift().fillna(x.iloc[0]['date'])).reset_index(level=0)
delta = temp_df['date'] - prev_dates['date']
delta = delta.astype("timedelta64[D]")
temp_df['date_delta'] = delta

In [None]:
#temp_df = temp_df.drop('date', axis='columns')

def summarize_transactions(df):
    summaries_dfs = []
    for tp in df['type'].unique():
        tmp_df = temp_df[temp_df['type'] == tp].drop('type', axis=1)
        
        tmp_grp_df = tmp_df.groupby('account_id').agg([np.count_nonzero, np.average, np.std])
        tmp_grp_df.fillna(0)
        
        ops_df = tmp_grp_df['norm_signed_amount']
        ops_df = ops_df.rename(columns={'count_nonzero': tp + '_cnt',
                                        'average': tp + '_avg',
                                        'std': tp + '_std',
                                       })
        
        dates_df = tmp_grp_df['date_delta']
        dates_df = dates_df.drop('count_nonzero', axis='columns')
        dates_df = dates_df.rename(columns={'average': tp + '_dates_avg',
                                            'std': tp + '_dates_std',
                                           })
        joined_summary_df = ops_df.join(dates_df)
        summaries_dfs.append(joined_summary_df)
    
    # now concatenate the summaries_dfs
    summaries_df = pd.concat(summaries_dfs, axis='columns')
    return summaries_df

trans_summaries_df = summarize_transactions(temp_df)
trans_summaries_df.head()

Now merge with **accounts** dataframe:

In [None]:
accounts_df = accounts_df.join(trans_summaries_df)
accounts_df.head()

In [None]:
accounts_df[accounts_df['withdrawal_std'].isnull()]

### Clients

Now we'll take a look at the features that characterize each user.

In [None]:
clients_df.describe(include='all')

As for each user corresponds a district, we'll look into the corresponding table:

In [None]:
districts_df.describe(include='all')

We can consider both the individual districts or the regions. 
In order to evaluate if the generalization for regions is reduces or not some interesting events, we'll use PCA to reduce the dimensionality of the table to two dimensions, therefore making it possible to plot.

In [None]:
districts_df_e = districts_df.drop(['district_name', 'region'], axis='columns')
districts_df_e = districts_df_e.fillna(0)

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
districts_reduced = pca.fit_transform(districts_df_e.values)

In [None]:
districts_df_e['a'] = districts_reduced[:, 0]
districts_df_e['b'] = districts_reduced[:, 1]

# convert regions to integers, to use them as colors
regions = districts_df.apply(lambda x: x['region'][1], axis='columns').astype(int)

districts_df_e.plot(kind='scatter', x='a', y='b', c=regions)

From the plot we conclude that probably it is beneficial to work with districts.

In [None]:
clients_df.head()

In [None]:
clients_districts_df = pd.merge(clients_df, districts_df.drop(['region', 'district_name'], axis='columns'), left_on='district_id', right_index=True, how='left', sort=False)

# we no longer need district_id column
clients_districts_df = clients_districts_df.drop('district_id', axis='columns')

clients_districts_df.head()

Now we'll merge **Credit Cards** and **Clients** information with the accounts.

As there may be multiple credit cards and clients associated with a given account, we have to devise a certain heuristics.

In [None]:
dispositions_df.describe(include='all')

In [None]:
dispositions_df.type.unique()

To start, we'll simply use the information about the owner of the account.

We'll check the data if there is really only one owner per account.

In [None]:
dispositions_owners_df = dispositions_df[dispositions_df['type'] == 'OWNER'].drop('type', axis='columns')

In [None]:
owners_per_account_df = dispositions_owners_df.groupby(['account_id']).count()

owners_per_account_df[owners_per_account_df['client_id'] != 1].count(axis='index')

We can now merge **clients** with **accounts**

In [None]:
accounts_disp_merge = accounts_df.merge(dispositions_owners_df, left_index=True, right_on='account_id', how='left', sort=False)
accounts_disp_merge = accounts_disp_merge.set_index('account_id')


accounts_clients_df = accounts_disp_merge.merge(clients_districts_df, left_on='client_id', right_index=True, how='left', sort=False)
accounts_clients_df = accounts_clients_df.drop('client_id', axis='columns')

accounts_clients_df = accounts_clients_df.fillna(0)

accounts_clients_df.head()

### Loans

We want to keep a summary of the history of loans related to a given account.

In [None]:
loans_df.describe(include='all')

In [None]:
print(loans_df.status.unique())

For reference:
 * **A-** contract finished, no problems
 * **B-** contract finished, loan not payed
 * **C-** running contract, ok so far
 * **D-** running contract, client in debt
 
Lets  see how the loans distribute accross the four categories:

In [None]:
loans_df.groupby('status').count()['account_id']

There are also a lot of contracts running, but most importantly there is a gret disparity between the two counts of the different loans results. We must be aware of this when training the algorithms. As we want to identify properly the loans that will have the **B** status, we have to be carefull when structuring the training and testing datasets.

In order to summarize this data, we want to obtain the *count* of loans in each state, as well as the *average* and *standard deviation* of the amount and duration associated with each loan.

In [None]:
def summarize_loans(df):
    summaries_dfs = []
    for tp in df['status'].unique():
        tmp_df = df[df['status'] == tp].drop('status', axis=1)
        
        tmp_grp_df = tmp_df.groupby('account_id').agg([np.count_nonzero, np.average, np.std])
        
        amount_df = tmp_grp_df['amount']
        amount_df = amount_df.rename(columns={'count_nonzero': tp + '_cnt',
                                           'average': tp + '_amount_avg',
                                           'std': tp + '_amount_std',
                                          })
        
        duration_df = tmp_grp_df['duration']
        duration_df = duration_df.drop('count_nonzero', axis='columns')
        duration_df = duration_df.rename(columns={'average': tp + '_duration_avg',
                                                  'std': tp + '_duration_std',
                                                 })
        joined_summary_df = amount_df.join(duration_df)
        summaries_dfs.append(joined_summary_df)
    
    # now concatenate the summaries_dfs
    summaries_df = pd.concat(summaries_dfs, axis='columns')
    return summaries_df


loans_summary_df = summarize_loans(loans_df)
loans_summary_df = loans_summary_df.fillna(0)
loans_summary_df.head()

Lets analyse if the history of each account is long enough to be used:

In [None]:
loans_hist_df = loans_summary_df[['A_cnt', 'B_cnt', 'C_cnt', 'D_cnt']].sum(axis='columns')
loans_hist_df.plot(kind='hist', alpha=0.5)

It seems that there are no accounts with more than one loan through time. Therefore, we will not use the loans history of each account. We'll simply save a dataframe with this information merged with accounts:

In [None]:
accounts_loans_df = accounts_clients_df.join(loans_summary_df)
accounts_loans_df = accounts_loans_df.fillna(0)
accounts_loans_df.head()

In [None]:
accounts_loans_df.dtypes

We've gathered 55 features to characterize each account, gathering information about the accounts, users (and respective credit cards), transactions and history of loans.

## 3. Predictive Data Mining

The above-mentioned accounts-indexed dataframe, combined with the specification of each loan (duration and amount), constitutes all the needed information to train our algorithms.


Firstly we have to prepare a dataframe such that the machine-learning algorithms are able to undertand it. In order to accomplish that, we will merge this **accounts_loans_df** dataframe with the original **loans_df** dataframe. For new loans that arive, the process will be the same, therefore we want to encapsulate the process in a function:

In [None]:
accounts_final_df = accounts_clients_df.rename(columns={'date': 'date created'})
accounts_final_df['account_id'] = accounts_final_df.index

def join_loans_accounts(l_df):
    dataset = pd.merge(l_df, accounts_final_df, left_on='account_id', right_index=True, how='left', sort=False)
    return dataset

dataset = join_loans_accounts(loans_df)

In [None]:
accounts_final_df.head()

Now we have to extract the useful rows: the ones with **status** *A* and *B*:

In [None]:
def select_useful_rows(df):
    mask = (dataset.status == 'A') | (dataset.status == 'B')
    return dataset[mask]

In [None]:
dataset_useful = select_useful_rows(dataset)

In [None]:
print(dataset.shape)
print(dataset_useful.shape)

We just have now to convert the timestamp columns to an integer value. We'll use Unix time:

In [None]:
def convert_date_cols(df):
    new_df = df.copy()
    for col in new_df.columns:
        if new_df[col].dtype == 'datetime64[ns]':
            new_df[col] = new_df[col].astype(np.int64) // 10**9
    return new_df
            

dataset_valid = convert_date_cols(dataset_useful)

and convert categorical data in new columns:

In [None]:
features_mask = dataset_valid.columns[dataset_valid.columns != 'status']

dataset_valid_num = pd.get_dummies(dataset_valid[features_mask])
dataset_valid_num['status'] = dataset_valid['status']

In [None]:
features = dataset_valid_num.columns[dataset_valid_num.columns != 'status']
label = 'status'
target_names = ['A', 'B']

### Testing classifiers

We'll start off by using a **decision tree** as our first machine learning classifier. It has the benefits of working both with continuous and discrete data types, and provides a very natural way to interpret the results it gives.

We'll use stratified k-folds cross-validation to prevent overfitting.

The next function implements the procedure of, for a given model, it splits the data into train and test sets, and next performs stratified k-fold cross-validation to select an instance of the model that performs best for the corresponding validation set. It returns the score of the classifier in the testing set, as well as the confusion matrix.

In [None]:
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import StratifiedKFold
from sklearn.base import clone as skl_clone
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

def k_fold_model_select(data, features, label, raw_classifier, n_folds=10, weigh_samples_fn=None): 
    # weigh_samples_fn is explained below
    
    # split into training and test data
    X_train, X_test, y_train, y_test = train_test_split(data[features].values, 
                                                        data[label].values,
                                                        test_size=0.3,
                                                        stratify=data[label],
                                                        random_state=5)
    
    
    # use stratified k-fold cross validation to select the model
    skf = StratifiedKFold(y_train, n_folds=n_folds)

    best_classifier = None
    best_score = float('-inf')

    for train_index, validation_index in skf:
        classifier = skl_clone(raw_classifier)
        classifier = classifier.fit(X_train[train_index], y_train[train_index])

        if weigh_samples_fn != None:
            y_pred = classifier.predict(X_train[validation_index])
            sample_weight = weigh_samples_fn(y_train[validation_index], y_pred)
        else:
            sample_weight = None
            
        score = classifier.score(X_train[validation_index], y_train[validation_index],
                                 sample_weight=sample_weight)

        if score > best_score:
            best_classifier = classifier
            best_score = score
    
    # compute the confusion matrix
    y_pred = best_classifier.predict(X_test)
    conf_mat = confusion_matrix(y_test, y_pred)
    
    # now compute the score for the test data of the best found classifier
    if weigh_samples_fn != None:
        sample_weight = weigh_samples_fn(y_test, y_pred)
    else:
        sample_weight = None
    test_score = best_classifier.score(X_test, y_test, sample_weight=sample_weight)
    
    # and obtain the classification report
    report = classification_report(y_test, y_pred, target_names=target_names)
    
    return (test_score, report, conf_mat, best_classifier)

In [None]:
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier

dtc = DecisionTreeClassifier(min_samples_split=20, random_state=0)
dtc_score, dtc_rep, dtc_cm, dtc_clf = k_fold_model_select(dataset_valid_num, features, label, dtc)

In [None]:
from sklearn.externals.six import StringIO
import pydot_ng as pydot
from IPython.display import Image

def display_tree(dtc_classifier):
    dot_data = StringIO()  
    tree.export_graphviz(dtc_clf, out_file=dot_data,  
                         feature_names=features_mask,
                         class_names=target_names,
                         filled=True,
                         rounded=True,
                         special_characters=True)
    graph = pydot.graph_from_dot_data(dot_data.getvalue())  
    return Image(graph.create_png())

In [None]:
display_tree(dtc_clf)

We see that the tree indicates that the accounts with low withdrawal values (*withdrawal_avg*) (relative to the account's balance) and high number of deposits (*credit_cnt*) are mostly classified as reliable accounts for loans.

On the other hand, accounts with higher withdrawal values (also relative to the account's balance), and higher variance in deposits periodicity (*credit_dates_std*) are mostly classified as non-reliable accounts for loans.

Let's analyse more deeply these results. For that, we'll use the obtained score in predicting the test set and the confusion matrix:

In [None]:
def plot_confusion_matrix(cm, title='Confusion matrix', cmap=plt.cm.Blues):
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(target_names))
    plt.xticks(tick_marks, target_names, rotation=45)
    plt.yticks(tick_marks, target_names)
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [None]:
def normalize_confusion_matrix(cm):
    # Normalize the confusion matrix by row (i.e by the number of samples
    # in each class)
    cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    return cm_normalized

In [None]:
print("Score for best Decision Tree Classifier:", dtc_score)
print("Confusion matrix:", dtc_cm, sep='\n')
print("Classification report:", dtc_rep, sep='\n')

plot_confusion_matrix(normalize_confusion_matrix(dtc_cm))

Things seem prety good: we have a good accuracy, and the plotted confusion matrix shows high percentage of results in the main diagonal!

Nonetheless, there are two factors we have to take into account: 
 * As stated before, we have very few examples with label B (unsuccessful loans), and those events, despite unusual, are very costly. Therefore we need to give them higher relevance.
 * Trustworthy users that are labeled as trustless ones will most likely swithch to a different bank, negating all the future accumulated revenue. As shown by the confusion matrix, one of the users suffers from exactly this fenomena, therefore we must pay attention to that fact.

Because of this, we must be pragmatic about these results and perform some changes.

In order to gain more insight, we'll continue by testing other classifiers, analysing their ROC curves. 

Where applicable, we'll also set the weights given to each label to be **balanced**: this way the classifiers take into account the frequency of each class and automatically adjust their behavior.

In [None]:
# Decision Tree Classifier
dt = DecisionTreeClassifier(min_samples_split=20, random_state=0, class_weight='balanced')
dt_score, dt_rep, dt_cm, dt_clf = k_fold_model_select(dataset_valid_num, features, label, dt)

print("Score:", dt_score)
print("Confusion matrix:", dt_cm, sep='\n')
print("Classification report:", dt_rep, sep='\n')

In [None]:
# Linear Classifier (Logistic Regression)
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(class_weight='balanced')
lr_score, lr_rep, lr_cm, lr_clf = k_fold_model_select(dataset_valid_num, features, label, lr)

print("Score:", lr_score)
print("Confusion matrix:", lr_cm, sep='\n')
print("Classification report:", lr_rep, sep='\n')

In [None]:
# Nearest Neighbors Classifier
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(weights='distance')
knn_score, knn_rep, knn_cm, knn_clf = k_fold_model_select(dataset_valid_num, features, label, knn)

print("Score:", knn_score)
print("Confusion matrix:", knn_cm, sep='\n')
print("Classification report:", knn_rep, sep='\n')

In [None]:
# (Gaussian Naive Bayes)
from sklearn.naive_bayes import GaussianNB

nb = GaussianNB()
nb_score, nb_cm, nb_clf = k_fold_model_select(dataset_valid_num, features, label, nb)

print("Score:", nb_score)
print("Confusion matrix:", nb_cm, sep='\n')

In [None]:
# Neural Network (Multi-Layer Perceptron)
from sknn.mlp import Classifier as MLPClassifier
from sknn.mlp import Layer as MLPLayer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline

# use a pipeline that first normalizes the data

nn = MLPClassifier(
    layers=[
        MLPLayer("Maxout", units=30, pieces=2),
        MLPLayer("Softmax")],
    learning_rate=0.000000001,
    n_iter=25)

pipeline = Pipeline([
        ('min/max scaler', MinMaxScaler(feature_range=(0.0, 1.0))),
        ('neural network', nn)])

nn_score, nn_rep, nn_cm, nn_clf = k_fold_model_select(dataset_valid_num, features, label, nn)

print("Score:", nn_score)
print("Confusion matrix:", nn_cm, sep='\n')
print("Classification report:", nn_rep, sep='\n')

We'll test SVC with different kernels.

In [None]:
# SVC (linear kernel)
from sklearn.svm import SVC

#svc_linear = SVC(kernel='linear')
#svc_linear_score, svc_linear_rep, svc_linear_cm, svc_linear_clf = k_fold_model_select(dataset_valid_num, features, label, svc_linear,
                                                                        n_folds=2)

#print("Score:", svc_linear_score)
#print("Confusion matrix:", svc_linear_cm, sep='\n')
#print("Classification report:", svc_linear_rep, sep='\n')

In [None]:
# SVC (sigmoid kernel)
from sklearn.svm import SVC

#svc_sigmoid = SVC(kernel='linear')
#svc_sigmoid_score, svc_sigmoid_rep, svc_sigmoid_cm, svc_sigmoid_clf = k_fold_model_select(dataset_valid_num, features, label, svc_sigmoid,
                                                                        n_folds=2)

#print("Score:", svc_sigmoid_score)
#print("Confusion matrix:", svc_sigmoid_cm, sep='\n')
#print("Classification report:", svc_sigmoid_rep, sep='\n')

In [None]:
# SVC (radial basis function kernel)
from sklearn.svm import SVC

#svc_rbf = SVC(kernel='rbf')
#svc_rbf_score, svc_rbf_rep, svc_rbf_cm, svc_rbf_clf = k_fold_model_select(dataset_valid_num, features, label, svc_rbf,
                                                                        n_folds=2)

#print("Score:", svc_rbf_score)
#print("Confusion matrix:", svc_sigmoid_cm, sep='\n')
#print("Classification report:", svc_sigmoid_rep, sep='\n')

Now we'll test some ensamble methods:

Starting by **AdaBoost**, this algorithm enables to, in each of its iterations, specialize a base classifier for the instances incorrectly classified in the previous iterations. We'll use as base classifier the Decision Tree Classifier, as it is very volatile to the data in which it is trained.

In [None]:
# AdaBoost Classifier
from sklearn.ensemble import AdaBoostClassifier

ab = AdaBoostClassifier(base_classifier=dt)
ab_score, ab_rep, ab_cm, ab_clf = k_fold_model_select(dataset_valid_num, features, label, ab)

print("Score:", ab_score)
print("Confusion matrix:", ab_cm, sep='\n')
print("Classification report:", ab_rep, sep='\n')