## Problem Statment
Santander wants to Identify which customers will make at least one transaction in the future irrespective of the amount. Santander therefore wants a model that will predict future customer transactions with real data provided but unidentified for security purposes.
The data has 200 numerical values of anonimyzed data which will be used to analyse the customer relations with the business.

## Project Planning
### Exploratory Data Analysis
### Data Visualization using Statistical methods
### Implementation of most suitable machine learning algorithm
### Predict and display future comsumer transactions

In [1]:
#IMPORTING NECESSARY PACKAGES
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
import seaborn as sns
import warnings
from scipy.stats.stats import pearsonr
import itertools
from scipy import stats

In [2]:
# link to data set from Kaggle
#
# https://www.kaggle.com/c/santander-customer-transaction-prediction/data

data = pd.read_csv('train.csv')

In [3]:
data.head(5)

Unnamed: 0,ID_code,target,var_0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,...,var_190,var_191,var_192,var_193,var_194,var_195,var_196,var_197,var_198,var_199
0,train_0,0,8.9255,-6.7863,11.9081,5.093,11.4607,-9.2834,5.1187,18.6266,...,4.4354,3.9642,3.1364,1.691,18.5227,-2.3978,7.8784,8.5635,12.7803,-1.0914
1,train_1,0,11.5006,-4.1473,13.8588,5.389,12.3622,7.0433,5.6208,16.5338,...,7.6421,7.7214,2.5837,10.9516,15.4305,2.0339,8.1267,8.7889,18.356,1.9518
2,train_2,0,8.6093,-2.7457,12.0805,7.8928,10.5825,-9.0837,6.9427,14.6155,...,2.9057,9.7905,1.6704,1.6858,21.6042,3.1417,-6.5213,8.2675,14.7222,0.3965
3,train_3,0,11.0604,-2.1518,8.9522,7.1957,12.5846,-1.8361,5.8428,14.925,...,4.4666,4.7433,0.7178,1.4214,23.0347,-1.2706,-2.9275,10.2922,17.9697,-8.9996
4,train_4,0,9.8369,-1.4834,12.8746,6.6375,12.2772,2.4486,5.9405,19.2514,...,-1.4905,9.5214,-0.1508,9.1942,13.2876,-1.5121,3.9267,9.5031,17.9974,-8.8104


In [None]:
# A DESCRIPTION OF THE DATA SET
data.describe()

In [None]:
#CALCULATE NUMBER OF MISSING VALUES IN TRAIN AND TEST DATAFRAMES
def check_null_values(df):
    flag=df.isna().sum().any()
    if flag==True:
        total = df.isnull().sum()
        percent = (df.isnull().sum())/(df.isnull().count()*100)
        output = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
        data_type = []
        for col in df.columns:
            dtype = str(df[col].dtype)
            data_type.append(dtype)
        output['Types'] = data_type
        return(np.transpose(output))
    else:
        return(False)

In [None]:
check_null_values(data)

In [None]:
#DISTRIBUTION OF THE TARGET VALUE IN TRAIN DATASET
f,ax=plt.subplots(1,2,figsize=(18,8))
data['target'].value_counts().plot.pie(explode=[0,0.1],autopct='%1.1f%%',ax=ax[0],shadow=True)
ax[0].set_title('target')
ax[0].set_ylabel('')
sns.countplot('target',data=data,ax=ax[1])
ax[1].set_title('Target')
plt.show()
#SINCE THERE ARE 10.0% VALUES WITH 1, THE DATA IS UNBALANCED WRT TARGET VALUE

In [None]:
# DENSITY PLOTS OF EACH FEATURE WITH THE TARGET VALUES(0,1) i.e 
# How each feature infleunces the target being a 0 or a 1 using a density graph.

def plot_feature_distribution(df1, df2, label1, label2, features):
    i = 0
    sns.set_style('whitegrid')
    plt.figure()
    fig, ax = plt.subplots(10,10,figsize=(18,22))

    for feature in features:
        i += 1
        plt.subplot(10,10,i)
        sns.kdeplot(df1[feature], bw=0.5,label=label1)
        sns.kdeplot(df2[feature], bw=0.5,label=label2)
        plt.xlabel(feature, fontsize=9)
        locs, labels = plt.xticks()
        plt.tick_params(axis='x', which='major', labelsize=6, pad=-6)
        plt.tick_params(axis='y', which='major', labelsize=6)
    plt.tight_layout()
    plt.show();
    
t0 = data.loc[data['target'] == 0]
t1 = data.loc[data['target'] == 1]
features_1 = data.columns.values[2:102]
plot_feature_distribution(t0, t1, '0', '1', features_1)

In [None]:
#The last 100 features
features_2 = data.columns.values[102:202]
plot_feature_distribution(t0, t1, '0', '1', features_2)

### Analysing the above Density plots and their infleunce to the target values
- Firstly, since the data contained in the train and test dataframes are similar, analysis was done on the test which applies on the train dataframe as well.
- From these density plot, most of the features have very high standard deviation values which implies the data points are spread out around the mean, i.e very low minimum values and very high maximum values.
- Approximately 5% of these features have very low standard deviations which makes the density plot have peak values at a high probabilistic level of 0.6 which is a great probability of having a customer do a transaction in the future.

### Pearson Correlation 
* From the 200 features, var_91, var_108 and var_148 have the highest probability values from their density plots.
* Applying a pearson correlation on each feature and the target to see their relationship
* considering var_91
* Null hypothesis is: there is a relationship between them
* The alternative hypothesis is there is no relationship between them

In [None]:
stats.pearsonr(data['target'], data['var_91'])
# since r value is < 0.02, the correlation between target and var_91 is a positive weak correlation 
# this implies as var_91 increases, probability of having a customer make future transaction increases too
# Also since the p value is greater than 0.05, the null hypothesis is therefore supported

In [None]:
# considering var_108
stats.pearsonr(data['target'], data['var_108'])
# The rvalue being <0.02 and negative shows that there is a weak negative correlation between these variables
# meaning an increase in var_108 causes a decrease in target.

In [None]:
# considering var_148
stats.pearsonr(data['target'], data['var_148'])
# The rvalue being <0.02 and negative shows that there is a weak negative correlation between these variables

### Pearson correlation analysis
* Var_91 has a positive weak correlation with the target, this implies higher values of var_91 will increase the probability of having a 1.
* var_108 and var_148 both have negative weak correlation with the target. This implies a decrease in any of these values causes a increased probability of having 1.
* This observation gotten from the pearson correlation of var_91 and target can further be used for implementation as an increase in var_91 increases the probability of a customer making a transaction in the future. 
* Measure put to increase this unidentified feature could be a great milestone to the project's aim.

### Training and test datasets.
* Two models are used so the one that produces highest accuracy is therefore termed most suitable for the problem. They are;
* Logistic Regression Model with and without regularization
* Multinomial Naive Bayes Model 
* Conclussions are therefore drawn from the accuracies of  the models.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# since the dataset is too heavy, training and testing of the model will be done on 20000 records
#Before splitting the dataset, verify if X is a matrix and y a vector
data = data[]
X = data.iloc[:, 2:202].values
print("X: ", type(X), X.shape)
y = data.iloc[:, 1].values
print('y: ', type(y), y.shape)

# Split the data into a training and test set.
Xlr, Xtestlr, ylr, ytestlr = train_test_split(data.iloc[:, 2:202].values, 
                                              data.iloc[:, 1].values,random_state=5)

print('Accuracy of the Logistic Regression model on the training and test set')
clf = LogisticRegression()
# Fit the model on the trainng data.
clf.fit(Xlr, ylr)
#Accuracy from the training data
y_predict_train = clf.predict(Xlr)
print("[Train] Accuracy score (ylr, y_predict_train):", accuracy_score(ylr, y_predict_train))


In [None]:
# Print the accuracy from the testing data.
y_predict_test = clf.predict(Xtestlr)
print("[Test] Accuracy score (y_predict_test, ytestlr):", accuracy_score(y_predict_test, ytestlr))


* Since the difference between the training and test accuracy scores is not much, this implies there is no model overfitting or underfitting.
* The model is also suitable for this problem as it's accuracy is greater than 90%.

In [None]:
# printing out classification report
# Precision: Ability of a classiifer not to label an instance positive that is actually negative. 
# Recall: Ability of a classifier to find all positive instances for each class it is defined
# F1 score: Weighted harmonic mean of precision and recall such that the best score is 1.0 and the worst is 0.0.
#Support: Number of actual occurrences of the class in the specified dataset

from sklearn.metrics import classification_report

print("[Training Classification Report:]")
print(classification_report(ylr, y_predict_training))

print("[Test Classification Report:]")
print(classification_report(ytestlr, y_predict_test))

In [None]:
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score

def cv_score(clf, x, y, score_func=accuracy_score):
    result = 0
    nfold = 2
    for train, test in KFold(nfold).split(x): # split data into train/test groups, 2 times
        clf.fit(x[train], y[train]) # fit
        result += score_func(clf.predict(x[test]), y[test]) # evaluate score function on held-out data
    return result / nfold # average

In [None]:
# LGR with l1 regularization
clf_1 = LogisticRegression(penalty='l1', random_state = 0)
score = cv_score(clf_1, Xlr, ylr)
print(score)

In [None]:
# Using the Multinomial Naive Bayes Model
from sklearn.naive_bayes import MultinomialNB

X = data.iloc[:, 2:202].values
print("X: ", type(X), X.shape)
y = data.iloc[:, 1].values
print('y: ', type(y), y.shape)

# Split the data into a training and test set.
Xtrain, Xtest, ytrain, ytest = train_test_split(data.iloc[:, 2:202].values, 
                                              data.iloc[:, 1].values,random_state=5)

clf_mnb = MultinomialNB()
# Fit the model on the trainng data.
clf_mnb.fit(Xtrain, ytrain)
#Accuracy from the training data
y_predict_train = clf_mnb.predict(X)
print('Accuracy of the Multinomial Naive Bayes model on the training and test set')
print("[Train] Accuracy score (ylr, y_predict_train):", accuracy_score(ylr, y_predict_train))

# Print the accuracy from the testing data.
y_predict_test = clf_mnb.predict(Xtestlr)
print("[Test] Accuracy score (y_predict_test, ytestlr):", accuracy_score(y_predict_test, ytestlr))



### 