**Universal bank has recently trialed a marketing campaign to sell their new CD account product to existing customers. They contacted 5000 of their non-CD account customers with an offer. The data provided in universal.csv is the result of this market test.**

**Use the techniques covered in this class to load and clean the data. Then, identify the best predictive model (using only the models covered thus far). Use RandomSearchCV combined with GridSearchCV to identify the best parameters for each model tested.**

**Be sure to document your thought process using markdown. Think of this as a report that your manager will read. This assignment requires you to decide how to process the provided data best (i.e., encoding). Be sure to provide your arguments/observations in markdown as you progress through data preparation, fitting, and performance evaluation.**

**Add a conclusions/discussion section that summarizes the performance of each of the models you tested, and indicate which is the best model. To accomplish this, you must decide (and explain why) which evaluation metric you will use to choose 'the best' model.**

**Details of the data**

Id: Customer ID

Age: Customers age in completed years.  

Experience: Number of years of professional experience.

Income: Annual income of the customer($000s).

Family Size: Family size of the customer.

CCAvg: Average spending on credit cards per month ($000s).

Education: Education Level. 1: Undergrad; 2: Graduate; 3: Advanced/Professional

Mortgage: Value of house mortgage if any ($000s).

Personal Loan: (1 if customer has personal loand with bank, 0 otherwise)

Securities Account: (1 f customer has securities account with bank, 0 otherwise)  

CD Account: (1 if customer has certificate of deposit (CD) account with bank, 0 otherwise)  

Online Banking: (1 if customer uses Internet banking facilities, 0 otherwise)  

Credit Card: (1 if customer uses credit card issued by Universal Bank, 0 otherwise) 

* Importing necessary packages

In [18]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier 
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from matplotlib import pyplot as plt
import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV


np.random.seed(86089106)

* Loading the data

In [3]:
df = pd.read_csv('https://raw.githubusercontent.com/prof-tcsmith/data/master/UniversalBank.csv')
df.head(5)

Unnamed: 0,ID,Age,Experience,Income,ZIP Code,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
0,1,25,1,49,91107,4,1.6,1,0,0,1,0,0,0
1,2,45,19,34,90089,3,1.5,1,0,0,1,0,0,0
2,3,39,15,11,94720,1,1.0,1,0,0,0,0,0,0
3,4,35,9,100,94112,1,2.7,2,0,0,0,0,0,0
4,5,35,8,45,91330,4,1.0,2,0,0,0,0,0,1


* Exploring the data

In [4]:
# Explore the dataset 

print(df.describe())
print(df.info())

                ID          Age   Experience       Income      ZIP Code  \
count  5000.000000  5000.000000  5000.000000  5000.000000   5000.000000   
mean   2500.500000    45.338400    20.104600    73.774200  93152.503000   
std    1443.520003    11.463166    11.467954    46.033729   2121.852197   
min       1.000000    23.000000    -3.000000     8.000000   9307.000000   
25%    1250.750000    35.000000    10.000000    39.000000  91911.000000   
50%    2500.500000    45.000000    20.000000    64.000000  93437.000000   
75%    3750.250000    55.000000    30.000000    98.000000  94608.000000   
max    5000.000000    67.000000    43.000000   224.000000  96651.000000   

            Family        CCAvg    Education     Mortgage  Personal Loan  \
count  5000.000000  5000.000000  5000.000000  5000.000000    5000.000000   
mean      2.396400     1.937938     1.881000    56.498800       0.096000   
std       1.147663     1.747659     0.839869   101.713802       0.294621   
min       1.000000  

* Clean and Transform data

In [5]:
# based on findings from data exploration, we need to clean up colum names, as there are some leading whitespace characters
df.columns = [s.strip() for s in df.columns] 
df.columns

Index(['ID', 'Age', 'Experience', 'Income', 'ZIP Code', 'Family', 'CCAvg',
       'Education', 'Mortgage', 'Personal Loan', 'Securities Account',
       'CD Account', 'Online', 'CreditCard'],
      dtype='object')

* Implementing one-hot encoding for Education

In [6]:
df['Education'] = df['Education'].replace({1: "Undergrad", 2: 'Graduate', 3: 'Advanced/Professional'})

In [7]:
edu_dummies = pd.get_dummies(df['Education'], prefix='Education', drop_first=False)
df = df.join(edu_dummies)

* Drop the columns we are not using as predictors 

In [8]:
df = df.drop(columns=['ID', 'ZIP Code', 'Education'])

* View data

In [11]:
df.head(5)

Unnamed: 0,Age,Experience,Income,Family,CCAvg,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard,Education_Advanced/Professional,Education_Graduate,Education_Undergrad
0,25,1,49,4,1.6,0,0,1,0,0,0,0,0,1
1,45,19,34,3,1.5,0,0,1,0,0,0,0,0,1
2,39,15,11,1,1.0,0,0,0,0,0,0,0,0,1
3,35,9,100,1,2.7,0,0,0,0,0,0,0,1,0
4,35,8,45,4,1.0,0,0,0,0,0,1,0,1,0


## Splitting data into training and test sets

In [14]:
# construct datasets for analysis

# split the data into validation and training set
train_df, test_df = train_test_split(df, test_size=0.3)

# to reduce repetition in later code, create variables to represent the columns
# that are our predictors and target
target = 'CD Account'
predictors = list(df.columns)
predictors.remove(target)

* Conducting data preparation that should be done after split

In [15]:
# Impute missing values

numeric_cols_with_nas = list(train_df.isna().sum()[train_df.isna().sum() > 0].index)
numeric_cols_with_nas

[]

* Standardizing numeric values

In [26]:
# creating a common scale between the numberic columns by standardizing each numeric column

# create a standard scaler and fit it to the training set of predictors
scaler = preprocessing.StandardScaler()
cols_to_stdize = ['Age', 'Experience', 'Income', 'CCAvg', 'Mortgage', 'Family']               
               
# Transform the predictors of training and validation sets
train_df[cols_to_stdize] = scaler.fit_transform(train_df[cols_to_stdize]) # train_predictors is not a numpy array
test_df[cols_to_stdize] = scaler.transform(test_df[cols_to_stdize]) # validation_target is now a series object

* Save the data

In [27]:
X_train = train_df[predictors]
y_train = train_df[target]
X_test = test_df[predictors]
y_test = test_df[target]

X_train.to_csv('universal-bank-X-train-data.csv', index=False)
y_train.to_csv('universal-bank-y-train-data.csv', index=False)
X_test.to_csv('universal-bank-X-test-data.csv', index=False)
y_test.to_csv('universal-bank-y-test-data.csv', index=False)


In [28]:
X_train

Unnamed: 0,Age,Experience,Income,Family,CCAvg,Mortgage,Personal Loan,Securities Account,Online,CreditCard,Education_Advanced/Professional,Education_Graduate,Education_Undergrad
2243,0.764311,0.699395,0.114708,0.535093,-0.142208,0.895370,0,0,1,1,0,1,0
3131,0.152303,0.174568,-0.279606,0.535093,0.437125,1.069707,0,0,1,1,0,1,0
1351,1.201459,1.311695,0.224239,-1.203027,-0.084275,-0.557438,0,0,1,1,1,0,0
2741,-1.421432,-1.487387,-0.542482,-1.203027,-0.258075,-0.557438,0,0,0,0,0,0,1
2939,0.764311,0.349510,-1.068233,1.404152,-0.895341,-0.557438,0,0,0,1,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2551,0.064873,0.174568,-0.673920,-0.333967,0.089525,-0.557438,0,1,1,1,1,0,0
3834,0.239732,0.174568,-1.002514,-1.203027,-0.316008,0.168966,0,0,1,0,1,0,0
4053,-0.896854,-0.787617,0.355677,-0.333967,-1.127074,-0.557438,0,0,0,0,0,0,1
3635,1.114030,1.136752,-1.090140,-0.333967,-0.837408,-0.557438,0,0,1,0,0,1,0


## Conclusion

* In this Notebook, I have cleaned the data using discussed techniques and exported the data to csv files which we will use in modelling notebook
 