<h2>Classification Case-Studies: Advanced Model Building - Strategies</h2>

<ul>
  <li>Data preprocessing</li>
  <li>Exploratory Data Analysis</li>
  <li>HyperParameter Tuning</li>
  <li>Wrapper-based Feature Selection</li>
  <li>Classification using Logistic Regression - Decision Trees - SVM </li>
</ul>

<h2>Case Study 2: Credit Defaulting Problem</h2>

Given a Taiwanese Bank Database (<b>'tawain-credit-data.xls'</b>) of Credit Loan History for its clients, develop a classification model that will accurately estimate whether a client will pay back the loan or not. The datasets contain 23 features described as follows:

<ul>
<li>X1: Amount of the given credit (NT dollar)</li>
<li>X2: Gender (1 = male; 2 = female).</li>
<li> X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).</li>
<li>X4: Marital status (1 = married; 2 = single; 3 = others).</li>
<li>X5: Age (year).</li>
<li>X6 - X11: History of past payment</li>
<li>X12-X17: Amount of bill statement (NT dollar)</li> 
<li>X18-X23: Amount of previous payment (NT dollar)</li>
</ul>

More details on the datasets (UCI Machine Learning datasets) can be obtained in the follow link:
https://archive.ics.uci.edu/dataset/350/default+of+credit+card+clients

In [35]:
import warnings
warnings.filterwarnings('ignore')

<b style="color: blue;">Step 1: Load your datasets into pandas</b>

In [44]:
import pandas as pd

df = pd.read_excel('../datasets/taiwan-credit-data.xls')
column_names = list(df.iloc[0,:])
df.head()

Unnamed: 0.1,Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,...,X15,X16,X17,X18,X19,X20,X21,X22,X23,Y
0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
1,1,20000,2,2,1,24,2,2,-1,-1,...,0,0,0,0,689,0,0,0,0,1
2,2,120000,2,2,2,26,-1,2,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
3,3,90000,2,2,2,34,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
4,4,50000,2,2,1,37,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0


In [45]:
df = df.apply(pd.to_numeric, errors='coerce')#convert into numeric
#df.info()

In [46]:
df = df.drop([0])
df.columns = column_names
df.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
1,1.0,20000.0,2.0,2.0,1.0,24.0,2.0,2.0,-1.0,-1.0,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1.0
2,2.0,120000.0,2.0,2.0,2.0,26.0,-1.0,2.0,0.0,0.0,...,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1.0
3,3.0,90000.0,2.0,2.0,2.0,34.0,0.0,0.0,0.0,0.0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0.0
4,4.0,50000.0,2.0,2.0,1.0,37.0,0.0,0.0,0.0,0.0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0.0
5,5.0,50000.0,1.0,2.0,1.0,57.0,-1.0,0.0,-1.0,0.0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0.0


<b style="color: blue;">Step 2: Perform EDA using the following requirements</b>
<ol>
  <li>Count the number of rows with missing records - deal with missing values accordingly </li>
  <li>Provide boxplot and density distribution function for each attribute (except Y) in the dataset (optional)</li>
  <li>Provide a barplot that shows the number of data point per class label (see Y)</li>
</ol>

<b style="color: blue;">Step 3: Identify features and the target variable in the problem</b>

In [47]:
X = df.drop(columns=['default payment next month','ID'])
y = df[['default payment next month']]

<b style="color: blue;">Step 4: Scale all features using a Standard Scaler and Split the dataset into Training/Test set (80:20)</b>

In [48]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

Sc = StandardScaler()
X_sc = Sc.fit_transform(X)#normalisa -> std
X_sc = pd.DataFrame(X_sc,columns = X.columns) #get back the dataframe

X_train,X_test,y_train,y_test = train_test_split(X_sc,y,test_size=0.2,random_state=1234)

<b style="color: blue;">Step 5: Using SVM, Logistic Regression, and DTs, perform hyperparameter tuning and build your models in their best configuration possible.</b>


In [None]:
from sklearn.model_selection import GridSearchCV

svm = SVC()
svm_clf = GridSearchCV(svm, parameters,cv=5)
svm_clf.fit(X_train,y_train)
svm_clf.best_params_

<b style="color: blue;">Step 6: Evaluate each model performance on the test and provide classification reports</b>


<b style="color: blue;">Step 7: Retrain your models using a wrapper based feature selection</b><br/>
Evaluate whether you obtain a performance boost on the test set.