# CS412 - Machine Learning - 2020

# Homework 2

100 pts

# Goal

The goal of this homework is to get familiar feature handling and cross validation.


# Dataset

German Credit Risk dataset, prepared by Prof. Hoffman, classifies each person as having a good or bad credit risk. The dataset that we use consists of both numerical and categorical features.



# Task

Build a k-NN classifier with scikit-learn library to classify people as bad or good risks for the german credit dataset. 

# Software

Documentation for the necessary functions can be accessed from the link below.

[http://scikit-learn.org/stable/supervised_learning.html](http://scikit-learn.org/stable/supervised_learning.html)

# Submission

Follow the instructions at the end.


# 1) Initialize

First, make a copy of this notebook in your drive

In [None]:
# Mount to your drive, in this way you can reach files that are in your drive
# Run this cell
# Go through the link that will be showed below
# Select your google drive account and copy authorization code and paste here in output and press enter
# You can also follow the steps from that link
# https://medium.com/ml-book/simplest-way-to-open-files-from-google-drive-in-google-colab-fae14810674 

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# 2) Load Dataset

To start working for your homework, take a copy of the folder, given in the below link to your own google drive. You find the train and test data under this folder.

[https://drive.google.com/drive/folders/1DbW6VxLKZv2oqFn9SwxAnVadmn1_nPXi?usp=sharing](https://drive.google.com/drive/folders/1DbW6VxLKZv2oqFn9SwxAnVadmn1_nPXi?usp=sharing)

After copy the folder, copy the path of the train and test dataset to paste them in the below cell to load your data.


In [None]:
import pandas as pd
from os.path import join

path_prefix = "/content/drive/My Drive"


train_df = pd.read_csv(join(path_prefix,'german_credit_train.csv'))
test_df = pd.read_csv(join(path_prefix,'german_credit_test.csv'))

# 3) Optional - Analyze the Dataset 

You can use the functions of the pandas library to analyze your train dataset in detail - **this part is OPTIONAL - look around the data as you wish**.


*   Display the number of instances and features in the train ***(shape function can be used)**
*   Display 5 random examples from the train ***(sample function can be used)**
*   Display the information about each features ***(info method can be used)**



In [None]:
# Print shape
print("Train data dimensionality: ", )

print (train_df.shape)

# Print random 5 rows
print("Examples from train data: ")
print (train_df.sample(5))


Train data dimensionality: 
(800, 13)
Examples from train data: 
    AccountStatus  Duration CreditHistory  ...  OtherInstallPlans Housing Risk
585           A11        21           A32  ...               A143    A151    1
219           A11        12           A32  ...               A143    A152    1
216           A11        12           A32  ...               A143    A152    1
499           A14        24           A34  ...               A143    A152    1
782           A12        12           A34  ...               A143    A152    1

[5 rows x 13 columns]


In [None]:
# Print the information about the dataset
print("Information about train data:")
print (train_df.info())

Information about train data:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   AccountStatus      800 non-null    object
 1   Duration           800 non-null    int64 
 2   CreditHistory      800 non-null    object
 3   CreditAmount       800 non-null    int64 
 4   SavingsAccount     800 non-null    object
 5   EmploymentSince    800 non-null    object
 6   PercentOfIncome    800 non-null    int64 
 7   PersonalStatus     800 non-null    object
 8   Property           800 non-null    object
 9   Age                800 non-null    int64 
 10  OtherInstallPlans  800 non-null    object
 11  Housing            720 non-null    object
 12  Risk               800 non-null    int64 
dtypes: int64(5), object(8)
memory usage: 81.4+ KB
None


# 4) Define your train and test labels

*  Define labels for both train and test data in new arrays 
*  And remove the label column from both train and test sets do tht it is not used as a feature! 


(**you can use pop method**)


In [None]:
# Define labels
train_label = train_df['Risk']
test_label = test_df['Risk']

#Remove label column
holding_train=train_df["Risk"]
holding_test=test_df["Risk"]


train_df=train_df.drop(columns=["Risk"])
test_df=test_df.drop(columns=["Risk"])


# 5) Handle missing values if any 

*   Print the columns that have **NaN** values (**isnull** method can be used)
*   You can impute missing values with mode of that feature or remove samples or attributes
*   To impute the test set, you should use the mode values that you obtain from **train** set, as **you should not be looking at your test data to gain any information or advantage.**



In [None]:
# Print columns with NaN values

print ("Train Set:",)
print (train_df.isnull().any())

#train_df.isnull().sum()
# we have 80 NaN values in housing column
print ("Test Set:")
print (test_df.isnull().any())
print ("Only Housing returned True so we have NaN values only in Housing column for both train and test sets")

Train Set:
AccountStatus        False
Duration             False
CreditHistory        False
CreditAmount         False
SavingsAccount       False
EmploymentSince      False
PercentOfIncome      False
PersonalStatus       False
Property             False
Age                  False
OtherInstallPlans    False
Housing               True
dtype: bool
Test Set:
AccountStatus        False
Duration             False
CreditHistory        False
CreditAmount         False
SavingsAccount       False
EmploymentSince      False
PercentOfIncome      False
PersonalStatus       False
Property             False
Age                  False
OtherInstallPlans    False
Housing               True
dtype: bool
Only Housing returned True so we have NaN values only in Housing column for both train and test sets


In [None]:
# Impute missing values by replacing with mode value

train_df["Housing"].value_counts()

train_df["Housing"].mode()[0]
train_df["Housing"]=train_df["Housing"].fillna(train_df["Housing"].mode()[0])


test_df["Housing"]=test_df["Housing"].fillna(train_df["Housing"].mode()[0])
#check if we still have nan values
#train_df.isnull().sum()
#test_df.isnull().sum()


# 6) Transform categorical / ordinal features

* Transform all categorical / ordinal features using the methods that you have learnt in lectures and recitation 4 for both train and test data
* You saw the dictionary use for mapping in recitation. (You can use **replace function** to assign new values to the categories of a column).

*  The class of the categorical attributes in the dataset are defined as follows:
  - Status of existing checking account
     - A11 :      ... <    0 DM
	- A12 : 0 <= ... <  200 DM
	- A13 :      ... >= 200 DM / salary assignments for at least 1 year
     - A14 : no checking account

 - Credit history
    - A30 : no credits taken/all credits paid back duly
    - A31 : all credits at this bank paid back duly
	- A32 : existing credits paid back duly till now
    - A33 : delay in paying off in the past
	- A34 : critical account/other credits existing (not at this bank)

  - Savings account
    - A61 :          ... <  100 DM
	- A62 :   100 <= ... <  500 DM
	- A63 :   500 <= ... < 1000 DM
	- A64 :          .. >= 1000 DM
    - A65 :   unknown/ no savings account

 - Employment Since
    - A71 : unemployed
    - A72 :       ... < 1 year
	- A73 : 1  <= ... < 4 years  
	- A74 : 4  <= ... < 7 years
	- A75 :       .. >= 7 years
 
 - Personal Status
    - A91 : male   : divorced/separated
	- A92 : female : divorced/separated/married
    - A93 : male   : single
	- A94 : male   : married/widowed
	- A95 : female : single

  - Property
     -  A121 : real estate
	- A122 : if not A121 : building society savings agreement/life insurance
    - A123 : if not A121/A122 : car or other, not in attribute 6
	- A124 : unknown / no property

 - OtherInstallPlans  
    - A141 : bank
	- A142 : stores
	- A143 : none

 - Housing
    -  A151 : rent
	 - A152 : own
	- A153 : for free

In [None]:
# Transform the categorical / ordinal attributes

account_status_transformed={'A11':0,"A12" : 1,"A13" :2, "A14" : 3 }
train_df["AccountStatus"] = train_df["AccountStatus"].replace(account_status_transformed)
test_df["AccountStatus"] = test_df["AccountStatus"].replace(account_status_transformed)


##SOR BURAYI
credit_history_transformed={ "A30" : 4,"A31" : 3,"A32" : 2,"A33" : 1, "A34" : 0}
train_df["CreditHistory"] = train_df["CreditHistory"].replace(credit_history_transformed)
test_df["CreditHistory"] = test_df["CreditHistory"].replace(credit_history_transformed)


savings_account_transformed={"A61" : 0,"A62" : 1, "A63" : 2, "A64" : 3, "A65" : 4}
train_df["SavingsAccount"] = train_df["SavingsAccount"].replace(savings_account_transformed)
test_df["SavingsAccount"] = test_df["SavingsAccount"].replace(savings_account_transformed)


emp_since_transformed={ "A71" : 0,"A72" : 1, "A73" : 2, "A74" : 3,"A75" : 4}
train_df["EmploymentSince"]=train_df["EmploymentSince"].replace(emp_since_transformed)
test_df["EmploymentSince"]=test_df["EmploymentSince"].replace(emp_since_transformed)


pers_status_transformed={ "A91" : "male : divorced/separated","A92" : "female : divorced/separated/married","A93" : "male : single","A94" : "male : married/widowed","A95" : "female : single"}
train_df['PersonalStatus']=train_df['PersonalStatus'].replace(pers_status_transformed)
test_df['PersonalStatus']=test_df['PersonalStatus'].replace(pers_status_transformed)


property_transformed={"A121" : 3,"A122" :2,"A123" : 1,"A124" : 0}
train_df["Property"]=train_df["Property"].replace(property_transformed)
test_df["Property"]=test_df["Property"].replace(property_transformed)


OtherInstallPlans_transformed={"A141":"bank","A142":"stores","A143":"none"}
train_df['OtherInstallPlans']=train_df['OtherInstallPlans'].replace(OtherInstallPlans_transformed)
test_df['OtherInstallPlans']=test_df['OtherInstallPlans'].replace(OtherInstallPlans_transformed)


Housing_transformed={"A151":"rent", "A152":"own","A153":"for free"}
train_df['Housing']=train_df['Housing'].replace(Housing_transformed)
test_df['Housing']=test_df['Housing'].replace(Housing_transformed)


train_df.head()
#test_df.head()                                                       






Unnamed: 0,AccountStatus,Duration,CreditHistory,CreditAmount,SavingsAccount,EmploymentSince,PercentOfIncome,PersonalStatus,Property,Age,OtherInstallPlans,Housing
0,3,12,2,2859,4,0,4,male : single,0,38,none,own
1,0,9,2,2136,0,2,3,male : single,3,25,none,own
2,0,18,0,5302,0,4,2,male : single,0,36,none,for free
3,0,14,2,8978,0,4,1,male : divorced/separated,2,45,none,own
4,3,15,2,4623,1,2,3,male : single,2,40,none,own


In [None]:
from sklearn.preprocessing import OneHotEncoder

enc1 = OneHotEncoder(handle_unknown='ignore')
dummies_personal = enc1.fit_transform(train_df[['PersonalStatus']]).toarray()
dummies_personal = pd.DataFrame(dummies_personal,columns=enc1.categories_)


######

enc2=OneHotEncoder(handle_unknown="ignore")
dummies_housing=enc2.fit_transform(train_df[['Housing']]).toarray()
dummies_housing=pd.DataFrame(dummies_housing,columns=enc2.categories_)


#######

enc3=OneHotEncoder(handle_unknown="ignore")
dummies_plans=enc3.fit_transform(train_df[['OtherInstallPlans']]).toarray()
dummies_plans=pd.DataFrame(dummies_plans,columns=enc3.categories_)




###TEST SET KISMI####



dummies_personal_test = enc1.transform(test_df[['PersonalStatus']]).toarray()
dummies_personal_test = pd.DataFrame(dummies_personal_test,columns=enc1.categories_)



dummies_housing_test = enc2.transform(test_df[['Housing']]).toarray()
dummies_housing_test = pd.DataFrame(dummies_housing_test,columns=enc2.categories_)



dummies_plans_test = enc3.transform(test_df[['OtherInstallPlans']]).toarray()
dummies_plans_test = pd.DataFrame(dummies_plans_test,columns=enc3.categories_)




In [None]:
##MERGE DATASETS WITH NDUMMY VARIABLES AND DROP ORIG COLUMNS

train_df=pd.merge(train_df,dummies_housing,right_index=True,left_index=True)
train_df=train_df.drop(columns="Housing")



train_df=pd.merge(train_df,dummies_personal,right_index=True,left_index=True)
train_df=train_df.drop(columns="PersonalStatus")



train_df=pd.merge(train_df,dummies_plans,right_index=True,left_index=True)
train_df=train_df.drop(columns="OtherInstallPlans")



test_df=pd.merge(test_df,dummies_housing_test,right_index=True,left_index=True)
test_df=test_df.drop(columns="Housing")



test_df=pd.merge(test_df,dummies_personal_test,right_index=True,left_index=True)
test_df=test_df.drop(columns="PersonalStatus")


test_df=pd.merge(test_df,dummies_plans_test,right_index=True,left_index=True)
test_df=test_df.drop(columns="OtherInstallPlans")


# 7) Build a k-NN classifier on training data and perform models selection using 5 fold cross validation

*  Initialize k-NN classifiers with **k= 5, 10, 15**
*  Calculate the cross validation scores using cross_al_score method, number of folds is 5. 
*  Note: Xval is performed on training data! Do not use test data in any way and do not separate a hold-out validation set, rather use cross-validation.

Documentation of the cross_val_score method:

[https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score)

*  Stores the average accuracies of these folds
*  Select the value of k using the cross validation results. 

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from statistics import mean

# k values
kVals = [5,10,15]

# Save the accuracies of each value of kVal in [accuracies] variable
accuracies = []

# Loop over values of k for the k-Nearest Neighbor classifier
for k in kVals:
  # Initialize a k-NN classifier with k neighbors
  knn = KNeighborsClassifier(n_neighbors=k)
  # Calculate the 5 fold cross validation scores using cross_val_score
  # cv parameter: number of folds, in our case it must be 5
  scores = cross_val_score(knn,train_df,train_label,cv=5)
  accuracies.append(scores.mean())
  # Stores the average accuracies of the scores in accuracies variable, you can use mean method


print(accuracies)

[0.68, 0.7074999999999999, 0.7100000000000001]


# 8) Retrain using all training data and test on test set

* Train a classifier with the chosen k value of the best classifier using **all training data**. 

Note:  k-NN training involves no explicit training, but this is what we would do after model selection with decision trees or any other ML approach (we had 5 diff. models -one for each fold - for each k in the previous step - dont know which one to submit. Even if we picked the best one, it does not use all training samples.

* Predict the labels of testing data 

* Report the accuracy 

In [None]:
from sklearn.metrics import accuracy_score
import numpy as np
# Train the best classifier using all training set

best_knn=  KNeighborsClassifier(kVals[np.argmax(accuracies)])
best_knn.fit (train_df,train_label)



# Estimate the prediction of the test data
prediction= best_knn.predict(test_df)

# Print accuracy of test data
accuracy_sc= accuracy_score(prediction,test_label)
print ("The accuracy score is: ",accuracy_sc)



The accuracy score is:  0.665


# 9) Bonus (5pts)

There is a limited bonus for any extra work that you may use and improve the above results. 

You may try a larger k values, scale input features, remove some features, .... Please **do not overdo**, maybe spend another 30-60min on this. The idea is not do an exhaustive search (which wont help your understanding of ML process), but just to give some extra points to those who may look at the problem a little more comprehensively. 

**If you obtain better results than the above, please indicate the best model you have found and the corresponding accuracy.**

E.g. using feature normalization ..... and removing .... features and using a value k=...., I have obtained ....% accuracy.


# 10) Notebook & Report

**Notebook:** We may just look at your notebook results; so make sure each cell is run and outputs are there.

**Report:** Write an at most 1/2 page summary of your approach to this problem at the end of your notebook; this should be like an abstract of a paper or the executive summary.

**Must include statements such as:**

( Include the problem definition: 1-2 lines )

(Talk about any preprocessing you do, How you handle missing values and categorical features)

( Give the average validation accuracies for different k values and standard deviations between 5 folds of each k values, state which one you selected)

( State what your test results are with the chosen method, parameters: e.g. "We have obtained the best results with the ….. classifier (parameters=....) , giving classification accuracy of …% on test data….""

State if there is any **bonus** work...

You will get full points from here as long as you have a good (enough) summary of your work, regardless of your best performance or what you have decided to talk about in the last few lines.

# <font color="coral"> **REPORT** </font>
<font color= "turquoise">We are given a credit applicability problem to use with KNN classifier algorithm. First I labelled my target attribute which is "Risk" column. Later on I found the column with NaN values, which was "Housing", then I filled NaN values with using mode function. After that I categorized Columns according to ordinal or categorical features. Which I choose **categorical** only for "**Personal Status**", "**Other Install Plans**" and "**Housing**". After choosing them as categorical I used "**One hot Encoder**" to turn them into numeric values. For final part before choosing the best K valued model, I did Cross Validation using 5 folds. And decided on the best k-value according to my accuracy scores that I got from cv. Finally I used k=15 and got accuracy score of **0.665**

# 11) Submission

Please submit your **"share link" INLINE in Sucourse submissions**. That is we should be able to click on the link and go there and run (and possibly also modify) your code. 

For us to be able to modify, in case of errors etc, **you should get your "share link" as **share with anyone in edit mode** 

 **Also submit your notebook as pdf as attachment**, choose print and save as PDF, save with hw2-lastname-firstname.pdf to facilitate grading. 
