# **CS412 - Machine Learning - 2021 Summer**
## **Homework 1**
100 pts


## **Goal**

The goal of this homework is three-fold:

*   Introduction to the machine learning experimental set up 
*   Gain experience with Decision tree approach
*   Gain experience with the Scikit library

## **Dataset**
This dataset is taken from the following Kaggle link and simplified for the Homework 1: https://www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists?select=aug_train.csv

**Download the data from Sucourse.**
You must use 20% of the data for validation and 20% for test: **Training: 60%, Validation: 20%, Test: 20%**

## **Task**
Build a decision tree classifier with the scikit library function calls to predict whether a data scientist candidate is going to look for a new job or will work for the company - **target column** is the target variable.

**You should check the documentation "cs412_hw1_desc" to understand the task wholly.**

## **Software: You may find the necessary function references here:**
http://scikit-learn.org/stable/supervised_learning.html

## **Submission:**
Fill this notebook and submit this document with a link to #your Colab notebook 
(make sure to include the link obtained from the #share link on top right)


## 0) Initialize

*   First make a copy of the notebook given to you as a starter.

*   Make sure you choose Connect form upper right.

*   You may upload the data to the section on your left on Colab, than right click on the .csv file and get the path of the file by clicking on "Copy Path". You will be using it when loading the data.


## 1) Import Necessary Libraries

In [None]:
import pandas as pd
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from os.path import join
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn import metrics


## 2) Load training dataset

*  Read the .csv file with pandas library



In [None]:
# Read data
from google.colab import drive
drive.mount("./drive")

path_prefix = "./drive/My Drive"



Mounted at ./drive


In [None]:
fname = "data_scientist_job_change.csv"
df = pd.read_csv(join(path_prefix, fname))

df.head()

Unnamed: 0,city_development_index,relevent_experience,education_level,experience,training_hours,target
0,0.92,1,2.0,25.0,36,1.0
1,0.776,0,2.0,15.0,47,0.0
2,0.624,0,2.0,5.0,83,0.0
3,0.789,0,2.0,0.0,52,1.0
4,0.767,1,3.0,25.0,8,0.0


## 3) Understanding the dataset

There are alot of functions that can be used to know more about this dataset

- What is the shape of the training set (num of samples X number of attributes) ***[shape function can be used]***

- Display attribute names ***[columns function can be used]***

- Display the first 5 rows from training dataset ***[head or sample functions can be used]***

- Display number of nan values on each column ***[.isna() method can be used]***

Note: Understanding the features, possibly removing some features etc. is an important part in building an ML system, but for this homework this is not really necessary as the features are already transformed and simplified.


In [None]:
# print shape
print('Data Dimensionality: ',df.shape )



# print first 5 rows in your dataset
print('Head of Data: ', df.columns)



# print nan values for each column
print('NaN values: ',df.isna().sum() )


Data Dimensionality:  (19158, 6)
Head of Data:  Index(['city_development_index', 'relevent_experience', 'education_level',
       'experience', 'training_hours', 'target'],
      dtype='object')
NaN values:  city_development_index      0
relevent_experience         0
education_level           460
experience                 65
training_hours              0
target                      0
dtype: int64


## 4) Handling Missing Data
Simply drop the rows with NaN values or fill the NaN values with mode, median or mean value of the column. .dropna() method can be used

In [None]:
###
#print(df.mean())
#df['education_level'] = df['education_level'].fillna((df['education_level'].mean()))
#df['experience'] = df['experience'].fillna((df['experience'].mean()))
#print(df.isna().sum())

print(df.isna().sum())

df.dropna(subset=["education_level"], inplace = True)
df = df.reset_index(drop=True)
print("Missing values in column: ",df.education_level.isna().sum())

df.dropna(subset=["experience"], inplace = True)
df = df.reset_index(drop=True)
print("Missing values in column: ",df.experience.isna().sum())
print(df.isna().sum())

city_development_index      0
relevent_experience         0
education_level           460
experience                 65
training_hours              0
target                      0
dtype: int64
Missing values in column:  0
Missing values in column:  0
city_development_index    0
relevent_experience       0
education_level           0
experience                0
training_hours            0
target                    0
dtype: int64


## 5) Shuffle and Split training, test and validation sets as 60%-20%-20%, respectively.

In [None]:
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split


# Shuffle the training data and define X and y

df.head()
feature_cols = ['city_development_index', 'relevent_experience', 'education_level', 'experience', 'training_hours']
X = df[feature_cols]
y= df.target

# Split as 60%-20%-20%

X_trainx, X_test, y_trainx, y_test = train_test_split(X, y, test_size=0.4, random_state=1)

X_train, X_val, y_train, y_val = train_test_split(X_trainx, y_trainx, test_size=0.5, random_state=1)


## 6) Train a decision tree classifier on development/train data and do model selection using the validation data

* Train 5 decision tree classifiers with different values of "min_samples_split" which is the minimum number of samples required to split an internal node:  min_samples_split = [2, 4, 6, 8, 10]. 
* Evaluate the 5 models on validation set and choose the best one.
* Plot the train and validation set errors for those 5 settings - on one plot. 


In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score


# Train decision tree classifiers & Evaluate on validation set
#I will use k-fold validation
clf_1 = DecisionTreeClassifier()
clf_2 = DecisionTreeClassifier(criterion="entropy", min_samples_split = 4)
clf_3 = DecisionTreeClassifier(criterion="entropy", min_samples_split = 6)
clf_4 = DecisionTreeClassifier(criterion="entropy", min_samples_split = 8)
clf_5 = DecisionTreeClassifier(criterion="entropy", min_samples_split = 10)

clf_1 = clf_1.fit(X_train,y_train)
y_pred1 = clf_1.predict(X_val)
print("Accuracy1:",metrics.accuracy_score(y_val, y_pred1))

clf_2 = clf_2.fit(X_train,y_train)
y_pred2 = clf_2.predict(X_val)
print("Accuracy2:",metrics.accuracy_score(y_val, y_pred2))

clf_3 = clf_3.fit(X_train,y_train)
y_pred3 = clf_3.predict(X_val)
print("Accuracy3:",metrics.accuracy_score(y_val, y_pred3))

clf_4 = clf_4.fit(X_train,y_train)
y_pred4 = clf_4.predict(X_val)
print("Accuracy4:",metrics.accuracy_score(y_val, y_pred4))

clf_5 = clf_5.fit(X_train,y_train)
y_pred5 = clf_5.predict(X_val)
print("Accuracy5:",metrics.accuracy_score(y_val, y_pred5))





Accuracy1: 0.6928303236188093
Accuracy2: 0.7021276595744681
Accuracy3: 0.7126765599856965
Accuracy4: 0.7139281244412659
Accuracy5: 0.7205435365635616


In [None]:
# Plot errors
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_val, y_pred1))
print(classification_report(y_val, y_pred1))

print(confusion_matrix(y_val, y_pred2))
print(classification_report(y_val, y_pred2))

print(confusion_matrix(y_val, y_pred3))
print(classification_report(y_val, y_pred3))

print(confusion_matrix(y_val, y_pred4))
print(classification_report(y_val, y_pred4))

print(confusion_matrix(y_val, y_pred5))
print(classification_report(y_val, y_pred5))
#clearly pred5 has better precision


[[3337  857]
 [ 861  538]]
              precision    recall  f1-score   support

         0.0       0.79      0.80      0.80      4194
         1.0       0.39      0.38      0.39      1399

    accuracy                           0.69      5593
   macro avg       0.59      0.59      0.59      5593
weighted avg       0.69      0.69      0.69      5593

[[3395  799]
 [ 867  532]]
              precision    recall  f1-score   support

         0.0       0.80      0.81      0.80      4194
         1.0       0.40      0.38      0.39      1399

    accuracy                           0.70      5593
   macro avg       0.60      0.59      0.60      5593
weighted avg       0.70      0.70      0.70      5593

[[3454  740]
 [ 867  532]]
              precision    recall  f1-score   support

         0.0       0.80      0.82      0.81      4194
         1.0       0.42      0.38      0.40      1399

    accuracy                           0.71      5593
   macro avg       0.61      0.60      0.60    

## 7) Test your CHOSEN classifier on Test set

- Apply same pre-processing as training data (probably none)
- Predict the labels of testing data **using the best chosen SINGLE model out of the models that you have tried from step 6 (you have selected your model according to your validation results)** and report the accuracy. 

In [None]:
# You may want to train your BEST decision tree model on both training and validation sets. To merge these two, you may use
# concat() method of pandas
#i dont have to merge them, i created X_trainx and y_trainx before splitting it into test and validation

#i will use min_samples_split 10 bc overall, that one has given the best accuracy, with only 8 as a competition rarely.

# test prediction using a decision tree with all default parameters and ..... min-split value

clf = DecisionTreeClassifier(criterion="entropy", min_samples_split = 10)
clf = clf.fit(X_trainx,y_trainx)
y_pred = clf.predict(X_test)

# Report your accuracy

print("Accuracy:",metrics.accuracy_score(y_test, y_pred))


Accuracy: 0.7146688120139447


## 8) Notebook & Report 

**Notebook: We may just look at your notebook results; so make sure each cell is run and  outputs are on place.**

**Report: Write an at most 1/2 page summary of your approach to this problem at the end of your notebook**; this should be like an abstract of a paper or the executive summary (you aim for clarity and passing on information, not going to details about known facts such as what dec. trees or dataset details are, assuming they are known to people in your research area). 

**Must include statements such as:**

- **20pts** - The problem definition in 1-2 lines and explanation of the features (which is an ordinal variable? which is binary? etc.)

- **20pts** - What type of model is Decision Tree? (Unsupervised or supervised - explanation? Classification or regression - explanation?)

- **20pts** - Why do we have a seperate validation set?
 
- **20pts** - Give the validation accuracies for different hyperparameters **in a table** and state which one you selected
 
- **20pts** - State what your test results are with the chosen method, parameters: e.g. "We have obtained the best results with the min sample split = .... , giving classification accuracy of …% on test data….""

The **last day for the submissions** is Tuesday, 19 July 2021, 12:00 AM. **Late submissions** are accepted until Thursday 21 July 2021, 12:00 AM with a **-10 pts penalty**.

*You will get full points from here as long as you have a good (enough) summary of your work, regardless of your best performance or what you have decided to talk about in the last few lines.*


Q1: We are using city_development_index (real_valued),	relevent_experience (binary),	education_level (ordinal),	experience (real_valued) and	training_hours (real_valued) to determine if the target will be 0 or 1 (binary).  

Q2: It is classification because we are just looking to see if they should be hired or not. So we are looking to classify them. It is supervised, because this data is labeled and this model learns from the labeled data and makes predictions for new examples.

Q3: We can't use the test or train sets, because the model knows the target for train sets already and it's trying to predict the test set's target. And also, we want to validate our model because we are trying to tune the decision tree and get a better accuracy. 

Q4: 
**min_sample_split......accuracy**

2...........................................0.69

4...........................................0.70

6...........................................0.712

8...........................................0.713

***10.......................................0.72***



Q5: We have obtained the best results with the min sample split = 10, giving classification accuracy of %72 on test data.