# HW2

#### Machine Learning in Korea University
#### COSE362, Fall 2018
#### Due : 11/26 (TUE) 11:59 PM

#### In this assignment, you will learn various classification methods with given datasets.
* Implementation detail: Anaconda 5.3 with python 3.7
* Use given dataset. Please do not change train / valid / test split.
* Use numpy, scikit-learn, and matplotlib library
* You don't have to use all imported packages below. (some are optional). <br>
Also, you can import additional packages in "(Option) Other Classifiers" part. 
* <b>*DO NOT MODIFY OTHER PARTS OF CODES EXCEPT "Your Code Here"*</b>

In [1]:
# Basic packages
%matplotlib inline
import numpy as np
import pandas as pd
import csv
import matplotlib.pyplot as plt

# Machine Learning Models
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression #
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

# Additional packages
from sklearn.model_selection import cross_val_score
from sklearn.metrics import f1_score

In [2]:
# Import your own packages if you need(only in scikit-learn, numpy, pandas).
# Your Code Here
from sklearn.model_selection import ShuffleSplit
#End Your Code

## Process

> 1. Load "train.csv". It includes all samples' features and labels.
> 2. Training four types of classifiers(logistic regression, decision tree, random forest, support vector machine) and <b>validate</b> it in your own way. <b>(You can't get full credit if you don't conduct validation)</b>
> 3. Optionally, if you would train your own classifier(e.g. ensembling or gradient boosting), you can evaluate your own model on the development data. <br>
> 4. <b>You should submit your predicted results on test data with the selected classifier in your own manner.</b>

## Task & dataset description
1. 6 Features (1~6)<br>
Feature 2, 4, 6 : Real-valued<br>
Feature 1, 3, 5 : Categorical <br>

2. Samples <br>
>In development set : 2,000 samples <br>
>In test set : 1,500 samples

## Load development dataset
Load your development dataset. You should read <b>"train.csv"</b>. This is a classification task, and you need to preprocess your data for training your model. <br>
> You need to use <b>1-of-K coding scheme</b>, to convert categorical features to one-hot vector. <br>
> For example, if there are 3 categorical values, you can convert these features as [1,0,0], [0,1,0], [0,0,1] by 1-of-K coding scheme. <br>

In [4]:
# For training your model, you need to convert categorical features to one-hot encoding vectors.
# Your Code Here
train = pd.read_csv('./data/train.csv').values  # Loading data and convert to np.ndarray

def one_hot(data):
    feature2idx = []
    for categorical in [0,2,4]:
        f2i = {value : idx for idx, value in enumerate(sorted(set(data[:, categorical])))}
        feature2idx.append(f2i)
        
    new_data = []
        
    for i in range(len(data)):
        new_row = []
        row_length = len(data[0])
        cnt = 0

        for j in range(row_length):
            if j in [0,2,4]:
                convert_each_value = [0 for j in range(len(feature2idx[cnt]))]
                convert_each_value[feature2idx[cnt].get(data[i, j])] = 1
                new_row.extend(convert_each_value)
                cnt += 1
            else:
                new_row.extend([data[i, j]])
        
        new_data.append(new_row)
                
    return np.array(new_data)

X_tr = one_hot(train)[:, :-1]
Y_tr = one_hot(train)[:, -1]
# End Your Code

### Logistic Regression
Train and validate your <b>logistic regression classifier</b>, and print out your validation(or cross-validation) error.
> If you want, you can use cross validation, regularization, or feature selection methods. <br>
> <b> You should use F1 score('macro' option) as evaluation metric. </b>

In [5]:
# Training your logistic regression classifier, and print out your validation(or cross-validation) error.
# Save your own model
# Your Code Here
cv = ShuffleSplit(n_splits=5, test_size=1/5, random_state=1)

ith_f1_score = []
ith_cv_score = []
for train_index, valid_index in cv.split(X_tr):
    x_train = X_tr[train_index]
    x_valid = X_tr[valid_index]
    y_train = Y_tr[train_index]
    y_valid = Y_tr[valid_index]

    model = LogisticRegression(C=10, solver='lbfgs', max_iter=300)
    model.fit(x_train, y_train)
    y_predict = model.predict(x_valid)
    ith_f1_score.append(f1_score(y_valid, y_predict, average='macro').mean())


print('mean cv acc : ', cross_val_score(model, X_tr, Y_tr, cv=cv).mean())
print('mean f1 score is : ', np.mean(ith_f1_score))

# End Your Code

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


mean cv error :  0.2895
mean f1 score is :  0.26301157027955424


### Decision Tree
Train and validate your <b>decision tree classifier</b>, and print out your validation(or cross-validation) error.
> If you want, you can use cross validation, regularization, or feature selection methods. <br>
> <b> You should use F1 score('macro' option) as evaluation metric. </b>

In [6]:
# Training your decision tree classifier, and print out your validation(or cross-validation) error.
# Save your own model
# Your Code Here
cv = ShuffleSplit(n_splits=5, test_size=1/5, random_state=1)

ith_f1_score = []

for train_index, valid_index in cv.split(X_tr):
    x_train = X_tr[train_index]
    x_valid = X_tr[valid_index]
    y_train = Y_tr[train_index]
    y_valid = Y_tr[valid_index]
    
    model = DecisionTreeClassifier(max_depth=17)
    model.fit(x_train, y_train)
    
    y_predict = model.predict(x_valid)
    ith_f1_score.append(f1_score(y_valid, y_predict, average='macro'))
    
print('mean cv acc : ', cross_val_score(model, X_tr, Y_tr, cv=cv).mean())
print('mean f1 score : ', np.mean(ith_f1_score))
# End Your Code

mean cv error :  0.438
mean f1 score :  0.37187414990073886


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


### Random Forest
Train and validate your <b>random forest classifier</b>, and print out your validation(or cross-validation) error.
> If you want, you can use cross validation, regularization, or feature selection methods. <br>
> <b> You should use F1 score('macro' option) as evaluation metric. </b>

In [None]:
# Training your random forest classifier, and print out your validation(or cross-validation) error.
# Save your own model
# Your Code Here
cv = ShuffleSplit(n_splits=5, test_size=1/5, random_state=1)

num_trees = [100, 300, 500, 1000] # best : 500
ith_f1_score = []

for num in num_trees:
    
    for train_index, valid_index in cv.split(X_tr):
        x_train = X_tr[train_index]
        x_valid = X_tr[valid_index]
        y_train = Y_tr[train_index]
        y_valid = Y_tr[valid_index]

        model = RandomForestClassifier(n_estimators=num)
        model.fit(x_train, y_train)
    
        y_predict = model.predict(x_valid)
        ith_f1_score.append(f1_score(y_valid, y_predict, average='macro'))

    print('mean cv acc : ', cross_val_score(model, X_tr, Y_tr, cv=cv).mean())
    print('mean f1 score : ', np.mean(ith_f1_score))
# End Your Code

### Support Vector Machine
Train and validate your <b>support vector machine classifier</b>, and print out your validation(or cross-validation) error.
> If you want, you can use cross validation, regularization, or feature selection methods. <br>
> <b> You should use F1 score('macro' option) as evaluation metric. </b>

In [None]:
# Training your support vector machine classifier, and print out your validation(or cross-validation) error.
# Save your own model
# Your Code Here
cv = ShuffleSplit(n_splits=5, test_size=1/5, random_state=1)

ith_f1_score = []

for train_index, valid_index in cv.split(X_tr):
    x_train = X_tr[train_index]
    x_valid = X_tr[valid_index]
    y_train = Y_tr[train_index]
    y_valid = Y_tr[valid_index]
    
    model = SVC(kernel='poly', gamma=0.1, coef0=10)
    model.fit(x_train, y_train)
    
    y_predict = model.predict(x_valid)
    ith_f1_score.append(f1_score(y_valid, y_predict, average='macro'))
    
print('mean cv acc : ', cross_val_score(model, X_tr, Y_tr, cv=cv).mean())
print('mean f1 score : ', np.mean(ith_f1_score))
# End Your Code

### (Option) Other Classifiers.
Train and validate other classifiers by your own manner.
> <b> If you need, you can import other models only in this cell, only in scikit-learn. </b>

In [None]:
# If you need additional packages, import your own packages below.
# Your Code Here

# End Your Code

## Submit your prediction on the test data.

* Select your model and explain it briefly.
* You should read <b>"test.csv"</b>.
* Prerdict your model in array form.
* Prediction example <br>
[2, 6, 14, 8, $\cdots$]
* We will rank your result by <b>F1 metric(with 'macro' option)</b>.
* <b> If you don't submit prediction file or submit it in wrong format, you can't get the point for this part.

# Explain your final model
Logistic Regression, Decision Tree, Random Forest, Support Vector Machine의 네가지 모델에 대하여 
sklearn 공식 문서를 참고하여 각 모델의 파라미터 값들 중, 수업시간에 배운 파라미터들을 조정해가며 F1 score를 비교한 결과, Random Forest를 선택하였다.(tree 개수 : 500 / 5-fold cross validation 실행, 일반적으로 treed의 개수를 늘릴수록 좋은 결과를 기대할 수 있지만 특정 수 이상부터는 영향이 없다.)



In [None]:
# Load test dataset.
# Your Code Here
test = pd.read_csv('./data/test.csv').values
X_te = one_hot(test)[:, :] # 테스트 셋은 target값 포함 안되어있음
# End Your Code

In [None]:
# Predict target class
# Make variable "my_answer", type of array, and fill this array with your class predictions.
# Modify file name into your student number and your name.
# Your Code Here
model = RandomForestClassifier(n_estimators=500)
model.fit(X_tr, Y_tr)

my_answer = model.predict(X_te)

file_name = "HW2_2013190702_이찬주.csv"
# End Your Code

In [None]:
# This section is for saving predicted answers. DO NOT MODIFY.
pd.Series(my_answer).to_csv("./data/" + file_name, header=None, index=None)