# Task Definition

* 주어진 데이터를 갖고 네트워크 칩입을 감지하는 Classfication Model을 설계하라.

* 주어진 데이터를 가공하고, 수정하여도 무관하다.

* 사용해 볼 수 있는 만큼 많은 모델을 시도하여 볼 것

* 사용한 모델은 모두 Jupyter Notebook에 남길 것 (채점할 때 실행 가능하도록)

* 보고서는 따로 워드로 제출할 것 (보고서 양식은 따로 없음)

# Data Description

해당 데이터는 KDD Cup 1999 Dataset을 가공한 네트워크 칩입 감지 시스템용 데이터이다. 각 Feature들은 어떤 protocol type을 사용하는 지 혹은 어떤 service를 사용하는 지 등의 내용를 담고 있다. Class는 현재 네트워크가 침입을 당한 상태인 지 만약 침입을 당하였다면 어떤 종류의 침입을 당했는 지를 나타낸다.

# Feature Description

## Total 41 Features.

duration: continuous.

protocol_type: symbolic.

service: Numeric, categorical

flag: Numeric, categorical

src_bytes: continuous.

dst_bytes: continuous.

land: symbolic.

wrong_fragment: continuous.

urgent: continuous.

hot: continuous.

num_failed_logins: continuous.

logged_in: symbolic.

num_compromised: continuous.

root_shell: continuous.

su_attempted: continuous.

num_root: continuous.

num_file_creations: continuous.

num_shells: continuous.

num_access_files: continuous.

num_outbound_cmds: continuous.

is_host_login: symbolic.

is_guest_login: symbolic.

count: continuous.

srv_count: continuous.

serror_rate: continuous.

srv_serror_rate: continuous.

rerror_rate: continuous.

srv_rerror_rate: continuous.

same_srv_rate: continuous.

diff_srv_rate: continuous.

srv_diff_host_rate: continuous.

dst_host_count: continuous.

dst_host_srv_count: continuous.

dst_host_same_srv_rate: continuous.

dst_host_diff_srv_rate: continuous.

dst_host_same_src_port_rate: continuous.

dst_host_srv_diff_host_rate: continuous.

dst_host_serror_rate: continuous.

dst_host_srv_serror_rate: continuous.

dst_host_rerror_rate: continuous.

dst_host_srv_rerror_rate: continuous.

보다 자세한 설명은 http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html 를 참조(가공된 데이터이기 때문에, 사이트와 다른 점이 있음을 염두할것.)

# Class Description

"Normal", "dos", "u2r", "r2l", "probe" 5개의 Class가 존재한다.

Normal은 정상을 의미하며, 나머지 4개는 네트워크상 침입 기법들의 이름들이다.

# 필요한 모듈 불러오기

In [1]:
%matplotlib inline  
import numpy as np
import copy
import pandas as pd
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from collections import OrderedDict
import matplotlib.pyplot as plt
import collections


from sklearn.cross_validation import KFold
from sklearn.metrics import accuracy_score



# Pandas로 데이터 불러오기

In [2]:
# csv 파일 불러와서 pandas 데이터프레임으로 저장하기
data = pd.read_csv('desktop/train_data.csv')

# Step 1: Data preprocessing

###  Identify categorical features

In [3]:
# colums that are categorical and not binary yet: protocol_type (column 2), service (column 3), flag (column 4).
# explore categorical features
# 숫자형태로된 categorical feature은 object형으로 인식하지 못한다 
print('Training set:')
for col_name in data.columns:
    if data[col_name].dtypes == 'object' :
        unique_cat = len(data[col_name].unique())
        print("Feature '{col_name}' has {unique_cat} categories".format(col_name=col_name, unique_cat=unique_cat))

#see how distributed the feature service is, it is evenly distributed and therefore we need to make dummies for all.
print('Distribution of categories in protocol_type:')
print(data['protocol_type'].value_counts().sort_values(ascending=False).head())

print('Distribution of categories in service:')
print(data['service'].value_counts().sort_values(ascending=False).head())

print('Distribution of categories in flag:')
print(data['flag'].value_counts().sort_values(ascending=False).head())

print('Distribution of categories in xAttack:')
print(data['xAttack'].value_counts().sort_values(ascending=False).head())


Training set:
Feature 'protocol_type' has 3 categories
Feature 'xAttack' has 5 categories
Distribution of categories in protocol_type:
icmp    102689
udp      14993
tcp       8291
Name: protocol_type, dtype: int64
Distribution of categories in service:
25    40871
50    21853
12     9612
20     8614
55     7313
Name: service, dtype: int64
Distribution of categories in flag:
2    74945
4    34851
1    11233
5     2421
3     1665
Name: flag, dtype: int64
Distribution of categories in xAttack:
normal    67343
dos       45927
probe     11656
r2l         995
u2r          52
Name: xAttack, dtype: int64


### Encoding categorical variables

In [4]:
# 침입 인지 아닌지만 True or false 로 
new_columns = {"xAttack": {"normal": 1, "dos": 0, "probe": 0, "r2l": 0, "u2r": 0}}
data.replace(new_columns, inplace=True)

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()


# label 인코딩 후 데이터 변경 - protocol_type
le.fit_transform(data["protocol_type"])
data = data.apply(le.fit_transform)

protocol_type_encoded = pd.get_dummies(data["protocol_type"], prefix=['icmp', 'udp','tcp'])
data = data.drop('protocol_type', axis=1)
data = pd.concat([data,protocol_type_encoded],axis=1)


# 클래스가 맨 마지막에 오도록  (예쁜 방법은 아니다...)
xAttack_data = data['xAttack']
data = data.drop('xAttack', axis=1)
data = pd.concat([data,xAttack_data],axis=1)

data.head()

Unnamed: 0,duration,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,num_failed_logins,...,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,"['icmp', 'udp', 'tcp']_0","['icmp', 'udp', 'tcp']_1","['icmp', 'udp', 'tcp']_2",xAttack
0,0,18,1,482,0,0,0,0,0,0,...,17,0,0,0,5,0,1,0,0,1
1,0,39,1,143,0,0,0,0,0,0,...,88,0,0,0,0,0,0,0,1,1
2,0,44,3,0,0,0,0,0,0,0,...,0,0,100,99,0,0,1,0,0,0
3,0,22,1,229,5608,0,0,0,0,0,...,3,4,3,1,0,1,1,0,0,1
4,0,22,1,196,405,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,1


  # Step 2: Feature Scaling:

In [5]:
# Split dataframes into X & Y
# assign X as a dataframe of feautures and Y as a series of outcome variables
# The “X ” set consists of predictor variables & The “Y” set consists of the outcome variable.

Xdata = data.iloc[:,:-1]
Ydata = data.iloc[:,-1]


X = Xdata.values
Y = Ydata.values

# Step 3: Feature Selection & ready for normalization & ready for Feature Importance

## 3-1) Remove Low Var Features

In [6]:
#Feature Importance
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel

# Fits a number of randomized decision trees
model = ExtraTreesClassifier(n_estimators = 30)
model.fit(X, Y)

# Select the important features of previous model
sel = SelectFromModel(model, prefit=True)

# Subset features
X_new = sel.transform(X)
X_new.shape
X = X_new

In [7]:
# Input variable MinMax Normalization based on Training data
def standardization(Data,Data2):
    return ((Data - np.mean(Data2, axis=0)) / np.std(Data2, axis=0))

In [8]:
# Feature Importance
def plot_feature_importances(model):
    n_features = np.shape(Train_Input_Normalized)[1]
    plt.barh(range(n_features), model.feature_importances_, align='center')
    plt.yticks(np.arange(n_features), Use_Feature_Name)
    plt.xlabel("Feature importance")
    plt.ylabel("Feature")
    plt.ylim(-1, n_features)
    plt.rcParams.update({'font.size': 8})

# Step 4: Build the model & Step 5: Prediction & Evaluation (validation):

# 1 ) Decision Tree 

In [200]:
tree = DecisionTreeClassifier()

In [201]:
tree

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [202]:
'''
importances = tree.feature_importances_
import numpy as np
indices = np.argsort(importances)[::-1]
print("Feature Ranking")
for i in range(10):
    print (i+1,indices[i],importances[indices[i]],data.columns[i])
'''

'\nimportances = tree.feature_importances_\nimport numpy as np\nindices = np.argsort(importances)[::-1]\nprint("Feature Ranking")\nfor i in range(10):\n    print (i+1,indices[i],importances[indices[i]],data.columns[i])\n'

In [203]:
#Simple K-Fold cross validation. 10 folds. (split the data into 10 parts & fit on 9-parts & test accuracy on the remaining part)
cv = KFold(len(data), n_folds=10)

fold_accuracy = []

for train, test in cv:
    
    X_train, X_test, y_train, y_test = X[train], X[test], Y[train], Y[test]

    Train_Input_Normalized = copy.deepcopy(standardization(X_train,X_train))
    Test_Input_Normalized = copy.deepcopy(standardization(X_test,X_train))
    print("standardization complete!")
    
    tree.fit(Train_Input_Normalized, y_train)
    acc = tree.score(Test_Input_Normalized, y_test)
    fold_accuracy.append(acc) 
    
print("Accuracy per fold: ", fold_accuracy, "\n")
print("Average accuracy: ", sum(fold_accuracy)/len(fold_accuracy))
tree

standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
Accuracy per fold:  [0.96435942213049686, 0.96261311319257026, 0.9628512462295602, 0.96594427244582048, 0.96324521711518618, 0.96578550448519485, 0.96253076129237125, 0.9617369214892435, 0.96197507343018174, 0.96388028895768829] 

Average accuracy:  0.963492182077


DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

## entropy기준으로 한 트리 

In [204]:
newTree = DecisionTreeClassifier(criterion='entropy')
#Simple K-Fold cross validation. 10 folds. (split the data into 10 parts & fit on 9-parts & test accuracy on the remaining part)
cv = KFold(len(data), n_folds=10)

fold_accuracy = []

for train, test in cv:
    
    X_train, X_test, y_train, y_test = X[train], X[test], Y[train], Y[test]
    
    Train_Input_Normalized = copy.deepcopy(standardization(X_train,X_train))
    Test_Input_Normalized = copy.deepcopy(standardization(X_test,X_train))
    print("standardization complete!")
    
    newTree.fit(Train_Input_Normalized, y_train)
    acc = newTree.score(Test_Input_Normalized, y_test)
    fold_accuracy.append(acc) 
    
print("Accuracy per fold: ", fold_accuracy, "\n")
print("Average accuracy: ", sum(fold_accuracy)/len(fold_accuracy))

standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
Accuracy per fold:  [0.96435942213049686, 0.96261311319257026, 0.9628512462295602, 0.96594427244582048, 0.96324521711518618, 0.96578550448519485, 0.96253076129237125, 0.9617369214892435, 0.96197507343018174, 0.96388028895768829] 

Average accuracy:  0.963492182077


In [205]:
newTree

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [206]:
# max_depth를 설정해봄 
newTree = DecisionTreeClassifier(criterion='entropy',max_depth=10)
#Simple K-Fold cross validation. 10 folds. (split the data into 10 parts & fit on 9-parts & test accuracy on the remaining part)
cv = KFold(len(data), n_folds=10)

fold_accuracy = []

for train, test in cv:
    
    X_train, X_test, y_train, y_test = X[train], X[test], Y[train], Y[test]
    
    Train_Input_Normalized = copy.deepcopy(standardization(X_train,X_train))
    Test_Input_Normalized = copy.deepcopy(standardization(X_test,X_train))
    print("standardization complete!")
    
    newTree.fit(Train_Input_Normalized, y_train)
    acc = newTree.score(Test_Input_Normalized, y_test)
    fold_accuracy.append(acc) 
    
print("Accuracy per fold: ", fold_accuracy, "\n")
print("Average accuracy: ", sum(fold_accuracy)/len(fold_accuracy))

standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
Accuracy per fold:  [0.96491506588347353, 0.96293062390855688, 0.96301000158755357, 0.96578550448519485, 0.96300706517424783, 0.96665872826863541, 0.96308644915456065, 0.96276891323330949, 0.96165753750893068, 0.96388028895768829] 

Average accuracy:  0.963770017816



# Step 4: Build the model & Step 5: Prediction & Evaluation (validation):

# 2 ) RandomForest

In [12]:
# Load scikit's random forest classifier library
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFECV

In [13]:
# Create a random forest Classifier.
#rf = RandomForestClassifier(n_jobs=2, random_state=0)
rf = RandomForestClassifier(n_estimators=150,max_features="sqrt",
                           random_state=1026)
#Simple K-Fold cross validation. 10 folds. (split the data into 10 parts & fit on 9-parts & test accuracy on the remaining part)
cv = KFold(len(data), n_folds=10)

RFECV(RandomForestClassifier(), scoring='accuracy')

fold_accuracy = []

for train, test in cv:
    
    X_train, X_test, y_train, y_test = X[train], X[test], Y[train], Y[test]

        
    Train_Input_Normalized = copy.deepcopy(standardization(X_train,X_train))
    Test_Input_Normalized = copy.deepcopy(standardization(X_test,X_train))
    print("standardization complete!")
    
    rf.fit(Train_Input_Normalized, y_train)
    acc = rf.score(Test_Input_Normalized, y_test)
    fold_accuracy.append(acc)  
    
    
print("Accuracy per fold: ", fold_accuracy, "\n")
print("Average accuracy: ", sum(fold_accuracy)/len(fold_accuracy))

standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
Accuracy per fold:  [0.97372598825210355, 0.97197967931417684, 0.97237656770916014, 0.97491466222116374, 0.97094546320552511, 0.97356513455584659, 0.97269191077240613, 0.97094546320552511, 0.9714217670874018, 0.97308883067397001] 

Average accuracy:  0.9725655467


### select Most Important Features 했을때 성능이 좋아짐을 알 수 있다

In [14]:
# Create a random forest Classifier.
#rf = RandomForestClassifier(n_jobs=2, random_state=0)
rf = RandomForestClassifier(n_estimators=150,max_features="sqrt",
                           random_state=1026)
#Simple K-Fold cross validation. 10 folds. (split the data into 10 parts & fit on 9-parts & test accuracy on the remaining part)
cv = KFold(len(data), n_folds=10)

#Identify And Select Most Important Features
sfm = SelectFromModel(rf, threshold=0.15)

# threshold 값도 다양하게 바꿔봄 
#(The threshold value to use for feature selection. Features whose importance is greater or equal are kept while the others are discarded.)
# sfm = SelectFromModel(clf, threshold=0.1)


# Train the selector
sfm.fit(X, Y)
X = sfm.transform(X)

RFECV(RandomForestClassifier(), scoring='accuracy')

fold_accuracy = []

for train, test in cv:
    
    X_train, X_test, y_train, y_test = X[train], X[test], Y[train], Y[test]

        
    Train_Input_Normalized = copy.deepcopy(standardization(X_train,X_train))
    Test_Input_Normalized = copy.deepcopy(standardization(X_test,X_train))
    print("standardization complete!")
    
    rf.fit(Train_Input_Normalized, y_train)
    acc = rf.score(Test_Input_Normalized, y_test)
    fold_accuracy.append(acc)  
    
    
print("Accuracy per fold: ", fold_accuracy, "\n")
print("Average accuracy: ", sum(fold_accuracy)/len(fold_accuracy))

standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
Accuracy per fold:  [0.97372598825210355, 0.97197967931417684, 0.97237656770916014, 0.97491466222116374, 0.97094546320552511, 0.97356513455584659, 0.97269191077240613, 0.97094546320552511, 0.9714217670874018, 0.97308883067397001] 

Average accuracy:  0.9725655467


In [209]:
#plot_feature_importances(rf)

# Step 4: Build the model & Step 5: Prediction & Evaluation (validation):

# 3 ) GradientBoosting

In [210]:
from sklearn.ensemble import GradientBoostingClassifier

In [211]:
# Create a random forest Classifier.
#gbrt = GradientBoostingClassifier(random_state=0,max_depth=1,learning_rate=0.01)
gbrt = GradientBoostingClassifier(n_estimators=50,
                           random_state=1026)
#Simple K-Fold cross validation. 10 folds. (split the data into 10 parts & fit on 9-parts & test accuracy on the remaining part)
cv = KFold(len(data), n_folds=10)

fold_accuracy = []

for train, test in cv:
    
    X_train, X_test, y_train, y_test = X[train], X[test], Y[train], Y[test]
    
    Train_Input_Normalized = copy.deepcopy(standardization(X_train,X_train))
    Test_Input_Normalized = copy.deepcopy(standardization(X_test,X_train))
    print("standardization complete!")

    gbrt.fit(Train_Input_Normalized, y_train)
    acc = gbrt.score(Test_Input_Normalized, y_test)
    fold_accuracy.append(acc)  
    
    
print("Accuracy per fold: ", fold_accuracy, "\n")
print("Average accuracy: ", sum(fold_accuracy)/len(fold_accuracy))

standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
Accuracy per fold:  [0.95967613906969362, 0.956262898872837, 0.95673916494681699, 0.96022862586330082, 0.95768833849329205, 0.96022862586330082, 0.95808525839485592, 0.9574501865523537, 0.95705326665078982, 0.95824402635548145] 

Average accuracy:  0.958165653106


In [212]:
#plot_feature_importances(gbrt)

# Step 4: Build the model & Step 5: Prediction & Evaluation (validation):

# 4 ) MLPClassifier

In [213]:
from sklearn.neural_network import MLPClassifier

In [214]:
mlp = MLPClassifier(hidden_layer_sizes=(10,10,10))
#Simple K-Fold cross validation. 10 folds. (split the data into 10 parts & fit on 9-parts & test accuracy on the remaining part)
cv = KFold(len(data), n_folds=10)

fold_accuracy = []

for train, test in cv:
    
    X_train, X_test, y_train, y_test = X[train], X[test], Y[train], Y[test]

    Train_Input_Normalized = copy.deepcopy(standardization(X_train,X_train))
    Test_Input_Normalized = copy.deepcopy(standardization(X_test,X_train))
    print("standardization complete!")

    mlp.fit(Train_Input_Normalized, y_train)
    acc = mlp.score(Test_Input_Normalized, y_test)
    fold_accuracy.append(acc)  
  
    
print("Accuracy per fold: ", fold_accuracy, "\n")
print("Average accuracy: ", sum(fold_accuracy)/len(fold_accuracy))

standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
Accuracy per fold:  [0.91665343705350055, 0.91324019685664393, 0.91871725670741389, 0.93030086528538536, 0.91124871001031993, 0.92656981821068507, 0.92847503373819162, 0.91394776534095423, 0.9261728983091212, 0.91664682067158842] 

Average accuracy:  0.920197280218


## hidden layer 변경해봄

In [215]:
mlp = MLPClassifier(hidden_layer_sizes=(5,2))
#Simple K-Fold cross validation. 10 folds. (split the data into 10 parts & fit on 9-parts & test accuracy on the remaining part)
cv = KFold(len(data), n_folds=10)

fold_accuracy = []

for train, test in cv:
    
    X_train, X_test, y_train, y_test = X[train], X[test], Y[train], Y[test]

    Train_Input_Normalized = copy.deepcopy(standardization(X_train,X_train))
    Test_Input_Normalized = copy.deepcopy(standardization(X_test,X_train))
    print("standardization complete!")

    mlp.fit(Train_Input_Normalized, y_train)
    acc = mlp.score(Test_Input_Normalized, y_test)
    fold_accuracy.append(acc)  
    
print("Accuracy per fold: ", fold_accuracy, "\n")
print("Average accuracy: ", sum(fold_accuracy)/len(fold_accuracy))

standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
Accuracy per fold:  [0.91657405937450387, 0.91355770757263055, 0.91887601206540725, 0.91871080415972062, 0.91132809399063264, 0.91886957212034615, 0.91513852504564575, 0.91394776534095423, 0.92220369929348256, 0.91688497261252677] 

Average accuracy:  0.916609121158


In [216]:
mlp = MLPClassifier(hidden_layer_sizes=(5))
#Simple K-Fold cross validation. 10 folds. (split the data into 10 parts & fit on 9-parts & test accuracy on the remaining part)
cv = KFold(len(data), n_folds=10)

fold_accuracy = []

for train, test in cv:
    
    X_train, X_test, y_train, y_test = X[train], X[test], Y[train], Y[test]

    Train_Input_Normalized = copy.deepcopy(standardization(X_train,X_train))
    Test_Input_Normalized = copy.deepcopy(standardization(X_test,X_train))
    print("standardization complete!")

    mlp.fit(Train_Input_Normalized, y_train)
    acc = mlp.score(Test_Input_Normalized, y_test)
    fold_accuracy.append(acc) 
    
print("Accuracy per fold: ", fold_accuracy, "\n")
print("Average accuracy: ", sum(fold_accuracy)/len(fold_accuracy))

standardization complete!




standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
Accuracy per fold:  [0.91617717097952056, 0.91387521828861729, 0.91887601206540725, 0.91886957212034615, 0.91132809399063264, 0.91894895610065885, 0.91482098912439469, 0.91394776534095423, 0.91275700563626261, 0.91688497261252677] 

Average accuracy:  0.915648575626


# Step 4: Build the model & Step 5: Prediction & Evaluation (validation):

# 5 ) SVM 

In [217]:
from sklearn.svm import SVC

In [218]:
svc = SVC()

cv = KFold(len(data), n_folds=10)

fold_accuracy = []

for train, test in cv:
    
    X_train, X_test, y_train, y_test = X[train], X[test], Y[train], Y[test]

    Train_Input_Normalized = copy.deepcopy(standardization(X_train,X_train))
    Test_Input_Normalized = copy.deepcopy(standardization(X_test,X_train))
    print("standardization complete!")
    
    svc.fit(Train_Input_Normalized, y_train)
    acc = svc.score(Test_Input_Normalized, y_test)
    fold_accuracy.append(acc)  
    
print("Accuracy per fold: ", fold_accuracy, "\n")
print("Average accuracy: ", sum(fold_accuracy)/len(fold_accuracy))

standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
Accuracy per fold:  [0.91792347991744716, 0.91601841562152719, 0.92125734243530721, 0.9218861633722315, 0.91283638961657543, 0.9202190997856633, 0.91521790902595856, 0.91664682067158842, 0.91513852504564575, 0.91752004445502899] 

Average accuracy:  0.917466418995


# Step 4: Build the model & Step 5: Prediction & Evaluation (validation):

# 6 ) LogisticRegression

In [13]:
from sklearn.linear_model import LogisticRegression

In [15]:
lr = LogisticRegression()

cv = KFold(len(data), n_folds=10)

fold_accuracy = []

for train, test in cv:
    
    X_train, X_test, y_train, y_test = X[train], X[test], Y[train], Y[test]
    
    Train_Input_Normalized = copy.deepcopy(standardization(X_train,X_train))
    Test_Input_Normalized = copy.deepcopy(standardization(X_test,X_train))
    print("standardization complete!")
    
    lr.fit(Train_Input_Normalized, y_train)
    acc = lr.score(Test_Input_Normalized, y_test)
    fold_accuracy.append(acc)  
    
print("Accuracy per fold: ", fold_accuracy, "\n")
print("Average accuracy: ", sum(fold_accuracy)/len(fold_accuracy))

standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
Accuracy per fold:  [0.94046674075250036, 0.93760914430862041, 0.94014923003651374, 0.94030324680479482, 0.93760419147416052, 0.9401444788441693, 0.9385567992379138, 0.93903310311979038, 0.9391124871001032, 0.94062078272604588] 

Average accuracy:  0.93936002044


# Step 4: Build the model & Step 5: Prediction & Evaluation (validation):

# 7 ) KNN 

In [221]:
from sklearn.neighbors import KNeighborsClassifier

In [222]:
#k의 역할은 몇 번째로 가까운 데이터까지 살펴볼 것인가를 정한 숫자이다.

neigh = KNeighborsClassifier(n_neighbors=3)

cv = KFold(len(data), n_folds=10)

fold_accuracy = []

for train, test in cv:
    
    X_train, X_test, y_train, y_test = X[train], X[test], Y[train], Y[test]
    
    Train_Input_Normalized = copy.deepcopy(standardization(X_train,X_train))
    Test_Input_Normalized = copy.deepcopy(standardization(X_test,X_train))
    print("standardization complete!")

    neigh.fit(Train_Input_Normalized, y_train)
    acc = neigh.score(Test_Input_Normalized, y_test)
    fold_accuracy.append(acc)  
    
    
print("Accuracy per fold: ", fold_accuracy, "\n")
print("Average accuracy: ", sum(fold_accuracy)/len(fold_accuracy))

standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
Accuracy per fold:  [0.96221622479758695, 0.9588823622797269, 0.96031116050166698, 0.96618242438675872, 0.96078431372549022, 0.96134000158767963, 0.96126061760736681, 0.96157815352861797, 0.96149876954830515, 0.96427720885925217] 

Average accuracy:  0.961833123682


In [223]:
# 다른 k값

neigh = KNeighborsClassifier(n_neighbors=10)

cv = KFold(len(data), n_folds=10)

fold_accuracy = []

for train, test in cv:
    
    X_train, X_test, y_train, y_test = X[train], X[test], Y[train], Y[test]
    
    Train_Input_Normalized = copy.deepcopy(standardization(X_train,X_train))
    Test_Input_Normalized = copy.deepcopy(standardization(X_test,X_train))
    print("standardization complete!")

    neigh.fit(Train_Input_Normalized, y_train)
    acc = neigh.score(Test_Input_Normalized, y_test)
    fold_accuracy.append(acc)  
    
    
print("Accuracy per fold: ", fold_accuracy, "\n")
print("Average accuracy: ", sum(fold_accuracy)/len(fold_accuracy))

standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
Accuracy per fold:  [0.96443879980949354, 0.96269249087156694, 0.9602317828226703, 0.96594427244582048, 0.96300706517424783, 0.96570612050488214, 0.96237199333174561, 0.96245137731205843, 0.9617369214892435, 0.96419782487893946] 

Average accuracy:  0.963277864864


# Step 5: 자신의 best model   RandomForestClassifier

In [10]:
from sklearn.feature_selection import SelectFromModel

# Create a random forest Classifier.
rf = RandomForestClassifier(n_estimators=150,max_features="sqrt",
                           random_state=1026)

#Identify And Select Most Important Features
sfm = SelectFromModel(rf, threshold=0.15)
# Train the selector
sfm.fit(X, Y)
X = sfm.transform(X)

#Simple K-Fold cross validation. 10 folds. (split the data into 10 parts & fit on 9-parts & test accuracy on the remaining part)
cv = KFold(len(data), n_folds=10)
fold_accuracy = []

for train, test in cv:
    
    X_train, X_test, y_train, y_test = X[train], X[test], Y[train], Y[test]

    Train_Input_Normalized = copy.deepcopy(standardization(X_train,X_train))
    Test_Input_Normalized = copy.deepcopy(standardization(X_test,X_train))
    print("standardization complete!")
    
    rf.fit(Train_Input_Normalized, y_train)
    acc = rf.score(Test_Input_Normalized, y_test)
    fold_accuracy.append(acc)  
    
print("Accuracy per fold: ", fold_accuracy, "\n")
print("Average accuracy: ", sum(fold_accuracy)/len(fold_accuracy))

standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
standardization complete!
Accuracy per fold:  [0.97372598825210355, 0.97197967931417684, 0.97237656770916014, 0.97491466222116374, 0.97094546320552511, 0.97356513455584659, 0.97269191077240613, 0.97094546320552511, 0.9714217670874018, 0.97308883067397001] 

Average accuracy:  0.9725655467
