# Poker Rule


* sampleSubmission.csv *

>  Submission file should predict the hand for each id in the test set.

id,hand
1,0
2,0
3,9
...
etc.


* train.csv, test.csv *

>  Each hand consists of five cards with a given suit and rank, drawn from a standard deck of 52.

<p>S1 “Suit of card #1”</p>
<p>Ordinal (1-4) representing {Hearts, Spades, Diamonds, Clubs} </p>
<p>C1 “Rank of card #1”</p>
<p>Numerical (1-13) representing (Ace, 2, 3, ... , Queen, King) </p>
...

<p>S5 “Suit of card #5”</p>
<p>C5 “Rank of card #5”</p>


>  Each row in the training set has the accompanying class label for the poker hand it comprises.

<p>0: Nothing in hand; not a recognized poker hand</p>
<p>1: One pair; one pair of equal ranks within five cards</p>
<p>2: Two pairs; two pairs of equal ranks within five cards</p>
<p>3: Three of a kind; three equal ranks within five cards</p>
<p>4: Straight; five cards, sequentially ranked with no gaps</p>
<p>5: Flush; five cards with the same suit</p>
<p>6: Full house; pair + different rank three of a kind</p>
<p>7: Four of a kind; four equal ranks within five cards</p>
<p>8: Straight flush; straight + flush</p>
<p>9: Royal flush; {Ace, King, Queen, Jack, Ten} + flush</p>

# KNN

In [40]:
import numpy as np
import cv2
import pandas as pd
from sklearn import metrics
from sklearn import model_selection as modsel
from sklearn import linear_model
from sklearn.neighbors import KNeighborsClassifier

import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')

In [41]:
# train data

train_data = pd.read_csv("./data/Poker_Rule/train.csv")
print(train_data.shape)

(25010, 11)


In [42]:
print(train_data)

       S1  C1  S2  C2  S3  C3  S4  C4  S5  C5  hand
0       4   9   2   1   2   2   4   7   2   8     0
1       1   4   3   6   1  12   3  11   2   7     0
2       1  11   4   1   3   7   4  11   2   1     2
3       2   9   2   4   3   6   1   9   4   9     3
4       1   8   2   4   2  11   2   2   2   1     0
5       2   5   1   5   2  13   2   3   3  13     2
6       3  10   4   6   1   4   2  13   4   5     0
7       4  10   3   1   2  13   4   2   4   7     0
8       3   2   4  10   3   3   4   4   1   9     0
9       2   7   3   8   4   8   2  13   2  12     1
10      2   5   1   3   2  10   3   2   2   1     0
11      1   6   2  12   4   7   2  10   1   1     0
12      4   2   4   9   1  12   3   7   2  11     0
13      2   6   1   5   3   3   4   2   4   5     1
14      1   6   3  12   4  11   2  11   3  13     1
15      2   5   2   4   4   9   2   3   3   2     0
16      4   9   4  11   3   8   3   9   3   5     1
17      1   9   2   4   1  11   3   4   1  13     1
18      1  1

* S1, C1, ..., S5, C5는 feature.
* hand는 label.

In [43]:
labels = train_data['hand']
features = train_data.drop(['hand'], axis=1)

In [44]:
# create KNN
knn = KNeighborsClassifier(n_neighbors=100, weights='uniform', algorithm='auto')

# laerning knn
knn.fit(features, labels)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=100, p=2,
           weights='uniform')

In [45]:
# test data

test_data = pd.read_csv("./data/Poker_Rule/test.csv")
print(test_data.shape)

(1000000, 11)


In [46]:
print(test_data)

             id  S1  C1  S2  C2  S3  C3  S4  C4  S5  C5
0             1   1  10   2   2   3   3   3   8   1   1
1             2   2  13   3   5   3   7   4   6   1   4
2             3   1   3   1  11   2   8   2   1   2   4
3             4   1   6   3   3   4   7   1   8   3  11
4             5   2  10   3   4   1   6   2  12   2   6
5             6   1   4   3  10   2  11   2   6   1   7
6             7   1  10   3   8   1   4   3  11   3   9
7             8   2  11   3   8   1   1   1  11   2   3
8             9   3   4   1   1   1   3   3   5   3   6
9            10   3  12   2   1   1   3   1   2   3  10
10           11   1   7   3   1   4   8   4  10   3  11
11           12   2   3   2  11   3   9   1  10   2   5
12           13   1   4   2  13   4   6   4   8   4   5
13           14   4   7   4   2   3   3   1   4   1   1
14           15   4   8   3  10   4  11   3   5   1  12
15           16   4  11   1  10   1   3   1   2   4   5
16           17   1   8   3   2   4   4   2   3 

* train data의 ['hand']부분을 예측해야함.

In [47]:
# make test dataframe only exist features to predict
test_data = test_data.drop(['id'], axis=1)

# predict
predict = knn.predict(test_data)

In [48]:
print(predict)

[1 0 0 ..., 1 0 1]


* predict는 예상된 'hand' 값.

In [49]:
# training set accuracy score

print(knn.score(features, labels))

0.591923230708


In [51]:
# testing set accuracy score

print(knn.score(test_data, predict))

1.0


<p>=> 1.0은 너무 학습이 잘된 것 같다... (과적합?)</p>
<p>=> 정화도 계산 방법이 잘 못 된 것 같다...</p>

# SVM

In [8]:
import pandas as pd
from sklearn import svm, metrics
from sklearn.model_selection import train_test_split

In [9]:
# load dataset
train_data = pd.read_csv("./data/Poker_Rule/train.csv")
test_data = pd.read_csv("./data/Poker_Rule/test.csv")

In [11]:
svm_model = svm.SVC(kernel='linear')           # create SVM - linear

In [13]:
csv_data = train_data[["S1","C1","S2","C2","S3","C3","S4","C4","S5","C5"]]
csv_label = train_data["hand"]

In [14]:
# trining
svm_model.fit(csv_data, csv_label)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [15]:
# set test data
test_csv_data = test_data[["S1","C1","S2","C2","S3","C3","S4","C4","S5","C5"]]

# predict
predict = svm_model.predict(test_csv_data)

In [16]:
print(predict)

[0 0 0 ..., 0 0 0]


** train data에서 train과 test를 0.3비율로 나누어 수행 -> accuracy 측정

In [19]:
Train_data, Test_data, Train_label, Test_label = train_test_split(csv_data, csv_label, test_size=0.3, random_state=0)

In [20]:
svm_model.fit(Train_data, Train_label)
pre = svm_model.predict(Test_data)
accuracy_score = metrics.accuracy_score(Test_label, pre)
print(accuracy_score)

0.494868719179


=> 정말 낮은 정확도가 나왔다...

In [22]:
svm_model_g = svm.SVC(kernel='rbf', C=10.0, gamma=0.10, random_state=0)   # Gaussian

svm_model_g.fit(Train_data, Train_label)
pre_g = svm_model.predict(Test_data)
accuracy_score_g = metrics.accuracy_score(Test_label, pre_g)
print(accuracy_score_g)

0.494868719179


=> ?????  비선형으로 바꾸어도...;;; 

# KNN

In [7]:
import pandas as pd

df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")

In [8]:
# Division data 7:3
test_index = [x for x in df_train.index if x % 3 == 0]
train_index = [x for x in df_train.index if x % 3 != 0]

In [9]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25010 entries, 0 to 25009
Data columns (total 11 columns):
S1      25010 non-null int64
C1      25010 non-null int64
S2      25010 non-null int64
C2      25010 non-null int64
S3      25010 non-null int64
C3      25010 non-null int64
S4      25010 non-null int64
C4      25010 non-null int64
S5      25010 non-null int64
C5      25010 non-null int64
hand    25010 non-null int64
dtypes: int64(11)
memory usage: 2.1 MB


In [10]:
# dummy :  convert data to category
df_train1 = pd.concat([pd.get_dummies(df_train.S1),
pd.get_dummies(df_train.S2),
pd.get_dummies(df_train.S3),
pd.get_dummies(df_train.S4),
pd.get_dummies(df_train.S5)],axis=1)

In [11]:
# if S1~S4 data sxist, data = 1
# else data = 0

df_train = pd.concat([df_train,df_train1],axis=1)
df_train = df_train.drop(['S1','S2','S3','S4','S5'],axis=1)

In [12]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25010 entries, 0 to 25009
Data columns (total 26 columns):
C1      25010 non-null int64
C2      25010 non-null int64
C3      25010 non-null int64
C4      25010 non-null int64
C5      25010 non-null int64
hand    25010 non-null int64
1       25010 non-null uint8
2       25010 non-null uint8
3       25010 non-null uint8
4       25010 non-null uint8
1       25010 non-null uint8
2       25010 non-null uint8
3       25010 non-null uint8
4       25010 non-null uint8
1       25010 non-null uint8
2       25010 non-null uint8
3       25010 non-null uint8
4       25010 non-null uint8
1       25010 non-null uint8
2       25010 non-null uint8
3       25010 non-null uint8
4       25010 non-null uint8
1       25010 non-null uint8
2       25010 non-null uint8
3       25010 non-null uint8
4       25010 non-null uint8
dtypes: int64(6), uint8(20)
memory usage: 1.6 MB


In [13]:
# test data set
df_test1 = pd.concat([pd.get_dummies(df_test.S1),
pd.get_dummies(df_test.S2),
pd.get_dummies(df_test.S3),
pd.get_dummies(df_test.S4),
pd.get_dummies(df_test.S5)],axis=1)

df_test = pd.concat([df_test,df_test1],axis=1)
df_test = df_test.drop(['S1','S2','S3','S4','S5'],axis=1)

# drop label data
xtrain = df_train.drop('hand',axis=1)

# compare data
ytrain = df_train['hand']
xtest = df_test.drop('id',axis=1)
id = df_test['id']  # make submission file

In [14]:
# knn modeling 

from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier,ExtraTreesClassifier
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=5) # k = 5, group
model.fit(xtrain,ytrain)
y_pred = model.predict(xtest)

In [15]:
# result -> using Kaggle
submission = pd.DataFrame({'id':id, 'hand':y_pred})
submission = submission[['id','hand']]
submission.to_csv('sub.csv',index=False)

=> Accuracy Score : 53.57%

#### Data Preprocessing 

In [17]:
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier

df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")

In [18]:
# Character + Number

# train
df_train['SC1'] = df_train['S1'].apply(lambda x: str(x)) + df_train['C1'].apply(lambda x: str(x))
df_train['SC2'] = df_train['S2'].apply(lambda x: str(x)) + df_train['C2'].apply(lambda x: str(x))
df_train['SC3'] = df_train['S3'].apply(lambda x: str(x)) + df_train['C3'].apply(lambda x: str(x))
df_train['SC4'] = df_train['S4'].apply(lambda x: str(x)) + df_train['C4'].apply(lambda x: str(x))
df_train['SC5'] = df_train['S5'].apply(lambda x: str(x)) + df_train['C5'].apply(lambda x: str(x))
xtrain = pd.get_dummies(df_train['SC1']) + pd.get_dummies(df_train['SC2']) +pd.get_dummies(df_train['SC3']) + pd.get_dummies(df_train['SC4']) + pd.get_dummies(df_train['SC5'])

# test
df_test['SC1'] = df_test['S1'].apply(lambda x: str(x)) + df_test['C1'].apply(lambda x: str(x))
df_test['SC2'] = df_test['S2'].apply(lambda x: str(x)) + df_test['C2'].apply(lambda x: str(x))
df_test['SC3'] = df_test['S3'].apply(lambda x: str(x)) + df_test['C3'].apply(lambda x: str(x))
df_test['SC4'] = df_test['S4'].apply(lambda x: str(x)) + df_test['C4'].apply(lambda x: str(x))
df_test['SC5'] = df_test['S5'].apply(lambda x: str(x)) + df_test['C5'].apply(lambda x: str(x))
xtest = pd.get_dummies(df_test['SC1']) + pd.get_dummies(df_test['SC2']) +pd.get_dummies(df_test['SC3']) + pd.get_dummies(df_test['SC4']) + pd.get_dummies(df_test['SC5'])

df_train = df_train.drop(['S1','C1','S2','C2','S3','C3','S4','C4','S5','C5'],axis=1)
df_test = df_test.drop(['S1','C1','S2','C2','S3','C3','S4','C4','S5','C5'],axis=1)

xtrain = df_train.drop('hand',axis=1)
ytrain = df_train['hand']
id = df_test['id']
xtest = df_test.drop('id',axis=1)

model = KNeighborsClassifier(n_neighbors=1)
model.fit(xtrain,ytrain)
y_pred = model.predict(xtest)

submission = pd.DataFrame({'id':id, 'hand':y_pred})
submission = submission[['id','hand']]
submission.to_csv('sub.csv',index=False)

=> 성능이 더 떨어짐

In [19]:
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier

df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")

In [20]:
df_train = df_train = df_train.drop(['S1','S2','S3','S4','S5'],axis=1)
df_test = df_test.drop(['S1','S2','S3','S4','S5'],axis=1)

# only use number data
df_train1 = pd.get_dummies(df_train.C1) + pd.get_dummies(df_train.C2) + pd.get_dummies(df_train.C3) + pd.get_dummies(df_train.C4) + pd.get_dummies(df_train.C5)
df_test1 = pd.get_dummies(df_test.C1) + pd.get_dummies(df_test.C2) + pd.get_dummies(df_test.C3) + pd.get_dummies(df_test.C4) + pd.get_dummies(df_test.C5)

xtrain = df_train1
ytrain = df_train['hand']
id = df_test['id']
xtest = df_test1

model = KNeighborsClassifier(n_neighbors=1)
model.fit(xtrain,ytrain)
y_pred = model.predict(xtest)

submission = pd.DataFrame({'id':id, 'hand':y_pred})
submission = submission[['id','hand']]
submission.to_csv('sub.csv',index=False)

=> 성능 상승 (98.3%)

# Randon Forest

In [21]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier

df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")

df_train = df_train = df_train.drop(['S1','S2','S3','S4','S5'],axis=1)
df_test = df_test.drop(['S1','S2','S3','S4','S5'],axis=1)

df_train1 = pd.get_dummies(df_train.C1) + pd.get_dummies(df_train.C2) + pd.get_dummies(df_train.C3) + pd.get_dummies(df_train.C4) + pd.get_dummies(df_train.C5)
df_test1 = pd.get_dummies(df_test.C1) + pd.get_dummies(df_test.C2) + pd.get_dummies(df_test.C3) + pd.get_dummies(df_test.C4) + pd.get_dummies(df_test.C5)

xtrain = df_train.drop('hand',axis=1)
ytrain = df_train['hand']
id = df_test['id']
xtest = df_test.drop('id',axis=1)

model = RandomForestClassifier(n_estimators=10)
model.fit(xtrain,ytrain)
y_pred = model.predict(xtest)

submission = pd.DataFrame({'id':id, 'hand':y_pred})
submission = submission[['id','hand']]
submission.to_csv('sub.csv',index=False)

=> KNN보다 성능이 떨어짐

#  Ensenble & Stacking

- Ensenble: 성능(정확도)이 낮은 모델들을 여러 개 모아서 실질적으로 성능을 향상시키는 기법
- Stacking: 앙상블하는 과정 중 하나로 다양한 학습 알고리즘을 통해 구성된 모델을 조합하는 방식으로 각 분류기의 성능을 추정하는 방식
<br><br>
- 다음의 모델들을 앙상블 및 스태킹하여 모델을 제작
  - GradientBoostingClassifier()
  - RandomForestClassifier()
  - ExtraTreeClassifier()
  - KNeighborsClassifier()

In [22]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier,ExtraTreesClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import KFold
from sklearn.linear_model import LassoCV

df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")

test_index = [x for x in df_train.index if x % 3 == 0] #30% testset
train_index = [x for x in df_train.index if x % 3 != 0] #70% trainingset

df_train = df_train = df_train.drop(['S1','S2','S3','S4','S5'],axis=1)
df_test = df_test.drop(['S1','S2','S3','S4','S5'],axis=1)

df_train1 = pd.get_dummies(df_train.C1) + pd.get_dummies(df_train.C2) + pd.get_dummies(df_train.C3) + pd.get_dummies(df_train.C4) + pd.get_dummies(df_train.C5)
df_test1 = pd.get_dummies(df_test.C1) + pd.get_dummies(df_test.C2) + pd.get_dummies(df_test.C3) + pd.get_dummies(df_test.C4) + pd.get_dummies(df_test.C5)

xtrain = df_train1
ytrain = df_train['hand']
id = df_test['id']
xtest = df_test1


# 교차검증(Cross Validation) : 5개의 그룹으로 분류해 K-Fold 방식으로 교차검증을 진행
cv = KFold(n_splits=5, shuffle=True)
regs = [GradientBoostingClassifier(n_estimators=10),
        GradientBoostingClassifier(n_estimators=50),
        RandomForestClassifier(n_estimators=10),
        RandomForestClassifier(n_estimators=50),
        ExtraTreesClassifier(n_estimators=10),
        ExtraTreesClassifier(n_estimators=50),
        KNeighborsClassifier(n_neighbors=1),
        KNeighborsClassifier(n_neighbors=5),
        LassoCV(cv=10,n_alphas=1000),
        LassoCV(cv=5, n_alphas=1000)]

meta_feature = pd.DataFrame(np.zeros(xtrain.shape[0]))
for train_index, test_index in cv.split(xtrain):
    xtrain_train = xtrain.ix[train_index]
    xtrain_test = xtrain.ix[test_index]
    ytrain_train = ytrain.ix[train_index]
    ytrain_test = ytrain.ix[test_index]
    reg_num = 0
    for reg in regs:
        reg_name = str(reg_num) + str(reg.__class__).split('.')[-1].split("'")[0]
        reg.fit(xtrain_train, ytrain_train)
        meta_feature.ix[test_index, reg_name] = reg.predict(xtrain_test)
        reg_num += 1
meta_feature = meta_feature.drop(0,axis=1)

meta_feature_test = pd.DataFrame(np.zeros(xtest.shape[0]))
reg_num = 0
for reg in regs:
    reg_name = str(reg_num) + str(reg.__class__).split('.')[-1].split("'")[0]
    reg.fit(xtrain, ytrain)
    meta_feature_test[reg_name] = reg.predict(xtest)
    reg_num += 1
meta_feature_test = meta_feature_test.drop(0, axis=1)

stacker = GradientBoostingClassifier(n_estimators=10)
stacker.fit(meta_feature,ytrain)
y_pred = stacker.predict(meta_feature_test)

submission = pd.DataFrame({'id':id, 'hand':y_pred})
submission = submission[['id','hand']]
submission.to_csv('sub.csv',index=False)

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix


<p>=> Number만 사용하는 경우가 성능이 더 좋음</p>
<p>=> RandomForestClassifier 모델과 GradientBoostingClassifier 모델, ExtraTreesClassifier 모델을 함께 Stacking ensemble 하여 사용하는 것이 가장 좋음</p>