# Import Library

Library yang digunakan pada notebook ini adalah"
- pandas (operasi sederhana pada dataset)
- sklearn (modul machine learning)

In [53]:
import pandas as pd
from sklearn import datasets, model_selection
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, f1_score
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC

## Load Datasets

In [71]:
cancer = datasets.load_breast_cancer()
dtcancer = pd.DataFrame(data=cancer.data, columns=cancer.feature_names)
dttennis = pd.read_csv("../data/play_tennis.csv")

In [55]:
def simpleDescribe(df: datasets):
    print('Jumlah kolom: ' + str(len(df.columns)))
    print('Jumlah baris: ' + str(len(df)))
    print()
    print('Kolom pada tabel:')
    print(df.columns.tolist())


### Data cancer

Sampel data kanker payudara yang disediakan oleh data internal sklearn:

In [56]:
dtcancer.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [57]:
simpleDescribe(dtcancer)

Jumlah kolom: 30
Jumlah baris: 569

Kolom pada tabel:
['mean radius', 'mean texture', 'mean perimeter', 'mean area', 'mean smoothness', 'mean compactness', 'mean concavity', 'mean concave points', 'mean symmetry', 'mean fractal dimension', 'radius error', 'texture error', 'perimeter error', 'area error', 'smoothness error', 'compactness error', 'concavity error', 'concave points error', 'symmetry error', 'fractal dimension error', 'worst radius', 'worst texture', 'worst perimeter', 'worst area', 'worst smoothness', 'worst compactness', 'worst concavity', 'worst concave points', 'worst symmetry', 'worst fractal dimension']


### Data Tennis
Sampel data tennis yang didapatkan dari file eksternal pada folder data: 

In [58]:
dttennis.head()

Unnamed: 0,day,outlook,temp,humidity,wind,play
0,D1,Sunny,Hot,High,Weak,No
1,D2,Sunny,Hot,High,Strong,No
2,D3,Overcast,Hot,High,Weak,Yes
3,D4,Rain,Mild,High,Weak,Yes
4,D5,Rain,Cool,Normal,Weak,Yes


In [59]:
simpleDescribe(dttennis)

Jumlah kolom: 6
Jumlah baris: 14

Kolom pada tabel:
['day', 'outlook', 'temp', 'humidity', 'wind', 'play']


Terdapat beberapa hal penting yang bisa dilihat disini, yaitu:
1. Kolom 'day' pada dataset merupakan data hari ke-sekian dan merupakan urutan hari dalam minggu. Untuk bisa mengetahui pola atau pengaruh 'urutan hari' dari sebuah tabel terhadap play adalah dengan melakukan operasi modulo 7 untuk melihat pengaruh hari (senin, selasa, dll) terhadap atribut play.
2. Data pada dataset tersebut masih berupa string dan perlu untuk di-encode sesuai dengan kategorinya

In [72]:
for i, data in dttennis['day'].iteritems():
    dttennis['day'][i] = int(data[1:])%7
dttennis.head(2)

#### Encode Play Tennis Column

In [70]:
le = LabelEncoder()
dttennis_old = dttennis.copy()
for col in dttennis_old[1:]:
    dttennis[col] = le.fit_transform(dttennis_old[col])
dttennis.head()

Unnamed: 0,day,outlook,temp,humidity,wind,play
0,0,2,1,0,1,0
1,1,2,1,0,0,0
2,2,0,1,0,1,1
3,3,1,2,0,1,1
4,4,1,0,1,1,1


## Split Dataset

Memisahkan dataset menjadi 2 bagian yaitu train dan test

In [4]:
ctrain, ctest = model_selection.train_test_split(dtcancer, test_size=0.2, train_size=0.8, random_state=1)
tntrain, tntest = model_selection.train_test_split(dttennis, test_size=0.2, train_size=0.8, random_state=1)

In [5]:
### split into four data variables (x_train, x_test, y_train, and y_test)
# Dataset Cancer
##################
y = dtcancer['worst fractal dimension']
x = dtcancer.drop('worst fractal dimension', axis=1)

x_ctrain, x_ctest, y_ctrain, y_ctest = model_selection.train_test_split(x, y, test_size=0.2, train_size=0.8, random_state=1)

###
# Dataset Tennis
##################
y = dttennis['play']
x = dttennis.drop('play', axis=1)

x_tntrain, x_tntest, y_tntrain, y_tntest = model_selection.train_test_split(x, y, test_size=0.2, train_size=0.8, random_state=1)

<h3>Decision Tree Classifier</h3>

**Decision Tree Classifier** adalah sebuah metode supervised learning yang digunakan untuk memberikan model yang akan membagi data bergantung dari parameter tertentu.

Tahap pembuatan DTC adalah:
1. Pilih atribut terbaik menggunakan ASM (Attribute Selection Measures)
2. Buat atribut tersebut menjadi decision node dan pecah dataset menjadi lebih kecil
3. Ulangi secara rekursif hingga kondisi berikut terpenuhi:
    - semua atribut sudah habis
    - tidak ada instansi lain lagi
    - semua tuple terdapat dalam nilai atribut yang sama

Banyak metode ASM, tetapi yang akan digunakan disini adalah *Information Gain*

<h3>Id3 Estimator</h3>

<h3>KMeans</h3>

In [6]:
ckmeans = KMeans(n_clusters=2, random_state=42).fit(ctrain)
ckmeanspredict = ckmeans.predict(ctest)

tnkmeans = KMeans(n_clusters=2, random_state=42).fit(tntrain)
tnkmeanspredict = tnkmeans.predict(tntest)

# accuracy_score(ckmeans.labels_, ckmeanspredict, normalize=False)
# f1_score(ckmeans.labels_, ckmeanspredict)

<h3>Logistic Regression</h3>

<h3>Neural Network</h3>

In [7]:
clf = MLPClassifier(random_state=1, max_iter=2000)

# cNeural = clf.fit(x_ctrain, y_ctrain)
# cNeuralpredict = cNeural.predict(x_ctest)

tnNeural = clf.fit(x_tntrain, y_tntrain)
tnNeuralpredict = tnNeural.predict(x_tntest)

tnNeural.score(x_tntest, y_tntest)

0.3333333333333333

<h3>SVM</h3>

In [8]:
# to do tomorrow
svc = SVC(C=1, kernel='linear')

# c_svc = svc.fit(x_ctrain, y_ctrain)
# c_svcpredict = c_svc.predict(x_ctest)

tn_svc = svc.fit(x_tntrain, y_tntrain)
tn_svcpredict = tn_svc.predict(x_tntest)