# Membaca dataset
## 1. Load Dataset breast cancer
Melakukan load data reast cancer dan membagi data menjadi 80% data training dan 20% data testing.
Dalam dataset ini, tidak perlu dilakukan labeling data dikarenakan data sudah berasal dari library sklearn dan dapat langsung digunakan

In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer

#Load data cancer
data_cancer = load_breast_cancer()

#Splitting cancer data to 80% train and 20% test
X = data_cancer.data
Y = data_cancer.target
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, train_size=0.8)

## 2. Load Dataset Play Tennis dan Mengolah Data
Dikarenakan data berupa csv, maka data akan di load menggunakan library pandas. Data juga harus dilakukan labeling terlebih dahulu untuk data yang kategorikal.
### 2.1 Load dataset ke pandas
Melakukan load dataset play tennis menggunakan pandas

In [2]:
import pandas as pd
tennis_file = "play_tennis.csv"
data_tennis = pd.read_csv(tennis_file)
print(data_tennis)

    day   outlook  temp humidity    wind play
0    D1     Sunny   Hot     High    Weak   No
1    D2     Sunny   Hot     High  Strong   No
2    D3  Overcast   Hot     High    Weak  Yes
3    D4      Rain  Mild     High    Weak  Yes
4    D5      Rain  Cool   Normal    Weak  Yes
5    D6      Rain  Cool   Normal  Strong   No
6    D7  Overcast  Cool   Normal  Strong  Yes
7    D8     Sunny  Mild     High    Weak   No
8    D9     Sunny  Cool   Normal    Weak  Yes
9   D10      Rain  Mild   Normal    Weak  Yes
10  D11     Sunny  Mild   Normal  Strong  Yes
11  D12  Overcast  Mild     High  Strong  Yes
12  D13  Overcast   Hot   Normal    Weak  Yes
13  D14      Rain  Mild     High  Strong   No


### 2.2 Melakukan Labeling data kategorikal
Sebelumnya, data 'day' perlu dihilangkan terlebih dahulu dikarenakan data 'day' merupakan sebuah data "unique identifier".
Setelah itu, data akan dilakukan labelling dengan label sebagai berikut: <br>
<b>Outlook: </b><br>
overcast = 0<br>
rain = 1<br>
sunny = 2<br>
<b>Temp: </b><br>
cool = 0<br>
hot = 1<br>
mild = 2<br>
<b>Humidity: </b><br>
High = 0<br>
Normal = 1<br>
<b>Wind: </b><br>
strong = 0<br>
weak = 1<br>
<b>Play: </b><br>
no = 0<br>
yes = 1<br>

In [3]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

tennis_file = "play_tennis.csv"
data_tennis = pd.read_csv(tennis_file)

#remove day
data_tennis = data_tennis.drop(columns=['day'])

#change column value to category
data_tennis['outlook'] = data_tennis['outlook'].astype('category')
data_tennis['temp'] = data_tennis['temp'].astype('category')
data_tennis['humidity'] = data_tennis['humidity'].astype('category')
data_tennis['wind'] = data_tennis['wind'].astype('category')
data_tennis['play'] = data_tennis['play'].astype('category')

#encode with label
encoder = LabelEncoder()
data_tennis['outlook'] = encoder.fit_transform(data_tennis.outlook)
data_tennis['temp'] = encoder.fit_transform(data_tennis.temp)
data_tennis['humidity'] = encoder.fit_transform(data_tennis.humidity)
data_tennis['wind'] = encoder.fit_transform(data_tennis.wind)
data_tennis['play'] = encoder.fit_transform(data_tennis.play)

print(data_tennis)

    outlook  temp  humidity  wind  play
0         2     1         0     1     0
1         2     1         0     0     0
2         0     1         0     1     1
3         1     2         0     1     1
4         1     0         1     1     1
5         1     0         1     0     0
6         0     0         1     0     1
7         2     2         0     1     0
8         2     0         1     1     1
9         1     2         1     1     1
10        2     2         1     0     1
11        0     2         0     0     1
12        0     1         1     1     1
13        1     2         0     0     0


Karena semua data tersebut adalah ada nominal, maka diperlukan encoding lebih lanjut sehingga data berubah menjadi data nominal. Encoding menggunakan pd.get_dummies()

In [4]:
data_tennis = pd.get_dummies(data_tennis, columns=['outlook', 'temp', 'humidity', 'wind'], drop_first=False)
print(data_tennis)

    play  outlook_0  outlook_1  outlook_2  temp_0  temp_1  temp_2  humidity_0  \
0      0          0          0          1       0       1       0           1   
1      0          0          0          1       0       1       0           1   
2      1          1          0          0       0       1       0           1   
3      1          0          1          0       0       0       1           1   
4      1          0          1          0       1       0       0           0   
5      0          0          1          0       1       0       0           0   
6      1          1          0          0       1       0       0           0   
7      0          0          0          1       0       0       1           1   
8      1          0          0          1       1       0       0           0   
9      1          0          1          0       0       0       1           0   
10     1          0          0          1       0       0       1           0   
11     1          1         

### 2.3 Membagi menjadi data training dan data test

In [5]:
from sklearn.model_selection import train_test_split
A = data_tennis.drop(['play'], axis=1)
B = data_tennis['play']
A_train, A_test, B_train, B_test = train_test_split(A, B, test_size=0.2)

# Pembelajaran
## 1. Decision Tree Classifier
### 1.1 Pembelajaran Data Beast Cancer
Berikut adalah pembelajaran data breast cancer dengan menggunakan Decision Tree Classifier dan penampilan dari tree yang terbentuk dengan export_text

In [6]:
from sklearn import tree
from sklearn.tree import export_text
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, Y_train)
r = export_text(clf)
print(r)

|--- feature_7 <= 0.05
|   |--- feature_23 <= 957.45
|   |   |--- feature_13 <= 44.45
|   |   |   |--- feature_24 <= 0.18
|   |   |   |   |--- feature_7 <= 0.05
|   |   |   |   |   |--- feature_14 <= 0.00
|   |   |   |   |   |   |--- feature_8 <= 0.18
|   |   |   |   |   |   |   |--- class: 1
|   |   |   |   |   |   |--- feature_8 >  0.18
|   |   |   |   |   |   |   |--- class: 0
|   |   |   |   |   |--- feature_14 >  0.00
|   |   |   |   |   |   |--- feature_21 <= 33.27
|   |   |   |   |   |   |   |--- class: 1
|   |   |   |   |   |   |--- feature_21 >  33.27
|   |   |   |   |   |   |   |--- feature_21 <= 33.56
|   |   |   |   |   |   |   |   |--- class: 0
|   |   |   |   |   |   |   |--- feature_21 >  33.56
|   |   |   |   |   |   |   |   |--- class: 1
|   |   |   |   |--- feature_7 >  0.05
|   |   |   |   |   |--- feature_23 <= 796.25
|   |   |   |   |   |   |--- class: 1
|   |   |   |   |   |--- feature_23 >  796.25
|   |   |   |   |   |   |--- class: 0
|   |   |   |--- feature_24 

Berikut adalah hasil evaluasi model dengan menggunakan metriks accuracy dan F1:

In [7]:
from sklearn.metrics import accuracy_score, f1_score
predictions = clf.predict(X_test)
print("F1 Score (binary): " + str(f1_score(Y_test, predictions)))
print("Accuracy: " + str(accuracy_score(Y_test, predictions)))

F1 Score (binary): 0.9130434782608695
Accuracy: 0.8947368421052632


### 1.2 Pembelajaran Play Tennis
Berikut adalah pembelajaran data Play Tennis dengan menggunakan Decision Tree Classifier dan penampilan dari tree yang terbentuk dengan export_text

In [8]:
from sklearn import tree
from sklearn.tree import export_text
clf = tree.DecisionTreeClassifier()
clf = clf.fit(A_train, B_train)

r = export_text(clf,feature_names = list(A.columns))
print(r)

|--- humidity_1 <= 0.50
|   |--- outlook_2 <= 0.50
|   |   |--- outlook_1 <= 0.50
|   |   |   |--- class: 1
|   |   |--- outlook_1 >  0.50
|   |   |   |--- wind_0 <= 0.50
|   |   |   |   |--- class: 1
|   |   |   |--- wind_0 >  0.50
|   |   |   |   |--- class: 0
|   |--- outlook_2 >  0.50
|   |   |--- class: 0
|--- humidity_1 >  0.50
|   |--- class: 1



Berikut adalah hasil evaluasi model dengan menggunakan metriks accuracy dan F1:

In [9]:
from sklearn.metrics import accuracy_score, f1_score
predictions = clf.predict(A_test)
print("F1 Score (binary): " + str(f1_score(B_test, predictions)))
print("Accuracy: " + str(accuracy_score(B_test, predictions)))

F1 Score (binary): 0.8
Accuracy: 0.6666666666666666


## 2. Id3Estimator
### 2.1 Pembelajaran Data Breast Cancer

In [10]:
import six
import sys
sys.modules['sklearn.externals.six'] = six
from id3 import Id3Estimator

estimator = Id3Estimator()
estimator = estimator.fit(X_train, Y_train)

Berikut adalah hasil evaluasi model dengan menggunakan metriks accuracy dan F1:

In [11]:
from sklearn.metrics import accuracy_score, f1_score
predictions = estimator.predict(X_test)
print("F1 Score (binary): " + str(f1_score(Y_test, predictions)))
print("Accuracy: " + str(accuracy_score(Y_test, predictions)))

F1 Score (binary): 0.9305555555555556
Accuracy: 0.9122807017543859


### 2.2 Pembelajaran Data Play Tennis

In [12]:
import six
import sys
sys.modules['sklearn.externals.six'] = six
from id3 import Id3Estimator

estimator = Id3Estimator()
estimator = estimator.fit(A_train, B_train)
# r = export_text(estimator, feature_names=data_cancer['feature_names'])

Berikut adalah hasil evaluasi model dengan menggunakan metriks accuracy dan F1:

In [13]:
from sklearn.metrics import accuracy_score, f1_score
predictions = estimator.predict(A_test)
print("F1 Score (binary): " + str(f1_score(B_test, predictions)))
print("Accuracy: " + str(accuracy_score(B_test, predictions)))

F1 Score (binary): 0.8
Accuracy: 0.6666666666666666


## 3. Kmeans
### 3.1 Pembelajaran Data Breast Cancer

In [14]:
from sklearn.cluster import KMeans
kmeans = KMeans().fit(X_train)

Berikut adalah hasil evaluasi model dengan menggunakan metriks accuracy dan F1:

In [15]:
from sklearn.metrics import accuracy_score, f1_score
predictions = kmeans.predict(X_test)
print("F1 Score (macro): " + str(f1_score(Y_test, predictions, average = 'macro')))
print("Accuracy: " + str(accuracy_score(Y_test, predictions)))

F1 Score (macro): 0.030303030303030304
Accuracy: 0.07017543859649122


### 3.2 Pembelajaran Data Play Tennis

In [16]:
from sklearn.cluster import KMeans
kmeans = KMeans().fit(A_train)

In [17]:
from sklearn.metrics import accuracy_score, f1_score
predictions = kmeans.predict(A_test)
print("F1 Score (macro): " + str(f1_score(B_test, predictions, average = 'macro')))
print("Accuracy: " + str(accuracy_score(B_test, predictions)))

F1 Score (macro): 0.0
Accuracy: 0.0


## 4. Logistic Regression

In [18]:
from sklearn.linear_model import LogisticRegression
# clf = LogisticRegression(random_state=0).fit(X_train, Y_train)

## 5. Neural Network
### 5.1 Pembelajaran Data Breast Cancer

In [19]:
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(random_state=1, max_iter=300).fit(X_train, Y_train)

In [20]:
from sklearn.metrics import accuracy_score, f1_score
predictions = clf.predict(X_test)
print("F1 Score (binary): " + str(f1_score(Y_test, predictions)))
print("Accuracy: " + str(accuracy_score(Y_test, predictions)))

F1 Score (binary): 0.9645390070921985
Accuracy: 0.956140350877193


### 5.2 Pembelajaran Data Play Tennis

In [21]:
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(random_state=1, max_iter=10000).fit(A_train, B_train)

In [22]:
from sklearn.metrics import accuracy_score, f1_score
predictions = clf.predict(A_test)
print("F1 Score (binary): " + str(f1_score(B_test, predictions)))
print("Accuracy: " + str(accuracy_score(B_test, predictions)))

F1 Score (binary): 0.8
Accuracy: 0.6666666666666666


## 6. SVM
### 6.1 Pembelajaran Data Breast Cancer

In [23]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
clf = make_pipeline(StandardScaler(), SVC(gamma='auto'))
clf.fit(X_train, Y_train)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('svc', SVC(gamma='auto'))])

In [24]:
from sklearn.metrics import accuracy_score, f1_score
predictions = clf.predict(X_test)
print("F1 Score (binary): " + str(f1_score(Y_test, predictions)))
print("Accuracy: " + str(accuracy_score(Y_test, predictions)))

F1 Score (binary): 0.9722222222222222
Accuracy: 0.9649122807017544


### 6.2 Pembelajaran Data Play Tennis

In [25]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
clf = make_pipeline(StandardScaler(), SVC(gamma='auto'))
clf.fit(A_train, B_train)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('svc', SVC(gamma='auto'))])

In [26]:
from sklearn.metrics import accuracy_score, f1_score
predictions = clf.predict(A_test)
print("F1 Score (binary): " + str(f1_score(B_test, predictions)))
print("Accuracy: " + str(accuracy_score(B_test, predictions)))

F1 Score (binary): 0.8
Accuracy: 0.6666666666666666
