<a href="https://colab.research.google.com/github/almirars/MachineLearning/blob/main/Pertemuan10.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Bagging dengan RandomForest

Pada kasus ini kita akan menggunakan salah satu metode bagging yaitu RandomForest untuk mengklasifikasikan jenis tumor. Dalam latihan ini Anda akan melakukan training dengan data [Wisconsin Breast Cancer Dataset](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data) dari UCI machine learning repository. Latihan ini akan melakukan prediksi memprediksi apakah tumor ganas atau jinak.

Kita akan membandingkan performa dari algoritma Decision Tree dan RandomForest pada kasus ini.


**Import Library**

In [32]:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier # import DT
from sklearn.ensemble import RandomForestClassifier # import RandomForest
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, classification_report

**Persiapan Data**

In [33]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [34]:
# Load data
df = pd.read_csv('/content/drive/MyDrive/ML/mushrooms.csv')

df.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


In [35]:
# Use LabelEncoder to make the columns into machine understandable format.
# Using LabelEncoder to convert catergory values to ordinal
from sklearn.preprocessing import LabelEncoder
labelencoder=LabelEncoder()
for column in df.columns:
    df[column] = labelencoder.fit_transform(df[column])

df.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,1,5,2,4,1,6,1,0,1,4,...,2,7,7,0,2,1,4,2,3,5
1,0,5,2,9,1,0,1,0,0,4,...,2,7,7,0,2,1,4,3,2,1
2,0,0,2,8,1,3,1,0,0,5,...,2,7,7,0,2,1,4,3,2,3
3,1,5,3,8,1,6,1,0,1,5,...,2,7,7,0,2,1,4,2,3,5
4,0,5,2,3,0,5,1,1,0,4,...,2,7,7,0,2,1,0,3,0,1


In [36]:
df.shape

(8124, 23)

In [37]:
df.population.value_counts()

4    4040
5    1712
3    1248
2     400
0     384
1     340
Name: population, dtype: int64

In [38]:
# Cek apakah ada kolom null
df.isnull().sum()

class                       0
cap-shape                   0
cap-surface                 0
cap-color                   0
bruises                     0
odor                        0
gill-attachment             0
gill-spacing                0
gill-size                   0
gill-color                  0
stalk-shape                 0
stalk-root                  0
stalk-surface-above-ring    0
stalk-surface-below-ring    0
stalk-color-above-ring      0
stalk-color-below-ring      0
veil-type                   0
veil-color                  0
ring-number                 0
ring-type                   0
spore-print-color           0
population                  0
habitat                     0
dtype: int64

**Split data training dan testing**

In [39]:
#Separating the independent variables
x = df.drop('population',axis=1)
x.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-above-ring,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,habitat
0,1,5,2,4,1,6,1,0,1,4,...,2,2,7,7,0,2,1,4,2,5
1,0,5,2,9,1,0,1,0,0,4,...,2,2,7,7,0,2,1,4,3,1
2,0,0,2,8,1,3,1,0,0,5,...,2,2,7,7,0,2,1,4,3,3
3,1,5,3,8,1,6,1,0,1,5,...,2,2,7,7,0,2,1,4,2,5
4,0,5,2,3,0,5,1,1,0,4,...,2,2,7,7,0,2,1,0,3,1


In [40]:
#Separating the population variables
y = df[['population']]
y.head()

Unnamed: 0,population
0,3
1,2
2,2
3,3
4,0


In [41]:
from sklearn.model_selection import train_test_split

#Separating the dataset into training and testing dataset
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.3, random_state = 4)

In [42]:
y_train = np.array(y_train['population'])
y_test1 = np.array(y_test['population'])

**Traning Decision Tree**

1) Without Hyper parameter Optimization

In [43]:
#Fitting a decision tree with default hyper parameters
tree = DecisionTreeClassifier()
tree.fit(x_train,y_train)
pred_tree = tree.predict(x_test)

In [44]:
#Checking different metrics for decision tree model with default hyper parameters
print('Checking different metrics for decision tree model with default hyper parameters:\n')
print("Training accuracy: ",tree.score(x_train,y_train))
acc_score = accuracy_score(y_test, pred_tree)
print('Testing accuracy: ',acc_score)

Checking different metrics for decision tree model with default hyper parameters:

Training accuracy:  0.7460429124164615
Testing accuracy:  0.38474159146841674


2) With Hyper parameter Optimization

In [45]:
#Setting values for the parameters
#n_estimators = [100, 300, 500, 800, 1200]
max_depth = [5, 10, 15, 25, 30]
min_samples_split = [2, 5, 10, 15, 100]
min_samples_leaf = [1, 2, 5, 10]
max_features = [1, 2, 5, 10]

#Creating a dictionary for the hyper parameters
hyperT = dict(max_depth = max_depth, min_samples_split = min_samples_split, 
              min_samples_leaf = min_samples_leaf, max_features=max_features)

#Applying GridSearchCV to get the best value for hyperparameters
gridT = GridSearchCV(tree, hyperT, cv = 3, verbose = 1, n_jobs = -1)
bestT = gridT.fit(x_train, y_train)

Fitting 3 folds for each of 400 candidates, totalling 1200 fits


In [46]:
#Printing the best hyperparameters
print('The best hyper parameters are: \n',gridT.best_params_)

The best hyper parameters are: 
 {'max_depth': 5, 'max_features': 10, 'min_samples_leaf': 5, 'min_samples_split': 15}


3) Fitting Descision Tree with best Hyper parameters

In [47]:
#Fitting the decision tree model with the best hyper parameters obtained through GridSearchCV
tree1 = DecisionTreeClassifier(criterion='gini',splitter='random',max_depth=5, min_samples_leaf=5,min_samples_split=5, max_features=10)
tree1.fit(x_train,y_train)
pred_tree1 = tree1.predict(x_test)

In [48]:
#Checking different metrics for decision tree model after tuning the hyperparameters
print('Checking different metrics for decision tree model after tuning the hyperparameters:\n')
print("Training accuracy: ",tree1.score(x_train,y_train))
acc_score = accuracy_score(y_test, pred_tree1)
print('Testing accuracy: ',acc_score)

Checking different metrics for decision tree model after tuning the hyperparameters:

Training accuracy:  0.6188884980654239
Testing accuracy:  0.610746513535685


**Training RandomForest**

1) Without Hyper parameter Optimization

In [49]:
rf = RandomForestClassifier()
rf.fit(x_train,y_train)
pred_rf = rf.predict(x_test)

In [50]:
#Checking different metrics for random forest model with default hyper parameters
print('Checking different metrics for random forest model with default hyper parameters:\n')
print("Training accuracy: ",rf.score(x_train,y_train))
acc_score = accuracy_score(y_test, pred_rf)
print('Testing accuracy: ',acc_score)

Checking different metrics for random forest model with default hyper parameters:

Training accuracy:  0.7460429124164615
Testing accuracy:  0.38474159146841674


2) With Hyper parameter Optimization

In [51]:
#Setting values for the parameters
#n_estimators = [100, 300, 500, 800, 1200]
max_depth = [5, 10, 15, 25, 30]
min_samples_split = [2, 5, 10, 15, 100]
min_samples_leaf = [1, 2, 5, 10]
max_features = [1, 2, 5, 10]

#Creating a dictionary for the hyper parameters
hyper_rf = dict(max_depth = max_depth, min_samples_split = min_samples_split, 
              min_samples_leaf = min_samples_leaf, max_features=max_features)

#Applying GridSearchCV to get the best value for hyperparameters
gridrf = GridSearchCV(tree, hyper_rf, cv = 3, verbose = 1, n_jobs = -1)
bestrf = gridrf.fit(x_train, y_train)

Fitting 3 folds for each of 400 candidates, totalling 1200 fits


In [52]:
#Printing the best hyperparameters
print('The best hyper parameters are: \n',gridrf.best_params_)

The best hyper parameters are: 
 {'max_depth': 5, 'max_features': 10, 'min_samples_leaf': 2, 'min_samples_split': 100}


3) Fitting RandomForest with best Hyper parameters

In [53]:
#Fitting the decision tree model with the best hyper parameters obtained through GridSearchCV
rf2 = DecisionTreeClassifier(criterion='gini',splitter='random',max_depth=5, min_samples_leaf=2,min_samples_split=5, max_features=5)
rf2.fit(x_train,y_train)
pred_tree1 = rf2.predict(x_test)

In [54]:
#Checking different metrics for decision tree model after tuning the hyperparameters
print('Checking different metrics for decision tree model after tuning the hyperparameters:\n')
print("Training accuracy: ",rf2.score(x_train,y_train))
acc_score = accuracy_score(y_test, pred_tree1)
print('Testing accuracy: ',acc_score)

Checking different metrics for decision tree model after tuning the hyperparameters:

Training accuracy:  0.6053464650017587
Testing accuracy:  0.5783429040196882


**Evaluasi**

**Decision Tree** sebelum hyperparameters Kita dapat melihat bahwa akurasi pelatihan Decision Tree adalah 0.7460429124164615 (74%). Saat menyesuaikan model yang sama untuk menguji data, akurasi turun menjadi sekitar 0.38474159146841674 (38%). Hal ini membuktikan bahwa model akan menghasilkan hasil yang berbeda untuk sampel yang berbeda. Lalu, hyperparameters Ketika menerapkan algoritma GridSearchCV ke model yang digunakan untuk memilih hyperparameter terbaik, terlihat bahwa akurasi pelatihan sekitar 0.6315511783327471 (63%) dan akurasi pengujian sekitar 0.6222313371616078 (62%). 
Hal ini membuktikan bahwa penyetelan hyperparameters mengurangi akurasi pelatihan dan juga mengurangi overfitting di Decision Tree.

**Random Forest** sebelum hyperparameters Kita dapat melihat bahwa akurasi pelatihan Random Forest adalah 0.7460429124164615 (74%). Saat menyesuaikan model yang sama untuk menguji data, akurasi turun menjadi sekitar 0.38474159146841674 (38%). Sesudah hyperparameters Ketika menerapkan algoritma GridSearchCV ke model yang digunakan untuk memilih hyperparameter terbaik, terlihat bahwa akurasi pelatihan sekitar 0.6021807949349279 (60%) dan akurasi pengujian sekitar 0.5955701394585726 (59%). Hal itu membuktikan bahwa penyetelan hyperparameters mengurangi akurasi pelatihan, itu juga mengurangi overfitting di Random Forest.