# Random Forests

Temeli birden çok karar ağacının ürettiği tahminlerin bir araya getirilerek değerlendirilmesine dayanır.

![image.png](image10.png)

In [1]:
import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np
from sklearn.model_selection import GridSearchCV

diabetes = pd.read_csv('diabetes.csv')
df = diabetes.copy()
df = df.dropna()
y = df['Outcome']
X = df.drop('Outcome', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   test_size=0.3,
                                                   random_state=238)

In [2]:
from sklearn.ensemble import RandomForestClassifier

In [4]:
rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)

In [6]:
y_pred = rf_model.predict(X_test)
accuracy_score(y_test, y_pred)

0.7575757575757576

## Model Tuning

In [7]:
?rf_model

[1;31mType:[0m        RandomForestClassifier
[1;31mString form:[0m RandomForestClassifier()
[1;31mLength:[0m      100
[1;31mFile:[0m        c:\users\alperen arda\appdata\local\programs\python\python311\lib\site-packages\sklearn\ensemble\_forest.py
[1;31mDocstring:[0m  
A random forest classifier.

A random forest is a meta estimator that fits a number of decision tree
classifiers on various sub-samples of the dataset and uses averaging to
improve the predictive accuracy and control over-fitting.
Trees in the forest use the best split strategy, i.e. equivalent to passing
`splitter="best"` to the underlying :class:`~sklearn.tree.DecisionTreeClassifier`.
The sub-sample size is controlled with the `max_samples` parameter if
`bootstrap=True` (default), otherwise the whole dataset is used to build
each tree.

For a comparison between tree-based ensemble models see the example
:ref:`sphx_glr_auto_examples_ensemble_plot_forest_hist_grad_boosting_comparison.py`.

Read more in the :ref

In [11]:
rf_params = {"max_depth": [2, 3, 5, 10, 20, 50],
             "max_features": [2, 5, 8], 
             "n_estimators": [10, 500, 1000], 
             "min_samples_split": [2, 5, 10]} 

Veri setimizde 8 özellik varsa ve her karar ağacına bu 8 özelliğin tamamı veriliyorsa, bu durumda Bagging yöntemi uygulanmış olur. Çünkü Bagging (Bootstrap Aggregating), her alt modelin (karar ağacı gibi) tüm özellikleri kullanarak farklı örneklem kümeleriyle eğitildiği bir topluluk öğrenme (ensemble learning) yöntemidir. Bu, modelin varyansını azaltırken overfitting riskini de düşürür. Bagging yöntemi Random Forests yönteminin özel bir halidir.

Random Forests ve Bagging, yüksek sayıda hiperparametre kombinasyonuna sahip yöntemlerdendir. Örneğin, 500 ağaçlık bir model, 10 katlı çapraz doğrulama ve GridSearchCV ile dört hiperparametrenin her biri için 10 farklı değer denenirse, toplamda 100.000 farklı model eğitimi gerekebilir. Bundan dolayı bu algoritmalarda tune edilecek parametre kombinasyonlarını yazarken dikkat etmeliyiz.

In [12]:
rf_model = RandomForestClassifier()
rf_cv_model = GridSearchCV(rf_model, rf_params, cv=10, n_jobs=-1, verbose=2)

In [13]:
rf_cv_model.fit(X_train, y_train)

Fitting 10 folds for each of 162 candidates, totalling 1620 fits


In [14]:
rf_cv_model.best_params_

{'max_depth': 3, 'max_features': 5, 'min_samples_split': 5, 'n_estimators': 10}

In [15]:
rf_tuned = RandomForestClassifier(max_depth=3, max_features=5, min_samples_split=5, n_estimators=10)
rf_tuned.fit(X_train, y_train)

In [17]:
y_train_pred = rf_tuned.predict(X_train)
acc_train = accuracy_score(y_train, y_train_pred)
acc_train

0.7914338919925512

In [19]:
y_test_pred = rf_tuned.predict(X_test)
acc_test = accuracy_score(y_test, y_test_pred)
acc_test

0.7575757575757576