# RANDOM FORESTS

Aitor Díez Mateo - 72834331R

- Use Random Forest to process the Iris and Wine datasets.
    - What are the hyperparameters of the algorithm?
    - How do they affect the accuracy of the models?
    - Do random forest offer best or worse results?

Importing the necessary libraries

In [1]:
from sklearn.datasets import load_wine
from sklearn.datasets import load_iris
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV


Importing the Iris and Wine datasets

In [3]:
wine = load_wine()
df_wine = pd.DataFrame(data=wine.data, columns=wine.feature_names)
df_wine['target'] = wine.target.astype(str)
df_wine['target'] = "class_" + df_wine['target']

#Get the data
X_wine = df_wine.drop('target', axis=1)
y_wine = df_wine['target']

df_wine.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,target
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,class_0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0,class_0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0,class_0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,class_0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0,class_0


In [4]:
iris = load_iris()
df_iris = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df_iris['target'] = iris.target.astype(str)
df_iris['target'] = "class_" + df_iris['target']

#Get the data
X_iris = df_iris.drop('target', axis=1)
y_iris = df_iris['target']

df_iris.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,class_0
1,4.9,3.0,1.4,0.2,class_0
2,4.7,3.2,1.3,0.2,class_0
3,4.6,3.1,1.5,0.2,class_0
4,5.0,3.6,1.4,0.2,class_0


Splitting the datasets into training and testing sets

In [11]:
X_train_wine, X_test_wine, y_train_wine, y_test_wine = train_test_split(X_wine, y_wine, test_size=0.3, random_state=10)
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(X_iris, y_iris, test_size=0.3, random_state=10)

Training the models

In [12]:
clf_wine = RandomForestClassifier(n_estimators=100)
clf_wine = clf_wine.fit(X_train_wine, y_train_wine)

clf_iris = RandomForestClassifier(n_estimators=100)
clf_iris = clf_iris.fit(X_train_iris, y_train_iris)

Make predictions

In [13]:
y_pred_wine = clf_wine.predict(X_test_wine)
y_pred_iris = clf_iris.predict(X_test_iris)

Computing the accuracy of the models

In [15]:
accuracy_wine = accuracy_score(y_test_wine, y_pred_wine)
accuracy_iris = accuracy_score(y_test_iris, y_pred_iris)

print(f"Accuracy of the Wine dataset: {accuracy_wine}")
print(f"Accuracy of the Iris dataset: {accuracy_iris}")

Accuracy of the Wine dataset: 0.9629629629629629
Accuracy of the Iris dataset: 0.9777777777777777


These results are better than the obtained with the CART algorithm.

The default hyperparameters of the Random Forest algorithm are:

n_estimators=100, *, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='sqrt', max_leaf_nodes=None, min_impurity_decrease=0.0, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None, monotonic_cst=None

More information about the hyperparameters can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

We are going to try different values for the hyperparameters and see how they affect the accuracy of the models. We are going to use GridSearchCV to find the best hyperparameters for the models.

In [33]:
param_grid = {
    'min_samples_split': [2, 5, 10, 15],
    'min_samples_leaf': [1, 2, 4],
    'criterion': ['gini', 'entropy', 'log_loss'],
    'max_features': ['sqrt', 'log2', None],
    'bootstrap': [True, False]

}

Training the models with the best hyperparameters

In [34]:
clf_iris_gridSearch = RandomForestClassifier()
grid_search_iris = GridSearchCV(estimator=clf_iris_gridSearch, param_grid=param_grid,cv=5,scoring='accuracy')

clf_wine_gridSearch = RandomForestClassifier()
grid_search_wine = GridSearchCV(estimator=clf_wine_gridSearch, param_grid=param_grid,cv=5,scoring='accuracy')
print("Training the models...")
grid_search_iris.fit(X_train_iris, y_train_iris)
print("Iris trained")
grid_search_wine.fit(X_train_wine, y_train_wine)

Training the models...
Iris trained


Getting the best hyperparameters and accuracy of the models

In [35]:
print(f"Best hyperparameters for the Iris dataset: {grid_search_iris.best_params_}")
print(f"Accuracy of the Iris dataset: {grid_search_iris.best_score_}")

print(f"Best hyperparameters for the Wine dataset: {grid_search_wine.best_params_}")
print(f"Accuracy of the Wine dataset: {grid_search_wine.best_score_}")

Best hyperparameters for the Iris dataset: {'bootstrap': True, 'criterion': 'gini', 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 15}
Accuracy of the Iris dataset: 0.9428571428571428
Best hyperparameters for the Wine dataset: {'bootstrap': False, 'criterion': 'gini', 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2}
Accuracy of the Wine dataset: 0.9916666666666666


As we can check, different values for the hyperparameters can affect the accuracy of the models. Depending on the dataset, the best hyperparameters can be different.

- Use a Random Forest to classify the CSV file of the previous exercise.
    - How do the hyperparameters affect the accuracy of the models?
    - Do random forest offer best or worse results?

Loading and splitting the dataset

In [36]:
df = pd.read_csv("C:/Users/aitor/PycharmProjects/ML-Class/data/03_a_tree_dataset.csv", names=["Outlook","Humidity","Wind","PlayTennis"])
df = pd.get_dummies(df,columns=["Outlook","Humidity","Wind"])
X = df.drop('PlayTennis', axis=1)
y = df['PlayTennis']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)

Training the model

In [37]:
clf = RandomForestClassifier()
clf = clf.fit(X_train, y_train)

Make predictions and get the accuracy of the model

In [38]:
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of the dataset: {accuracy}")

Accuracy of the dataset: 0.445


In this case the accuracy score is quite similar to the CART algorithm.

We are going to try different values for the hyperparameters and see how they affect the accuracy of the model. We are going to use GridSearchCV to find the best hyperparameters for the model.

In [39]:
param_grid = {
    'min_samples_split': [2, 5, 10, 15],
    'min_samples_leaf': [1, 2, 4],
    'criterion': ['gini', 'entropy', 'log_loss'],
    'max_features': ['sqrt', 'log2', None],
    'bootstrap': [True, False]

}

In [40]:
clf_gridSearch = RandomForestClassifier()
grid_search = GridSearchCV(estimator=clf_gridSearch, param_grid=param_grid,cv=5,scoring='accuracy')
print("Training the model...")
grid_search.fit(X_train, y_train)

Training the model...


  _data = np.array(data, dtype=dtype, copy=copy,


In [42]:
print(f"Best hyperparameters for the dataset: {grid_search.best_params_}")
print(f"Accuracy of the dataset: {grid_search.best_score_}")

Best hyperparameters for the dataset: {'bootstrap': True, 'criterion': 'gini', 'max_features': 'sqrt', 'min_samples_leaf': 4, 'min_samples_split': 5}
Accuracy of the dataset: 0.51875


In summary, the different hyperparameter configurations affect to the accuracy of the model. We can obtain betters or worse results depending on the hyperparameters values.