<a href="https://colab.research.google.com/github/alfonsoayalapaloma/ml-2024/blob/main/ds_eda_04_classifiers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://pandas.pydata.org/static/img/pandas.svg" width="250">


## <center> Classifiers

# Random Forest

## How it works

### Bootstrap Sampling:

Random Forest creates multiple decision trees using different subsets of the training data. Each subset is created by randomly sampling the data with replacement (bootstrap sampling).

### Feature Randomness:

When splitting nodes in each decision tree, Random Forest considers a random subset of features rather than all features. This introduces more diversity among the trees.

### Building Trees:
Each decision tree is built independently using the bootstrap sample and the random subset of features. The trees are grown to their maximum depth without pruning.

### Aggregation:

For classification tasks, the final prediction is made by aggregating the predictions of all individual trees. This is typically done by majority voting (the class that gets the most votes from the trees is the final prediction).

### Advantages

### Improved Accuracy:

By combining the predictions of multiple trees, Random Forest often achieves higher accuracy than individual decision trees.

### Robustness:

It reduces overfitting and is more robust to noise in the data.
Feature Importance: Random Forest can provide estimates of feature importance, helping to understand which features are most influential in making predictions.

## Disadvantages

### Complexity:

Random Forest models can be more complex and harder to interpret compared to a single decision tree.

### Computational Cost:

Training multiple trees can be computationally expensive and require more memory.


In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

import seaborn as sns

ds_iris = load_iris()
#print(ds_iris.target_names)

# Load the Iris dataset
iris = sns.load_dataset('iris')
numeric_cols=["sepal_length","sepal_width","petal_length","petal_width"]
target_col="species"

target_names = iris[target_col].unique()
X = iris[ numeric_cols ]
y = iris[ target_col ]

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the classifier
clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Print classification report
report = classification_report(y_test, y_pred, target_names=target_names)
print("Classification Report:\n", report)

Accuracy: 1.00
Classification Report:
               precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      1.00      1.00         9
   virginica       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

