<a href="https://colab.research.google.com/github/alfonsoayalapaloma/ml-2024/blob/main/ml_01_classifiers_iris.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://pandas.pydata.org/static/img/pandas.svg" width="250">


## <center> Classifiers

# Random Forest

## How it works

### Bootstrap Sampling:

Random Forest creates multiple decision trees using different subsets of the training data. Each subset is created by randomly sampling the data with replacement (bootstrap sampling).

### Feature Randomness:

When splitting nodes in each decision tree, Random Forest considers a random subset of features rather than all features. This introduces more diversity among the trees.

### Building Trees:
Each decision tree is built independently using the bootstrap sample and the random subset of features. The trees are grown to their maximum depth without pruning.

### Aggregation:

For classification tasks, the final prediction is made by aggregating the predictions of all individual trees. This is typically done by majority voting (the class that gets the most votes from the trees is the final prediction).

### Advantages

### Improved Accuracy:

By combining the predictions of multiple trees, Random Forest often achieves higher accuracy than individual decision trees.

### Robustness:

It reduces overfitting and is more robust to noise in the data.
Feature Importance: Random Forest can provide estimates of feature importance, helping to understand which features are most influential in making predictions.

## Disadvantages

### Complexity:

Random Forest models can be more complex and harder to interpret compared to a single decision tree.

### Computational Cost:

Training multiple trees can be computationally expensive and require more memory.


# Solución de un problema en Machine Learning



1.   Análisis del problema. Eleccion de un modelo.
2.   Extracción del dataset. Limpieza
1.   EDA. Analisis de la dataset
1.   Feature engineering [ determinar variables independientes(X) y dependiente(y)]
2.   Partir el dataset en train y test
2.   Crear el modelo y Entrenarlo
1.   Hacer predicciones
1.   Calificar el modelo
2.   Visualizar el resultado del modelo
1.   Conclusiones


# Problema de clasificación de la especie de flor IRIS

1. Análisis del problema


2. Extraccion del dataset. Limpieza

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

import seaborn as sns

iris = sns.load_dataset('iris')
iris.sample(3)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
146,6.3,2.5,5.0,1.9,virginica
100,6.3,3.3,6.0,2.5,virginica
129,7.2,3.0,5.8,1.6,virginica


3. EDA. Análisis del dataset

4. Feature engineering [ determinar variables independientes(X) y dependiente(y)]

In [None]:
numeric_cols=["sepal_length","sepal_width","petal_length","petal_width"]
target_col="species"

target_names = iris[target_col].unique()
X = iris[ numeric_cols ]
y = iris[ target_col ]

5. Partir el dataset en train y test

In [None]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print('X_train',X_train.shape)
print('X_test',X_test.shape)

X_train (120, 4)
X_test (30, 4)


6. Crear y Entrenar (Ajustar) el modelo.

In [None]:
# Initialize the classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the classifier
clf.fit(X_train, y_train)

7. Hacer predicciones

In [None]:
# Make predictions on the test set
y_pred = clf.predict(X_test)

8. Calificar el modelo

In [None]:
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Print classification report
report = classification_report(y_test, y_pred, target_names=target_names)
print("Classification Report:\n", report)

Accuracy: 1.00
Classification Report:
               precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      1.00      1.00         9
   virginica       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



9. Visualizar el resultado

In [None]:
combined_df =X_test.copy()
combined_df['y_pred'] = y_pred
combined_df['y_test'] = y_test

colors = {"versicolor":"red","setosa":"green","virginica":"blue"}
combined_df['colors'] = combined_df['y_pred'].map(colors)
combined_df.plot.scatter(x='sepal_width', y='sepal_length', color=combined_df['colors']);

10. Elaborar conclusiones

El modelo tiene una alta accuracy por lo que se acepta como clasificador del dataset. Sin embargo se debe comprobar con un dataset con mayor numero de filas.

# EJERCICIO
Realice los mismos pasos para el siguiente problema: Clasificar si se hará o no una compra teniendo en cuenta el dataset que contiene los campos edad, salarioestimado y comprar_realizada.

El dataset se encuentra en:



```
url="https://raw.githubusercontent.com/alfonsoayalapaloma/datasets/main/Social_Network_Ads.csv"

```





   Age  EstimatedSalary  Purchased
0   19            19000          0
1   35            20000          0
2   26            43000          0
3   27            57000          0
4   19            76000          0
Accuracy: 0.9083333333333333

Classification Report:
              precision    recall  f1-score   support

           0       0.94      0.90      0.92        73
           1       0.86      0.91      0.89        47

    accuracy                           0.91       120
   macro avg       0.90      0.91      0.90       120
weighted avg       0.91      0.91      0.91       120



In [None]:

#print(ds_iris.target_names)

# Load the Iris dataset










