## Actividad árboles y bosques

En esta práctica aprenderás a:
- Entrenar y afinar un Árbol de Decisión.
- Construir un bosque manual de árboles y combinar sus predicciones mediante voto mayoritario.
- Observar cómo un ensamble de modelos puede superar a un solo árbol.

#### Parte 1

1. Generar datos
    - Crea un dataset en forma de lunas con:
    - make_moons(n_samples=10000, noise=0.4)

2. Dividir los datos
    - Separa en conjunto de entrenamiento y de prueba usando train_test_split().

3. Ajustar el modelo
    - Usa búsqueda en malla (grid search) con validación cruzada (clase GridSearchCV) para encontrar buenos hiperparámetros para un DecisionTreeClassifier.
    - Pista: prueba distintos valores de max_leaf_nodes.

4. Entrenar y evaluar
    - Entrena el árbol con todo el conjunto de entrenamiento usando los hiperparámetros óptimos.
    - Evalúa en el conjunto de prueba.

Deberías obtener aproximadamente 85%–87% de precisión.

In [None]:
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

X, y = make_moons(n_samples=10000, noise=0.4, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

param_grid = {'max_leaf_nodes': range(2, 100)}
tree_clf = DecisionTreeClassifier(random_state=42)
grid_search = GridSearchCV(tree_clf, param_grid, cv=5)
grid_search.fit(X_train, y_train)

best_tree_clf = grid_search.best_estimator_

best_tree_clf.fit(X_train, y_train)
y_pred = best_tree_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Mejores hiperparámetros: {grid_search.best_params_}")
print(f"Precisión en el conjunto de prueba: {accuracy:.4f}")

Mejores hiperparámetros: {'max_leaf_nodes': 23}
Precisión en el conjunto de prueba: 0.8735


#### Parte 2 Crecer un bosque



1. Generar subconjuntos
    - Crea 1,000 subconjuntos del conjunto de entrenamiento, cada uno con 100 instancias seleccionadas aleatoriamente.
    - Pista: usa ShuffleSplit de Scikit-Learn.

2. Entrenar múltiples árboles
    - Entrena un DecisionTreeClassifier en cada subconjunto, usando los mejores hiperparámetros encontrados en la Parte 1.
    - Evalúa cada árbol individual en el conjunto de prueba.
    - Como fueron entrenados en conjuntos pequeños, se espera que tengan solo ≈80% de precisión.

3. Combinar predicciones
    - Para cada instancia del conjunto de prueba, recolecta las predicciones de los 1,000 árboles.
    - Conservar únicamente la predicción más frecuente usando mode() de SciPy.
    - Esto implementa un voto mayoritario.

4. Evaluar el bosque
    - Evalúa las predicciones combinadas en el conjunto de prueba.
    - Deberías obtener una precisión ligeramente mayor que la del árbol individual (+0.5% a +1.5%).

🎉 ¡Felicidades, has implementado tu propio Random Forest desde cero!

In [None]:
from sklearn.model_selection import ShuffleSplit
from scipy.stats import mode
import numpy as np

# 1. Generar subconjuntos
n_trees = 1000
n_instances = 100
shuffle_split = ShuffleSplit(n_splits=n_trees, test_size=len(X_train) - n_instances, random_state=42)

# 2. Entrenar múltiples árboles
trees = []
accuracy_scores = []

for train_index, test_index in shuffle_split.split(X_train):
    X_subset = X_train[train_index]
    y_subset = y_train[train_index]

    # Usar los mejores hiperparámetros encontrados en la Parte 1
    tree_clf = DecisionTreeClassifier(max_leaf_nodes=grid_search.best_params_['max_leaf_nodes'], random_state=42)
    tree_clf.fit(X_subset, y_subset)
    trees.append(tree_clf)

    y_pred_individual = tree_clf.predict(X_test)
    accuracy_scores.append(accuracy_score(y_test, y_pred_individual))

print(f"Precisión promedio de los árboles individuales: {np.mean(accuracy_scores):.4f}")

# 3. Combinar predicciones (voto mayoritario)
y_pred_ensemble = np.empty([len(X_test), n_trees], dtype=np.int64)

for i, tree in enumerate(trees):
    y_pred_ensemble[:, i] = tree.predict(X_test)

y_pred_forest, _ = mode(y_pred_ensemble, axis=1)

# 4. Evaluar el bosque
accuracy_forest = accuracy_score(y_test, y_pred_forest)
print(f"Precisión del bosque (voto mayoritario): {accuracy_forest:.4f}")

Precisión promedio de los árboles individuales: 0.7988
Precisión del bosque (voto mayoritario): 0.8735


In [None]:
import pandas as pd

df = pd.read_csv('/content/Covid Data.csv')
display(df.head())

Unnamed: 0,USMER,MEDICAL_UNIT,SEX,PATIENT_TYPE,DATE_DIED,INTUBED,PNEUMONIA,AGE,PREGNANT,DIABETES,...,ASTHMA,INMSUPR,HIPERTENSION,OTHER_DISEASE,CARDIOVASCULAR,OBESITY,RENAL_CHRONIC,TOBACCO,CLASIFFICATION_FINAL,ICU
0,2,1,1,1,03/05/2020,97,1,65,2,2,...,2,2,1,2,2,2,2,2,3,97
1,2,1,2,1,03/06/2020,97,1,72,97,2,...,2,2,1,2,2,1,1,2,5,97
2,2,1,2,2,09/06/2020,1,2,55,97,1,...,2,2,2,2,2,2,2,2,3,2
3,2,1,1,1,12/06/2020,97,2,53,2,2,...,2,2,2,2,2,2,2,2,7,97
4,2,1,2,1,21/06/2020,97,2,68,97,1,...,2,2,1,2,2,2,2,2,3,97


In [None]:
df['is_died'] = (df['DATE_DIED'] != '9999-99-99').astype(int)
display(df.head())

Unnamed: 0,USMER,MEDICAL_UNIT,SEX,PATIENT_TYPE,DATE_DIED,INTUBED,PNEUMONIA,AGE,PREGNANT,DIABETES,...,INMSUPR,HIPERTENSION,OTHER_DISEASE,CARDIOVASCULAR,OBESITY,RENAL_CHRONIC,TOBACCO,CLASIFFICATION_FINAL,ICU,is_died
0,2,1,1,1,03/05/2020,97,1,65,2,2,...,2,1,2,2,2,2,2,3,97,1
1,2,1,2,1,03/06/2020,97,1,72,97,2,...,2,1,2,2,1,1,2,5,97,1
2,2,1,2,2,09/06/2020,1,2,55,97,1,...,2,2,2,2,2,2,2,3,2,1
3,2,1,1,1,12/06/2020,97,2,53,2,2,...,2,2,2,2,2,2,2,7,97,1
4,2,1,2,1,21/06/2020,97,2,68,97,1,...,2,1,2,2,2,2,2,3,97,1


In [None]:
from sklearn.model_selection import train_test_split
import numpy as np

feature_cols = [col for col in df.columns if col not in ['DATE_DIED', 'is_died']]
X = df[feature_cols]
y = df['is_died']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of y_test: {y_test.shape}")

Shape of X_train: (838860, 20)
Shape of X_test: (209715, 20)
Shape of y_train: (838860,)
Shape of y_test: (209715,)


In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Instantiate a DecisionTreeClassifier object
tree_clf = DecisionTreeClassifier(random_state=42)

# Fit the decision tree model to the training data
tree_clf.fit(X_train, y_train)

# Make predictions on the test data
y_pred = tree_clf.predict(X_test)

# Calculate the accuracy of the decision tree model on the test set
accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy of the single Decision Tree: {accuracy:.4f}")

Accuracy of the single Decision Tree: 0.9398


In [None]:
# 1. Import necessary libraries (already imported in previous cells)

# 2. Define parameters
n_trees = 1000
n_instances = 100

# 3. Create ShuffleSplit object
shuffle_split = ShuffleSplit(n_splits=n_trees, train_size=n_instances, random_state=42)

# 4. Initialize lists
trees = []
accuracy_scores = []

# 5. Iterate through splits and train trees
for train_index, _ in shuffle_split.split(X_train):
    X_subset = X_train.iloc[train_index]
    y_subset = y_train.iloc[train_index]

    # Instantiate and fit the DecisionTreeClassifier (using default hyperparameters)
    tree_clf = DecisionTreeClassifier(random_state=42)
    tree_clf.fit(X_subset, y_subset)
    trees.append(tree_clf)

    # Evaluate individual tree on the test set
    y_pred_individual = tree_clf.predict(X_test)
    accuracy_scores.append(accuracy_score(y_test, y_pred_individual))

# 6. Print average accuracy of individual trees
print(f"Precisión promedio de los árboles individuales: {np.mean(accuracy_scores):.4f}")

# 7. Combine predictions (majority voting)
y_pred_ensemble = np.empty([len(X_test), n_trees], dtype=np.int64)

for i, tree in enumerate(trees):
    y_pred_ensemble[:, i] = tree.predict(X_test)

# 9. Use mode to get majority vote
y_pred_forest, _ = mode(y_pred_ensemble, axis=1)
y_pred_forest = y_pred_forest.ravel() # mode returns a 2D array

# 10. Calculate accuracy of the forest
accuracy_forest = accuracy_score(y_test, y_pred_forest)

# 11. Print accuracy of the forest
print(f"Precisión del bosque (voto mayoritario): {accuracy_forest:.4f}")

Precisión promedio de los árboles individuales: 0.9174
Precisión del bosque (voto mayoritario): 0.9483


In [None]:
from sklearn.metrics import accuracy_score, recall_score, f1_score

# Retrieve accuracy scores from previous outputs (assuming they are stored in variables)
# Based on the previous outputs, the accuracy of the single tree is stored in 'accuracy'
# and the accuracy of the random forest is stored in 'accuracy_forest'.

print(f"Accuracy of the single Decision Tree: {accuracy:.4f}")
print(f"Accuracy of the Random Forest: {accuracy_forest:.4f}")

# Calculate and print Recall and F1 score for the single Decision Tree
recall_single_tree = recall_score(y_test, y_pred)
f1_single_tree = f1_score(y_test, y_pred)
print(f"Recall of the single Decision Tree: {recall_single_tree:.4f}")
print(f"F1 Score of the single Decision Tree: {f1_single_tree:.4f}")

# Calculate and print Recall and F1 score for the Random Forest
recall_forest = recall_score(y_test, y_pred_forest)
f1_forest = f1_score(y_test, y_pred_forest)
print(f"Recall of the Random Forest: {recall_forest:.4f}")
print(f"F1 Score of the Random Forest: {f1_forest:.4f}")

if accuracy_forest > accuracy:
    print(f"The Random Forest performed better than the single Decision Tree by {accuracy_forest - accuracy:.4f} in terms of Accuracy.")
elif accuracy_forest < accuracy:
    print(f"The single Decision Tree performed better than the Random Forest by {accuracy - accuracy_forest:.4f} in terms of Accuracy.")
else:
    print("The Random Forest and the single Decision Tree performed equally in terms of Accuracy.")

if recall_forest > recall_single_tree:
    print(f"The Random Forest performed better than the single Decision Tree by {recall_forest - recall_single_tree:.4f} in terms of Recall.")
elif recall_forest < recall_single_tree:
    print(f"The single Decision Tree performed better than the Random Forest by {recall_single_tree - recall_forest:.4f} in terms of Recall.")
else:
    print("The Random Forest and the single Decision Tree performed equally in terms of Recall.")

if f1_forest > f1_single_tree:
    print(f"The Random Forest performed better than the single Decision Tree by {f1_forest - f1_single_tree:.4f} in terms of F1 Score.")
elif f1_forest < f1_single_tree:
    print(f"The single Decision Tree performed better than the Random Forest by {f1_single_tree - f1_forest:.4f} in terms of F1 Score.")
else:
    print("The Random Forest and the single Decision Tree performed equally in terms of F1 Score.")

Accuracy of the single Decision Tree: 0.9398
Accuracy of the Random Forest: 0.9483
Recall of the single Decision Tree: 0.4998
F1 Score of the single Decision Tree: 0.5468
Recall of the Random Forest: 0.4188
F1 Score of the Random Forest: 0.5408
The Random Forest performed better than the single Decision Tree by 0.0085 in terms of Accuracy.
The single Decision Tree performed better than the Random Forest by 0.0810 in terms of Recall.
The single Decision Tree performed better than the Random Forest by 0.0059 in terms of F1 Score.
