# Classification with Decision Trees and Random Forests

In this notebook, we'll dive into the world of classification using two powerful machine learning algorithms: Decision Trees and Random Forests. We'll utilize the California Housing dataset from scikit-learn to predict whether a district's median house value is "high-priced" or not.

### 1. Understanding Decision Trees and Random Forests

**Decision Trees:**

* **Intuitive:** Decision trees mimic human decision-making, creating a flowchart-like structure where each internal node represents a test on an attribute, each branch represents an outcome of the test, and each leaf node represents a class label or prediction.

* **Interpretable:** Decision trees are easy to understand and visualize, making them ideal for explaining model decisions.

* **Limitations:** Decision trees are prone to overfitting, especially when they become deep and complex.

**Random Forests:**

* **Ensemble Power:** Random Forests are an ensemble learning method that combines multiple decision trees to make predictions.

* **Reduced Overfitting:** By averaging the predictions of multiple trees, random forests reduce the risk of overfitting and improve generalization to new data.

* **Robustness:** Random forests are robust to noisy data and can handle missing values.

### 2. Loading and Preprocessing the Data

In [None]:
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load the California Housing dataset
housing = fetch_california_housing(as_frame=True)
data = housing.data
target = housing.target

# Create a binary target variable (High-priced or not)
median_house_value = target
median_house_value_threshold = median_house_value.median()  # You can adjust this threshold
target_binary = (median_house_value > median_house_value_threshold).astype(int)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data, target_binary, test_size=0.2, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

**Explanation:**

1.  **Load the dataset:** We use `fetch_california_housing` to load the dataset directly from scikit-learn.
2.  **Create binary target:** We define a threshold for "high-priced" houses and create a new binary target variable indicating if the median house value is above or below this threshold.
3.  **Split data:** We divide the data into training and testing sets for model training and evaluation.
4.  **Standardize features:** We scale the features to improve model performance.

### 3. Decision Tree Classifier

**Objective:** Build a Decision Tree model to predict whether a district's median house value is "high-priced" or not.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Create a Decision Tree classifier
clf = DecisionTreeClassifier(random_state=42)

# Train the model on the training data
clf.fit(X_train_scaled, y_train)

# Make predictions on the testing data
y_pred = clf.predict(X_test_scaled)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy of Decision Tree Classifier: {accuracy:.2f}')

Accuracy of Decision Tree Classifier: 0.85


**Explanation:**

1.  **Import necessary modules:** We import the `DecisionTreeClassifier` and `accuracy_score` from scikit-learn.
2.  **Create the classifier:** We initialize a `DecisionTreeClassifier` object with a random state for reproducibility.
3.  **Train the model:** We fit the classifier to the scaled training data.
4.  **Make predictions:** We use the trained model to predict the class labels (high-priced or not) for the testing data.
5.  **Evaluate the model:** We calculate the accuracy, which is the proportion of correct predictions.

### 4. Random Forest Classifier

**Objective:** Build a Random Forest model to predict whether a district's median house value is "high-priced" or not.

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Create a Random Forest classifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model on the training data
rf_clf.fit(X_train_scaled, y_train)

# Make predictions on the testing data
y_pred_rf = rf_clf.predict(X_test_scaled)

# Calculate accuracy
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f'Accuracy of Random Forest Classifier: {accuracy_rf:.2f}')

Accuracy of Random Forest Classifier: 0.89


**Explanation:**

1.  **Import necessary modules:** We import the `RandomForestClassifier` and `accuracy_score` from scikit-learn.
2.  **Create the classifier:** We initialize a `RandomForestClassifier` object with 100 trees and a random state for reproducibility.
3.  **Train the model:** We fit the classifier to the scaled training data.
4.  **Make predictions:** We use the trained model to predict the class labels for the testing data.
5.  **Evaluate the model:** We calculate the accuracy.

**Student Challenge (Colab):**

1.  **Experiment with hyperparameters:** Try changing the `max_depth` or `min_samples_split` parameters of the `DecisionTreeClassifier`. Observe how it affects the accuracy.
2.  **Evaluate other metrics:**  Research and calculate other classification metrics like precision, recall, and F1-score for both models.
3.  **Visualize a Decision Tree:** Use `plot_tree` from `sklearn.tree` to visualize one of the decision trees in your Random Forest.

This notebook provides a hands-on introduction to classification using Decision Trees and Random Forests. By experimenting with hyperparameters and evaluating different metrics, you can gain a deeper understanding of these algorithms and their performance on real-world datasets.

<div class="md-recitation">
  Sources
  <ol>
  <li><a href="https://researchleap.com/benefits-of-educational-data-mining/">https://researchleap.com/benefits-of-educational-data-mining/</a></li>
  <li><a href="https://github.com/nand-n/dolche-Backend-api">https://github.com/nand-n/dolche-Backend-api</a></li>
  <li><a href="https://lifewithdata.com/2022/07/14/a-gentle-introduction-to-decision-tree-in-machine-learning/">https://lifewithdata.com/2022/07/14/a-gentle-introduction-to-decision-tree-in-machine-learning/</a></li>
  <li><a href="https://github.com/Bodapati-Haritha/fod">https://github.com/Bodapati-Haritha/fod</a></li>
  <li><a href="https://medium.com/@ddimri/bert-for-classification-beyond-the-next-sentence-prediction-task-93acc1412749">https://medium.com/@ddimri/bert-for-classification-beyond-the-next-sentence-prediction-task-93acc1412749</a></li>
  </ol>
</div>