### Random Forest Classifier Exercise with Digits Dataset

This notebook will guide you through implementing a Random Forest classifier using Python's popular `scikit-learn` library. We will use the `load_digits` dataset, which contains images of handwritten digits, to demonstrate the capabilities of the Random Forest model.

## Objectives
- Understand how to build a Random Forest classifier.
- Visualize sample images from the dataset.
- Learn how to interpret the model's feature importance.
- Apply model evaluation techniques to assess performance.

In [None]:
# Step 1: Import Libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import seaborn as sns

## Step 2: Load the Digits Dataset

We will use the `load_digits` dataset from `scikit-learn`. The dataset contains images of digits (0-9), which are represented as 8x8 pixel matrices.

In [None]:
# Load digits dataset from sklearn
digits = datasets.load_digits()

# Display dataset information
print(f"Number of samples: {len(digits.images)}")
print(f"Image shape: {digits.images[0].shape}")

# Convert the dataset to a DataFrame for easier handling
data = pd.DataFrame(data=digits.data, columns=[f'pixel_{i}' for i in range(digits.data.shape[1])])
data['target'] = digits.target

# Show the first few rows of the dataset
data.head()


## Step 3: Visualize Sample Images

To better understand the dataset, we will visualize a few sample images along with their corresponding labels.

In [None]:
# Plotting a few samples of the dataset
fig, axes = plt.subplots(1, 5, figsize=(10, 4))
for i, ax in enumerate(axes):
    ax.imshow(digits.images[i], cmap='gray')
    ax.set_title(f"Label: {digits.target[i]}")
    ax.axis('off')
plt.suptitle("Sample Images from Digits Dataset")
plt.show()

## Step 4: Split the Data into Training and Test Sets

We will split the dataset into training and testing sets using a 70/30 ratio.


In [None]:
# Features (X) and target variable (y)
X = data.drop('target', axis=1)
y = data['target']

# Split the dataset into training and testing sets (70% for training, 30% for testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)



## Step 5: Train the Random Forest Model

We will train a Random Forest classifier using the following important parameters:
- **n_estimators**: Number of trees in the forest.
- **max_depth**: Maximum depth of the trees to avoid overfitting.
- **random_state**: Ensures reproducibility of the results.


In [None]:

# Initialize and train the Random Forest model
rf_clf = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
rf_clf.fit(X_train, y_train)




## Step 6: Feature Importance

One of the benefits of using a Random Forest model is the ability to analyze feature importance. In this dataset, each pixel is considered a feature.

In [None]:
# Get feature importance
feature_importances = rf_clf.feature_importances_

# Create a DataFrame to visualize feature importance
importance_df = pd.DataFrame({
    'Feature': [f'pixel_{i}' for i in range(X.shape[1])],
    'Importance': feature_importances
}).sort_values(by='Importance', ascending=False).head(10)  # Show top 10 most important pixels

# Plotting feature importance
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=importance_df)
plt.title('Top 10 Feature Importance')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()


## Step 7: Make Predictions and Evaluate the Model

We will make predictions on the test set and evaluate the model’s performance using metrics like accuracy, classification report, and confusion matrix.


In [None]:
# Make predictions on the test set
y_pred = rf_clf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

# Print classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Plot the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(10, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=digits.target_names, yticklabels=digits.target_names)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

## Interpretation of the Results

- **Feature Importance**: The feature importance plot shows which pixels have the highest contribution to the model's classification decision. This allows us to understand which areas of the digit images are most critical.
- **Confusion Matrix**: The confusion matrix gives an overview of the model's predictions versus the actual classes, helping identify any misclassification patterns.
- **Accuracy and Classification Report**: The accuracy score and classification report provide insight into the model's precision, recall, and F1-score, indicating how well the model is performing overall.

## Key Points

- **n_estimators**: Defines the number of trees in the Random Forest. A higher number typically results in better performance but increases computational cost.
- **max_depth**: Limiting the depth of the trees helps in controlling overfitting and ensures the model generalizes better to unseen data.
- **Random Forest**: It’s an ensemble model that helps in improving performance and reducing overfitting compared to a single decision tree.

### Final Thoughts

Random Forest is a powerful and flexible algorithm for classification tasks. It works well with the `digits` dataset, which has high-dimensional features (i.e., pixels). By analyzing feature importance, we can understand which features contribute most to the decision-making process of the model.

Feel free to experiment with other hyperparameters like **min_samples_split** or **max_features** to see how they impact the model's performance.
