# Project Overview
# Creating Random Forest Model (Using GridSearch CV)
Step1 to Step4

# Creating Professional Forest (By Using GridSearch CV)
# Scenario 1:
Based on (Using GridSearch CV + Trees Score)

*primary forest=1000, professional forest=200, Creating Trees with best Params
# Creating Professional Forest (Without Using GridSearch CV)
# Scenario 2:
Based on Trees Score

*primary forest=2000, professional forest=100, Using Naturaly and Randomly Trees
# Creating Professional Forest (By Using GridSearch CV)
# Scenario 3:
Based on Trees Score + Using Very Professional Trees

*primary forest=4000, professional forest=50, Using Naturaly and Randomly Trees
# Conclusion

# Step 1: Load, Explore and Preprocess the Data
Let's get started by loading the dataset and performing the initial preprocessing.

In [None]:
import pandas as pd
from sklearn.datasets import load_breast_cancer

# Load the Breast Cancer Wisconsin dataset
data = load_breast_cancer()

# Convert to DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Display the first few rows of the DataFrame
print("First few rows of the dataset:")
print(df.head())

# Display basic statistics
print("\nBasic statistics of the dataset:")
print(df.describe())

# Check for null values
print("\nChecking for null values:")
print(df.isnull().sum())


First few rows of the dataset:
   mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0        17.99         10.38          122.80     1001.0          0.11840   
1        20.57         17.77          132.90     1326.0          0.08474   
2        19.69         21.25          130.00     1203.0          0.10960   
3        11.42         20.38           77.58      386.1          0.14250   
4        20.29         14.34          135.10     1297.0          0.10030   

   mean compactness  mean concavity  mean concave points  mean symmetry  \
0           0.27760          0.3001              0.14710         0.2419   
1           0.07864          0.0869              0.07017         0.1812   
2           0.15990          0.1974              0.12790         0.2069   
3           0.28390          0.2414              0.10520         0.2597   
4           0.13280          0.1980              0.10430         0.1809   

   mean fractal dimension  ...  worst texture  worst perimete

# Data Preprocessing
Since there are no missing values in this dataset, we can proceed directly to splitting and standardizing the data.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Define features and target
X = df.drop('target', axis=1)
y = df['target']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the feature data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

print("Data preprocessing completed.")
print("Training data shape:", X_train.shape)
print("Testing data shape:", X_test.shape)


Data preprocessing completed.
Training data shape: (455, 30)
Testing data shape: (114, 30)


# Step 2: Creating a Random Forest model
# Hyperparameter Tuning with GridSearchCV
Let's start by finding the best hyperparameters for the Random Forest model using GridSearchCV.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# Define the parameter grid for GridSearchCV
param_grid = {
    'n_estimators': [100, 200, 300, 400, 500],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Initialize the Random Forest classifier
rf = RandomForestClassifier(random_state=42)

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)

# Fit GridSearchCV to the training data
grid_search.fit(X_train, y_train)

# Print the best parameters and score
best_params = grid_search.best_params_
print(f'Best Parameters: {best_params}')
print(f'Best Cross-Validation Score: {grid_search.best_score_:.2f}')


Best Parameters: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}
Best Cross-Validation Score: 0.96


# Step 3: **Train** the Random Forest Model (Using GirdSearch CV  Parameter)

In [None]:
# Create the Random Forest model with best parameters
rf_best = RandomForestClassifier(
    n_estimators=best_params['n_estimators'],
    max_depth=best_params['max_depth'],
    min_samples_split=best_params['min_samples_split'],
    min_samples_leaf=best_params['min_samples_leaf'],
    random_state=42,
    n_jobs=-1
)

# Fit the model to the training data
rf_best.fit(X_train, y_train)


# Step 4: Evaluate the Random Forest Model (Using GirdSearch CV)

In [None]:
from sklearn.metrics import accuracy_score, classification_report

# Make predictions on the test set
rf_predictions = rf_best.predict(X_test)

# Calculate accuracy
rf_accuracy = accuracy_score(y_test, rf_predictions)
print(f'Random Forest Accuracy: {rf_accuracy:.2f}')

# Detailed classification report
rf_report = classification_report(y_test, rf_predictions)
print(f'\nClassification Report:\n{rf_report}')


Random Forest Accuracy: 0.96

Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.93      0.95        43
           1       0.96      0.99      0.97        71

    accuracy                           0.96       114
   macro avg       0.97      0.96      0.96       114
weighted avg       0.97      0.96      0.96       114



# Professional Forest (PF) Plan: By Using GirdSearch CV Parameters
# Scenario 1:
1. Create Primary Forest with 1000 Trees.

2. Score Trees and Select Top 20%.

3. Evaluate the PF Model.

4. Compare PF Model Performance with the Original RF Model

# Scenario 1_1: Create Primary Forest with 1000 Trees (Using GirdSearch CV Parameters)

In [None]:
# Function to create and score primary forests
def create_and_score_forest(n_trees, X_train, y_train):
    ppf = RandomForestClassifier(
        n_estimators=n_trees,
        max_depth=best_params['max_depth'],
        min_samples_split=best_params['min_samples_split'],
        min_samples_leaf=best_params['min_samples_leaf'],
        random_state=42,
        n_jobs=-1
    )
    ppf.fit(X_train, y_train)

    scores = [tree.score(X_train, y_train) for tree in ppf.estimators_]
    return ppf, scores

# Create primary forest with 1000 trees
n_trees_primary_forest = 1000
primary_forest, primary_forest_scores = create_and_score_forest(n_trees_primary_forest, X_train, y_train)
primary_forest_scores_df = pd.DataFrame({'score': primary_forest_scores})


# Scenario 1_2: Select Top 20% Trees and Create PF Model

In [None]:
import numpy as np

# Function to select top-performing trees and create PF
def create_pf_from_ppf(ppf, tree_scores, top_percentage):
    top_trees = int(len(tree_scores) * top_percentage)
    top_tree_indices = tree_scores.nlargest(top_trees, 'score').index
    pf_trees = [ppf.estimators_[i] for i in top_tree_indices]
    return pf_trees

# Select top 20% of trees
top_percentage = 0.20
pf_trees = create_pf_from_ppf(primary_forest, primary_forest_scores_df, top_percentage)
print('It down')


It down


# Scenario 1_3: Evaluate the PF Model

In [None]:
from sklearn.metrics import accuracy_score, classification_report

# Evaluate the PF
pf_predictions = np.mean([tree.predict(X_test) for tree in pf_trees], axis=0)
pf_predictions = np.round(pf_predictions).astype(int)

# Calculate accuracy
pf_accuracy = accuracy_score(y_test, pf_predictions)
print(f'Professional Forest Accuracy: {pf_accuracy:.2f}')

# Detailed classification report
pf_report = classification_report(y_test, pf_predictions)
print(f'\nClassification Report:\n{pf_report}')
print('It down')

Professional Forest Accuracy: 0.96

Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.93      0.95        43
           1       0.96      0.99      0.97        71

    accuracy                           0.96       114
   macro avg       0.97      0.96      0.96       114
weighted avg       0.97      0.96      0.96       114

It down


# Professional Forest (PF) Plan: Without Using GirdSearch CV Parameters
# Scenario 2:
Create Primary Forest with 2000 Trees.

Score Trees and List Them.

Select Top 5% Trees (100 Trees).

Evaluate the PF Model.

# Scenario 2_1: Create Primary Forest with 2000 Trees

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Function to create and score primary forests
def create_and_score_forest(n_trees, X_train, y_train):
    ppf = RandomForestClassifier(
        n_estimators=n_trees,
        random_state=42,
        n_jobs=-1
    )
    ppf.fit(X_train, y_train)

    scores = [tree.score(X_train, y_train) for tree in ppf.estimators_]
    return ppf, scores

# Create primary forest with 2000 trees
n_trees_primary_forest = 2000
primary_forest, primary_forest_scores = create_and_score_forest(n_trees_primary_forest, X_train, y_train)
primary_forest_scores_df = pd.DataFrame({'score': primary_forest_scores})
print("Primary forest created and scored.")


Primary forest created and scored.


# Scenario 2_2: Score Trees and List Them

In [None]:
# Display the scores
print(primary_forest_scores_df.describe())


             score
count  2000.000000
mean      0.969435
std       0.008079
min       0.938462
25%       0.964835
50%       0.969231
75%       0.975824
max       0.991209


# Scenario 2_3: Select Top 5% Trees (100 Trees)

In [None]:
# Function to select top-performing trees and create PF
def create_pf_from_ppf(ppf, tree_scores, top_percentage):
    top_trees = int(len(tree_scores) * top_percentage)
    top_tree_indices = tree_scores.nlargest(top_trees, 'score').index
    pf_trees = [ppf.estimators_[i] for i in top_tree_indices]
    return pf_trees

# Select top 5% of trees
top_percentage = 0.05
pf_trees = create_pf_from_ppf(primary_forest, primary_forest_scores_df, top_percentage)
print(f"Selected top {int(top_percentage * 100)}% of trees for PF.")


Selected top 5% of trees for PF.


# Scenario 4: Evaluate the PF Model


In [None]:
from sklearn.metrics import accuracy_score, classification_report
import numpy as np

# Evaluate the PF
pf_predictions = np.mean([tree.predict(X_test) for tree in pf_trees], axis=0)
pf_predictions = np.round(pf_predictions).astype(int)

# Calculate accuracy
pf_accuracy = accuracy_score(y_test, pf_predictions)
print(f'Professional Forest Accuracy: {pf_accuracy:.2f}')

# Detailed classification report
pf_report = classification_report(y_test, pf_predictions)
print(f'\nClassification Report:\n{pf_report}')


Professional Forest Accuracy: 0.96

Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.93      0.95        43
           1       0.96      0.99      0.97        71

    accuracy                           0.96       114
   macro avg       0.97      0.96      0.96       114
weighted avg       0.97      0.96      0.96       114



# Analysis:
Performance Metrics: All models show identical performance metrics across accuracy, precision, recall, and F1-score. This consistency highlights the robustness of the PF approach, even when selecting different numbers of top-performing trees.

Computational Efficiency: The PF model with 100 trees selected from 2000 demonstrates that we can achieve the same performance as the well-tuned RF model with significantly fewer trees. This reduction in model complexity can lead to faster prediction times and lower resource usage.

Versatility of PF Approach: Both scenarios of PF (200 top trees from 1000 and 100 top trees from 2000) show that the PF method can maintain high performance with different configurations. This flexibility allows for efficient model optimization depending on resource constraints and requirements.

# Conclusion:
The PF approach, whether selecting 200 top trees from 1000 trees "By using GridSearch CV" or 100 top trees from 2000 trees "without using GridSearch CV", effectively maintains the high performance of the original Random Forest model with fewer trees. This not only preserves accuracy and other key metrics but also enhances computational efficiency, making PF a valuable strategy in various scenarios.

# Professional Forest (PF) Plan: (Creating Very Fast Model)
# Scenario 3:
Create Primary Forest with 4000 Trees.

Score Trees and List Them.

Select Top 1.25% Trees (50 Trees).

Evaluate the PF Model.

# Scenario 3_1: Create Primary Forest with 4000 Trees

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Function to create and score primary forests
def create_and_score_forest(n_trees, X_train, y_train):
    ppf = RandomForestClassifier(
        n_estimators=n_trees,
        random_state=42,
        n_jobs=-1
    )
    ppf.fit(X_train, y_train)

    scores = [tree.score(X_train, y_train) for tree in ppf.estimators_]
    return ppf, scores

# Create primary forest with 4000 trees
n_trees_primary_forest = 4000
primary_forest, primary_forest_scores = create_and_score_forest(n_trees_primary_forest, X_train, y_train)
primary_forest_scores_df = pd.DataFrame({'score': primary_forest_scores})
print("Primary forest created and scored.")


Primary forest created and scored.


# Scenario 3_2: Score Trees and List Them

In [None]:
# Display the scores
print(primary_forest_scores_df.describe())


             score
count  4000.000000
mean      0.969536
std       0.008176
min       0.934066
25%       0.964835
50%       0.969231
75%       0.975824
max       0.993407


# Scenario 3_3: Select Top 1.25% Trees (50 Trees)

In [None]:
import numpy as np

# Function to select top-performing trees and create PF
def create_pf_from_ppf(ppf, tree_scores, top_percentage):
    top_trees = int(len(tree_scores) * top_percentage)
    top_tree_indices = tree_scores.nlargest(top_trees, 'score').index
    pf_trees = [ppf.estimators_[i] for i in top_tree_indices]
    return pf_trees

# Select top 1.25% of trees
top_percentage = 0.0125
pf_trees = create_pf_from_ppf(primary_forest, primary_forest_scores_df, top_percentage)
print(f"Selected top {int(top_percentage * 100)}% of trees for PF.")


Selected top 1% of trees for PF.


# Scenario 3_4: Evaluate the PF Model

In [None]:
from sklearn.metrics import accuracy_score, classification_report

# Evaluate the PF
pf_predictions = np.mean([tree.predict(X_test) for tree in pf_trees], axis=0)
pf_predictions = np.round(pf_predictions).astype(int)

# Calculate accuracy
pf_accuracy = accuracy_score(y_test, pf_predictions)
print(f'Professional Forest Accuracy: {pf_accuracy:.2f}')

# Detailed classification report
pf_report = classification_report(y_test, pf_predictions)
print(f'\nClassification Report:\n{pf_report}')


Professional Forest Accuracy: 0.96

Classification Report:
              precision    recall  f1-score   support

           0       0.95      0.93      0.94        43
           1       0.96      0.97      0.97        71

    accuracy                           0.96       114
   macro avg       0.96      0.95      0.95       114
weighted avg       0.96      0.96      0.96       114



# Analysis:
Consistency in Accuracy: All models, including the PF with 50 trees, maintained an accuracy of 0.96, demonstrating the robustness of the PF approach even with fewer trees.

Slight Variations in Precision and Recall:

The PF model with 50 trees shows a very slight reduction in precision for Class 0 (0.95) compared to the other models (0.98).

The recall for Class 1 is slightly lower in the PF model with 50 trees (0.97) compared to the others (0.99).

F1-Score:

The F1-score for both classes in the PF model with 50 trees is marginally lower than the other models but still high (0.94 for Class 0 and 0.97 for Class 1).

Efficiency Gains:

Despite the slight variations, the PF model with 50 trees achieves a comparable performance to the RF and other PF models while significantly reducing the number of trees. This translates into faster predictions and potentially lower computational costs

# Conclusion:
The PF approach with 50 trees from a primary forest of 4000 maintains high performance, similar to the well-tuned RF model and the PF with 100 trees. The slight differences in precision and recall are minor and acceptable, especially given the efficiency gains from using fewer trees If we need to "tiny" and "very fast" model.

This final experiment confirms the robustness and efficiency of the PF methodology in maintaining model performance with a more streamlined set of trees.

# Final Conclusion:
# By using PF approach we can create "tiny", "impact", "robust" and "veryfast" model
# we don't need using GridSearch CV

## License
This project is licensed under the MIT License - see the LICENSE file for details.

## © 2024 Ali M Shafiei. All rights reserved.
