In [16]:
X = data
y = df['category']

We use RandomizedSearchCV for hyperparameter tuning to find the best parameters for our model. RandomizedSearchCV performs an exhaustive search over the specified parameter grid, and best_params_ gives us the best parameters found during the search.
In this model, we also tune the following hyperparameter parameters. By choosing these parameters, we tune our Random Forest model to achieve the best performance.
we’ve chosen [10, 20] as possible values for n_estimators, which means the RandomizedSearchCV will try both 10 and 20 trees and see which one works better. We’ve chosen [None, 10] as possible values for max_depth, which means the RandomizedSearchCV will try both unlimited depth and a maximum depth of 10. min_samples_split is the minimum number of samples required to split an internal node.we’ve chosen [2, 5] as possible values, which means the RandomizedSearchCV will try splitting nodes with a minimum of 2 samples and a minimum of 5 samples. we’ve chosen [1, 2] as possible values for min_samples_leaf, which means the RandomizedSearchCV will try both a minimum of 1 sample and a minimum of 2 samples in a leaf node.
Cross-validation(cv) is the number of cross-validation folds to use, which estimates the performance of the model on unseen data. Scoring is the metric to use to evaluate the performance of your model. In this case, we’re using accuracy.The n_iter parameter in RandomizedSearchCV determines the number of parameter settings that are sampled. Randomized search is a way to save time compared to grid search, which tries every single combination of parameters. random_state is used for reproducibility. 

In [17]:


# Import necessary libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the parameter grid
param_distributions = {
    'n_estimators': [10, 20],  # Number of trees in the forest
    'max_depth': [None, 10],  # Maximum depth of the tree
    'min_samples_split': [2, 5],  # Minimum number of samples required to split an internal node
    'min_samples_leaf': [1, 2],  # Minimum number of samples required to be at a leaf node
}

# Create a RandomForestClassifier
clf = RandomForestClassifier()

# Create the randomized search object
random_search = RandomizedSearchCV(estimator=clf, param_distributions=param_distributions, cv=3, scoring='accuracy', n_iter=5, random_state=42)

# Fit the randomized search object to the data
random_search.fit(X_train, y_train)

# Get the best parameters
best_params = random_search.best_params_

# Train a new classifier using the best parameters
clf_best = RandomForestClassifier(**best_params)
clf_best.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf_best.predict(X_test)

# Evaluate the model
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Precision:', precision_score(y_test, y_pred, average='weighted'))
print('Recall:', recall_score(y_test, y_pred, average='weighted'))
print('F1 Score:', f1_score(y_test, y_pred, average='weighted'))


Accuracy: 0.5448861738175917
Precision: 0.579680871093444
Recall: 0.5448861738175917
F1 Score: 0.49588607669881146
