In random forests, each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set.

Furthermore, when splitting each node during the construction of a tree, the best split is found through an exhaustive search of the features values of either all input features or a random subset of size max_features. Default for max_features is sqrt.

The purpose of these two sources of randomness is to decrease the variance of the forest estimator. Indeed, individual decision trees typically exhibit high variance and tend to overfit. The injected randomness in forests yield decision trees with somewhat decoupled prediction errors. By taking an average of those predictions, some errors can cancel out. Random forests achieve a reduced variance by combining diverse trees, sometimes at the cost of a slight increase in bias. In practice the variance reduction is often significant hence yielding an overall better model.

The scikit-learn implementation combines classifiers by averaging their probabilistic prediction, instead of letting each classifier vote for a single class.

Scikit allows for parallel computation in fitting forests.

The relative rank (i.e. depth) of a feature used as a decision node in a tree can be used to assess the relative importance of that feature with respect to the predictability of the target variable. Features used at the top of the tree contribute to the final prediction decision of a larger fraction of the input samples. The expected fraction of the samples they contribute to can thus be used as an estimate of the relative importance of the features. In scikit-learn, the fraction of samples a feature contributes to is combined with the decrease in impurity from splitting them to create a normalized estimate of the predictive power of that feature.

By averaging the estimates of predictive ability over several randomized trees one can reduce the variance of such an estimate and use it for feature selection. This is known as the mean decrease in impurity, or MDI. Refer to [L2014] for more information on MDI and feature importance evaluation with Random Forests.

The impurity-based feature importances computed on tree-based models suffer from two flaws that can lead to misleading conclusions. First they are computed on statistics derived from the training dataset and therefore do not necessarily inform us on which features are most important to make good predictions on held-out dataset. Secondly, they favor high cardinality features, that is features with many unique values. Permutation feature importance is an alternative to impurity-based feature importance that does not suffer from these flaws. These two methods of obtaining feature importance are explored in: Permutation Importance vs Random Forest Feature Importance (MDI).

In practice those estimates are stored as an attribute named feature_importances_ on the fitted model. This is an array with shape (n_features,) whose values are positive and sum to 1.0. The higher the value, the more important is the contribution of the matching feature to the prediction function.

# Dataset

Датасет [ссылка](https://www.kaggle.com/datasets/rabieelkharoua/students-performance-dataset)

В нем есть GPA и GradeClass, оба этих столбца это оценка, но GPA это вещественное значение, которое можно использовать для регрессии, а GradeClass можно использовать для классфикации.

В нем нет пропусков данных.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("student_performance.csv")

In [3]:
len(df)

2392

In [84]:
df.head()

Unnamed: 0,Age,Gender,Weight (kg),Height (m),Max_BPM,Avg_BPM,Resting_BPM,Session_Duration (hours),Calories_Burned,Workout_Type,Fat_Percentage,Water_Intake (liters),Workout_Frequency (days/week),Experience_Level,BMI
0,56,Male,88.3,1.71,180,157,60,1.69,1313.0,Yoga,12.6,3.5,4,3,30.2
1,46,Female,74.9,1.53,179,151,66,1.3,883.0,HIIT,33.9,2.1,4,2,32.0
2,32,Female,68.1,1.66,167,122,54,1.11,677.0,Cardio,33.4,2.3,4,2,24.71
3,25,Male,53.2,1.7,190,164,56,0.59,532.0,Strength,28.8,2.1,3,1,18.41
4,38,Male,46.1,1.79,188,158,68,0.64,556.0,Strength,29.2,2.8,3,1,14.39


# Classification

Первое, что мы сделаем, будет классификация

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

In [4]:
df_class = df.drop(columns=["Ethnicity", "StudentID", "GPA"])

In [85]:
X = df_class.drop(["GradeClass"], axis=1)
y = df_class["GradeClass"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [86]:
rf = RandomForestClassifier()
rf.fit(X_train, y_train)

In [87]:
from sklearn.metrics import classification_report

In [88]:
y_pred = rf.predict(X_test)
print("Score", rf.score(X_test, y_test))

Score 0.7033426183844012


In [89]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         0.0       0.38      0.20      0.26        30
         1.0       0.53      0.44      0.48        88
         2.0       0.55      0.55      0.55       121
         3.0       0.52      0.47      0.50       125
         4.0       0.85      0.95      0.90       354

    accuracy                           0.70       718
   macro avg       0.56      0.52      0.54       718
weighted avg       0.68      0.70      0.69       718



In [90]:
features = pd.DataFrame(rf.feature_importances_, index=X_test.columns)
print(features.head(15))

                          0
Age                0.061112
Gender             0.029935
ParentalEducation  0.063453
StudyTimeWeekly    0.184300
Absences           0.461114
Tutoring           0.024797
ParentalSupport    0.077254
Extracurricular    0.026453
Sports             0.028739
Music              0.023371
Volunteering       0.019472


## Out of Bag

In [91]:
rf = RandomForestClassifier(
    oob_score=True,
)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
print("Score", rf.score(X_test, y_test))
print("OOB", rf.oob_score_)

Score 0.692200557103064
OOB 0.7001194743130227


## Custom Parameters

Теперь попробуем классификацию с кастомными параметрами

The main parameters to adjust when using these methods is n_estimators and max_features. The former is the number of trees in the forest. The larger the better, but also the longer it will take to compute. In addition, note that results will stop getting significantly better beyond a critical number of trees. The latter is the size of the random subsets of features to consider when splitting a node. The lower the greater the reduction of variance, but also the greater the increase in bias. Empirical good default values are max_features=1.0 or equivalently max_features=None (always considering all features instead of a random subset) for regression problems, and max_features="sqrt" (using a random subset of size sqrt(n_features)) for classification tasks (where n_features is the number of features in the data). The default value of max_features=1.0 is equivalent to bagged trees and more randomness can be achieved by setting smaller values (e.g. 0.3 is a typical default in the literature). Good results are often achieved when setting max_depth=None in combination with min_samples_split=2 (i.e., when fully developing the trees). Bear in mind though that these values are usually not optimal, and might result in models that consume a lot of RAM. The best parameter values should always be cross-validated. In addition, note that in random forests, bootstrap samples are used by default (bootstrap=True) while the default strategy for extra-trees is to use the whole dataset (bootstrap=False). When using bootstrap sampling the generalization error can be estimated on the left out or out-of-bag samples. This can be enabled by setting oob_score=True.

The size of the model with the default parameters is O(M N log N), where M is the number of trees and N is the number of samples. In order to reduce the size of the model, you can change these parameters: min_samples_split, max_leaf_nodes, max_depth and min_samples_leaf.

In [92]:
rf = RandomForestClassifier(
    n_estimators=500,
    min_samples_split=20,
    min_samples_leaf=6,
)
rf.fit(X_train, y_train)

In [93]:
y_pred = rf.predict(X_test)
print("Score", rf.score(X_test, y_test))

Score 0.6866295264623955


In [94]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         0.0       0.50      0.03      0.06        30
         1.0       0.50      0.36      0.42        88
         2.0       0.48      0.61      0.54       121
         3.0       0.49      0.39      0.44       125
         4.0       0.84      0.95      0.89       354

    accuracy                           0.69       718
   macro avg       0.56      0.47      0.47       718
weighted avg       0.67      0.69      0.66       718



In [95]:
features = pd.DataFrame(rf.feature_importances_, index=X_test.columns)
print(features.head(15))

                          0
Age                0.025220
Gender             0.012671
ParentalEducation  0.023619
StudyTimeWeekly    0.096593
Absences           0.727672
Tutoring           0.019289
ParentalSupport    0.047308
Extracurricular    0.016799
Sports             0.014208
Music              0.010286
Volunteering       0.006336


# Classification Forest vs Tree

In [20]:
from sklearn.tree import DecisionTreeClassifier

In [21]:
dt = DecisionTreeClassifier(
    min_samples_split=20,
    min_samples_leaf=6,
)
dt.fit(X_train, y_train)

In [22]:
y_pred = dt.predict(X_test)
print("Score", dt.score(X_test, y_test))

Score 0.7019498607242339


In [23]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         0.0       0.58      0.20      0.30        35
         1.0       0.52      0.53      0.53        75
         2.0       0.56      0.60      0.58       121
         3.0       0.47      0.45      0.46       121
         4.0       0.86      0.90      0.88       366

    accuracy                           0.70       718
   macro avg       0.60      0.54      0.55       718
weighted avg       0.70      0.70      0.69       718



In [24]:
features = pd.DataFrame(dt.feature_importances_, index=X_test.columns)
print(features.head(15))

                          0
Age                0.021807
Gender             0.005577
ParentalEducation  0.007284
StudyTimeWeekly    0.146213
Absences           0.665609
Tutoring           0.047890
ParentalSupport    0.061200
Extracurricular    0.021738
Sports             0.013688
Music              0.008529
Volunteering       0.000465


# Regression

Теперь посмотрим на регрессию

In [25]:
df_regr = df.drop(columns=["Ethnicity", "StudentID", "GradeClass"])

In [26]:
X = df_regr.drop(["GPA"], axis=1)
y = df_regr["GPA"]

In [27]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [28]:
from sklearn.ensemble import RandomForestRegressor

In [29]:
rf = RandomForestRegressor()
rf.fit(X_train, y_train)

In [30]:
from sklearn.metrics import (
    mean_absolute_error,
    mean_squared_error,
    r2_score,
)

In [31]:
y_pred = rf.predict(X_test)
print("Score", rf.score(X_test, y_test))
print("MAE", mean_absolute_error(y_test, y_pred))
print("MSE", mean_squared_error(y_test, y_pred))
print("R2", r2_score(y_test, y_pred))

Score 0.9283164251766762
MAE 0.19070929452724653
MSE 0.05714057146270938
R2 0.9283164251766762


In [32]:
features = pd.DataFrame(rf.feature_importances_, index=X_test.columns)
print(features.head(15))

                          0
Age                0.006765
Gender             0.002799
ParentalEducation  0.006787
StudyTimeWeekly    0.057583
Absences           0.860559
Tutoring           0.011453
ParentalSupport    0.032929
Extracurricular    0.007207
Sports             0.008481
Music              0.003575
Volunteering       0.001862


# GridSearch

Теперь посмотрим на подбор гиперпараметров с помощью кросс-валидации

In [33]:
from sklearn.model_selection import GridSearchCV

In [34]:
param_grid = {
    #"n_estimators": [100, 300, 600],
    #"max_depth": [20, 30],
    "min_samples_split": [15, 20, 25],
    "min_samples_leaf": [3, 6, 10],
    #"criterion": ["squared_error", "friedman_mse", "poisson"],
}

In [35]:
rf = RandomForestRegressor(
    n_estimators=300,
    min_samples_split=20,
    min_samples_leaf=6,
)
rf_cv = GridSearchCV(
    estimator=rf, 
    param_grid=param_grid, 
    cv=3, 
    scoring="neg_mean_squared_error",
    verbose=3,
)

In [36]:
rf_cv.fit(X_train, y_train)

Fitting 3 folds for each of 9 candidates, totalling 27 fits
[CV 1/3] END min_samples_leaf=3, min_samples_split=15;, score=-0.066 total time=   1.6s
[CV 2/3] END min_samples_leaf=3, min_samples_split=15;, score=-0.065 total time=   1.6s
[CV 3/3] END min_samples_leaf=3, min_samples_split=15;, score=-0.065 total time=   1.6s
[CV 1/3] END min_samples_leaf=3, min_samples_split=20;, score=-0.069 total time=   1.5s
[CV 2/3] END min_samples_leaf=3, min_samples_split=20;, score=-0.067 total time=   1.5s
[CV 3/3] END min_samples_leaf=3, min_samples_split=20;, score=-0.068 total time=   1.5s
[CV 1/3] END min_samples_leaf=3, min_samples_split=25;, score=-0.073 total time=   1.5s
[CV 2/3] END min_samples_leaf=3, min_samples_split=25;, score=-0.071 total time=   1.5s
[CV 3/3] END min_samples_leaf=3, min_samples_split=25;, score=-0.071 total time=   1.5s
[CV 1/3] END min_samples_leaf=6, min_samples_split=15;, score=-0.069 total time=   1.6s
[CV 2/3] END min_samples_leaf=6, min_samples_split=15;, scor

In [37]:
print("Best Estimator", rf_cv.best_params_)

Best Estimator {'min_samples_leaf': 3, 'min_samples_split': 15}


In [38]:
y_pred = rf_cv.predict(X_test)
print("Score", rf_cv.score(X_test, y_test))
print("MAE", mean_absolute_error(y_test, y_pred))
print("MSE", mean_squared_error(y_test, y_pred))
print("R2", r2_score(y_test, y_pred))

Score -0.06040022020466688
MAE 0.19527124079694128
MSE 0.06040022020466688
R2 0.9242271543046767


In [39]:
rf = RandomForestRegressor(
    n_estimators=300,
    min_samples_split=20,
    min_samples_leaf=6,
    criterion="squared_error",
)
rf.fit(X_train, y_train)

In [40]:
y_pred = rf.predict(X_test)
print("Score", rf.score(X_test, y_test))
print("MAE", mean_absolute_error(y_test, y_pred))
print("MSE", mean_squared_error(y_test, y_pred))
print("R2", r2_score(y_test, y_pred))

Score 0.9192793770492187
MAE 0.2015227031648326
MSE 0.06434420347480756
R2 0.9192793770492187


# Regression Forest vs Tree

Сравнение с деревьями

In [41]:
from sklearn.tree import DecisionTreeRegressor

In [42]:
dt = DecisionTreeRegressor(
    min_samples_split=20,
    min_samples_leaf=10,
)
dt.fit(X_train, y_train)

In [43]:
y_pred = dt.predict(X_test)
print("Score", dt.score(X_test, y_test))
print("MAE", mean_absolute_error(y_test, y_pred))
print("MSE", mean_squared_error(y_test, y_pred))
print("R2", r2_score(y_test, y_pred))

Score 0.8890802145009167
MAE 0.23437206846953196
MSE 0.08841662745698559
R2 0.8890802145009167


In [44]:
param_grid = {
    "min_samples_split": [15, 20, 25],
    "min_samples_leaf": [3, 6, 10],
    "criterion": ["squared_error", "poisson"],
}

In [45]:
dt = DecisionTreeRegressor()
dt_cv = GridSearchCV(
    estimator=dt, 
    param_grid=param_grid, 
    cv=3, 
    scoring="neg_mean_squared_error",
    verbose=3,
)

In [46]:
dt_cv.fit(X_train, y_train)

Fitting 3 folds for each of 18 candidates, totalling 54 fits
[CV 1/3] END criterion=squared_error, min_samples_leaf=3, min_samples_split=15;, score=-0.104 total time=   0.0s
[CV 2/3] END criterion=squared_error, min_samples_leaf=3, min_samples_split=15;, score=-0.097 total time=   0.0s
[CV 3/3] END criterion=squared_error, min_samples_leaf=3, min_samples_split=15;, score=-0.100 total time=   0.0s
[CV 1/3] END criterion=squared_error, min_samples_leaf=3, min_samples_split=20;, score=-0.103 total time=   0.0s
[CV 2/3] END criterion=squared_error, min_samples_leaf=3, min_samples_split=20;, score=-0.097 total time=   0.0s
[CV 3/3] END criterion=squared_error, min_samples_leaf=3, min_samples_split=20;, score=-0.097 total time=   0.0s
[CV 1/3] END criterion=squared_error, min_samples_leaf=3, min_samples_split=25;, score=-0.104 total time=   0.0s
[CV 2/3] END criterion=squared_error, min_samples_leaf=3, min_samples_split=25;, score=-0.100 total time=   0.0s
[CV 3/3] END criterion=squared_erro

In [47]:
print("Best Estimator", dt_cv.best_params_)

Best Estimator {'criterion': 'poisson', 'min_samples_leaf': 6, 'min_samples_split': 20}


In [48]:
y_pred = dt_cv.predict(X_test)
print("Score", dt_cv.score(X_test, y_test))
print("MAE", mean_absolute_error(y_test, y_pred))
print("MSE", mean_squared_error(y_test, y_pred))
print("R2", r2_score(y_test, y_pred))

Score -0.08985373488192508
MAE 0.23997034585149418
MSE 0.08985373488192508
R2 0.8872773449287767


# Dataset Gym

[ссылка](https://www.kaggle.com/datasets/valakhorasani/gym-members-exercise-dataset)

In [166]:
df = pd.read_csv("gym_members.csv")
len(df)

973

In [168]:
df['Gender'] = df['Gender'].replace('Female', 0).infer_objects(copy=False)
df['Gender'] = df['Gender'].replace('Male', 1).infer_objects(copy=False)

In [169]:
def encode_and_bind(original_dataframe, feature_to_encode):
    dummies = pd.get_dummies(original_dataframe[[feature_to_encode]])
    res = pd.concat([original_dataframe, dummies], axis=1)
    res = res.drop([feature_to_encode], axis=1)
    return(res)

In [170]:
df = encode_and_bind(df, "Workout_Type")

In [171]:
df.head()

Unnamed: 0,Age,Gender,Weight (kg),Height (m),Max_BPM,Avg_BPM,Resting_BPM,Session_Duration (hours),Calories_Burned,Fat_Percentage,Water_Intake (liters),Workout_Frequency (days/week),Experience_Level,BMI,Workout_Type_Cardio,Workout_Type_HIIT,Workout_Type_Strength,Workout_Type_Yoga
0,56,1,88.3,1.71,180,157,60,1.69,1313.0,12.6,3.5,4,3,30.2,False,False,False,True
1,46,0,74.9,1.53,179,151,66,1.3,883.0,33.9,2.1,4,2,32.0,False,True,False,False
2,32,0,68.1,1.66,167,122,54,1.11,677.0,33.4,2.3,4,2,24.71,True,False,False,False
3,25,1,53.2,1.7,190,164,56,0.59,532.0,28.8,2.1,3,1,18.41,False,False,True,False
4,38,1,46.1,1.79,188,158,68,0.64,556.0,29.2,2.8,3,1,14.39,False,False,True,False


In [172]:
df_class = df.drop(columns=[])

In [173]:
X = df_class.drop(["Gender"], axis=1)
y = df_class["Gender"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

## Random Forest

In [174]:
rf = RandomForestClassifier()
rf.fit(X_train, y_train)

In [175]:
y_pred = rf.predict(X_test)
print("Score", rf.score(X_test, y_test))

Score 0.958904109589041


In [176]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.93      0.99      0.96       137
           1       0.99      0.93      0.96       155

    accuracy                           0.96       292
   macro avg       0.96      0.96      0.96       292
weighted avg       0.96      0.96      0.96       292



In [177]:
features = pd.DataFrame(rf.feature_importances_, index=X_test.columns)
print(features)

                                      0
Age                            0.008686
Weight (kg)                    0.214589
Height (m)                     0.147552
Max_BPM                        0.016862
Avg_BPM                        0.012364
Resting_BPM                    0.007964
Session_Duration (hours)       0.028828
Calories_Burned                0.021131
Fat_Percentage                 0.137041
Water_Intake (liters)          0.327621
Workout_Frequency (days/week)  0.003891
Experience_Level               0.012338
BMI                            0.054913
Workout_Type_Cardio            0.001406
Workout_Type_HIIT              0.001604
Workout_Type_Strength          0.002272
Workout_Type_Yoga              0.000936


## Decision Tree

In [178]:
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

In [179]:
y_pred = dt.predict(X_test)
print("Score", dt.score(X_test, y_test))

Score 0.934931506849315


In [180]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.92      0.95      0.93       137
           1       0.95      0.92      0.94       155

    accuracy                           0.93       292
   macro avg       0.93      0.94      0.93       292
weighted avg       0.94      0.93      0.93       292



In [181]:
features = pd.DataFrame(dt.feature_importances_, index=X_test.columns)
print(features)

                                      0
Age                            0.000000
Weight (kg)                    0.236785
Height (m)                     0.108447
Max_BPM                        0.003528
Avg_BPM                        0.006122
Resting_BPM                    0.000000
Session_Duration (hours)       0.000000
Calories_Burned                0.010590
Fat_Percentage                 0.121178
Water_Intake (liters)          0.509427
Workout_Frequency (days/week)  0.000000
Experience_Level               0.000000
BMI                            0.003924
Workout_Type_Cardio            0.000000
Workout_Type_HIIT              0.000000
Workout_Type_Strength          0.000000
Workout_Type_Yoga              0.000000
