# Generate the General Models

Here, the general models are created. These models use 11 of the subjects' data for training, and 3 for testing. They are called general models because all the subjects are used in each of the models.

The total number of combinations of 11 training subjects and 3 testing subjects is $14 \choose 3$, or 364. For each combination of 11 training subjects and 3 testing subjects, 18 models are generated. The total number of models generated is 364 × 18 = 6552.

The models are generated and then saved in a CSV file called `train_general.csv` for the training models and `test_general.csv` for the testing models. The feature importance data is also saved for each model. This data is saved in a CSV file called `feature_importance_general.csv`. These files are in the directory `outputs/general/`.

In [1]:
from pandas import read_csv as read
from pycaret.regression import *
from itertools import combinations
import csv
import pandas as pd

In [2]:
%%capture
data_original = read('../data/initial_features_limited_interpolation.csv')
temp_df = data_original.groupby(['subject', 'experimental_condition']).agg({'rpe':'max'}).reset_index()
subjects = temp_df[temp_df['rpe']>4]
df = (
    read('../data/initial_features_limited_interpolation.csv')
    .merge(right= subjects[['subject', 'experimental_condition']], how = 'right', on = ['subject', 'experimental_condition'])
)

# The number of subjects used in testing data. Should be set to 3 to match the data used in results.
COMBINATIONS = 3

combinations_list = list(combinations(df['subject'].unique(), COMBINATIONS))

## Cross Validation Results

In [6]:
train_general_df = pd.DataFrame()
test_general_df = pd.DataFrame()
feature_importance_general_df = pd.DataFrame()

for sub in [(2, 3, 4)]:  # combinations_list
    # Split data into training and testing based on subject
    train = df[~df['subject'].isin(sub)]
    test = df[df['subject'].isin(sub)]

    # IMPORTANT: CURRENTLY EXCLUDING wrist_acc_time. REMOVE IF NECESSARY.
    reg = setup(data=train, target='rpe', ignore_features=['experimental_condition', 'subject', 'wrist_acc_time'])
    best = compare_models(sort='MAE', n_select = 18)
    all = pull()
    all['test_set'] = str(sub)

    # Output trained model results to csv
    train_general_df = pd.concat([train_general_df, all], ignore_index=True)
    
    
    test_results = pd.DataFrame()
    for model in best:
        # Run models on test data
        test_result = predict_model(model, data=test, verbose = False)
        test_result_df = pull()
        test_result_df['test_set'] = str(sub)
        test_results = pd.concat([test_results, test_result_df], ignore_index=True)
        
        # Add feature importance of model to dataframe
        try:
            importance = pd.DataFrame({'Feature': get_config('X_train').columns, 'Value' : abs(model.feature_importances_)}).sort_values(by='Value', ascending=False).reset_index().drop('index', axis=1)
        except:
            try:
                importance = pd.DataFrame({'Feature': get_config('X_train').columns, 'Value' : abs(model.coef_)}).sort_values(by='Value', ascending=False).reset_index().drop('index', axis=1)
            except:
                importance = pd.DataFrame({'Feature': ['error'], 'Value': [0]})

        # Append feature importance to dataframe
        importance['test_subjects'] = str(sub)
        feature_importance_general_df = pd.concat([feature_importance_general_df, importance], ignore_index=True)

    # Save test model results to csv
    test_general_df = pd.concat([test_general_df, test_results], ignore_index=True)

train_general_df.to_csv(f'outputs/train_general.csv', index=False)
test_general_df.to_csv(f'outputs/test_general.csv', index=False)
feature_importance_general_df.to_csv(f'outputs/feature_importance_general.csv', index=False)

Unnamed: 0,Description,Value
0,Session id,5746
1,Target,rpe
2,Target type,Regression
3,Original data shape,"(481, 59)"
4,Transformed data shape,"(481, 56)"
5,Transformed train set shape,"(336, 56)"
6,Transformed test set shape,"(145, 56)"
7,Ignore features,3
8,Numeric features,55
9,Preprocess,True


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
et,Extra Trees Regressor,1.7277,4.8807,2.1894,0.235,0.5688,0.4672,0.117
rf,Random Forest Regressor,1.7768,4.9686,2.2115,0.2167,0.5866,0.4838,0.169
catboost,CatBoost Regressor,1.7934,5.0841,2.2376,0.1976,0.5891,0.4936,4.272
lightgbm,Light Gradient Boosting Machine,1.8069,5.2066,2.2689,0.1767,0.5952,0.4842,0.139
xgboost,Extreme Gradient Boosting,1.8186,5.4836,2.3228,0.1272,0.5929,0.4838,0.151
gbr,Gradient Boosting Regressor,1.8499,5.6253,2.3509,0.1149,0.6006,0.5003,0.146
ridge,Ridge Regression,1.8643,5.5767,2.3508,0.1072,0.6099,0.5076,0.061
lr,Linear Regression,1.8904,5.6096,2.3604,0.106,0.6101,0.5257,0.07
ada,AdaBoost Regressor,1.9401,5.7079,2.3736,0.1038,0.6214,0.515,0.092
br,Bayesian Ridge,1.9806,6.0397,2.4482,0.042,0.6399,0.5528,0.065


Processing:   0%|          | 0/102 [00:00<?, ?it/s]