# Models
## Creating General and Individual models to predict Rating Percieved Exertion (RPE)
Using the `data/initial_features.csv` dataset and the Pycaret library, we will create models to predict the Rating Percieved Exertion of our subjects. We will create a general model to predict the RPE of any subject, and an individual model to predict the RPE of a specialized subject. 

#### What is Rating Percieved Exertion?
The Borg Rating of Perceived Exertion (RPE) is a way of measuring physical activity intensity level. Perceived exertion is how hard you feel like your body is working.

An image of the RPE scale for reference:

![RPE Scale](https://www.researchgate.net/publication/327632653/figure/tbl2/AS:670492033830922@1536869166661/Rating-of-Perceived-Exertion-RPE-Scale-Borg-1962.png)
<sub>From researchgate.net</sub>

#### First, we will import necessary libraries

In [2]:
import matplotlib.pyplot as plt
from pycaret.datasets import get_data
from pycaret.regression import *

#### Next, we clean data
The new columns *weight* and *pace* are added using the `experimental_condition` column.
The RPE column has several NA values, so we replace those with necessary values

<sub>Will improve this explanation later if needed</sub>

In [3]:
df = get_data('data/initial_features')

df[['weight', 'pace']] = df['experimental_condition'].str.split('-', expand=True)

df['weight'] = df['weight'].str.replace('Condition ', '').astype(float)
df['pace'] = df['pace'].astype(int)

In [9]:
df.columns

Index(['subject', 'experimental_condition', 'rpe', 'wrist_acc_time',
       'wrist_acc_length', 'wrist_acc_mean', 'wrist_acc_rms', 'wrist_acc_mad',
       'wrist_acc_std', 'wrist_acc_min', 'wrist_acc_max', 'wrist_acc_med',
       'wrist_acc_perc25', 'wrist_acc_perc75', 'wrist_jerk_mean',
       'wrist_jerk_rms', 'wrist_jerk_mad', 'wrist_jerk_std', 'wrist_jerk_min',
       'wrist_jerk_max', 'wrist_jerk_med', 'wrist_jerk_perc25',
       'wrist_jerk_perc75', 'trunk_acc_mean', 'trunk_acc_rms', 'trunk_acc_mad',
       'trunk_acc_std', 'trunk_acc_min', 'trunk_acc_max', 'trunk_acc_med',
       'trunk_acc_perc25', 'trunk_acc_perc75', 'trunk_jerk_mean',
       'trunk_jerk_rms', 'trunk_jerk_mad', 'trunk_jerk_std', 'trunk_jerk_min',
       'trunk_jerk_max', 'trunk_jerk_med', 'trunk_jerk_perc25',
       'trunk_jerk_perc75', 'upperarm_acc_mean', 'upperarm_acc_rms',
       'upperarm_acc_mad', 'upperarm_acc_std', 'upperarm_acc_min',
       'upperarm_acc_max', 'upperarm_acc_med', 'upperarm_acc_perc25'

In [4]:
df['rpe'] = df['rpe'].fillna(method='ffill')
df['rpe'] = df['rpe'].astype(int)       # Change this to float if necessary

First, I will create a general model to predict one of the conditions. In this example, I will use condition 1.5 - 15.

In [5]:
filt_condition = df['experimental_condition'] == 'Condition 1.5-15'
filt_subject = df['subject'] == 1

df_condition = df[filt_condition]

In [12]:
# Training a general model using specific condition (here, Condition 1.5-15)
train = df_condition.loc[df_condition['subject'].isin(range(1, 13))]
test = df_condition.loc[df_condition['subject'].isin(range(13, 16))]

reg = setup(data=train, target='rpe')

Unnamed: 0,Description,Value
0,Session id,4767
1,Target,rpe
2,Target type,Regression
3,Original data shape,"(637, 61)"
4,Transformed data shape,"(637, 61)"
5,Transformed train set shape,"(445, 61)"
6,Transformed test set shape,"(192, 61)"
7,Numeric features,59
8,Categorical features,1
9,Preprocess,True


In [15]:
model_condition = compare_models()

In [14]:
pull()

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
knn,K Neighbors Regressor,0.1493,0.0979,0.3035,0.9826,0.1145,0.0882,0.494
rf,Random Forest Regressor,0.4128,0.3972,0.6167,0.9296,0.2367,0.1949,0.745
gbr,Gradient Boosting Regressor,0.4756,0.45,0.665,0.9173,0.2688,0.2254,0.674
dt,Decision Tree Regressor,0.3367,0.534,0.7203,0.8999,0.2776,0.181,0.494
et,Extra Trees Regressor,0.5399,0.5458,0.7307,0.8998,0.3067,0.2473,0.694
lightgbm,Light Gradient Boosting Machine,0.5028,0.612,0.7647,0.8919,0.2687,0.2451,0.587
ada,AdaBoost Regressor,0.7326,0.8379,0.9127,0.8463,0.3858,0.3275,0.582
lr,Linear Regression,1.0481,1.7644,1.322,0.6779,0.4651,0.5284,1.228
ridge,Ridge Regression,1.1822,2.2646,1.4872,0.5927,0.5011,0.6107,0.542
br,Bayesian Ridge,1.1986,2.3241,1.5073,0.5841,0.5055,0.6068,0.493


In [16]:
predictions = predict_model(model_condition, data=test)

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,K Neighbors Regressor,2.6768,12.8598,3.5861,-0.5473,1.1076,0.6974


Now, I'll create a general model for all conditions

In [1]:
# Training a general model for all conditions
train = df.loc[df['subject'].isin(range(1, 13))]
test = df.loc[df['subject'].isin(range(13, 16))]

reg = setup(data=train, target='rpe')
model_general = compare_models()
# predictions = predict_model(model_general, data=test)

NameError: name 'df' is not defined

We can see that the model for all conditions is worse than the model for the specified condition, although neither are very good.

Now, we'll create an individual model for a sample subject. For this example, we will create an individual model for subject 1.

In [74]:
df_subject = df[filt_subject]

train = df_subject.sample(frac=0.8, random_state=42)
test = df_subject.drop(train.index)

reg = setup(data=train, target='rpe')
model_subject = compare_models()
predictions = predict_model(model_subject, data=test)

Unnamed: 0,Description,Value
0,Session id,6764
1,Target,rpe
2,Target type,Regression
3,Original data shape,"(224, 61)"
4,Transformed data shape,"(224, 64)"
5,Transformed train set shape,"(156, 64)"
6,Transformed test set shape,"(68, 64)"
7,Numeric features,59
8,Categorical features,1
9,Preprocess,True


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,K Neighbors Regressor,0.1143,0.0486,0.2204,0.9505,0.1263,0.0992


Processing:   0%|          | 0/77 [00:00<?, ?it/s]

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,K Neighbors Regressor,0.1143,0.0486,0.2204,0.9505,0.1263,0.0992


So far, this model has been by far the best and most accurate at predicting. Its R<sup>2</sup> value is over 0.9, and the MAE/MSE/RMSE scores are all much closer to 0. These are all very good signs of high accuracy. But what if we create a general model for subject 1 and specify a condition? 

In this next example, we will create an individual model for subject 1 under Condition 1.5-15.

In [75]:
df_subject_condition = df[(df['experimental_condition'] == 'Condition 2.5-15') & filt_subject]

train = df_subject_condition.sample(frac=0.8, random_state=42)
test = df_subject_condition.drop(train.index)

reg = setup(data=train, target='rpe')
model_subject_condition = compare_models()
predictions = predict_model(model_subject_condition, data=test)

Unnamed: 0,Description,Value
0,Session id,3511
1,Target,rpe
2,Target type,Regression
3,Original data shape,"(78, 61)"
4,Transformed data shape,"(78, 61)"
5,Transformed train set shape,"(54, 61)"
6,Transformed test set shape,"(24, 61)"
7,Numeric features,59
8,Categorical features,1
9,Preprocess,True


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,K Neighbors Regressor,0.08,0.044,0.2098,0.9617,0.0892,0.0611


Processing:   0%|          | 0/77 [00:00<?, ?it/s]

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,K Neighbors Regressor,0.08,0.044,0.2098,0.9617,0.0892,0.0611


This model is the most accurate at predicting so far. However, it is important to note that this model is relying only on 98 rows of data to make its analysis, so it may not be entirely reliable. We can see the size of each of the datasets used below:

In [67]:
print("--= Number of rows =--")
print("General model: \t\t" + str(df.shape[0]))
print("Subject model: \t\t" + str(df_subject.shape[0]))
print("Condition model: \t" + str(df_condition.shape[0]))
print("Subject-condition model: " + str(df_subject_condition.shape[0]))

--= Number of rows =--
General model: 		2895
Subject model: 		280
Condition model: 	899
Subject-condition model: 98


We can see that the subject-condition model has by far the smallest dataset, being only 35% the size of the next smallest data set (the subject model). This *may* be reason to doubt the accuracy of this model.