In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor

from sklearn.model_selection import GridSearchCV

In [None]:
!pip install catboost

Collecting catboost
  Downloading catboost-1.2.7-cp310-cp310-manylinux2014_x86_64.whl.metadata (1.2 kB)
Downloading catboost-1.2.7-cp310-cp310-manylinux2014_x86_64.whl (98.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.7/98.7 MB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: catboost
Successfully installed catboost-1.2.7


In [None]:
from catboost import CatBoostRegressor

# Regression models

In this notebook we will build different regression models and compare their score. Our main goal in this specific case is to predict exact value of target productivity metric based on features extracted from time series of employees' physical worklog.

We will compare the following models, keeping in mind that we should start with a simple **linear** model and continue with more advanced ones which use **boosting**:

* Linear regression
* SVM
* Random forest
* CatBoost

We also want to determine the best weighted metric strategy among those we have defined earlier, so model comparison will be performed for each of 3 metrics: **Even**, **Initiative** and **Absent**.

We will be using earlier introduced dataset of 5 quarters of 700+ employees' physical worklog to fit and evaluate our models' performance.

We will measure quality of models' predictions by MSE metric generally and RMSE for CatBoost. To ensure our choice of each model's parameters we will choose their best combination using **Grid Search** optimization method with **cross-validation**.

As a result of model comparison we are also interested in feature importances to understand which features of dataset have more impact.

## **Even metric**

In [None]:
df = pd.read_csv('dataset_quarts_even.csv')

In [None]:
df = df.set_index('author')

In [None]:
df

Unnamed: 0_level_0,period1,period2,period3,adf_c,adf_ct,adf_ctt,pp_c,pp_ct,kpss_c,kpss_ct,...,2,3,4,daily0,daily1,daily2,weekly0,weekly1,weekly2,target
author,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
author314_2023-07-01_2023-09-30,11,27,29,0,0,0,0,0,0,0,...,0.00,23.0,0.0,1,1,1,1,1,1,0.679866
author363_2023-07-01_2023-09-30,28,10,19,0,0,0,0,0,0,0,...,0.75,0.0,0.0,1,1,1,1,1,1,0.719305
author286_2023-07-01_2023-09-30,22,13,30,1,1,1,0,0,0,0,...,0.00,0.0,40.0,1,1,1,1,1,1,0.719847
author912_2023-07-01_2023-09-30,31,30,1,0,0,0,0,0,0,0,...,12.75,14.0,2.0,1,0,0,1,1,1,0.722459
author15_2023-07-01_2023-09-30,1,2,28,1,1,1,0,0,0,1,...,4.00,8.0,0.0,0,0,0,1,1,1,0.723281
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
author288_2024-07-01_2024-09-30,9,29,21,0,0,0,0,0,0,0,...,0.00,0.0,0.0,1,1,1,1,1,1,0.568182
author255_2024-07-01_2024-09-30,17,21,28,0,0,0,0,0,0,0,...,168.00,0.0,0.0,1,1,1,1,1,1,0.562500
author898_2024-07-01_2024-09-30,1,14,21,1,1,1,0,0,1,0,...,0.00,0.0,0.0,1,1,1,1,1,1,0.981061
author596_2024-07-01_2024-09-30,12,25,19,0,0,0,0,0,0,0,...,0.00,64.0,80.0,1,1,1,1,1,1,0.579545


In [None]:
X, y = df.drop(['target'], axis=1), df['target']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

### **1. Linear regression (baseline)**

In [None]:
model = LinearRegression()

In [None]:
model.fit(X_train, y_train)

In [None]:
preds = model.predict(X_test)

In [None]:
mean_squared_error(y_test, preds)

0.0038667820233058907

In [None]:
model.coef_

array([ 6.43197672e-05, -9.22447915e-05,  1.54727156e-04,  2.53262070e-03,
        1.00087595e-04, -9.16154666e-03, -1.89302732e-02,  2.72210233e-02,
        5.47770983e-03,  1.93926716e-03, -3.65335915e-04,  7.15876909e-03,
       -8.51270239e-03, -3.05583996e-06, -1.79766932e-04, -2.14902279e-04,
       -2.49779301e-04, -2.31310586e-04, -1.69851456e-04,  1.83924838e-02,
       -1.76521311e-02, -1.05910103e-02, -8.43375919e-03, -8.43375919e-03,
       -8.43375919e-03])

In [None]:
model.score(X_train, y_train)

0.19323560897107905

### **2. Support Vector Machines**

In [None]:
svr_model = SVR()

In [None]:
svr_model.fit(X_train, y_train)

In [None]:
svr_preds = svr_model.predict(X_test)

In [None]:
mean_squared_error(y_test, svr_preds)

0.0045628414752972675

In [None]:
svr_param_grid = {'C': [0.001, 0.01, 0.1, 0.5, 1.0],
                  'epsilon': [0.001, 0.01, 0.1, 0.5, 1.0],
                 }

In [None]:
svr_grid_search = GridSearchCV(estimator=svr_model, param_grid=svr_param_grid, cv=5, scoring='neg_mean_squared_error', verbose=3)

In [None]:
svr_grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 25 candidates, totalling 125 fits
[CV 1/5] END ...........C=0.001, epsilon=0.001;, score=-0.005 total time=   1.0s
[CV 2/5] END ...........C=0.001, epsilon=0.001;, score=-0.005 total time=   0.9s
[CV 3/5] END ...........C=0.001, epsilon=0.001;, score=-0.005 total time=   0.9s
[CV 4/5] END ...........C=0.001, epsilon=0.001;, score=-0.005 total time=   1.5s
[CV 5/5] END ...........C=0.001, epsilon=0.001;, score=-0.005 total time=   1.6s
[CV 1/5] END ............C=0.001, epsilon=0.01;, score=-0.005 total time=   0.9s
[CV 2/5] END ............C=0.001, epsilon=0.01;, score=-0.005 total time=   0.8s
[CV 3/5] END ............C=0.001, epsilon=0.01;, score=-0.005 total time=   0.8s
[CV 4/5] END ............C=0.001, epsilon=0.01;, score=-0.005 total time=   0.8s
[CV 5/5] END ............C=0.001, epsilon=0.01;, score=-0.005 total time=   0.8s
[CV 1/5] END .............C=0.001, epsilon=0.1;, score=-0.005 total time=   0.1s
[CV 2/5] END .............C=0.001, epsilon=0.1;

In [None]:
svr_grid_search.best_estimator_.fit(X_train, y_train)

In [None]:
svr_preds = svr_grid_search.best_estimator_.predict(X_test)

In [None]:
mean_squared_error(y_test, svr_preds)

0.004097684432078389

### **3. Random Forest**

In [None]:
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

In [None]:
rf_model.fit(X_train, y_train)

In [None]:
rf_preds = rf_model.predict(X_test)

In [None]:
mean_squared_error(y_test, rf_preds)

0.003721706985651599

In [None]:
rf_param_grid = {'n_estimators': [10, 100, 500, 1000],
                 'max_depth': [5, 8, 15],
                 'min_samples_split': [2, 5, 10]
                }

In [None]:
rf_grid_search = GridSearchCV(estimator=rf_model, param_grid=rf_param_grid, cv=5, scoring='neg_mean_squared_error', verbose=3)

In [None]:
rf_grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 36 candidates, totalling 180 fits
[CV 1/5] END max_depth=5, min_samples_split=2, n_estimators=10;, score=-0.004 total time=   0.1s
[CV 2/5] END max_depth=5, min_samples_split=2, n_estimators=10;, score=-0.004 total time=   0.1s
[CV 3/5] END max_depth=5, min_samples_split=2, n_estimators=10;, score=-0.004 total time=   0.1s
[CV 4/5] END max_depth=5, min_samples_split=2, n_estimators=10;, score=-0.005 total time=   0.1s
[CV 5/5] END max_depth=5, min_samples_split=2, n_estimators=10;, score=-0.004 total time=   0.1s
[CV 1/5] END max_depth=5, min_samples_split=2, n_estimators=100;, score=-0.004 total time=   1.1s
[CV 2/5] END max_depth=5, min_samples_split=2, n_estimators=100;, score=-0.004 total time=   0.8s
[CV 3/5] END max_depth=5, min_samples_split=2, n_estimators=100;, score=-0.004 total time=   0.8s
[CV 4/5] END max_depth=5, min_samples_split=2, n_estimators=100;, score=-0.004 total time=   0.8s
[CV 5/5] END max_depth=5, min_samples_split=2, n_estimators=1

In [None]:
rf_preds = rf_grid_search.best_estimator_.predict(X_test)

In [None]:
mean_squared_error(y_test, rf_preds)

0.003620832233894536

In [None]:
rf_grid_search.best_estimator_.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': 15,
 'max_features': 1.0,
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 5,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'n_estimators': 1000,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}

In [None]:
rf_grid_search.best_estimator_.feature_importances_

array([8.06952595e-02, 6.86863536e-02, 6.29494674e-02, 4.22383141e-03,
       5.54462947e-03, 5.26805412e-03, 9.59379280e-04, 5.20442049e-04,
       2.91438479e-03, 2.98959791e-03, 2.18250143e-01, 1.03404911e-02,
       2.09049385e-01, 1.46119613e-01, 4.19647833e-02, 2.63320684e-02,
       4.55040481e-02, 3.09266480e-02, 3.23393918e-02, 1.22585408e-03,
       4.57369516e-04, 2.73848774e-03, 1.90584350e-07, 6.46148594e-08,
       6.20067538e-08])

In [None]:
pd.DataFrame({'features': X.columns, 'feature_importances': rf_grid_search.best_estimator_.feature_importances_})

Unnamed: 0,features,feature_importances
0,period1,0.08069526
1,period2,0.06868635
2,period3,0.06294947
3,adf_c,0.004223831
4,adf_ct,0.005544629
5,adf_ctt,0.005268054
6,pp_c,0.0009593793
7,pp_ct,0.000520442
8,kpss_c,0.002914385
9,kpss_ct,0.002989598


### **4. CatBoost**

In [None]:
cb_model = CatBoostRegressor(random_state=42)

In [None]:
cb_model.fit(X_train, y_train)

Learning rate set to 0.051888
0:	learn: 0.0739691	total: 3.66ms	remaining: 3.66s
1:	learn: 0.0733096	total: 6.33ms	remaining: 3.16s
2:	learn: 0.0727057	total: 9.44ms	remaining: 3.14s
3:	learn: 0.0721397	total: 12.5ms	remaining: 3.1s
4:	learn: 0.0716069	total: 15.5ms	remaining: 3.08s
5:	learn: 0.0711330	total: 18.2ms	remaining: 3.02s
6:	learn: 0.0707861	total: 21.2ms	remaining: 3.01s
7:	learn: 0.0702709	total: 24.1ms	remaining: 2.99s
8:	learn: 0.0698457	total: 27.1ms	remaining: 2.98s
9:	learn: 0.0695152	total: 30.2ms	remaining: 2.99s
10:	learn: 0.0691561	total: 32.9ms	remaining: 2.96s
11:	learn: 0.0688356	total: 35.5ms	remaining: 2.93s
12:	learn: 0.0685473	total: 38.6ms	remaining: 2.93s
13:	learn: 0.0682508	total: 41.7ms	remaining: 2.94s
14:	learn: 0.0679962	total: 45.1ms	remaining: 2.96s
15:	learn: 0.0677633	total: 48.4ms	remaining: 2.98s
16:	learn: 0.0674939	total: 51.5ms	remaining: 2.98s
17:	learn: 0.0672453	total: 54.5ms	remaining: 2.97s
18:	learn: 0.0670501	total: 57.5ms	remaining:

<catboost.core.CatBoostRegressor at 0x7a3467c88160>

In [None]:
cb_preds = cb_model.predict(X_test)

In [None]:
mean_squared_error(y_test, cb_preds)

0.0035223639789115245

In [None]:
cb_param_grid = {'iterations': [500, 700, 1000, 2000, 2500]}

In [None]:
cb_grid_search = GridSearchCV(estimator=cb_model, param_grid=cb_param_grid, cv=5, scoring='neg_mean_squared_error', verbose=3)

In [None]:
cb_grid_search.fit(X_train, y_train)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
504:	learn: 0.0566564	total: 1.44s	remaining: 5.68s
505:	learn: 0.0566395	total: 1.44s	remaining: 5.68s
506:	learn: 0.0566275	total: 1.44s	remaining: 5.67s
507:	learn: 0.0566161	total: 1.45s	remaining: 5.67s
508:	learn: 0.0566004	total: 1.45s	remaining: 5.67s
509:	learn: 0.0565816	total: 1.45s	remaining: 5.66s
510:	learn: 0.0565589	total: 1.45s	remaining: 5.66s
511:	learn: 0.0565392	total: 1.46s	remaining: 5.66s
512:	learn: 0.0565217	total: 1.46s	remaining: 5.66s
513:	learn: 0.0565125	total: 1.46s	remaining: 5.65s
514:	learn: 0.0564973	total: 1.47s	remaining: 5.66s
515:	learn: 0.0564768	total: 1.47s	remaining: 5.67s
516:	learn: 0.0564614	total: 1.48s	remaining: 5.67s
517:	learn: 0.0564421	total: 1.49s	remaining: 5.68s
518:	learn: 0.0564305	total: 1.49s	remaining: 5.69s
519:	learn: 0.0564232	total: 1.49s	remaining: 5.68s
520:	learn: 0.0564171	total: 1.5s	remaining: 5.68s
521:	learn: 0.0564099	total: 1.5s	remaining: 5.68s
5

In [None]:
cb_grid_search.best_estimator_.get_params()

{'loss_function': 'RMSE', 'random_state': 42, 'iterations': 500}

In [None]:
cb_preds = cb_grid_search.best_estimator_.predict(X_test)

In [None]:
mean_squared_error(y_test, cb_preds)

0.00368503230485713

In [None]:
cb_grid_search.best_estimator_.get_feature_importance()

array([1.20281936e+01, 8.68197520e+00, 6.05285397e+00, 1.70390923e-01,
       4.04398616e-01, 8.19666762e-01, 2.66167430e-02, 1.38884473e-02,
       3.44351562e-01, 5.09967981e-01, 1.03474235e+01, 1.24316266e+00,
       1.89620976e+01, 1.39710002e+01, 6.15648867e+00, 4.25316979e+00,
       6.36588476e+00, 4.31022811e+00, 4.98231375e+00, 2.38309975e-01,
       8.85430066e-02, 2.90271122e-02, 7.59982102e-07, 0.00000000e+00,
       4.63518285e-05])

In [None]:
pd.DataFrame({'features': X.columns, 'feature_importances': cb_grid_search.best_estimator_.get_feature_importance()})

Unnamed: 0,features,feature_importances
0,period1,12.02819
1,period2,8.681975
2,period3,6.052854
3,adf_c,0.1703909
4,adf_ct,0.4043986
5,adf_ctt,0.8196668
6,pp_c,0.02661674
7,pp_ct,0.01388845
8,kpss_c,0.3443516
9,kpss_ct,0.509968


## **Initiative metric**

In [None]:
df = pd.read_csv('dataset_quarts_initiative.csv')

In [None]:
df = df.set_index('author')

In [None]:
df

Unnamed: 0_level_0,period1,period2,period3,adf_c,adf_ct,adf_ctt,pp_c,pp_ct,kpss_c,kpss_ct,...,2,3,4,daily0,daily1,daily2,weekly0,weekly1,weekly2,target
author,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
author314_2023-07-01_2023-09-30,11,27,29,0,0,0,0,0,0,0,...,0.00,23.0,0.0,1,1,1,1,1,1,0.487786
author363_2023-07-01_2023-09-30,28,10,19,0,0,0,0,0,0,0,...,0.75,0.0,0.0,1,1,1,1,1,1,0.586906
author286_2023-07-01_2023-09-30,22,13,30,1,1,1,0,0,0,0,...,0.00,0.0,40.0,1,1,1,1,1,1,0.560409
author912_2023-07-01_2023-09-30,31,30,1,0,0,0,0,0,0,0,...,12.75,14.0,2.0,1,0,0,1,1,1,0.560118
author15_2023-07-01_2023-09-30,1,2,28,1,1,1,0,0,0,1,...,4.00,8.0,0.0,0,0,0,1,1,1,0.557249
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
author288_2024-07-01_2024-09-30,9,29,21,0,0,0,0,0,0,0,...,0.00,0.0,0.0,1,1,1,1,1,1,0.377273
author255_2024-07-01_2024-09-30,17,21,28,0,0,0,0,0,0,0,...,168.00,0.0,0.0,1,1,1,1,1,1,0.300000
author898_2024-07-01_2024-09-30,1,14,21,1,1,1,0,0,1,0,...,0.00,0.0,0.0,1,1,1,1,1,1,0.969697
author596_2024-07-01_2024-09-30,12,25,19,0,0,0,0,0,0,0,...,0.00,64.0,80.0,1,1,1,1,1,1,0.418182


In [None]:
X, y = df.drop(['target'], axis=1), df['target']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

### **1. Linear regression (baseline)**

In [None]:
model = LinearRegression()

In [None]:
model.fit(X_train, y_train)

In [None]:
preds = model.predict(X_test)

In [None]:
mean_squared_error(y_test, preds)

0.006927271009937399

In [None]:
model.coef_

array([-4.93583684e-04, -3.80248474e-04, -2.33146091e-04, -6.68070859e-04,
       -5.98542163e-03, -1.57059972e-03, -2.92081765e-02,  3.85183377e-02,
       -3.67092285e-03, -8.07096493e-03,  1.40325643e-07,  2.64153288e-03,
       -8.53260082e-03, -1.04103620e-04, -5.91920286e-05,  3.42630188e-05,
       -3.18692365e-04, -1.09457697e-04, -9.43734908e-05,  1.89135641e-02,
       -1.29150979e-02, -1.22644091e-02, -1.53728938e-02, -1.53728938e-02,
       -1.53728938e-02])

In [None]:
model.score(X_train, y_train)

0.18233642458075405

### **2. Support Vector Machines**

In [None]:
svr_model = SVR()

In [None]:
svr_model.fit(X_train, y_train)

In [None]:
svr_preds = svr_model.predict(X_test)

In [None]:
mean_squared_error(y_test, svr_preds)

0.007668768908156325

In [None]:
svr_param_grid = {'C': [0.001, 0.01, 0.1, 0.5, 1.0],
                  'epsilon': [0.001, 0.01, 0.1, 0.5, 1.0],
                 }

In [None]:
svr_grid_search = GridSearchCV(estimator=svr_model, param_grid=svr_param_grid, cv=5, scoring='neg_mean_squared_error', verbose=3)

In [None]:
svr_grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 25 candidates, totalling 125 fits
[CV 1/5] END ...........C=0.001, epsilon=0.001;, score=-0.009 total time=   1.0s
[CV 2/5] END ...........C=0.001, epsilon=0.001;, score=-0.009 total time=   1.0s
[CV 3/5] END ...........C=0.001, epsilon=0.001;, score=-0.008 total time=   1.0s
[CV 4/5] END ...........C=0.001, epsilon=0.001;, score=-0.009 total time=   1.0s
[CV 5/5] END ...........C=0.001, epsilon=0.001;, score=-0.009 total time=   0.9s
[CV 1/5] END ............C=0.001, epsilon=0.01;, score=-0.009 total time=   0.8s
[CV 2/5] END ............C=0.001, epsilon=0.01;, score=-0.009 total time=   0.8s
[CV 3/5] END ............C=0.001, epsilon=0.01;, score=-0.008 total time=   1.3s
[CV 4/5] END ............C=0.001, epsilon=0.01;, score=-0.009 total time=   1.4s
[CV 5/5] END ............C=0.001, epsilon=0.01;, score=-0.009 total time=   1.1s
[CV 1/5] END .............C=0.001, epsilon=0.1;, score=-0.009 total time=   0.2s
[CV 2/5] END .............C=0.001, epsilon=0.1;

In [None]:
svr_grid_search.best_estimator_.fit(X_train, y_train)

In [None]:
svr_preds = svr_grid_search.best_estimator_.predict(X_test)

In [None]:
mean_squared_error(y_test, svr_preds)

0.007486190893980477

### **3. Random Forest**

In [None]:
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

In [None]:
rf_model.fit(X_train, y_train)

In [None]:
rf_preds = rf_model.predict(X_test)

In [None]:
mean_squared_error(y_test, rf_preds)

0.0066906067090229514

In [None]:
rf_param_grid = {'n_estimators': [10, 100, 500, 1000],
                 'max_depth': [5, 8, 15],
                 'min_samples_split': [2, 5, 10]
                }

In [None]:
rf_grid_search = GridSearchCV(estimator=rf_model, param_grid=rf_param_grid, cv=5, scoring='neg_mean_squared_error', verbose=3)

In [None]:
rf_grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 36 candidates, totalling 180 fits
[CV 1/5] END max_depth=5, min_samples_split=2, n_estimators=10;, score=-0.008 total time=   0.1s
[CV 2/5] END max_depth=5, min_samples_split=2, n_estimators=10;, score=-0.008 total time=   0.1s
[CV 3/5] END max_depth=5, min_samples_split=2, n_estimators=10;, score=-0.007 total time=   0.1s
[CV 4/5] END max_depth=5, min_samples_split=2, n_estimators=10;, score=-0.008 total time=   0.1s
[CV 5/5] END max_depth=5, min_samples_split=2, n_estimators=10;, score=-0.008 total time=   0.1s
[CV 1/5] END max_depth=5, min_samples_split=2, n_estimators=100;, score=-0.008 total time=   0.8s
[CV 2/5] END max_depth=5, min_samples_split=2, n_estimators=100;, score=-0.007 total time=   0.8s
[CV 3/5] END max_depth=5, min_samples_split=2, n_estimators=100;, score=-0.007 total time=   0.8s
[CV 4/5] END max_depth=5, min_samples_split=2, n_estimators=100;, score=-0.008 total time=   0.8s
[CV 5/5] END max_depth=5, min_samples_split=2, n_estimators=1

In [None]:
rf_preds = rf_grid_search.best_estimator_.predict(X_test)

In [None]:
mean_squared_error(y_test, rf_preds)

0.006429891258368871

In [None]:
rf_grid_search.best_estimator_.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': 8,
 'max_features': 1.0,
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'n_estimators': 1000,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}

In [None]:
rf_grid_search.best_estimator_.feature_importances_

array([5.22282482e-02, 4.39203622e-02, 4.41667277e-02, 2.84144083e-03,
       4.43784441e-03, 3.46703923e-03, 1.44221419e-03, 1.50064038e-03,
       1.92690704e-03, 3.44169601e-03, 7.91664388e-02, 1.07964175e-02,
       1.39509496e-01, 4.84602919e-01, 2.68972396e-02, 1.67822436e-02,
       3.89828441e-02, 2.59730610e-02, 1.53910151e-02, 6.92394340e-04,
       3.69226394e-04, 1.45979260e-03, 7.71175554e-07, 2.15712791e-06,
       8.63502762e-07])

In [None]:
pd.DataFrame({'features': X.columns, 'feature_importances': rf_grid_search.best_estimator_.feature_importances_})

Unnamed: 0,features,feature_importances
0,period1,0.05222825
1,period2,0.04392036
2,period3,0.04416673
3,adf_c,0.002841441
4,adf_ct,0.004437844
5,adf_ctt,0.003467039
6,pp_c,0.001442214
7,pp_ct,0.00150064
8,kpss_c,0.001926907
9,kpss_ct,0.003441696


### **4. CatBoost**

In [None]:
cb_model = CatBoostRegressor(random_state=42)

In [None]:
cb_model.fit(X_train, y_train)

Learning rate set to 0.051888
0:	learn: 0.0963809	total: 3.15ms	remaining: 3.14s
1:	learn: 0.0956716	total: 6ms	remaining: 2.99s
2:	learn: 0.0950687	total: 8.94ms	remaining: 2.97s
3:	learn: 0.0944152	total: 12ms	remaining: 2.99s
4:	learn: 0.0938578	total: 14.9ms	remaining: 2.96s
5:	learn: 0.0933148	total: 17.6ms	remaining: 2.92s
6:	learn: 0.0928766	total: 20.6ms	remaining: 2.92s
7:	learn: 0.0924206	total: 23.5ms	remaining: 2.92s
8:	learn: 0.0919995	total: 26.5ms	remaining: 2.91s
9:	learn: 0.0916081	total: 29.6ms	remaining: 2.93s
10:	learn: 0.0912629	total: 32.4ms	remaining: 2.91s
11:	learn: 0.0909256	total: 36.9ms	remaining: 3.04s
12:	learn: 0.0905791	total: 40.1ms	remaining: 3.05s
13:	learn: 0.0902862	total: 43.9ms	remaining: 3.09s
14:	learn: 0.0900103	total: 46.7ms	remaining: 3.06s
15:	learn: 0.0897680	total: 49.9ms	remaining: 3.07s
16:	learn: 0.0895278	total: 52.7ms	remaining: 3.05s
17:	learn: 0.0893044	total: 55.5ms	remaining: 3.02s
18:	learn: 0.0890658	total: 58.6ms	remaining: 3.0

<catboost.core.CatBoostRegressor at 0x7a3467cd8730>

In [None]:
cb_preds = cb_model.predict(X_test)

In [None]:
mean_squared_error(y_test, cb_preds)

0.006751363837522685

In [None]:
cb_param_grid = {'iterations': [500, 700, 1000, 2000, 2500]}

In [None]:
cb_grid_search = GridSearchCV(estimator=cb_model, param_grid=cb_param_grid, cv=5, scoring='neg_mean_squared_error', verbose=3)

In [None]:
cb_grid_search.fit(X_train, y_train)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
504:	learn: 0.0782337	total: 1.44s	remaining: 5.7s
505:	learn: 0.0782205	total: 1.45s	remaining: 5.7s
506:	learn: 0.0782022	total: 1.45s	remaining: 5.7s
507:	learn: 0.0781831	total: 1.45s	remaining: 5.69s
508:	learn: 0.0781817	total: 1.45s	remaining: 5.69s
509:	learn: 0.0781761	total: 1.46s	remaining: 5.68s
510:	learn: 0.0781539	total: 1.46s	remaining: 5.68s
511:	learn: 0.0781520	total: 1.46s	remaining: 5.68s
512:	learn: 0.0781394	total: 1.46s	remaining: 5.67s
513:	learn: 0.0781215	total: 1.47s	remaining: 5.67s
514:	learn: 0.0781009	total: 1.47s	remaining: 5.67s
515:	learn: 0.0780706	total: 1.47s	remaining: 5.66s
516:	learn: 0.0780495	total: 1.48s	remaining: 5.66s
517:	learn: 0.0780367	total: 1.48s	remaining: 5.66s
518:	learn: 0.0780163	total: 1.48s	remaining: 5.65s
519:	learn: 0.0779874	total: 1.48s	remaining: 5.65s
520:	learn: 0.0779627	total: 1.49s	remaining: 5.65s
521:	learn: 0.0779607	total: 1.49s	remaining: 5.64s
52

In [None]:
cb_grid_search.best_estimator_.get_params()

{'loss_function': 'RMSE', 'random_state': 42, 'iterations': 500}

In [None]:
cb_preds = cb_grid_search.best_estimator_.predict(X_test)

In [None]:
mean_squared_error(y_test, cb_preds)

0.006763138085033325

In [None]:
cb_grid_search.best_estimator_.get_feature_importance()

array([8.99044290e+00, 8.80526217e+00, 7.84352016e+00, 4.14981050e-01,
       6.17048043e-01, 7.56907308e-01, 3.22432588e-02, 4.80879940e-02,
       1.98568382e-01, 8.47685692e-01, 1.24705138e+01, 1.48484552e+00,
       1.52753211e+01, 1.97825918e+01, 4.04117455e+00, 4.60208743e+00,
       4.58283822e+00, 4.23179103e+00, 4.44982655e+00, 2.23766936e-01,
       4.57067456e-02, 2.53115795e-01, 1.65751108e-03, 0.00000000e+00,
       1.60097975e-05])

In [None]:
pd.DataFrame({'features': X.columns, 'feature_importances': cb_grid_search.best_estimator_.get_feature_importance()})

Unnamed: 0,features,feature_importances
0,period1,8.990443
1,period2,8.805262
2,period3,7.84352
3,adf_c,0.414981
4,adf_ct,0.617048
5,adf_ctt,0.756907
6,pp_c,0.032243
7,pp_ct,0.048088
8,kpss_c,0.198568
9,kpss_ct,0.847686


## **Absence metric**

In [None]:
df = pd.read_csv('dataset_quarts_absence.csv')

In [None]:
df = df.set_index('author')

In [None]:
df

Unnamed: 0_level_0,period1,period2,period3,adf_c,adf_ct,adf_ctt,pp_c,pp_ct,kpss_c,kpss_ct,...,2,3,4,daily0,daily1,daily2,weekly0,weekly1,weekly2,target
author,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
author314_2023-07-01_2023-09-30,11,27,29,0,0,0,0,0,0,0,...,0.00,23.0,0.0,1,1,1,1,1,1,0.841081
author363_2023-07-01_2023-09-30,28,10,19,0,0,0,0,0,0,0,...,0.75,0.0,0.0,1,1,1,1,1,1,0.850033
author286_2023-07-01_2023-09-30,22,13,30,1,1,1,0,0,0,0,...,0.00,0.0,40.0,1,1,1,1,1,1,0.865727
author912_2023-07-01_2023-09-30,31,30,1,0,0,0,0,0,0,0,...,12.75,14.0,2.0,1,0,0,1,1,1,0.874368
author15_2023-07-01_2023-09-30,1,2,28,1,1,1,0,0,0,1,...,4.00,8.0,0.0,0,0,0,1,1,1,0.877582
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
author288_2024-07-01_2024-09-30,9,29,21,0,0,0,0,0,0,0,...,0.00,0.0,0.0,1,1,1,1,1,1,0.636364
author255_2024-07-01_2024-09-30,17,21,28,0,0,0,0,0,0,0,...,168.00,0.0,0.0,1,1,1,1,1,1,0.725000
author898_2024-07-01_2024-09-30,1,14,21,1,1,1,0,0,1,0,...,0.00,0.0,0.0,1,1,1,1,1,1,0.984848
author596_2024-07-01_2024-09-30,12,25,19,0,0,0,0,0,0,0,...,0.00,64.0,80.0,1,1,1,1,1,1,0.622727


In [None]:
X, y = df.drop(['target'], axis=1), df['target']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

### **1. Linear regression (baseline)**

In [None]:
model = LinearRegression()

In [None]:
model.fit(X_train, y_train)

In [None]:
preds = model.predict(X_test)

In [None]:
mean_squared_error(y_test, preds)

0.003534748392065086

In [None]:
model.coef_

array([ 1.41947282e-04,  2.88339138e-06,  3.26862899e-04, -3.03342413e-03,
       -9.38553656e-04, -5.79164222e-03, -2.14770285e-02,  1.89345399e-02,
       -1.54683071e-03,  5.86129355e-03, -8.92595578e-04,  1.04216080e-02,
       -1.40885812e-02,  6.88959488e-05, -2.63617317e-04, -3.10714516e-04,
       -2.52628967e-04, -2.52931963e-04, -2.75514893e-04,  1.75073102e-02,
       -1.86778071e-02, -5.20552162e-03,  9.05018497e-03,  9.05018497e-03,
        9.05018497e-03])

In [None]:
model.score(X_train, y_train)

0.3653216794879055

### **2. Support Vector Machines**

In [None]:
svr_model = SVR()

In [None]:
svr_model.fit(X_train, y_train)

In [None]:
svr_preds = svr_model.predict(X_test)

In [None]:
mean_squared_error(y_test, svr_preds)

0.0042154048922525954

In [None]:
svr_param_grid = {'C': [0.001, 0.01, 0.1, 0.5, 1.0],
                  'epsilon': [0.001, 0.01, 0.1, 0.5, 1.0],
                 }

In [None]:
svr_grid_search = GridSearchCV(estimator=svr_model, param_grid=svr_param_grid, cv=5, scoring='neg_mean_squared_error', verbose=3)

In [None]:
svr_grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 25 candidates, totalling 125 fits
[CV 1/5] END ...........C=0.001, epsilon=0.001;, score=-0.005 total time=   0.9s
[CV 2/5] END ...........C=0.001, epsilon=0.001;, score=-0.005 total time=   0.9s
[CV 3/5] END ...........C=0.001, epsilon=0.001;, score=-0.005 total time=   0.9s
[CV 4/5] END ...........C=0.001, epsilon=0.001;, score=-0.005 total time=   0.9s
[CV 5/5] END ...........C=0.001, epsilon=0.001;, score=-0.005 total time=   1.0s
[CV 1/5] END ............C=0.001, epsilon=0.01;, score=-0.005 total time=   1.3s
[CV 2/5] END ............C=0.001, epsilon=0.01;, score=-0.005 total time=   1.4s
[CV 3/5] END ............C=0.001, epsilon=0.01;, score=-0.005 total time=   0.9s
[CV 4/5] END ............C=0.001, epsilon=0.01;, score=-0.005 total time=   0.8s
[CV 5/5] END ............C=0.001, epsilon=0.01;, score=-0.005 total time=   0.9s
[CV 1/5] END .............C=0.001, epsilon=0.1;, score=-0.005 total time=   0.1s
[CV 2/5] END .............C=0.001, epsilon=0.1;

In [None]:
svr_grid_search.best_estimator_.fit(X_train, y_train)

In [None]:
svr_preds = svr_grid_search.best_estimator_.predict(X_test)

In [None]:
mean_squared_error(y_test, svr_preds)

0.003636037675417072

### **3. Random Forest**

In [None]:
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

In [None]:
rf_model.fit(X_train, y_train)

In [None]:
rf_preds = rf_model.predict(X_test)

In [None]:
mean_squared_error(y_test, rf_preds)

0.0028520469887143188

In [None]:
rf_param_grid = {'n_estimators': [10, 100, 500, 1000],
                 'max_depth': [5, 8, 15],
                 'min_samples_split': [2, 5, 10]
                }

In [None]:
rf_grid_search = GridSearchCV(estimator=rf_model, param_grid=rf_param_grid, cv=5, scoring='neg_mean_squared_error', verbose=3)

In [None]:
rf_grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 36 candidates, totalling 180 fits
[CV 1/5] END max_depth=5, min_samples_split=2, n_estimators=10;, score=-0.003 total time=   0.1s
[CV 2/5] END max_depth=5, min_samples_split=2, n_estimators=10;, score=-0.004 total time=   0.1s
[CV 3/5] END max_depth=5, min_samples_split=2, n_estimators=10;, score=-0.003 total time=   0.1s
[CV 4/5] END max_depth=5, min_samples_split=2, n_estimators=10;, score=-0.003 total time=   0.1s
[CV 5/5] END max_depth=5, min_samples_split=2, n_estimators=10;, score=-0.004 total time=   0.1s
[CV 1/5] END max_depth=5, min_samples_split=2, n_estimators=100;, score=-0.003 total time=   0.8s
[CV 2/5] END max_depth=5, min_samples_split=2, n_estimators=100;, score=-0.003 total time=   0.8s
[CV 3/5] END max_depth=5, min_samples_split=2, n_estimators=100;, score=-0.003 total time=   0.8s
[CV 4/5] END max_depth=5, min_samples_split=2, n_estimators=100;, score=-0.003 total time=   0.8s
[CV 5/5] END max_depth=5, min_samples_split=2, n_estimators=1

In [None]:
rf_preds = rf_grid_search.best_estimator_.predict(X_test)

In [None]:
mean_squared_error(y_test, rf_preds)

0.0028186868980742802

In [None]:
rf_grid_search.best_estimator_.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': 15,
 'max_features': 1.0,
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'n_estimators': 1000,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}

In [None]:
rf_grid_search.best_estimator_.feature_importances_

array([5.79867077e-02, 5.52573628e-02, 5.01934025e-02, 2.94675518e-03,
       3.01724603e-03, 4.13657966e-03, 4.05133592e-04, 1.94090561e-04,
       1.90663704e-03, 1.99559869e-03, 2.87503101e-01, 8.27160535e-03,
       2.91220382e-01, 8.64426231e-02, 3.69929801e-02, 2.22555439e-02,
       3.65150956e-02, 2.26734920e-02, 2.72378402e-02, 6.48928559e-04,
       2.80792282e-04, 1.91672044e-03, 1.53868054e-07, 1.63354539e-07,
       1.06433210e-06])

In [None]:
pd.DataFrame({'features': X.columns, 'feature_importances': rf_grid_search.best_estimator_.feature_importances_})

Unnamed: 0,features,feature_importances
0,period1,0.05798671
1,period2,0.05525736
2,period3,0.0501934
3,adf_c,0.002946755
4,adf_ct,0.003017246
5,adf_ctt,0.00413658
6,pp_c,0.0004051336
7,pp_ct,0.0001940906
8,kpss_c,0.001906637
9,kpss_ct,0.001995599


### **4. CatBoost**

In [None]:
cb_model = CatBoostRegressor(random_state=42)

In [None]:
cb_model.fit(X_train, y_train)

Learning rate set to 0.051888
0:	learn: 0.0768812	total: 3.08ms	remaining: 3.07s
1:	learn: 0.0755228	total: 6.39ms	remaining: 3.19s
2:	learn: 0.0742369	total: 9.2ms	remaining: 3.06s
3:	learn: 0.0730690	total: 12.3ms	remaining: 3.05s
4:	learn: 0.0720376	total: 15.3ms	remaining: 3.04s
5:	learn: 0.0710716	total: 18.3ms	remaining: 3.03s
6:	learn: 0.0701518	total: 21.3ms	remaining: 3.01s
7:	learn: 0.0692530	total: 24.3ms	remaining: 3.02s
8:	learn: 0.0684660	total: 27.5ms	remaining: 3.02s
9:	learn: 0.0676867	total: 30.3ms	remaining: 3s
10:	learn: 0.0669912	total: 33.1ms	remaining: 2.97s
11:	learn: 0.0663142	total: 36.3ms	remaining: 2.99s
12:	learn: 0.0657688	total: 39.1ms	remaining: 2.97s
13:	learn: 0.0652468	total: 42.2ms	remaining: 2.97s
14:	learn: 0.0647383	total: 45.3ms	remaining: 2.98s
15:	learn: 0.0641861	total: 48.1ms	remaining: 2.96s
16:	learn: 0.0637022	total: 51ms	remaining: 2.95s
17:	learn: 0.0633026	total: 53.9ms	remaining: 2.94s
18:	learn: 0.0628884	total: 56.8ms	remaining: 2.93

<catboost.core.CatBoostRegressor at 0x7a34438f3be0>

In [None]:
cb_preds = cb_model.predict(X_test)

In [None]:
mean_squared_error(y_test, cb_preds)

0.0027809196865078204

In [None]:
cb_param_grid = {'iterations': [500, 700, 1000, 2000, 2500]}

In [None]:
cb_grid_search = GridSearchCV(estimator=cb_model, param_grid=cb_param_grid, cv=5, scoring='neg_mean_squared_error', verbose=3)

In [None]:
cb_grid_search.fit(X_train, y_train)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
2004:	learn: 0.0381415	total: 5.76s	remaining: 1.42s
2005:	learn: 0.0381362	total: 5.77s	remaining: 1.42s
2006:	learn: 0.0381301	total: 5.77s	remaining: 1.42s
2007:	learn: 0.0381263	total: 5.77s	remaining: 1.41s
2008:	learn: 0.0381209	total: 5.78s	remaining: 1.41s
2009:	learn: 0.0381114	total: 5.78s	remaining: 1.41s
2010:	learn: 0.0381029	total: 5.78s	remaining: 1.41s
2011:	learn: 0.0380974	total: 5.79s	remaining: 1.4s
2012:	learn: 0.0380895	total: 5.79s	remaining: 1.4s
2013:	learn: 0.0380838	total: 5.79s	remaining: 1.4s
2014:	learn: 0.0380796	total: 5.79s	remaining: 1.39s
2015:	learn: 0.0380755	total: 5.8s	remaining: 1.39s
2016:	learn: 0.0380699	total: 5.8s	remaining: 1.39s
2017:	learn: 0.0380661	total: 5.8s	remaining: 1.39s
2018:	learn: 0.0380601	total: 5.8s	remaining: 1.38s
2019:	learn: 0.0380541	total: 5.81s	remaining: 1.38s
2020:	learn: 0.0380433	total: 5.81s	remaining: 1.38s
2021:	learn: 0.0380396	total: 5.81s	remai

In [None]:
cb_grid_search.best_estimator_.get_params()

{'loss_function': 'RMSE', 'random_state': 42, 'iterations': 2000}

In [None]:
cb_preds = cb_grid_search.best_estimator_.predict(X_test)

In [None]:
mean_squared_error(y_test, cb_preds)

0.002695862702130857

In [None]:
cb_grid_search.best_estimator_.get_feature_importance()

array([9.16051027e+00, 7.96048513e+00, 5.62737714e+00, 1.62299708e-01,
       3.09595605e-01, 6.40886793e-01, 9.42410682e-03, 1.25421325e-02,
       2.63452219e-01, 3.92558826e-01, 1.52129874e+01, 1.35506572e+00,
       2.38728690e+01, 1.08305489e+01, 5.42410718e+00, 4.17224053e+00,
       5.83874357e+00, 3.45313213e+00, 4.89377825e+00, 1.82225881e-01,
       8.04978851e-02, 1.43835032e-01, 7.32674752e-04, 4.53176808e-05,
       5.85635570e-05])

In [None]:
pd.DataFrame({'features': X.columns, 'feature_importances': cb_grid_search.best_estimator_.get_feature_importance()})

Unnamed: 0,features,feature_importances
0,period1,9.16051
1,period2,7.960485
2,period3,5.627377
3,adf_c,0.1623
4,adf_ct,0.309596
5,adf_ctt,0.640887
6,pp_c,0.009424
7,pp_ct,0.012542
8,kpss_c,0.263452
9,kpss_ct,0.392559


## **Results**

We have compared models' score for all of the metrics and obtained the result of MSE for each case:


|metric|linear regression|svm|random forest|catboost|
|------|-----------------|---|-------------|--------|
|even      |0.0039|0.0041|0.0036|0.0035|
|initiative|0.0069|0.0075|0.0064|0.0068|
|absence   |0.0035|0.0036|0.0028|0.0027|

As expected, boosting showed better results than linear models. The best model on our dataset in regression case is either Random Forest or CatBoost, while Catboost is 2/3 metric cases slightly better due to its higher model complexity.

The best metric is one with **Absence** focus as it gives the best result in MSE: **0.0027** error. In practice it means that based on dataset we can predict target value of productivity metric (from `[0; 1]` range as designed) with average error of **0.05**. This result is not bad, but can be possibly improved with different type of task: classification on belonging to one of productivity groups, i.e. lower, average and higher. We will use this idea in further analysis of predicting productivity problem.

Another significant result is feature importances of models. We will have a closer look on those of model with best score:

|feature|feature importance (max 100%)|
|-------|------------------|
|period1|9.160510
|period2|7.960485
|period3|5.627377
|adf_c|0.162300
|adf_ct|0.309596
|adf_ctt|0.640887
|pp_c|0.009424
|pp_ct|0.012542
|kpss_c|0.263452
|kpss_ct|0.392559
|max|15.212987
|shift|1.355066
|mean|23.872869
|var|10.830549
|0|5.424107
|1|4.172241
|2|5.838744
|3|3.453132
|4|4.893778
|daily0|0.182226
|daily1|0.080498
|daily2|0.143835
|weekly0|0.000733
|weekly1|0.000045
|weekly2|0.000059

As we see the best features are max, mean and varience of time series, which is logical because larger fluctuations in time series indicate lower frequency of logging work and to some extent lower productivity (as shown earlier in weighted metrics analysis).

Other features like daily means in a week (features 0-4), periods and shift also give sufficient impact in model predictions. Lower periods indicate higher frequency in logging work, and shift shows whether change between two sub-samples is a structural break (employee going on vacation and then returning back to previous logging habit) or just a regular high pitch in time series explained by lower logging frequency.

We will continue our analysis with earlier stated clusterization idea and further classification based on clusters.