## Overview

Modeling: 
- [x] Splitting the data into a train, validation (depending on the model) and test set that are properly temporally separated. The temporal separation of traces depends on the selected prediction task and  the  start  &  end  timestamps  of  these  traces  in  addition  to  the  timestamp  of  the  activity  of interest (in the case of next activity prediction).  
- [ ] Class balancing (varies depending on the chosen prediction task!). <b>not applicable</b> 
- [x] Optional: Hyperparameter tuning of the machine learning model through cross-validation on the 
train set.
- [x] Training the machine learning model on the train set. 


##Import libraries

In [None]:
import pandas as pd
import numpy as np
import sklearn
from sklearn.ensemble import RandomForestRegressor

##Read data

In [None]:
from google.colab import drive
drive.mount('/content/drive')

MessageError: ignored

In [None]:
agg_df = pd.read_csv('aggregation_enconding.csv',
                     encoding='latin1',
                     parse_dates=['time:timestamp'],
                     infer_datetime_format=True
                     )

# Converting the TimeDelta columns
agg_df['case:TripDuration'] = pd.to_timedelta(agg_df['case:TripDuration']).astype('timedelta64[D]') # to number of days
agg_df['time:Remaining'] = pd.to_timedelta(agg_df['time:Remaining']).dt.total_seconds() # to total seconds
agg_df['time:relative'] = pd.to_timedelta(agg_df['time:relative']).dt.total_seconds() 

agg_df.head()

FileNotFoundError: ignored

##Train and test Split

The partitions will be split temporally\
**improvement:** do the temporal split on cases instad of on events. The way it is right now events from the same case could be in the train and in the test set. 

**improvement:** can we find a way to use the categorical features?

In [None]:
# Sort values by timestamp
agg_df.sort_values('time:timestamp', inplace=True)
# Make sure to drop the old indices values since the index is used to split temporally
agg_df.reset_index(drop=True, inplace=True)  

# Drop columns that are not suitable for random forest regressor
rf_input = agg_df.drop(columns=['time:timestamp', 'case:id', 'time:relative', 
                          'case:Permit BudgetNumber', 'case:Permit OrganizationalEntity',
                          'case:BudgetNumber'])

# Split into 80% train and 20% test
train_set, test_set= np.split(rf_input, [int(.8 *len(rf_input))])

X_train = train_set[rf_input.drop(columns=['time:Remaining']).columns]
y_train = train_set['time:Remaining']

X_test = test_set[rf_input.drop(columns=['time:Remaining']).columns]
y_test = test_set['time:Remaining']

print('train size: {} rows'.format(len(X_train)))
print('test size: {} rows'.format(len(X_test)))

train size: 26600 rows
test size: 6650 rows


In [None]:
agg_df.head()

Unnamed: 0,time:timestamp,case:Amount,case:Permit BudgetNumber,case:Permit OrganizationalEntity,case:Permit RequestedBudget,case:id,case:BudgetNumber,case:TripDuration,case:TripStartMonth,case:TripEndMonth,...,act_Permit REJECTED,act_Permit SUBMITTED,act_Request Payment,act_Send Reminder,res_ADMINISTRATION,res_BUDGET OWNER,res_DIRECTOR,res_EMPLOYEE,res_SUPERVISOR,res_UNDEFINED
0,2018-01-06 10:13:23+00:00,662.163043,budget 2233,organizational unit 65455,694.336959,declaration 7194,budget 145023,3.0,1.0,1,...,0,1,0,0,0,0,0,1,0,0
1,2018-01-06 10:13:26+00:00,662.163043,budget 2233,organizational unit 65455,694.336959,declaration 7194,budget 145023,3.0,1.0,1,...,0,1,0,0,1,0,0,1,0,0
2,2018-01-06 10:16:06+00:00,160.987935,budget 2233,organizational unit 65455,367.590155,declaration 53983,budget 145820,1.0,2.0,2,...,0,1,0,0,0,0,0,1,0,0
3,2018-01-06 10:16:41+00:00,160.987935,budget 2233,organizational unit 65455,367.590155,declaration 53983,budget 145820,1.0,2.0,2,...,0,1,0,0,1,0,0,1,0,0
4,2018-01-06 10:21:34+00:00,372.23907,budget 2233,organizational unit 65455,694.336959,declaration 53993,budget 146836,2.0,5.0,5,...,0,1,0,0,0,0,0,1,0,0


## Model Tunning

### Random Forest Regressor

**improvement:** use grid search instead of random search and also split the cross-validation by case and not by event.

In [None]:
from sklearn.model_selection import TimeSeriesSplit, RandomizedSearchCV

# Define Grid 
grid = {
    'bootstrap': [True, False],
    'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
    'max_features': ['auto', 'sqrt'],
    'min_samples_leaf': [1, 2, 4],
    'min_samples_split': [2, 5, 10],
    'n_estimators': [25, 50, 100, 200, 400, 600, 800, 1000, 1200, 1400]
 }

# Time Series Cross Validation
tscv = TimeSeriesSplit(gap=0, max_train_size=None, n_splits=3, test_size=1)

# Using random search because grid search takes too long to run
# Hyper-parameter tunning
rf_random = RandomizedSearchCV(estimator = RandomForestRegressor(),
                               param_distributions = grid,
                               n_iter = 20,
                               cv = tscv.split(X_train),
                               verbose=2, random_state=42, n_jobs = -1)

model = rf_random.fit(X_train, y_train)

model.best_estimator_

Fitting 3 folds for each of 20 candidates, totalling 60 fits


 nan nan]


RandomForestRegressor(bootstrap=False, max_depth=70, max_features='sqrt',
                      min_samples_split=10, n_estimators=400)

In [None]:
#Final model
rf = RandomForestRegressor(bootstrap=False, max_depth=70, max_features='sqrt', min_samples_split=10, n_estimators=400).fit(X_train, y_train)