<a href="https://colab.research.google.com/github/danielbauer1979/MSDIA_PredictiveModelingAndMachineLearning/blob/main/GB886_V_9_TelCoExample_ForwardAndBackwardSelection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Model Selection: Telecom Work Measurement Study

To introduce and discuss how model selection is relevant, we consider an example from a [Telecom Work Measurement Study](http://www.statsci.org/data/oz/telecom.html). The data originates from study for a telecom company (we are only using an excerpt). According to the website, “the purpose of the study was to model the total hours worked in a section of Telecom in terms of the counts of various tasks. It was hoped that such a model could be used to predict hours worked and hence staffing requirements in changing circumstances. The number of hours worked by employees in a fault reporting centre were recorded, together with the number of faults of each type which were recorded. Employees often work on a flexitime system which allows them to build up time and to leave early every second Friday.”

We will use it to showcase that the most complex model is not necessarily the best. Specifically, we will run different models (i.e., including different sets of features) and compare them.

## Preliminaries

As usually, we start by loading some of the packages we will use and the data:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm

And let's load our data:

In [None]:
!git clone https://github.com/danielbauer1979/MSDIA_PredictiveModelingAndMachineLearning.git

In [None]:
data = pd.read_csv('MSDIA_PredictiveModelingAndMachineLearning/GB886_V_4_tel_base.csv')
data.head()

Our target/label variable $y$ is 'Hours' (hours worked on a given day), and we have features relating to the activities performed (e.g., Number of service orders of type A/B/C given by SOA/SOB/SOC; number of hotlines given by Hot) and the characteristics of the day (e.g., what day of the week is it).

Let's take a look at a few variables:

In [None]:
plt.figure(figsize=(10,6))
plt.scatter(data['Hot'], data['Hours'])
plt.xlabel('Hot')
plt.ylabel('Hours')

# Add a trendline
X = sm.add_constant(data['Hot'])
model = sm.OLS(data['Hours'], X).fit()
plt.plot(data['Hot'], model.predict(X), color='red')

plt.show()

So it seems like hours worked are increasing in Hot.

In [None]:
ys = [data['Hours'][data['Tuesday'] == 1],data['Hours'][data['Wednesday'] == 1],
               data['Hours'][data['Thursday'] == 1],
               data['Hours'][data['Friday'] == 1]]
plt.boxplot(ys, labels=['Tuesday','Wednesday','Thursday','Friday'])
plt.show()


And it seems like fewer hours are worked on Friday!

## Linear Regression models

Let's start running our linear regression models. In line with our approach so far, let's run the "full" model including all the features"

Let's start running regressions, and let's start with the "full" model.

In [None]:
y = data['Hours']
X = data.drop(columns=['Hours'])
X = sm.add_constant(X)
model_full = sm.OLS(y, X).fit()
model_full.summary()

So in line with our visualizations above, Hot carries a positive coefficients and Friday carries a negative coefficient. However, many of the coefficients are not significant. For instance, the day dummies other than Friday don't seem to be significant. Let's drop these days:

So it looks like only Friday really matters, so let's drop the other days.

In [None]:
X = data.drop(columns=['Hours','Tuesday','Wednesday','Thursday'])
X = sm.add_constant(X)
model_reduced_days = sm.OLS(y, X).fit()
model_reduced_days.summary()

Let's also look at a reduced model with just four variables that seem likely different than zero (their t-statistics are reasonably high):

In [None]:
X = data[['RWT','SOA','Hot','Friday']]
X = sm.add_constant(X)
model_reduced = sm.OLS(y, X).fit()
model_reduced.summary()

Finally, let's fun a model with only Friday as the feature:

In [None]:
X = data[['Friday']]
X = sm.add_constant(X)
model_simplest = sm.OLS(y, X).fit()
model_simplest.summary()

Let's compare the three models. We will do it two different ways: On the $x$-axis, let's go from simple to more complex models (with the "full" model being the most complex). On the $y$-axis, we will look first at the **Mean-squared-Error** (average of the squared errors: $\frac{1}{N}\sum_{i=1}^N (y_i - \hat{y}_i)^2$) and then at the **R-squared**.

Let's define the components:

In [None]:
mse_simplest = np.average(model_simplest.resid ** 2)
mse_reduced = np.average(model_reduced.resid ** 2)
mse_reduced_days = np.average(model_reduced_days.resid ** 2)
mse_full = np.average(model_full.resid ** 2)

model_names = ['Simplest','Reduced', 'Reduced Days', 'Full']
mse_values = [mse_simplest,mse_reduced, mse_reduced_days, mse_full]
rsquared_values = [model_simplest.rsquared, model_reduced.rsquared, model_reduced_days.rsquared, model_full.rsquared]

And plot the two:

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(12, 5))

# Plot the MSE values on the first subplot
axs[0].plot(model_names, mse_values, marker='o')
axs[0].set_xlabel('Model')
axs[0].set_ylabel('Average of Squared Errors (MSE)')
axs[0].set_title('Comparison of Model MSE')

# Plot the R-squared values on the second subplot
axs[1].plot(model_names, rsquared_values, marker='o')
axs[1].set_xlabel('Model')
axs[1].set_ylabel('R-squared')
axs[1].set_title('Comparison of Model R-squared')

# Adjust spacing between subplots
plt.tight_layout()

# Display the plot
plt.show()


So, it appears that the most complex model---the model "Full"---seems to produce the lowest mean-squared error and the highest R-squared. So the **most complex model appears to be the best model**.

## Comparison based on New, Unseen Data

Let's follow the outlined strategy and compare the model based on previously unseen data:

In [None]:
data_addtl = pd.read_csv('MSDIA_PredictiveModelingAndMachineLearning/GB886_V_4_tel_addtl.csv')
data_addtl.head()

Let's generate predictions for our four models based on this new dataset:

In [None]:
#Feature matrices for the four models
X_new = data_addtl.drop(columns=['Hours'])
X_new = sm.add_constant(X_new)
X_new_simplest = X_new[['Friday']]
X_new_simplest = sm.add_constant(X_new_simplest)
X_new_reduced = X_new[['RWT','SOA','Hot','Friday']]
X_new_reduced = sm.add_constant(X_new_reduced)
X_new_reduced_days = X_new.drop(columns=['Tuesday','Wednesday','Thursday'])

# Predict using the four models
predictions_full = model_full.predict(X_new)
predictions_simplest = model_simplest.predict(X_new_simplest)
predictions_reduced = model_reduced.predict(X_new_reduced)
predictions_reduced_days = model_reduced_days.predict(X_new_reduced_days)

And let's regenerate the plots from above using the new predictions:

In [None]:
# Calculate MSE for the new predictions
y_true = data_addtl['Hours']
mse_simplest_new = np.average((y_true - predictions_simplest) ** 2)
mse_reduced_new = np.average((y_true - predictions_reduced) ** 2)
mse_reduced_days_new = np.average((y_true - predictions_reduced_days) ** 2)
mse_full_new = np.average((y_true - predictions_full) ** 2)

# Calculate R-squared for the new predictions
sse_simplest_new = np.sum((y_true - predictions_simplest) ** 2)
sse_reduced_new = np.sum((y_true - predictions_reduced) ** 2)
sse_reduced_days_new = np.sum((y_true - predictions_reduced_days) ** 2)
sse_full_new = np.sum((y_true - predictions_full) ** 2)

sst_new = np.sum((y_true - np.mean(y_true)) ** 2)

rsquared_simplest_new = 1 - (sse_simplest_new / sst_new)
rsquared_reduced_new = 1 - (sse_reduced_new / sst_new)
rsquared_reduced_days_new = 1 - (sse_reduced_days_new / sst_new)
rsquared_full_new = 1 - (sse_full_new / sst_new)

# Plot the MSE and R-squared values for the new predictions
mse_values_new = [mse_simplest_new, mse_reduced_new, mse_reduced_days_new, mse_full_new]
rsquared_values_new = [rsquared_simplest_new, rsquared_reduced_new, rsquared_reduced_days_new, rsquared_full_new]

fig, axs = plt.subplots(1, 2, figsize=(12, 5))

# Plot the MSE values on the first subplot
axs[0].plot(model_names, mse_values_new, marker='o')
axs[0].set_xlabel('Model')
axs[0].set_ylabel('Average of Squared Errors (MSE)')
axs[0].set_title('Comparison of Model MSE (New Data)')

# Plot the R-squared values on the second subplot
axs[1].plot(model_names, rsquared_values_new, marker='o')
axs[1].set_xlabel('Model')
axs[1].set_ylabel('R-squared')
axs[1].set_title('Comparison of Model R-squared (New Data)')

# Adjust spacing between subplots
plt.tight_layout()

# Display the plot
plt.show()


So, clearly, the situation changes. The most complex model ("Full") is no longer the best model. Indeed, it turns out that the "Reduced Days" model produces the smallest MSE and the highest R-squared.

Notice that since typically we are predicting for previously unseen data, this perspective is the more relevant one. Hence, **generally, the most complex model is NOT the best model**.

## ANOVA

Let's compare the models based on an ANOVA. The advantage is that we can use the full dataset!

In [None]:
data_combined = pd.concat([data, data_addtl], ignore_index=True)

# Run the four models on the combined dataset
y_combined = data_combined['Hours']
X_combined = data_combined.drop(columns=['Hours'])
X_combined = sm.add_constant(X_combined)
model_full_combined = sm.OLS(y_combined, X_combined).fit()

X_combined_reduced_days = data_combined.drop(columns=['Hours','Tuesday','Wednesday','Thursday'])
X_combined_reduced_days = sm.add_constant(X_combined_reduced_days)
model_reduced_days_combined = sm.OLS(y_combined, X_combined_reduced_days).fit()

X_combined_reduced = data_combined[['RWT','SOA','Hot','Friday']]
X_combined_reduced = sm.add_constant(X_combined_reduced)
model_reduced_combined = sm.OLS(y_combined, X_combined_reduced).fit()

X_combined_simplest = data_combined[['Friday']]
X_combined_simplest = sm.add_constant(X_combined_simplest)
model_simplest_combined = sm.OLS(y_combined, X_combined_simplest).fit()

In [None]:
from statsmodels.stats.anova import anova_lm
anova_results = anova_lm(model_simplest_combined, model_reduced_combined, model_reduced_days_combined, model_full_combined)
print(anova_results)

So, again, there is no evidence that the more complex model 4 is superior to the reduced days model 3 (bottom line). However, on this case there is also no evidence that the more complex model 3 is superior to the reduced model (line 2). However, the reduced model seems superior to the simplest model (line 1): The F statistics is fairly large and the p-value very small.

So, the ANOVA guidance is different: **The best model is the reduced model**! This emphasizes one problem with the "new data" approach above: It really depends on the speciifc sample we are looking at.

However, this begs the question of whether any of our four considered models are best? Are there approaches to more strategically pick a suitable model / set of features?


## Model Selection

To select a suitable model, we will use *forward* and *backward selection*:

* **Forward selection**: Add features one-by-one, where on each step starting from the constant model we add the feature that increases the fit (as e.g. measured by R-squared or AIC) the most. We then use AIC across models with a different number of features to select the model at the *sweet spot*.

* **Backward selection**: Drop features one-by-one, where on each step starting from the full model we drop the feature that decreases the fit the least (as e.g. measured by R-squared or AIC). We then use AIC across models with a different number of features to select the model at the *sweet spot*.

'Statsmodels' does not have an automatic procedure, so we need to code this up ourselves (although the AI functionality of Colab helps :-)

### Forward Selection

Let's start with forward selection:

In [None]:
remaining_features = list(X_combined.columns[1:])  # Exclude the constant term
selected_features = []
model_sequence = []
aic_values = []

while remaining_features:
    best_aic = float('inf')
    best_feature = None
    best_model = None

    for feature in remaining_features:
        candidate_features = selected_features + [feature]
        X_candidate = X_combined[candidate_features]
        X_candidate = sm.add_constant(X_candidate)
        candidate_model = sm.OLS(y_combined, X_candidate).fit()
        candidate_aic = candidate_model.aic

        if candidate_aic < best_aic:
            best_aic = candidate_aic
            best_feature = feature
            best_model = candidate_model

    selected_features.append(best_feature)
    remaining_features.remove(best_feature)
    model_sequence.append(best_model)
    aic_values.append(best_aic)

# Plot AIC values against number of features
plt.plot(range(1, len(aic_values) + 1), aic_values, marker='o')
plt.xlabel('Number of Features')
plt.ylabel('AIC')
plt.title('Forward Selection Based on AIC')
plt.show()

# Find the model with the lowest AIC
best_model_index = np.argmin(aic_values)
best_model = model_sequence[best_model_index]

# Print the summary of the best model
print(best_model.summary())


So, we clearly see the U-shape in the complexity-AIC plot. The *sweet spot* is a model with five features, which is similar to our intuitively selected model above but not identical.

### Backward Selection

Let's also look at a backward-selected model:

In [None]:
remaining_features = list(X_combined.columns[1:])
selected_features = remaining_features.copy()
model_sequence = []
aic_values = []

while remaining_features:
    best_aic = float('inf')
    worst_feature = None
    best_model = None

    X_current = X_combined[selected_features]
    X_current = sm.add_constant(X_current)
    current_model = sm.OLS(y_combined, X_current).fit()
    current_aic = current_model.aic

    for feature in remaining_features:
        candidate_features = selected_features.copy()
        candidate_features.remove(feature)

        X_candidate = X_combined[candidate_features]
        X_candidate = sm.add_constant(X_candidate)
        candidate_model = sm.OLS(y_combined, X_candidate).fit()
        candidate_aic = candidate_model.aic

        if candidate_aic < best_aic:
            best_aic = candidate_aic
            worst_feature = feature
            best_model = candidate_model

    if best_aic >= current_aic:
        break  # Stop if removing any feature increases AIC

    selected_features.remove(worst_feature)
    remaining_features.remove(worst_feature)
    model_sequence.append(best_model)
    aic_values.append(best_aic)

# The last model in the sequence is the best one
best_model = model_sequence[-1]
print(best_model.summary().tables[1])


So, it turns out that we end up at the identical models based on Forward and Backward selection. However, this doesn't need to be the case!