# This notebook will perform a Least Squares Regression Analysis of Hell Let Loose game data.

This analysis will show which factors are the most important when predicting the number of "victory points" (and therefore Win or Loss result) gained during a match of Hell Let Loose. The prevailing assumption is that the number of nodes built and squad leader quality have the greatest effect on match outcome. Using regression analysis on data that we've collected, we will attempt to see if that assumption holds up.

## Importing and prepping the data:

In [57]:
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as stats
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.stats.api as sms

In [58]:
import warnings

warnings.filterwarnings("ignore")

In [69]:
import os

pwd = os.getcwd()

path = pwd + "/raw_data/"

hll_df = pd.read_csv(path + 'HLL.csv', index_col='Game_ID', parse_dates= True)

hll_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 60 entries, 1 to 60
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Date      60 non-null     object
 1   Map       60 non-null     object
 2   Mode      60 non-null     object
 3   Side      60 non-null     object
 4   Takeover  60 non-null     int64 
 5   Nodes     60 non-null     int64 
 6   SL_Qual   60 non-null     int64 
 7   Points    60 non-null     int64 
 8   Win       60 non-null     int64 
dtypes: int64(5), object(4)
memory usage: 4.7+ KB


## Building a Model:

Because we are working with a relatively small dataset (<100 points), and the emphasis is on model fit as opposed to prediction, we will not utilize a test/training split.

We will not use the factor "takeover" as this is exclusive to commanding stats, and we want to generally apply our data to all matches.

In [93]:
from statsmodels.formula.api import ols

X = hll_df[['Map', 'Mode', 'Side', 'Nodes', 'SL_Qual']] 

y = hll_df['Points']

X = sm.add_constant(X)
est = ols('Points ~ Map + Mode + Side + Nodes + SL_Qual', data = hll_df).fit()
est.summary()

0,1,2,3
Dep. Variable:,Points,R-squared:,0.686
Model:,OLS,Adj. R-squared:,0.578
Method:,Least Squares,F-statistic:,6.398
Date:,"Sun, 30 Jan 2022",Prob (F-statistic):,6.72e-07
Time:,18:47:46,Log-Likelihood:,-93.914
No. Observations:,60,AIC:,219.8
Df Residuals:,44,BIC:,253.3
Df Model:,15,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-4.1214,1.127,-3.657,0.001,-6.393,-1.850
Map[T.Foy],0.3690,0.800,0.461,0.647,-1.243,1.981
Map[T.Hill 400],-0.1938,0.879,-0.220,0.827,-1.966,1.578
Map[T.Hurtgen Forest],-0.6907,1.009,-0.685,0.497,-2.724,1.343
Map[T.Kursk],0.0093,1.073,0.009,0.993,-2.153,2.171
Map[T.Omaha Beach],1.1954,1.047,1.142,0.260,-0.914,3.305
Map[T.Purple Heart Lane],-0.7306,0.911,-0.802,0.427,-2.566,1.105
Map[T.St Marie Du Mont],-0.7825,0.883,-0.887,0.380,-2.561,0.996
Map[T.St Mere Eglise],-0.5611,0.797,-0.704,0.485,-2.166,1.044

0,1,2,3
Omnibus:,1.027,Durbin-Watson:,2.526
Prob(Omnibus):,0.598,Jarque-Bera (JB):,0.816
Skew:,-0.284,Prob(JB):,0.665
Kurtosis:,2.939,Cond. No.,46.1


ANOVA:

In [94]:
from statsmodels.stats.anova import anova_lm

table = sm.stats.anova_lm(est, typ=2)

table

Unnamed: 0,sum_sq,df,F,PR(>F)
Map,12.889119,10.0,0.705431,0.7142762
Mode,20.027299,1.0,10.961089,0.001864208
Side,4.226684,2.0,1.156648,0.323917
Nodes,0.120838,1.0,0.066135,0.7982478
SL_Qual,62.52237,1.0,34.218955,5.609745e-07
Residual,80.393579,44.0,,


Before we eliminate any factors as not significant, we will check for multicolinearity using VIF (Variance Inflation Factor) as our metric. We will need to create dummy variables for our categorical factors:

In [95]:
# Create Dummy Variables

hll_df2 = pd.get_dummies(data = hll_df, columns = ["Map", "Mode", "Side"])

hll_df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 60 entries, 1 to 60
Data columns (total 22 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   Date                   60 non-null     object
 1   Takeover               60 non-null     int64 
 2   Nodes                  60 non-null     int64 
 3   SL_Qual                60 non-null     int64 
 4   Points                 60 non-null     int64 
 5   Win                    60 non-null     int64 
 6   Map_Carentan           60 non-null     uint8 
 7   Map_Foy                60 non-null     uint8 
 8   Map_Hill 400           60 non-null     uint8 
 9   Map_Hurtgen Forest     60 non-null     uint8 
 10  Map_Kursk              60 non-null     uint8 
 11  Map_Omaha Beach        60 non-null     uint8 
 12  Map_Purple Heart Lane  60 non-null     uint8 
 13  Map_St Marie Du Mont   60 non-null     uint8 
 14  Map_St Mere Eglise     60 non-null     uint8 
 15  Map_Stalingrad         60

In [96]:
# Drop last dummy variable for each category (following "n-1" rule)

hll_df2.drop(["Map_Utah Beach", "Mode_W", "Side_U"], axis = 1, inplace = True)

hll_df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 60 entries, 1 to 60
Data columns (total 19 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   Date                   60 non-null     object
 1   Takeover               60 non-null     int64 
 2   Nodes                  60 non-null     int64 
 3   SL_Qual                60 non-null     int64 
 4   Points                 60 non-null     int64 
 5   Win                    60 non-null     int64 
 6   Map_Carentan           60 non-null     uint8 
 7   Map_Foy                60 non-null     uint8 
 8   Map_Hill 400           60 non-null     uint8 
 9   Map_Hurtgen Forest     60 non-null     uint8 
 10  Map_Kursk              60 non-null     uint8 
 11  Map_Omaha Beach        60 non-null     uint8 
 12  Map_Purple Heart Lane  60 non-null     uint8 
 13  Map_St Marie Du Mont   60 non-null     uint8 
 14  Map_St Mere Eglise     60 non-null     uint8 
 15  Map_Stalingrad         60

In [97]:
hll_df2.columns

Index(['Date', 'Takeover', 'Nodes', 'SL_Qual', 'Points', 'Win', 'Map_Carentan',
       'Map_Foy', 'Map_Hill 400', 'Map_Hurtgen Forest', 'Map_Kursk',
       'Map_Omaha Beach', 'Map_Purple Heart Lane', 'Map_St Marie Du Mont',
       'Map_St Mere Eglise', 'Map_Stalingrad', 'Mode_O', 'Side_G', 'Side_S'],
      dtype='object')

In [98]:
from statsmodels.formula.api import ols

X2 = hll_df2[['Nodes', 'SL_Qual', 'Map_Carentan',
       'Map_Foy', 'Map_Hill 400', 'Map_Hurtgen Forest', 'Map_Kursk',
       'Map_Omaha Beach', 'Map_Purple Heart Lane', 'Map_St Marie Du Mont',
       'Map_St Mere Eglise', 'Map_Stalingrad', 'Mode_O', 'Side_G', 'Side_S']] 

X2 = sm.add_constant(X2)

y2 = hll_df['Points']



In [99]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Creating a dataframe that will contain the names of all the feature variables and their VIFs
vif = pd.DataFrame()
vif['Features'] = X2.columns
vif['VIF'] = [variance_inflation_factor(X2.values, i) for i in range(X2.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

Unnamed: 0,Features,VIF
0,const,28.18
15,Side_S,3.16
12,Map_Stalingrad,3.11
8,Map_Omaha Beach,2.4
7,Map_Kursk,2.38
2,SL_Qual,2.22
4,Map_Foy,2.02
1,Nodes,1.96
13,Mode_O,1.96
11,Map_St Mere Eglise,1.82


We will use the generally accepted rule of thumb that any factor with a VIF > 5 is highly correlated, and should be removed from our model. Here, none of our factors meet that criteria.

From our first look, it appears that the most important factors in point outcome are: Squad Lead Quality and Mode. We'll confirm this using Recursive Feature Elimination (RFE). 

## Recursive Feature Elimination:

This is an automated process that gradually removes insignificant features until only the most important features remain. Let's start with the top 5 most important features:

In [100]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFECV

In [106]:
lm = LinearRegression()
lm.fit(X2, y)

rfe = RFE(lm, 5)             # running RFE
rfe = rfe.fit(X2, y)

list(zip(X2.columns,rfe.support_,rfe.ranking_))

[('const', False, 12),
 ('Nodes', False, 10),
 ('SL_Qual', True, 1),
 ('Map_Carentan', False, 9),
 ('Map_Foy', False, 7),
 ('Map_Hill 400', False, 11),
 ('Map_Hurtgen Forest', False, 5),
 ('Map_Kursk', False, 8),
 ('Map_Omaha Beach', True, 1),
 ('Map_Purple Heart Lane', False, 4),
 ('Map_St Marie Du Mont', True, 1),
 ('Map_St Mere Eglise', False, 6),
 ('Map_Stalingrad', True, 1),
 ('Mode_O', True, 1),
 ('Side_G', False, 3),
 ('Side_S', False, 2)]

In [108]:
rfecv = RFECV(
    estimator=LinearRegression(),
    min_features_to_select=1,
    step=5,
    n_jobs=-1,
    scoring="r2",
    cv=5,
)

_ = rfecv.fit(X2, y)

X2.columns[rfecv.support_]

Index(['SL_Qual', 'Map_Omaha Beach', 'Map_St Marie Du Mont', 'Map_Stalingrad',
       'Mode_O', 'Side_S'],
      dtype='object')

In [114]:
X2 = hll_df2[['SL_Qual', 'Map_Omaha Beach', 'Map_St Marie Du Mont', 'Map_Stalingrad',
       'Mode_O', 'Side_S']] 

y = hll_df2['Points']

X2 = sm.add_constant(X2)
est = sm.OLS(y, X2).fit()
est.summary()

0,1,2,3
Dep. Variable:,Points,R-squared:,0.65
Model:,OLS,Adj. R-squared:,0.611
Method:,Least Squares,F-statistic:,16.42
Date:,"Sun, 30 Jan 2022",Prob (F-statistic):,1.41e-10
Time:,19:05:30,Log-Likelihood:,-97.121
No. Observations:,60,AIC:,208.2
Df Residuals:,53,BIC:,222.9
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.9092,0.559,-3.416,0.001,-3.030,-0.788
SL_Qual,2.1986,0.232,9.458,0.000,1.732,2.665
Map_Omaha Beach,1.0479,0.784,1.337,0.187,-0.524,2.620
Map_St Marie Du Mont,-0.7675,0.617,-1.244,0.219,-2.005,0.470
Map_Stalingrad,0.9183,0.724,1.268,0.210,-0.535,2.371
Mode_O,-2.0689,0.656,-3.153,0.003,-3.385,-0.753
Side_S,-0.5734,0.646,-0.887,0.379,-1.870,0.723

0,1,2,3
Omnibus:,0.114,Durbin-Watson:,2.604
Prob(Omnibus):,0.945,Jarque-Bera (JB):,0.309
Skew:,-0.038,Prob(JB):,0.857
Kurtosis:,2.657,Cond. No.,15.8


VIF Check:

In [110]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
vif['Features'] = X2.columns
vif['VIF'] = [variance_inflation_factor(X2.values, i) for i in range(X2.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

Unnamed: 0,Features,VIF
0,const,11.1
5,Mode_O,1.77
4,Map_Stalingrad,1.68
2,Map_Omaha Beach,1.67
6,Side_S,1.53
1,SL_Qual,1.09
3,Map_St Marie Du Mont,1.03


## Conclusion

**When it comes to victory points, Nodes don't matter.** We were able to eliminate this factor fairly early in our analysis. Interestingly, this analysis period covers the Update 11 change to the Commander's ability "Encourage." Even with this nerf, nodes still are not a significant factor in the outcome of the game. The Node mechanics in Hell Let Loose need to be changed in order to make this a value-added game feature.

The top three factors that effect a match's outcome are, in order: Squad Leader Quality, Game Mode, and Whether or not you're playing Omaha Beach. These conclusions are not particularly groundbreaking to anyone who is a long term veteran of the game, but I hope that this analysis has shed some light on the commonly held conceptions that most players have.

In the future, we should look to somehow incorporate Tank effectiveness as a separate category in our analysis, since there is a good chance that it effects game outcome. Did I miss anything else? Let me know what you think should be included!

*Note: All of these findings are simply results from my specific sample. They could easily change with more data.*