# Major Leagues Assignment # 04
---

In [21]:
%%time

# Importing all modules necessary as per required
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

# Import Regressors and the evaluation modules
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.multioutput import MultiOutputRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.linear_model import LinearRegression
from xgboost import XGBRegressor

sns.set (color_codes = True)

# For plotting inline
%matplotlib inline

CPU times: user 2.63 ms, sys: 0 ns, total: 2.63 ms
Wall time: 2.32 ms


## Loading the NFL Data
---
For this task, I have selected the NFL games dataset for the exploration and regression based modeling. First, loading data into Pandas Dataframe and viewing first 5 rows to get rough idea about the dataset.

In [22]:
game_df = pd.read_csv ("../data/nfl_games.csv", parse_dates = ['date'], dtype = {'season': np.int32, 'neutral': np.int32, 'playoff': np.int32, 'score1': np.int32, 'score2': np.int32})
game_df.head ()

Unnamed: 0,date,season,neutral,playoff,team1,team2,elo1,elo2,elo_prob1,score1,score2,result1
0,1920-09-26,1920,0,0,RII,STP,1503.947,1300.0,0.824651,48,0,1.0
1,1920-10-03,1920,0,0,AKR,WHE,1503.42,1300.0,0.824212,43,0,1.0
2,1920-10-03,1920,0,0,RCH,ABU,1503.42,1300.0,0.824212,10,0,1.0
3,1920-10-03,1920,0,0,DAY,COL,1493.002,1504.908,0.575819,14,0,1.0
4,1920-10-03,1920,0,0,RII,MUN,1516.108,1478.004,0.644171,45,0,1.0


## Feature Engineering - New Column Match Addition
---
New column  "match", indicating two teams playing the match, is being created. This new feature may prove helpful later on during modeling part.

In [23]:
game_df['match'] = game_df['team1'] + '-' + game_df['team2']

## Feature Engineering - Streaks and Versus History
---
For the match outcome prediction, one helpful way is to take into account the "current form" of both teams and their match history in previous encounters against each other. Those are captured by the streak and the prev_result1 features respectively. Both these features using following functions.

In [24]:
%%time
pd.options.mode.chained_assignment = None

def get_last_matches (db, team, date, x = 5):
    ''' Return the last x matches for the input team '''
    
    # Filter team matches from matches
    team_matches = db [(db ['team1'] == team) | (db ['team2'] == team)]
                           
    # Filter x last matches from team matches
    last_matches = team_matches [team_matches.date < date].sort_values (by = 'date', ascending = False).iloc[0:x,:]
    
    # Return last matches
    return last_matches

def get_last_match_against (db, team1, team2, date, x = 5):
    ''' Get the last x matches of a given team against another team '''
    
    # Filter team matches from matches
    team_matches = db [(db ['match'] == '-'.join ([team1, team2])) | (db ['match'] == '-'.join ([team2, team1]))]
    team_matches ['result2'] = 1 - team_matches ['result1']
                           
    # Filter x last matches from vs matches
    last_matches = team_matches [team_matches.date < date].sort_values (by = 'date', ascending = False).iloc[0:x,:]
    
    team1_wins_as_team1 = last_matches [last_matches ['team1'] == team1]['result1'].sum ()
    team1_wins_as_team2 = last_matches [last_matches ['team2'] == team1]['result2'].sum ()
    team1_wins = team1_wins_as_team1 + team1_wins_as_team2
    
    # Return last matches
    return team1_wins

game_df ['prev_result1'] = game_df.apply (lambda x: get_last_match_against (game_df, x ['team1'], x ['team2'], x ['date']), axis = 1)

CPU times: user 2min 52s, sys: 104 ms, total: 2min 52s
Wall time: 2min 52s


In [25]:
%%time


def get_streak (db, team, date, x = 5):
    match_history = get_last_matches (db, team, date, x)
    match_history ['result2'] = 1 - match_history ['result1']
    match_history ['result1'] = match_history ['result1'].apply (np.ceil)
    match_history ['result2'] = match_history ['result2'].apply (np.ceil)
   
    result_as_team1 = match_history [match_history ['team1'] == team]['result1'].sum ()
    result_as_team2 = match_history [match_history ['team2'] == team]['result2'].sum ()
    result = result_as_team1 + result_as_team2
    
    if result == x:
        return 1
    else:
        return 0

game_df ['streak1'] = game_df.apply (lambda x: get_streak (game_df, x ['team1'], x ['date']), axis = 1)

CPU times: user 3min 5s, sys: 87.9 ms, total: 3min 5s
Wall time: 3min 5s


In [26]:
%%time
game_df ['streak2'] = game_df.apply (lambda x: get_streak (game_df, x ['team2'], x ['date']), axis = 1)

CPU times: user 3min 13s, sys: 99.9 ms, total: 3min 14s
Wall time: 3min 14s


## Further Data Cleanup - Factorization and Dropping Redundant/Irrelevant Features
---
Since regressors usually work better with numbers, so I am factorizing few categorical features to make them usable for the regression based modeling later. Also, I drop few irrelevant features.

In [27]:
#Factorizing few important Features Columns
game_df ['matchId'] = game_df ['match'].factorize () [0]
game_df ['result1'] = game_df ['result1'].apply (np.int32)

game_df ['team1Id'] = game_df ['team1'].factorize () [0]
game_df ['team2Id'] = game_df ['team2'].factorize () [0]

game_df = game_df.drop (['season', 'elo1', 'elo2', 'elo_prob1'], axis = 1)

game_df.tail ()

Unnamed: 0,date,neutral,playoff,team1,team2,score1,score2,result1,match,prev_result1,streak1,streak2,matchId,team1Id,team2Id
16002,2018-01-14,0,1,PIT,JAX,42,45,0,PIT-JAX,3.0,0,0,1411,61,105
16003,2018-01-14,0,1,MIN,NO,29,24,1,MIN-NO,1.0,0,0,972,90,100
16004,2018-01-21,0,1,NE,JAX,24,20,1,NE-JAX,5.0,0,0,1431,81,105
16005,2018-01-21,0,1,PHI,MIN,38,7,1,PHI-MIN,3.0,0,0,866,63,97
16006,2018-02-04,1,1,NE,PHI,33,41,0,NE-PHI,4.0,1,0,1288,81,70


## Regressing the Scores for Both Teams - MultiOutputRegressor
---
Since we require two outputs (scores of both teams) from our regression model, so I am using MultiOutputRegressor from scikit-learn library. MultiOutputRegressor will use any standard scikit-learn single output regressor twice to output the required two outputs. So, we can use e.g. Random Forest Regressor in MultiOutputRegressor to get multi output based regression. For this task, I am using the score1 and score2 as the outcome to be predicted by the regressor. First separating features and targets, and also removing a number of other irrelevant columns/features from our data table. After this step, our data are ready for modeling.

In [28]:
labels = np.array ([game_df ['score1'],game_df ['score2']]).T
features = game_df.drop (['date', 'score1', 'score2','match','team1','team2','matchId'], axis = 1)
feature_names = list (features.columns)
features = np.array (features)
features.shape

(16007, 8)

Next, we do training/testing split of 80/20 split.

In [29]:
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size = 0.20, random_state = 0)
train_features.shape, test_features.shape, train_labels.shape, test_labels.shape

((12805, 8), (3202, 8), (12805, 2), (3202, 2))

We will investigate three regressors, i.e. **Random Forest Regressor, Linear Regressor, XGBRegressor** and evaluate on three evaluation metrics, i.e. **Mean Absolute Error (MAE), Mean Squared Error (MSE), MultiOutput Regression Score**.

## MultiOutputRegressor Regression #01- Random Forest Regressor
---
First I am using Random Forest Regressor from scikit-learn in MultioutputRegressor module. **Score1** and **Score2** will be the target labels. Evaluation will be based on AME, MSE, Score as written above.

In [30]:
%%time

mor = MultiOutputRegressor(RandomForestRegressor(n_estimators=100,random_state=0))
#training process
mor.fit (features, labels)

CPU times: user 5.14 s, sys: 48 ms, total: 5.18 s
Wall time: 5.19 s


MultiOutputRegressor(estimator=RandomForestRegressor(bootstrap=True,
                                                     criterion='mse',
                                                     max_depth=None,
                                                     max_features='auto',
                                                     max_leaf_nodes=None,
                                                     min_impurity_decrease=0.0,
                                                     min_impurity_split=None,
                                                     min_samples_leaf=1,
                                                     min_samples_split=2,
                                                     min_weight_fraction_leaf=0.0,
                                                     n_estimators=100,
                                                     n_jobs=None,
                                                     oob_score=False,
                                                 

Once we have trained our model, next we evaluate our model on three evalaution metrics as discussed above.

In [31]:

print('MAE:',mean_absolute_error(mor.predict(test_features), test_labels))
print('MSE:',mean_squared_error(mor.predict(test_features), test_labels))
print ('Multi Target Output Random Forest Based Regressor Score:', round (mor.score (test_features, test_labels), 3))

MAE: 4.54832675696375
MSE: 36.48255827784976
Multi Target Output Random Forest Based Regressor Score: 0.701


It appears that Random Forest based MultiOutputRegressor is giveng pretty reasonable results.

## MultiOutputRegressor Regression #02- XGBoost Regressor
---
Next, we train and evaluate the XGBoost Regressor from scikit-learn in MultioutputRegressor module.

In [32]:
%%time

mor = MultiOutputRegressor(XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,max_depth = 5, alpha = 10, n_estimators = 100))
#training process
mor.fit (features, labels)

CPU times: user 1.02 s, sys: 12 ms, total: 1.03 s
Wall time: 1.03 s


MultiOutputRegressor(estimator=XGBRegressor(alpha=10, base_score=0.5,
                                            booster='gbtree',
                                            colsample_bylevel=1,
                                            colsample_bynode=1,
                                            colsample_bytree=0.3, gamma=0,
                                            importance_type='gain',
                                            learning_rate=0.1, max_delta_step=0,
                                            max_depth=5, min_child_weight=1,
                                            missing=None, n_estimators=100,
                                            n_jobs=1, nthread=None,
                                            objective='reg:linear',
                                            random_state=0, reg_alpha=0,
                                            reg_lambda=1, scale_pos_weight=1,
                                            seed=None, silent=None, subsamp

Once we have trained our model, next we evaluate our model on three evalaution metrics as discussed above.

In [33]:

print('MAE:',mean_absolute_error(mor.predict(test_features), test_labels))
print('MSE:',mean_squared_error(mor.predict(test_features), test_labels))
print ('Multi Target Output XGBoost based Regressor Score:', round (mor.score (test_features, test_labels), 3))

MAE: 6.956283390349816
MSE: 77.05458382962796
Multi Target Output XGBoost based Regressor Score: 0.37


It appears that XGBoost based MultiOutputRegressor is giveng pretty reasonable results but worse than Random Forest based MultioutputRegressor.

## MultiOutputRegressor Regression #03 - Linear Regressor
---
Next, we train and evaluate the Linear Regressor from scikit-learn in MultioutputRegressor module.

In [34]:
%%time

mor = LinearRegression()
#training process
mor.fit (features, labels)

CPU times: user 6.84 ms, sys: 8 µs, total: 6.85 ms
Wall time: 5.21 ms


LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Once we have trained our model, next we evaluate our model on three evalaution metrics as discussed above.

In [35]:

print('MAE:',mean_absolute_error(mor.predict(test_features), test_labels))
print('MSE:',mean_squared_error(mor.predict(test_features), test_labels))
print ('Multi Target Output Linear Regressor Score:', round (mor.score (test_features, test_labels), 3))

MAE: 7.233133556966823
MSE: 82.39727180323428
Multi Target Output Linear Regressor Score: 0.326




It appears that Linear Rgressor based MultiOutputRegressor is giving pretty reasonable results but worse than previous two methods.

## Conclusion
This assignment was about applying regression to output scores for two playing teams. After training and evaluating three regression models (Random Forest, XGBoost, Linear Regressor), in MultiouputRegressor mode, we found that Random Forest based approach performs the best on all three evaluation metrics (**Mean Absolute Error (MAE):4.54832675696375, Mean Squared Error:36.48255827784976, Regression Score:0.701**). Feature Engineering, including combining and removing different features, also played an important role in improving the overall effectiveness of the system.

---
## Reference Notes
The 'streaks and history' idea and some of the code used in the Feature Engineering part was based on the following link (which was mentioned in the lecture slides for Course Regression lecture # 04)

> https://www.kaggle.com/airback/match-outcome-prediction-in-football