# ML Modelling

After having thoroughly explored the dataset and building multiple simple baseline models using the Probability scores or Point Difference or Form Difference, we were able to achieve an approx 72%. Any ML we perform should perform better than our Baseline models. We will be exploring different ML algorithms to build this strategy.


## Important Features

Upon performing EDA, we have identified a few essential features that could determine the game's outcome. They are listed below - 

1. Winning Probabilities from different Betting Houses - 
    - Just on building a simple comparison model, we were able to generate 72% accuracy.
    - We will also look at the difference between Opening and Closing Bets of various houses.
    
2. Point difference
    - It measures the difference in point gained by the teams through the season. The baseline created using point difference gave us a 70% accuracy.
    
3. Form Difference
    - We have built a custom formula to measure the current form of a team, which was discussed in `2. Analysis-v2 (Team Performance).ipynb`. Just using this formula, we were able to achieve 70 to 72% accuracy.
    
4. Elite Teams
    - Let's create a unique feature to indicate that a Team is Elite. It has been observed that Betting Probabilities have a hard time predicting when Elite Teams are playing away. Elite Teams are - {'Man United', 'Chelsea', 'Tottenham', 'Liverpool', 'Man City', 'Arsenal'}
    
5. New Teams of Season
    - Every season, the bottom three teams get relegated to a lower division, and three new teams from the lower division are included. These teams tend to be weaker playing sides. New Teams are - {'Norwich', 'Watford', 'Brentford'}


## Cross-Validation

It was observed that our Baseline model's performance was a bit weaker in the half of the tournament. So to effectively model our system, we adopted two validation strategies -

1. We train a model on the first 30 weeks of the dataset and validate its performance on the last 8 weeks.
2. We Randomly sample 80% of our dataset for the Train set and the rest of the 20% for validation. We might also experiment with K-fold random splitting as well

## Evaluation

For the task, we need to successfully predict if the Away team has won the game or not. Since we do not know how detrimental a False Positive or False Negative could be, both Recall & Precision are essential. Here is the list of metrics we will note to evaluate the performance of our ML model - 

1. Recall
2. Precision
3. Accuracy
4. ROC AUC score
5. Confusion Matrix

## Features Neglected or will be Later explored

1. We have decided not to use the Asian Handicap Betting Odds as there seems to be a slight confusion. The resources I had read online mentioned that a small penalty value gets added to the final score. But the key value descriptions do not provide information about the type of Asian handicap being used. Hence, I have decided to move forward with things that I already know.

2. We would like to later explore the effect of Total Goals and its betting odds on the outcome of the game.

3. We would like to explore the effect of information about the Day of the week and the week when the game was played.


In [1]:
import math
import random
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from utils import (
    build_ongoing_points_table, find_optimal_threshold,
    generate_betting_probablity
)
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import (
    confusion_matrix, recall_score, precision_score, 
    accuracy_score, roc_auc_score
)
from imblearn.over_sampling import SMOTE
random.seed(211)
np.random.seed(211)

In [2]:
ori_df = pd.read_csv('E0.csv')
print(f"Shape of dataset {ori_df.shape}")
ori_df.head()

Shape of dataset (366, 106)


Unnamed: 0,Div,Date,Time,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,...,AvgC<2.5,AHCh,B365CAHH,B365CAHA,PCAHH,PCAHA,MaxCAHH,MaxCAHA,AvgCAHH,AvgCAHA
0,E0,13/08/2021,20:00,Brentford,Arsenal,2,0,H,1,0,...,1.62,0.5,1.75,2.05,1.81,2.13,2.05,2.17,1.8,2.09
1,E0,14/08/2021,12:30,Man United,Leeds,5,1,H,1,0,...,2.25,-1.0,2.05,1.75,2.17,1.77,2.19,1.93,2.1,1.79
2,E0,14/08/2021,15:00,Burnley,Brighton,1,2,A,1,0,...,1.62,0.25,1.79,2.15,1.81,2.14,1.82,2.19,1.79,2.12
3,E0,14/08/2021,15:00,Chelsea,Crystal Palace,3,0,H,2,0,...,1.94,-1.5,2.05,1.75,2.12,1.81,2.16,1.93,2.06,1.82
4,E0,14/08/2021,15:00,Everton,Southampton,3,1,H,0,1,...,1.67,-0.5,2.05,1.88,2.05,1.88,2.08,1.9,2.03,1.86


# Target Variable and Time Stamping

As per the goal of the project, we need to predict succesfully if the Away team can win or not.

In [3]:
# Adding Target
ori_df['Result'] = (ori_df['FTR'] == 'A').astype(int)

# Adding TimeStamp and Week No.
ori_df['Datetime'] = ori_df['Date'] + " " + ori_df['Time']+":00"
ori_df['Datetime'] = pd.to_datetime(ori_df['Datetime'], infer_datetime_format=True)
min_date = ori_df['Datetime'].min()
ori_df['Week No.'] = ori_df['Datetime'].apply(lambda x: int((x - min_date).days / 7))

# Features Generation

As discussed above, we will be generating all the important features that we will be using for building our machine learning model.


In [4]:
betting_houses_original_arg = {
    'Bet365': ['B365H', 'B365D', 'B365A'],
    'Bet&Win': ['BWH', 'BWD', 'BWA'],
    'Interwetten': ['IWH', 'IWD', 'IWA'],
    'Pinnacle': ['PSH', 'PSD', 'PSA'],
    'VC Bet': ['VCH', 'VCD', 'VCA'],
    'William Hill': ['WHH', 'WHD', 'WHA'],
    'Average': ['AvgH', 'AvgD', 'AvgA'],
    'Maximum': ['MaxH', 'MaxD', 'MaxA']
}

betting_houses_closing_original_arg = {
    'Bet365': ['B365CH', 'B365CD', 'B365CA'],
    'Bet&Win': ['BWCH', 'BWCD', 'BWCA'],
    'Interwetten': ['IWCH', 'IWCD', 'IWCA'],
    'Pinnacle': ['PSCH', 'PSCD', 'PSCA'],
    'VC Bet': ['VCCH', 'VCCD', 'VCCA'],
    'William Hill': ['WHCH', 'WHCD', 'WHCA'],
    'Average': ['AvgCH', 'AvgCD', 'AvgCA'],
    'Maximum': ['MaxCH', 'MaxCD', 'MaxCA']
}

elite_teams_original_arg = [
    'Man United', 'Chelsea', 'Tottenham', 'Liverpool', 'Man City', 'Arsenal'
]

new_teams_original_arg = [
    'Norwich', 'Watford', 'Brentford'
]
team_names_original_arg = ori_df['HomeTeam'].unique().tolist()
performance_feat_names_original_arg = [
    'Point Diff', 'Form Diff 1', 'Form Diff 2'
]

past_record_feat_original_arg = ['Elite Home', 'Elite Away', 'New Team Home', 'New Team Away']

def build_features(
    arg_df, 
    betting_houses = betting_houses_original_arg, 
    betting_houses_closing = betting_houses_closing_original_arg,
    only_away = False,
    all_teams = team_names_original_arg,
    elite_teams = elite_teams_original_arg, 
    new_teams = new_teams_original_arg,
    performance_feat_names = performance_feat_names_original_arg,
    past_record_feat_names = past_record_feat_original_arg,
    filter_names = None
    ):
    
    arg_df = generate_betting_probablity(arg_df, betting_houses, betting_houses_closing)
    arg_df = build_ongoing_points_table(arg_df, all_teams, elite_teams)
    arg_df['Point Diff'] = arg_df['AwayPoint'] - arg_df['HomePoint']
    arg_df['Form Diff 1'] = arg_df['AwayForm1'] - arg_df['HomeForm1']
    arg_df['Form Diff 2'] = arg_df['AwayForm2'] - arg_df['HomeForm2']
    arg_df['Elite Home'] = arg_df['HomeTeam'].apply(lambda x: 1 if x in elite_teams else 0)
    arg_df['Elite Away'] = arg_df['AwayTeam'].apply(lambda x: 1 if x in elite_teams else 0)
    arg_df['New Team Home'] = arg_df['HomeTeam'].apply(lambda x: 1 if x in new_teams else 0)
    arg_df['New Team Away'] = arg_df['AwayTeam'].apply(lambda x: 1 if x in new_teams else 0)
    
    feature_names = []
    if only_away:
        feature_names += [betting_houses[k][2] + ' Prob' for k in betting_houses]
        feature_names += [betting_houses_closing[k][2] + ' Prob' for k in betting_houses_closing]
    else:
        feature_names += [code + ' Prob' for k in betting_houses for code in betting_houses[k]]
        feature_names += [code + ' Prob' for k in betting_houses_closing for code in betting_houses_closing[k]]
    feature_names += performance_feat_names
    feature_names += past_record_feat_names
    feature_names += ['Result']
    arg_df = arg_df[feature_names]
    if filter_names is None:
        return arg_df
    else:
        return arg_df[filter_names + ['Result']]

# Validation Methods

We will implement our two main stratergies to validate our dataset.

In [5]:
def cross_validation_split_by_dates(arg_df, ori_df, threshold_week = 30):
    train_len, *_ = ori_df[ori_df['Week No.'] <= threshold_week]['Week No.'].shape
    train_df = arg_df.iloc[:train_len]
    val_df = arg_df.iloc[train_len:]
    y_train = train_df['Result'].to_numpy()
    train_df = train_df.drop(['Result'], axis=1)
    X_train = train_df.to_numpy()

    y_test = val_df['Result'].to_numpy()
    val_df = val_df.drop(['Result'], axis=1)
    X_test = val_df.to_numpy()

    return X_train, X_test, y_train, y_test

def cross_validation_random_sample(arg_df, split_ratio=0.2, seed = 211):
    Y = arg_df['Result'].to_numpy()
    arg_df = arg_df.drop(['Result'], axis=1)
    X = arg_df.to_numpy()
    X_train, X_test, y_train, y_test = \
        train_test_split(X, Y, test_size=split_ratio, random_state=seed)
    return X_train, X_test, y_train, y_test

def k_fold_cross_validation(arg_df, folds = 5, seed=211):
    skf = StratifiedKFold(n_splits=folds, shuffle=True, random_state=seed)
    y = arg_df['Result'].to_numpy()
    arg_df = arg_df.drop(['Result'], axis=1)
    X = arg_df.to_numpy()
    datasets = []
    for train_index, test_index in skf.split(X, y):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        datasets.append((X_train, X_test, y_train, y_test))
    return datasets


# Evaulation Metrics

Lets include a function to calculate all the metrics needed for task

In [6]:
def evaluate_performance(y_pred, y_true, threshold = 0.5):
    roc_auc = None
    if threshold is not None:
        roc_auc = roc_auc_score(y_true, y_pred)
        y_pred = y_pred > threshold
    cm = confusion_matrix(y_true, y_pred)
    tn, fp, fn, tp = cm.ravel()
    recall = recall_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred)
    accuracy = accuracy_score(y_true, y_pred)
    if threshold is not None:
        return accuracy, precision, recall, roc_auc
    else:
        return accuracy, precision, recall
    # print("Evaulation Report\n")

    # print(f"Accuracy={accuracy:.4f}, Precision={precision:.4f}, Recall={recall:.4f} ")
    # print(f"AUC Score={roc_auc:.4f} \n")
    # print("Confusion Matrix\n")
    # print("         | Predicted 0 | Predicted 1")
    # print(f"Actual 0 | {tn:11} | {fp:11}")
    # print(f"Actual 1 | {fn:11} | {tp:11}")
    # print("\n")
       


# ML Models

Lets build essential functions for training the ML model. It will include features to where any Model following the Scikit-Learn API will be accepted,choose the important features and include SMOTE sampling.

In [7]:
def model_training_evaluation(clf, ori_df, betting_houses_arg, betting_houses_closing_arg, filter_names = None, SMOTE_apply = False):

    feat_df = build_features(ori_df, betting_houses_arg, betting_houses_closing_arg, filter_names = filter_names)
    sm = SMOTE(random_state=211)
    print("Time Based Cross Validation - ")
    X_train, X_test, y_train, y_test = cross_validation_split_by_dates(feat_df, ori_df)
    if SMOTE_apply:
        X_train, y_train = sm.fit_resample(X_train, y_train)
    clf.fit(X_train, y_train)
    y_pred = clf.predict_proba(X_test)[:, 1]
    acc, pre, rec, roc_auc = evaluate_performance(y_pred, y_test)
    print(f"Accuracy={acc:.3f}, Precison={acc:.3f}, Recall={rec:.3f}, AUC={roc_auc:.3f} \n")
    
    print("K fold Cross Validation - ")
    datasets = k_fold_cross_validation(feat_df)
    avg_acc, avg_pre, avg_rec, avg_auc = 0, 0, 0, 0 
    folds = 5
    datasets = k_fold_cross_validation(feat_df, folds=folds)

    for data in datasets:
        X_train, X_test, y_train, y_test = data
        if SMOTE_apply:
            X_train, y_train = sm.fit_resample(X_train, y_train)
        clf.fit(X_train, y_train)
        y_pred = clf.predict_proba(X_test)[:,1]
        acc, pre, rec, roc_auc = evaluate_performance(y_pred, y_test)
        avg_acc += acc
        avg_pre += pre
        avg_rec += rec
        avg_auc += roc_auc
    avg_acc /= folds
    avg_pre /= folds
    avg_rec /= folds
    avg_auc /= folds
    print(f"Accuracy={avg_acc:.3f}, Precison={avg_pre:.3f}, Recall={avg_rec:.3f}, AUC={avg_auc:.3f} \n")


Before we procede, lets just list out the performance our Baseline models on the validation dataset taken in the last 8 weeks. We seem to achieve highest accuracy with Closing Odds Probablity.

In [8]:
betting_houses_arg = {
    'Average': ['AvgH', 'AvgD', 'AvgA'],
    'Maximum': ['MaxH', 'MaxD', 'MaxA']
}
betting_houses_closing_arg = {
    'Average': ['AvgCH', 'AvgCD', 'AvgCA'],
    'Maximum': ['MaxCH', 'MaxCD', 'MaxCA']
}
baseline_df = build_features(ori_df, betting_houses_arg, betting_houses_closing_arg)
train_len, *_ = ori_df[ori_df['Week No.'] <= 30]['Week No.'].shape
baseline_df = baseline_df.iloc[train_len:]

for houses in betting_houses_arg:
    codes = betting_houses_arg[houses]
    y_pred = ((baseline_df[codes[2] + ' Prob'] > baseline_df[codes[0] + ' Prob']) & (baseline_df[codes[2] + ' Prob'] > baseline_df[codes[1] + ' Prob'])).astype(int)
    y_true = baseline_df['Result']
    print(f"Results from {houses} Betting House")
    acc, pre, rec = evaluate_performance(y_pred, y_true, None)
    print(f"Accuracy={acc:.3f}, Precison={acc:.3f}, Recall={rec:.3f} \n")

for houses in betting_houses_closing_arg:
    codes = betting_houses_closing_arg[houses]
    y_pred = ((baseline_df[codes[2] + ' Prob'] > baseline_df[codes[0] + ' Prob']) & (baseline_df[codes[2] + ' Prob'] > baseline_df[codes[1] + ' Prob'])).astype(int)
    y_true = baseline_df['Result']
    print(f"Results from {houses} Closing Betting House")
    acc, pre, rec = evaluate_performance(y_pred, y_true, None)
    print(f"Accuracy={acc:.3f}, Precison={acc:.3f}, Recall={rec:.3f} \n")

y_pred = (baseline_df['Point Diff'] > 4).astype(int)
y_true = baseline_df['Result']
print(f"Results from Point Difference")
acc, pre, rec = evaluate_performance(y_pred, y_true, None)
print(f"Accuracy={acc:.3f}, Precison={acc:.3f}, Recall={rec:.3f} \n")

y_pred = (baseline_df['Form Diff 1'] > 1.0).astype(int)
y_true = baseline_df['Result']
print(f"Results from Form Difference 1")
acc, pre, rec = evaluate_performance(y_pred, y_true, None)
print(f"Accuracy={acc:.3f}, Precison={acc:.3f}, Recall={rec:.3f} \n")

y_pred = (baseline_df['Form Diff 2'] > 3.0).astype(int)
y_true = baseline_df['Result']
print(f"Results from Form Difference 2")
acc, pre, rec = evaluate_performance(y_pred, y_true, None)
print(f"Accuracy={acc:.3f}, Precison={acc:.3f}, Recall={rec:.3f} \n")



Results from Average Betting House
Accuracy=0.695, Precison=0.695, Recall=0.615 

Results from Maximum Betting House
Accuracy=0.695, Precison=0.695, Recall=0.615 

Results from Average Closing Betting House
Accuracy=0.720, Precison=0.720, Recall=0.615 

Results from Maximum Closing Betting House
Accuracy=0.720, Precison=0.720, Recall=0.615 

Results from Point Difference
Accuracy=0.671, Precison=0.671, Recall=0.692 

Results from Form Difference 1
Accuracy=0.707, Precison=0.707, Recall=0.538 

Results from Form Difference 2
Accuracy=0.671, Precison=0.671, Recall=0.654 



---

After careful exploration, I had chosen LogisticRegression & XGBoost as they seemed to Consistently give better results.

In our first case, we achieved (70% accuracy, 65% Precison and 36% recall)

In [9]:
    
from sklearn.linear_model import LogisticRegression

betting_houses_arg = {
    'Average': ['AvgH', 'AvgD', 'AvgA'],
    'Maximum': ['MaxH', 'MaxD', 'MaxA']
}
betting_houses_closing_arg = {
    'Average': ['AvgCH', 'AvgCD', 'AvgCA'],
    'Maximum': ['MaxCH', 'MaxCD', 'MaxCA']
}
clf = LogisticRegression(
    penalty='l1',
    solver='liblinear',
    C=0.1
)
print("Without SMOTE Sampling: \n")
model_training_evaluation(clf, ori_df, betting_houses_arg, betting_houses_closing_arg, SMOTE_apply = False)

# print("With SMOTE Analysis:")
# model_training_evaluation(clf, ori_df, betting_houses_arg, betting_houses_closing_arg, SMOTE_apply = True)


Without SMOTE Sampling: 

Time Based Cross Validation - 
Accuracy=0.707, Precison=0.707, Recall=0.577, AUC=0.689 

K fold Cross Validation - 
Accuracy=0.713, Precison=0.651, Recall=0.360, AUC=0.687 



In [10]:
# from sklearn.linear_model import LogisticRegression

# betting_houses_arg = {
#     'Bet365': ['B365H', 'B365D', 'B365A'],
#     'Bet&Win': ['BWH', 'BWD', 'BWA'],
#     'Pinnacle': ['PSH', 'PSD', 'PSA'],
#     'VC Bet': ['VCH', 'VCD', 'VCA'],
#     'William Hill': ['WHH', 'WHD', 'WHA'],
#     'Average': ['AvgH', 'AvgD', 'AvgA'],
#     'Maximum': ['MaxH', 'MaxD', 'MaxA']

# }
# betting_houses_closing_arg = {
#     'Bet365': ['B365CH', 'B365CD', 'B365CA'],
#     'Bet&Win': ['BWCH', 'BWCD', 'BWCA'],
#     'Pinnacle': ['PSCH', 'PSCD', 'PSCA'],
#     'VC Bet': ['VCCH', 'VCCD', 'VCCA'],
#     'William Hill': ['WHCH', 'WHCD', 'WHCA'],
#     'Average': ['AvgCH', 'AvgCD', 'AvgCA'],
#     'Maximum': ['MaxCH', 'MaxCD', 'MaxCA']

# }
# filter_names = [betting_houses_arg[k][2]+ ' Prob' for k in betting_houses_arg]
# clf = LogisticRegression(
#     penalty='l1',
#     solver='liblinear',
#     C=0.1
# )
# print("Without SMOTE Sampling: \n")
# model_training_evaluation(clf, ori_df, betting_houses_arg, betting_houses_closing_arg, filter_names, SMOTE_apply = False)



---

The XGBoost classifier trained from SMOTE smapling seems to be performing very well on the Week Based Cross Validation dataset with 73% accuracy, 73% precison and 69% recall. Its performance is slightly below average in the random sampling scenario with 67% accuracy

In [11]:
from xgboost import XGBClassifier

betting_houses_arg = {
    'Bet365': ['B365H', 'B365D', 'B365A'],
    'Bet&Win': ['BWH', 'BWD', 'BWA'],
    'Pinnacle': ['PSH', 'PSD', 'PSA'],
    'VC Bet': ['VCH', 'VCD', 'VCA'],
    'William Hill': ['WHH', 'WHD', 'WHA'],
    'Average': ['AvgH', 'AvgD', 'AvgA'],
    'Maximum': ['MaxH', 'MaxD', 'MaxA']

}
betting_houses_closing_arg = {
    'Bet365': ['B365CH', 'B365CD', 'B365CA'],
    'Bet&Win': ['BWCH', 'BWCD', 'BWCA'],
    'Pinnacle': ['PSCH', 'PSCD', 'PSCA'],
    'VC Bet': ['VCCH', 'VCCD', 'VCCA'],
    'William Hill': ['WHCH', 'WHCD', 'WHCA'],
    'Average': ['AvgCH', 'AvgCD', 'AvgCA'],
    'Maximum': ['MaxCH', 'MaxCD', 'MaxCA']

}
clf = XGBClassifier(
    max_depth=2,
    gamma=2,
    eta=0.8,
    reg_alpha=0.5,
    reg_lambda=0.5
)
print("Without SMOTE Sampling: \n")
model_training_evaluation(clf, ori_df, betting_houses_arg, betting_houses_closing_arg, SMOTE_apply = False)

print("With SMOTE Analysis:")
model_training_evaluation(clf, ori_df, betting_houses_arg, betting_houses_closing_arg, SMOTE_apply = True)


Without SMOTE Sampling: 

Time Based Cross Validation - 
Accuracy=0.695, Precison=0.695, Recall=0.577, AUC=0.663 

K fold Cross Validation - 
Accuracy=0.680, Precison=0.548, Recall=0.464, AUC=0.715 

With SMOTE Analysis:
Time Based Cross Validation - 
Accuracy=0.732, Precison=0.732, Recall=0.692, AUC=0.714 

K fold Cross Validation - 
Accuracy=0.669, Precison=0.515, Recall=0.576, AUC=0.703 



---

We also wanted to explore the possibility of Ensembling the results of XGboost & Logistic Regression. But we did not recieve a very high accuracy performance. The time based cross validation performed well on SMOTE sampling and k-fold cross validation performed better without SMOTE sampling by achieving a 0.712 AUC score

In [12]:
from mlxtend.classifier import StackingClassifier

betting_houses_arg = {
    'Bet365': ['B365H', 'B365D', 'B365A'],
    'Bet&Win': ['BWH', 'BWD', 'BWA'],
    'Pinnacle': ['PSH', 'PSD', 'PSA'],
    'VC Bet': ['VCH', 'VCD', 'VCA'],
    'William Hill': ['WHH', 'WHD', 'WHA'],
    'Average': ['AvgH', 'AvgD', 'AvgA'],
    'Maximum': ['MaxH', 'MaxD', 'MaxA']

}
betting_houses_closing_arg = {
    'Bet365': ['B365CH', 'B365CD', 'B365CA'],
    'Bet&Win': ['BWCH', 'BWCD', 'BWCA'],
    'Pinnacle': ['PSCH', 'PSCD', 'PSCA'],
    'VC Bet': ['VCCH', 'VCCD', 'VCCA'],
    'William Hill': ['WHCH', 'WHCD', 'WHCA'],
    'Average': ['AvgCH', 'AvgCD', 'AvgCA'],
    'Maximum': ['MaxCH', 'MaxCD', 'MaxCA']

}
clf = StackingClassifier(
    classifiers=[
        LogisticRegression(
            penalty='l1',
            solver='liblinear',
            C=0.1
        ),
        XGBClassifier(
            max_depth=2,
            gamma=2,
            eta=0.8,
            reg_alpha=0.5,
            reg_lambda=0.5
        )
    ],
    use_probas=True,
    meta_classifier=LogisticRegression()
)
print("Without SMOTE Sampling: \n")
model_training_evaluation(clf, ori_df, betting_houses_arg, betting_houses_closing_arg, SMOTE_apply = False)

print("With SMOTE Analysis:")
model_training_evaluation(clf, ori_df, betting_houses_arg, betting_houses_closing_arg, SMOTE_apply = True)


Without SMOTE Sampling: 

Time Based Cross Validation - 
Accuracy=0.671, Precison=0.671, Recall=0.538, AUC=0.648 

K fold Cross Validation - 
Accuracy=0.683, Precison=0.549, Recall=0.480, AUC=0.712 

With SMOTE Analysis:
Time Based Cross Validation - 
Accuracy=0.707, Precison=0.707, Recall=0.692, AUC=0.716 

K fold Cross Validation - 
Accuracy=0.669, Precison=0.515, Recall=0.552, AUC=0.697 

