# Title: Predicting Fantasy Football Teams using Regression with Regularisation

#### Group Member Names : 
* Viral Nena (200531848)
* Akash Chatra (200532153)


### INTRODUCTION:
*********************************************************************************************************************
#### AIM : The aim of this project is to build a model capable of predicting high performing fantasy football teams

*********************************************************************************************************************
#### Github Repo: https://github.com/JoshuaPlacidi/Fantasy-Football-Team-Predictions.git

*********************************************************************************************************************
#### DESCRIPTION OF PAPER: 
The paper presents a hybrid methodology combining ARIMA and RNN models to predict football player points in Fantasy Premier League, optimizing player selection through Linear Programming. The approach, validated in the ongoing season, demonstrates effective performance prediction and potential for use by official on-field managers in the English Premier League.

*********************************************************************************************************************
#### PROBLEM STATEMENT : 
* Trying to replicate the process of the time series models explained in the research paper.
* chose the respective available dataset. (Dataset was scaraped from the official FPL API and mergered into a single file)
* Link for dataset before merging into single csv file :- https://github.com/vaastav/Fantasy-Premier-League/
* The performance of football players is predicted using regularised regression-based machine learning algorithms, which are then compared.
*********************************************************************************************************************
#### CONTEXT OF THE PROBLEM:
*  The challenge of optimizing player selection in Fantasy Premier League due to varying player performance across seasons, proposing a hybrid ARIMA-RNN model and Linear Programming to predict and maximize player points in the simulated environment.
*********************************************************************************************************************
#### SOLUTION:
* The study aims to develop a predictive system for high-performing fantasy football teams using FPL data.
* This task will be accomplished through linear regression, linear optimization, Arima-Rnn model. 


# Background
*********************************************************************************************************************

**Time series Modelling using Arima and RNN Algortihms**


|Reference|Explanation|Dataset/Input|Weakness|
|------|------|------|------|
|Gupta, A. (2019).| Time series modeling for dream team in fantasy premier league.|players dataset|a study focused on using time series analysis techniques to improve the selection of players for a fantasy football team in the Fantasy Premier League (FPL).|


**MDP (Bayesian belief model)**


|Reference|Explanation|Dataset/Input|Weakness|
|------|------|------|------|
|Matthews, T., Ramchurn, S., & Chalkiadakis, G. (2012).|Competing with humans at fantasy football: Team formation in large partially-observable domains.| English Premier League(EPL) dataset|solve the sequential team formation problem posed by a popular online Fantasy Football game known as Fantasy Premier League (FPL), where a participant’s task (as manager) is to repeatedly select highly-constrained sets of players in order to maximise a score reflecting the real-world performances of those selected players in the English Premier League.|


**Predictions using Machine learning**


|Reference|Explanation|Dataset/Input|Weakness|
|------|------|------|------|
|Parikh, N. (2015).| Interactive tools for fantasy football analytics and predictions using machine learning|Fantasy Sports API dataset|A web-based tool for in-depth fantasy football analytics. We intend to build reliable predictive algorithms to project the success of specific football players.|





*********************************************************************************************************************






# Implement paper code :
*********************************************************************************************************************

#### Importing all the Library 


If Pulp is not installed, install using this "!pip install pulp"

In [1]:
import pandas as pd
import numpy as np
import altair as alt
import pulp
from sklearn import linear_model
pd.options.mode.chained_assignment = None

#### Predictions

Calculate predictions with individual models trained for each position

In [2]:
data = pd.read_csv('Player_Data.csv', index_col = 0)
train_data = data[data.season != 1920]
test_data = data[data.season == 1920]

gk_train = train_data[train_data.position == 1]
gk_test = test_data[test_data.position == 1]

def_train = train_data[train_data.position == 2]
def_test = test_data[test_data.position == 2]

mid_train = train_data[train_data.position == 3]
mid_test = test_data[test_data.position == 3]

fwd_train = train_data[train_data.position == 4]
fwd_test = test_data[test_data.position == 4]

features = ['opp_diff','was_home','minutes_sum','bps_sum',
              'influence_sum','threat_sum','ict_sum','creat_sum',
              'yel_sum','red_sum','selected_by','tran_sum',
              'goals_sum','assists_sum','points_sum','value',
              'saves_sum','goals_con_sum','clean_sheets_sum']

gk_model = linear_model.LinearRegression()
gk_model.fit(gk_train[features],gk_train.points)

def_model = linear_model.LinearRegression()
def_model.fit(def_train[features],def_train.points)

mid_model = linear_model.LinearRegression()
mid_model.fit(mid_train[features],mid_train.points)

fwd_model = linear_model.LinearRegression()
fwd_model.fit(fwd_train[features],fwd_train.points)

gk_test['prediction'] = gk_model.predict(gk_test[features])
gk_test['prediction_error'] = abs(gk_test.prediction - gk_test.points)

def_test['prediction'] = def_model.predict(def_test[features])
def_test['prediction_error'] = abs(def_test.prediction - def_test.points)

mid_test['prediction'] = mid_model.predict(mid_test[features])
mid_test['prediction_error'] = abs(mid_test.prediction - mid_test.points)

fwd_test['prediction'] = fwd_model.predict(fwd_test[features])
fwd_test['prediction_error'] = abs(fwd_test.prediction - fwd_test.points)

all_predictions = pd.concat([gk_test, def_test, mid_test, fwd_test])

print('GoalKeeper  Mean Error: ' + str(round(gk_test.prediction_error.mean(),3)))
print('Defender Mean Error: ' + str(round(def_test.prediction_error.mean(),3)))
print('Mid Fielder Mean Error: ' + str(round(mid_test.prediction_error.mean(),3)))
print('Forward Mean Error: ' + str(round(fwd_test.prediction_error.mean(),3)) + '\n')
print('Total Mean Error: ' + str(round((all_predictions.prediction_error.mean()),3)))

GoalKeeper  Mean Error: 2.346
Defender Mean Error: 2.308
Mid Fielder Mean Error: 1.768
Forward Mean Error: 2.147

Total Mean Error: 2.048


#### Select Team Functions

Linear optimisation used to calculate the best legal team for each gameweek

In [3]:
# Select a team for a given gameweek
def select(gw, data_in, print_output=False):
    sub_factor = 0.1
    data_in = data_in[data_in.GW == gw]
    first_team, captain, subs, cal_points = select_team(data_in, 100, sub_factor)

    real_points_total = 0
    predicted_points_total = 0
    total_cost = 0

    if(print_output):
        print('Starting team')

    for i in range(data_in.shape[0]):

        if captain[i].value() != 0:
            if(print_output):
                print(print_player(data_in.iloc[i]) + ' (Captain)')
            predicted_points_total += (data_in.iloc[i].prediction * 2)
            real_points_total += (data_in.iloc[i].points * 2)
            total_cost += data_in.iloc[i].value

        elif first_team[i].value() != 0:
            if(print_output):
                print(print_player(data_in.iloc[i]))

            predicted_points_total += data_in.iloc[i].prediction
            real_points_total += data_in.iloc[i].points
            total_cost += data_in.iloc[i].value

    if(print_output):
        print('\n' + 'Substitutes')
  
    sub_points = 0
    for i in range(data_in.shape[0]):
        if subs[i].value() != 0:
            if(print_output):
                print(print_player(data_in.iloc[i]))
            total_cost += data_in.iloc[i].value

    #total_points = total_points - (sub_points * sub_factor)
    error = abs(real_points_total - predicted_points_total)

    if(print_output):
        print('\n' + 'Predicted Points    ' + str(round(predicted_points_total,2)))
        print('Real Points         '        + str(real_points_total))
        print('Error               '        + str(abs(round(error,2))))
        print('Cost                '        + '£' + str(round(total_cost/10.0,2)) + 'M\n')  

    return predicted_points_total, real_points_total, total_cost/10.0

In [4]:
def print_player(player):
    return  ' [' + str(player.player_id) + '] ' + ' (Pred:' + str(round(player.prediction,1)) + ' | Real:' + str(round(player.points,2)) + ') ' +  player.first_name + ' ' + player.second_name


In [5]:
# Runs selection over a range of gameweeks
def select_range(start_gw, end_gw, data_in):
    total_error = 0
    points = 0
    real_points = 0
    for gw in range(start_gw, end_gw):
        predicted_score, real_score, error = select(gw, data_in, False)
        print('GW' + str(gw) + '---------------------------')
        print('Predicted Score : ' + str(round(predicted_score,2)) + ' Points')
        print('Real Score      : ' + str(real_score) + '.00 Points')
        total_error += abs(error)
        points += predicted_score
        real_points += real_score

    print('\nPredicted       ' + str(round(points,2)))
    print('Real            ' + str(round(real_points,2)))
    print('Total Error     ' + str(round(total_error,2)))
    print('Average Error   ' + str(round(total_error / (end_gw - start_gw),2)))


In [6]:
def select_team(player_data, budget, sub_factor):
    num_players = len(player_data)
    model = pulp.LpProblem("Constrained_value_maximisation", pulp.LpMaximize)

    # Array to store players selected for the starting team
    decisions = [ pulp.LpVariable("x{}".format(i), lowBound=0, upBound=1, cat='Integer') for i in range(num_players)]

    # Array to captain decision
    captain_decisions = [pulp.LpVariable("y{}".format(i), lowBound=0, upBound=1, cat='Integer')for i in range(num_players)]

    # Array to store sub decisions
    sub_decisions = [pulp.LpVariable("z{}".format(i), lowBound=0, upBound=1, cat='Integer') for i in range(num_players)]

    # objective function
    model += sum((captain_decisions[i] + decisions[i] + sub_decisions[i]*sub_factor) * player_data.iloc[i].prediction
                for i in range(num_players)), "Objective"

    # cost constraint
    model += sum((decisions[i] + sub_decisions[i]) * (player_data.iloc[i].value / 10.0)
              for i in range(num_players)) <= budget  # total cost

    # position constraints
    # 1 starting goalkeeper
    model += sum(decisions[i] for i in range(num_players) if player_data.iloc[i].position == 1) == 1
    # 2 total goalkeepers
    model += sum(decisions[i] + sub_decisions[i] for i in range(num_players) if player_data.iloc[i].position == 1) == 2

    # Select the starting defenders
    # Must be between 3 and 5 starting defenders
    model += sum(decisions[i] for i in range(num_players) if player_data.iloc[i].position == 2) >= 3

    model += sum(decisions[i] for i in range(num_players) if player_data.iloc[i].position == 2) <= 5

    # Select all defenders
    # Must be 5 defenders selected
    model += sum(decisions[i] + sub_decisions[i] for i in range(num_players) if player_data.iloc[i].position == 2) == 5

    # Select midfielders
    # Must be between 3 and 5 starting midfielders selected
    model += sum(decisions[i] for i in range(num_players) if player_data.iloc[i].position == 3) >= 3
    model += sum(decisions[i] for i in range(num_players) if player_data.iloc[i].position == 3) <= 5


    # 5 all midfielders
    # Must be 5 midfielders selected
    model += sum(decisions[i] + sub_decisions[i]
               for i in range(num_players) if player_data.iloc[i].position == 3) == 5

    # Select forwards
    # Must be between 1 and 3 starting forwards
    model += sum(decisions[i] for i in range(num_players) if player_data.iloc[i].position == 4) >= 1
    model += sum(decisions[i] for i in range(num_players) if player_data.iloc[i].position == 4) <= 3

    # Must be 3 forwards selected
    model += sum(decisions[i] + sub_decisions[i] for i in range(num_players) if player_data.iloc[i].position == 4) == 3

    # Only 3 players can be selected from a single club
    team_codes = np.unique(player_data.team_code)
    for team_id in np.unique(team_codes):
        model += sum(decisions[i] + sub_decisions[i]
                   for i in range(num_players) if player_data.iloc[i].team_code == team_id) <= 3  # max 3 players

    # 11 starting players must be selected
    model += sum(decisions) == 11

    # 1 of the starting players must be selected as captain
    model += sum(captain_decisions) == 1  # 1 captain
  
    # Check player selections are valid
    for i in range(num_players):  
        # Captain has to be present in starting team
        model += (decisions[i] - captain_decisions[i]) >= 0
        # Subs cannot be present in starting team
        model += (decisions[i] + sub_decisions[i]) <= 1 

    model.solve()

    return decisions, captain_decisions, sub_decisions, model.objective.value()

#### Generate Predictions


gw = the gameweek to make predictions for

In [7]:
gw = 40
predicted_points, real_points, cost = select(gw,all_predictions,True)

Starting team
 [168]  (Pred:4.1 | Real:14) Kasper Schmeichel
 [401]  (Pred:5.0 | Real:6) Matt Doherty
 [251]  (Pred:4.6 | Real:8) Matt Ritchie
 [182]  (Pred:6.4 | Real:14) Trent Alexander-Arnold
 [183]  (Pred:5.8 | Real:9) Virgil van Dijk
 [239]  (Pred:5.2 | Real:21) Anthony Martial
 [344]  (Pred:5.3 | Real:1) Bamidele Alli
 [172]  (Pred:4.6 | Real:1) Harvey Barnes
 [171]  (Pred:4.8 | Real:3) James Maddison
 [191]  (Pred:7.3 | Real:11) Mohamed Salah (Captain)
 [409]  (Pred:5.0 | Real:9) Raúl Jiménez

Substitutes
 [427]  (Pred:3.6 | Real:8) Emiliano Martínez
 [256]  (Pred:3.6 | Real:2) Javier Manquillo
 [234]  (Pred:2.0 | Real:2) Mason Greenwood
 [554]  (Pred:3.4 | Real:8) Dwight Gayle

Predicted Points    65.52
Real Points         108
Error               42.48
Cost                £100.0M



In [8]:
select_range(4,9, all_predictions)

GW4---------------------------
Predicted Score : 65.79 Points
Real Score      : 60.00 Points
GW5---------------------------
Predicted Score : 65.24 Points
Real Score      : 60.00 Points
GW6---------------------------
Predicted Score : 64.24 Points
Real Score      : 70.00 Points
GW7---------------------------
Predicted Score : 64.48 Points
Real Score      : 49.00 Points
GW8---------------------------
Predicted Score : 64.62 Points
Real Score      : 56.00 Points

Predicted       324.38
Real            295
Total Error     499.7
Average Error   99.94


*********************************************************************************************************************
### Contribution  Code :
* 

implementing ridge regression model instead of linear regression model on players statistics to reduce the error. 

In [9]:
from sklearn.linear_model import Ridge

In [10]:
# Initialize Ridge regression models
gk_model = Ridge(alpha=0.8)  
def_model = Ridge(alpha=0.8)
mid_model = Ridge(alpha=0.8)
fwd_model = Ridge(alpha=0.8)

In [11]:
# Fit Ridge regression models
gk_model.fit(gk_train[features], gk_train.points)
def_model.fit(def_train[features], def_train.points)
mid_model.fit(mid_train[features], mid_train.points)
fwd_model.fit(fwd_train[features], fwd_train.points)

  return linalg.solve(A, Xy, sym_pos=True, overwrite_a=True).T


Ridge(alpha=0.8)

In [12]:
# Make predictions
gk_test['prediction'] = gk_model.predict(gk_test[features])
def_test['prediction'] = def_model.predict(def_test[features])
mid_test['prediction'] = mid_model.predict(mid_test[features])
fwd_test['prediction'] = fwd_model.predict(fwd_test[features])

In [13]:
# Calculate prediction errors
gk_test['prediction_error'] = abs(gk_test.prediction - gk_test.points)
def_test['prediction_error'] = abs(def_test.prediction - def_test.points)
mid_test['prediction_error'] = abs(mid_test.prediction - mid_test.points)
fwd_test['prediction_error'] = abs(fwd_test.prediction - fwd_test.points)

In [14]:
# Calculate mean errors and print
print('GoalKeeper Mean Error: ' + str(round(gk_test.prediction_error.mean(), 3)))
print('Defender Mean Error: ' + str(round(def_test.prediction_error.mean(), 3)))
print('Mid Fielder Mean Error: ' + str(round(mid_test.prediction_error.mean(), 3)))
print('Forward Mean Error: ' + str(round(fwd_test.prediction_error.mean(), 3)) + '\n')

GoalKeeper Mean Error: 2.346
Defender Mean Error: 2.308
Mid Fielder Mean Error: 1.768
Forward Mean Error: 2.147



### Results :
**************************************************************************************************************************
For both the models Linear regression and Ridge model, we can verify it with the help of mean error score to compare the scores.

#### Observations :
*******************************************************************************************************************************
* We have implemented both the models on the same data set (players dataset)
* we can compare both the models here with the mean error score.
* both the models has performed almost similar and got similar results.


### Conclusion and Future Direction :
*******************************************************************************************************************************
#### Learnings : 
During the implementations of this project, i learnt how to replicate the research paper using the github repo. Here i have learnt how how to reduce the chances of overfitting the model. Undestanding the importance of features in predicting performance. For evaluating the model use of metrics such as mean error. experimenting with the features, parameter on models to improve accuracy of the model. 
*******************************************************************************************************************************
#### Results Discussion :

Two models—Linear Regression and Ridge Regression—were used in this project's outcomes, and their outcomes were compared. The performance of both models was assessed using the mean error score after they were applied to a dataset, specifically the players dataset.

It was clear from the results that both models behaved very similarly, producing similar mean error ratings. This implies that the target variable on this dataset was successfully predicted using both linear regression and ridge regression. Their different approaches to handling the correlation between characteristics and the goal variable may be responsible for the slight variation in their performances.

Overfitting was a problem, and balanced training and testing data helped. The model generalization and feature significance of Ridge Regression were noticed. Future plans call for the use of bigger datasets and more sophisticated algorithms. The performance of the models, the author's learning, limitations, and future directions are highlighted in the discussion. the discussion of results emphasizes the comparable performance of Linear Regression and Ridge Regression models on the given dataset. The project served as a platform for the author to learn, experiment, and draw insights into the intricacies of model implementation and evaluation. While highlighting limitations and potential extensions, the author's understanding of the underlying concepts and their practical implications is evident. 
#### Limitations :
* Over fitting of model can be real challenge , so balancing training testing can help in generalizing the unseen data.  
* These model shrinks coefficients towards zero, potentially keeping less important features in the model.
*******************************************************************************************************************************
#### Future Extension :
* To improve the accuracy of the model we implement the model on a larger data set. 
* for predections the algorithms can be implemented with some advance model.

# References:

* Interactive Tools for Fantasy Football Analytics and Predictions using Machine Learning Neena Parikh 2014: https://dspace.mit.edu/handle/1721.1/100687

* Time Series Modelling for Dream Team in Fantasy Premier League Akhil Gupta 2017: https://arxiv.org/abs/1909.12938

* Han J, Kamber M, Pei J. Data Preprocessing. Data Mining Concepts and Techniques 2011; 83-124: Morgan Kauffman Publications.

* Shivani, K. S. Sandhu and A. Ramachandran Nair, "A Comparative Study of ARIMA and RNN for Short Term Wind Speed Forecasting," 2019 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT), Kanpur, India, 2019, pp. 1-7, doi: 10.1109/ICCCNT45670.2019.8944466.
