# <center> Squad Selector </center>

Author:  Farhan Kassam

In this notebook, we will solve a FPL version of the mixed-integer linear problem (MILP) or the knapsack problem.

The knapsack problem is a combinatorial optimization problem where you are given a set of items, each with a weight and a value. The program must determine a subset of items to include in the knapsack so that the total weight is less than or equal to a given limit and the total value is as large as possible.

**Reference**: <br>
Khalid, Irfan. Sep, 2021. <i>How to Build A Fantasy Premier League Team with Data Science</i>. https://towardsdatascience.com/how-to-build-a-fantasy-premier-league-team-with-data-science-f01283281236.

# FPL MILP

Below is a review of the MILP in reference to the FPL specific constraints.

In FPL, there is an objective and a set of rules (constraints) that we must adhere to when selecting a team. We will explain the problem in words and mathematical terms.

Objective: Maximize the points earned by a player

$F = \sum_{i=0}^{N}(x_i + y_i) * V_i$

Constraints:
- The cost should be less than 100m (1000 in our dataset since values are not saved as floats)
    - $\sum_{i=0}^{N}(x_i+y_i) * C_i \le 100$
<br></br>
- There are 11 players in the starting lineup
    - $\sum_{i=0}^{N}x_i = 11$
<br></br>
- There are 15 players in the squad
    - $\sum_{i=0}^{N}x_i + y_i = 15$
<br></br>
- There are 2 Goalkeepers in the squad and only 1 in the lineup
    - $\sum_{j \in G}x_j = 1$
    - $\sum_{j \in G}x_j +y_j = 2$
<br></br>  
- There are between 3-5 defenders in the starting lineup and 5 in the squad
    - $3 \le \sum_{j \in D}x_j \le 5$
    - $\sum_{j \in D}x_j + y_j = 5$
<br></br>
- There are between 3-5 midfielders in the starting lineup and 5 in the squad
    - $3 \le \sum_{j \in M}x_j \le 5$
    - $\sum_{j \in M}x_j + y_j = 5$
<br></br> 
- There are between 1-3 forwards in the starting lineup and 3 in the squad
    - $1 \le \sum_{j \in F}x_j \le 3$
    - $\sum_{j \in F}x_j + y_j \le 3$
<br></br>
- There cannot be more than 3 players from the same team in the squad
    - $\sum_{j \in T_k}x_j + y_j \le 3$

In [1]:
import pandas as pd
import numpy as np
import pulp

# The Problem

We will first import the data and compare the scoring of our predictions compared to the true results in FPL. If our team's true achieved points are greater or equal to the highest achieving squad then our model's selection are a good indicator of which players to choose for your FPL squad.

In [3]:
df = pd.read_csv('../data/pred_2022_23.csv', index_col=0)
df.head()

Unnamed: 0,name,team,position,GW,value,total_points,pred
0,Nathan Redmond,Southampton,MID,6,53,0,0.485144
1,Junior Stanislas,Bournemouth,MID,6,48,0,0.465642
2,Armando Broja,Chelsea,FWD,6,54,1,0.692717
3,Fabian Schär,Newcastle,DEF,6,47,6,2.541648
4,Jonny Evans,Leicester,DEF,6,44,0,2.376991


Function to select my model's predicted squad.

In [15]:
def mypred_squad(data, budget):
    '''This function returns a 15-man squad in a dataframe where the first 11 are the starting lineup and the last 4
    are subs. The squad is returned based on mixed-integer linear programming with mypred column as the objective.'''
    
    assert isinstance(data, pd.DataFrame), "Data Must Be Pandas DataFrame"
    assert isinstance(budget, int), "Budget Must Be Integer"
    assert set(['position', 'team', 'value', 'pred']).issubset(data.columns), "Must Have Required Columns: position, team, value, mypred"
    
    # Helper Variables
    POS = data['position'].unique()
    CLUBS = data['team'].unique()
    budget = budget
    pos_available = {'GK': 2, 'DEF': 5, 'MID': 5, 'FWD': 3}

    positions = np.array(data.position)
    costs = np.array(data.value)
    points = np.array(data.pred)
    teams = np.array(data.team)
    
    # initializing the model
    model = pulp.LpProblem("FPL-Optimization", pulp.LpMaximize)
    # decision types
    # the format function inserts i into empty placeholder {} to create a list of possible inclusions for the model

    lineup = [pulp.LpVariable("x_{}".format(i), lowBound = 0, upBound = 1, cat = 'Integer') for i in range(len(data))]
    subs = [pulp.LpVariable("y_{}".format(i), lowBound = 0, upBound = 1, cat = 'Integer') for i in range(len(data))]

    # defining model objective

    model += pulp.lpSum((lineup[i] + subs[i]*0.1) * points[i] for i in range(len(data))), "Objective"

    # defining constraints

    # Budget constraint
    model += pulp.lpSum((lineup[i] + subs[i]) * costs[i] for i in range(len(data))) <= budget

    # Starting Goalkeeper constraint
    model += pulp.lpSum(lineup[i] for i in range(len(data)) if positions[i] == 'GK') == 1

    # Starting Defender constraint
    model += pulp.lpSum(lineup[i] for i in range(len(data)) if positions[i] == 'DEF') >= 3
    model += pulp.lpSum(lineup[i] for i in range(len(data)) if positions[i] == 'DEF') <= 5

    # Starting Midfielder constraint
    model += pulp.lpSum(lineup[i] for i in range(len(data)) if positions[i] == 'MID') >= 3
    model += pulp.lpSum(lineup[i] for i in range(len(data)) if positions[i] == 'MID') <= 5

    # Starting Forward constraint
    model += pulp.lpSum(lineup[i] for i in range(len(data)) if positions[i] == 'FWD') >= 1
    model += pulp.lpSum(lineup[i] for i in range(len(data)) if positions[i] == 'FWD') <= 3

    # Team position constraints
    for pos in POS:
        model += pulp.lpSum(lineup[i] + subs[i] for i in range(len(data)) if positions[i] == pos) == pos_available[pos]

    # Club constraint for team
    for club in CLUBS:
        model += pulp.lpSum(lineup[i] + subs[i] for i in range(len(data)) if teams[i] == club) <= 3

    # Lineup size constraint

    model += pulp.lpSum(lineup[i] for i in range(len(data))) == 11

    # total team size constraint

    model += pulp.lpSum(lineup[i] + subs[i] for i in range(len(data))) == 15

    for i in range(len(data)):
        model += (lineup[i] + subs[i]) <= 1  # subs must not be on team

    model.solve()
    
    squad_array = []
    for i in range(len(lineup)):
        if lineup[i].value() != 0:
            squad_array.append([data.name[i], data.team[i], data.position[i],  data.GW[i], data.pred[i], data.total_points[i], data.value[i]])
        if subs[i].value() != 0:
            squad_array.append([data.name[i], data.team[i], data.position[i],  data.GW[i], data.pred[i], data.total_points[i], data.value[i]])
    
    squad_df = pd.DataFrame(data=squad_array,columns=['name', 'team', 'position', 'GW', 'pred', 'total_points', 'value'])
    
    print(f"Total Predicted Score = {model.objective.value()}\nSquad Value = {squad_df.value.sum()}\nTrue Score = {squad_df.total_points.sum()}\n")
    return squad_df

Function to return actual results squad.

In [6]:
def true_squad(data, budget):
    '''This function returns a 15-man squad in a dataframe where the first 11 are the starting lineup and the last 4
    are subs. The squad is returned based on mixed-integer linear programming with actual_points column as the objective.'''
    
    assert isinstance(data, pd.DataFrame), "Data Must Be Pandas DataFrame"
    assert isinstance(budget, int), "Budget Must Be Integer"
    assert set(['position', 'team', 'value', 'total_points']).issubset(data.columns), "Must Have Required Columns: position, team, value, actual_points"
    
    # Helper Variables
    POS = data['position'].unique()
    CLUBS = data['team'].unique()
    budget = budget
    pos_available = {'GK': 2, 'DEF': 5, 'MID': 5, 'FWD': 3}

    positions = np.array(data.position)
    costs = np.array(data.value)
    points = np.array(data.total_points)
    teams = np.array(data.team)
    
    # initializing the model
    model = pulp.LpProblem("FPL-Optimization", pulp.LpMaximize)
    # decision types
    # the format function inserts i into empty placeholder {} to create a list of possible inclusions for the model

    lineup = [pulp.LpVariable("x_{}".format(i), lowBound = 0, upBound = 1, cat = 'Integer') for i in range(len(data))]
    subs = [pulp.LpVariable("y_{}".format(i), lowBound = 0, upBound = 1, cat = 'Integer') for i in range(len(data))]

    # defining model objective

    model += pulp.lpSum((lineup[i] + subs[i]*0.1) * points[i] for i in range(len(data))), "Objective"

    # defining constraints

    # Budget constraint
    model += pulp.lpSum((lineup[i] + subs[i]) * costs[i] for i in range(len(data))) <= budget

    # Starting Goalkeeper constraint
    model += pulp.lpSum(lineup[i] for i in range(len(data)) if positions[i] == 'GK') == 1

    # Starting Defender constraint
    model += pulp.lpSum(lineup[i] for i in range(len(data)) if positions[i] == 'DEF') >= 3
    model += pulp.lpSum(lineup[i] for i in range(len(data)) if positions[i] == 'DEF') <= 5

    # Starting Midfielder constraint
    model += pulp.lpSum(lineup[i] for i in range(len(data)) if positions[i] == 'MID') >= 3
    model += pulp.lpSum(lineup[i] for i in range(len(data)) if positions[i] == 'MID') <= 5

    # Starting Forward constraint
    model += pulp.lpSum(lineup[i] for i in range(len(data)) if positions[i] == 'FWD') >= 1
    model += pulp.lpSum(lineup[i] for i in range(len(data)) if positions[i] == 'FWD') <= 3

    # Team position constraints
    for pos in POS:
        model += pulp.lpSum(lineup[i] + subs[i] for i in range(len(data)) if positions[i] == pos) == pos_available[pos]

    # Club constraint for team
    for club in CLUBS:
        model += pulp.lpSum(lineup[i] + subs[i] for i in range(len(data)) if teams[i] == club) <= 3

    # Lineup size constraint

    model += pulp.lpSum(lineup[i] for i in range(len(data))) == 11

    # total team size constraint

    model += pulp.lpSum(lineup[i] + subs[i] for i in range(len(data))) == 15

    for i in range(len(data)):
        model += (lineup[i] + subs[i]) <= 1  # subs must not be on team

    model.solve()
    
    squad_array = []
    for i in range(len(lineup)):
        if lineup[i].value() != 0:
            squad_array.append([data.name[i], data.team[i], data.position[i],  data.GW[i], data.pred[i], data.total_points[i], data.value[i]])
        if subs[i].value() != 0:
            squad_array.append([data.name[i], data.team[i], data.position[i],  data.GW[i], data.pred[i], data.total_points[i], data.value[i]])
    
    squad_df = pd.DataFrame(data=squad_array,columns=['name', 'team', 'position', 'GW', 'pred', 'total_points','value'])
    
    print(f"True Score = {model.objective.value()}\nSquad Value = {squad_df.value.sum()}\nPredicted Score = {squad_df.pred.sum()}\n")
    return squad_df

In [16]:
# Testing optimization for gameweek 6

mypred_squad(df.loc[df['GW']==6], 1000)

Total Predicted Score = 47.50478077300091
Squad Value = 1000
True Score = 56



Unnamed: 0,name,team,position,GW,pred,total_points,value
0,Jack Harrison,Leeds,MID,6,3.633601,1,61
1,Erling Haaland,Man City,FWD,6,5.998943,9,119
2,Danny Ward,Leicester,GK,6,2.671414,2,40
3,Ben Davies,Liverpool,DEF,6,1.680175,0,40
4,Trent Alexander-Arnold,Liverpool,DEF,6,3.868259,1,75
5,Mohamed Salah,Liverpool,MID,6,5.909416,3,130
6,Nathan Patterson,Everton,DEF,6,2.453857,6,40
7,Gabriel Martinelli Silva,Arsenal,MID,6,4.070874,2,65
8,Neco Williams,Nott'm Forest,DEF,6,2.889454,4,41
9,Jordan Pickford,Everton,GK,6,3.071837,9,45


In [17]:
true_squad(df.loc[df['GW']==6], 1000)

True Score = 138.4
Squad Value = 823
Predicted Score = 38.743725482205704



Unnamed: 0,name,team,position,GW,pred,total_points,value
0,Ivan Toney,Brentford,FWD,6,3.758937,17,71
1,Dominic Solanke,Bournemouth,FWD,6,1.817277,12,57
2,Marcus Rashford,Man Utd,MID,6,3.519832,18,64
3,Ben Chilwell,Chelsea,DEF,6,1.365034,10,58
4,Kieran Trippier,Newcastle,DEF,6,3.094774,8,51
5,Daniel Castelo Podence,Wolves,MID,6,2.168623,11,55
6,Leandro Trossard,Brighton,MID,6,3.54785,12,65
7,Patson Daka,Leicester,FWD,6,0.891711,11,57
8,Thiago Emiliano da Silva,Chelsea,DEF,6,2.919174,8,55
9,Matheus Luiz Nunes,Wolves,MID,6,1.732177,9,50


For gameweek 6, it seems that our predictor had our squad prediction of 47 points where they ended up scoring 56, this is among average for that gameweek compared to the true highest scoring squad which had achieved a score of 138 points but were only predicted 38 points.

As reported in the previous notebook, our model heavily overlooks upcoming star performers and does not have any opponent team information which can strongly influence a player's performance. Additionally, our model seems to bias players who have a track record of good performances and predicts them to continue on the same trajectory rather than adjusting for external factors. This may represent a flaw in our feature selection since we only considered overall history and previous 5 matches.