# <center> Squad Selector </center>

In this notebook, we will solve a FPL version of the mixed-integer linear problem (MILP) or the knapsack problem.

The knapsack problem is a combinatorial optimization problem where you are given a set of items, each with a weight and a value. The program must determine a subset of items to include in the knapsack so that the total weight is less than or equal to a given limit and the total value is as large as possible.

# FPL MILP

In FPL, there is an objective and a set of rules (constraints) that we must adhere to when selecting a team. We will explain the problem in words and mathematical terms.

Objective: Maximize the points earned by a player

$F = \sum_{i=0}^{N}(x_i + y_i) * V_i$

Constraints:
- The cost should be less than 100m (1000 in our dataset since values are not saved as floats)
    - $\sum_{i=0}^{N}(x_i+y_i) * C_i \le 100$
<br></br>
- There are 11 players in the starting lineup
    - $\sum_{i=0}^{N}x_i = 11$
<br></br>
- There are 15 players in the squad
    - $\sum_{i=0}^{N}x_i + y_i = 15$
<br></br>
- There are 2 Goalkeepers in the squad and only 1 in the lineup
    - $\sum_{j \in G}x_j = 1$
    - $\sum_{j \in G}x_j +y_j = 2$
<br></br>  
- There are between 3-5 defenders in the starting lineup and 5 in the squad
    - $3 \le \sum_{j \in D}x_j \le 5$
    - $\sum_{j \in D}x_j + y_j = 5$
<br></br>
- There are between 3-5 midfielders in the starting lineup and 5 in the squad
    - $3 \le \sum_{j \in M}x_j \le 5$
    - $\sum_{j \in M}x_j + y_j = 5$
<br></br> 
- There are between 1-3 forwards in the starting lineup and 3 in the squad
    - $1 \le \sum_{j \in F}x_j \le 3$
    - $\sum_{j \in F}x_j + y_j \le 3$
<br></br>
- There cannot be more than 3 players from the same team in the squad
    - $\sum_{j \in T_k}x_j + y_j \le 3$

Let's start by selecting a team based on the above constraints for the season so far via the aggreated data and a python library specifically for MILP called pulp.

In [1]:
import pandas as pd
import numpy as np
import pulp

In [2]:
agg = pd.read_csv('../data/aggregated.csv', index_col=0)
agg.head()

Unnamed: 0,name,team,position,total_points,minutes,goals_scored,assists,clean_sheets,saves,penalties_saved,penalties_missed,goals_conceded,own_goals,yellow_cards,red_cards,value,PV-ratio
0,Erling Haaland,Man City,FWD,189,1950,27,4,7,0,0,0,23,0,4,0,122,1.54918
1,Harry Kane,Spurs,FWD,157,2236,17,6,9,0,0,1,36,0,4,0,118,1.330508
2,Marcus Rashford,Man Utd,MID,156,1999,15,4,9,0,0,0,25,0,3,0,73,2.136986
3,Kieran Trippier,Newcastle,DEF,154,2107,1,5,15,0,0,0,13,0,6,0,61,2.52459
4,Martin Ødegaard,Arsenal,MID,144,1954,9,8,8,0,0,0,24,0,3,0,70,2.057143


Below we will create helper variables which will help us set and adhere to the models constraints.

In [3]:
# Creating helper variables
POS = agg['position'].unique()
CLUBS = agg['team'].unique()
budget = 1000
pos_available = {'GK': 2, 'DEF': 5, 'MID': 5, 'FWD': 3}

positions = np.array(agg.position)
costs = np.array(agg.value)
points = np.array(agg.total_points)
teams = np.array(agg.team)

In [4]:
# initializing the model
model = pulp.LpProblem("FPL-Optimization", pulp.LpMaximize)
# decision types
# the format function inserts i into empty placeholder {} to create a list of possible inclusions for the model

lineup = [pulp.LpVariable("x_{}".format(i), lowBound = 0, upBound = 1, cat = 'Integer') for i in range(len(agg))]
subs = [pulp.LpVariable("y_{}".format(i), lowBound = 0, upBound = 1, cat = 'Integer') for i in range(len(agg))]

# defining model objective

model += pulp.lpSum((lineup[i] + subs[i]*0.1) * points[i] for i in range(len(agg))), "Objective"

# defining constraints

# Budget constraint
model += pulp.lpSum((lineup[i] + subs[i]) * costs[i] for i in range(len(agg))) <= budget

# Starting Goalkeeper constraint
model += pulp.lpSum(lineup[i] for i in range(len(agg)) if positions[i] == 'GK') == 1

# Starting Defender constraint
model += pulp.lpSum(lineup[i] for i in range(len(agg)) if positions[i] == 'DEF') >= 3
model += pulp.lpSum(lineup[i] for i in range(len(agg)) if positions[i] == 'DEF') <= 5

# Starting Midfielder constraint
model += pulp.lpSum(lineup[i] for i in range(len(agg)) if positions[i] == 'MID') >= 3
model += pulp.lpSum(lineup[i] for i in range(len(agg)) if positions[i] == 'MID') <= 5

# Starting Forward constraint
model += pulp.lpSum(lineup[i] for i in range(len(agg)) if positions[i] == 'FWD') >= 1
model += pulp.lpSum(lineup[i] for i in range(len(agg)) if positions[i] == 'FWD') <= 3

# Team position constraints
for pos in POS:
    model += pulp.lpSum(lineup[i] + subs[i] for i in range(len(agg)) if positions[i] == pos) == pos_available[pos]

# Club constraint for team
for club in CLUBS:
    model += pulp.lpSum(lineup[i] + subs[i] for i in range(len(agg)) if teams[i] == club) <= 3

# Lineup size constraint

model += pulp.lpSum(lineup[i] for i in range(len(agg))) == 11

# total team size constraint

model += pulp.lpSum(lineup[i] + subs[i] for i in range(len(agg))) == 15

for i in range(len(agg)):
    model += (lineup[i] + subs[i]) <= 1  # subs must not be on team
    
model.solve()

1

The above says the model was successful in determining a solution. The following code blocks will extract the players and create a dataframe of the squad selected by the MILP model.

In [5]:
squad_array = []
for i in range(len(lineup)):
    if lineup[i].value() != 0:
        squad_array.append([agg.name[i], agg.team[i], agg.position[i], agg.total_points[i], agg.value[i]])
    if subs[i].value() != 0:
        squad_array.append([agg.name[i], agg.team[i], agg.position[i], agg.total_points[i], agg.value[i]])
    
squad_df = pd.DataFrame(data=squad_array,columns=['name', 'team', 'position', 'points','value'])
    
print(f"Total Score = {model.objective.value()}")
print(f"Squad Cost = {squad_df.value.sum()}")
squad_df

Total Score = 1563.3000000000002
Squad Cost = 1000


Unnamed: 0,name,team,position,points,value
0,Erling Haaland,Man City,FWD,189,122
1,Harry Kane,Spurs,FWD,157,118
2,Marcus Rashford,Man Utd,MID,156,73
3,Kieran Trippier,Newcastle,DEF,154,61
4,Martin Ødegaard,Arsenal,MID,144,70
5,Bukayo Saka,Arsenal,MID,138,84
6,Ivan Toney,Brentford,FWD,134,77
7,Miguel Almirón Rejala,Newcastle,MID,128,56
8,David Raya Martin,Brentford,GK,116,47
9,Fabian Schär,Newcastle,DEF,108,52


The above is the best overall squad in the season so far where the first 11 entries are the starting XI and the remaining 4 entries are the substitutes. The total score of this squad so far is 1563 and the total cost is 1000 using the entire budget.

Since the model worked, we can now generalize it to a function where we can select a squad based on the `xP` column on a week by week basis. We can create another function to select a squad based on the ground truth (`total_points`) column then comparing the two to determine if the expected points squad did as well as the truly highest scoring squad in a particular week. Lastly, we can create another function to select a squad based on our model's predictions **WHATEVER OUR MODEL VERSION IS CALLED** and compare the three selected squads.

Function to select squad based on `xP` column.

In [6]:
def xP_squad(data, budget):
    '''This function returns a 15-man squad in a dataframe where the first 11 are the starting lineup and the last 4
    are subs. The squad is returned based on mixed-integer linear programming with xP column as the objective.'''
    
    assert isinstance(data, pd.DataFrame), "Data Must Be Pandas DataFrame"
    assert isinstance(budget, int), "Budget Must Be Integer"
    assert set(['position', 'team', 'value', 'xP']).issubset(data.columns), "Must Have Required Columns: position, team, value, xP"
    
    # Helper Variables
    POS = data['position'].unique()
    CLUBS = data['team'].unique()
    budget = budget
    pos_available = {'GK': 2, 'DEF': 5, 'MID': 5, 'FWD': 3}

    positions = np.array(data.position)
    costs = np.array(data.value)
    points = np.array(data.xP)
    teams = np.array(data.team)
    
    # initializing the model
    model = pulp.LpProblem("FPL-Optimization", pulp.LpMaximize)
    # decision types
    # the format function inserts i into empty placeholder {} to create a list of possible inclusions for the model

    lineup = [pulp.LpVariable("x_{}".format(i), lowBound = 0, upBound = 1, cat = 'Integer') for i in range(len(data))]
    subs = [pulp.LpVariable("y_{}".format(i), lowBound = 0, upBound = 1, cat = 'Integer') for i in range(len(data))]

    # defining model objective

    model += pulp.lpSum((lineup[i] + subs[i]*0.1) * points[i] for i in range(len(data))), "Objective"

    # defining constraints

    # Budget constraint
    model += pulp.lpSum((lineup[i] + subs[i]) * costs[i] for i in range(len(data))) <= budget

    # Starting Goalkeeper constraint
    model += pulp.lpSum(lineup[i] for i in range(len(data)) if positions[i] == 'GK') == 1

    # Starting Defender constraint
    model += pulp.lpSum(lineup[i] for i in range(len(data)) if positions[i] == 'DEF') >= 3
    model += pulp.lpSum(lineup[i] for i in range(len(data)) if positions[i] == 'DEF') <= 5

    # Starting Midfielder constraint
    model += pulp.lpSum(lineup[i] for i in range(len(data)) if positions[i] == 'MID') >= 3
    model += pulp.lpSum(lineup[i] for i in range(len(data)) if positions[i] == 'MID') <= 5

    # Starting Forward constraint
    model += pulp.lpSum(lineup[i] for i in range(len(data)) if positions[i] == 'FWD') >= 1
    model += pulp.lpSum(lineup[i] for i in range(len(data)) if positions[i] == 'FWD') <= 3

    # Team position constraints
    for pos in POS:
        model += pulp.lpSum(lineup[i] + subs[i] for i in range(len(data)) if positions[i] == pos) == pos_available[pos]

    # Club constraint for team
    for club in CLUBS:
        model += pulp.lpSum(lineup[i] + subs[i] for i in range(len(data)) if teams[i] == club) <= 3

    # Lineup size constraint

    model += pulp.lpSum(lineup[i] for i in range(len(data))) == 11

    # total team size constraint

    model += pulp.lpSum(lineup[i] + subs[i] for i in range(len(data))) == 15

    for i in range(len(data)):
        model += (lineup[i] + subs[i]) <= 1  # subs must not be on team

    model.solve()
    
    squad_array = []
    for i in range(len(lineup)):
        if lineup[i].value() != 0:
            squad_array.append([data.name[i], data.team[i], data.position[i],  data.GW[i], data.xP[i], data.value[i]])
        if subs[i].value() != 0:
            squad_array.append([data.name[i], data.team[i], data.position[i],  data.GW[i], data.xP[i], data.value[i]])
    
    squad_df = pd.DataFrame(data=squad_array,columns=['name', 'team', 'position', 'GW', 'xP','value'])
    
    print(f"Total Score = {model.objective.value()}\nSquad Value = {squad_df.value.sum()}\n\n")
    return squad_df

In [7]:
# reading in the cleaned dataframe to select a squad by gameweek
df = pd.read_csv('../data/cleaned.csv', index_col=0)
df.head()

Unnamed: 0,name,position,team,GW,xP,minutes,goals_scored,assists,clean_sheets,saves,...,penalties_missed,goals_conceded,own_goals,yellow_cards,red_cards,influence,creativity,threat,value,total_points
0,Nathan Redmond,MID,Southampton,1,1.5,1,0,0,0,0,...,0,0,0,0,0,0.0,0.0,0.0,55,1
1,Junior Stanislas,MID,Bournemouth,1,1.1,1,0,0,0,0,...,0,0,0,0,0,0.0,0.0,0.0,50,1
2,Armando Broja,FWD,Chelsea,1,2.0,15,0,0,0,0,...,0,0,0,0,0,5.2,0.3,19.0,55,1
3,Fabian Schär,DEF,Newcastle,1,2.4,90,1,0,1,0,...,0,0,0,0,0,66.0,14.6,25.0,45,15
4,Jonny Evans,DEF,Leicester,1,1.9,90,0,0,0,0,...,0,2,0,0,0,14.0,1.3,0.0,45,1


In [8]:
# Selecting squads based on predicted points for following gameweek and storing them in an array
expected_squads = []
for gw in df['GW'].unique():
    data = df.loc[df['GW'] == gw].reset_index()
    print(f'Prediction for GW_{gw +1}')
    expected_squads.append(xP_squad(data, 1000))

Prediction for GW_2
Total Score = 47.269999999999996
Squad Value = 1000


Prediction for GW_3
Total Score = 108.39999999999999
Squad Value = 979


Prediction for GW_4
Total Score = 93.62
Squad Value = 996


Prediction for GW_5
Total Score = 93.23
Squad Value = 995


Prediction for GW_6
Total Score = 86.60000000000001
Squad Value = 998


Prediction for GW_7
Total Score = 84.28
Squad Value = 999


Prediction for GW_9
Total Score = 79.35000000000001
Squad Value = 999


Prediction for GW_10
Total Score = 122.14999999999999
Squad Value = 996


Prediction for GW_11
Total Score = 107.04000000000002
Squad Value = 999


Prediction for GW_12
Total Score = 90.13
Squad Value = 1000


Prediction for GW_13
Total Score = None
Squad Value = 771


Prediction for GW_14
Total Score = 88.03
Squad Value = 997


Prediction for GW_15
Total Score = 81.27000000000001
Squad Value = 985


Prediction for GW_16
Total Score = 79.86999999999999
Squad Value = 998


Prediction for GW_17
Total Score = 95.66000000000001

Note there is no gameweek 7 since it was cancelled by the Premier League.
xP in gameweek 12 are 0 for gameweek 13 which is not correct, possibly due to gameweek 12 being a blank gameweek and therefore not enough data to predict gameweek 13.

In [9]:
# Showing predicted team for gameweek 2 based on xP from GW 1
expected_squads[0]

Unnamed: 0,name,team,position,GW,xP,value
0,Erling Haaland,Man City,FWD,1,5.0,115
1,João Cancelo,Man City,DEF,1,4.7,70
2,David Raya Martin,Brentford,GK,1,2.2,45
3,Trent Alexander-Arnold,Liverpool,DEF,1,5.5,75
4,Cameron Archer,Aston Villa,FWD,1,1.5,45
5,Ben Chilwell,Chelsea,DEF,1,3.5,60
6,Emiliano Buendía Stati,Aston Villa,MID,1,2.7,60
7,Joelinton Cássio Apolinário de Lira,Newcastle,MID,1,2.7,60
8,Sean Longstaff,Newcastle,MID,1,1.5,45
9,Jacob Murphy,Newcastle,MID,1,1.5,45


In [10]:
# Expected points for gameweek 13
expected_squads[10]

Unnamed: 0,name,team,position,GW,xP,value
0,Junior Stanislas,Bournemouth,MID,12,0.0,48
1,Armando Broja,Chelsea,FWD,12,0.0,53
2,Leander Dendoncker,Aston Villa,MID,12,0.0,47
3,Roberto Firmino,Liverpool,FWD,12,0.0,81
4,João Palhinha Gonçalves,Fulham,MID,12,0.0,50
5,Kieran Trippier,Newcastle,DEF,12,0.0,57
6,Bernd Leno,Fulham,GK,12,0.0,45
7,Kyle Walker-Peters,Southampton,DEF,12,0.0,45
8,Joe Rothwell,Bournemouth,MID,12,0.0,49
9,Saman Ghoddos,Brentford,MID,12,0.0,48


In [11]:
# Showing the selected squads by gameweek
for i in range(0,len(expected_squads)):
    if i < 6:
        print(f"xP_GW_{i+2}\n\n{expected_squads[i]}\n\n")
    elif i == 6:
        print(F"xP_GW_{i+2} DOES NOT EXIST")
        continue
    else:
        print(f"xP_GW_{i+3}\n\n{expected_squads[i]}\n\n")

xP_GW_2

                                   name         team position  GW   xP  value
0                        Erling Haaland     Man City      FWD   1  5.0    115
1                          João Cancelo     Man City      DEF   1  4.7     70
2                     David Raya Martin    Brentford       GK   1  2.2     45
3                Trent Alexander-Arnold    Liverpool      DEF   1  5.5     75
4                        Cameron Archer  Aston Villa      FWD   1  1.5     45
5                          Ben Chilwell      Chelsea      DEF   1  3.5     60
6                Emiliano Buendía Stati  Aston Villa      MID   1  2.7     60
7   Joelinton Cássio Apolinário de Lira    Newcastle      MID   1  2.7     60
8                        Sean Longstaff    Newcastle      MID   1  1.5     45
9                          Jacob Murphy    Newcastle      MID   1  1.5     45
10                       Ilkay Gündogan     Man City      MID   1  3.6     75
11                     Andrew Robertson    Liverpool   