# <center> Squad Selector </center>

Author:  Farhan Kassam

In this notebook, we will solve a FPL version of the mixed-integer linear problem (MILP) or the knapsack problem.

The knapsack problem is a combinatorial optimization problem where you are given a set of items, each with a weight and a value. The program must determine a subset of items to include in the knapsack so that the total weight is less than or equal to a given limit and the total value is as large as possible.

**Reference**: <br>
Khalid, Irfan. Sep, 2021. <i>How to Build A Fantasy Premier League Team with Data Science</i>. https://towardsdatascience.com/how-to-build-a-fantasy-premier-league-team-with-data-science-f01283281236.

# FPL MILP

In FPL, there is an objective and a set of rules (constraints) that we must adhere to when selecting a team. We will explain the problem in words and mathematical terms.

Objective: Maximize the points earned by a player

$F = \sum_{i=0}^{N}(x_i + y_i) * V_i$

Constraints:
- The cost should be less than 100m (1000 in our dataset since values are not saved as floats)
    - $\sum_{i=0}^{N}(x_i+y_i) * C_i \le 100$
<br></br>
- There are 11 players in the starting lineup
    - $\sum_{i=0}^{N}x_i = 11$
<br></br>
- There are 15 players in the squad
    - $\sum_{i=0}^{N}x_i + y_i = 15$
<br></br>
- There are 2 Goalkeepers in the squad and only 1 in the lineup
    - $\sum_{j \in G}x_j = 1$
    - $\sum_{j \in G}x_j +y_j = 2$
<br></br>  
- There are between 3-5 defenders in the starting lineup and 5 in the squad
    - $3 \le \sum_{j \in D}x_j \le 5$
    - $\sum_{j \in D}x_j + y_j = 5$
<br></br>
- There are between 3-5 midfielders in the starting lineup and 5 in the squad
    - $3 \le \sum_{j \in M}x_j \le 5$
    - $\sum_{j \in M}x_j + y_j = 5$
<br></br> 
- There are between 1-3 forwards in the starting lineup and 3 in the squad
    - $1 \le \sum_{j \in F}x_j \le 3$
    - $\sum_{j \in F}x_j + y_j \le 3$
<br></br>
- There cannot be more than 3 players from the same team in the squad
    - $\sum_{j \in T_k}x_j + y_j \le 3$

Let's start by selecting a team based on the above constraints for the season so far via the aggreated data and a python library specifically for MILP called pulp.

In [1]:
import pandas as pd
import numpy as np
import pulp

In [2]:
agg = pd.read_csv('../data/aggregated.csv', index_col=0)
agg.head()

Unnamed: 0,name,team,position,total_points,minutes,goals_scored,assists,clean_sheets,saves,penalties_saved,penalties_missed,goals_conceded,own_goals,yellow_cards,red_cards,value,PV-ratio
0,Erling Haaland,Man City,FWD,247,2404,34,8,11,0,0,0,24,0,6,0,122,2.02459
1,Harry Kane,Spurs,FWD,219,3046,24,8,10,0,0,1,54,0,5,0,118,1.855932
2,Gabriel Martinelli Silva,Arsenal,MID,200,2750,16,9,13,0,0,0,34,0,3,0,65,3.076923
3,Martin Ødegaard,Arsenal,MID,193,2809,13,9,11,0,0,0,37,0,4,0,70,2.757143
4,Bukayo Saka,Arsenal,MID,193,2875,14,12,11,0,0,1,39,0,7,0,84,2.297619


Below we will create helper variables which will help us set and adhere to the models constraints.

In [3]:
# Creating helper variables
POS = agg['position'].unique()
CLUBS = agg['team'].unique()
budget = 1000
pos_available = {'GK': 2, 'DEF': 5, 'MID': 5, 'FWD': 3}

positions = np.array(agg.position)
costs = np.array(agg.value)
points = np.array(agg.total_points)
teams = np.array(agg.team)

In [4]:
# initializing the model
model = pulp.LpProblem("FPL-Optimization", pulp.LpMaximize)
# decision types
# the format function inserts i into empty placeholder {} to create a list of possible inclusions for the model

lineup = [pulp.LpVariable("x_{}".format(i), lowBound = 0, upBound = 1, cat = 'Integer') for i in range(len(agg))]
subs = [pulp.LpVariable("y_{}".format(i), lowBound = 0, upBound = 1, cat = 'Integer') for i in range(len(agg))]

# defining model objective

model += pulp.lpSum((lineup[i] + subs[i]*0.1) * points[i] for i in range(len(agg))), "Objective"

# defining constraints

# Budget constraint
model += pulp.lpSum((lineup[i] + subs[i]) * costs[i] for i in range(len(agg))) <= budget

# Starting Goalkeeper constraint
model += pulp.lpSum(lineup[i] for i in range(len(agg)) if positions[i] == 'GK') == 1

# Starting Defender constraint
model += pulp.lpSum(lineup[i] for i in range(len(agg)) if positions[i] == 'DEF') >= 3
model += pulp.lpSum(lineup[i] for i in range(len(agg)) if positions[i] == 'DEF') <= 5

# Starting Midfielder constraint
model += pulp.lpSum(lineup[i] for i in range(len(agg)) if positions[i] == 'MID') >= 3
model += pulp.lpSum(lineup[i] for i in range(len(agg)) if positions[i] == 'MID') <= 5

# Starting Forward constraint
model += pulp.lpSum(lineup[i] for i in range(len(agg)) if positions[i] == 'FWD') >= 1
model += pulp.lpSum(lineup[i] for i in range(len(agg)) if positions[i] == 'FWD') <= 3

# Team position constraints
for pos in POS:
    model += pulp.lpSum(lineup[i] + subs[i] for i in range(len(agg)) if positions[i] == pos) == pos_available[pos]

# Club constraint for team
for club in CLUBS:
    model += pulp.lpSum(lineup[i] + subs[i] for i in range(len(agg)) if teams[i] == club) <= 3

# Lineup size constraint

model += pulp.lpSum(lineup[i] for i in range(len(agg))) == 11

# total team size constraint

model += pulp.lpSum(lineup[i] + subs[i] for i in range(len(agg))) == 15

for i in range(len(agg)):
    model += (lineup[i] + subs[i]) <= 1  # subs must not be on team
    
model.solve()

1

The above says the model was successful in determining a solution. The following code blocks will extract the players and create a dataframe of the squad selected by the MILP model.

In [5]:
squad_array = []
for i in range(len(lineup)):
    if lineup[i].value() != 0:
        squad_array.append([agg.name[i], agg.team[i], agg.position[i], agg.total_points[i], agg.value[i]])
    if subs[i].value() != 0:
        squad_array.append([agg.name[i], agg.team[i], agg.position[i], agg.total_points[i], agg.value[i]])
    
squad_df = pd.DataFrame(data=squad_array,columns=['name', 'team', 'position', 'points','value'])
    
print(f"Total Score = {model.objective.value()}")
print(f"Squad Cost = {squad_df.value.sum()}")
squad_df

Total Score = 2077.2
Squad Cost = 1000


Unnamed: 0,name,team,position,points,value
0,Erling Haaland,Man City,FWD,247,122
1,Harry Kane,Spurs,FWD,219,118
2,Gabriel Martinelli Silva,Arsenal,MID,200,65
3,Martin Ødegaard,Arsenal,MID,193,70
4,Bukayo Saka,Arsenal,MID,193,84
5,Marcus Rashford,Man Utd,MID,191,73
6,Ivan Toney,Brentford,FWD,185,77
7,Kieran Trippier,Newcastle,DEF,183,61
8,David Raya Martin,Brentford,GK,153,47
9,José Malheiro de Sá,Wolves,GK,143,50


The above is the best overall squad in the season so far where the first 11 entries are the starting XI and the remaining 4 entries are the substitutes. The total score of this squad so far is 1563 and the total cost is 1000 using the entire budget.

# Generalizing The Problem

Since the model worked, we can now generalize it to a function where we can select a squad based on the `xP` column on a week by week basis. We can create another function to select a squad based on the ground truth (`total_points`) column then comparing the two to determine if the expected points squad did as well as the truly highest scoring squad in a particular week. Lastly, we can create another function to select a squad based on our model's predictions `mypred` and compare the three selected squads.

Function to select squad based on `xP` column.

In [6]:
def xP_squad(data, budget):
    '''This function returns a 15-man squad in a dataframe where the first 11 are the starting lineup and the last 4
    are subs. The squad is returned based on mixed-integer linear programming with xP column as the objective.'''
    
    assert isinstance(data, pd.DataFrame), "Data Must Be Pandas DataFrame"
    assert isinstance(budget, int), "Budget Must Be Integer"
    assert set(['position', 'team', 'value', 'xP']).issubset(data.columns), "Must Have Required Columns: position, team, value, xP"
    
    # Helper Variables
    POS = data['position'].unique()
    CLUBS = data['team'].unique()
    budget = budget
    pos_available = {'GK': 2, 'DEF': 5, 'MID': 5, 'FWD': 3}

    positions = np.array(data.position)
    costs = np.array(data.value)
    points = np.array(data.xP)
    teams = np.array(data.team)
    
    # initializing the model
    model = pulp.LpProblem("FPL-Optimization", pulp.LpMaximize)
    # decision types
    # the format function inserts i into empty placeholder {} to create a list of possible inclusions for the model

    lineup = [pulp.LpVariable("x_{}".format(i), lowBound = 0, upBound = 1, cat = 'Integer') for i in range(len(data))]
    subs = [pulp.LpVariable("y_{}".format(i), lowBound = 0, upBound = 1, cat = 'Integer') for i in range(len(data))]

    # defining model objective

    model += pulp.lpSum((lineup[i] + subs[i]*0.1) * points[i] for i in range(len(data))), "Objective"

    # defining constraints

    # Budget constraint
    model += pulp.lpSum((lineup[i] + subs[i]) * costs[i] for i in range(len(data))) <= budget

    # Starting Goalkeeper constraint
    model += pulp.lpSum(lineup[i] for i in range(len(data)) if positions[i] == 'GK') == 1

    # Starting Defender constraint
    model += pulp.lpSum(lineup[i] for i in range(len(data)) if positions[i] == 'DEF') >= 3
    model += pulp.lpSum(lineup[i] for i in range(len(data)) if positions[i] == 'DEF') <= 5

    # Starting Midfielder constraint
    model += pulp.lpSum(lineup[i] for i in range(len(data)) if positions[i] == 'MID') >= 3
    model += pulp.lpSum(lineup[i] for i in range(len(data)) if positions[i] == 'MID') <= 5

    # Starting Forward constraint
    model += pulp.lpSum(lineup[i] for i in range(len(data)) if positions[i] == 'FWD') >= 1
    model += pulp.lpSum(lineup[i] for i in range(len(data)) if positions[i] == 'FWD') <= 3

    # Team position constraints
    for pos in POS:
        model += pulp.lpSum(lineup[i] + subs[i] for i in range(len(data)) if positions[i] == pos) == pos_available[pos]

    # Club constraint for team
    for club in CLUBS:
        model += pulp.lpSum(lineup[i] + subs[i] for i in range(len(data)) if teams[i] == club) <= 3

    # Lineup size constraint

    model += pulp.lpSum(lineup[i] for i in range(len(data))) == 11

    # total team size constraint

    model += pulp.lpSum(lineup[i] + subs[i] for i in range(len(data))) == 15

    for i in range(len(data)):
        model += (lineup[i] + subs[i]) <= 1  # subs must not be on team

    model.solve()
    
    squad_array = []
    for i in range(len(lineup)):
        if lineup[i].value() != 0:
            squad_array.append([data.name[i], data.team[i], data.position[i],  data.GW[i], data.xP[i], data.value[i]])
        if subs[i].value() != 0:
            squad_array.append([data.name[i], data.team[i], data.position[i],  data.GW[i], data.xP[i], data.value[i]])
    
    squad_df = pd.DataFrame(data=squad_array,columns=['name', 'team', 'position', 'GW', 'xP','value'])
    
    print(f"Total Score = {model.objective.value()}\nSquad Value = {squad_df.value.sum()}\n\n")
    return squad_df

Function to select my model's predicted squad.

In [7]:
def mypred_squad(data, budget):
    '''This function returns a 15-man squad in a dataframe where the first 11 are the starting lineup and the last 4
    are subs. The squad is returned based on mixed-integer linear programming with mypred column as the objective.'''
    
    assert isinstance(data, pd.DataFrame), "Data Must Be Pandas DataFrame"
    assert isinstance(budget, int), "Budget Must Be Integer"
    assert set(['position', 'team', 'value', 'mypred']).issubset(data.columns), "Must Have Required Columns: position, team, value, mypred"
    
    # Helper Variables
    POS = data['position'].unique()
    CLUBS = data['team'].unique()
    budget = budget
    pos_available = {'GK': 2, 'DEF': 5, 'MID': 5, 'FWD': 3}

    positions = np.array(data.position)
    costs = np.array(data.value)
    points = np.array(data.mypred)
    teams = np.array(data.team)
    
    # initializing the model
    model = pulp.LpProblem("FPL-Optimization", pulp.LpMaximize)
    # decision types
    # the format function inserts i into empty placeholder {} to create a list of possible inclusions for the model

    lineup = [pulp.LpVariable("x_{}".format(i), lowBound = 0, upBound = 1, cat = 'Integer') for i in range(len(data))]
    subs = [pulp.LpVariable("y_{}".format(i), lowBound = 0, upBound = 1, cat = 'Integer') for i in range(len(data))]

    # defining model objective

    model += pulp.lpSum((lineup[i] + subs[i]*0.1) * points[i] for i in range(len(data))), "Objective"

    # defining constraints

    # Budget constraint
    model += pulp.lpSum((lineup[i] + subs[i]) * costs[i] for i in range(len(data))) <= budget

    # Starting Goalkeeper constraint
    model += pulp.lpSum(lineup[i] for i in range(len(data)) if positions[i] == 'GK') == 1

    # Starting Defender constraint
    model += pulp.lpSum(lineup[i] for i in range(len(data)) if positions[i] == 'DEF') >= 3
    model += pulp.lpSum(lineup[i] for i in range(len(data)) if positions[i] == 'DEF') <= 5

    # Starting Midfielder constraint
    model += pulp.lpSum(lineup[i] for i in range(len(data)) if positions[i] == 'MID') >= 3
    model += pulp.lpSum(lineup[i] for i in range(len(data)) if positions[i] == 'MID') <= 5

    # Starting Forward constraint
    model += pulp.lpSum(lineup[i] for i in range(len(data)) if positions[i] == 'FWD') >= 1
    model += pulp.lpSum(lineup[i] for i in range(len(data)) if positions[i] == 'FWD') <= 3

    # Team position constraints
    for pos in POS:
        model += pulp.lpSum(lineup[i] + subs[i] for i in range(len(data)) if positions[i] == pos) == pos_available[pos]

    # Club constraint for team
    for club in CLUBS:
        model += pulp.lpSum(lineup[i] + subs[i] for i in range(len(data)) if teams[i] == club) <= 3

    # Lineup size constraint

    model += pulp.lpSum(lineup[i] for i in range(len(data))) == 11

    # total team size constraint

    model += pulp.lpSum(lineup[i] + subs[i] for i in range(len(data))) == 15

    for i in range(len(data)):
        model += (lineup[i] + subs[i]) <= 1  # subs must not be on team

    model.solve()
    
    squad_array = []
    for i in range(len(lineup)):
        if lineup[i].value() != 0:
            squad_array.append([data.name[i], data.team[i], data.position[i],  data.GW[i], data.mypred[i], data.value[i]])
        if subs[i].value() != 0:
            squad_array.append([data.name[i], data.team[i], data.position[i],  data.GW[i], data.mypred[i], data.value[i]])
    
    squad_df = pd.DataFrame(data=squad_array,columns=['name', 'team', 'position', 'GW', 'mypred','value'])
    
    print(f"Total Score = {model.objective.value()}\nSquad Value = {squad_df.value.sum()}\n\n")
    return squad_df

Function to return actual results squad.

In [8]:
def true_squad(data, budget):
    '''This function returns a 15-man squad in a dataframe where the first 11 are the starting lineup and the last 4
    are subs. The squad is returned based on mixed-integer linear programming with actual_points column as the objective.'''
    
    assert isinstance(data, pd.DataFrame), "Data Must Be Pandas DataFrame"
    assert isinstance(budget, int), "Budget Must Be Integer"
    assert set(['position', 'team', 'value', 'actual_points']).issubset(data.columns), "Must Have Required Columns: position, team, value, actual_points"
    
    # Helper Variables
    POS = data['position'].unique()
    CLUBS = data['team'].unique()
    budget = budget
    pos_available = {'GK': 2, 'DEF': 5, 'MID': 5, 'FWD': 3}

    positions = np.array(data.position)
    costs = np.array(data.value)
    points = np.array(data.actual_points)
    teams = np.array(data.team)
    
    # initializing the model
    model = pulp.LpProblem("FPL-Optimization", pulp.LpMaximize)
    # decision types
    # the format function inserts i into empty placeholder {} to create a list of possible inclusions for the model

    lineup = [pulp.LpVariable("x_{}".format(i), lowBound = 0, upBound = 1, cat = 'Integer') for i in range(len(data))]
    subs = [pulp.LpVariable("y_{}".format(i), lowBound = 0, upBound = 1, cat = 'Integer') for i in range(len(data))]

    # defining model objective

    model += pulp.lpSum((lineup[i] + subs[i]*0.1) * points[i] for i in range(len(data))), "Objective"

    # defining constraints

    # Budget constraint
    model += pulp.lpSum((lineup[i] + subs[i]) * costs[i] for i in range(len(data))) <= budget

    # Starting Goalkeeper constraint
    model += pulp.lpSum(lineup[i] for i in range(len(data)) if positions[i] == 'GK') == 1

    # Starting Defender constraint
    model += pulp.lpSum(lineup[i] for i in range(len(data)) if positions[i] == 'DEF') >= 3
    model += pulp.lpSum(lineup[i] for i in range(len(data)) if positions[i] == 'DEF') <= 5

    # Starting Midfielder constraint
    model += pulp.lpSum(lineup[i] for i in range(len(data)) if positions[i] == 'MID') >= 3
    model += pulp.lpSum(lineup[i] for i in range(len(data)) if positions[i] == 'MID') <= 5

    # Starting Forward constraint
    model += pulp.lpSum(lineup[i] for i in range(len(data)) if positions[i] == 'FWD') >= 1
    model += pulp.lpSum(lineup[i] for i in range(len(data)) if positions[i] == 'FWD') <= 3

    # Team position constraints
    for pos in POS:
        model += pulp.lpSum(lineup[i] + subs[i] for i in range(len(data)) if positions[i] == pos) == pos_available[pos]

    # Club constraint for team
    for club in CLUBS:
        model += pulp.lpSum(lineup[i] + subs[i] for i in range(len(data)) if teams[i] == club) <= 3

    # Lineup size constraint

    model += pulp.lpSum(lineup[i] for i in range(len(data))) == 11

    # total team size constraint

    model += pulp.lpSum(lineup[i] + subs[i] for i in range(len(data))) == 15

    for i in range(len(data)):
        model += (lineup[i] + subs[i]) <= 1  # subs must not be on team

    model.solve()
    
    squad_array = []
    for i in range(len(lineup)):
        if lineup[i].value() != 0:
            squad_array.append([data.name[i], data.team[i], data.position[i],  data.GW[i], data.actual_points[i], data.value[i]])
        if subs[i].value() != 0:
            squad_array.append([data.name[i], data.team[i], data.position[i],  data.GW[i], data.actual_points[i], data.value[i]])
    
    squad_df = pd.DataFrame(data=squad_array,columns=['name', 'team', 'position', 'GW', 'actual_points','value'])
    
    print(f"Total Score = {model.objective.value()}\nSquad Value = {squad_df.value.sum()}\n\n")
    return squad_df

Function that returns all three squads from the predictions dataframe in the Modeling notebook.

In [9]:
def squads(data, budget):
    '''Returns 3 different squads:
    1. FPL Predicted squad, 
    2. RandomForest Predicted squad
    3. True Results Squad'''
    
    xP_team = xP_squad(data, budget)
    mypred_team = mypred_squad(data, budget)
    true_team = true_squad(data, budget)
    return xP_team, mypred_team, true_team

# Prediction Squads vs Actual Squad

Now that we have defined all the necessary functions, we will compare the prediction squads to the true squad by the percent match of players selected and accuracy of the predicted points column compared to the actual_points.

In [10]:
# reading in the predictions data
pred_df = pd.read_csv('../data/pred_df.csv', index_col = 0)
pred_df

Unnamed: 0,name,position,team,GW,value,xP,mypred,actual_points
0,Fabian Schär,DEF,Newcastle,27,51,1.1,2.610179,2
1,Jonny Evans,DEF,Leicester,27,44,0.2,1.878152,1
2,Enzo Fernández,MID,Chelsea,27,50,3.0,3.091144,5
3,Brennan Johnson,FWD,Nott'm Forest,27,57,4.8,4.505072,2
4,Cheick Doucouré,MID,Crystal Palace,27,50,0.0,1.962224,0
...,...,...,...,...,...,...,...,...
2185,Çaglar Söyüncü,DEF,Leicester,33,42,0.7,2.858711,2
2186,Nick Pope,GK,Newcastle,33,54,4.8,3.776287,3
2187,Oliver Skipp,MID,Spurs,33,43,1.8,2.733489,2
2188,Ashley Young,DEF,Aston Villa,33,43,4.7,2.781020,5


In [11]:
# Determining the number of correct results between xP and actual_points, and between mypred and actual_points
pred_df['Match_xP_actual'] = pred_df['xP'].eq(pred_df['actual_points'])
pred_df['Match_mypred_actual'] = pred_df['mypred'].round().eq(pred_df['actual_points'])

# viewing the dataframe again
pred_df

Unnamed: 0,name,position,team,GW,value,xP,mypred,actual_points,Match_xP_actual,Match_mypred_actual
0,Fabian Schär,DEF,Newcastle,27,51,1.1,2.610179,2,False,False
1,Jonny Evans,DEF,Leicester,27,44,0.2,1.878152,1,False,False
2,Enzo Fernández,MID,Chelsea,27,50,3.0,3.091144,5,False,False
3,Brennan Johnson,FWD,Nott'm Forest,27,57,4.8,4.505072,2,False,False
4,Cheick Doucouré,MID,Crystal Palace,27,50,0.0,1.962224,0,True,False
...,...,...,...,...,...,...,...,...,...,...
2185,Çaglar Söyüncü,DEF,Leicester,33,42,0.7,2.858711,2,False,False
2186,Nick Pope,GK,Newcastle,33,54,4.8,3.776287,3,False,False
2187,Oliver Skipp,MID,Spurs,33,43,1.8,2.733489,2,False,False
2188,Ashley Young,DEF,Aston Villa,33,43,4.7,2.781020,5,False,False


In [12]:
# Calculating the number of correct predictions from xP and mypred
print(f"xP Correct Percentage: {len(pred_df.loc[pred_df['Match_xP_actual']==True]) / len(pred_df)*100}")
print(f"mypred Correct Percentage: {len(pred_df.loc[pred_df['Match_mypred_actual']==True]) / len(pred_df)*100}")

xP Correct Percentage: 5.525114155251141
mypred Correct Percentage: 7.579908675799087


In [13]:
# Calculating the number of correct predictions from xP and mypred for GW 33
print(f"xP Correct Percentage: {len(pred_df.loc[(pred_df['Match_xP_actual']==True)&(pred_df['GW']==33)]) / len(pred_df)*100}")
print(f"mypred Correct Percentage: {len(pred_df.loc[(pred_df['Match_mypred_actual']==True)&(pred_df['GW']==33)]) / len(pred_df)*100}")

xP Correct Percentage: 0.684931506849315
mypred Correct Percentage: 1.0502283105022832


In [14]:
pred_df.loc[(pred_df['Match_mypred_actual']==True)&(pred_df['GW']==33)]

Unnamed: 0,name,position,team,GW,value,xP,mypred,actual_points,Match_xP_actual,Match_mypred_actual
1906,Keylor Navas,GK,Nott'm Forest,33,45,1.8,3.423691,3,False,True
1911,Will Hughes,MID,Crystal Palace,33,48,1.8,2.067849,2,False,True
1936,Daniel James,MID,Fulham,33,59,3.2,2.327251,2,False,True
1938,Bruno Borges Fernandes,MID,Man Utd,33,95,2.2,5.261157,5,False,True
1946,Ryan Christie,MID,Bournemouth,33,53,2.3,2.963589,3,False,True
1952,Dominic Solanke,FWD,Bournemouth,33,56,5.8,4.969646,5,False,True
1953,Vitaly Janelt,MID,Brentford,33,55,1.2,2.917945,3,False,True
1963,Solly March,MID,Brighton,33,52,5.0,5.306709,5,True,True
1970,Jan Bednarek,DEF,Southampton,33,42,0.6,2.423543,2,False,True
1982,Trent Alexander-Arnold,DEF,Liverpool,33,75,4.5,4.832591,5,False,True


In [15]:
pred_df.loc[(pred_df['Match_xP_actual']==True)&(pred_df['GW']==33)]

Unnamed: 0,name,position,team,GW,value,xP,mypred,actual_points,Match_xP_actual,Match_mypred_actual
1912,Orel Mangala,MID,Nott'm Forest,33,49,2.0,2.629225,2,True,False
1913,Cristian Romero,DEF,Spurs,33,49,1.0,3.057738,1,True,False
1917,Stuart Armstrong,MID,Southampton,33,47,1.0,1.909163,1,True,False
1922,Frederico Rodrigues de Paula Santos,MID,Man Utd,33,51,1.0,2.103465,1,True,False
1961,Aaron Hickey,DEF,Brentford,33,49,1.0,2.818996,1,True,False
1963,Solly March,MID,Brighton,33,52,5.0,5.306709,5,True,True
2002,Jefferson Lerma Solís,MID,Bournemouth,33,47,2.0,2.84199,2,True,False
2012,Daniel Iversen,GK,Leicester,33,38,2.0,3.370475,2,True,False
2023,Sean Longstaff,MID,Newcastle,33,43,3.0,3.42338,3,True,True
2071,Adama Traoré Diarra,MID,Wolves,33,54,1.0,2.094207,1,True,False


For gameweeks 27-33, the FPL predictor correctly predicted 5.5% of the players whereas our ridge prediction model had correctly predicted 7.6% of the players.

For gameweek 33, the FPL predictor got 0.7% correct and our predictor got 1%.

It seems like our predictor is performing more accurately than the FPL predictor in general. Let's check what the selected teams look like and compare them to the true highest scoring team.

In [16]:
# Extracting predicted and true squads for gameweek 33

xP_team, mypred_team, true_team = squads(pred_df.loc[pred_df['GW']==33].reset_index(), 1000)

Total Score = 84.40999999999998
Squad Value = 985


Total Score = 58.99351922951429
Squad Value = 1000


Total Score = 143.5
Squad Value = 988




The predicted score in: 
- FPL predicted model was 84.4 points with a team value of 985
- My predicted model was 58.99 points with a team value of 1000

In contrast, the true value was:
- Total score of 143.5 points and a team value of 988

Let's see if there are any similar player names between the true selection and the predicted squads.

In [17]:
# seeing if there are similar players in FPL xP and my predicted squad
set(xP_team['name']).intersection(mypred_team['name'])

{'Gabriel Martinelli Silva', 'Ollie Watkins'}

In [18]:
# seeing if there are similar players in FPL xP and true squad
set(true_team['name']).intersection(xP_team['name'])

{'Callum Wilson',
 'Erling Haaland',
 'Joelinton Cássio Apolinário de Lira',
 'Kevin De Bruyne',
 'Tyrone Mings'}

In [19]:
# seeing if there are similar players in my predicted and true squad
set(true_team['name']).intersection(mypred_team['name'])

{'Harry Kane'}

The percentage match between:
- FPL xP team and my team: 18%
- FPL xP team and true squad: 45%
- My team and true squad: 9%

Below we will view the true squad and then summarize the results.

In [20]:
true_team

Unnamed: 0,name,team,position,GW,actual_points,value
0,Jason Steele,Brighton,GK,33,10,39
1,Erling Haaland,Man City,FWD,33,14,123
2,Kevin De Bruyne,Man City,MID,33,19,121
3,David Raya Martin,Brentford,GK,33,9,49
4,Marcus Rashford,Man Utd,MID,33,12,71
5,John Stones,Man City,DEF,33,9,55
6,Lloyd Kelly,Bournemouth,DEF,33,8,43
7,Mathias Jorgensen,Brentford,DEF,33,12,39
8,Joelinton Cássio Apolinário de Lira,Newcastle,MID,33,9,60
9,Joel Matip,Liverpool,DEF,33,10,59


In [21]:
xP_team

Unnamed: 0,name,team,position,GW,xP,value
0,Jack Grealish,Man City,MID,33,8.5,72
1,Erling Haaland,Man City,FWD,33,10.0,123
2,Kevin De Bruyne,Man City,MID,33,12.7,121
3,Alexandre Moreno Lopera,Aston Villa,DEF,33,5.3,45
4,Craig Dawson,Wolves,DEF,33,4.6,48
5,Joelinton Cássio Apolinário de Lira,Newcastle,MID,33,7.8,60
6,Ollie Watkins,Aston Villa,FWD,33,7.3,77
7,Gabriel Martinelli Silva,Arsenal,MID,33,6.1,69
8,Callum Wilson,Newcastle,FWD,33,7.5,69
9,Tyrone Mings,Aston Villa,DEF,33,6.5,46


In [22]:
mypred_team

Unnamed: 0,name,team,position,GW,mypred,value
0,Ivan Toney,Brentford,FWD,33,5.761853,76
1,Solly March,Brighton,MID,33,5.306709,52
2,Bukayo Saka,Arsenal,MID,33,5.870399,84
3,Trent Alexander-Arnold,Liverpool,DEF,33,4.832591,75
4,Mohamed Salah,Liverpool,MID,33,5.737337,130
5,Kieran Trippier,Newcastle,DEF,33,4.577401,62
6,Andreas Hoelgebaum Pereira,Fulham,MID,33,4.460987,44
7,Ollie Watkins,Aston Villa,FWD,33,5.458232,77
8,Sam Johnstone,Crystal Palace,GK,33,3.857337,44
9,Gabriel Martinelli Silva,Arsenal,MID,33,6.129783,69


Although my predictor had better performance overall, it seems that the accuracy is lower in choosing the correct set of players. For example taking a look at Erling Haaland below, the FPL predictor more closely matches the true points acheived whereas my predictor has a larger gap. It seems like the Root Mean Squared Error has been further optimized in the FPL predictor possibly due to more data from previous seasons being included, and possibly team statistics.

In [23]:
pred_df.loc[(pred_df['name']=='Erling Haaland')]

Unnamed: 0,name,position,team,GW,value,xP,mypred,actual_points,Match_xP_actual,Match_mypred_actual
40,Erling Haaland,FWD,Man City,27,122,6.6,5.562641,6,False,True
1083,Erling Haaland,FWD,Man City,30,121,7.5,4.311619,12,False,False
1385,Erling Haaland,FWD,Man City,31,122,9.0,6.024013,12,False,False
1928,Erling Haaland,FWD,Man City,33,123,10.0,5.763254,14,False,False


To improve our predictor, the next steps would be to include more data from previous seasons, include team statistics (wins, home/away statistics), and opposition strength. These statistics contribute to a player's performance and may help to reduce the RMSE and give more accurate predicitions to choose from. 