# FPL Gameweek Player Predictions

Managers in Fantasy Premier League (FPL) earn points from their players for a number of actions. These include goals, assists, clean sheets and saves. They can also earn additional bonus points if they are among the top-performing players in the Bonus Points System (BPS) in any given match.

You can look at a detailed breakdown of the scoring system [here.](https://fantasy.premierleague.com/help/rules)

## FPL Points Prediction Model

In this notebook I created a model that predicts how many points a player will score for a specific gameweek during the 22-23 PL Season. I set up a Machine Learning Linear Multiple Regression Model using the Scikit-Learn Python library. Later in the notebook I will specify the model's variables and other details. 

## Index
* [Data](#data)
* [Lags](#lags)
* [Model](#model)
    * [Training Data](#training_data)
    * [Multiple Linear Regression](#regression)
    * [Measuring Error](#error)
* [Predictions](#predictions)
    * [Predictions - Gameweek 3](#predictions_gw3)
    * [Predictions - Gameweek 4](#predictions_gw4)
* [Ideal Team](#ideal_team)
    * [Ideal Team - Gameweek 3 (No Budget Constraint)](#ideal_team_gw3_no_budget)
    * [Ideal Team - Gameweek 3 (Budget Constraint)](#ideal_team_gw3_budget)
    * [Ideal Team - Gameweek 4 (No Budget Constraint)](#ideal_team_gw4_no_budget)
    * [Ideal Team - Gameweek 4 (Budget Constraint)](#ideal_team_gw4_budget)
    

In [1]:
#Import relevant libraries and packages
import pandas as pd
import numpy as np
import os
import sys
import plotly.express as px
from pathlib import Path
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from pulp import *

## Data <a class="anchor" id="data"></a>

In [2]:
#Paths
path = Path('Data')
path_22_23 = Path('Data/2022-23')

#Import datasets
data = pd.read_csv(path/'training_data_updated.csv', 
                       index_col=0, 
                       dtype={'season':str,
                              'squad':str,
                              'comp':str})
season_gws = pd.read_csv(path/'remaining_season.csv', index_col=0)
players_raw = pd.read_csv(path_22_23/'players_raw.csv')
player_stats_2223 = pd.read_csv(path_22_23/'gws/merged_gw.csv')
team_standard_stats_2223 = pd.read_csv(path_22_23/'team_standard_stats_2223.csv')
player_standard_stats_2223 = pd.read_csv(path_22_23/'player_standard_stats_2223.csv')
teams = pd.read_csv(path_22_23/'teams.csv')
cleaned_players = pd.read_csv(path_22_23/'cleaned_players.csv')

#Reset index and drop duplicates (just in case)
data = data.reset_index()
data = data.drop_duplicates()

The data has one row per player, per gameweek, for each player and gameweek since the 2020-2021 season. Each row contains information and statistics for each player and gameweek. The dataframe's columns are:

In [3]:
#Data info
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 51548 entries, 0 to 51547
Data columns (total 31 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   player           51548 non-null  object 
 1   position         51548 non-null  int64  
 2   gw               51548 non-null  int64  
 3   team             51548 non-null  object 
 4   opponent_team    51548 non-null  object 
 5   was_home         51548 non-null  bool   
 6   season           51548 non-null  object 
 7   minutes          51548 non-null  int64  
 8   total_points     51548 non-null  int64  
 9   assists          51548 non-null  int64  
 10  bonus            51548 non-null  int64  
 11  bps              51548 non-null  int64  
 12  clean_sheets     51548 non-null  int64  
 13  creativity       51548 non-null  float64
 14  goals_conceded   51548 non-null  int64  
 15  goals_scored     51548 non-null  int64  
 16  ict_index        51548 non-null  float64
 17  influence   

We add a column to the dataframe called 'fixture difficulty rating' (FDR). The FDR creates a value that offers a perceived fixture difficulty for each team when facing another team. These values are then simplified into ratings from 1 to 5, with 5 being the highest difficulty value.

FPL develops FDR based on a complex algorithm that analyses the performance statistics for each team across their home and away matches. It then combines this data with each team's home and away form over the past six fixtures. 

In FPL, FDR can change from week to week. In other words, Team X might have an FDR of 3 this gameweek, and an FDR of 4 next gameweek. For the sake of simplicity and to maintain consistency across the time-series data, I will assign a constant FDR for each team based on historical PL standings and historical FDRs (dating back to the 2017-18 season). These are my FDR assignments and the logic behind them:

**FDR = 5 &rarr; Manchester City and Liverpool**
- Since the 2017-18 season, Manchester City and Liverpool are the highest-achieving and most consistent teams. They are arguably the most difficult teams to play against and have ended all seasons in the top-4.

**FDR = 4 &rarr; Arsenal, Chelsea, Manchester United, Tottenham Hotspur**
- These four teams, along with Man City and Liverpool are considered the PL's "big six", and thus the most difficult teams to play against. I didn't assign these teams an FDR of 5 because they haven't been as consistent as Man City and Liverpool and haven't made as many points as them since the 2017-18 season.

**FDR = 3 &rarr; Brighton, Crystal Palace, Everton, Leicester City, Newcaste United, West Ham United, Wolves**
- These teams are considered "mid-table teams". Although consistency and regularity across these teams varies, and some of them are arguably more difficult to play against than others, it makes sense to group them under the same FDR rating due to their historic standings (and similar consistency) since the 2017-18 season. 

**FDR = 2 &rarr; Aston Villa, Brentford, Burnley, Leeds, Norwich, Southampton, Watford, Hull City, Middlesbrough, Bournemouth, Sunderland, Swansea, West Brom, Stoke City, Huddersfield, Fulham, Cardiff City, Sheffield United, Nottingham Forest**
- All of these teams (with the exception of Southampton) have been relegated at least once in the past 5 seasons, and during their time in the Premier League have struggled to make it out of the relegation zone or past the 10th standing. The reason I grouped Southampton with the rest of the teams here is because it is the only team that despite not having been relegated, hasn't finished a season above the 11th position (since the 2017-18 season). 

**FDR = 1 &rarr; NONE**
- I didn't assign a score of 1 to any of the teams because FPL rarely gives an FDR of 1 to any fixture.

In [4]:
#Function to add fdr (fixture difficulty rating) to dataframe
def fdr_assignment(data):
    if data['opponent_team'] == 'Arsenal':
        return 4
    if data['opponent_team'] == 'Aston Villa':
        return 2
    if data['opponent_team'] == 'Brentford':
        return 2
    if data['opponent_team'] == 'Brighton':
        return 3
    if data['opponent_team'] == 'Burnley':
        return 2
    if data['opponent_team'] == 'Chelsea':
        return 4
    if data['opponent_team'] == 'Crystal Palace':
        return 3
    if data['opponent_team'] == 'Everton':
        return 3
    if data['opponent_team'] == 'Leeds':
        return 2
    if data['opponent_team'] == 'Leicester City':
        return 3
    if data['opponent_team'] == 'Liverpool':
        return 5
    if data['opponent_team'] == 'Manchester City':
        return 5
    if data['opponent_team'] == 'Manchester Utd':
        return 4
    if data['opponent_team'] == 'Newcastle Utd':
        return 3
    if data['opponent_team'] == 'Norwich':
        return 2
    if data['opponent_team'] == 'Southampton':
        return 2
    if data['opponent_team'] == 'Tottenham':
        return 4
    if data['opponent_team'] == 'Watford':
        return 2
    if data['opponent_team'] == 'West Ham':
        return 3
    if data['opponent_team'] == 'Wolves':
        return 3
    if data['opponent_team'] == 'Hull City':
        return 2
    if data['opponent_team'] == 'Middlesbrough':
        return 2
    if data['opponent_team'] == 'Bournemouth':
        return 2
    if data['opponent_team'] == 'Sunderland':
        return 2
    if data['opponent_team'] == 'Swansea City':
        return 2
    if data['opponent_team'] == 'West Brom':
        return 2
    if data['opponent_team'] == 'Stoke City':
        return 2
    if data['opponent_team'] == 'Huddersfield Town':
        return 2
    if data['opponent_team'] == 'Fulham':
        return 2
    if data['opponent_team'] == 'Cardiff City':
        return 2
    if data['opponent_team'] == 'Sheffield Utd':
        return 2
    if data['opponent_team'] == 'Nottingham Forest':
        return 2
    
data['fdr'] = data.apply(fdr_assignment, axis = 1)

We add a column to the dataframe called 'team_won' that specifies whether the player's team won that game or not (1 if team won, 0 if otherwise).

In [5]:
def team_won(data):
    if data['was_home'] == True and data['team_h_score'] > data['team_a_score']:
        return 1
    if data['was_home'] == False and data['team_h_score'] < data['team_a_score']:
        return 1
    if data['was_home'] == True and data['team_h_score'] < data['team_a_score']:
        return 0
    if data['was_home'] == False and data['team_h_score'] > data['team_a_score']:
        return 0
    else:
        return 0
        
data['team_won'] = data.apply(team_won, axis = 1)

We add a column to the dataframe called 'team_mv' that assigns each team a market value. The market value info was scraped from [transfermarkt](https://www.transfermarkt.com/premier-league/startseite/wettbewerb/GB1).

*Note: for teams that are not currently in the PL, we assign them the market value in their last season they were in the PL. We also assign the latest available market value to each team for consistency (regardelss of season).

In [6]:
def team_market_value(data):
    if data['team'] == 'Arsenal':
        return 671.5
    if data['team'] == 'Aston Villa':
        return 499.6
    if data['team'] == 'Brentford':
        return 292.65
    if data['team'] == 'Brighton':
        return 264.7
    if data['team'] == 'Burnley':
        return 138.05
    if data['team'] == 'Chelsea':
        return 823.7
    if data['team'] == 'Crystal Palace':
        return 268.80
    if data['team'] == 'Everton':
        return 415.95
    if data['team'] == 'Leeds':
        return 275.30
    if data['team'] == 'Leicester City':
        return 508.30
    if data['team'] == 'Liverpool':
        return 870
    if data['team'] == 'Manchester City':
        return 1010
    if data['team'] == 'Manchester Utd':
        return 708.8
    if data['team'] == 'Newcastle Utd':
        return 333.6
    if data['team'] == 'Norwich':
        return 163.2
    if data['team'] == 'Southampton':
        return 271.45
    if data['team'] == 'Tottenham':
        return 727.3
    if data['team'] == 'Watford':
        return 156.2
    if data['team'] == 'West Ham':
        return 384
    if data['team'] == 'Wolves':
        return 385.95
    if data['team'] == 'Hull City':
        return 135.85
    if data['team'] == 'Middlesbrough':
        return 128.8
    if data['team'] == 'Bournemouth':
        return 160.4
    if data['team'] == 'Sunderland':
        return 132
    if data['team'] == 'Swansea City':
        return 165.49
    if data['team'] == 'West Brom':
        return 141.15
    if data['team'] == 'Stoke City':
        return 192.45
    if data['team'] == 'Huddersfield Town':
        return 137.45
    if data['team'] == 'Fulham':
        return 202.5
    if data['team'] == 'Cardiff City':
        return 113
    if data['team'] == 'Sheffield Utd':
        return 148.85
    if data['team'] == 'Nottingham Forest':
        return 189.8
            
data['team_mv'] = data.apply(team_market_value, axis = 1)

We do the same as above (adding a market value column), but this time for the opponent_team. We call the column 'opponent_team_mv'.

In [7]:
def opponent_team_market_value(data):
    if data['opponent_team'] == 'Arsenal':
        return 671.5
    if data['opponent_team'] == 'Aston Villa':
        return 499.6
    if data['opponent_team'] == 'Brentford':
        return 292.65
    if data['opponent_team'] == 'Brighton':
        return 264.7
    if data['opponent_team'] == 'Burnley':
        return 138.05
    if data['opponent_team'] == 'Chelsea':
        return 823.7
    if data['opponent_team'] == 'Crystal Palace':
        return 268.80
    if data['opponent_team'] == 'Everton':
        return 415.95
    if data['opponent_team'] == 'Leeds':
        return 275.30
    if data['opponent_team'] == 'Leicester City':
        return 508.30
    if data['opponent_team'] == 'Liverpool':
        return 870
    if data['opponent_team'] == 'Manchester City':
        return 1010
    if data['opponent_team'] == 'Manchester Utd':
        return 708.8
    if data['opponent_team'] == 'Newcastle Utd':
        return 333.6
    if data['opponent_team'] == 'Norwich':
        return 163.2
    if data['opponent_team'] == 'Southampton':
        return 271.45
    if data['opponent_team'] == 'Tottenham':
        return 727.3
    if data['opponent_team'] == 'Watford':
        return 156.2
    if data['opponent_team'] == 'West Ham':
        return 384
    if data['opponent_team'] == 'Wolves':
        return 385.95
    if data['opponent_team'] == 'Hull City':
        return 135.85
    if data['opponent_team'] == 'Middlesbrough':
        return 128.8
    if data['opponent_team'] == 'Bournemouth':
        return 160.4
    if data['opponent_team'] == 'Sunderland':
        return 132
    if data['opponent_team'] == 'Swansea City':
        return 165.49
    if data['opponent_team'] == 'West Brom':
        return 141.15
    if data['opponent_team'] == 'Stoke City':
        return 192.45
    if data['opponent_team'] == 'Huddersfield Town':
        return 137.45
    if data['opponent_team'] == 'Fulham':
        return 202.5
    if data['opponent_team'] == 'Cardiff City':
        return 113
    if data['opponent_team'] == 'Sheffield Utd':
        return 148.85
    if data['opponent_team'] == 'Nottingham Forest':
        return 189.8
        
data['opponent_team_mv'] = data.apply(opponent_team_market_value, axis = 1)

## Lags <a class="anchor" id="lags"></a>

Since we are dealing with time series data, we create two functions to keep track of lags (a fixed amount of passing time) on both player and team levels. In this case, lags are a certain amount of gameweeks. The functions return lagged statistics for the x amount of lags that we specify and adds them to the original dataframe as new columns, which will be helpful later for modeling. Our function also returns rolling averages - lagged statistics per game.

Let's look at the two functions:

- **player_lag_stats** &rarr; this function returns the lagged statistic we specify for each player and each specified lag. Let's say we want Kevin de Bruyne's lagged goals_scored (statistic) for the last 1, 2, and 3 gameweeks (lags). Let's assume De Bruyne scored  1, 2, and 0 goals in the past three gameweeks (respectively). This is what our lags would look like:
    - *goals_scored_last_1 = 0*
    - *goals_scored_last_2 = 0 + 2 = 2* 
    - *goals_scored_last_3 = 0 + 2 + 1 = 3* 
    
  Rolling averages (assuming De Bruyne played the full 90min per game):
    - *goals_scored_pg_last_1 = 0/1 = 0*
    - *goals_scored_pg_last_2 = (0 + 2)/2 = 1*
    - *goals_scored_pg_last_3 = (0 + 2 + 1)/3 = 1*
    
    
- **team_lag_stats** &rarr; this function does the same as the function above, but on a team level - it returns the lagged statistic for the team as a whole, not just the player. It also returns their *conceded* lagged statistic, and their opponent's lagged and conceded lagged statistic. For example, if we want a team's goals_scored (statistic) in the last 1 gameweek (lag), the function would return how many goals the team scored and conceded in the last gameweek, and how many goals their opponent scored and conceded in the last gameweek. 

We need lagged statistics because we want to predict a player's expected points based on historical data.

In [8]:
#Lagged stats for players
def player_lag_stats(df, stats, lags):    
    player_lag = []
    updated_df = df.copy()
    stats.insert(0, 'minutes')
    for stat in stats:
        for lag in lags:
            stat_name = stat + '_last_' + str(lag)
            minute_game = 'minutes_last_' + str(lag)
            if lag == 'all':
                updated_df[stat_name] = updated_df.groupby(['player'])[stat].apply(lambda x: x.cumsum() - x)
            else: 
                updated_df[stat_name] = updated_df.groupby(['player'])[stat].apply(lambda x: x.rolling(min_periods=1, 
                                                                                            window=lag+1).sum() - x)
            if stat != 'minutes':
                pg_stat_name = stat + '_pg_last_' + str(lag)
                player_lag.append(pg_stat_name)
                updated_df[pg_stat_name] = 90 * updated_df[stat_name] / updated_df[minute_game]
                #Adjusting for negative values and 0 minutes played
                updated_df[pg_stat_name] = updated_df[pg_stat_name].replace([np.inf, -np.inf], np.nan)
            else: player_lag.append(minute_game)
                
    return updated_df, player_lag

In [9]:
#Lagged stats for teams
def team_lag_stats(df, stats, lags):
    team_lag = []
    updated_new = df.copy()
    for stat in stats:
        stat_name_team = stat + '_team'
        stat_conceded_team = stat_name_team + '_conceded'
        stat_team = (df.groupby(['team', 'season', 'gw',
                                   'opponent_team'])
                        [stat].sum().rename(stat_name_team).reset_index())
        stat_team = stat_team.merge(stat_team,
                           left_on=['team', 'season', 'gw',
                                    'opponent_team'],
                           right_on=['opponent_team', 'season', 'gw',
                                     'team'],
                           how='left',
                           suffixes = ('', '_conceded'))
        stat_team.drop(['team_conceded', 'opponent_team_conceded'], axis=1, inplace=True)
        for lag in lags:
            stat_name = stat + '_team_last_' + str(lag)
            stat_conceded_name = stat + '_team_conceded_last_' + str(lag)
            pg_stat_name = stat + '_team_pg_last_' + str(lag)
            pg_stat_conceded_name = stat + '_team_conceded_pg_last_' + str(lag)
            team_lag.extend([pg_stat_name])
            if lag == 'all':
                stat_team[stat_name] = (stat_team.groupby('team')[stat_name_team]
                                              .apply(lambda x: x.cumsum() - x))
                
                stat_team[stat_conceded_name] = (stat_team.groupby('team')[stat_conceded_team]
                                              .apply(lambda x: x.cumsum() - x))
                stat_team[pg_stat_name] = (stat_team[stat_name]
                                                 / stat_team.groupby('team').cumcount())
                stat_team[pg_stat_conceded_name] = (stat_team[stat_conceded_name]
                                                 / stat_team.groupby('team').cumcount())
            else:
                stat_team[stat_name] = (stat_team.groupby('team')[stat_name_team]
                                              .apply(lambda x: x.rolling(min_periods=1, 
                                                                         window=lag + 1).sum() - x))
                stat_team[stat_conceded_name] = (stat_team.groupby('team')[stat_conceded_team]
                                              .apply(lambda x: x.rolling(min_periods=1, 
                                                                         window=lag + 1).sum() - x))
                stat_team[pg_stat_name] = (stat_team[stat_name] / 
                                                 stat_team.groupby('team')[stat_name_team]
                                                 .apply(lambda x: x.rolling(min_periods=1, 
                                                                            window=lag + 1).count() - 1))
                stat_team[pg_stat_conceded_name] = (stat_team[stat_conceded_name] / 
                                                    stat_team.groupby('team')[stat_conceded_name]
                                                 .apply(lambda x: x.rolling(min_periods=1, 
                                                                            window=lag + 1).count() - 1))
        updated_new = updated_new.merge(stat_team, 
                          on=['team', 'season', 'gw', 'opponent_team'], 
                          how='left')
        updated_new = updated_new.merge(stat_team,
                 left_on=['team', 'season', 'gw', 'opponent_team'],
                 right_on=['opponent_team', 'season', 'gw', 'team'],
                 how='left',
                 suffixes = ('', '_opponent'))
        updated_new.drop(['team_opponent', 'opponent_team_opponent'], axis=1, inplace=True)
        
    team_lag = team_lag + [team_lag + '_opponent' for team_lag in team_lag]  

    return updated_new, team_lag

## Model <a class="anchor" id="model"></a>

### Training Data <a class="anchor" id="training_data"></a>

Now that we have our data and our lag functions, we can proceed to create the training data for our model.

Our model will use the following features to make its predictions:

- **total_points**: total points scored.
- **minutes**: minutes played.
- **assists**: assists made.
- **bonus**: bonus points scored.
- **clean_sheets**: if player made clean sheet.
- **goals_conceded**: goals conceded.
- **goals_scored**: goals scored.
- **penalties_saved**: penalties saved (GK only).
- **red_cards**: if player got red card.
- **saves**: saves (GK only).
- **yellow_cards**: if player got yellow card.
- **position**: the player's position.
- **was_home**: if player played home match.
- **threat**: This is a value that examines a player's threat on goal. It gauges the individuals most likely to score goals. While attempts are the key action, the Index looks at pitch location, giving greater weight to actions that are regarded as the best chances to score.
- **ict_index**: All of the statistics above are combined to create an overall Influence, Creativity, Threat (ICT) Index score.
- **creativity**: Creativity assesses player performance in terms of producing goalscoring opportunities for others. It can be used as a guide to identify the players most likely to supply assists. While this analyzes frequency of passing and crossing, it also considers pitch location and quality of the final ball.
- **influence**: Influence evaluates the degree to which a player has made an impact on a single match or throughout the season. It takes into account events and actions that could directly or indirectly effect the outcome of the fixture. At the top level these are decisive actions like goals and assists. But the Influence score also processes significant defensive actions to analyse the effectiveness of defenders and goalkeepers.
- **fdr**: fixture difficulty rating.
- **team_wins**: whether the team won that game.
- **team_mv**: the team's market value.
- **opponent_team_mv**: the opponent team's market value.
- **xg**: player expected goals (per90min).
- **xa**: player expected assists (per90min).
- **npxg**: player non-penalty expected goals (per90min).
- **team_xg**: team expected goals (per90min).
- **team_xa**: team expected assists (per90min).
- **team_npxg**: team non-penalty expected goals (per90min).

**Note: Each statistic is per player, per gameweek. The specific lags we're using, and whether we are using player and team levels, is specified below.*

In [10]:
#Drop duplicates
training_data = data.drop_duplicates()

#Square and cube FDR, xg, xa (for both players and teams)
training_data['fdr_squared'] = training_data['fdr']**2
training_data['fdr_cubed'] = training_data['fdr']**3
training_data['xg_squared'] = training_data['xg']**2
training_data['xg_cubed'] = training_data['xg']**3
training_data['xa_squared'] = training_data['xa']**2
training_data['xa_cubed'] = training_data['xa']**3
training_data['team_xg_squared'] = training_data['team_xg']**2
training_data['team_xg_cubed'] = training_data['team_xg']**3
training_data['team_xa_squared'] = training_data['team_xa']**2
training_data['team_xa_cubed'] = training_data['team_xa']**3

#Total points
training_data, teams_lag = team_lag_stats(training_data, ['total_points'], ['all'])
training_data, players_lag = player_lag_stats(training_data, ['total_points'], ['all', 1, 2, 3, 4 , 5, 10, 20])

#Minutes
training_data, players_lag = player_lag_stats(training_data, ['minutes'], ['all', 1, 2, 3, 4 , 5, 10, 20])

#Assists
training_data, teams_lag = team_lag_stats(training_data, ['assists'], ['all'])
training_data, players_lag = player_lag_stats(training_data, ['assists'], ['all', 1, 2, 3, 4 , 5, 10, 20])

# #Bonus
training_data, teams_lag = team_lag_stats(training_data, ['bonus'], ['all'])
training_data, players_lag = player_lag_stats(training_data, ['bonus'], ['all', 1, 2, 3, 4 , 5, 10, 20])

#Clean sheets
training_data, players_lag = player_lag_stats(training_data, ['clean_sheets'], ['all', 1, 2, 3, 4 , 5, 10, 20])

#Goals conceded
training_data, players_lag = player_lag_stats(training_data, ['goals_conceded'], ['all', 1, 2, 3, 4 , 5, 10, 20])

#Goals scored
training_data, teams_lag = team_lag_stats(training_data, ['goals_scored'], ['all'])
training_data, players_lag = player_lag_stats(training_data, ['goals_scored'], ['all', 1, 2, 3, 4 , 5, 10, 20])

#Penalties Saved
training_data, players_lag = player_lag_stats(training_data, ['penalties_saved'], ['all', 1, 2, 3, 4 , 5, 10, 20])

#Red Cards
training_data, players_lag = player_lag_stats(training_data, ['red_cards'], ['all', 1, 2, 3, 4 , 5, 10, 20])

#Saves
training_data, players_lag = player_lag_stats(training_data, ['saves'], ['all', 1, 2, 3, 4 , 5, 10, 20])

#Yellow Cards
training_data, players_lag = player_lag_stats(training_data, ['yellow_cards'], ['all', 1, 2, 3, 4 , 5, 10, 20])

#Team wins
training_data, players_lag = player_lag_stats(training_data, ['team_won'], ['all', 1, 2, 3, 4 , 5, 10, 20])

#Columns to drop
drop_columns = ['gw', 'player', 'minutes', 'team', 'opponent_team',
                'assists', 'bonus', 'bps', 'clean_sheets','goals_conceded', 
                'goals_scored', 'penalties_saved', 'red_cards', 'saves',
                'yellow_cards', 'season', 'team_a_score', 'team_h_score', 
                'team_won']

training_data = training_data.drop(drop_columns,axis = 1)

#Fill NaN values with 0
training_data = training_data.fillna(0)

#Convert was_home column values to integers (1 for was_home, 0 otherwise)
training_data['was_home'] = training_data["was_home"].astype(int)

#Round all numbers to two decimal points for simplicity
training_data = training_data.round(2)

In [11]:
training_data

Unnamed: 0,position,was_home,total_points,creativity,ict_index,influence,threat,xg,xa,npxg,...,team_won_last_3,team_won_pg_last_3,team_won_last_4,team_won_pg_last_4,team_won_last_5,team_won_pg_last_5,team_won_last_10,team_won_pg_last_10,team_won_last_20,team_won_pg_last_20
0,2,0,6,2.8,1.7,10.2,4.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.00
1,4,0,7,12.7,9.9,38.6,48.0,0.4,0.2,0.4,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.00
2,1,0,7,0.0,1.4,14.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.00
3,3,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.00
4,2,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
51543,3,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.00
51544,2,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.78,4.0,0.62
51545,1,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.00
51546,1,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.00,4.0,0.00


### Multiple Linear Regression <a class="anchor" id="regression"></a>

We use the scikit learn Python library to develop a multiple linear regression prediction model.

In [12]:
#Multiple Linear Regression Prediction Model

#Features to make our predictions
x = training_data.drop('total_points', axis=1)

#What we want to predict
y = training_data['total_points'] 

#Split up data into train and test sets, fit model, and make predictions
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.10, random_state = 42)
linear_reg = LinearRegression()
linear_reg.fit(x_train,y_train)
y_prediction = linear_reg.predict(x_test)
y_prediction



array([0.10718749, 1.94804739, 0.66578573, ..., 1.30518413, 0.37797201,
       0.44936321])

### Measuring Error <a class="anchor" id="error"></a>

To measure the accuracy of our model, we will look at 3 metrics:
- r2 score
- MSE
- RMSE

In [13]:
#Calculating r2_score, mse, and rmse
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
score = r2_score(y_test,y_prediction)
print('r2 score is ',score)
print('MSE is ',mean_squared_error(y_test,y_prediction))
print('RMSE is ',np.sqrt(mean_squared_error(y_test,y_prediction)))

r2 score is  0.7431862960528433
MSE is  1.488201871533926
RMSE is  1.219918797106564


r2 score = ~0.74 → an r2 above 0.7 would generally be seen as showing a high level of correlation between the dependent and independent variables.

MSE = ~1.49 & RMSE = ~1.22 → ideally, these values should be close to 0, however, they are low, which is a good sign of our model's accuracy. We should expected these values to start to get closer to 0 as the season progresses.

## Predictions <a class="anchor" id="predictions"></a>

Now that we trained our model, we are ready to make predictions for the upcoming PL gameweek.

First, we import data with upcoming gameweek information, merge with relevant player statistics from last gameweek (creativity, ict_index, influence, threat), and then add the rest of the statistics we use for the model with a value of zero (we do this in order to get lagged statistics later):

**Note: remember to change the gameweek weekly to update the data.*

### Gameweek 3 Predictions <a class="anchor" id="predictions_gw3"></a>

Actual vs predicted total points:

In [14]:
player_stats_gw3 = player_stats_2223[player_stats_2223['GW'] == 3]
relevant_columns = ['name', 'total_points']
player_stats_gw3 = player_stats_gw3[relevant_columns]
player_stats_gw3 = player_stats_gw3.rename(columns={'name': 'player'})
gw3_predictions = pd.read_csv(path/'gw3_predictions.csv')
gw3_predictions_vs_actual = gw3_predictions.merge(player_stats_gw3, on = 'player')
gw3_predictions_vs_actual[['player', 'predicted_total_points', 'total_points']]

Unnamed: 0,player,predicted_total_points,total_points
0,Mateusz Klich,1,1
1,Adam Forshaw,0,1
2,Patrick Bamford,1,0
3,Diego Llorente,2,8
4,Robin Koch,2,6
...,...,...,...
476,Will Smallbone,0,0
477,Tino Livramento,0,0
478,Mateusz Lis,0,0
479,Willy Caballero,0,0


### Gameweek 4 Predictions <a class="anchor" id="predictions_gw4"></a>

We create a dataframe with the upcoming gw's data:

# CHANGE GAMEWEEK HERE

In [15]:
#Most recent (played) gw
gameweek = 3

#Player stats for most recent gameweek
player_stats = player_stats_2223[player_stats_2223['GW'] == gameweek]
relevant_columns = ['name','creativity', 'ict_index','influence', 'threat']
player_stats = player_stats[relevant_columns]
player_stats = player_stats.rename(columns={'name': 'player'})

#Add relevant statistics with value = 0
player_stats[['minutes', 'total_points', 'assists', 'bonus', 'bps',
       'clean_sheets', 'goals_conceded', 'goals_scored',
       'penalties_saved', 'red_cards', 'saves', 'yellow_cards', 'team_a_score', 'team_h_score']] = 0

#Merge dataframes and make some adjustments
def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
    """
    :param df_1: the left table to join
    :param df_2: the right table to join
    :param key1: key column of the left table
    :param key2: key column of the right table
    :param threshold: how close the matches should be to return a match, based on Levenshtein distance
    :param limit: the amount of matches that will get returned, these are sorted high to low
    :return: dataframe with boths keys and matches
    """
    s = df_2[key2].tolist()
    
    m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))    
    df_1['matches'] = m
    
    m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
    df_1['matches'] = m2
    
    return df_1

#Player Raw Data Merged with player's team
players_raw = players_raw[['first_name', 'second_name', 'team_code']]
teams = teams[['code', 'name']]
players_raw = players_raw.merge(teams, left_on = 'team_code', right_on= 'code')
players_raw['player'] = players_raw['first_name'] + ' ' + players_raw['second_name']
players_raw = players_raw.rename(columns={'name': 'team'})
players_raw = players_raw[['player', 'team']]
players_raw['team'] = players_raw['team'].replace({'Man Utd': 'Manchester Utd', 
                                          'Newcastle United': 'Newcastle Utd',
                                          'West Ham United': 'West Ham',
                                          'Tottenham Hotspur': 'Tottenham',
                                          'Brighton and Hove Albion': 'Brighton',
                                          'Wolverhampton Wanderers': 'Wolves',
                                          'Leicester': 'Leicester City',
                                          'Man City': 'Manchester City',
                                          'Newcastle': 'Newcastle Utd',
                                          "Nott'm Forest": 'Nottingham Forest',
                                          'Spurs': 'Tottenham'})

#Gameweek Data
season_gws['opponent_team'] = season_gws['opponent_team'].replace({'Manchester United': 'Manchester Utd', 
                                          'Newcastle United': 'Newcastle Utd',
                                          'West Ham United': 'West Ham',
                                          'Tottenham Hotspur': 'Tottenham',
                                          'Brighton and Hove Albion': 'Brighton',
                                          'Wolverhampton Wanderers': 'Wolves'})

season_gws['team'] = season_gws['team'].replace({'Manchester United': 'Manchester Utd', 
                                          'Newcastle United': 'Newcastle Utd',
                                          'West Ham United': 'West Ham',
                                          'Tottenham Hotspur': 'Tottenham',
                                          'Brighton and Hove Albion': 'Brighton',
                                          'Wolverhampton Wanderers': 'Wolves'})

season_gws = season_gws[['gw','team', 'opponent_team', 'was_home', 'season']]
season_gws = season_gws.drop_duplicates()
season_gws = season_gws.reset_index().drop('index', axis=1)

#CHANGE GAMEWEEK TO NEXT GAMEWWEK HERE:
season_gws = season_gws[season_gws['gw'] == 4]

#Merge gameweek info with player names
season_player_merge = season_gws.merge(players_raw, on='team')
season_player_merge = season_player_merge[['player', 'gw', 'team', 'opponent_team', 'was_home', 'season']]
season_player_merge

#Add player's position
cleaned_players['player'] = cleaned_players['first_name'] + ' ' + cleaned_players['second_name']
cleaned_players = cleaned_players[['player', 'element_type']]
cleaned_players = cleaned_players.rename(columns={'element_type': 'position'})
season_player_merge = season_player_merge.merge(cleaned_players, on='player')

#Ordered and clean df with player gw data
season_player_merge = season_player_merge[['player', 'position', 'gw', 'team', 'opponent_team', 'was_home', 'season']]
season_player_merge = season_player_merge.drop_duplicates()

season_gw = fuzzy_merge(season_player_merge, player_stats, 'player', 'player', threshold=91)
season_gw_stats = season_gw.merge(player_stats, left_on = 'matches', right_on = 'player')
season_gw_stats = season_gw_stats.drop(['player_x', 'matches'], axis=1)
season_gw_stats = season_gw_stats.rename(columns={'player_y': 'player'})
season_gw_stats = season_gw_stats[['player', 'position', 'gw', 'team', 'opponent_team', 'was_home',
       'season', 'minutes', 'total_points', 'assists', 'bonus', 'bps',
       'clean_sheets', 'creativity', 'goals_conceded', 'goals_scored',
       'ict_index', 'influence', 'penalties_saved', 'red_cards', 'saves',
       'threat', 'yellow_cards', 'team_a_score', 'team_h_score']]

#Add player's xg, xa, npxg
season_gw_stats = fuzzy_merge(season_gw_stats, player_standard_stats_2223, 'player', 'player', threshold=91)
season_gw_stats['matches'].replace('', np.nan, inplace=True)
season_gw_no_match = season_gw_stats[season_gw_stats['matches'].isna()]
season_gw_no_match[['xg', 'xa', 'npxg']] = 0
season_gw_no_match = season_gw_no_match.drop('matches', axis=1)
season_gw_stats = season_gw_stats.dropna(subset=['matches'])
season_gw_stats = season_gw_stats.merge(player_standard_stats_2223, left_on = 'matches', right_on = 'player')
season_gw_stats = season_gw_stats.drop(['player_x', 'matches'], axis=1)
season_gw_stats = season_gw_stats.rename(columns={'player_y': 'player'})
season_gw_stats = season_gw_stats[['player', 'position', 'gw', 'team', 'opponent_team', 'was_home',
       'season', 'minutes', 'total_points', 'assists', 'bonus', 'bps',
       'clean_sheets', 'creativity', 'goals_conceded', 'goals_scored',
       'ict_index', 'influence', 'penalties_saved', 'red_cards', 'saves',
       'threat', 'yellow_cards', 'team_a_score', 'team_h_score', 'xg', 'xa', 'npxg']]
season_gw_stats = pd.concat([season_gw_stats, season_gw_no_match])

#Add team's xg, xa, and npxg
next_gw = season_gw_stats 
next_gw = next_gw.merge(team_standard_stats_2223, left_on = 'team', right_on ='team')
next_gw = next_gw[['player', 'position', 'gw', 'team', 'opponent_team', 'was_home',
       'season', 'minutes', 'total_points', 'assists', 'bonus', 'bps',
       'clean_sheets', 'creativity', 'goals_conceded', 'goals_scored',
       'ict_index', 'influence', 'penalties_saved', 'red_cards', 'saves',
       'threat', 'yellow_cards', 'team_a_score', 'team_h_score', 'xg', 'xa', 'npxg',
       'team_xg', 'team_xa', 'team_npxg']]

#FDR, team_won, team_mv, and opponent_team_mv assignment
next_gw['fdr'] = next_gw.apply(fdr_assignment, axis = 1)
next_gw['team_won'] = next_gw.apply(team_won, axis = 1)
next_gw['team_mv'] = next_gw.apply(team_market_value, axis = 1)
next_gw['opponent_team_mv'] = next_gw.apply(opponent_team_market_value, axis = 1)

#Re-order columns
next_gw = next_gw[['player', 'position', 'gw', 'team', 'opponent_team', 'was_home',
       'season', 'minutes', 'total_points', 'assists', 'bonus', 'bps',
       'clean_sheets', 'creativity', 'goals_conceded', 'goals_scored',
       'ict_index', 'influence', 'penalties_saved', 'red_cards', 'saves',
       'threat', 'yellow_cards', 'team_a_score', 'team_h_score', 'team_won',
       'team_mv','opponent_team_mv','xg', 'xa', 'npxg','team_xg', 'team_xa', 'team_npxg']]

#Convert 'season' column into string
next_gw['season'] = next_gw['season'].apply(str)

next_gw

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  season_gw_no_match[['xg', 'xa', 'npxg']] = 0
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  season_gw_no_match[['xg', 'xa', 'npxg']] = 0
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  season_gw_no_match[['xg', 'xa', 'npxg']] = 0


Unnamed: 0,player,position,gw,team,opponent_team,was_home,season,minutes,total_points,assists,...,team_h_score,team_won,team_mv,opponent_team_mv,xg,xa,npxg,team_xg,team_xa,team_npxg
0,Neco Williams,DEF,4,Nottingham Forest,Tottenham,True,2223,0,0,0,...,0,0,189.8,727.3,0.10,0.23,0.10,0.92,0.48,0.92
1,Steve Cook,DEF,4,Nottingham Forest,Tottenham,True,2223,0,0,0,...,0,0,189.8,727.3,0.00,0.00,0.00,0.92,0.48,0.92
2,Jack Colback,MID,4,Nottingham Forest,Tottenham,True,2223,0,0,0,...,0,0,189.8,727.3,0.00,0.00,0.00,0.92,0.48,0.92
3,Scott McKenna,DEF,4,Nottingham Forest,Tottenham,True,2223,0,0,0,...,0,0,189.8,727.3,0.00,0.00,0.00,0.92,0.48,0.92
4,Ryan Yates,MID,4,Nottingham Forest,Tottenham,True,2223,0,0,0,...,0,0,189.8,727.3,0.86,0.00,0.86,0.92,0.48,0.92
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
584,Boubakary Soumaré,MID,4,Leicester City,Chelsea,False,2223,0,0,0,...,0,0,508.3,823.7,0.00,0.00,0.00,0.48,0.35,0.48
585,Luke Thomas,DEF,4,Leicester City,Chelsea,False,2223,0,0,0,...,0,0,508.3,823.7,0.00,0.00,0.00,0.48,0.35,0.48
586,Lewis Brunt,DEF,4,Leicester City,Chelsea,False,2223,0,0,0,...,0,0,508.3,823.7,0.00,0.00,0.00,0.48,0.35,0.48
587,Daniel Iversen,GK,4,Leicester City,Chelsea,False,2223,0,0,0,...,0,0,508.3,823.7,0.00,0.00,0.00,0.48,0.35,0.48


Now, we take the dataframe we just created, and concatenate it to the original data (with all previous gws' info), in order to get the relevant lagged statistics - the ones we used in our original model. Then, we only keep next gameweek's rows and drop the rest to make our predictions:

In [16]:
#Adjusting original data to concatenate with upcoming gameweek dataframe
data_adjusted = data[['player', 'position', 'gw', 'team', 'opponent_team', 'was_home', 'total_points',
                         'creativity','ict_index','influence','threat','xg', 'xa', 'npxg', 'team_xg',
                         'team_xa', 'team_npxg', 'fdr','season', 'minutes', 'assists', 
                         'bonus', 'bps', 'clean_sheets', 'goals_conceded', 'goals_scored', 'penalties_saved', 
                         'red_cards', 'saves', 'yellow_cards', 'team_a_score', 'team_h_score', 'team_won', 
                         'team_mv', 'opponent_team_mv']]

#We concatenate adjusted original data with next gameweek's dataframe
data_adjusted = pd.concat([data_adjusted, next_gw])
data_adjusted = data_adjusted.drop_duplicates().reset_index()
data_adjusted = data_adjusted.drop('index', axis=1)

#Square and cube FDR, xg, xa (for both players and teams)
data_adjusted['fdr_squared'] = data_adjusted['fdr']**2
data_adjusted['fdr_cubed'] = data_adjusted['fdr']**3
data_adjusted['xg_squared'] = data_adjusted['xg']**2
data_adjusted['xg_cubed'] = data_adjusted['xg']**3
data_adjusted['xa_squared'] = data_adjusted['xa']**2
data_adjusted['xa_cubed'] = data_adjusted['xa']**3
data_adjusted['team_xg_squared'] = data_adjusted['team_xg']**2
data_adjusted['team_xg_cubed'] = data_adjusted['team_xg']**3
data_adjusted['team_xa_squared'] = data_adjusted['team_xa']**2
data_adjusted['team_xa_cubed'] = data_adjusted['team_xa']**3

#Total points
data_adjusted, teams_lag = team_lag_stats(data_adjusted, ['total_points'], ['all'])
data_adjusted, players_lag = player_lag_stats(data_adjusted, ['total_points'], ['all', 1, 2, 3, 4 , 5, 10, 20])

# #Minutes
data_adjusted, players_lag = player_lag_stats(data_adjusted, ['minutes'], ['all', 1, 2, 3, 4 , 5, 10, 20])

#Assists
data_adjusted, teams_lag = team_lag_stats(data_adjusted, ['assists'], ['all'])
data_adjusted, players_lag = player_lag_stats(data_adjusted, ['assists'], ['all', 1, 2, 3, 4 , 5, 10, 20])

# #Bonus
data_adjusted, teams_lag = team_lag_stats(data_adjusted, ['bonus'], ['all'])
data_adjusted, players_lag = player_lag_stats(data_adjusted, ['bonus'], ['all', 1, 2, 3, 4 , 5, 10, 20])

#Clean sheets
data_adjusted, players_lag = player_lag_stats(data_adjusted, ['clean_sheets'], ['all', 1, 2, 3, 4 , 5, 10, 20])

#Goals conceded
data_adjusted, players_lag = player_lag_stats(data_adjusted, ['goals_conceded'], ['all', 1, 2, 3, 4 , 5, 10, 20])

#Goals scored
data_adjusted, teams_lag = team_lag_stats(data_adjusted, ['goals_scored'], ['all'])
data_adjusted, players_lag = player_lag_stats(data_adjusted, ['goals_scored'], ['all', 1, 2, 3, 4 , 5, 10, 20])

#Penalties Saved
data_adjusted, players_lag = player_lag_stats(data_adjusted, ['penalties_saved'], ['all', 1, 2, 3, 4 , 5, 10, 20])

#Red Cards
data_adjusted, players_lag = player_lag_stats(data_adjusted, ['red_cards'], ['all', 1, 2, 3, 4 , 5, 10, 20])

#Saves
data_adjusted, players_lag = player_lag_stats(data_adjusted, ['saves'], ['all', 1, 2, 3, 4 , 5, 10, 20])

#Yellow Cards
data_adjusted, players_lag = player_lag_stats(data_adjusted, ['yellow_cards'], ['all', 1, 2, 3, 4 , 5, 10, 20])

#Team wins
data_adjusted, players_lag = player_lag_stats(data_adjusted, ['team_won'], ['all', 1, 2, 3, 4 , 5, 10, 20])

#Only keep data for upcoming gw
data_adjusted = data_adjusted.loc[(data_adjusted['gw'] == gameweek)]
data_adjusted = data_adjusted.loc[(data_adjusted['season'] == '2223')]

#Drop irrelevant columns
drop_columns = ['gw', 'minutes', 'player','team', 'opponent_team',
                'assists', 'bonus', 'bps', 'clean_sheets','goals_conceded', 
                'goals_scored', 'penalties_saved', 'red_cards', 'saves',
                'yellow_cards', 'season', 'team_a_score', 'team_h_score', 'team_won']
next_gw_stats = data_adjusted.drop(drop_columns,axis = 1)

#Fill NaN values with 0
next_gw_stats = next_gw_stats.fillna(0)

#Convert was_home column to int (1 if was_home, 0 if otherwise)
next_gw_stats['was_home'] = next_gw_stats["was_home"].astype(int)

#Round all values to two decimals
next_gw_stats = next_gw_stats.round(2)

#CHANGE GAMEWEEK
next_gw_stats.to_csv(path/'gw4_test_data.csv', index=False)
next_gw_stats 

Unnamed: 0,position,was_home,total_points,creativity,ict_index,influence,threat,xg,xa,npxg,...,team_won_last_3,team_won_pg_last_3,team_won_last_4,team_won_pg_last_4,team_won_last_5,team_won_pg_last_5,team_won_last_10,team_won_pg_last_10,team_won_last_20,team_won_pg_last_20
50959,3,1,1,0.3,0.0,0.0,0.0,0.00,0.25,0.00,...,2.0,5.00,2.0,1.53,2.0,1.21,3.0,0.60,5.0,0.42
50960,3,1,1,0.4,0.1,0.8,0.0,0.00,0.00,0.00,...,2.0,90.00,2.0,90.00,2.0,90.00,3.0,2.93,5.0,0.66
50961,4,1,0,0.0,0.0,0.0,0.0,0.37,0.00,0.37,...,2.0,1.64,2.0,1.64,2.0,1.64,3.0,2.45,5.0,2.15
50962,2,1,8,0.2,1.6,11.6,4.0,0.03,0.00,0.03,...,2.0,0.67,2.0,0.50,2.0,0.40,3.0,0.33,5.0,0.33
50963,2,1,6,0.4,1.5,14.2,0.0,0.01,0.00,0.01,...,2.0,0.67,2.0,0.53,2.0,0.42,3.0,0.36,5.0,0.32
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
51543,3,0,0,0.0,0.0,0.0,0.0,0.00,0.00,0.00,...,0.0,0.00,0.0,0.00,0.0,0.00,0.0,0.00,0.0,0.00
51544,2,0,0,0.0,0.0,0.0,0.0,0.00,0.00,0.00,...,0.0,0.00,0.0,0.00,0.0,0.00,1.0,0.78,4.0,0.62
51545,1,0,0,0.0,0.0,0.0,0.0,0.00,0.00,0.00,...,0.0,0.00,0.0,0.00,0.0,0.00,0.0,0.00,0.0,0.00
51546,1,0,0,0.0,0.0,0.0,0.0,0.00,0.00,0.00,...,0.0,0.00,0.0,0.00,0.0,0.00,1.0,0.00,4.0,0.00


Now we make our predictions and add them to the upcoming gameweek dataframe:

In [17]:
#Features to make our predictions
gw4 = next_gw_stats.drop('total_points', axis=1)

#Make predictions
predictions_next_gw = linear_reg.predict(gw4)

In [18]:
#We add predictions to the original dataframe
data_adjusted['predicted_total_points'] = predictions_next_gw
predictions_gw4 = data_adjusted[['player', 'gw', 'position', 'team', 'opponent_team', 'season', 'predicted_total_points']]
predictions_gw4 = predictions_gw4.reset_index()
predictions_gw4 = predictions_gw4.drop('index', axis=1)

#Function to change position column values from int to string with position abbreviation
def position_assignment(data):
    if data['position'] == 1:
        return 'GK'
    if data['position'] == 2:
        return 'DEF'
    if data['position'] == 3:
        return 'MID'
    if data['position'] == 4:
        return 'FWD'

predictions_gw4['position'] = predictions_gw4.apply(position_assignment, axis = 1)

#We make all negative values equal to 0
predictions_gw4['predicted_total_points'] = predictions_gw4['predicted_total_points'].round(0).astype(int)
predictions_gw4['predicted_total_points'] = predictions_gw4['predicted_total_points'].where(predictions_gw4['predicted_total_points'] > 0, other=0)

#Player data from this season
players_2223 = pd.read_csv(path/'2022-23/cleaned_players.csv')

#Adjust 2022-23 players dataframe - get players' full names and keep cost column
players_2223['player'] = players_2223['first_name'] + ' ' + players_2223['second_name']
players_2223 = players_2223.set_index('player')
players_2223 = players_2223.drop(['first_name', 'second_name'], axis=1)
players_2223 = players_2223[['now_cost']]
players_2223 = players_2223.rename({'now_cost': 'cost'}, axis=1)
players_2223 = players_2223.astype(str)

#Adjust cost values to represent actual FPL costs
for index, row in players_2223.iterrows():
    if (len(row['cost'])) == 3:
        row['cost'] = (row['cost'][:2] + '.' + row['cost'][2:])
    if (len(row['cost'])) == 2:
        row['cost'] = (row['cost'][:1] + '.' + row['cost'][1:])
        
players_2223['cost'] = players_2223['cost'].astype(float)
players_2223 = players_2223.reset_index()

#Merge predictions with players' costs
predictions_gw4 = predictions_gw4.merge(players_2223, on='player')
predictions_gw4

predictions_gw4.to_csv(path/'gw4_predictions.csv', index=False)

In [19]:
predictions_gw4

Unnamed: 0,player,gw,position,team,opponent_team,season,predicted_total_points,cost
0,Mateusz Klich,3,MID,Leeds,Chelsea,2223,1,4.9
1,Adam Forshaw,3,MID,Leeds,Chelsea,2223,1,4.5
2,Patrick Bamford,3,FWD,Leeds,Chelsea,2223,2,7.4
3,Diego Llorente,3,DEF,Leeds,Chelsea,2223,3,4.5
4,Robin Koch,3,DEF,Leeds,Chelsea,2223,3,4.5
...,...,...,...,...,...,...,...,...
567,Will Smallbone,3,MID,Southampton,Leicester City,2223,0,4.5
568,Tino Livramento,3,DEF,Southampton,Leicester City,2223,0,4.5
569,Mateusz Lis,3,GK,Southampton,Leicester City,2223,0,4.0
570,Willy Caballero,3,GK,Southampton,Leicester City,2223,0,4.0


Let's take a look at our highest-expected scorers:

In [20]:
#Highest-expected scorers for Gameweek 4
highest_expected_scorers_gw4 = predictions_gw4.sort_values('predicted_total_points', ascending=False)
highest_expected_scorers_gw4.head(25)

Unnamed: 0,player,gw,position,team,opponent_team,season,predicted_total_points,cost
142,Wilfried Zaha,3,MID,Crystal Palace,Aston Villa,2223,13,7.1
6,Jack Harrison,3,MID,Leeds,Chelsea,2223,12,6.0
403,Martin Ødegaard,3,MID,Arsenal,Bournemouth,2223,12,6.4
555,Che Adams,3,FWD,Southampton,Leicester City,2223,10,6.4
225,Harry Kane,3,FWD,Tottenham,Wolves,2223,9,11.4
318,Mohamed Salah,3,MID,Liverpool,Manchester Utd,2223,9,13.0
171,Aleksandar Mitrović,3,FWD,Fulham,Brentford,2223,9,6.6
16,Rodrigo Moreno,3,MID,Leeds,Chelsea,2223,9,6.3
12,Brenden Aaronson,3,MID,Leeds,Chelsea,2223,9,5.5
362,Bernardo Veiga de Carvalho e Silva,3,MID,Manchester City,Newcastle Utd,2223,9,6.8


## Ideal Team <a class="anchor" id="ideal_team"></a>

### Ideal Team - Gameweek 3 (No Budget Constraint) <a class="anchor" id="ideal_team_gw3_no_budget"></a>

In [21]:
ideal_team_gw3_no_budget = pd.read_csv(path/'gw3_ideal_team_no_budget.csv')
ideal_team_gw3_no_budget = ideal_team_gw3_no_budget.merge(player_stats_gw3, on='player')
ideal_team_gw3_no_budget

Unnamed: 0,player,gw,position,team,opponent_team,season,predicted_total_points,cost,total_points
0,Lukasz Fabianski,3,GK,West Ham United,Brighton and Hove Albion,2223,6,5.0,1
1,Nick Pope,3,GK,Newcastle United,Manchester City,2223,5,5.0,3
2,Trent Alexander-Arnold,3,DEF,Liverpool,Manchester United,2223,9,7.5,0
3,Aaron Cresswell,3,DEF,West Ham United,Brighton and Hove Albion,2223,7,5.0,0
4,Reece James,3,DEF,Chelsea,Leeds,2223,6,6.0,1
5,Kyle Walker-Peters,3,DEF,Southampton,Leicester City,2223,6,4.5,2
6,Pascal Struijk,3,DEF,Leeds,Chelsea,2223,5,4.5,6
7,Kevin De Bruyne,3,MID,Manchester City,Newcastle United,2223,11,12.1,5
8,Mohamed Salah,3,MID,Liverpool,Manchester United,2223,10,13.0,8
9,Rodrigo Moreno,3,MID,Leeds,Chelsea,2223,10,6.1,13


In [22]:
#Ideal team expected total points:
predicted_points = str(sum(ideal_team_gw3_no_budget['predicted_total_points']))
print("Ideal team expected total points: " + predicted_points)

Ideal team expected total points: 115


In [23]:
#Ideal team expected total points:
actual_points = str(sum(ideal_team_gw3_no_budget['total_points']))
print("Ideal team actual total points: " + actual_points)

Ideal team actual total points: 63


### Ideal Team - Gameweek 3 (Budget Constraint) <a class="anchor" id="ideal_team_gw3_budget"></a>

In [24]:
ideal_team_gw3_budget = pd.read_csv(path/'gw3_ideal_team_budget.csv')
ideal_team_gw3_budget = ideal_team_gw3_budget.merge(player_stats_gw3, on='player')
ideal_team_gw3_budget

Unnamed: 0,player,team,opponent_team,position,predicted_total_points,cost,total_points
0,Nick Pope,Newcastle United,Manchester City,GK,5,5.0,3
1,Lukasz Fabianski,West Ham United,Brighton and Hove Albion,GK,6,5.0,1
2,Reece James,Chelsea,Leeds,DEF,6,6.0,1
3,Trent Alexander-Arnold,Liverpool,Manchester United,DEF,9,7.5,0
4,Ben Mee,Brentford,Fulham,DEF,5,4.5,0
5,Kyle Walker-Peters,Southampton,Leicester City,DEF,6,4.5,2
6,Aaron Cresswell,West Ham United,Brighton and Hove Albion,DEF,7,5.0,0
7,Rodrigo Moreno,Leeds,Chelsea,MID,10,6.1,13
8,Kevin De Bruyne,Manchester City,Newcastle United,MID,11,12.1,5
9,Phil Foden,Manchester City,Newcastle United,MID,9,8.0,2


In [25]:
#Ideal team expected total points
predicted_points = str(sum(ideal_team_gw3_budget['predicted_total_points']))
print("Ideal team expected total points: " + predicted_points)

Ideal team expected total points: 112


In [26]:
#Ideal team expected total points
actual_points = str(sum(ideal_team_gw3_budget['total_points']))
print("Ideal team expected total points: " + actual_points)

Ideal team expected total points: 50


### Ideal Team - Gameweek 4 (No Budget Constraint) <a class="anchor" id="ideal_team_gw4_no_budget"></a>

The following algorithm returns an ideal team, according to **predicted_total_points**. The team satisfies the position requirements (2 goalkeepers, 5 defenders, 5 midfielders, and 3 forwards) and the team constraint (no more than 3 players per team), however, it does **NOT** satisfy the budget constraint.

If we were to pick players based on our model's **predicted_total_points** (without any budget constraint), we should pick the following players:

*Why would we want a team that doesn't satisfy the budget constraint?*
- Although we could not put together this squad for regular FPL, we could still select it for FPL draft, which doesn't have any budget constraint.

In [27]:
def get_ideal_team(gk = 2, df = 5, md = 5, fwd = 3, team_max = 3):
    ideal_team = []
    positions = {'GK': gk, 'DEF': df, 'MID': md, 'FWD': fwd}
    teams = {'Arsenal': team_max, 'Leeds': team_max, 'Manchester City': team_max, 
             'Tottenham': team_max, 'Liverpool': team_max, 'Southampton': team_max, 
             'Chelsea': team_max, 'Brentford': team_max, 'Nottingham Forest': team_max, 
             'Wolves': team_max, 'Aston Villa': team_max, 
             'Crystal Palace': team_max, 'West Ham': team_max, 'Leicester City': team_max, 
             'Newcastle Utd': team_max, 'Bournemouth': team_max, 'Everton': team_max, 
             'Brighton': team_max, 
             'Manchester Utd': team_max, 'Fulham': team_max}
    t = highest_expected_scorers_gw4.iterrows()
    for i, row1 in t:
        if (positions[row1['position']] > 0):
            ideal_team.append(row1['player'])
            positions[row1['position']] = positions[row1['position']] - 1
            teams[row1['team']] = teams[row1['team']] - 1
    return ideal_team

ideal_team_gw4_no_budget = pd.DataFrame(get_ideal_team()) 

In [28]:
ideal_team_gw4_no_budget = ideal_team_gw4_no_budget.rename({0: 'player'}, axis=1)
ideal_team_gw4_no_budget = ideal_team_gw4_no_budget.merge(highest_expected_scorers_gw4, on='player')
ideal_team_gw4_no_budget.position = pd.Categorical(ideal_team_gw4_no_budget.position, categories=['GK', 'DEF', 'MID', 'FWD'])
ideal_team_gw4_no_budget = ideal_team_gw4_no_budget.sort_values('position')
ideal_team_gw4_no_budget = ideal_team_gw4_no_budget.reset_index().drop('index', axis=1)
ideal_team_gw4_no_budget.to_csv(path/'gw4_ideal_team_no_budget.csv', index=False)
ideal_team_gw4_no_budget

Unnamed: 0,player,gw,position,team,opponent_team,season,predicted_total_points,cost
0,David Raya Martin,3,GK,Brentford,Fulham,2223,7,4.5
1,Jordan Pickford,3,GK,Everton,Nottingham Forest,2223,7,4.5
2,William Saliba,3,DEF,Arsenal,Bournemouth,2223,8,4.6
3,Ivan Perišić,3,DEF,Tottenham,Wolves,2223,7,5.5
4,Kieran Trippier,3,DEF,Newcastle Utd,Manchester City,2223,7,5.1
5,Joël Veltman,3,DEF,Brighton,West Ham,2223,6,4.5
6,Lewis Dunk,3,DEF,Brighton,West Ham,2223,5,4.5
7,Wilfried Zaha,3,MID,Crystal Palace,Aston Villa,2223,13,7.1
8,Jack Harrison,3,MID,Leeds,Chelsea,2223,12,6.0
9,Martin Ødegaard,3,MID,Arsenal,Bournemouth,2223,12,6.4


In [29]:
#Ideal team expected total points
predicted_points = str(sum(ideal_team_gw4_no_budget['predicted_total_points']))
print("Ideal team expected total points: " + predicted_points)

Ideal team expected total points: 130


### Ideal Team - Gameweek 4 (Budget Constraint) <a class="anchor" id="ideal_team_gw4_budget"></a>

The following algorithm returns an ideal team, according to **predicted_total_points**. The team satisfies the position requirements (2 goalkeepers, 5 defenders, 5 midfielders, and 3 forwards), the team constraint (no more than 3 players per team), **AND** the budget constraint (squad cost must not exceed £100 million).

If we were to pick players based on our model's **predicted_total_points** (satisfying the budget constraint), we should pick the following players:

In [30]:
positions = highest_expected_scorers_gw4.position.unique()
clubs = highest_expected_scorers_gw4.team.unique()
budget = 100
available_roles = {
    'GK': 2,
    'DEF': 5,
    'MID': 5,
    'FWD': 3    
}

names = [highest_expected_scorers_gw4.player[i] for i in highest_expected_scorers_gw4.index]
teams = [highest_expected_scorers_gw4.team[i] for i in highest_expected_scorers_gw4.index]
roles = [highest_expected_scorers_gw4.position[i] for i in highest_expected_scorers_gw4.index]
costs = [highest_expected_scorers_gw4.cost[i] for i in highest_expected_scorers_gw4.index]
predicted_points = [highest_expected_scorers_gw4.predicted_total_points[i] for i in highest_expected_scorers_gw4.index]
players = [LpVariable("player_" + str(i), cat="Binary") for i in highest_expected_scorers_gw4.index]
prob = LpProblem("Fantasy Ideal Team (total_points)", LpMaximize)

#Maximize predicted_total_points
prob += lpSum(players[i] * predicted_points[i] for i in range(len(highest_expected_scorers_gw4)))
#Budget constraint
prob += lpSum(players[i] * highest_expected_scorers_gw4.cost[highest_expected_scorers_gw4.index[i]] for i in range(len(highest_expected_scorers_gw4))) <= budget

for pos in positions:
    prob += lpSum(players[i] for i in range(len(highest_expected_scorers_gw4)) if roles[i] == pos) <= available_roles[pos]
#Max 3 per team constraint
for club in clubs:
    prob += lpSum(players[i] for i in range(len(highest_expected_scorers_gw4)) if teams[i] == club) <= 3
prob.solve()
df_list = []
for variable in prob.variables():
    if variable.varValue != 0:
        name = highest_expected_scorers_gw4.player[int(variable.name.split("_")[1])]
        club = highest_expected_scorers_gw4.team[int(variable.name.split("_")[1])]
        role = highest_expected_scorers_gw4.position[int(variable.name.split("_")[1])]
        predicted_points = highest_expected_scorers_gw4.predicted_total_points[int(variable.name.split("_")[1])]
        cost = highest_expected_scorers_gw4.cost[int(variable.name.split("_")[1])]
        opponent_team = highest_expected_scorers_gw4.opponent_team[int(variable.name.split("_")[1])]
        df_list.append((name, club, opponent_team, role, predicted_points, cost))
    

# Dataframe with name, club, position, points, cost
ideal_team_gw4_budget = pd.DataFrame(df_list, columns = ['player', 'team', 'opponent_team', 'position', 'predicted_total_points', 'cost'])



Welcome to the CBC MILP Solver 
Version: 2.10.3 
Build Date: Dec 15 2019 

command line - /Users/amirgrunhaus/opt/miniconda3/lib/python3.9/site-packages/pulp/apis/../solverdir/cbc/osx/64/cbc /var/folders/mh/djnbz20x6yx5_43kjtkrmxmh0000gn/T/e80bf977e4204b839a0fe5ce95aa4e67-pulp.mps max timeMode elapsed branch printingOptions all solution /var/folders/mh/djnbz20x6yx5_43kjtkrmxmh0000gn/T/e80bf977e4204b839a0fe5ce95aa4e67-pulp.sol (default strategy 1)
At line 2 NAME          MODEL
At line 3 ROWS
At line 30 COLUMNS
At line 3200 RHS
At line 3226 BOUNDS
At line 3799 ENDATA
Problem MODEL has 25 rows, 572 columns and 1716 elements
Coin0008I MODEL read with 0 errors
Option for timeMode changed from cpu to elapsed
Continuous objective value is 130 - 0.00 seconds
Cgl0004I processed model has 25 rows, 268 columns (268 integer (240 of which binary)) and 804 elements
Cutoff increment increased from 1e-05 to 0.9999
Cbc0038I Initial state - 0 integers unsatisfied sum - 0
Cbc0038I Solution found of -130


In [31]:
ideal_team_gw4_budget.position = pd.Categorical(ideal_team_gw4_budget.position, categories=['GK', 'DEF', 'MID', 'FWD'])
ideal_team_gw4_budget = ideal_team_gw4_budget.sort_values('position')
ideal_team_gw4_budget = ideal_team_gw4_budget.reset_index().drop('index', axis=1)
ideal_team_gw4_budget.to_csv(path/'gw4_ideal_team_budget.csv', index=False)
ideal_team_gw4_budget

Unnamed: 0,player,team,opponent_team,position,predicted_total_points,cost
0,David Raya Martin,Brentford,Fulham,GK,7,4.5
1,Dean Henderson,Nottingham Forest,Everton,GK,7,4.5
2,Kevin Mbabu,Fulham,Brentford,DEF,5,4.5
3,Ivan Perišić,Tottenham,Wolves,DEF,7,5.5
4,Joël Veltman,Brighton,West Ham,DEF,6,4.5
5,William Saliba,Arsenal,Bournemouth,DEF,8,4.6
6,Kieran Trippier,Newcastle Utd,Manchester City,DEF,7,5.1
7,Wilfried Zaha,Crystal Palace,Aston Villa,MID,13,7.1
8,Rodrigo Moreno,Leeds,Chelsea,MID,9,6.3
9,Bernardo Veiga de Carvalho e Silva,Manchester City,Newcastle Utd,MID,9,6.8


In [32]:
#Ideal team expected total points
predicted_points = str(sum(ideal_team_gw4_budget['predicted_total_points']))
print("Ideal team expected total points: " + predicted_points)

Ideal team expected total points: 130
