# FPL Gameweek Player Predictions

Managers in Fantasy Premier League (FPL) earn points from their players for a number of actions. These include goals, assists, clean sheets and saves. They can also earn additional bonus points if they are among the top-performing players in the Bonus Points System (BPS) in any given match.

You can look at a detailed breakdown of the scoring system [here.](https://fantasy.premierleague.com/help/rules)

## FPL Points Prediction Model

In this notebook I created a model that predicts how many points a player will score for a specific gameweek during the 22-23 PL Season based on data dating back to the 2017-18 PL Season. I set up a Machine Learning Linear Multiple Regression Model using the Scikit-Learn Python library. Later in the notebook I will specify the model's variables and other details. 

## Index
* [Data](#data)
* [Lags](#lags)
* [Model](#model)
* [Predictions](#predictions)
* [Ideal Team](#ideal_team)

In [30]:
#Import relevant libraries and packages
import pandas as pd
import numpy as np
import os
import sys
import plotly.express as px
from pathlib import Path
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

## Data <a class="anchor" id="data"></a>

In [31]:
#Paths
path = Path('Data')
path_22_23 = Path('Data/2022-23')

#Import datasets
data = pd.read_csv(path/'training_data_updated.csv', 
                       index_col=0, 
                       dtype={'season':str,
                              'squad':str,
                              'comp':str})
season_gws = pd.read_csv(path/'remaining_season.csv', index_col=0)
player_stats = pd.read_csv(path_22_23/'gws/merged_gw.csv')

#Reset index and drop duplicates (just in case)
data = data.reset_index()
data = data.drop_duplicates()

The data has one row per player, per gameweek, for each player and gameweek since the 2016-2017 season. Each row contains information and statistics for each player and gameweek. The dataframe's columns are:

In [32]:
#Data info
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 141316 entries, 0 to 141315
Data columns (total 25 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   player           141316 non-null  object 
 1   position         141316 non-null  int64  
 2   gw               141316 non-null  int64  
 3   team             141316 non-null  object 
 4   opponent_team    141316 non-null  object 
 5   was_home         141316 non-null  bool   
 6   season           141316 non-null  object 
 7   minutes          141316 non-null  int64  
 8   total_points     141316 non-null  int64  
 9   assists          141316 non-null  int64  
 10  bonus            141316 non-null  int64  
 11  bps              141316 non-null  int64  
 12  clean_sheets     141316 non-null  int64  
 13  creativity       141316 non-null  float64
 14  goals_conceded   141316 non-null  int64  
 15  goals_scored     141316 non-null  int64  
 16  ict_index        141316 non-null  floa

First, we change the position column values from int to string with the position abbreviation:
- 1 &rarr; Goalkeeper (GK)
- 2 &rarr; Defender (DEF)
- 3 &rarr; Midfielder (MID)
- 4 &rarr; Forward (FWD)

In [33]:
#Function to change position column values from int to string with position abbreviation
def position_assignment(data):
    if data['position'] == 1:
        return 'GK'
    if data['position'] == 2:
        return 'DEF'
    if data['position'] == 3:
        return 'MID'
    if data['position'] == 4:
        return 'FWD'
    
data['position'] = data.apply(position_assignment, axis = 1)

We add a column to the dataframe called 'fixture difficulty rating' (FDR). The FDR creates a value that offers a perceived fixture difficulty for each team when facing another team. These values are then simplified into ratings from 1 to 5, with 5 being the highest difficulty value.

FPL develops FDR based on a complex algorithm that analyses the performance statistics for each team across their home and away matches. It then combines this data with each team's home and away form over the past six fixtures. 

In FPL, FDR can change from week to week. In other words, Team X might have an FDR of 3 this gameweek, and an FDR of 4 next gameweek. For the sake of simplicity and to maintain consistency across the time-series data, I will assign a constant FDR for each team based on historical PL standings and historical FDRs (dating back to the 2017-18 season). These are my FDR assignments and the logic behind them:

**FDR = 5 &rarr; Manchester City and Liverpool**
- Since the 2017-18 season, Manchester City and Liverpool are the highest-achieving and most consistent teams. They are arguably the most difficult teams to play against and have ended all seasons in the top-4.

**FDR = 4 &rarr; Arsenal, Chelsea, Manchester United, Tottenham Hotspur**
- These four teams, along with Man City and Liverpool are considered the PL's "big six", and thus the most difficult teams to play against. I didn't assign these teams an FDR of 5 because they haven't been as consistent as Man City and Liverpool and haven't made as many points as them since the 2017-18 season.

**FDR = 3 &rarr; Brighton, Crystal Palace, Everton, Leicester City, Newcaste United, West Ham United, Wolves**
- These teams are considered "mid-table teams". Although consistency and regularity across these teams varies, and some of them are arguably more difficult to play against than others, it makes sense to group them under the same FDR rating due to their historic standings (and similar consistency) since the 2017-18 season. 

**FDR = 2 &rarr; Aston Villa, Brentford, Burnley, Leeds, Norwich, Southampton, Watford, Hull City, Middlesbrough, Bournemouth, Sunderland, Swansea, West Brom, Stoke City, Huddersfield, Fulham, Cardiff City, Sheffield United, Nottingham Forest**
- All of these teams (with the exception of Southampton) have been relegated at least once in the past 5 seasons, and during their time in the Premier League have struggled to make it out of the relegation zone or past the 10th standing. The reason I grouped Southampton with the rest of the teams here is because it is the only team that despite not having been relegated, hasn't finished a season above the 11th position (since the 2017-18 season). 

**FDR = 1 &rarr; NONE**
- I didn't assign a score of 1 to any of the teams because FPL rarely gives an FDR of 1 to any fixture.

In [34]:
#Function to add fdr (fixture difficulty rating) to dataframe
def fdr_assignment(data):
    if data['opponent_team'] == 'Arsenal':
        return 4
    if data['opponent_team'] == 'Aston Villa':
        return 2
    if data['opponent_team'] == 'Brentford':
        return 2
    if data['opponent_team'] == 'Brighton and Hove Albion':
        return 3
    if data['opponent_team'] == 'Burnley':
        return 2
    if data['opponent_team'] == 'Chelsea':
        return 4
    if data['opponent_team'] == 'Crystal Palace':
        return 3
    if data['opponent_team'] == 'Everton':
        return 3
    if data['opponent_team'] == 'Leeds':
        return 2
    if data['opponent_team'] == 'Leicester City':
        return 3
    if data['opponent_team'] == 'Liverpool':
        return 5
    if data['opponent_team'] == 'Manchester City':
        return 5
    if data['opponent_team'] == 'Manchester United':
        return 4
    if data['opponent_team'] == 'Newcastle United':
        return 3
    if data['opponent_team'] == 'Norwich':
        return 2
    if data['opponent_team'] == 'Southampton':
        return 2
    if data['opponent_team'] == 'Tottenham Hotspur':
        return 4
    if data['opponent_team'] == 'Spurs':
        return 4
    if data['opponent_team'] == 'Watford':
        return 2
    if data['opponent_team'] == 'West Ham United':
        return 3
    if data['opponent_team'] == 'Wolverhampton Wanderers':
        return 3
    if data['opponent_team'] == 'Hull City':
        return 2
    if data['opponent_team'] == 'Middlesbrough':
        return 2
    if data['opponent_team'] == 'Bournemouth':
        return 2
    if data['opponent_team'] == 'Sunderland':
        return 2
    if data['opponent_team'] == 'Swansea City':
        return 2
    if data['opponent_team'] == 'West Bromwich Albion':
        return 2
    if data['opponent_team'] == 'Stoke City':
        return 2
    if data['opponent_team'] == 'Huddersfield Town':
        return 2
    if data['opponent_team'] == 'Fulham':
        return 2
    if data['opponent_team'] == 'Cardiff City':
        return 2
    if data['opponent_team'] == 'Sheffield United':
        return 2
    if data['opponent_team'] == 'Nottingham Forest':
        return 2
    
data['fdr'] = data.apply(fdr_assignment, axis = 1)

## Lags <a class="anchor" id="lags"></a>

Since we are dealing with time series data, we create two functions to keep track of lags (a fixed amount of passing time) on both player and team levels. In this case, lags are a certain amount of gameweeks. The functions return lagged statistics for the x amount of lags that we specify and adds them to the original dataframe as new columns, which will be helpful later for modeling. Our function also returns rolling averages, or in other words, lagged statistics per game.

Let's look at the two functions:

- **player_lag_stats** &rarr; this function returns the lagged statistic we specify for each player and each specified lag. Let's say we want Kevin de Bruyne's lagged goals_scored (statistic) for the last 1, 2, and 3 gameweeks (lags). Let's assume De Bruyne scored  1, 2, and 0 goals in the past three gameweeks (respectively). This is what our lags would look like:
    - *goals_scored_last_1 = 0*
    - *goals_scored_last_2 = 0 + 2 = 2* 
    - *goals_scored_last_3 = 0 + 2 + 1 = 3* 
    
  Rolling averages (assuming De Bruyne played the full 90min per game):
    - *goals_scored_pg_last_1 = 0/1 = 0*
    - *goals_scored_pg_last_2 = (0 + 2)/2 = 1*
    - *goals_scored_pg_last_3 = (0 + 2 + 1)/3 = 1*
    
    
- **team_lag_stats** &rarr; this function does the same as the function above, but on a team level - it returns the lagged statistic for the team as a whole, not just the player. It also returns their *conceded* lagged statistic, and their opponent's lagged and conceded lagged statistic. For example, if we want a team's goals_scored (statistic) in the last 1 gameweek (lag), the function would return how many goals the team scored and conceded in the last gameweek, and how many goals their opponent scored and conceded in the last gameweek. 

We need lagged statistics because we want to predict a player's expected points based on historical data.

In [35]:
#Lagged stats for players
def player_lag_stats(df, stats, lags):    
    player_lag = []
    updated_df = df.copy()
    stats.insert(0, 'minutes')
    for stat in stats:
        for lag in lags:
            stat_name = stat + '_last_' + str(lag)
            minute_game = 'minutes_last_' + str(lag)
            if lag == 'all':
                updated_df[stat_name] = updated_df.groupby(['player'])[stat].apply(lambda x: x.cumsum() - x)
            else: 
                updated_df[stat_name] = updated_df.groupby(['player'])[stat].apply(lambda x: x.rolling(min_periods=1, 
                                                                                            window=lag+1).sum() - x)
            if stat != 'minutes':
                pg_stat_name = stat + '_pg_last_' + str(lag)
                player_lag.append(pg_stat_name)
                updated_df[pg_stat_name] = 90 * updated_df[stat_name] / updated_df[minute_game]
                #Adjusting for negative values and 0 minutes played
                updated_df[pg_stat_name] = updated_df[pg_stat_name].replace([np.inf, -np.inf], np.nan)
            else: player_lag.append(minute_game)
                
    return updated_df, player_lag

In [36]:
#Lagged stats for teams
def team_lag_stats(df, stats, lags):
    team_lag = []
    updated_new = df.copy()
    for stat in stats:
        stat_name_team = stat + '_team'
        stat_conceded_team = stat_name_team + '_conceded'
        stat_team = (df.groupby(['team', 'season', 'gw',
                                   'opponent_team'])
                        [stat].sum().rename(stat_name_team).reset_index())
        stat_team = stat_team.merge(stat_team,
                           left_on=['team', 'season', 'gw',
                                    'opponent_team'],
                           right_on=['opponent_team', 'season', 'gw',
                                     'team'],
                           how='left',
                           suffixes = ('', '_conceded'))
        stat_team.drop(['team_conceded', 'opponent_team_conceded'], axis=1, inplace=True)
        for lag in lags:
            stat_name = stat + '_team_last_' + str(lag)
            stat_conceded_name = stat + '_team_conceded_last_' + str(lag)
            pg_stat_name = stat + '_team_pg_last_' + str(lag)
            pg_stat_conceded_name = stat + '_team_conceded_pg_last_' + str(lag)
            team_lag.extend([pg_stat_name])
            if lag == 'all':
                stat_team[stat_name] = (stat_team.groupby('team')[stat_name_team]
                                              .apply(lambda x: x.cumsum() - x))
                
                stat_team[stat_conceded_name] = (stat_team.groupby('team')[stat_conceded_team]
                                              .apply(lambda x: x.cumsum() - x))
                stat_team[pg_stat_name] = (stat_team[stat_name]
                                                 / stat_team.groupby('team').cumcount())
                stat_team[pg_stat_conceded_name] = (stat_team[stat_conceded_name]
                                                 / stat_team.groupby('team').cumcount())
            else:
                stat_team[stat_name] = (stat_team.groupby('team')[stat_name_team]
                                              .apply(lambda x: x.rolling(min_periods=1, 
                                                                         window=lag + 1).sum() - x))
                stat_team[stat_conceded_name] = (stat_team.groupby('team')[stat_conceded_team]
                                              .apply(lambda x: x.rolling(min_periods=1, 
                                                                         window=lag + 1).sum() - x))
                stat_team[pg_stat_name] = (stat_team[stat_name] / 
                                                 stat_team.groupby('team')[stat_name_team]
                                                 .apply(lambda x: x.rolling(min_periods=1, 
                                                                            window=lag + 1).count() - 1))
                stat_team[pg_stat_conceded_name] = (stat_team[stat_conceded_name] / 
                                                    stat_team.groupby('team')[stat_conceded_name]
                                                 .apply(lambda x: x.rolling(min_periods=1, 
                                                                            window=lag + 1).count() - 1))
        updated_new = updated_new.merge(stat_team, 
                          on=['team', 'season', 'gw', 'opponent_team'], 
                          how='left')
        updated_new = updated_new.merge(stat_team,
                 left_on=['team', 'season', 'gw', 'opponent_team'],
                 right_on=['opponent_team', 'season', 'gw', 'team'],
                 how='left',
                 suffixes = ('', '_opponent'))
        updated_new.drop(['team_opponent', 'opponent_team_opponent'], axis=1, inplace=True)
        
    team_lag = team_lag + [team_lag + '_opponent' for team_lag in team_lag]  

    return updated_new, team_lag

## Model <a class="anchor" id="model"></a>

### Training Data

Now that we have our data and our lag functions, we can proceed to create the training data for our model.

Our model will use the following features to make its predictions:

- **total_points**: total points scored.
- **minutes**: minutes played.
- **assists**: assists made.
- **bonus**: bonus points scored.
- **clean_sheets**: if player made clean sheet.
- **goals_conceded**: goals conceded.
- **goals_scored**: goals scored.
- **penalties_saved**: penalties saved (GK only).
- **red_cards**: if player got red card.
- **saves**: saves (GK only).
- **yellow_cards**: if player got yellow card.
- **was_home**: if player played home match.
- **creativity**: assesses player performance in terms of producing goalscoring opportunities for others. It can be used as a guide to identify the players most likely to supply assists. It analyses frequency of passing and crossing, pitch location and quality of final ball.
- **influence**: Influence evaluates the degree to which a player has made an impact on a single match or throughout the season. It takes into account events and actions that could directly or indirectly effect the outcome of the fixture. At the top level these are decisive actions like goals and assists. But the Influence score also processes significant defensive actions to analyse the effectiveness of defenders and goalkeepers.
- **threat**: This is a value that examines a player's threat on goal. It gauges the individuals most likely to score goals. While attempts are the key action, the Index looks at pitch location, giving greater weight to actions that are regarded as the best chances to score.
- **ict_index**: All of the statistics above are combined to create an overall Influence, Creativity, Threat (ICT) Index score.
- **fdr**: fixture difficulty rating.

**Note: Each statistic is per player, per gameweek. The specific lags we're using, and whether we are using player and team levels, is specified below.*

In [37]:
#Create training data by adding lag features, dropping irrelevant columns, and adjusting data types

#We drop the '1617' season because of lacking data, and drop duplicates (just in case)
training_data = data[data['season'] != '1617']
training_data = training_data.drop_duplicates()

#Total points
training_data, teams_lag = team_lag_stats(training_data, ['total_points'], ['all', 1, 2, 3, 4, 5])
training_data, players_lag = player_lag_stats(training_data, ['total_points'], ['all', 1, 2, 3, 4, 5])

#Minutes
training_data, players_lag = player_lag_stats(training_data, ['minutes'], ['all', 1, 2, 3, 4, 5])

#Assists
training_data, players_lag = player_lag_stats(training_data, ['assists'], ['all', 1, 2, 3, 4, 5])

#Bonus
training_data, teams_lag = team_lag_stats(training_data, ['bonus'], [1, 2, 3])
training_data, players_lag = player_lag_stats(training_data, ['bonus'], ['all', 1, 2, 3, 4, 5])

#Clean sheets
training_data, players_lag = player_lag_stats(training_data, ['clean_sheets'], ['all', 1, 2, 3, 4, 5])

#Goals conceded
training_data, teams_lag = team_lag_stats(training_data, ['goals_conceded'], [1, 2, 3, 4, 5])
training_data, players_lag = player_lag_stats(training_data, ['goals_conceded'], ['all', 1, 2, 3, 4, 5])

#Goals scored
training_data, teams_lag = team_lag_stats(training_data, ['goals_scored'], [1, 2, 3, 4, 5])
training_data, players_lag = player_lag_stats(training_data, ['goals_scored'], ['all', 1, 2, 3, 4, 5])

#Penalties Saved
training_data, players_lag = player_lag_stats(training_data, ['penalties_saved'], ['all', 1, 2, 3, 4, 5])

#Red Cards
training_data, teams_lag = team_lag_stats(training_data, ['red_cards'], [1, 2, 3, 4, 5])
training_data, players_lag = player_lag_stats(training_data, ['red_cards'], ['all', 1, 2, 3, 4, 5])

#Saves
training_data, players_lag = player_lag_stats(training_data, ['saves'], ['all', 1, 2, 3, 4, 5])

#Yellow Cards
training_data, teams_lag = team_lag_stats(training_data, ['yellow_cards'], [1, 2, 3, 4, 5])
training_data, players_lag = player_lag_stats(training_data, ['yellow_cards'], ['all', 1, 2, 3, 4, 5])

#Columns to drop
drop_columns = ['gw', 'player', 'minutes', 'position', 'team', 'opponent_team',
                'assists', 'bonus', 'bps', 'clean_sheets','goals_conceded', 
                'goals_scored', 'penalties_saved', 'red_cards', 'saves',
                'yellow_cards', 'season', 'team_a_score', 'team_h_score']
training_data = training_data.drop(drop_columns,axis = 1)

#Fill NaN values with 0
training_data = training_data.fillna(0)

#Convert was_home column values to integers (1 for was_home, 0 otherwise)
training_data['was_home'] = training_data["was_home"].astype(int)

#Round all numbers to two decimal points for simplicity
training_data = training_data.round(2)

training_data

Unnamed: 0,was_home,total_points,creativity,ict_index,influence,threat,fdr,total_points_team,total_points_team_conceded,total_points_team_last_all,...,yellow_cards_last_1,yellow_cards_pg_last_1,yellow_cards_last_2,yellow_cards_pg_last_2,yellow_cards_last_3,yellow_cards_pg_last_3,yellow_cards_last_4,yellow_cards_pg_last_4,yellow_cards_last_5,yellow_cards_pg_last_5
0,0,0,0.6,1.9,0.4,18.0,4.0,11,83.0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,0,0.0,0.0,0.0,0.0,2.0,53,24.0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0,6,46.9,8.7,40.2,0.0,3.0,66,15.0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1,6,11.2,6.7,29.6,26.0,3.0,52,40.0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1,9,25.2,10.9,48.6,35.0,5.0,43,44.0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
117632,0,2,12.3,2.8,15.6,0.0,2.0,38,39.0,2918,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
117633,0,1,24.1,5.5,7.6,23.0,2.0,38,39.0,2918,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
117634,0,0,0.0,0.0,0.0,0.0,2.0,38,39.0,2918,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
117635,0,2,1.9,2.3,21.4,0.0,2.0,38,39.0,2918,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


### Multiple Linear Regression Prediction Model

We use the scikit learn Python library to develop a multiple linear regression prediction model.

In [38]:
#Multiple Linear Regression Prediction Model

#Features to make our predictions
x = training_data.drop('total_points', axis=1)

#What we want to predict
y = training_data['total_points'] 

#Split up data into train and test sets, fit model, and make predictions
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 42)
linear_reg = LinearRegression()
linear_reg.fit(x_train,y_train)
y_prediction = linear_reg.predict(x_test)
y_prediction

array([5.84843155, 1.06885603, 2.01884218, ..., 4.56584004, 4.62581418,
       1.16539636])

### Measuring Error

To measure the accuracy of our model, we will look at 3 metrics:
- r2 score
- MSE
- RMSE

In [39]:
#Calculating r2_score, mse, and rmse
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
score = r2_score(y_test,y_prediction)
print('r2 score is ',score)
print('MSE is ',mean_squared_error(y_test,y_prediction))
print('RMSE is ',np.sqrt(mean_squared_error(y_test,y_prediction)))

r2 score is  0.743917709887884
MSE is  1.5609080705381004
RMSE is  1.249363065941242


**r2 score = ~0.74** &rarr; an r2 above 0.7 would generally be seen as showing a high level of correlation between the dependent and independent variables. 

**MSE = ~1.56 & RMSE = ~1.25** &rarr; ideally, these values should be close to 0, however, they are low, which is a good sign of our model's accuracy. We should expected these values to start to get closer to 0 as the season progresses.

### Predictions (Training & Test Data)

We now take a deeper look at our predictions for our training and test data, as they will give us an idea of what we should expect for the actual 22-23 season gameweek predictions.

In [40]:
#Make predictions for all data points in original data
all_data_predictions = linear_reg.predict(x)

#Add predictions to original data
data = data[data['season'] != '1617']
data["predicted_total_points"] = all_data_predictions

#Predictions Dataframe
predictions_vs_actual = data[['player', 'gw', 'position','team', 'opponent_team', 'season', 'total_points', 'predicted_total_points']]
predictions_vs_actual = predictions_vs_actual.reset_index()
predictions_vs_actual = predictions_vs_actual.drop('index', axis=1)

#We make all negative values equal to 0
predictions_vs_actual['predicted_total_points'] = predictions_vs_actual['predicted_total_points'].round(0).astype(int)
predictions_vs_actual

Unnamed: 0,player,gw,position,team,opponent_team,season,total_points,predicted_total_points
0,Aaron Cresswell,1,DEF,West Ham United,Manchester United,1718,0,0
1,Aaron Lennon,1,MID,Everton,Stoke City,1718,0,1
2,Aaron Mooy,1,MID,Huddersfield Town,Crystal Palace,1718,6,7
3,Aaron Ramsey,1,MID,Arsenal,Leicester City,1718,6,5
4,Abdoulaye Doucouré,1,MID,Watford,Liverpool,1718,9,8
...,...,...,...,...,...,...,...,...
117632,Marc Roca Junqué,2,MID,Leeds,Southampton,2223,2,2
117633,Brenden Aaronson,2,MID,Leeds,Southampton,2223,1,2
117634,Darko Gyabi,2,MID,Leeds,Southampton,2223,0,0
117635,Tyler Adams,2,MID,Leeds,Southampton,2223,2,3


Let's look at the actual vs predicted total points for Mohamed Salah, the highest-scoring FPL player, during the 21-22 season.

In [41]:
#Mohamed Salah 21-22 predictions
salah_predictions = predictions_vs_actual[(predictions_vs_actual['player'] == 'Mohamed Salah') & (predictions_vs_actual['season'] == '2122')]
salah_predictions

Unnamed: 0,player,gw,position,team,opponent_team,season,total_points,predicted_total_points
91511,Mohamed Salah,1,MID,Liverpool,Norwich,2122,17,14
92070,Mohamed Salah,2,MID,Liverpool,Burnley,2122,3,5
92645,Mohamed Salah,3,MID,Liverpool,Chelsea,2122,10,9
93235,Mohamed Salah,4,MID,Liverpool,Leeds,2122,8,11
93838,Mohamed Salah,5,MID,Liverpool,Crystal Palace,2122,12,12
94449,Mohamed Salah,6,MID,Liverpool,Brentford,2122,7,8
95062,Mohamed Salah,7,MID,Liverpool,Manchester City,2122,13,13
95678,Mohamed Salah,8,MID,Liverpool,Watford,2122,13,13
96295,Mohamed Salah,9,MID,Liverpool,Manchester United,2122,24,22
96916,Mohamed Salah,10,MID,Liverpool,Brighton and Hove Albion,2122,5,7


Now let's look at the actual vs predicted total points for Harry Kane.

In [42]:
#Harry Kane 21-22 predictions
kane_predictions = predictions_vs_actual[(predictions_vs_actual['player'] == 'Harry Kane') & (predictions_vs_actual['season'] == '2122')]
kane_predictions

Unnamed: 0,player,gw,position,team,opponent_team,season,total_points,predicted_total_points
91320,Harry Kane,1,FWD,Tottenham Hotspur,Manchester City,2122,0,1
91877,Harry Kane,2,FWD,Tottenham Hotspur,Wolverhampton Wanderers,2122,0,2
92449,Harry Kane,3,FWD,Tottenham Hotspur,Watford,2122,1,3
93032,Harry Kane,4,FWD,Tottenham Hotspur,Crystal Palace,2122,2,2
93632,Harry Kane,5,FWD,Tottenham Hotspur,Chelsea,2122,2,1
94239,Harry Kane,6,FWD,Tottenham Hotspur,Arsenal,2122,2,1
94851,Harry Kane,7,FWD,Tottenham Hotspur,Aston Villa,2122,2,4
95466,Harry Kane,8,FWD,Tottenham Hotspur,Newcastle United,2122,12,8
96083,Harry Kane,9,FWD,Tottenham Hotspur,West Ham United,2122,2,2
96702,Harry Kane,10,FWD,Tottenham Hotspur,Manchester United,2122,2,1


## Predictions <a class="anchor" id="predictions"></a>

Now that we trained our model, we are ready to make predictions for the upcoming PL gameweek.

First, we import data with upcoming gameweek information, merge with relevant player statistics from last gameweek (creativity, ict_index, influence, threat), and then add the rest of the statistics we use for the model with a value of zero (we do this in order to get lagged statistics later):

**Note: remember to change the gameweek weekly to update the data.*

# CHANGE GAMEWEEK HERE

In [43]:
#Dataframe with upcoming gameweek information
gameweek = 3
next_gw = season_gws[season_gws['gw'] == gameweek]
next_gw = next_gw[['player', 'position', 'gw', 'team', 'opponent_team', 'was_home', 'season']]

#Dataframe with player's creativity, ict_index, influence, and threat, from last gameweek
player_stats_last_gw = player_stats[player_stats['GW'] == gameweek - 1]
relevant_columns = ['name','creativity', 'ict_index','influence', 'threat']
player_stats_last_gw = player_stats_last_gw[relevant_columns]
player_stats_last_gw = player_stats_last_gw.rename(columns={'name': 'player'})

#Merge dataframes and make some adjustments
def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
    """
    :param df_1: the left table to join
    :param df_2: the right table to join
    :param key1: key column of the left table
    :param key2: key column of the right table
    :param threshold: how close the matches should be to return a match, based on Levenshtein distance
    :param limit: the amount of matches that will get returned, these are sorted high to low
    :return: dataframe with boths keys and matches
    """
    s = df_2[key2].tolist()
    
    m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))    
    df_1['matches'] = m
    
    m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
    df_1['matches'] = m2
    
    return df_1

next_gw = fuzzy_merge(next_gw, player_stats_last_gw, 'player', 'player', threshold=91)
next_gw = next_gw.merge(player_stats_last_gw, left_on = 'matches', right_on = 'player')
next_gw = next_gw.drop(['player_x', 'matches'], axis=1)
next_gw = next_gw.rename(columns={'player_y': 'player'})

#Add relevant statistics with value = 0
next_gw[['minutes', 'total_points', 'assists', 'bonus', 'bps',
       'clean_sheets', 'goals_conceded', 'goals_scored',
       'penalties_saved', 'red_cards', 'saves', 'yellow_cards', 'team_a_score', 'team_h_score']] = 0

#FDR assignment
next_gw['fdr'] = next_gw.apply(fdr_assignment, axis = 1)

#Re-order columns
next_gw = next_gw[['player', 'position', 'gw', 'team', 'opponent_team', 'fdr','was_home',
       'season', 'minutes', 'total_points', 'assists', 'bonus', 'bps',
       'clean_sheets', 'creativity', 'goals_conceded', 'goals_scored',
       'ict_index', 'influence', 'penalties_saved', 'red_cards', 'saves',
       'threat', 'yellow_cards', 'team_a_score', 'team_h_score']]

#Convert 'season' column into string
next_gw['season'] = next_gw['season'].apply(str)

next_gw

Unnamed: 0,player,position,gw,team,opponent_team,fdr,was_home,season,minutes,total_points,...,goals_scored,ict_index,influence,penalties_saved,red_cards,saves,threat,yellow_cards,team_a_score,team_h_score
0,Liam Cooper,2,3,Leeds,Chelsea,4,True,2223,0,0,...,0,0.0,0.0,0,0,0,0.0,0,0,0
1,Luke Ayling,2,3,Leeds,Chelsea,4,True,2223,0,0,...,0,0.0,0.0,0,0,0,0.0,0,0,0
2,Mateusz Klich,3,3,Leeds,Chelsea,4,True,2223,0,0,...,0,2.7,6.2,0,0,0,0.0,0,0,0
3,Adam Forshaw,3,3,Leeds,Chelsea,4,True,2223,0,0,...,0,0.1,0.6,0,0,0,0.0,0,0,0
4,Rodrigo Moreno,3,3,Leeds,Chelsea,4,True,2223,0,0,...,0,22.0,75.4,0,0,0,113.0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
529,Gavin Bazunu,1,3,Southampton,Leicester City,3,False,2223,0,0,...,0,2.7,27.4,0,0,0,0.0,0,0,0
530,Armel Bella-Kotchap,2,3,Southampton,Leicester City,3,False,2223,0,0,...,0,0.9,7.8,0,0,0,1.0,0,0,0
531,Willy Caballero,1,3,Southampton,Leicester City,3,False,2223,0,0,...,0,0.0,0.0,0,0,0,0.0,0,0,0
532,Joe Ayodele-Aribo,3,3,Southampton,Leicester City,3,False,2223,0,0,...,0,9.5,39.2,0,0,0,54.0,0,0,0


Now, we take the dataframe we just created, and concatenate it to the original data (with all previous gw's info), in order to get the relevant lagged statistics - the ones we used in our original model. Then, we only keep next gameweek's rows and drop the rest to make our predictions:

In [44]:
#Adjusting original data to concatenate with upcoming gameweek dataframe
data_adjusted = data[['player', 'position', 'gw', 'team', 'opponent_team', 'was_home', 'total_points',
                         'creativity','ict_index','influence','threat','fdr','season', 'minutes', 'assists', 
                         'bonus', 'bps', 'clean_sheets', 'goals_conceded', 'goals_scored', 'penalties_saved', 
                         'red_cards', 'saves', 'yellow_cards', 'team_a_score', 'team_h_score']]

#We concatenate adjusted original data with next gameweek's dataframe
data_adjusted = pd.concat([data_adjusted, next_gw])
data_adjusted = data_adjusted.drop_duplicates().reset_index()
data_adjusted = data_adjusted.drop('index', axis=1)

#Lagged values
#Total points
data_adjusted, teams_lag = team_lag_stats(data_adjusted, ['total_points'], ['all', 1, 2, 3, 4, 5])
data_adjusted, players_lag = player_lag_stats(data_adjusted, ['total_points'], ['all', 1, 2, 3, 4, 5])

#Minutes
data_adjusted, players_lag = player_lag_stats(data_adjusted, ['minutes'], ['all', 1, 2, 3, 4, 5])

#Assists
data_adjusted, players_lag = player_lag_stats(data_adjusted, ['assists'], ['all', 1, 2, 3, 4, 5])

#Bonus
data_adjusted, teams_lag = team_lag_stats(data_adjusted, ['bonus'], [1, 2, 3])
data_adjusted, players_lag = player_lag_stats(data_adjusted, ['bonus'], ['all', 1, 2, 3, 4, 5])

#Clean sheets
data_adjusted, players_lag = player_lag_stats(data_adjusted, ['clean_sheets'], ['all', 1, 2, 3, 4, 5])

#Goals conceded
data_adjusted, teams_lag = team_lag_stats(data_adjusted, ['goals_conceded'], [1, 2, 3, 4, 5])
data_adjusted, players_lag = player_lag_stats(data_adjusted, ['goals_conceded'], ['all', 1, 2, 3, 4, 5])

#Goals scored
data_adjusted, teams_lag = team_lag_stats(data_adjusted, ['goals_scored'], [1, 2, 3, 4, 5])
data_adjusted, players_lag = player_lag_stats(data_adjusted, ['goals_scored'], ['all', 1, 2, 3, 4, 5])

#Penalties Saved
data_adjusted, players_lag = player_lag_stats(data_adjusted, ['penalties_saved'], ['all', 1, 2, 3, 4, 5])

#Red Cards
data_adjusted, teams_lag = team_lag_stats(data_adjusted, ['red_cards'], [1, 2, 3, 4, 5])
data_adjusted, players_lag = player_lag_stats(data_adjusted, ['red_cards'], ['all', 1, 2, 3, 4, 5])

#Saves
data_adjusted, players_lag = player_lag_stats(data_adjusted, ['saves'], ['all', 1, 2, 3, 4, 5])

#Yellow Cards
data_adjusted, teams_lag = team_lag_stats(data_adjusted, ['yellow_cards'], [1, 2, 3, 4, 5])
data_adjusted, players_lag = player_lag_stats(data_adjusted, ['yellow_cards'], ['all', 1, 2, 3, 4, 5])

#Fill NaN values with 0
data_adjusted = data_adjusted.fillna(0)

#Only keep data for upcoming gw
data_adjusted = data_adjusted.loc[(data_adjusted['gw'] == gameweek)]
data_adjusted = data_adjusted.loc[(data_adjusted['season'] == '2223')]

#Drop irrelevant columns
drop_columns = ['gw', 'minutes', 'player', 'position', 'team', 'opponent_team',
                'assists', 'bonus', 'bps', 'clean_sheets','goals_conceded', 
                'goals_scored', 'penalties_saved', 'red_cards', 'saves',
                'yellow_cards', 'season', 'team_a_score', 'team_h_score']
next_gw_stats = data_adjusted.drop(drop_columns,axis = 1)

#Convert was_home column to int (1 if was_home, 0 if otherwise)
next_gw_stats['was_home'] = next_gw_stats["was_home"].astype(int)

#Round all values to two decimals
next_gw_stats = next_gw_stats.round(2)

next_gw_stats 

Unnamed: 0,was_home,total_points,creativity,ict_index,influence,threat,fdr,total_points_team,total_points_team_conceded,total_points_team_last_all,...,yellow_cards_last_1,yellow_cards_pg_last_1,yellow_cards_last_2,yellow_cards_pg_last_2,yellow_cards_last_3,yellow_cards_pg_last_3,yellow_cards_last_4,yellow_cards_pg_last_4,yellow_cards_last_5,yellow_cards_pg_last_5
117637,1,0,0.0,0.0,0.0,0.0,4.0,0,0.0,2956,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.50,1.0,0.33
117638,1,0,0.0,0.0,0.0,0.0,4.0,0,0.0,2956,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.00
117639,1,0,21.0,2.7,6.2,0.0,4.0,0,0.0,2956,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.00
117640,1,0,0.1,0.1,0.6,0.0,4.0,0,0.0,2956,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.00
117641,1,0,31.8,22.0,75.4,113.0,4.0,0,0.0,2956,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.27,1.0,0.21
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
118166,0,0,0.0,2.7,27.4,0.0,3.0,0,0.0,7037,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.00
118167,0,0,0.3,0.9,7.8,1.0,3.0,0,0.0,7037,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.00
118168,0,0,0.0,0.0,0.0,0.0,3.0,0,0.0,7037,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.00
118169,0,0,2.1,9.5,39.2,54.0,3.0,0,0.0,7037,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.00


Now we make our predictions and add them to the upcoming gameweek dataframe:

### Gameweek 3 Predictions

In [45]:
# #Features to make our predictions
# gw3 = next_gw_stats.drop('total_points', axis=1)

# #Make predictions
# predictions_next_gw = linear_reg.predict(gw3)

In [46]:
# #We add predictions to the original dataframe
# data_adjusted['predicted_total_points'] = predictions_next_gw
# predictions_gw3 = data_adjusted[['player', 'gw', 'position', 'team', 'opponent_team', 'season', 'predicted_total_points']]
# predictions_gw3 = predictions_gw3.reset_index()
# predictions_gw3 = predictions_gw3.drop('index', axis=1)
# predictions_gw3['position'] = predictions_gw3.apply(position_assignment, axis = 1)

# #We make all negative values equal to 0
# predictions_gw3['predicted_total_points'] = predictions_gw3['predicted_total_points'].round(0).astype(int)
# predictions_gw3['predicted_total_points'] = predictions_gw3['predicted_total_points'].where(predictions_gw3['predicted_total_points'] > 0, other=0)

# predictions_gw3

Unnamed: 0,player,gw,position,team,opponent_team,season,predicted_total_points
0,Liam Cooper,3,DEF,Leeds,Chelsea,2223,0
1,Luke Ayling,3,DEF,Leeds,Chelsea,2223,0
2,Mateusz Klich,3,MID,Leeds,Chelsea,2223,0
3,Adam Forshaw,3,MID,Leeds,Chelsea,2223,0
4,Rodrigo Moreno,3,MID,Leeds,Chelsea,2223,13
...,...,...,...,...,...,...,...
529,Gavin Bazunu,3,GK,Southampton,Leicester City,2223,3
530,Armel Bella-Kotchap,3,DEF,Southampton,Leicester City,2223,0
531,Willy Caballero,3,GK,Southampton,Leicester City,2223,0
532,Joe Ayodele-Aribo,3,MID,Southampton,Leicester City,2223,6


Let's take a look at our highest-expected scorers:

In [47]:
# #Highest-expected scorers for Gameweek 3
# highest_expected_scorers_gw3 = predictions_gw3.sort_values('predicted_total_points', ascending=False)
# highest_expected_scorers_gw3.head(25)

Unnamed: 0,player,gw,position,team,opponent_team,season,predicted_total_points
395,Gabriel Fernando de Jesus,3,FWD,Arsenal,Bournemouth,2223,16
4,Rodrigo Moreno,3,MID,Leeds,Chelsea,2223,13
319,Kevin De Bruyne,3,MID,Manchester City,Newcastle United,2223,11
330,Phil Foden,3,MID,Manchester City,Newcastle United,2223,9
212,Pierre-Emile Højbjerg,3,MID,Tottenham Hotspur,Wolverhampton Wanderers,2223,9
308,Luis Díaz,3,MID,Liverpool,Manchester United,2223,8
516,Kyle Walker-Peters,3,DEF,Southampton,Leicester City,2223,7
288,Kalidou Koulibaly,3,DEF,Chelsea,Leeds,2223,7
436,Mathias Jensen,3,MID,Brentford,Fulham,2223,7
453,Neco Williams,3,DEF,Nottingham Forest,Everton,2223,7


## Ideal Team <a class="anchor" id="ideal_team"></a>

The following algorithm returns an ideal team, according to **predicted_total_points**. The team fulfills the position requirements (2 goalkeepers, 5 defenders, 5 midfielders, and 3 forwards).

If we were to pick players based on our model's **predicted_total_points** (without any budget constraint), we should pick the following players:

### Ideal Team - Gameweek 3

In [48]:
# def get_ideal_team(gk = 2, df = 5, md = 5, fwd = 3, team_max = 3):
#     ideal_team = []
#     positions = {'GK': gk, 'DEF': df, 'MID': md, 'FWD': fwd}
#     teams = {'Arsenal': team_max, 'Leeds': team_max, 'Manchester City': team_max, 
#              'Tottenham Hotspur': team_max, 'Liverpool': team_max, 'Southampton': team_max, 
#              'Chelsea': team_max, 'Brentford': team_max, 'Nottingham Forest': team_max, 
#              'Wolverhampton Wanderers': team_max, 'Aston Villa': team_max, 
#              'Crystal Palace': team_max, 'West Ham United': team_max, 'Leicester City': team_max, 
#              'Newcastle United': team_max, 'Bournemouth': team_max, 'Everton': team_max, 
#              'Brighton and Hove Albion': team_max, 
#              'Manchester United': team_max, 'Fulham': team_max}
#     t = highest_expected_scorers_gw3.iterrows()
#     for i, row1 in t:
#         if (positions[row1['position']] > 0):
#             ideal_team.append(row1['player'])
#             positions[row1['position']] = positions[row1['position']] - 1
#             teams[row1['team']] = teams[row1['team']] - 1
#     return ideal_team

# ideal_team_gw3 = pd.DataFrame(get_ideal_team()) 

In [49]:
# ideal_team_gw3 = ideal_team_gw3.rename({0: 'player'}, axis=1)
# ideal_team_gw3 = ideal_team_gw3.merge(highest_expected_scorers, on='player')
# ideal_team_gw3.position = pd.Categorical(ideal_team_gw3.position, categories=['GK', 'DEF', 'MID', 'FWD'])
# ideal_team_gw3 = ideal_team_gw3.sort_values('position')
# ideal_team_gw3 = ideal_team_gw3.reset_index().drop('index', axis=1)
# ideal_team_gw3

Unnamed: 0,player,gw,position,team,opponent_team,season,predicted_total_points
0,Dean Henderson,3,GK,Nottingham Forest,Everton,2223,7
1,José Malheiro de Sá,3,GK,Wolverhampton Wanderers,Tottenham Hotspur,2223,6
2,Kyle Walker-Peters,3,DEF,Southampton,Leicester City,2223,7
3,Kalidou Koulibaly,3,DEF,Chelsea,Leeds,2223,7
4,Neco Williams,3,DEF,Nottingham Forest,Everton,2223,7
5,Reece James,3,DEF,Chelsea,Leeds,2223,7
6,Ben Mee,3,DEF,Brentford,Fulham,2223,6
7,Rodrigo Moreno,3,MID,Leeds,Chelsea,2223,13
8,Kevin De Bruyne,3,MID,Manchester City,Newcastle United,2223,11
9,Phil Foden,3,MID,Manchester City,Newcastle United,2223,9
