### This ananlysis uses FPL dataset from FPL API. The ultimate goal is to be able to build models so that we can predict player points for the upcoming FPL game week.

Specifically, we will: 
    
    1. Load in the FPL players datasets in JSON
    2. Build pandas dataframes for 22/23 season with current price
    3. Conduct some EDA
    4. Split the data into train and test
    5. Find the optimal parameters for each model and evaluate performance
    6. Fit the models for prediction
    7. Make prediction

    
Hopefully after this analysis we will be able to answer the following questions: 
 - How to obtain FPL dataset in desired format ready for analysis?
 - What are the valuable players based on previous season performance given current price?
 - What are some good predictors of FPL player points for the upcoming game week?
 - What predictors are correlated, therefore should be dropped before modelling?
 - How to construct modelling to balance bias and variance?
 - How to finetune model parameters to avoid overfitting?
 - What is the predicted player points for upcoming FPL game week using models trained?

In [58]:
# Importing necessary libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from bs4 import BeautifulSoup
import requests
import json
from IPython.display import HTML
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")
import chardet
pd.set_option('display.max_row', None)

#### A detailed guide to all currently available Fantasy Premier League API endpoints: 
https://medium.com/@frenzelts/fantasy-premier-league-api-endpoints-a-detailed-guide-acbd5598eb19

#### Data limitation:
Unable to retrieve game by game player level data from the past seasons from API
#### Solution:
Use this github repo https://github.com/vaastav/Fantasy-Premier-League/tree/master

In [32]:
# function to request json over url
def getJson(base_url, end_point_path, element_id=''):
    return requests.get(base_url+end_point_path+str(element_id)).json()

In [33]:
# get json data
base_url = 'https://fantasy.premierleague.com/api/'
end_point_general = 'bootstrap-static'
end_point_fixtures = 'fixtures'
end_point_player = 'element-summary/'

json = getJson(base_url, end_point_general)
# check keys
json.keys()

dict_keys(['events', 'game_settings', 'phases', 'teams', 'total_players', 'elements', 'element_stats', 'element_types'])

#### Desired data
##### Dataframe #1:
- all 38 game week of season 22/23
- all 10 games of each 38 game week of season 22/23
- all players involved [i.e. ~(11 + 4) x 2] for all 10 games in each 38 game week of season 22/23

Expected number of records = 38 x 10 x (11 + 4) x 2 = ~11.4K

##### Dataframe #2:
- current player info including cost, position etc.

##### Dataframe #3:
- Join df1 and df2 to get aggregated df for info such as value = average points per minute / current cost

In [34]:
# build dataframes
df_elements = pd.DataFrame(json['elements'])
df_elements_types = pd.DataFrame(json['element_types'])
df_teams = pd.DataFrame(json['teams'])
total_players = json['total_players']

df_elements_short = df_elements[['id', 'first_name', 'second_name','team','element_type','selected_by_percent','now_cost','minutes','transfers_in','value_season','event_points', 'total_points']]

# restructure player dataframe
df_elements_short['position'] = df_elements_short.element_type.map(df_elements_types.set_index('id').singular_name)
df_elements_short['team'] = df_elements_short.team.map(df_teams.set_index('id').name)
df_elements_short['name'] = df_elements_short['first_name'] + ' ' + df_elements_short['second_name']
df_elements_short['points_per_minute'] = df_elements_short['total_points'] / df_elements_short['minutes']
df_elements_short['points_per_minute'] = df_elements_short['points_per_minute'].fillna(0)
df_elements_short['now_cost'] = df_elements_short['now_cost'] / 10
df_elements_short['ppm_over_cost'] = df_elements_short['points_per_minute'] / df_elements_short['now_cost']
df_elements_short = df_elements_short[df_elements_short['minutes'] >= df_elements_short['minutes'].quantile(0.8)]
df_elements_short = df_elements_short[['id', 'name', 'team', 'position', 'selected_by_percent', 'minutes', 'now_cost', 'points_per_minute', 'ppm_over_cost']]

In [41]:
# show value players
df_elements_short = df_elements_short.sort_values(by=['ppm_over_cost'], ascending=False)
df_elements_short.groupby(['position']).head(5)

Unnamed: 0,id,name,team,position,selected_by_percent,minutes,now_cost,points_per_minute,ppm_over_cost
25,26,Leandro Trossard,Arsenal,Midfielder,2.7,2237,7.0,0.068842,0.009835
416,402,Miguel Almirón Rejala,Newcastle,Midfielder,6.2,2487,6.5,0.06353,0.009774
282,275,Bernd Leno,Fulham,Goalkeeper,9.3,3240,4.5,0.043827,0.009739
116,113,David Raya Martin,Brentford,Goalkeeper,10.0,3420,5.0,0.048538,0.009708
19,20,William Saliba,Arsenal,Defender,14.8,2415,5.0,0.048447,0.009689
154,151,Joël Veltman,Brighton,Defender,1.3,2183,4.5,0.043518,0.009671
515,495,Ben Davies,Spurs,Defender,0.7,2284,4.5,0.043345,0.009632
134,131,Pervis Estupiñán,Brighton,Defender,51.2,2674,5.0,0.047868,0.009574
390,377,Diogo Dalot Teixeira,Man Utd,Defender,2.0,2152,5.0,0.047398,0.00948
231,226,Eberechi Eze,Crystal Palace,Midfielder,12.2,2631,6.5,0.060433,0.009297


In [36]:
df_mo = df_elements_short[(df_elements_short['name']=='Mohamed Salah')]
df_mo

Unnamed: 0,id,name,team,position,selected_by_percent,minutes,now_cost,points_per_minute,ppm_over_cost
316,308,Mohamed Salah,Liverpool,Midfielder,24.6,3290,12.5,0.072644,0.005812


In [37]:
element_id = 308
json = getJson(base_url, end_point_player, element_id)
# check keys
json.keys()

dict_keys(['fixtures', 'history', 'history_past'])

In [22]:
df_player_mo = pd.DataFrame(json['history_past'])
df_player_mo.tail()

Unnamed: 0,season_name,element_code,start_cost,end_cost,total_points,minutes,goals_scored,assists,clean_sheets,goals_conceded,...,bps,influence,creativity,threat,ict_index,starts,expected_goals,expected_assists,expected_goal_involvements,expected_goals_conceded
3,2018/19,118748,130,132,259,3254,22,12,21,20,...,687,1186.8,973.9,2168.0,432.7,0,0.0,0.0,0.0,0.0
4,2019/20,118748,125,125,233,2879,19,10,16,26,...,661,1061.2,834.8,2156.0,405.1,0,0.0,0.0,0.0,0.0
5,2020/21,118748,120,129,231,3077,22,6,11,41,...,657,1056.0,825.7,1980.0,385.8,0,0.0,0.0,0.0,0.0
6,2021/22,118748,125,131,265,2758,23,14,17,22,...,756,1241.0,875.9,2230.0,434.8,0,0.0,0.0,0.0,0.0
7,2022/23,118748,130,131,239,3290,19,13,13,45,...,651,1067.4,899.2,1688.0,365.6,37,21.01,7.03,28.04,47.47


##### Data limiation
From observation above, FPL API doesn't have player level game week data for past seasons

##### Solution
Use github repo https://github.com/vaastav/Fantasy-Premier-League/tree/master loaded to local repo

In [51]:
# helper functions to retrieve players master dataframe based on seasons interested
def getEncoding(file):
    with open(file, 'rb') as rawdata:
        result = chardet.detect(rawdata.read(100000))
    return result['encoding']

def getMasterDataFrame(seasons):
    for i, season in enumerate(seasons):

        file = 'Fantasy-Premier-League/data/'+ season +'/gws/merged_gw.csv'
        encoding = getEncoding(file)

        if i == 0:
            df_master = pd.read_csv(file, encoding=encoding)
        else:
            df_temp = pd.read_csv(file, encoding=encoding)
            df_master = pd.concat([df_master, df_temp], ignore_index=True)
    return df_master

In [54]:
# retrieve player master dataframe by specifying seasons intended to pull
# i.e. to pull player data for season 22/23, assign seasons = ['2022-23']
seasons = ['2022-23', '2021-22', '2020-21', '2019-20', '2018-19', '2017-18', '2016-17']

df_master = getMasterDataFrame(seasons)

In [None]:
df_master.head()

Unnamed: 0,name,position,team,xP,assists,bonus,bps,clean_sheets,creativity,element,...,loaned_in,loaned_out,offside,open_play_crosses,penalties_conceded,recoveries,tackled,tackles,target_missed,winning_goals
0,Nathan Redmond,MID,Southampton,1.5,0,0,3,0,0.0,403,...,,,,,,,,,,
1,Junior Stanislas,MID,Bournemouth,1.1,0,0,3,0,0.0,58,...,,,,,,,,,,
2,Armando Broja,FWD,Chelsea,2.0,0,0,3,0,0.3,150,...,,,,,,,,,,
3,Fabian Schär,DEF,Newcastle,2.4,0,3,43,1,14.6,366,...,,,,,,,,,,
4,Jonny Evans,DEF,Leicester,1.9,0,0,15,0,1.3,249,...,,,,,,,,,,


##### Data limiation
Predictors from FPL (i.e. threat, ICT index, influence) are most lagging indicators, lack of player attributes predictors

##### Solution
Feature engineer player attribute predictors from FIFA and Football Manager