Program to gather data from SoccerData, which is a python library to scrape soccer data from multiple sites. 
https://soccerdata.readthedocs.io/en/stable/intro.html
Upon entering a league name and year in the cell 3, this program will pull all the match data from every match of that league and season. 
SoccerData caches automatically, which ensures mutliple calls of the same data will be efficient when needed. Cell 10 will clean the data, and ensure only the valid columns are being saved. Cell 12 is where the data will be saved in the corresponding folder. Renaming the csv file is required.

There is already sample data inside performanceData, so it is not necessary to run this program.

In [1]:
import soccerdata as sd
import pandas as pd

In [2]:
team_names = ["ENG-Premier League",
              "ESP-La Liga",
              "ITA-Serie A",
              "GER-Bundesliga",
              "FRA-Ligue 1"]

In [3]:
espn = sd.ESPN(leagues="ENG-Premier League", seasons=2021)



In [4]:
matchsheet = espn.read_matchsheet(match_id=541465)
matchsheet.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,is_home,venue,attendance,capacity,roster,fouls_committed,yellow_cards,red_cards,offsides,won_corners,...,total_long_balls,accurate_long_balls,longball_pct,blocked_shots,effective_tackles,total_tackles,tackle_pct,interceptions,effective_clearance,total_clearance
league,season,game,team,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
ENG-Premier League,2021,2020-07-26 West Ham United-Aston Villa,Aston Villa,False,London Stadium,0,,"[{'active': True, 'starter': True, 'jersey': '...",13,1,0,1,7,...,63,22,0.3,3,3,13,0.2,10,11,11
ENG-Premier League,2021,2020-07-26 West Ham United-Aston Villa,West Ham United,True,London Stadium,0,,"[{'active': True, 'starter': True, 'jersey': '...",16,2,0,0,0,...,61,25,0.4,3,8,15,0.5,6,26,26


In [5]:
epl_schedule = espn.read_schedule()

In [6]:
matchsheet_data = []

In [7]:
# Loop through each row in the epl_schedule dataframe
for index, row in epl_schedule.iterrows():
    game_id = row['game_id']
    home_team = row['home_team']
    away_team = row['away_team']
    try:
        # Read match sheet data for the current game_id
        matchsheet = espn.read_matchsheet(match_id=game_id)
        
        # Add home_team and away_team to the matchsheet dataframe
        matchsheet['home_team'] = home_team
        matchsheet['away_team'] = away_team
        matchsheet['season'] = '2015'
        
        # Append the match sheet dataframe to the list
        matchsheet_data.append(matchsheet)

    except Exception as e:
        # Handle any errors gracefully and log them
        print(f"Error fetching matchsheet for game_id {game_id}: {e}")

In [8]:
all_matchsheets = pd.concat(matchsheet_data, ignore_index=True)

In [9]:
print(all_matchsheets.head())

   is_home             venue  attendance capacity  \
0     True           Anfield       53333     None   
1    False           Anfield       53333     None   
2     True  Vitality Stadium       10714     None   
3    False  Vitality Stadium       10714     None   
4     True         Turf Moor       19784     None   

                                              roster fouls_committed  \
0  [{'active': True, 'starter': True, 'jersey': '...               9   
1  [{'active': True, 'starter': True, 'jersey': '...               9   
2  [{'active': True, 'starter': True, 'jersey': '...              10   
3  [{'active': True, 'starter': True, 'jersey': '...              19   
4  [{'active': True, 'starter': True, 'jersey': '...               6   

  yellow_cards red_cards offsides won_corners  ... blocked_shots  \
0            0         0        0          11  ...             3   
1            2         0        5           2  ...             2   
2            2         0        1           

In [10]:
columns_to_keep = [
    'home_team', 'away_team', 'is_home', 'season', 'venue', 'attendance', 'total_shots', 'fouls_committed', 'yellow_cards', 'won_corners', 
    'red_cards', 'saves', 'possession_pct', 'accurate_passes', 'total_passes',
    'accurate_crosses', 'total_crosses', 'longball_pct', 'blocked_shots', 
    'effective_tackles', 'total_tackles', 'interceptions', 'total_clearance'
]

# Filter the dataframe to keep only the specified columns
filtered_matchsheets = all_matchsheets[columns_to_keep]

In [11]:
print(filtered_matchsheets.head())

         home_team         away_team  is_home season             venue  \
0        Liverpool      Norwich City     True   2015           Anfield   
1        Liverpool      Norwich City    False   2015           Anfield   
2  AFC Bournemouth  Sheffield United     True   2015  Vitality Stadium   
3  AFC Bournemouth  Sheffield United    False   2015  Vitality Stadium   
4          Burnley       Southampton     True   2015         Turf Moor   

   attendance total_shots fouls_committed yellow_cards won_corners  ...  \
0       53333          15               9            0          11  ...   
1       53333          12               9            2           2  ...   
2       10714          13              10            2           3  ...   
3       10714           8              19            1           4  ...   
4       19784          10               6            0           2  ...   

  accurate_passes total_passes accurate_crosses total_crosses longball_pct  \
0             387         

In [None]:
filtered_matchsheets.to_csv('performanceData/EPL_2015_matchsheets.csv', index=False)