### <div align="center">Introduction</div>
Even after the 2022/2023 soccer season was over, I had a burning desire to do a pet project and analyze the data from the current season. For a long time I delayed this moment and could not even imagine that by the beginning of August the data for the last season would simply not be freely available on the Internet. Therefore, I decided to work with Internet data on my own, in particular, to learn how to use web scraping and collect data for scientific and research purposes

This part of the project will be devoted to web scraping and data collection from __[understat.com](https://understat.com)__. I initially relied on this author's __[notebook](https://www.kaggle.com/code/slehkyi/web-scraping-football-statistics-per-game-data/notebook)__, for which I thank him very much, and also supplemented it with other data from site. 

Of course, there are many interactive graphs and tables on these kinds of sites, but sometimes you want to look and build something of your own. In particular, as an Atletico Madrid fan, I would like to see how successful this season has been for a part of players and the team as whole.


### Loading of necessary libraries for work with data:

In [1]:
import numpy as np 
import pandas as pd 
import requests
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
import seaborn as sns
import json
import plotly.express as px

### Some information about data

#### Let's repeat about the data structure of understat.com. 
On the main page of the site the user can choose one of 6 leagues(I will take only the top 5 leagues without the RFPL):
* EPL 
* La liga
* Ligue 1
* Serie A
* Bundesliga
* RFPL

By clicking on each tab, we can select any season we are interested in from 2014/2015 through 2023/2024. In addition, we can also go to the team page and detailed stats for each player of the team

The author's notebook described the process of scraping these leagues. However, this didn't seem enough to me and I wanted to get player and team data as well

The foundation of the basics in this approach is the URL. The basic structure is clear - https://understat.com. Depending on the information we want to retrieve, the additional URL changes:
* for <span style="color:red">league</span> it is <b>league/</b> + <b>league_name</b> + <b>/season</b>
* for a <span style="color:red">team</span> it is <b>team/</b> + <b>team_name</b> + <b>/season</b>
* most difficult for the <span style="color:red">player</span>. Here we build everything by player id. That is <b>player/</b> + <b>player_id</b>

By analogy, I found the location of the data I needed. By tagging "script" with BeautifulSoup, I will find the necessary parts for each URL:
* for <span style="color:red">leagues</span> it is <b>'teamsData'</b> with information and advanced statistics for each played match of a team.
* for <span style="color:red">teams</span> it is <b>'playersData'</b> with brief statistics of each player of the team in the season
* for a <span style="color:red">player</span> there were several data, but I decided to take 2: <b>'shotsData'</b> with information on each shot of a certain player and <b>'matchesData'</b> with information on the match in which the player participated.

I decided not to write a separate function for each item and made everything in one function:

In [58]:
def get_understat_data(item_type, # can be 3 types: team, player and league
                       identifier = None, # is specified only for item_type == league or team
                       season = None, # is specified only for item_type == league or team
                       player_data_type = None): # is specified only for item_type == player. Can be 'shots' or 'matches'
    
    base_url = 'https://understat.com/'
    
    #URL and data_key generation depending on the item_type
    if item_type == 'team':
        data_key = 'playersData'
        url = f'{base_url}team/{identifier}/{season}' if identifier and season else None
        
    elif item_type == 'player':
        url = f'{base_url}player/{identifier}' if identifier else None
        
        if player_data_type == 'shots':
            data_key = 'shotsData'   
        elif player_data_type == 'matches':
            data_key = 'matchesData'
        else:
            print(f'Invalid player data type "{player_data_type}". Valid types are "shots" or "matches"')
            return None
        
    elif item_type == 'league':
        data_key = 'teamsData'
        url = f'{base_url}league/{identifier}/{season}' if identifier and season else None
    else:
        print(f'Invalid item type: "{item_type}". Valid types are "player", "team", "league".')
        return None
    
    if not url:
        print(f'Invalid combination of identifier and season for item type "{item_type}".')
        return None
    
    try:
        # Making a get request
        res = requests.get(url)
        
        # Check if an error has occured
        res.raise_for_status()
        
        # Use lxml parser
        soup = BeautifulSoup(res.content, "lxml")
        
        # Find strings with 'script'
        scripts = soup.find_all('script')
        
        # Find the required data by data_key and strip it 
        string_with_json_obj = next(el.text.strip() for el in scripts if data_key in el.text)
        ind_start = string_with_json_obj.index("('") + 2
        ind_end = string_with_json_obj.index("')")
        json_data = string_with_json_obj[ind_start:ind_end]
        
        # Decode the JSON data, converting any Unicode sequences into corresponding Unicode characters
        json_data = json_data.encode('utf8').decode('unicode_escape')
        
        # Load data in Python dict
        data = json.loads(json_data)
        
        return data
    
    except Exception as e:
        print(f"Error occurred while processing {item_type}: {e}")
        return None

### Work with leagues

One of the most difficult moments. The code turned out to be very cumbersome because of nested dictionary and with a large number of iterations. There might have been a way to make it simpler and faster, but I decided to make the code easier to understand

In [59]:
# Lists for iterations
leagues = ['EPL', 'La liga', 'Bundesliga', 'Serie A', 'Ligue 1']
seasons = ['2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022']
# We will need this to autocomplete the team names
team_titles_list = []

for league in leagues:
    for season in seasons:
        # Call the function for league
        data = get_understat_data(item_type = 'league', identifier = league, season = season)
        # Create a dataframe for adding individual dataframes in the future
        season_league_df = pd.DataFrame()
        
        for league_season_data in [data]:
            # Because of nested dictionary peculiarity, 
            # we have to iterate through the next iteration to get to the innards
            for team_id, team_data in league_season_data.items():
                id_value = team_data['id']
                title_value = team_data['title']
                
                #That's the list for the teams
                team_titles_list.append(title_value)
               
                #The very important innards that we'll be unpacking
                for history_item in team_data['history']:
                    
                    unpacked_item = {'id' : id_value, 'title' : title_value}
                    # Add 'id' and 'team_name' for every row
                    unpacked_item.update(history_item)
                    
                    # Unpack ppda
                    ppda_dict = unpacked_item.pop('ppda', {})
                    unpacked_item['att_ppda'] = ppda_dict.get('att')
                    unpacked_item['def_ppda'] = ppda_dict.get('def')
                    
                    # Unpack ppda_allowed
                    ppda_dict = unpacked_item.pop('ppda_allowed', {})
                    unpacked_item['att_ppda_allowed'] = ppda_dict.get('att')
                    unpacked_item['def_ppda_allowed'] = ppda_dict.get('def')
                    
                    # Obtaining the final dataframe for a particular team in a particular year
                    season_league_df = season_league_df.append((pd.DataFrame(unpacked_item.values(), unpacked_item.keys())).transpose())
        
        '''Uncomment to load each dataframe in csv format ↓ '''
        #season_league_df.to_csv(f'{league}_{season}_data.csv', index = False)


In [62]:
# Let's look at the last iterated dataframe and the columns in it
display(season_league_df.sample(4), season_league_df.columns)

Unnamed: 0,id,title,h_a,xG,xGA,npxG,npxGA,deep,deep_allowed,scored,...,date,wins,draws,loses,pts,npxGD,att_ppda,def_ppda,att_ppda_allowed,def_ppda_allowed
0,274,Ajaccio,a,0.478923,1.59182,0.478923,0.831729,5,3,1,...,2023-04-16 13:00:00,0,0,1,0,-0.352806,194,21,305,31
0,174,Toulouse,a,1.73074,3.90797,0.970646,3.14787,8,13,1,...,2022-12-29 20:00:00,0,0,1,0,-2.177224,242,22,238,27
0,166,Montpellier,a,0.778496,3.3388,0.778496,1.81861,1,21,2,...,2022-08-13 19:00:00,0,0,1,0,-1.040114,226,13,291,39
0,168,Nantes,a,0.506816,3.70953,0.506816,2.94943,7,12,0,...,2022-11-06 14:00:00,0,0,1,0,-2.442614,272,18,218,28


Index(['id', 'title', 'h_a', 'xG', 'xGA', 'npxG', 'npxGA', 'deep',
       'deep_allowed', 'scored', 'missed', 'xpts', 'result', 'date', 'wins',
       'draws', 'loses', 'pts', 'npxGD', 'att_ppda', 'def_ppda',
       'att_ppda_allowed', 'def_ppda_allowed'],
      dtype='object')

### Work with teams

It's simpler than that. Iterating through the leagues helped us create a list of team names. 

I decided to make a single file with data for each season instead of splitting the data for each league and year again

In [31]:
# Find unique team names for list
teams_name = list(np.unique(np.array(team_titles_list)))
# A list of dataframes, and that says it all.
dataframes_list = []

for team in teams_name:
    for season in seasons:
        # Call the function for team
        data = get_understat_data(item_type = 'team', identifier=team, season=season)
        current_df = pd.DataFrame(data)
        # Add for the current_df season date
        current_df['season'] = season
        dataframes_list.append(current_df)
        
#Combine all dataframes to get a total with statistics of players in the team
teams_data = pd.concat(dataframes_list, ignore_index = True)

In [32]:
# Let's look at some rows for the dataframe
teams_data.sample(4)

Unnamed: 0,id,player_name,games,time,goals,xG,assists,xA,shots,key_passes,yellow_cards,red_cards,position,team_title,npg,npxG,xGChain,xGBuildup,season
31366,7614,Lucas Perrin,26,1746,0,0.8301517376676202,1,0.8441031081601977,8,8,4,0,D S,Strasbourg,0,0.8301517376676202,3.864677901379764,3.6756010903045535,2022
8074,5648,Malang Sarr,8,549,0,0.0206521991640329,0,0.3875728938728571,1,3,2,0,D S,Chelsea,0,0.0206521991640329,1.337369790300727,0.9393736999481916,2021
25830,7080,Matheus Cunha,25,849,2,3.923799779266119,0,0.8961888998746872,34,7,1,0,F M S,RasenBallsport Leipzig,2,3.923799779266119,6.534436486661434,2.590898845344782,2018
25896,8962,Joscha Wosz,2,30,0,0.0,0,0.0,0,0,0,0,S,RasenBallsport Leipzig,0,0.0,0.0,0.0,2020


In [None]:
# Save teams dataframe to csv
# teams_data.to_csv('team_players_stat.csv', index = False)

### Work with players

Of course, similar to the team names, we could create a list with interesting for us player ids for the top 5 leagues in the team work, but I decided to upload all players' data (the maximum ids that were found was 11689. It is possible that there is more data already)

In [35]:
# It was interesting how long it would take to complete such an extensive job
import time
start_time = time.time()
# A list of dataframes for matches and shots
dataframes_list_for_shots = []
dataframes_list_for_matches = []

for identifier in range(1,11689+1):
    
    #  Call the function for player and data_type == 'shots'
    data_1 = get_understat_data(item_type='player', identifier=identifier, player_data_type = 'shots')
    current_df_1 = pd.DataFrame(data_1)
    dataframes_list_for_shots.append(current_df_1)
    
    # Call the function for player and data_type == 'matches'
    data_2 = get_understat_data(item_type='player', identifier=identifier, player_data_type = 'matches')
    current_df_2 = pd.DataFrame(data_2)
    dataframes_list_for_matches.append(current_df_2)
    

# Final dataframes
player_shots_data = pd.concat(dataframes_list_for_shots, ignore_index = True)
player_matches_data = pd.concat(dataframes_list_for_matches, ignore_index = True)
print("--- %s seconds ---" % (time.time() - start_time))

Error occurred while processing player: 404 Client Error: Not Found for url: https://understat.com/player/5835
Error occurred while processing player: 404 Client Error: Not Found for url: https://understat.com/player/5835
Error occurred while processing player: 404 Client Error: Not Found for url: https://understat.com/player/5836
Error occurred while processing player: 404 Client Error: Not Found for url: https://understat.com/player/5836
Error occurred while processing player: 404 Client Error: Not Found for url: https://understat.com/player/5837
Error occurred while processing player: 404 Client Error: Not Found for url: https://understat.com/player/5837
Error occurred while processing player: 404 Client Error: Not Found for url: https://understat.com/player/5838
Error occurred while processing player: 404 Client Error: Not Found for url: https://understat.com/player/5838
Error occurred while processing player: 404 Client Error: Not Found for url: https://understat.com/player/5839
E

Error occurred while processing player: 404 Client Error: Not Found for url: https://understat.com/player/5872
Error occurred while processing player: 404 Client Error: Not Found for url: https://understat.com/player/5873
Error occurred while processing player: 404 Client Error: Not Found for url: https://understat.com/player/5873
Error occurred while processing player: 404 Client Error: Not Found for url: https://understat.com/player/5874
Error occurred while processing player: 404 Client Error: Not Found for url: https://understat.com/player/5874
Error occurred while processing player: 404 Client Error: Not Found for url: https://understat.com/player/5875
Error occurred while processing player: 404 Client Error: Not Found for url: https://understat.com/player/5875
Error occurred while processing player: 404 Client Error: Not Found for url: https://understat.com/player/5876
Error occurred while processing player: 404 Client Error: Not Found for url: https://understat.com/player/5876
E

A lot of 404 errors with missing player data popped up, but it didn't affect the code execution in any way 

In [66]:
# Let's look at the players_shots dataframe and the columns in it
display(player_shots_data.sample(4), player_shots_data.columns)

Unnamed: 0,id,minute,result,X,Y,xG,player,h_a,player_id,situation,season,shotType,match_id,h_team,a_team,h_goals,a_goals,date,player_assisted,lastAction
50232,399110,50,BlockedShot,0.955,0.4309999847412109,0.1428630053997039,Olivier Giroud,a,502,OpenPlay,2020,RightFoot,14617,Fulham,Chelsea,0,1,2021-01-16 17:30:00,Mason Mount,Pass
161322,121386,51,MissedShots,0.905,0.6780000305175782,0.091026596724987,Marcos Alonso,h,1621,OpenPlay,2016,LeftFoot,3493,Chelsea,Middlesbrough,3,0,2017-05-08 20:00:00,Pedro,Pass
252906,262980,11,Goal,0.855,0.4309999847412109,0.1244440004229545,Joãozinho,h,2928,OpenPlay,2018,LeftFoot,9078,Dinamo Moscow,Ural,4,0,2018-12-09 11:00:00,Evgeni Lutsenko,Pass
30493,134846,22,MissedShots,0.7559999847412109,0.655,0.02278389967978,Marcel Schmelzer,h,313,OpenPlay,2016,LeftFoot,3860,Borussia Dortmund,Hoffenheim,2,1,2017-05-06 14:30:00,,


Index(['id', 'minute', 'result', 'X', 'Y', 'xG', 'player', 'h_a', 'player_id',
       'situation', 'season', 'shotType', 'match_id', 'h_team', 'a_team',
       'h_goals', 'a_goals', 'date', 'player_assisted', 'lastAction'],
      dtype='object')

In [44]:
# Save it to csv
#player_shots_data.to_csv('shots_statistics.csv', index=False)

In [67]:
# Let's look at the players_matches dataframe and the columns in it
display(player_matches_data.sample(4), player_shots_data.columns)

Unnamed: 0,goals,shots,xG,time,position,h_team,a_team,h_goals,a_goals,date,id,season,roster_id,xA,assists,key_passes,npg,npxG,xGChain,xGBuildup
201925,0,1,0.0409737005829811,36,AMC,Hamburger SV,Paderborn,0,3,2014-08-30,5459,2014,27927,0.0,0,0,0,0.0409737005829811,0.0,0.0
307521,0,2,0.0853141024708747,90,FWR,Wolverhampton Wanderers,Watford,0,2,2018-10-20,9278,2018,276293,0.0142571004107594,0,1,0,0.0853141024708747,0.0995713025331497,0.0142571004107594
189606,0,1,0.0412378013134002,90,MC,Verona,Empoli,2,1,2015-05-17,4804,2014,55413,0.0496909990906715,0,1,0,0.0412378013134002,0.3621990084648132,0.3125079870223999
343293,0,1,0.0729828998446464,24,Sub,Union Berlin,Mainz 05,1,1,2020-05-27,12651,2019,398821,0.0,0,0,0,0.0729828998446464,0.0729828998446464,0.0


Index(['id', 'minute', 'result', 'X', 'Y', 'xG', 'player', 'h_a', 'player_id',
       'situation', 'season', 'shotType', 'match_id', 'h_team', 'a_team',
       'h_goals', 'a_goals', 'date', 'player_assisted', 'lastAction'],
      dtype='object')

In [68]:
# Save it to csv
# player_matches_data.to_csv('matches_statistics.csv', index=False)