# Olympics Basketball - Web Scraping

In this notebook I am retrieving some data about Men's Basketball at the 2020 Summer Olympics tournament from FIBA website. I am generating a dataset with multiple columns related with the performance of all players in each game of this tournament and storing it in a CSV file.

## 1. Load libraries

The only libraries required for executing this notebook are:

- **time**: for delaying the HTTP requests (it is important to be respectful when doing web scraping or else you might harm the server that is providing you the data).
- **requests**: to send HTTP requests to FIBA webpage.
- **bs4 (BeautifulSoup)**: for parsing of HTML.
- **pandas**: makes easier dealing with data.


In [1]:
import time

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

## 2. Prepare for web scraping

As the website providing us the data is **https://www.fiba.basketball**, we store this address in **BASE_DATA**:

In [2]:
BASE_DATA = "https://www.fiba.basketball"

The following function will ease our web scraping task. Given a URL corresponding to a game of the Men's Basketball at the 2020 Summer Olympics tournament, returns a Pandas dataframe with the information of all the players that participated in that game:

In [3]:
def get_boxscore(game_url):
    box_sc_p = requests.get(game_url).text
    box_sc = bs(box_sc_p, "html.parser")

    scores = [s for s in box_sc.find("div", class_= "final-score").text.split("\n") if s]
    date = game_url.split("/")[-2]

    BASE_DATA = "https://www.fiba.basketball"
    data_dir = box_sc.find("li", {"data-tab-content": "boxscore"}).get("data-ajax-url")
    boxsc_p = requests.get(BASE_DATA + data_dir).text
    boxsc = bs(boxsc_p, "html.parser")

    loc_box, aw_box = boxsc.find_all("tbody")
    colnames = [d.text for d in boxsc.find("thead").find_all("th")]
    names = game_url.split("/")[-1].split("-")
    
    all_players = []

    for i in range(len(names)):
        if "Republic" in game_url:
            if names[i].startswith("Republic"):
                names[i] = "Czech Republic"
            elif names[i].startswith("Czech"):
                names[i] = "Czech Republic"

    for i, team in enumerate([loc_box, aw_box]):
        for player in team.find_all("tr"):
            example = [d.text.strip().split("\n")[0] for d in player.find_all("td") if d != "\n"]
            if len(example) < len(colnames):
                example = example[:-1]
                example = example + ["0:0", "0"] + ["0/0"] * 4 + ["0"] * 10
            player_data = {"country": names[i], "vs": names[i-1],
                           "team_score": scores[i], "vs_score": scores[i-1],
                           "date": date}
            for a,b in zip(colnames, example):
                player_data[a] = b
            
            all_players.append(player_data)
    
    return pd.DataFrame(all_players)

## 3. Web scraping

First, we extract all the URL's of <a href="https://www.fiba.basketball/olympics/men/2020/games">this tournament's games</a>:

In [4]:
page = requests.get(BASE_DATA + "/olympics/men/2020/games").text
soup = bs(page, "html.parser")
link_ls = [a.get("href") 
           for a in soup.find_all("a") 
           if a.get("href") and a.get("href").startswith("/olympics/men/2020/game/")]

For each link, we make a request to FIBA website to parse the information corresponding to each game, and we store the returned ```pandas.DataFrame``` objects in **all_tables** list:

In [5]:
all_tables = []

for a in link_ls:
    try:
        all_tables.append(get_boxscore(BASE_DATA + a))
    except (ValueError, TypeError, AttributeError):
        print("Problem with", BASE_DATA + a)
        continue
    time.sleep(3)
    print(a)

/olympics/men/2020/game/2507/Iran-Czech-Republic
/olympics/men/2020/game/2507/Germany-Italy
/olympics/men/2020/game/2507/Australia-Nigeria
/olympics/men/2020/game/2507/France-USA
/olympics/men/2020/game/2607/Argentina-Slovenia
Problem with https://www.fiba.basketball/olympics/men/2020/game/2607/Japan-Spain
/olympics/men/2020/game/2807/Nigeria-Germany
/olympics/men/2020/game/2807/USA-Iran
/olympics/men/2020/game/2807/Italy-Australia
/olympics/men/2020/game/2807/Czech-Republic-France
Problem with https://www.fiba.basketball/olympics/men/2020/game/2907/Slovenia-Japan
Problem with https://www.fiba.basketball/olympics/men/2020/game/2907/Spain-Argentina
/olympics/men/2020/game/3107/Iran-France
/olympics/men/2020/game/3107/Italy-Nigeria
/olympics/men/2020/game/3107/Australia-Germany
/olympics/men/2020/game/3107/USA-Czech-Republic
/olympics/men/2020/game/0108/Argentina-Japan
/olympics/men/2020/game/0108/Spain-Slovenia
Problem with https://www.fiba.basketball/olympics/men/2020/game/0308/Sloveni

## 4. Storing the data

To allow its future use, data must be stored in a file. First, all tables are merged in a single ```pandas.DataFrame```:

In [6]:
#df = pd.concat(all_tables)
import pandas as pd

df = pd.read_csv("basketball_olympic_players_game_stats.csv")
df.head()

Unnamed: 0,country,vs,team_score,vs_score,date,#,Players,Min,Pts,FG,...,OREB,DREB,REB,AST,PF,TO,STL,BLK,+/-,EFF
0,Iran,Czech Republic,78,84,2507,3,Mohammadsina Vahedi,02:33,0,0/2,...,0,0,0,0,0,1,0,0,-3,-3
1,Iran,Czech Republic,78,84,2507,5,Pujan Jalalpoor,09:04,3,1/3,...,0,0,0,0,0,1,0,0,-6,0
2,Iran,Czech Republic,78,84,2507,7,Mohammad Hassanzadeh,0:0,0,0/0,...,0,0,0,0,0,0,0,0,0,0
3,Iran,Czech Republic,78,84,2507,8,Saeid Davarpanah,0:0,0,0/0,...,0,0,0,0,0,0,0,0,0,0
4,Iran,Czech Republic,78,84,2507,13,Mohammad Jamshidijafarabadi,28:36,16,7/11,...,0,1,1,7,1,7,1,0,-5,13


This dataset is stored in **basketball_olympic_players_game_stats.csv** file. This dataset is still a bit ugly, so it will be cleant in **basketball_olympic_players_game_stats.csv**.

In [7]:
df.to_csv("basketball_olympic_players_game_stats.csv", index = None)

Finally, the dataset columns have been conveniently annotated in **basketball_olympic_players_game_stats.json** (description and datatype).