# Olympic Athletes (1896 - 2022) - Data Scaper
___
## About this file
### PART I : Scrapping Biographical Data for Olympic Athletes
This file is designed to scrape biographical data for all Olympic athletes from 1896 to 2022. The athlete data is obtained using a list of athlete URLs generated by the `url_collector.ipynb` notebook (see that notebook for more details and code on how to collect the URLs).

The scraped data will be organized into a dataframe, with each row corresponding to an individual athlete who participated in a specific Olympic event (athlete-events). The dataframe will include the following columns:
- `ID` : Unique number for each athlete, used to access their webpage on Olympedia at ```https://www.olympedia.org/athletes/[athlete-id]```.
- `Name` : Athlete's name
- `Gender` : Male or Female
- `Born` : Birthdate
- `Died` : Date of death (if applicable)
- `Height` : In centimeters
- `Weight` : In kilograms
- `Team` : Team name
- `NOC` : National Olympic Committee 
- `Games` : Year and season
- `Sport` : Sport
- `Event` : Event
- `Medal` : Gold, Silver, Bronze, or NA

### PART II: Retrieving Host Cities for Olympic Games
In addition to scraping athlete data, this script also retrieves information about the host cities for both Summer and Winter Olympic Games from 1896 to 2022. The dataset will include the following information for each Olympic Game:
- `Year`: The year in which the Olympic Games took place.
- `Season`: The season of the Olympic Games (Summer or Winter).
- `Game`: A concatenated field combining year and season for easy reference.
- `Host City`: The city that hosted the Olympic Games in that year.


### PART III: Retrieving Country Names and Their Corresponding NOCs

___
### Web Scraping Techniques:
To speed up the data scraping process (given more than 155,600 URLs to request), we are going to use the threading concurrency method by using the `concurrent.futures` library. Threading concurrency is a programming concept that allows handling and executing multiple tasks concurrently, making progress simultaneously.
___
### Import necessary libraries

In [1]:
from time import sleep, perf_counter
import concurrent.futures
from bs4 import BeautifulSoup 
import requests
import json
import gzip
import pandas as pd

# PART I - Scrapping Biographical Data for Olympic Athletes

### Step 1: Extract blocks of information from each athlete's webpage

In [21]:
all_content = []

def get_content(url): 
    """
    This function retrieves blocks of information from an athlete's webpage given the input url. 
    It stores the scraped block of data into a string, then appends it to the list "all_content".
    
    :param [url]: url of the athlete's webpage, with the following format: https://www.olympedia.org/athletes/[athlete-id].
    :type [url]: string
    
    :return : None - This function append the scraped data into a string and then into the list "all_content".
    :rtype: None 
    """

    response = requests.get(url)
    if response.status_code == 200:
        # Extract blocks of contents
        page = bs(response.content, 'lxml')
        name = page.find("h1")
        bio_summary = page.find(attrs = {"class":"biodata"})
        game_result = page.find('table', attrs = {"class":"table"})
        content = "<div>" + str(name) + "\n" + f"<h4>{url}</h4>" + str(bio_summary) + "\n" + str(game_result) + "</div>"
        all_content.append(content)
        sleep(1)
    else:
        # Skip webpage if it does not exist
        print(f"Error fetching {url}")

Access the JSON file containing the list of athletes URLs `athletes_urls.json`.

In [70]:
with open('raw_data/athletes_urls.json', 'r') as file:
    athletes_urls = json.load(file)
    

Perform information extraction using the threading concurrency approach.

In [None]:
start = perf_counter()

with concurrent.futures.ThreadPoolExecutor(max_workers=50) as executor:
    executor.map(get_content, athletes_urls)

end = perf_counter()
print(f"Elapsed Time: {end-start:.3f} seconds")

# Elapsed Time: 7740.37 seconds (2,15 hours)

### Save into a compressed JSON file

In [None]:
with gzip.open("raw_data/athletes_content.json.gz", 'wb') as file:
    file.write(json.dumps(all_content).encode('utf-8'))

___
### Step 2: Extract individual data

In [18]:
def extract_data(athlete_content):
    """
    This function retrieves specific individual data from the block of data scraped for each athlete.
    It stores the data into a dictionary "athlete_data" for each athlete-event record,
    and appends each record to a list "athlete_stats".
    
    :param [athlete_content]: Blocks of data scraped from each athlete's webpage
    :type [athlete_content]: str
    
    :return: Records of all events in which the athlete participated (biographic data and game results)
    :rtype: list of dictionaries
    """

    
    # Extract biographical data
    athlete_stats = []
    athlete_info = BeautifulSoup(athlete_content, 'lxml')
    
    id = int(athlete_info.find("h4").text.split("/")[-1])
    name = athlete_info.find("h1").text.strip()
    gender, born, died, weight, height, noc = None, None, None, None, None, None
    
    bio_summary = athlete_info.find(attrs = {"class":"biodata"}).find_all('tr')
    for row in bio_summary:
        header = row.find("th").text
        data = row.find("td").text
        
        if header == 'Sex':
            gender = data
        elif header == 'Born':
            born = data.split('in')[0].strip()
        elif header == 'Died':
            died = data.split('in')[0].strip()
        elif header == 'Measurements':
            measurements = data.split(' / ')
            if len(measurements) == 2: 
                height, weight = measurements 
            elif "kg" in data:
                weight = data
            elif "cm" in data:
                height = data
        elif header == 'NOC':
            noc = data.strip()
           
    # Extract all games played by the athlete
    table_result = athlete_info.find('tbody').find_all('tr')    
    game, sport, team = None, None, None
    for row in table_result:
        
        if row.has_attr('class'):
            game, sport, team = [td.text.strip() for td in row.find_all('td')[:3]]
        else:
            # Store in the dictionary
            athlete_data = {
                'id' : id,
                'name' : name,
                'gender' : gender,
                'born' : born,
                'died' : died,
                'height' : height,
                'weight' : weight,
                'noc' : noc,
                'game' : game,
                'team' : team,
                'sport' : sport,
                'event' : f"{sport}, {row.find_all('td')[1].text.strip('n')}".replace('\n', ''),
                'medal' : row.find_all('td')[4].text
            }
            athlete_stats.append(athlete_data)
        
    return athlete_stats

Access the compressed JSON file containing the scraped block of data.

In [3]:
with gzip.open("raw_data/athletes_content.json.gz", 'rb') as file:
    decompressed_data = file.read().decode('utf-8')

athletes_content = json.loads(decompressed_data)

Use ThreadPoolExecutor to parallelize the extraction 

In [None]:
start = perf_counter()

with concurrent.futures.ThreadPoolExecutor(max_workers=50) as executor:
    all_athletes_data = list(executor.map(extract_data, athletes_content))
    
end = perf_counter()
print(f"Elapsed Time: {end-start:.02f} seconds")

# Elapsed Time : 6367.20 sec

In [None]:
# Flatten lists of lists

flatten_athletes_data = []
for athlete_stats in all_athletes_data:
    if len(athlete_stats) > 1:
        
        for athlete_data in athlete_stats:
            flatten_athletes_data.append(athlete_data)
            
        flatten_athletes_data.append(athlete_data)

### Save into a csv file

In [None]:
athletes_df = pd.DataFrame(flatten_athletes_data)
athletes_df.to_csv("data/athletes.csv")

### Here is a preview of our dataset

In [2]:
pd.read_csv("data/athletes.csv")

Unnamed: 0.1,Unnamed: 0,id,name,gender,born,died,height,weight,noc,game,team,sport,event,medal
0,0,131892,Meryem Erdoğan,Female,24 April 1990,,172 cm,55 kg,Türkiye,2016 Summer Olympics,TUR,Athletics,"Athletics, Marathon, Women(Olympic)",
1,1,131892,Meryem Erdoğan,Female,24 April 1990,,172 cm,55 kg,Türkiye,2020 Summer Olympics,TUR,Athletics,"Athletics, Marathon, Women(Olympic)",
2,2,131892,Meryem Erdoğan,Female,24 April 1990,,172 cm,55 kg,Türkiye,2020 Summer Olympics,TUR,Athletics,"Athletics, Marathon, Women(Olympic)",
3,3,4300,Maurice Maina,Male,1 January 1963,,158 cm,47 kg,Kenya,1988 Summer Olympics,KEN,Boxing,"Boxing, Light-Flyweight, Men(Olympic)",
4,4,4300,Maurice Maina,Male,1 January 1963,,158 cm,47 kg,Kenya,1988 Summer Olympics,KEN,Boxing,"Boxing, Light-Flyweight, Men(Olympic)",
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
476343,476343,20989,Caitlin Bilodeaux-Banos,Female,17 March 1965,,170 cm,64 kg,United States,1988 Summer Olympics,USA,Fencing,"Fencing, Foil, Individual, Women(Olympic)",
476344,476344,20989,Caitlin Bilodeaux-Banos,Female,17 March 1965,,170 cm,64 kg,United States,1988 Summer Olympics,USA,Fencing,"Fencing, Foil, Team, Women(Olympic)",
476345,476345,20989,Caitlin Bilodeaux-Banos,Female,17 March 1965,,170 cm,64 kg,United States,1992 Summer Olympics,USA,Fencing,"Fencing, Foil, Individual, Women(Olympic)",
476346,476346,20989,Caitlin Bilodeaux-Banos,Female,17 March 1965,,170 cm,64 kg,United States,1992 Summer Olympics,USA,Fencing,"Fencing, Foil, Team, Women(Olympic)",


# PART II: Retrieving Host Cities for Olympic Games

In [66]:
def get_host_cities(all_cities, season_table_element, season):
    """
    This function retrieves host cities data for Olympic Games from the Olympedia.org website 
    and stores it in a dictionary named "game". The data is then appended to the input list "all_cities".
    
    :param [all_cities]: A list to store host city data.
    :param [table_element]: A BeautifulSoup Tag object representing a <table> element containing host city data
    :param [season]: Season of the Olympic Games: "Summer" or "Winter"
    :type [all_cities]: List
    :type [table_element]: BeautifulSoup.Tag
    :type [season]: String
    
    :return : None - This function appends the scraped data to the input "all_cities" list
    :rtype : None
    """
      
    for row in season_table_element.find_all("tr")[1:]:
        year = row.find_all('td')[1].text
        
        game = {
            "year" : year,
            "season" : season,
            "game" : f"{year} {season} Olympics",
            "host_city" : row.find_all("td")[2].text
        }
        all_cities.append(game)

In [67]:
base_url = "https://www.olympedia.org/editions"


host_cities = []
resp = requests.get(base_url)
if resp.status_code == 200:
    soup = BeautifulSoup(resp.content, 'html.parser')
    
    # Locate the table
    summer_table = soup.find_all("table")[0]
    winter_table = soup.find_all("table")[1]
    
    # Retrieve data
    get_host_cities(host_cities, summer_table, "Summer")
    get_host_cities(host_cities, winter_table, "Winter")
  
else:
    print(f"Error fetching {base_url}")

### Save into a csv file

In [68]:
host_cities_df = pd.DataFrame(host_cities)
host_cities_df.to_csv("data/host_cities.csv")

### Preview of the `host_cities` dataset

In [69]:
host_cities_df

Unnamed: 0,year,season,game,host_city
0,1896,Summer,1896 Summer Olympics,Athina
1,1900,Summer,1900 Summer Olympics,Paris
2,1904,Summer,1904 Summer Olympics,St. Louis
3,1908,Summer,1908 Summer Olympics,London
4,1912,Summer,1912 Summer Olympics,Stockholm
...,...,...,...,...
57,2010,Winter,2010 Winter Olympics,Vancouver
58,2014,Winter,2014 Winter Olympics,Sochi
59,2018,Winter,2018 Winter Olympics,PyeongChang
60,2022,Winter,2022 Winter Olympics,Beijing


___
# PART III: Retrieving Country Names and Their Corresponding NOCs

In this section, we will scrape all the names and NOCs of the countries that **Competed in Modern Olympics** (referred to as `glyphicon glyphicon-ok` in the Olympedia website). If the country did not compete in Moderin Olympics, the class of the tag will be named as `glyphicon glyphicon-remove`.

In [17]:
all_nocs = []
url = "https://www.olympedia.org/countries"

response = requests.get(url)
if response.status_code == 200:
    
    # Page Successfully Fetched - Get NOCs and Country Names
    soup = BeautifulSoup(response.content, 'lxml')
    table = soup.find('tbody').find_all("tr")
    for row in table:
        if row.find("span", class_ = "glyphicon glyphicon-ok"):
            noc_country = {
                "noc" : row.find_all("a")[0].text,
                "country" : row.find_all("a")[1].text
            }
            all_nocs.append(noc_country)
    
    # Save scraped data into CSV file
    noc_df = pd.DataFrame(all_nocs)
    noc_df.to_csv("data/noc_countries.csv", index = False)
    
else:
    print(f"Error fetching {url} - {response.status_code}")

In [19]:
# Here is the preview of the dataset
pd.read_csv("data/noc_countries.csv").head(5)

Unnamed: 0,noc,country
0,AFG,Afghanistan
1,ALB,Albania
2,ALG,Algeria
3,ASA,American Samoa
4,AND,Andorra


___
# Project Update - 2023-08-23


### Goal
My goal in this web scraping project is to create datasets that are compatible with multiple languages and tools. Initially, the format I provided was limited to Python. In response, I have made several updates to ensure broader compatibility.


### Updates
#### 1. Handling NULL Values
Previously, NULL values in the CSV files were represented by empty values separated by commas (e.g., `,,,`). This caused issues when importing data into PostgreSQL, as the system couldn't recognize the number of columns to import. To address this, I've updated the datasets so that 'NULL' is explicitly written in place of empty spaces, making them PostgreSQL-compatible.

#### 2. Index Column
The index column was not intentionally included when saving the datasets to CSV files from pandas. To enhance dataset organization and maintain consistency, I have removed this index column.

#### 3. Column Name Swap
I've swapped the column names "noc" and "team" to align with the common abbreviation for National Olympic Committee (NOC), which is typically three letters. This change enhances the clarity and consistency of the dataset.

These updates ensure that the datasets are more versatile and can be seamlessly imported and used across various languages and tools.

### Updates on `athletes.csv `

In [38]:
athletes_df = pd.read_csv("data/athletes.csv")

athletes_df.drop('index', axis = 1, inplace = True)                             # remove "index" column
athletes_df.rename(columns = {"noc" : "team", "team" : "noc"}, inplace = True)  # rename columns
athletes_df.fillna("NULL", inplace = True)                                      # handle missing values

# Save the new version (without including the index column)
athletes_df.to_csv("data/athletes.csv", index = False)

### Updates on `host_cities.csv `

In [37]:
# remove "index" column
cities_df = pd.read_csv("data/host_cities.csv").drop("index", axis=1)
cities_df.to_csv("data/host_cities.csv", index = False)

# Project Update - 2023-08-30

While conducting my exploratory data analysis on the Olympics, I noticed that the total number of participants from each country does not align with the data recorded by the Olympedia website. To address this issue, I need to scrape additional data regarding athlete roles. According to the website, the available roles for Olympic athletes are as follows. This information will enable me to identify those participants who officially competed in the Olympic Games as athletes:

- `Competed in Olympic Games`
- `Non-starter`
- `Coach`
- `Referee`
- `Other`
- etc

In [25]:
# Access the compressed JSON file containing the scraped block of data.

with gzip.open("raw_data/athletes_content.json.gz", 'rb') as file:
    decompressed_data = file.read().decode('utf-8')

athletes_content = json.loads(decompressed_data)

In [26]:
def get_roles(athlete_data):
    
    athlete_info = BeautifulSoup(athlete_data, 'lxml')
    has_table = athlete_info.find(attrs = {"class":"biodata"})
    
    if has_table:
        bio_summary = has_table.find_all('tr')
        for row in bio_summary:
            if row.find("th").text == 'Roles':
                
                athlete_roles = {
                    'id' : int(athlete_info.find("h4").text.split("/")[-1]),
                    'name' : athlete_info.find("h1").text.strip(),
                    'roles' : row.find("td").text
                }
                return athlete_roles
    

Perform data scraping using the method above.

In [27]:
start = perf_counter()

with concurrent.futures.ThreadPoolExecutor(max_workers=50) as executor:
    all_athletes_roles = list(executor.map(get_roles, athletes_content))
    
end = perf_counter()
print(f"Elapsed Time: {end-start:.02f} seconds")

Elapsed Time: 1714.15 seconds


In [44]:
# Flatten the list
filtered_athletes_roles = [data for data in all_athletes_roles if data is not None]

# Save to CSV file 
athletes_roles_df = pd.DataFrame(filtered_athletes_roles)
athletes_roles_df.to_csv("data/athletes_roles.csv", index = False)
athletes_roles_df

Unnamed: 0,id,name,roles
0,131892,Meryem Erdoğan,Competed in Olympic Games
1,4300,Maurice Maina,Competed in Olympic Games
2,60239,Stanislav Tůma,Competed in Olympic Games
3,129369,Eunice Kirwa,Competed in Olympic Games
4,142670,Sinem Kurtbay,Competed in Olympic Games
...,...,...,...
155653,122196,Aleksa Šaponjić,Competed in Olympic Games
155654,52168,Zhang Yousheng,Competed in Olympic Games
155655,18974,Werner Delmes,Competed in Olympic Games • Coach
155656,126253,Tim Payne,Competed in Olympic Games


In [88]:
# Here are the 4 websites that went wrong
for index, value in enumerate(all_athletes_roles):

    if value is None:
        soup = BeautifulSoup(athletes_content[index])
        print(f"{soup.find('h1').text} : {soup.find('h4').text}")

The page you were looking for doesn't exist. :          https://www.olympedia.org/athletes/892593
We're sorry, but something went wrong. :          https://www.olympedia.org/athletes/115872
We're sorry, but something went wrong. :          https://www.olympedia.org/athletes/129723
We're sorry, but something went wrong. :          https://www.olympedia.org/athletes/133835


# Project Update - 2023-09-02
Another modification needs to be made to the `athletes.csv` dataset. While conducting the exploratory data analysis, I have noticed that there are NULL values in the `game` column, rendering the records invalid. Each athlete's record should indicate the year and Olympic games in which they participated, as well as the specific game and event in which they competed. Therefore, after conducting further research on Olympedia, I will fill the NULL values in the 'game' column with the most recent non-null value above them.

In [13]:
df = pd.read_csv("data/athletes.csv")
df.isna().sum()

id             0
name           0
gender         0
born        9316
died      359755
height    127511
weight    136329
team           0
game        3138
noc            0
sport          0
event          0
medal     410322
dtype: int64

In [15]:
# Replace null values with the most above non-null value using method = 'ffill'
df['game'].fillna(method='ffill', inplace=True)
print("Number of Null values on the game column: ", df['game'].isna().sum())


# Save the updated DataFrame back to a CSV file
df.to_csv("data/athletes.csv", index = False)

Number of Null values on the game column:  0
