# <a id='toc1_'></a>[JO 2024 project](#toc0_)

**Table of contents**<a id='toc0_'></a>    
- [JO 2024 project](#toc1_)    
  - [Prelude](#toc1_1_)    
  - [Imports](#toc1_2_)    
  - [Fonctions](#toc1_3_)    
  - [Data collect](#toc1_4_)    
    - [Extract one country data](#toc1_4_1_)    
    - [Extract All countries data](#toc1_4_2_)    
    - [Extract all data](#toc1_4_3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_1_'></a>[Prelude](#toc0_)

Work in progress . . .  

Summarize about the project :
- Data source : <a href="https://www.olympic.org/news">International Olympic Committee</a>
- Data extract from : <a href="http://olympanalyt.com/OlympAnalytics.php">olympanalyt.com - mail : sportsencyclo@gmail.com</a>

## <a id='toc1_2_'></a>[Imports](#toc0_)

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

## <a id='toc1_3_'></a>[Fonctions](#toc0_)

In [2]:
def GetHTML(url) :
    """
    Needs : url website
    Return : HTML text
    """
    response = requests.get(url)

    if response.status_code == 200 :
        print("Response OK, continue.")
    else :
        print("Access impossible.")
    
    return response.text

In [3]:
def Getdataframe(html) :

    """
    Needs : HTML text 
            with response = requests.get(url).text
    Return : Dataframe with athlete AND sport description : medal, country, athlete_name, games, sport, event
    Requirement : Must to apply on website like : adress[COUNTRY]adress
                i.e : http://olympanalyt.com/OlympAnalytics.php?param_pagetype=Medals&param_dbversion=&param_country=[COUNTRY]&param_games=ALL&param_sport=ALL
    """

    # Converting text (html) in soup (text parsed)
    soup = BeautifulSoup(html, "html5lib") #parse

    # Finding the correct table according to description
    table = soup.find("table", class_="simpletable", style="text-align:left;")

    # Retrieving the list (find_all) of all rows
    rows = table.find_all("tr")

    # Creating the table (future dataframe)
    athletes = []

    # For each row (= athlete)
    for row in rows:
        #describe_athlete = row.text.split() # Get the text
        cells = row.find_all("td") # Retrieving the list (find_all) of all cells in the row

        if len(cells) == 7: # If there are 7 cells then it's an athlete
            medal = cells[0].find('img')['title'][0]  # Medal title
            country = cells[1].find('img')['title']  # Athlete's country
            athlete_name = str(cells[2]).replace("<br/>","/").replace("</td>","").replace("<td>","") # Athlete's name, splitted by "/" if it's a team
            games = cells[3].find('img')['title']  # Olympic Games
            sport = cells[4].find('img')['title']  # Sport
            sex = cells[5].text.strip()  # Sex
            event = cells[6].text.strip()  # Event

            # Add the information to the list of athletes
            athletes.append({
                'medal': medal,
                'country': country,
                'athlete_name': athlete_name,
                'games': games,
                'sport': sport,
                'event': event,
                'sex' : sex
                })

        else: # Otherwise it's not an athlete, move to the next row
            next

    return pd.DataFrame(athletes)

url = "http://olympanalyt.com/OlympAnalytics.php?param_pagetype=Medals&param_dbversion=&param_country=FRA&param_games=ALL&param_sport=ALL"
# Try with one country - FRA (France)
html = GetHTML(url)

Getdataframe(html)

Response OK, continue.


Unnamed: 0,medal,country,athlete_name,games,sport,event,sex
0,1,France,Paul MASSON,Greece,Cycling Track,sprint individual,Men's
1,1,France,Paul MASSON,Greece,Cycling Track,1/3km time trial,Men's
2,1,France,Paul MASSON,Greece,Cycling Track,10km,Men's
3,1,France,Léon FLAMENG,Greece,Cycling Track,100km,Men's
4,1,France,Eugène-Henri GRAVELOTTE,Greece,Fencing,foil individual,Men's
...,...,...,...,...,...,...,...
884,2,France,SIMON Julia/FILLON MAILLET Quentin/CHEVALIER-B...,China,Biathlon,Relay mix,Mixed
885,2,France,LEDEUX Tess,China,Freestyle Skiing,Big air,Women's
886,2,France,TRESPEUCH Chloe,China,Snowboard,Snowboard Cross,Women's
887,3,France,FAIVRE Mathieu,China,Alpine Skiing,giant slalom,Men's


## <a id='toc1_4_'></a>[Data collect](#toc0_)

### <a id='toc1_4_1_'></a>[Extract one country data](#toc0_)

In [4]:
url = "http://olympanalyt.com/OlympAnalytics.php?param_pagetype=Medals&param_dbversion=&param_country=FRA&param_games=ALL&param_sport=ALL"

In [5]:
# Try with one country - FRA (France)
html = GetHTML(url)

Response OK, continue.


In [6]:
# Try to get the DF on one country - FRA (France)
df_medalists = Getdataframe(html)

df_medalists.sample(5)

Unnamed: 0,medal,country,athlete_name,games,sport,event,sex
181,3,France,Roger François DUCRET,Belgium,Fencing,foil individual,Men's
322,2,France,James COUTTET,Switzerland,Alpine Skiing,slalom,Men's
95,3,France,Henri Monnot/Gaston Cailleux/Léon Tellier,France,Sailing,0.5t,Mixed
231,1,France,Lucien GAUDIN,Netherlands,Fencing,épée individual,Men's
578,3,France,Коваль/Барата,United States,Rowing,double sculls (2x),Men's


### <a id='toc1_4_2_'></a>[Extract All countries data](#toc0_)

Scrape countries codes to loop on each of them (website url modification)  
Link of countries codes : <a href="http://olympanalyt.com/OlympAnalytics.php?param_pagetype=RefCountries&param_dbversion=&param_country=CIV&param_games=ALL&param_sport=ALL"> Countries list</a>

In [7]:
# Countries list
url = "http://olympanalyt.com/OlympAnalytics.php?param_pagetype=RefCountries&param_dbversion=&param_country=CIV&param_games=ALL&param_sport=ALL"

html = GetHTML(url)

Response OK, continue.


In [8]:
# Parsing text in Beautifulsoup object
soup = BeautifulSoup(html, "html5lib")

In [9]:
# Select the right table
table = soup.find_all("table", class_="simpletable", style="text-align:center;")[1] # 2nd table
rows = table.find_all("tr")
countries = []

for row in rows[2:]: # ignore the column names
    data = row.text.split()

    if len(data) == 5 and len(data[0]) == 3:
        countries.append(
            {
                'Code':data[0], 
                'Country':data[1], 
                'Continent':data[2], 
                'Firstparticipation':data[3], 
                'Lastparticipation':data[4]
            }
        )
    else :
        next

df_countries = pd.DataFrame(countries)

In [10]:
df_countries.sample(5)

Unnamed: 0,Code,Country,Continent,Firstparticipation,Lastparticipation
15,BLR,Belarus,Europe,1994,2022
12,BRN,Bahrain,Asia,1984,2021
61,GUY,Guyana,America,1948,2021
13,BAN,Bangladesh,Asia,1984,2021
156,UZB,Uzbekistan,Asia,1994,2022


### <a id='toc1_4_3_'></a>[Extract all data](#toc0_)

In [11]:
# Run scraping on all countries 
code_countries = df_countries["Code"]
name_countries = df_countries["Country"]
i = 0

for country, country_name in zip(code_countries, name_countries) :
    url = "http://olympanalyt.com/OlympAnalytics.php?param_pagetype=Medals&param_dbversion=&param_country=" + country + "&param_games=ALL&param_sport=ALL"
    html = GetHTML(url)
    df_temp = Getdataframe(html)
    if i == 0 :
        df_olympic = df_temp
    else : 
        df_olympic = pd.concat([df_olympic, df_temp], ignore_index=True)

    #Delete these codes after tries
    if i > 5 :
        break

    i += 1
    
    print("for : {}".format(country_name))

Response OK, continue.
for : Afghanistan
Response OK, continue.
for : Albania
Response OK, continue.
for : Algeria
Response OK, continue.
for : Andorra
Response OK, continue.
for : Angola
Response OK, continue.
for : Argentina
Response OK, continue.


In [12]:
df_olympic.sample(15)

Unnamed: 0,medal,country,athlete_name,games,sport,event,sex
30,2,Argentina,Manuel FERREIRA/Angel BOSIO/N. PERINETTI/Angel...,Netherlands,Football,football,Men's
111,2,Armenia,ALEKSANYAN Artur,Japan,Wrestling Greco-Roman,97kg,Men's
39,2,Argentina,Jeanette CAMPBELL,Germany,Swimming,100m freestyle,Women's
72,3,Argentina,Серена Амато,Australia,Sailing,single-handed dinghy (Europe),Women's
9,1,Algeria,Нурия Мерах-Бенида,Australia,Athletics,1500m,Women's
68,3,Argentina,Чакон,United States,Boxing,54 - 57kg (featherweight),Men's
24,3,Argentina,Alfredo PORZIO,France,Boxing,66.68 - 72.57kg (middleweight),Men's
49,3,Argentina,Mauro CIA,Great Britain,Boxing,73 - 80kg (light-heavyweight),Men's
92,2,Argentina,DEL POTRO Juan Martin,Brazil,Tennis,singles,Men's
113,3,Armenia,DAVTYAN Artur,Japan,Gymnastics Artistic,vault,Men's


In [74]:
# Check chars non-ASCII return True
def contains_non_ascii(strings):
    """
    Return: result in Boolean
    """
    return any(ord(char) > 127 for char in strings)


# Filter non-ASCII & ASCII name
df_non_ascii = df_olympic[df_olympic['athlete_name'].apply(contains_non_ascii)]
df_ascii = df_olympic[~df_olympic['athlete_name'].apply(contains_non_ascii)]

In [80]:
# Check if there are more than 1 athlete
def is_team(strings):
    """
    Return: result in Boolean
    """
    return len(strings.split("/")) > 1

# Team athletes
df_non_ascii_team = df_non_ascii[df_non_ascii["athlete_name"].apply(is_team)]
df_ascii_team = df_ascii[df_ascii["athlete_name"].apply(is_team)]

# Solo athlete
df_non_ascii_solo = df_non_ascii[~df_non_ascii["athlete_name"].apply(is_team)]
df_ascii_solo = df_ascii[~df_ascii["athlete_name"].apply(is_team)]

In [86]:
from transliterate import translit
names = df_non_ascii_solo.athlete_name
names_translated = []
for name in names:
    names_translated.append(translit(name, reversed=True))

In [87]:
names_translated

['Morseli',
 'Soltani',
 'Bahari',
 'Nurija Merah-Benida',
 'Ali Said-Sif',
 'Dzhabir Said-Gerni',
 'Abderahman Hammad',
 'Mohamed Allalu',
 'Hector MENDEZ',
 'Raul LANDINI',
 'Espinosa',
 'Chakon',
 'Karlos Espinola',
 'Serena Amato',
 'Dzhordzhina Bardash',
 'Nazarjan',
 'Mhitarjan',
 'Arsen Melikjan']

In [82]:
df_ascii_solo.sample(10)

Unnamed: 0,medal,country,athlete_name,games,sport,event,sex
20,2,Argentina,Luis BRUNETTO,France,Athletics,triple jump,Men's
56,3,Argentina,Victor ZALAZAR,Australia / Sweden,Boxing,71 - 75kg (middleweight),Men's
108,2,Armenia,MINASYAN Gor,Brazil,Weightlifting,+ 105kg (super heavyweight),Men's
26,1,Argentina,Arturo RODRIGUEZ JURADO,Netherlands,Boxing,+ 79.38kg (heavyweight),Men's
102,3,Armenia,Roman AMOYAN,China,Wrestling Greco-Roman,- 55kg,Men's
5,3,Algeria,Hocine SOLTANI,Spain,Boxing,54 - 57kg (featherweight),Men's
69,2,Argentina,,Australia,Hockey,hockey,Women's
106,1,Armenia,ALEKSANYAN Artur,Brazil,Wrestling Greco-Roman,98kg,Men's
89,1,Argentina,Argentina,Brazil,Hockey,hockey,Men's
74,1,Argentina,,Greece,Football,football,Men's


To do :
- Get simple sport list
- Get simple event list (weight, lenght, heigh, etc. split ?)
- Look for empty name (sometimes filled with country name)