# <a id='toc1_'></a>[JO 2024 project](#toc0_)

**Table of contents**<a id='toc0_'></a>    
- [JO 2024 project](#toc1_)    
  - [Prelude](#toc1_1_)    
  - [Imports](#toc1_2_)    
  - [Fonctions](#toc1_3_)    
  - [Data collect](#toc1_4_)    
    - [Extract one country data](#toc1_4_1_)    
    - [Extract All countries data](#toc1_4_2_)    
    - [Extract all data](#toc1_4_3_)    
    - [Description of data](#toc1_4_4_)    
  - [Transform](#toc1_5_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_1_'></a>[Prelude](#toc0_)

Work in progress . . .  

Summarize about the project :
- Data source : <a href="https://www.olympic.org/news">International Olympic Committee</a>
- Data extract from : <a href="http://olympanalyt.com/OlympAnalytics.php">olympanalyt.com - mail : sportsencyclo@gmail.com</a>

## <a id='toc1_2_'></a>[Imports](#toc0_)

In [119]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

import warnings
warnings.filterwarnings("ignore")

## <a id='toc1_3_'></a>[Fonctions](#toc0_)

In [120]:
def GetHTML(url) :
    """
    Needs : url website
    Return : HTML text
    """
    response = requests.get(url)

    if response.status_code == 200 :
        print("Response OK, continue.")
    else :
        print("Access impossible.")
    
    return response.text

In [121]:
def Getdataframe(html) :

    """
    Needs : HTML text 
            with response = requests.get(url).text
    Return : Dataframe with athlete AND sport description : medal, country, athlete_name, games, sport, event
    Requirement : Must to apply on website like : adress[COUNTRY]adress
                i.e : http://olympanalyt.com/OlympAnalytics.php?param_pagetype=Medals&param_dbversion=&param_country=[COUNTRY]&param_games=ALL&param_sport=ALL
    """

    # Converting text (html) in soup (text parsed)
    soup = BeautifulSoup(html, "html5lib") #parse

    # Finding the correct table according to description
    table = soup.find("table", class_="simpletable", style="text-align:left;")

    # Retrieving the list (find_all) of all rows
    rows = table.find_all("tr")

    # Creating the table (future dataframe)
    athletes = []

    # For each row (= athlete)
    for row in rows:
        #describe_athlete = row.text.split() # Get the text
        cells = row.find_all("td") # Retrieving the list (find_all) of all cells in the row

        if len(cells) == 7: # If there are 7 cells then it's an athlete
            medal = cells[0].find('img')['title'][0]  # Medal title
            country = cells[1].find('img')['title']  # Athlete's country
            athlete_name = str(cells[2]).replace("<br/>","/").replace("</td>","").replace("<td>","") # Athlete's name, splitted by "/" if it's a team
            games = cells[3].find('img')['title']  # Olympic Games
            sport = cells[4].find('img')['title']  # Sport
            sex = cells[5].text.strip()  # Sex
            event = cells[6].text.strip()  # Event

            # Add the information to the list of athletes
            athletes.append({
                'medal': medal,
                'country': country,
                'athlete_name': athlete_name,
                'games': games,
                'sport': sport,
                'event': event,
                'sex' : sex
                })

        else: # Otherwise it's not an athlete, move to the next row
            next

    return pd.DataFrame(athletes)

url = "http://olympanalyt.com/OlympAnalytics.php?param_pagetype=Medals&param_dbversion=&param_country=FRA&param_games=ALL&param_sport=ALL"
# Try with one country - FRA (France)
html = GetHTML(url)

Getdataframe(html)

Response OK, continue.


Unnamed: 0,medal,country,athlete_name,games,sport,event,sex
0,1,France,Paul MASSON,Greece,Cycling Track,sprint individual,Men's
1,1,France,Paul MASSON,Greece,Cycling Track,1/3km time trial,Men's
2,1,France,Paul MASSON,Greece,Cycling Track,10km,Men's
3,1,France,Léon FLAMENG,Greece,Cycling Track,100km,Men's
4,1,France,Eugène-Henri GRAVELOTTE,Greece,Fencing,foil individual,Men's
...,...,...,...,...,...,...,...
884,2,France,SIMON Julia/FILLON MAILLET Quentin/CHEVALIER-B...,China,Biathlon,Relay mix,Mixed
885,2,France,LEDEUX Tess,China,Freestyle Skiing,Big air,Women's
886,2,France,TRESPEUCH Chloe,China,Snowboard,Snowboard Cross,Women's
887,3,France,FAIVRE Mathieu,China,Alpine Skiing,giant slalom,Men's


## <a id='toc1_4_'></a>[Data collect](#toc0_)

### <a id='toc1_4_1_'></a>[Extract one country data](#toc0_)

In [122]:
url = "http://olympanalyt.com/OlympAnalytics.php?param_pagetype=Medals&param_dbversion=&param_country=FRA&param_games=ALL&param_sport=ALL"

In [123]:
# Try with one country - FRA (France)
html = GetHTML(url)

Response OK, continue.


In [124]:
# Try to get the DF on one country - FRA (France)
df_medalists = Getdataframe(html)

df_medalists.sample(5)

Unnamed: 0,medal,country,athlete_name,games,sport,event,sex
267,3,France,Charles RAMPELBERG,United States,Cycling Track,1km time trial,Men's
620,3,France,Жанни Лонго-Сипрелли,Australia,Cycling Road,individual time trial,Women's
422,3,France,Jean-Jacques MOUNIER,Germany,Judo,- 63kg (lightweight),Men's
479,3,France,Luc PILLOT/Thierry PEPONNET,United States,Sailing,470 - Two Person Dinghy,Men's
252,1,France,Xavier LESAGE,United States,Equestrian Dressage,individual,Mixed


### <a id='toc1_4_2_'></a>[Extract All countries data](#toc0_)

Scrape countries codes to loop on each of them (website url modification)  
Link of countries codes : <a href="http://olympanalyt.com/OlympAnalytics.php?param_pagetype=RefCountries&param_dbversion=&param_country=CIV&param_games=ALL&param_sport=ALL"> Countries list</a>

In [125]:
# Countries list
url = "http://olympanalyt.com/OlympAnalytics.php?param_pagetype=RefCountries&param_dbversion=&param_country=CIV&param_games=ALL&param_sport=ALL"

html = GetHTML(url)

Response OK, continue.


In [126]:
# Parsing text in Beautifulsoup object
soup = BeautifulSoup(html, "html5lib")

In [127]:
# Select the right table
table = soup.find_all("table", class_="simpletable", style="text-align:center;")[1] # 2nd table
rows = table.find_all("tr")
countries = []

for row in rows[2:]: # ignore the column names
    data = row.text.split()

    if len(data) == 5 and len(data[0]) == 3:
        countries.append(
            {
                'Code':data[0], 
                'Country':data[1], 
                'Continent':data[2], 
                'Firstparticipation':data[3], 
                'Lastparticipation':data[4]
            }
        )
    else :
        next

df_countries = pd.DataFrame(countries)

In [128]:
df_countries.sample(5)

Unnamed: 0,Code,Country,Continent,Firstparticipation,Lastparticipation
35,CRO,Croatia,Europe,1992,2022
66,IND,India,Asia,1900,2022
127,ROU,Romania,Europe,1900,2022
87,LBA,Libya,Africa,1968,2021
40,DMA,Dominica,America,1996,2021


### <a id='toc1_4_3_'></a>[Extract all data](#toc0_)

In [129]:
i = 0
if i == 1:
    # Run scraping on all countries 
    code_countries = df_countries["Code"]
    name_countries = df_countries["Country"]
    i = 0

    for country, country_name in zip(code_countries, name_countries) :
        url = "http://olympanalyt.com/OlympAnalytics.php?param_pagetype=Medals&param_dbversion=&param_country=" + country + "&param_games=ALL&param_sport=ALL"
        html = GetHTML(url)
        df_temp = Getdataframe(html)
        if i == 0 :
            df_olympic = df_temp
            i += 1
        else : 
            df_olympic = pd.concat([df_olympic, df_temp], ignore_index=True)

        
        print("for : {}".format(country_name))

In [130]:
#df_olympic.to_csv("DB_olympic.csv", sep=";", index=False)

In [131]:
df_olympic = pd.read_csv("DB_olympic.csv", delimiter=";")
df_olympic.fillna("", inplace=True)

In [132]:
df_olympic.describe(include="object")

Unnamed: 0,country,athlete_name,games,sport,event,sex
count,12317,12317.0,12317,12317,12317,12317
unique,119,9998.0,25,76,659,3
top,Germany,,United States,Athletics,individual,Men's
freq,922,309.0,1528,1464,357,8052


### <a id='toc1_4_4_'></a>[Description of data](#toc0_)

In [133]:
# Looking for none-latin names
latin_regex = r'[A-Za-zÀ-ÿ]'

df_filtered = df_olympic[df_olympic["athlete_name"].str.contains(latin_regex, regex=True)]
df_none_latine = df_olympic[~df_olympic["athlete_name"].str.contains(latin_regex, regex=True)]

In [134]:
df_filtered.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10093 entries, 0 to 12316
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   medal         10093 non-null  int64 
 1   country       10093 non-null  object
 2   athlete_name  10093 non-null  object
 3   games         10093 non-null  object
 4   sport         10093 non-null  object
 5   event         10093 non-null  object
 6   sex           10093 non-null  object
dtypes: int64(1), object(6)
memory usage: 630.8+ KB


In [135]:
df_filtered.sample(5)

Unnamed: 0,medal,country,athlete_name,games,sport,event,sex
5486,1,Germany,Markus WASMEIER,Norway,Alpine Skiing,giant slalom,Men's
8434,1,Kenya,Mathew BIRIR,Spain,Athletics,3000m steeplechase,Men's
1803,1,Canada,Jack CAMERON/Reginald SMITH/Ernie COLLETT/Harr...,France,Ice Hockey,ice hockey,Men's
6764,2,Indonesia,Triyatno TRIYATNO,Great Britain,Weightlifting,62 - 69kg (lightweight),Men's
198,3,Australia,Murray RILEY/Mervyn WOOD,Australia / Sweden,Rowing,double sculls (2x),Men's


In [136]:
df_none_latine.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2224 entries, 6 to 12278
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   medal         2224 non-null   int64 
 1   country       2224 non-null   object
 2   athlete_name  2224 non-null   object
 3   games         2224 non-null   object
 4   sport         2224 non-null   object
 5   event         2224 non-null   object
 6   sex           2224 non-null   object
dtypes: int64(1), object(6)
memory usage: 139.0+ KB


In [137]:
df_none_latine.sample(5)

Unnamed: 0,medal,country,athlete_name,games,sport,event,sex
10445,3,Romania,Симион,United States,Boxing,63.5 - 67kg (welterweight),Men's
1341,3,Belgium,,Australia,Tennis,doubles,Women's
4124,3,Finland,Рети,United States,Athletics,javelin throw,Men's
9028,1,Netherlands,Ван Велде,United States,Speed skating,1000m,Men's
407,1,Australia,Майкл Даймонд,Australia,Shooting,Trap,Men's


## <a id='toc1_5_'></a>[Transform](#toc0_)

In [138]:
# Translate none-latin in Latin names
from unidecode import unidecode

df_olympic["athlete_name"] = df_olympic["athlete_name"].apply(lambda s: unidecode(s))

In [139]:
df_olympic.sample(5)

Unnamed: 0,medal,country,athlete_name,games,sport,event,sex
6008,3,Germany,PUNZEL Tina/HENTSCHEL Lena,Japan,Diving,synchronized diving 3m springboard,Women's
9122,2,Netherlands,Sven KRAMER,Russian Federation,Speed skating,10000m,Men's
5990,1,Germany,ROTTER FOCKEN Aline,Japan,Wrestling Freestyle,76kg,Women's
496,3,Australia,Shein Kelli,Greece,Cycling Track,Keirin,Men's
10295,2,Romania,Alexandru FOLKER/Cezar DRAGANITA/Werner STOCKL...,Canada,Handball,handball,Men's


In [144]:
# Check if there are more than 1 athlete
def is_team(strings):
    """
    Information : Athletes names are separate by "/" if they are more than one
    Return: Boolean
    """
    return len(strings.split("/")) > 1

# Create a new colonne for team information
df_olympic["team"] = ["Yes" if is_team(athlete) else "No" for athlete in df_olympic["athlete_name"]]

# Team athletes
#df_final_team = df_olympic[df_olympic["athlete_name"].apply(is_team)]

# Solo athlete
#df_final_solo = df_olympic[~df_olympic["athlete_name"].apply(is_team)]

In [145]:
df_olympic

Unnamed: 0,medal,country,athlete_name,games,sport,event,sex,team
0,3,Afghanistan,Rohullah NIKPAI,China,Taekwondo,- 58 kg,Men's,No
1,3,Afghanistan,Rohullah NIKPAI,Great Britain,Taekwondo,58 - 68 kg,Men's,No
2,3,Algeria,Mohamed ZAOUI,United States,Boxing,71 - 75kg (middleweight),Men's,No
3,3,Algeria,Mustapha MOUSSA,United States,Boxing,75 - 81kg (light-heavyweight),Men's,No
4,1,Algeria,Hassiba BOULMERKA,Spain,Athletics,1500m,Women's,No
...,...,...,...,...,...,...,...,...
12312,3,Mixed Team,Robert Fournier-Sarloveze (FRA)/Frederick Agne...,France,Polo,polo,Men's,Yes
12313,1,Mixed Team,Ramon FONST (CUB)/Albertson VAN ZO POST (USA)/...,United States,Fencing,foil team,Men's,Yes
12314,1,Mixed Team,Philipp KASSEL (USA)/Max HESS (USA)/John GRIEB...,United States,Gymnastics Artistic,team competition,Men's,Yes
12315,2,Mixed Team,James LIGHTBODY (USA)/Lacey HEARN (USA)/Albert...,United States,Athletics,4miles team,Men's,Yes


To do :
- Get simple sport list
- Get simple event list (weight, lenght, heigh, etc. split ?)
- Look for empty name (sometimes filled with country name)

1. git branch **nom_de_branche**
2. git checkout **nom_de_branche**
3. *changement*
4. git add . 
5. git commit -m "*commentaire du commit*"

6. git push --set-upstream origin notice_alex
6. **bis** git push (*pour pousser sur le branche*)
