# RXposé (Version 2.0) #

As of March, 2024, I greatly prefer working in Python over R. In this file, I will build a better pipeline than the version in R from last year, as well as (hopefully) incoroporate a better data source to be able to finish the original analysis. Ideally, this project will also produce a searchable database, allowing other contributors to perform future analysis.

## Data Collection ##

The first step is to collect better data. [This website](https://iwrp.net/) has results in PDF format going back to 1928. Since this format is not helpful for computer analysis, we need to scrape it and convert it to a .csv or similarly usable file.

In [55]:
# Imports for data collection

import requests
from bs4 import BeautifulSoup

import re

import numpy as np
import pandas as pd

import warnings

In [2]:
def download_table(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'lxml')
    table = soup.find_all('table')
    
    
    if table:
        df = pd.read_html(str(table))[0]
        return df
        
    else:
        print(f"No table found on page")
        return None

In [3]:
df = download_table('https://iwrp.net/')

  df = pd.read_html(str(table))[0]


In [4]:
print(df.head())
print(df.size)

         Date                              Name        Place Nation   Gender  \
0  2023-12-04                    IWF Grand Prix         Doha    QAT    Males   
1  2023-12-04                    IWF Grand Prix         Doha    QAT  Females   
2  2023-11-15  48 th Junior World Championships  Guadalajara    MEX    Males   
3  2023-11-15  27 th Junior World Championships  Guadalajara    MEX  Females   
4  2023-10-20  11 th Youth Polish Championships     Biłgoraj    POL    Males   

  Age category  
0       Senior  
1       Senior  
2       Junior  
3       Junior  
4     Youth 15  
15510


In [5]:
print(df[df['Gender'] == 'Males'].size)
print(df[df['Gender'] == 'Females'].size)

6276
4656


In [6]:
df.dtypes

Date            object
Name            object
Place           object
Nation          object
Gender          object
Age category    object
dtype: object

In [7]:
def is_valid_date(date_string):
    try: 
        pd.to_datetime(date_string)
        return True
    except:
        ValueError
        return False

In [8]:
# df['Date'] = pd.to_datetime(df['Date'])

# This throws and erro without fixing the one problematic input


In [9]:
valid_dates_mask = df['Date'].apply(is_valid_date)

invalid_dates = df[~valid_dates_mask]
invalid_dates

Unnamed: 0,Date,Name,Place,Nation,Gender,Age category
2538,1979-00-00,57 th European Championships,Varna,BUL,Males,Senior


In [10]:
# Quick internt search for the correct date

def replace_invalid_date(date_string):
    try: 
        pd.to_datetime(date_string)
        return date_string
    except:
        ValueError
        return '1979-05-19'
    
df['Date'] = df['Date'].apply(replace_invalid_date)

In [11]:
df['Date'] = pd.to_datetime(df['Date'])

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2585 entries, 0 to 2584
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   Date          2585 non-null   datetime64[ns]
 1   Name          2585 non-null   object        
 2   Place         2583 non-null   object        
 3   Nation        2585 non-null   object        
 4   Gender        2585 non-null   object        
 5   Age category  2585 non-null   object        
dtypes: datetime64[ns](1), object(5)
memory usage: 121.3+ KB


Okay, what I actually need to do for now:

1. I need to have a single dataframe with all the results, but I need the date included. I think I need to go back and get a datetime object instead of a numerical year. Then I can avoid duplicate rows, since a single lifter's total, name, and date of competition will be uniquely identifying. Not all of this data will be used in this analysis, but we want to be able to easily search for all international performances of any lifter.

2. I need to then retrieve the actual lifting information from the links. The smaller dataframe generated by importing the table(s) at each link can then be appended to the large dataframe. I NEED TO MAKE SURE THAT THE DATE AND THE WEIGHT CATEGORY ARE INCLUDED IN EACH ROW AS PART OF THIS TRANSFORMATION. This should be fairly simple. Create the data frame with the necessary columns, then populate all the scraped values as additional columns.

In [13]:
# Loop through the 'Name' column and follow the links

def download_links(url: str, events_list: list) -> pd.DataFrame:
    # Send GET request to URL
    response = requests.get(url)
    
    # Parse the HTML
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Find all links on the page to sort later
    links = soup.find_all('a')
    
    return [link for link in links]
    

In [14]:
links = download_links('https://iwrp.net/', df['Name'])
print(type(links))

<class 'list'>


In [15]:
print(links[:10])
print(len(links))
print(len(df['Name']))


[<a href="/">IWRP</a>, <a href="/global-statistics">Global Statistics</a>, <a href="/../"><div class="header__logo"></div></a>, <a href="/component/cwyniki/?view=contest&amp;id_zawody=2752">IWF Grand Prix</a>, <a href="/component/cwyniki/?view=contest&amp;id_zawody=2753">IWF Grand Prix</a>, <a href="/component/cwyniki/?view=contest&amp;id_zawody=2748">48 th Junior World Championships</a>, <a href="/component/cwyniki/?view=contest&amp;id_zawody=2749">27 th Junior World Championships</a>, <a href="/component/cwyniki/?view=contest&amp;id_zawody=2746">11 th Youth Polish Championships</a>, <a href="/component/cwyniki/?view=contest&amp;id_zawody=2747">11 th Youth Polish Championships</a>, <a href="/component/cwyniki/?view=contest&amp;id_zawody=2745">28 th Waldemar Malak Memorial	</a>]
2590
2585


In [16]:
links = links[3:-2]
links[:10]

[<a href="/component/cwyniki/?view=contest&amp;id_zawody=2752">IWF Grand Prix</a>,
 <a href="/component/cwyniki/?view=contest&amp;id_zawody=2753">IWF Grand Prix</a>,
 <a href="/component/cwyniki/?view=contest&amp;id_zawody=2748">48 th Junior World Championships</a>,
 <a href="/component/cwyniki/?view=contest&amp;id_zawody=2749">27 th Junior World Championships</a>,
 <a href="/component/cwyniki/?view=contest&amp;id_zawody=2746">11 th Youth Polish Championships</a>,
 <a href="/component/cwyniki/?view=contest&amp;id_zawody=2747">11 th Youth Polish Championships</a>,
 <a href="/component/cwyniki/?view=contest&amp;id_zawody=2745">28 th Waldemar Malak Memorial	</a>,
 <a href="/component/cwyniki/?view=contest&amp;id_zawody=2743">88 th World Championships</a>,
 <a href="/component/cwyniki/?view=contest&amp;id_zawody=2744">30 th World Championships</a>,
 <a href="/component/cwyniki/?view=contest&amp;id_zawody=2741">37th Asian JuniorChampionship</a>]

In [17]:
len(links)

2585

In [19]:
df['link'] = [str(link) for link in links]
df.head()

Unnamed: 0,Date,Name,Place,Nation,Gender,Age category,link
0,2023-12-04,IWF Grand Prix,Doha,QAT,Males,Senior,"<a href=""/component/cwyniki/?view=contest&amp;..."
1,2023-12-04,IWF Grand Prix,Doha,QAT,Females,Senior,"<a href=""/component/cwyniki/?view=contest&amp;..."
2,2023-11-15,48 th Junior World Championships,Guadalajara,MEX,Males,Junior,"<a href=""/component/cwyniki/?view=contest&amp;..."
3,2023-11-15,27 th Junior World Championships,Guadalajara,MEX,Females,Junior,"<a href=""/component/cwyniki/?view=contest&amp;..."
4,2023-10-20,11 th Youth Polish Championships,Biłgoraj,POL,Males,Youth 15,"<a href=""/component/cwyniki/?view=contest&amp;..."


In [20]:
df.rename({'Name': 'Competition', 'Place': 'Host City', 'Nation': 'Host Nation', 'Age category': 'Age Category'}, axis=1, inplace=True)
df.head()

Unnamed: 0,Date,Competition,Host City,Host Nation,Gender,Age Category,link
0,2023-12-04,IWF Grand Prix,Doha,QAT,Males,Senior,"<a href=""/component/cwyniki/?view=contest&amp;..."
1,2023-12-04,IWF Grand Prix,Doha,QAT,Females,Senior,"<a href=""/component/cwyniki/?view=contest&amp;..."
2,2023-11-15,48 th Junior World Championships,Guadalajara,MEX,Males,Junior,"<a href=""/component/cwyniki/?view=contest&amp;..."
3,2023-11-15,27 th Junior World Championships,Guadalajara,MEX,Females,Junior,"<a href=""/component/cwyniki/?view=contest&amp;..."
4,2023-10-20,11 th Youth Polish Championships,Biłgoraj,POL,Males,Youth 15,"<a href=""/component/cwyniki/?view=contest&amp;..."


** Current Project **
Due to the rule changes in the sport of weighlifting, we are going to focus only on events post the Munich Olympics in 1972. This event ended on September 11, 1972, so we will filter specifically for these events. ALmost all data from this source is from after this time anyway, but it will also likely help prevent some errors when reading in the data.

In [21]:
# Some masks for faster computation

modern_event_mask = df['Date'] > '1972-09-11'
gender_mask = df['Gender'] == 'Females'
senior_mask = df['Age Category'] == 'Senior'

len(df[modern_event_mask][gender_mask][senior_mask])

  len(df[modern_event_mask][gender_mask][senior_mask])
  len(df[modern_event_mask][gender_mask][senior_mask])


359

In [139]:
# This is taking my machine about 40 seconds for 10 separate event links

def all_data(guiding_data_frame: pd.DataFrame) -> pd.DataFrame:
    '''Returns a datafram with all results based on the categories contained withing the guiding dataframe'''
    
    # For events with 15 columns
    full_dataframe = pd.DataFrame(columns= ['Date', 'Competition', 'Host City', 'Host Nation', 'Gender', 'Age Category',
                                            'Overall Rank', 'Athlete Name', 'Athlete Nation', 'Bodyweight (kg)', 'Session', 
                                            'Snatch 1', 'Snatch 2', 'Snatch 3', 'Snatch Rank',
                                            'C&J 1', 'C&J 2', 'C&J 3', 'C&J Rank',
                                            'Total (kg)', 'Sinclair'])
    
    # For events with 14 columns (no lettered sessions)
    full_dataframe_short = pd.DataFrame(columns=['Date', 'Competition', 'Host City', 'Host Nation', 'Gender', 'Age Category',
                                            'Overall Rank', 'Athlete Name', 'Athlete Nation', 'Bodyweight (kg)', 
                                            'Snatch 1', 'Snatch 2', 'Snatch 3', 'Snatch Rank',
                                            'C&J 1', 'C&J 2', 'C&J 3', 'C&J Rank',
                                            'Total (kg)', 'Sinclair'])
    
    pattern = re.compile(r'"([^"]*)"')
    for idx, link in enumerate(guiding_data_frame['link']):
        match = pattern.search(link)
        if match:
            event_link = match.group(1)
            absolute_url = 'https://iwrp.net/' + event_link

            response = requests.get(absolute_url)
            if response.status_code == 200:  # Check if request was successful
                soup = BeautifulSoup(response.content, 'html.parser')
                tables = soup.find_all('table')
                for table in tables[:1]: # Currently limiting tables, as I believe all informatin is contained in the first table.
                    with warnings.catch_warnings():
                        warnings.filterwarnings('ignore', category = FutureWarning)
                        try:
                            # This block is executed with 15 columns
                            temp_df = pd.read_html(str(table))[0]
                            # Extract column names from the first row
                            temp_df.columns = temp_df.iloc[0]
                            # Drop the first row (header row)
                            temp_df = temp_df.drop(0)
                            temp_df.columns = ['Overall Rank', 'Athlete Name', 'Athlete Nation', 'Bodyweight (kg)', 'Session', 
                                            'Snatch 1', 'Snatch 2', 'Snatch 3', 'Snatch Rank',
                                            'C&J 1', 'C&J 2', 'C&J 3', 'C&J Rank',
                                            'Total (kg)', 'Sinclair']
                            
                            # Reset index to ensure uniqueness
                            temp_df.reset_index(drop=True, inplace=True)
                            
                            temp_df['Date'] = guiding_data_frame['Date'][idx]
                            temp_df['Competition'] = guiding_data_frame['Competition'][idx]
                            temp_df['Host City'] =  guiding_data_frame['Host City'][idx]
                            temp_df['Host Nation'] = guiding_data_frame['Host Nation'][idx]
                            temp_df['Gender'] = guiding_data_frame['Gender'][idx]
                            temp_df['Age Category'] = guiding_data_frame['Age Category'][idx]
                            
                            # This is to save memory for this expensive process (expensive for my laptop)
                            # temp_df = temp_df.drop_duplicates()
                            
                            # Concatenate the data frames
                            full_dataframe = pd.concat([full_dataframe, temp_df], ignore_index=True)
                            
                        except ValueError:
                            # This block is executed with 14 columns
                            temp_df = pd.read_html(str(table))[0]
                            # Extract column names from the first row
                            temp_df.columns = temp_df.iloc[0]
                            # Drop the first row (header row)
                            temp_df = temp_df.drop(0)
                            if len(temp_df.columns) == 14:
                                temp_df.columns = ['Overall Rank', 'Athlete Name', 'Athlete Nation', 'Bodyweight (kg)', 
                                            'Snatch 1', 'Snatch 2', 'Snatch 3', 'Snatch Rank',
                                            'C&J 1', 'C&J 2', 'C&J 3', 'C&J Rank',
                                            'Total (kg)', 'Sinclair']
                            
                                # Reset index to ensure uniqueness
                                temp_df.reset_index(drop=True, inplace=True)
                            
                                temp_df['Date'] = guiding_data_frame['Date'][idx]
                                temp_df['Competition'] = guiding_data_frame['Competition'][idx]
                                temp_df['Host City'] =  guiding_data_frame['Host City'][idx]
                                temp_df['Host Nation'] = guiding_data_frame['Host Nation'][idx]
                                temp_df['Gender'] = guiding_data_frame['Gender'][idx]
                                temp_df['Age Category'] = guiding_data_frame['Age Category'][idx]
                            else:
                                print(f"Skipped {guiding_data_frame['Competition'][idx]} due to errant columns")
                                continue
                            
                            # This is to save memory for this expensive process (expensive for my laptop)
                            # temp_df = temp_df.drop_duplicates()
                            
                            # Concatenate the data frames
                            full_dataframe_short = pd.concat([full_dataframe_short, temp_df], ignore_index=True)
                        
                       
                # Implement a progress checker to watch for failing internet connection 
                if idx % 10 == 9:
                    print(f"successfully added information from page {idx+1} of {len(guiding_data_frame)}")
                        
                
            else:
                print(f"Failed to fetch data from {absolute_url}. Status code: {response.status_code}")
        else:
            print(f"No match found for link: {link}")
    
    return pd.concat([full_dataframe, full_dataframe_short], ignore_index = True, sort = False) # Join dataframes, filling in NaN as necessary

In [99]:
# As a test, this is only Senior Women's events from after Munich 1972 (no women's events before then, mask is unnecessary)
# I believe that only international competitions will have the correct data format to be included here.

senior_women_guiding_df = df[modern_event_mask][gender_mask][senior_mask].reset_index(drop=True)

senior_women_data = all_data(senior_women_guiding_df)

  senior_women_guiding_df = df[modern_event_mask][gender_mask][senior_mask].reset_index(drop=True)
  senior_women_guiding_df = df[modern_event_mask][gender_mask][senior_mask].reset_index(drop=True)


successfully added information from page 10 of 359
successfully added information from page 20 of 359
successfully added information from page 30 of 359
successfully added information from page 40 of 359
successfully added information from page 50 of 359
successfully added information from page 60 of 359
successfully added information from page 70 of 359
successfully added information from page 80 of 359
successfully added information from page 90 of 359
Skipped Team Women Polish Cup due to errant columns
successfully added information from page 100 of 359
successfully added information from page 110 of 359
successfully added information from page 120 of 359
successfully added information from page 130 of 359
Skipped Polish Women Cup due to errant columns
successfully added information from page 140 of 359
successfully added information from page 150 of 359
successfully added information from page 160 of 359
successfully added information from page 170 of 359
successfully added informa

Maybe I need to work with these in series individually, then create the dataframe from that. That would solve the problem of misaligned columns, and it would likely be faster.

In [100]:
senior_women_guiding_df.head()

Unnamed: 0,Date,Competition,Host City,Host Nation,Gender,Age Category,link
0,2023-12-04,IWF Grand Prix,Doha,QAT,Females,Senior,"<a href=""/component/cwyniki/?view=contest&amp;..."
1,2023-09-04,30 th World Championships,Riyadh,KSA,Females,Senior,"<a href=""/component/cwyniki/?view=contest&amp;..."
2,2023-06-15,30 th Polish Championships,Gdańsk,POL,Females,Senior,"<a href=""/component/cwyniki/?view=contest&amp;..."
3,2023-05-05,31 st Asian Championships,Jinju,KOR,Females,Senior,"<a href=""/component/cwyniki/?view=contest&amp;..."
4,2023-04-15,35 th European Championships,Yerevan,ARM,Females,Senior,"<a href=""/component/cwyniki/?view=contest&amp;..."


In [101]:
senior_women_data.shape

(33697, 21)

It's working!!

In [102]:
# These are leftovers from the import process

senior_women_data[senior_women_data['Sinclair'] == 'Sincler']

Unnamed: 0,Date,Competition,Host City,Host Nation,Gender,Age Category,Overall Rank,Athlete Name,Athlete Nation,Bodyweight (kg),...,Snatch 1,Snatch 2,Snatch 3,Snatch Rank,C&J 1,C&J 2,C&J 3,C&J Rank,Total (kg),Sinclair
0,2023-12-04,IWF Grand Prix,Doha,QAT,Females,Senior,Pl,Surname and name,Nation,B.W,...,1,2,3,,1,2,3,,,Sincler
4,2023-12-04,IWF Grand Prix,Doha,QAT,Females,Senior,Pl,Surname and name,Nation,B.W,...,Snatch,Snatch,Snatch,,Cl&Jerk,Cl&Jerk,Cl&Jerk,,,Sincler
5,2023-12-04,IWF Grand Prix,Doha,QAT,Females,Senior,Pl,Surname and name,Nation,B.W,...,1,2,3,,1,2,3,,,Sincler
32,2023-12-04,IWF Grand Prix,Doha,QAT,Females,Senior,Pl,Surname and name,Nation,B.W,...,Snatch,Snatch,Snatch,,Cl&Jerk,Cl&Jerk,Cl&Jerk,,,Sincler
33,2023-12-04,IWF Grand Prix,Doha,QAT,Females,Senior,Pl,Surname and name,Nation,B.W,...,1,2,3,,1,2,3,,,Sincler
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33678,1994-03-04,1 st Polish Championships,Elblag,POL,Females,Senior,Pl,Surname and name,Club,B.W,...,1,2,3,,1,2,3,,,Sincler
33686,1994-03-04,1 st Polish Championships,Elblag,POL,Females,Senior,Pl,Surname and name,Club,B.W,...,Snatch,Snatch,Snatch,,Cl&Jerk,Cl&Jerk,Cl&Jerk,,,Sincler
33687,1994-03-04,1 st Polish Championships,Elblag,POL,Females,Senior,Pl,Surname and name,Club,B.W,...,1,2,3,,1,2,3,,,Sincler
33692,1994-03-04,1 st Polish Championships,Elblag,POL,Females,Senior,Pl,Surname and name,Club,B.W,...,Snatch,Snatch,Snatch,,Cl&Jerk,Cl&Jerk,Cl&Jerk,,,Sincler


In [103]:
senior_women_data = senior_women_data[senior_women_data['Sinclair'] != 'Sincler']
senior_women_data.shape

(28850, 21)

In [104]:
senior_women_data['Session'].unique()

array(['A', '49 kg', 'B', 'C', '55 kg', '59 kg', '64 kg', '71 kg',
       '76 kg', '81 kg', '87 kg', '+ 87 kg', nan, 'D', 'E', '53 kg',
       '58 kg', '63 kg', '69 kg', '75 kg', '90 kg', '+ 90 kg', '+ 75 kg',
       'X', 'c', '50 kg', '54 kg', '70 kg', '83 kg', '+ 83 kg', '48 kg',
       '52 kg', '56 kg', '60 kg', '67,5 kg', '82,5 kg', '+ 82,5 kg'],
      dtype=object)

In [105]:
senior_women_data = senior_women_data.drop_duplicates()
senior_women_data.shape

(28848, 21)

In [107]:
len(senior_women_data['Date'].unique())

344

In [121]:
senior_women_data.head()

Unnamed: 0,Date,Competition,Host City,Host Nation,Gender,Age Category,Overall Rank,Athlete Name,Athlete Nation,Bodyweight (kg),...,Snatch 1,Snatch 2,Snatch 3,Snatch Rank,C&J 1,C&J 2,C&J 3,C&J Rank,Total (kg),Sinclair
1,2023-12-04,IWF Grand Prix,Doha,QAT,Females,Senior,1,Won Hyon Sim,PRK,45.00,...,77.0,82.0,86.0,1,90.0,95.0,99.0,1,181.0,303.2
2,2023-12-04,IWF Grand Prix,Doha,QAT,Females,Senior,2,Jean Ramos Rose,PHI,45.00,...,68.0,70.0,70.0,2,85.0,86.0,87.0,2,155.0,259.7
3,2023-12-04,IWF Grand Prix,Doha,QAT,Females,Senior,49 kg,49 kg,49 kg,49 kg,...,49 kg,49 kg,49 kg,49 kg,49 kg,49 kg,49 kg,49 kg,49 kg,49 kg
6,2023-12-04,IWF Grand Prix,Doha,QAT,Females,Senior,1,Jiang Huihua,CHN,49.00,...,90.0,94.0,96.0,2,113.0,118.0,120.0,2,216.0,337.7
7,2023-12-04,IWF Grand Prix,Doha,QAT,Females,Senior,2,Ri Song gum,PRK,49.00,...,90.0,93.0,95.0,3,120.0,120.0,124.0,1,213.0,333.0


In [123]:
# This should remove all instances of the extra html chimeras

senior_women_data = senior_women_data[senior_women_data['Overall Rank'] != senior_women_data['Athlete Name']]
senior_women_data.shape

(26603, 21)

In [132]:
columns_to_convert = ['Overall Rank', 'Bodyweight (kg)', 'Snatch 1', 'Snatch 2', 'Snatch 3', 'Snatch Rank', 
                      'C&J 1', 'C&J 2', 'C&J 3', 'C&J Rank', 'Total (kg)', 'Sinclair']

senior_women_data_numeric = senior_women_data

senior_women_data_numeric[columns_to_convert] = senior_women_data[columns_to_convert].apply(pd.to_numeric, errors='coerce')

In [133]:
senior_women_data_numeric.dtypes

Date               datetime64[ns]
Competition                object
Host City                  object
Host Nation                object
Gender                     object
Age Category               object
Overall Rank              float64
Athlete Name               object
Athlete Nation             object
Bodyweight (kg)           float64
Session                    object
Snatch 1                  float64
Snatch 2                  float64
Snatch 3                  float64
Snatch Rank               float64
C&J 1                     float64
C&J 2                     float64
C&J 3                     float64
C&J Rank                  float64
Total (kg)                float64
Sinclair                  float64
dtype: object

In [136]:
senior_women_data_numeric[senior_women_data_numeric['Sinclair'] == 0]

Unnamed: 0,Date,Competition,Host City,Host Nation,Gender,Age Category,Overall Rank,Athlete Name,Athlete Nation,Bodyweight (kg),...,Snatch 1,Snatch 2,Snatch 3,Snatch Rank,C&J 1,C&J 2,C&J 3,C&J Rank,Total (kg),Sinclair
26,2023-12-04,IWF Grand Prix,Doha,QAT,Females,Senior,,Piron Beatriz,DOM,47.68,...,,,,,,,,,,0.0
27,2023-12-04,IWF Grand Prix,Doha,QAT,Females,Senior,,Hariroh Siti Nafisatul,AFG,48.82,...,72.0,75.0,77.0,13.0,92.0,95.0,95.0,,,0.0
28,2023-12-04,IWF Grand Prix,Doha,QAT,Females,Senior,,Lopez Ferrer Ana Gabriela,MEX,48.97,...,83.0,83.0,83.0,,95.0,100.0,103.0,9.0,,0.0
29,2023-12-04,IWF Grand Prix,Doha,QAT,Females,Senior,,Delacruz Jourdan Elizabeth,USA,49.00,...,86.0,89.0,91.0,,,,,,,0.0
30,2023-12-04,IWF Grand Prix,Doha,QAT,Females,Senior,,Saikhom Mirabai Chanu,IND,49.00,...,,,,,,,,,,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33429,1997-10-26,4 th Polish Championships,Wiecbork,POL,Females,Senior,,Pruszek Karolina,TKKF Belfer,63.20,...,52.5,52.5,52.5,,67.5,72.5,72.5,6.0,,0.0
33492,1997-07-19,British Championships,Abbey,GBR,Females,Senior,,Hayes Suzanne,,75.70,...,52.5,,,2.0,,,,,,0.0
33547,1996-05-31,3 rd Polish Championships,Ciechanow,POL,Females,Senior,,Gawryluk Izabela,KS AZS-AWF,64.10,...,,,,,,,,,,0.0
33548,1996-05-31,3 rd Polish Championships,Ciechanow,POL,Females,Senior,,Gabrusewicz Aneta,,65.20,...,,,,,,,,,,0.0


April 3, 2024: At this point, we would be able to fill in any missing total or sinclair values! Since the parser is now working well, I am going to save this data and repeat the process for senior men. Further cleaning will be done later.

At this point, these data could be turned into a database. I would prefer to be able to do that with all the data at once, however. I am going to save this dataframe as a .csv, then I will be able to load everything at once. For now, we will focus on only Senior men and women.

**Future Steps**
1. Turn all numeric columns from strings to numbers. Will need to look for outliers/obviously wrong values.

2. Concat dataframes and create database -- Postgres? Mysql? SQLite?

In [137]:
# Save the current data
# We can change the variable names for future use

path = '/Users/aaronkeeney/Documents/Data_Analytics_Projects/Rxpose/senior_women_data.csv'

senior_women_data_numeric.to_csv(path, index=False)



Next, we will repeat this process with senior men

In [145]:
modern_event_mask = df['Date'] > '1972-09-11'
gender_mask = df['Gender'] == 'Males'
senior_mask = df['Age Category'] == 'Senior'

senior_men_guiding_df = df[modern_event_mask][gender_mask][senior_mask].reset_index()
senior_men_guiding_df.shape

  senior_men_guiding_df = df[modern_event_mask][gender_mask][senior_mask].reset_index()
  senior_men_guiding_df = df[modern_event_mask][gender_mask][senior_mask].reset_index()


(520, 8)

In [146]:
senior_men_data = all_data(senior_men_guiding_df)

successfully added information from page 10 of 520
successfully added information from page 20 of 520
successfully added information from page 30 of 520
successfully added information from page 40 of 520
successfully added information from page 50 of 520
successfully added information from page 60 of 520
successfully added information from page 70 of 520
successfully added information from page 80 of 520
successfully added information from page 90 of 520
successfully added information from page 100 of 520
Skipped Polish Team Championships 2016 due to errant columns
Skipped Polish Team Championships 2016 due to errant columns
Skipped Polish Team Championships 2016 due to errant columns
Skipped Polish Team Championships 2016 due to errant columns
Skipped Polish Team Championships 2016 due to errant columns
Skipped Polish Team Championships 2016 due to errant columns
successfully added information from page 110 of 520
Skipped Polish Team Championships 2016 due to errant columns
Skipped Po

Obviously, there is something different about the Polish Team Championships, but that is outside of our purview for this study.

In [160]:
senior_men_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 43097 entries, 1 to 52115
Data columns (total 21 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   Date             43097 non-null  datetime64[ns]
 1   Competition      43097 non-null  object        
 2   Host City        42969 non-null  object        
 3   Host Nation      43097 non-null  object        
 4   Gender           43097 non-null  object        
 5   Age Category     43097 non-null  object        
 6   Overall Rank     37762 non-null  float64       
 7   Athlete Name     43097 non-null  object        
 8   Athlete Nation   30171 non-null  object        
 9   Bodyweight (kg)  43097 non-null  float64       
 10  Session          20777 non-null  object        
 11  Snatch 1         42394 non-null  float64       
 12  Snatch 2         35893 non-null  float64       
 13  Snatch 3         35520 non-null  float64       
 14  Snatch Rank      40007 non-null  float64   

Next, we apply the same transformations for data cleaning that we used for the women's events.

In [161]:
# Remove title rows
senior_men_data = senior_men_data[senior_men_data['Sinclair'] != 'Sincler']

# Remove category rows
senior_men_data = senior_men_data[senior_men_data['Overall Rank'] != senior_men_data['Athlete Name']]

# Remove Duplicates
senior_men_data = senior_men_data.drop_duplicates()

senior_men_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 43097 entries, 1 to 52115
Data columns (total 21 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   Date             43097 non-null  datetime64[ns]
 1   Competition      43097 non-null  object        
 2   Host City        42969 non-null  object        
 3   Host Nation      43097 non-null  object        
 4   Gender           43097 non-null  object        
 5   Age Category     43097 non-null  object        
 6   Overall Rank     37762 non-null  float64       
 7   Athlete Name     43097 non-null  object        
 8   Athlete Nation   30171 non-null  object        
 9   Bodyweight (kg)  43097 non-null  float64       
 10  Session          20777 non-null  object        
 11  Snatch 1         42394 non-null  float64       
 12  Snatch 2         35893 non-null  float64       
 13  Snatch 3         35520 non-null  float64       
 14  Snatch Rank      40007 non-null  float64   

In [162]:
# Duplicate datafram and convert columns to numeric

senior_men_data_numeric = senior_men_data.reset_index()

senior_men_data_numeric[columns_to_convert] = senior_men_data_numeric[columns_to_convert].apply(pd.to_numeric, errors='coerce')

senior_men_data_numeric.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43097 entries, 0 to 43096
Data columns (total 22 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   index            43097 non-null  int64         
 1   Date             43097 non-null  datetime64[ns]
 2   Competition      43097 non-null  object        
 3   Host City        42969 non-null  object        
 4   Host Nation      43097 non-null  object        
 5   Gender           43097 non-null  object        
 6   Age Category     43097 non-null  object        
 7   Overall Rank     37762 non-null  float64       
 8   Athlete Name     43097 non-null  object        
 9   Athlete Nation   30171 non-null  object        
 10  Bodyweight (kg)  43097 non-null  float64       
 11  Session          20777 non-null  object        
 12  Snatch 1         42394 non-null  float64       
 13  Snatch 2         35893 non-null  float64       
 14  Snatch 3         35520 non-null  float

In [164]:
senior_men_data_numeric.sort_values(by='Sinclair', ascending=False)

Unnamed: 0,index,Date,Competition,Host City,Host Nation,Gender,Age Category,Overall Rank,Athlete Name,Athlete Nation,...,Snatch 1,Snatch 2,Snatch 3,Snatch Rank,C&J 1,C&J 2,C&J 3,C&J Rank,Total (kg),Sinclair
25909,30454,2017-11-03,German Championships,Speyer,GER,Males,Senior,6.0,Schiffer Karl Waldemar,,...,115.0,120.0,125.0,6.0,140.0,146.0,151.0,6.0,271.0,111132.0
1454,1680,2022-05-28,100 th European Championships,Tirana,ALB,Males,Senior,1.0,Talakhadze Lasha,GEO,...,208.0,212.0,217.0,1.0,245.0,253.0,,1.0,462.0,510.8
1455,1681,2022-05-28,100 th European Championships,Tirana,ALB,Males,Senior,2.0,Varazdat Lalayan,ARM,...,203.0,211.0,211.0,2.0,240.0,252.0,252.0,2.0,451.0,498.6
1456,1682,2022-05-28,100 th European Championships,Tirana,ALB,Males,Senior,3.0,Minasyan Gor,ARM,...,202.0,210.0,210.0,3.0,236.0,245.0,245.0,4.0,446.0,493.1
1677,1931,2021-12-07,86 th World Championships,Tashkent,UZB,Males,Senior,1.0,Talakhadze Lasha,GEO,...,210.0,218.0,225.0,1.0,247.0,257.0,267.0,1.0,492.0,492.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21372,24988,1979-11-08,53 rd World Championships,Saloniki,GRE,Males,Senior,8.0,Chavigny Jean-Claude,FRA,...,115.0,,,10.0,147.5,,,5.0,262.5,0.0
21373,24989,1979-11-08,53 rd World Championships,Saloniki,GRE,Males,Senior,9.0,Saito Takashi,JPN,...,115.0,,,11.0,147.5,,,6.0,262.5,0.0
21374,24990,1979-11-08,53 rd World Championships,Saloniki,GRE,Males,Senior,10.0,Tan Hanyong,CHN,...,115.0,,,12.0,147.5,,,9.0,262.5,0.0
21375,24991,1979-11-08,53 rd World Championships,Saloniki,GRE,Males,Senior,11.0,Loscos Rodrigues Julio,CUB,...,117.5,,,9.0,145.0,,,13.0,262.5,0.0


There appear to be a few errors, as there is one Sinclair that is nonsensical, and the other values should be checked. As far as I know, Lasha has not been credited with a Sinclair over 500...

In [165]:
# Save men's data in case it gets messed up

path = '/Users/aaronkeeney/Documents/Data_Analytics_Projects/Rxpose/senior_men_data.csv'

senior_men_data_numeric.to_csv(path, index=False)

In [139]:
# In case of crashes (or just future data analysis), use this cell to reload data from source

senior_women_data_loaded = pd.read_csv('/Users/aaronkeeney/Documents/Data_Analytics_Projects/Rxpose/senior_women_data.csv')

senior_men_data_loaded = pd.read_csv( '/Users/aaronkeeney/Documents/Data_Analytics_Projects/Rxpose/senior_men_data.csv')


  senior_men_data_loaded = pd.read_csv( '/Users/aaronkeeney/Documents/Data_Analytics_Projects/Rxpose/senior_men_data.csv')


In [140]:
# Combining all these data into one data frame

# In the future, these extra columns should disappear, as I added 'index=False' to the to_csv() function

all_senior_data_loaded = pd.concat([senior_men_data_loaded, senior_women_data_loaded], axis=0)

all_senior_data_loaded = all_senior_data_loaded.drop(all_senior_data_loaded.columns[[0,1]], axis=1)

all_senior_data_loaded.head()

Unnamed: 0,Date,Competition,Host City,Host Nation,Gender,Age Category,Overall Rank,Athlete Name,Athlete Nation,Bodyweight (kg),...,Snatch 1,Snatch 2,Snatch 3,Snatch Rank,C&J 1,C&J 2,C&J 3,C&J Rank,Total (kg),Sinclair
0,2023-12-04,IWF Grand Prix,Doha,QAT,Males,Senior,1.0,Pang Un Chol,PRK,54.94,...,110.0,114.0,116.0,1.0,142.0,148.0,152.0,1.0,268.0,441.0
1,2023-12-04,IWF Grand Prix,Doha,QAT,Males,Senior,2.0,Nugroho Satrio Adi,INA,54.79,...,108.0,112.0,115.0,2.0,135.0,139.0,144.0,2.0,254.0,418.9
2,2023-12-04,IWF Grand Prix,Doha,QAT,Males,Senior,3.0,Yodage Dilanka Isuru Kumara,SRI,55.0,...,106.0,112.0,114.0,4.0,133.0,139.0,140.0,3.0,245.0,402.8
3,2023-12-04,IWF Grand Prix,Doha,QAT,Males,Senior,4.0,Rizqih Muhammad Ibnu,INA,55.0,...,111.0,111.0,113.0,3.0,130.0,134.0,134.0,4.0,243.0,399.5
4,2023-12-04,IWF Grand Prix,Doha,QAT,Males,Senior,5.0,Taj Md Ashikur Rahman,BAN,55.0,...,93.0,97.0,100.0,6.0,112.0,118.0,118.0,5.0,212.0,348.5


In [141]:
all_senior_data_loaded.info()

<class 'pandas.core.frame.DataFrame'>
Index: 69700 entries, 0 to 26602
Data columns (total 21 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Date             69700 non-null  object 
 1   Competition      69700 non-null  object 
 2   Host City        69472 non-null  object 
 3   Host Nation      69700 non-null  object 
 4   Gender           69700 non-null  object 
 5   Age Category     69700 non-null  object 
 6   Overall Rank     62303 non-null  float64
 7   Athlete Name     69700 non-null  object 
 8   Athlete Nation   48424 non-null  object 
 9   Bodyweight (kg)  69700 non-null  float64
 10  Session          33815 non-null  object 
 11  Snatch 1         68826 non-null  float64
 12  Snatch 2         60058 non-null  float64
 13  Snatch 3         59548 non-null  float64
 14  Snatch Rank      65138 non-null  float64
 15  C&J 1            67033 non-null  float64
 16  C&J 2            58074 non-null  float64
 17  C&J 3            

For some reason, the Date column is no longer a datetime object after reloading the data. Easy fix, though.

In [142]:
all_senior_data_loaded['Date'] = pd.to_datetime(all_senior_data_loaded['Date'])
all_senior_data_loaded = all_senior_data_loaded.reset_index(drop=True)
all_senior_data_loaded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69700 entries, 0 to 69699
Data columns (total 21 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   Date             69700 non-null  datetime64[ns]
 1   Competition      69700 non-null  object        
 2   Host City        69472 non-null  object        
 3   Host Nation      69700 non-null  object        
 4   Gender           69700 non-null  object        
 5   Age Category     69700 non-null  object        
 6   Overall Rank     62303 non-null  float64       
 7   Athlete Name     69700 non-null  object        
 8   Athlete Nation   48424 non-null  object        
 9   Bodyweight (kg)  69700 non-null  float64       
 10  Session          33815 non-null  object        
 11  Snatch 1         68826 non-null  float64       
 12  Snatch 2         60058 non-null  float64       
 13  Snatch 3         59548 non-null  float64       
 14  Snatch Rank      65138 non-null  float

~~At this point, we could upload the senior data to the database. Below, I am going to filter these data and preserve only the pieces necessary for the current analysis project.~~

April 4, 2023: I am going to continue to clean the data first and create a usable database. We can return to the actual analysis later, but cleaning all data first will reduce the amount of work in the long term.

In [143]:
all_senior_data_loaded.sort_values(by = 'Sinclair', ascending = False)

Unnamed: 0,Date,Competition,Host City,Host Nation,Gender,Age Category,Overall Rank,Athlete Name,Athlete Nation,Bodyweight (kg),...,Snatch 1,Snatch 2,Snatch 3,Snatch Rank,C&J 1,C&J 2,C&J 3,C&J Rank,Total (kg),Sinclair
25909,2017-11-03,German Championships,Speyer,GER,Males,Senior,6.0,Schiffer Karl Waldemar,,2.40,...,115.0,120.0,125.0,6.0,140.0,146.0,151.0,6.0,271.0,111132.0
49505,2014-05-26,24 th Pan Americana Championships,Santo Domingo,DOM,Females,Senior,7.0,Lima De Araujo Monique Maria,BRA,4.70,...,100.0,105.0,105.0,4.0,121.0,124.0,126.0,8.0,221.0,22829.4
59576,2017-11-03,German Championships,Speyer,GER,Females,Senior,3.0,Winterholler Nicole,,3.50,...,65.0,69.0,73.0,3.0,85.0,89.0,92.0,3.0,162.0,21046.7
59577,2017-11-03,German Championships,Speyer,GER,Females,Senior,4.0,Jacobs Sarah,,3.60,...,68.0,68.0,73.0,4.0,85.0,89.0,92.0,4.0,157.0,18975.7
1454,2022-05-28,100 th European Championships,Tirana,ALB,Males,Senior,1.0,Talakhadze Lasha,GEO,110.00,...,208.0,212.0,217.0,1.0,245.0,253.0,,1.0,462.0,510.8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
770,2023-04-15,101 st European Championships,Yerevan,ARM,Males,Senior,,Imadouchene Romain,FRA,92.52,...,160.0,160.0,160.0,,,,,,,0.0
11540,2007-09-17,76 th World Championships,Chiang Mai,THA,Males,Senior,,Qiu Le,CHN,61.39,...,137.0,141.0,141.0,4.0,,,,,,0.0
11541,2007-09-17,76 th World Championships,Chiang Mai,THA,Males,Senior,,Alpanov Ruslan,UZB,61.58,...,120.0,125.0,125.0,28.0,145.0,145.0,145.0,,,0.0
11542,2007-09-17,76 th World Championships,Chiang Mai,THA,Males,Senior,,Vidanage Chinthana,SRI,61.75,...,121.0,125.0,125.0,24.0,160.0,160.0,160.0,,,0.0


In [144]:
all_senior_data_loaded[all_senior_data_loaded['Athlete Name'] == 'Talakhadze Lasha'].sort_values(by='Sinclair', ascending=False)

Unnamed: 0,Date,Competition,Host City,Host Nation,Gender,Age Category,Overall Rank,Athlete Name,Athlete Nation,Bodyweight (kg),...,Snatch 1,Snatch 2,Snatch 3,Snatch Rank,C&J 1,C&J 2,C&J 3,C&J Rank,Total (kg),Sinclair
1454,2022-05-28,100 th European Championships,Tirana,ALB,Males,Senior,1.0,Talakhadze Lasha,GEO,110.0,...,208.0,212.0,217.0,1.0,245.0,253.0,,1.0,462.0,510.8
1677,2021-12-07,86 th World Championships,Tashkent,UZB,Males,Senior,1.0,Talakhadze Lasha,GEO,182.9,...,210.0,218.0,225.0,1.0,247.0,257.0,267.0,1.0,492.0,492.5
1883,2021-07-23,XXXII Olympics Games,Tokyo,JPN,Males,Senior,1.0,Talakhadze Lasha,GEO,177.0,...,208.0,215.0,223.0,1.0,245.0,255.0,265.0,1.0,488.0,489.2
2228,2021-04-03,99 th European Championships,Moscow,RUS,Males,Senior,1.0,Talakhadze Lasha,GEO,176.3,...,211.0,217.0,222.0,1.0,245.0,253.0,263.0,1.0,485.0,486.3
2871,2019-09-18,85 th World Championships,Pattaya,THA,Males,Senior,1.0,Talakhadze Lasha,GEO,168.65,...,208.0,215.0,220.0,1.0,247.0,255.0,264.0,1.0,484.0,484.2
3424,2019-04-06,98 th European Championships,Batumi,GEO,Males,Senior,1.0,Talakhadze Lasha,GEO,170.1,...,208.0,218.0,,1.0,245.0,260.0,,1.0,478.0,478.1
4450,2017-11-29,83 rd World Championships,Anaheim,USA,Males,Senior,1.0,Talakhadze Lasha,GEO,166.0,...,210.0,215.0,220.0,1.0,243.0,250.0,257.0,1.0,477.0,477.5
806,2023-04-15,101 st European Championships,Yerevan,ARM,Males,Senior,1.0,Talakhadze Lasha,GEO,175.78,...,210.0,217.0,222.0,1.0,246.0,252.0,,1.0,474.0,475.4
5359,2016-08-10,XXXI Olympics Games,Rio de Janeiro,BRA,Males,Senior,1.0,Talakhadze Lasha,GEO,157.34,...,205.0,210.0,215.0,2.0,242.0,247.0,258.0,1.0,473.0,474.7
3774,2018-11-01,84 th World Championships,Ashgabat,TKM,Males,Senior,1.0,Talakhadze Lasha,GEO,169.31,...,207.0,212.0,217.0,1.0,245.0,252.0,257.0,1.0,474.0,474.2


From this small cross section, we can see that there are a few errors present. The sinclair over 500 is too high, but it is likely due to crediting Lasha with a bodyweight of 110kg. He was certainly larger than that for this competition. ~~We can also see some missing values for the total when there are valid numbers in both the snatch and clean and jerk. This should be filled in, and there are likely other places in this data set where this is the case.~~

Turns out the numbers in the Snatch and C&J columns are attempts, not makes. Thus, we will trust the 'Total (kg)' values as being corrct

In [145]:
# THIS FUNCTION IS NO LONGER USEFUL (APRIL 4, 2025)

# Fill in totals that have valid snatches and clean and jerks

def update_totals(row):
    if pd.isnull(row['Total (kg)']):
        snatch_max = max(row[['Snatch 1', 'Snatch 2', 'Snatch 3']])
        cj_max = max(row[['C&J 1', 'C&J 2', 'C&J 3']])
        if not np.isnan(snatch_max) and not np.isnan(cj_max):
            return snatch_max + cj_max
    return row['Total (kg)']

#all_senior_data_loaded['Total (kg)'] = all_senior_data_loaded.apply(update_totals, axis=1)


### Sinclair Experiment ###

Below, we are going to do a calculated Sinclair to try find a little more consistency. We are going to use the total and the credited bodyweight from the table to do this calculation. The Sinclair coefficients change every Olympic cycle. However, the changes are small, and if we scale everyone relative to the same numbers, I believe this will be a simple and equitable way to proceed. From the previous version (done within SQLite)

WHEN (sex = 'F')

    THEN  total_kg* POWER(10, 0.787004341*POWER(LOG10(weight_class_kg/153.757),2))
    
    ELSE  total_kg* POWER(10, 0.722762521*POWER(LOG10(weight_class_kg/193.609),2))

In this python verion, we will be using the total kilograms, and the listed bodyweight for each athlete. For a detailed explanation of Sinclair values, see this [link](https://vektlofting.no/siteassets/dokumenter/aktiv-idrett/stevneprotokoller/sinclair-koeffisienter-2021-2024/2021-sinclair_coefficients.pdf).

In [146]:
def calculate_sinclair(row):
    if not row['Bodyweight (kg)']:
        return row['Total (kg)']
    if row['Gender'] == 'Females' and row['Bodyweight (kg)'] < 153.757:
        return row['Total (kg)']  * 10**(0.78700434110 * (np.log10(row['Bodyweight (kg)'] / 153.757)) ** 2)
    if row['Gender'] == 'Males' and row['Bodyweight (kg)'] < 193.609:
        return row['Total (kg)']  * 10**(0.722762521 * (np.log10(row['Bodyweight (kg)'] / 193.609)) ** 2)
    else:
        return row['Total (kg)']



In [147]:
all_senior_data_loaded['Sinclair (calculated)'] = all_senior_data_loaded.apply(calculate_sinclair, axis=1)

In [148]:
all_senior_data_loaded.sort_values(by='Sinclair (calculated)', ascending=False)[:20]

Unnamed: 0,Date,Competition,Host City,Host Nation,Gender,Age Category,Overall Rank,Athlete Name,Athlete Nation,Bodyweight (kg),...,Snatch 2,Snatch 3,Snatch Rank,C&J 1,C&J 2,C&J 3,C&J Rank,Total (kg),Sinclair,Sinclair (calculated)
25909,2017-11-03,German Championships,Speyer,GER,Males,Senior,6.0,Schiffer Karl Waldemar,,2.4,...,120.0,125.0,6.0,140.0,146.0,151.0,6.0,271.0,111132.0,114978.167885
59576,2017-11-03,German Championships,Speyer,GER,Females,Senior,3.0,Winterholler Nicole,,3.5,...,69.0,73.0,3.0,85.0,89.0,92.0,3.0,162.0,21046.7,21547.167949
59577,2017-11-03,German Championships,Speyer,GER,Females,Senior,4.0,Jacobs Sarah,,3.6,...,68.0,73.0,4.0,85.0,89.0,92.0,4.0,157.0,18975.7,19420.376385
49505,2014-05-26,24 th Pan Americana Championships,Santo Domingo,DOM,Females,Senior,7.0,Lima De Araujo Monique Maria,BRA,4.7,...,105.0,105.0,4.0,121.0,124.0,126.0,8.0,221.0,22829.4,14129.191246
19283,1988-09-18,XXIV Olympics Games,Seoul,KOR,Males,Senior,1.0,Suleymanoglu Naim,TUR,59.7,...,150.5,152.5,1.0,175.0,188.5,190.0,1.0,342.5,0.0,528.874967
19664,1986-11-08,60 th World Championships,Sofia,BUL,Males,Senior,1.0,Suleymanoglu Naim,TUR,59.8,...,,,1.0,187.5,,,1.0,335.0,0.0,516.655165
1454,2022-05-28,100 th European Championships,Tirana,ALB,Males,Senior,1.0,Talakhadze Lasha,GEO,110.0,...,212.0,217.0,1.0,245.0,253.0,,1.0,462.0,510.8,510.757455
19728,1986-11-08,60 th World Championships,Sofia,BUL,Males,Senior,1.0,Zlatev Asen,BUL,82.3,...,,,1.0,225.0,,,1.0,405.0,0.0,509.587669
21248,1980-07-20,54 th World Championships,Moskau,URS,Males,Senior,1.0,Vardanyan Yurik,URS,81.7,...,172.5,177.5,1.0,205.0,215.5,222.5,1.0,400.0,0.0,505.286624
21075,1980-07-20,XXII Olympics Games,Moskau,URS,Males,Senior,1.0,Vardanyan Yurik,URS,81.7,...,172.5,177.5,1.0,205.0,215.5,222.5,1.0,400.0,0.0,505.286624


This has done a pretty decent job, at least from looking at the top performances. We should remove all the absurdly large sinclair values (those over 550 kg), and we need to update Lasha Talakhadze's bodyweight where he is listed as weighing 110 kg. Even though he is the strongest weighlifter in history, this value will mess up future calculations.

In [149]:
def fix_bodyweights(row):
    # Fix 
    if row['Bodyweight (kg)'] < 10:
        return 0
    if row['Bodyweight (kg)'] == 110 and row['Athlete Name'] == 'Talakhadze Lasha':
        return 183 # Lasha's weight from Wikipedia as of April 4, 2024
    else:
        return row['Bodyweight (kg)']

In [150]:
all_senior_data_loaded['Bodyweight (kg)'] = all_senior_data_loaded.apply(fix_bodyweights, axis=1)
all_senior_data_loaded['Sinclair (calculated)'] = all_senior_data_loaded.apply(calculate_sinclair, axis=1)

In [152]:
all_senior_data_loaded.sort_values(by='Sinclair (calculated)', ascending=False)[:20]

Unnamed: 0,Date,Competition,Host City,Host Nation,Gender,Age Category,Overall Rank,Athlete Name,Athlete Nation,Bodyweight (kg),...,Snatch 2,Snatch 3,Snatch Rank,C&J 1,C&J 2,C&J 3,C&J Rank,Total (kg),Sinclair,Sinclair (calculated)
19283,1988-09-18,XXIV Olympics Games,Seoul,KOR,Males,Senior,1.0,Suleymanoglu Naim,TUR,59.7,...,150.5,152.5,1.0,175.0,188.5,190.0,1.0,342.5,0.0,528.874967
19664,1986-11-08,60 th World Championships,Sofia,BUL,Males,Senior,1.0,Suleymanoglu Naim,TUR,59.8,...,,,1.0,187.5,,,1.0,335.0,0.0,516.655165
19728,1986-11-08,60 th World Championships,Sofia,BUL,Males,Senior,1.0,Zlatev Asen,BUL,82.3,...,,,1.0,225.0,,,1.0,405.0,0.0,509.587669
21075,1980-07-20,XXII Olympics Games,Moskau,URS,Males,Senior,1.0,Vardanyan Yurik,URS,81.7,...,172.5,177.5,1.0,205.0,215.5,222.5,1.0,400.0,0.0,505.286624
21248,1980-07-20,54 th World Championships,Moskau,URS,Males,Senior,1.0,Vardanyan Yurik,URS,81.7,...,172.5,177.5,1.0,205.0,215.5,222.5,1.0,400.0,0.0,505.286624
20679,1982-09-18,56 th World Championships,Ljubljana,YUG,Males,Senior,1.0,Zlatev Asen,BUL,81.8,...,,,1.0,220.0,,,1.0,400.0,0.0,504.952189
19427,1988-09-18,XXIV Olympics Games,Seoul,KOR,Males,Senior,1.0,Zakharevich Yuri,URS,109.55,...,205.0,210.0,1.0,245.0,251.0,251.0,1.0,455.0,0.0,503.753757
20500,1983-10-22,57 th World Championships,Moskau,URS,Males,Senior,1.0,Blagoev Blagoi,BUL,89.55,...,,,1.0,227.5,,,1.0,417.5,0.0,503.153733
19565,1987-09-07,61 st World Championships,Ostrava,TCH,Males,Senior,1.0,Khrapaty Anatoli,URS,89.6,...,,,2.0,232.5,,,1.0,417.5,0.0,503.017853
6849,2014-11-04,81 st World Championships,Almaty,KAZ,Males,Senior,1.0,Liao Hui,CHN,68.68,...,160.0,166.0,1.0,185.0,193.0,,1.0,359.0,484.4,502.939595


This is looking fairly promising! It is not perfect, but it will work for our future analyses. The next step is EDA, and making sure that I can cross reference the table from before with positive results. It is probably worthwhile to set up a database now, so that cleaned data can be queried directly.

PROBLEM: To solve the issue of the column/series name, I think I can take the first or second (or both) rows and make them the column names. Then, once the names match, I can add them to the correct series that is (or will become) part of the data frame.

SOLVED: 4/3/2024. Instead of the above solution, I forced the columns to have specific names. This required a try/except clause to make sure that I forced the correct number of columns onto each individual table. See the function all_data() for details.

PROBLEM: 5/20/23. Preview on GitHub not working. Reloading to hopefully fix that.