# RXposé (Version 2.0) #

As of March, 2024, I greatly prefer working in Python over R. In this file, I will build a better pipeline than the version in R from last year, as well as (hopefully) incoroporate a better data source to be able to finish the original analysis. Ideally, this project will also produce a searchable database, allowing other contributors to perform future analysis.

## Data Collection ##

The first step is to collect better data. [This website](https://iwrp.net/) has results in PDF format going back to 1928. Since this format is not helpful for computer analysis, we need to scrape it and convert it to a .csv or similarly usable file.

In [72]:
# Imports for data collection
import os
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

import re

import pandas as pd

In [27]:
# Write function to safely download PDF files from a given link

# This function is not useful anymore, but we will use it as a model for logic

def download_pdfs(url, save_directory):
    # Send GET request to URL
    response = requests.get(url)
    
    # Parse the HTML
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Find all links on the page to sort later
    links = soup.find_all('a')
    
    print(links)
    print(len(links))
    
    # Ensure directory is correctly set
    if  not os.path.exists(save_directory):
         os.makedirs(save_directory)
         
        
    # Loop through links, downloading only the PDFs
    for link in links[:5]:
        href = link.get('href')
        if href:
            # Make correct URL for GET request
            absolute_url = urljoin(url, href)
            
            linked_response = requests.get(absolute_url)
            
            linked_soup = BeautifulSoup(linked_response.content, 'html.parser')
            
            tables = linked_soup.find_all('table')
            
            filename = os.path.basename(absolute_url)
            
            with open(os.path.join(save_directory, filename), 'wb') as f:
                pdf_response = requests.get(absolute_url)
                f.write(pdf_response.content)
        
            print(f"Downloaded {filename} to {os.path}")

In [22]:
# Download files

download_pdfs('https://iwrp.net/', '/Users/aaronkeeney/Documents/Data_Analytics_Projects/Rxpose')

[<a href="/">IWRP</a>, <a href="/global-statistics">Global Statistics</a>, <a href="/../"><div class="header__logo"></div></a>, <a href="/component/cwyniki/?view=contest&amp;id_zawody=2752">IWF Grand Prix</a>, <a href="/component/cwyniki/?view=contest&amp;id_zawody=2753">IWF Grand Prix</a>, <a href="/component/cwyniki/?view=contest&amp;id_zawody=2748">48 th Junior World Championships</a>, <a href="/component/cwyniki/?view=contest&amp;id_zawody=2749">27 th Junior World Championships</a>, <a href="/component/cwyniki/?view=contest&amp;id_zawody=2746">11 th Youth Polish Championships</a>, <a href="/component/cwyniki/?view=contest&amp;id_zawody=2747">11 th Youth Polish Championships</a>, <a href="/component/cwyniki/?view=contest&amp;id_zawody=2745">28 th Waldemar Malak Memorial	</a>, <a href="/component/cwyniki/?view=contest&amp;id_zawody=2743">88 th World Championships</a>, <a href="/component/cwyniki/?view=contest&amp;id_zawody=2744">30 th World Championships</a>, <a href="/component/cwyn

In [2]:
def download_table(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'lxml')
    table = soup.find_all('table')
    
    
    if table:
        df = pd.read_html(str(table))[0]
        return df
        
    else:
        print(f"No table found on page")
        return None

In [3]:
df = download_table('https://iwrp.net/')

  df = pd.read_html(str(table))[0]


In [4]:
print(df.head())
print(df.size)

         Date                              Name        Place Nation   Gender  \
0  2023-12-04                    IWF Grand Prix         Doha    QAT    Males   
1  2023-12-04                    IWF Grand Prix         Doha    QAT  Females   
2  2023-11-15  48 th Junior World Championships  Guadalajara    MEX    Males   
3  2023-11-15  27 th Junior World Championships  Guadalajara    MEX  Females   
4  2023-10-20  11 th Youth Polish Championships     Biłgoraj    POL    Males   

  Age category  
0       Senior  
1       Senior  
2       Junior  
3       Junior  
4     Youth 15  
15510


In [5]:
print(df[df['Gender'] == 'Males'].size)
print(df[df['Gender'] == 'Females'].size)

6276
4656


In [6]:
df.dtypes

Date            object
Name            object
Place           object
Nation          object
Gender          object
Age category    object
dtype: object

In [7]:
def is_valid_date(date_string):
    try: 
        pd.to_datetime(date_string)
        return True
    except:
        ValueError
        return False

In [8]:
df['Date'] = pd.to_datetime(df['Date'])


ValueError: time data "1979-00-00" doesn't match format "%Y-%m-%d", at position 983. You might want to try:
    - passing `format` if your strings have a consistent format;
    - passing `format='ISO8601'` if your strings are all ISO8601 but not necessarily in exactly the same format;
    - passing `format='mixed'`, and the format will be inferred for each element individually. You might want to use `dayfirst` alongside this.

In [9]:
valid_dates_mask = df['Date'].apply(is_valid_date)

invalid_dates = df[~valid_dates_mask]
invalid_dates

Unnamed: 0,Date,Name,Place,Nation,Gender,Age category
2538,1979-00-00,57 th European Championships,Varna,BUL,Males,Senior


In [10]:
# Quick internt search for the correct date

def replace_invalid_date(date_string):
    try: 
        pd.to_datetime(date_string)
        return date_string
    except:
        ValueError
        return '1979-05-19'
    
df['Date'] = df['Date'].apply(replace_invalid_date)

In [11]:
df['Date'] = pd.to_datetime(df['Date'])

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2585 entries, 0 to 2584
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   Date          2585 non-null   datetime64[ns]
 1   Name          2585 non-null   object        
 2   Place         2583 non-null   object        
 3   Nation        2585 non-null   object        
 4   Gender        2585 non-null   object        
 5   Age category  2585 non-null   object        
dtypes: datetime64[ns](1), object(5)
memory usage: 121.3+ KB


Okay, what I actually need to do for now:

1. I need to have a single dataframe with all the results, but I need the date included. I think I need to go back and get a datetime object instead of a numerical year. Then I can avoid duplicate rows, since a single lifter's total, name, and date of competition will be uniquely identifying. Not all of this data will be used in this analysis, but we want to be able to easily search for all international performances of any lifter.

2. I need to then retrieve the actual lifting information from the links. The smaller dataframe generated by importing the table(s) at each link can then be appended to the large dataframe. I NEED TO MAKE SURE THAT THE DATE AND THE WEIGHT CATEGORY ARE INCLUDED IN EACH ROW AS PART OF THIS TRANSFORMATION. This should be fairly simple. Create the data frame with the necessary columns, then populate all the scraped values as additional columns.

In [64]:
# Loop through the 'Name' column and follow the links

def download_links(url: str, events_list: list) -> pd.DataFrame:
    # Send GET request to URL
    response = requests.get(url)
    
    # Parse the HTML
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Find all links on the page to sort later
    links = soup.find_all('a')
    
    return [link for link in links]
    

In [65]:
links = download_links('https://iwrp.net/', df['Name'])
print(type(links))

<class 'list'>


In [66]:
print(links[:10])
print(len(links))
print(len(df['Name']))


[<a href="/">IWRP</a>, <a href="/global-statistics">Global Statistics</a>, <a href="/../"><div class="header__logo"></div></a>, <a href="/component/cwyniki/?view=contest&amp;id_zawody=2752">IWF Grand Prix</a>, <a href="/component/cwyniki/?view=contest&amp;id_zawody=2753">IWF Grand Prix</a>, <a href="/component/cwyniki/?view=contest&amp;id_zawody=2748">48 th Junior World Championships</a>, <a href="/component/cwyniki/?view=contest&amp;id_zawody=2749">27 th Junior World Championships</a>, <a href="/component/cwyniki/?view=contest&amp;id_zawody=2746">11 th Youth Polish Championships</a>, <a href="/component/cwyniki/?view=contest&amp;id_zawody=2747">11 th Youth Polish Championships</a>, <a href="/component/cwyniki/?view=contest&amp;id_zawody=2745">28 th Waldemar Malak Memorial	</a>]
2590
2585


In [67]:
links = links[3:-2]
links[:10]

[<a href="/component/cwyniki/?view=contest&amp;id_zawody=2752">IWF Grand Prix</a>,
 <a href="/component/cwyniki/?view=contest&amp;id_zawody=2753">IWF Grand Prix</a>,
 <a href="/component/cwyniki/?view=contest&amp;id_zawody=2748">48 th Junior World Championships</a>,
 <a href="/component/cwyniki/?view=contest&amp;id_zawody=2749">27 th Junior World Championships</a>,
 <a href="/component/cwyniki/?view=contest&amp;id_zawody=2746">11 th Youth Polish Championships</a>,
 <a href="/component/cwyniki/?view=contest&amp;id_zawody=2747">11 th Youth Polish Championships</a>,
 <a href="/component/cwyniki/?view=contest&amp;id_zawody=2745">28 th Waldemar Malak Memorial	</a>,
 <a href="/component/cwyniki/?view=contest&amp;id_zawody=2743">88 th World Championships</a>,
 <a href="/component/cwyniki/?view=contest&amp;id_zawody=2744">30 th World Championships</a>,
 <a href="/component/cwyniki/?view=contest&amp;id_zawody=2741">37th Asian JuniorChampionship</a>]

In [68]:
len(links)

2585

In [70]:
print(links)
print(links[::-1])

[<a href="/component/cwyniki/?view=contest&amp;id_zawody=2752">IWF Grand Prix</a>, <a href="/component/cwyniki/?view=contest&amp;id_zawody=2753">IWF Grand Prix</a>, <a href="/component/cwyniki/?view=contest&amp;id_zawody=2748">48 th Junior World Championships</a>, <a href="/component/cwyniki/?view=contest&amp;id_zawody=2749">27 th Junior World Championships</a>, <a href="/component/cwyniki/?view=contest&amp;id_zawody=2746">11 th Youth Polish Championships</a>, <a href="/component/cwyniki/?view=contest&amp;id_zawody=2747">11 th Youth Polish Championships</a>, <a href="/component/cwyniki/?view=contest&amp;id_zawody=2745">28 th Waldemar Malak Memorial	</a>, <a href="/component/cwyniki/?view=contest&amp;id_zawody=2743">88 th World Championships</a>, <a href="/component/cwyniki/?view=contest&amp;id_zawody=2744">30 th World Championships</a>, <a href="/component/cwyniki/?view=contest&amp;id_zawody=2741">37th Asian JuniorChampionship</a>, <a href="/component/cwyniki/?view=contest&amp;id_zawod

In [61]:
df['link'] = str(links)
df.head()

Unnamed: 0,Date,Name,Place,Nation,Gender,Age category,link
0,2023-12-04,IWF Grand Prix,Doha,QAT,Males,Senior,"[<a href=""/component/cwyniki/?view=contest&amp..."
1,2023-12-04,IWF Grand Prix,Doha,QAT,Females,Senior,"[<a href=""/component/cwyniki/?view=contest&amp..."
2,2023-11-15,48 th Junior World Championships,Guadalajara,MEX,Males,Junior,"[<a href=""/component/cwyniki/?view=contest&amp..."
3,2023-11-15,27 th Junior World Championships,Guadalajara,MEX,Females,Junior,"[<a href=""/component/cwyniki/?view=contest&amp..."
4,2023-10-20,11 th Youth Polish Championships,Biłgoraj,POL,Males,Youth 15,"[<a href=""/component/cwyniki/?view=contest&amp..."


In [145]:
def all_data(guiding_data_frame: pd.DataFrame) -> pd.DataFrame:
    pattern = re.compile(r'"([^"]*)"')
    for link in guiding_data_frame['link'][:1]:  # Consider removing [:2] for the full dataframe
        match = pattern.search(link)
        if match:
            event_link = match.group(1)
            absolute_url = 'https://iwrp.net/' + event_link

            response = requests.get(absolute_url)
            if response.status_code == 200:  # Check if request was successful
                soup = BeautifulSoup(response.content, 'html.parser')
                tables = soup.find_all('table')
                print(len(tables))
                for table in tables:
                    print(pd.read_html(str(table))[0])
                #weightclass_df = pd.read_html(str(table))[0]
                #print(weightclass_df.columns[1])
                
                    
                    # Do something with temp_df, such as concatenating it with another DataFrame
                    # Example: final_df = pd.concat([final_df, temp_df], ignore_index=True)
            else:
                print(f"Failed to fetch data from {absolute_url}. Status code: {response.status_code}")
        else:
            print(f"No match found for link: {link}")

In [146]:
all_data(df)

10


  print(pd.read_html(str(table))[0])


       55 kg                                                                \
       61 kg                                                                 
       67 kg                                                                 
       73 kg                                                                 
       81 kg                                                                 
       89 kg                                                                 
       96 kg                                                                 
      102 kg                                                                 
      109 kg                                                                 
    + 109 kg                   + 109 kg.1 + 109 kg.2 + 109 kg.3 + 109 kg.4   
0         Pl             Surname and name     Nation        B.W        Gr.   
1         Pl             Surname and name     Nation        B.W        Gr.   
2          1                 Pang Un Chol        PRK      54.94 

  print(pd.read_html(str(table))[0])


       61 kg                                                               \
       67 kg                                                                
       73 kg                                                                
       81 kg                                                                
       89 kg                                                                
       96 kg                                                                
      102 kg                                                                
      109 kg                                                                
    + 109 kg                  + 109 kg.1 + 109 kg.2 + 109 kg.3 + 109 kg.4   
0         Pl            Surname and name     Nation        B.W        Gr.   
1         Pl            Surname and name     Nation        B.W        Gr.   
2          1               Pak Myong Jin        PRK      60.97          A   
3          2          Ceniza John Fabuar        PHI      61.00          A   

  print(pd.read_html(str(table))[0])
  print(pd.read_html(str(table))[0])


       73 kg                                                               \
       81 kg                                                                
       89 kg                                                                
       96 kg                                                                
      102 kg                                                                
      109 kg                                                                
    + 109 kg                  + 109 kg.1 + 109 kg.2 + 109 kg.3 + 109 kg.4   
0         Pl            Surname and name     Nation        B.W        Gr.   
1         Pl            Surname and name     Nation        B.W        Gr.   
2          1            Suharevs Ritvars        LAT      73.00          A   
3          2                 Shi Zhiyong        CHN      73.00          A   
4          3           Wichuma Weeraphon        THA      73.00          A   
..       ...                         ...        ...        ...        ...   

  print(pd.read_html(str(table))[0])
  print(pd.read_html(str(table))[0])


       89 kg                                                               \
       96 kg                                                                
      102 kg                                                                
      109 kg                                                                
    + 109 kg                  + 109 kg.1 + 109 kg.2 + 109 kg.3 + 109 kg.4   
0         Pl            Surname and name     Nation        B.W        Gr.   
1         Pl            Surname and name     Nation        B.W        Gr.   
2          1                Nasar Karlos        BUL      88.02          A   
3          2          Lopez Lopez Yeison        COL      88.98          A   
4          3          Pizzolato Antonino        ITA      89.00          A   
..       ...                         ...        ...        ...        ...   
118      NaN                Minasyan Gor        ARM     150.00          B   
119      NaN             Coullet Anthony        FRA     152.19          A   

  print(pd.read_html(str(table))[0])
  print(pd.read_html(str(table))[0])
  print(pd.read_html(str(table))[0])
  print(pd.read_html(str(table))[0])


Something odd about how these tables are constructed...somehow, the list of weightclasses is being interpreted as the column names. I don't know if there is a way to fix this directly. I would like to ignore those labels, since bodyweight and sinclair are already calculated for me...

With that in mind, it appears that all relevant information is in the first table. This means that I can likely just take the first table from each page? 

I emailed the owner to ask for access to their data for this project. Going to pause here and wait for a bit.