# Webscraping - A Notebook from Robin
In this notebook, I'll explore ways of scrape the necessary information we would like to scrape on

* HuggingFace Leaderboards
* LLM-stats.com
For this purpose, I'll use python libraries such as BeautifulSoup and MechanicalSoup

In [None]:
import pandas as pd
from bs4 import BeautifulSoup as bs
import mechanicalsoup as ms
import re
import time
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException

[Hugging Face Open Leadboards](#https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/) has a faily simply structure.
It is a simple page with a table.
This table is scrollable and contains all the available models and their information.
Alternatively, there is also a search bar, allowing filtering of the table.# Other advanced filters are available.

Let's first create a `SatefulBrowser` instance from `MechanicalSoup` enabling interaction with the websites.

In [None]:
browser = ms.StatefulBrowser(soup_config={"features": "lxml"}, raise_on_404=True)

In [None]:
# Then, we'll open the Hugging Face Leaderboards page.
# We can use the search bar feature directly in the URL by adding `?search=[model_name]`.
# Alternatively, interacting with the filters would require library handling dynamic websites such as `Selenium`.

In [None]:
url = "https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/"
browser.open(url)

Now, some features of the website is dynamic.
Hence, we'll set up Selenium to interact with the webpage dynamically.

First, we'll set up the driver and make it go on the desired webpage.

In [None]:
driver = webdriver.Safari()
driver.get(url)

Since the information we're looking at are stored nested inside a HTML 'table' tag, we can first locate this element and then look at what is inside rescursively.
However, this same table is inside an iframe, an interactive container, which we will need to switch to beforehand.

In [None]:
# Get the table
iframe = driver.find_element(By.XPATH, '//*[@id="iFrameResizer0"]')
driver.switch_to.frame(iframe)
table = driver.find_element(By.TAG_NAME, "table")

HTML table contains a 'thead' tag element, in which the information about the headers are contains.
In our case, 'thead' contains a 'th' element, which contains a 'p' element.
Said 'p' element contains the name of the desired column.

Now that we know this, we can just look for all 'p' element inside each 'th' elements.

In [None]:
thead = table.find_element(By.TAG_NAME, "thead")
colnames = [
    th.find_element(By.TAG_NAME, "p").text
    for th in thead.find_elements(By.TAG_NAME, "th")
]

Next up are the rows.
The rows are contained inside a 'tbody' element, while each the information contained inside each row are inside a 'tr' element.
Inside each 'tr', a list of 'td' is placed, each contained a 'p' element for the cell text.

A small difference is with the model name, which actually are link. This link is the first link of the row, hence we can just look for it.

In [None]:
tbody = table.find_element(By.TAG_NAME, "tbody")
rows = tbody.find_elements(By.TAG_NAME, "tr")

row = rows[0]
for row in rows:
    obs = [p.text for p in row.find_elements(By.TAG_NAME, "p")]
    obs.insert(2, row.find_element(By.TAG_NAME, "a").text)
    print(obs)

Now that we have a framework for scraping all the information inside the table that we want, we only need to store it.
Since the information is gathered row-wise, we'll write it sequentially as a `CSV` file.

In [None]:
with open("hf_leaderboard.csv", "w") as f:
    # Write columns or variable names
    colnames.append("\n")
    f.write(",".join(colnames[1:]))

    # Loop over each row and write them sequentially
    for i, row in enumerate(rows):
        obs = [p.text for p in row.find_elements(By.TAG_NAME, "p")]
        obs.insert(2, row.find_element(By.TAG_NAME, "a").text)
        obs.append("\n")
        f.write(",".join(obs))
        print(f"Row {i} has been written!")

The last thing we need to take care of is the fact that we need to scroll down the table to have the rest of the models.
However, I had trouble with the scrolling, I'll just use the 'search' bar mechanism to filter the table for the model I want.

In [None]:
hf_meta = pd.read_csv("huggingface_llm_metadata.csv")
models = hf_meta.modelId.values

The first step will be to already write the columns name inside the csv file.

In [None]:
driver = webdriver.Safari()
url = "https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/"
driver.get(url)
time.sleep(2)
iframe = driver.find_element(By.XPATH, '//*[@id="iFrameResizer0"]')
driver.switch_to.frame(iframe)
with open("hf_leaderboard.csv", "w") as f:
    table = driver.find_element(By.TAG_NAME, "table")
    thead = table.find_element(By.TAG_NAME, "thead")
    colnames = [
        th.find_element(By.TAG_NAME, "p").text
        for th in thead.find_elements(By.TAG_NAME, "th")
    ]
    # Write columns or variable names
    colnames.append("\n")
    f.write(",".join(colnames[1:]))

Now that we have the column names written, we can append the rest of the rows to it.

**Remarks**: I incountered some issue when switching pages.
Hence, I had to close the driver and open it again every time.
Additionnaly, it is necessary to let the driver time to open and scrape things. Hence incorporating waiting time is of uptmost importance, otherwise it fails.

In [None]:
with open("hf_leaderboard.csv", "a") as f:
    for model in models:
        driver = webdriver.Safari()
        time.sleep(5)
        # Open the webpage with the filter
        driver.get(url + f"?search={model}")
        time.sleep(5)
        # Make sure that the page is completely loaded.
        # Switch to the corresponding iFrame
        iframe = driver.find_element(By.XPATH, '//*[@id="iFrameResizer0"]')
        driver.switch_to.frame(iframe)
        # Check for unavailable model on the website
        try:
            # Get the table element
            table = driver.find_element(By.TAG_NAME, "table")
        except NoSuchElementException:
            driver.close()
            continue
        # Get the body element
        tbody = table.find_element(By.TAG_NAME, "tbody")
        # Get the first row, so the first model appearing in the search
        row = tbody.find_element(By.TAG_NAME, "tr")
        # Get all the information from the row
        obs = [p.text for p in row.find_elements(By.TAG_NAME, "p")]
        # Make sure to also get the model name, which is a link
        obs.insert(2, row.find_element(By.TAG_NAME, "a").text)
        # Write everything inside the CSV file as a new row.
        obs.append("\n")
        f.write(",".join(obs))
        driver.close()