# ClimateQ&A
---
**Goal of the notebook: Scraping all charts from Our World In Data**

Inputs of the notebook:
- *blablabli blablablou*

Output of the notebook:
- *blablabli blablablou*

Takeaways:
- *blablabli blablablou*

## Questions, remarks and ideas:
- *blablabli blablablou*


Prio High:
- Do we scrape all charts or just categories of topics linked to ClimateQ&A ?

Prio Mid:
- *blablabli blablablou*

Prio Low:
- How to dynamically give the embed link with the right parameters (country/ies, time period, tab (table, chart, map))
  1. Bind a function to the LLM which returns arguments: country, time, tab
  2. Write a function which inputs the arguments in the link and returns the right embedding link
  3. If the embedding link works great, if it does not: take other parameters which work and make the LLM say sorry we don't have the exact info you are looking for, but this is closely related

## Dependencies and path
Adjust the argument in `sys.path.append` to align with your specific requirements.

In [171]:
# Standard library imports
import os  # (Standard Library - No Version)
import sys  # (Standard Library - No Version)
import pandas as pd
import numpy as np

# Third-party imports
from dotenv import load_dotenv

# Change this path to align with your specific requirements
base_dir = '/home/dora/climate-question-answering'
sys.path.append(os.path.join(base_dir, 'climateqa'))

# Load environment variables
load_dotenv()

# IPython extension for auto-reloading (automatically reloads modules before executing code)
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## 1. Introduction
Scrape all charts from https://ourworldindata.org/charts and not explorers:
- All the information covered by charts under the tab "Explorers" seems to be contained in the charts under the "Charts" tabs
- The interactive added value of the explorer is already captured by the LLM

Elements to scrap:
- Category of the graph
- Title of the graph
- Subtitle of the graph
- Embedding link
- URL to the page
- Later: Any additional info in metadata as well > `Learn more about this data` > `select an indicator` (des fois) > `"What you should know about this indicator/data"`, `"How this data is described by its producer"` (PAS "Additional information about this data" as discusses more how the data was collected, by who, etc.)

### 1.1 Set up chromedriver and chrome in WSL

In [172]:
import time
import os.path
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

## Setup chrome options
chrome_options = Options()
chrome_options.add_argument("--headless") # Ensure GUI is off
chrome_options.add_argument("--no-sandbox")

# Set path to chrome/chromedriver as per your configuration
homedir = os.path.expanduser("~")
chrome_options.binary_location = f"{homedir}/chrome-linux64/chrome"
webdriver_service = Service(f"{homedir}/chromedriver-linux64/chromedriver")

# Choose Chrome Browser
driver = webdriver.Chrome(service=webdriver_service, options=chrome_options)

# Use WebDriverWait to allow dynamic page content to load
wait = WebDriverWait(driver, 10)

## 2. Scraping charts
### 2.1 Test for one chart
https://ourworldindata.org/grapher/number-with-without-clean-cooking-fuels

Pseudo code:
1. Get all the section headers `<section[i]> <h2>`
2. For loop through all the section headers
   1. `category = <section[i]> <h2> .text`
   2. For loop through all the child elements `> ul/li[i]`
      1. Get the `<a> element` and get:
         - `url = .href`
         - `embedding = f"blabla{url}blabla"`
         - `Title`
         - `category = category`
      2. Navigate to the url and get:
         - `Subtitle`
         - Later:
           - Any additional info in metadata as well > `Learn more about this data` > `select an indicator` (des fois) > `"What you should know about this indicator/data"`, `"How this data is described by its producer"` (PAS "Additional information about this data" as discusses more how the data was collected, by who, etc.)

In [173]:
from IPython.display import display, Markdown

# Get page
driver.get("https://ourworldindata.org/charts")

# Find all the charts by their unique class and text within them
sections = wait.until(EC.presence_of_all_elements_located((By.XPATH, "//section/h2")))


section = sections[1]
category = section.text
# Find the ul element that is a sibling to the current h2 section
ul_element = section.find_element(By.XPATH, './following-sibling::ul')

# Find all a elements within the li elements of the ul
items = ul_element.find_elements(By.XPATH, './/li/a')
url = items[-1].get_attribute("href")
embedding = f"""<iframe src="{url}?tab=map" loading="lazy" style="width: 100%; height: 600px; border: 0px none;" allow="web-share; clipboard-write"></iframe>"""


driver.get(url)

subtitle_element = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "span.markdown-text-wrap")))
subtitle_element.text.replace("\n", " ")

'Having access to electricity is defined in international statistics as having an electricity source that can provide very basic lighting, and charge a phone or power a radio for 4 hours per day. Primary energy is measured in kilowatt-hours per person, using the substitution method.'

### 2.2 For all charts

In [174]:
# Get page
home_url = "https://ourworldindata.org/charts"
driver.get(home_url)

# Find all the charts by their unique class and text within them
sections = wait.until(EC.presence_of_all_elements_located((By.XPATH, "//section/h2")))

# Initialize an empty list to store URLs
charts = []

for section in sections[1:]:
    category = section.text

    # Find the ul element that is a sibling to the current h2 section
    ul_element = section.find_element(By.XPATH, './following-sibling::ul')

    # Find all a elements within the li elements of the ul
    items = ul_element.find_elements(By.XPATH, './/li/a')

    for item in items:
        url = item.get_attribute("href")
        title = item.text
        chart = {
            "category": category,
            "title": title,
            "url": url,
            "embedding": f"""<iframe src="{url}?tab=map" loading="lazy" style="width: 100%; height: 600px; border: 0px none;" allow="web-share; clipboard-write"></iframe>"""

        }
        charts.append(chart)

for chart in charts:
    driver.get(chart["url"])
    subtitle_element = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "span.markdown-text-wrap")))
    chart["subtitle"] = subtitle_element.text.replace("\n", " ")

In [None]:
charts

[{'category': 'Access to Energy',
  'title': 'Number of people with and without access to clean cooking fuels',
  'url': 'https://ourworldindata.org/grapher/number-with-without-clean-cooking-fuels',
  'embedding': '<iframe src="https://ourworldindata.org/grapher/number-with-without-clean-cooking-fuels?tab=map" loading="lazy" style="width: 100%; height: 600px; border: 0px none;" allow="web-share; clipboard-write"></iframe>',
  'subtitle': 'Clean cooking fuels and technologies represent non-solid fuels such as natural gas, ethanol or electric technologies.'},
 {'category': 'Access to Energy',
  'title': 'Number of people without access to clean fuels for cooking',
  'url': 'https://ourworldindata.org/grapher/number-without-clean-cooking-fuel',
  'embedding': '<iframe src="https://ourworldindata.org/grapher/number-without-clean-cooking-fuel?tab=map" loading="lazy" style="width: 100%; height: 600px; border: 0px none;" allow="web-share; clipboard-write"></iframe>',
  'subtitle': 'Clean cook

## 3. Load as Pandas dataframe and export

In [None]:
df = pd.DataFrame(charts)
df

Unnamed: 0,category,title,url,embedding,subtitle
0,Access to Energy,Number of people with and without access to cl...,https://ourworldindata.org/grapher/number-with...,"<iframe src=""https://ourworldindata.org/graphe...",Clean cooking fuels and technologies represent...
1,Access to Energy,Number of people without access to clean fuels...,https://ourworldindata.org/grapher/number-with...,"<iframe src=""https://ourworldindata.org/graphe...",Clean cooking fuels and technologies represent...
2,Access to Energy,"People without clean fuels for cooking, by wor...",https://ourworldindata.org/grapher/people-with...,"<iframe src=""https://ourworldindata.org/graphe...",Data source: World Bank
3,Access to Energy,Share of the population without access to clea...,https://ourworldindata.org/grapher/share-of-th...,"<iframe src=""https://ourworldindata.org/graphe...",Access to clean fuels or technologies such as ...
4,Access to Energy,Share with access to electricity vs. per capit...,https://ourworldindata.org/grapher/share-with-...,"<iframe src=""https://ourworldindata.org/graphe...",Having access to electricity is defined in int...
...,...,...,...,...,...
6425,Working Hours,Weekly hours dedicated to home production in t...,https://ourworldindata.org/grapher/weekly-hour...,"<iframe src=""https://ourworldindata.org/graphe...",
6426,Working Hours,"Weekly hours worked by age group, United States",https://ourworldindata.org/grapher/average-wee...,"<iframe src=""https://ourworldindata.org/graphe...",
6427,Working Hours,Weekly working hours,https://ourworldindata.org/grapher/work-hours-...,"<iframe src=""https://ourworldindata.org/graphe...",
6428,Working Hours,"Weekly working hours vs. hourly wage, by wage ...",https://ourworldindata.org/grapher/working-hou...,"<iframe src=""https://ourworldindata.org/graphe...",


In [None]:
df.to_csv(os.path.join(base_dir, "data/owid_charts.csv"), index=False)