**Website:** https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population_(United_Nations)

**What I will scrape:**
I will extract a list of all countries, including:
- **Country name**
- **Type** (sovereign state or dependency)
- **Total population**

**Why suitable:**
- Fully static HTML table
- No login or authentication required
- Lists all countries on a single page, making it easy to scrape

In [74]:
!pip install requests beautifulsoup4 pandas lxml
!pip install pandas




In [75]:
from __future__ import annotations
from typing import Optional, Dict, List, Tuple
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time


In [76]:
# Fetcher
# Default request headers

DEFAULT_HEADERS: Dict[str, str] = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:143.0) Gecko/20100101 Firefox/143.0",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-GB,en;q=0.5",
    "Connection": "keep-alive",
}

def fetch_html(url: str, headers: Optional[Dict[str, str]] = None, timeout_s: float = 15.0) -> str:

    merged_headers: Dict[str, str] = {**DEFAULT_HEADERS, **(headers or {})}
    response = requests.get(url, headers=merged_headers, timeout=timeout_s)
    response.raise_for_status()

    # Politeness delay to avoid overwhelming the server
    time.sleep(1.5)

    return response.text

# Test Fetch
try:
    html_preview: str = fetch_html("https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population_(United_Nations)")
    print(html_preview[:500])  # preview first 500 characters
except Exception as e:
    print(f"Fetch failed: {e}")


<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vect


In [77]:
# Parse function
def parse_countries_page(html: str) -> List[Dict[str, Any]]:

    soup = BeautifulSoup(html, "html.parser")

    # Find all tables with class "wikitable"
    tables = soup.find_all("table", class_="wikitable")
    table = None

    # Select the table whose header contains "Country"
    for t in tables:
        header = t.find("th")
        if header and "Country" in header.get_text():
            table = t
            break

    if not table:
        print("No table found!")
        return []

    data: List[Dict[str, Any]] = []

    # Iterate over table rows, skipping the header
    for tr in table.find_all("tr")[1:]:
        # Extract text from each column (td or th)
        cols = [td.get_text(strip=True) for td in tr.find_all(["td", "th"])]

        # Skip rows that don't have enough columns
        if len(cols) < 5:
            continue

        # Remove footnotes from country name
        import re
        country_name_clean = re.sub(r"\[.*?\]", "", cols[1]).strip()

        country_info = {
            "Rank": cols[0],
            "Country": country_name_clean,  # cleaned from footnotes name
            "Population": cols[2],
            "Date": cols[3],
            "Source": cols[4],
        }
        data.append(country_info)

    return data


# Run scraping
url = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population_(United_Nations)"
html = fetch_html(url)
countries = parse_countries_page(html)

print(f"Extracted {len(countries)} countries")
for c in countries[:5]:
    print(c)


Extracted 238 countries
{'Rank': 'World', 'Country': '8,021,407,192', 'Population': '8,091,734,930', 'Date': '+0.88%', 'Source': '–'}
{'Rank': 'India', 'Country': '1,425,423,212', 'Population': '1,438,069,596', 'Date': '+0.89%', 'Source': 'Asia'}
{'Rank': 'China[a]', 'Country': '1,425,179,569', 'Population': '1,422,584,933', 'Date': '−0.18%', 'Source': 'Asia'}
{'Rank': 'United States', 'Country': '341,534,046', 'Population': '343,477,335', 'Date': '+0.57%', 'Source': 'Americas'}
{'Rank': 'Indonesia', 'Country': '278,830,529', 'Population': '281,190,067', 'Date': '+0.85%', 'Source': 'Asia'}


In [78]:
# Pagination and scraping
from typing import Optional, List, Dict, Any

# Since Wikipedia's population list is on a single page, there is no pagination
def find_next_page_url(html: str, base_url: str) -> Optional[str]:

    return None


# Main scraping function that could handle multiple pages
def scrape_all_countries(start_url: str, max_pages: int = 1) -> List[Dict[str, Any]]:

    all_rows: List[Dict[str, Any]] = []
    url: Optional[str] = start_url
    pages_visited: int = 0

    while url and pages_visited < max_pages:
        # Fetch HTML from the current URL
        html: str = fetch_html(url)

        # Parse the table and extract country data
        page_rows: List[Dict[str, Any]] = parse_countries_page(html)
        all_rows.extend(page_rows)

        # Wikipedia does not paginate, so this will always return None
        url = find_next_page_url(html, base_url=url)
        pages_visited += 1

    return all_rows


In [79]:
# CSV export
from __future__ import annotations
from typing import List, Dict, Any
import pandas as pd


# Convert list of country dicts to DataFrame
def countries_to_dataframe(rows: List[Dict[str, Any]]) -> pd.DataFrame:

    return pd.DataFrame.from_records(rows, columns=["Rank", "Country", "Population", "Date", "Source"])


# Save countries DataFrame to CSV with ; separator
def save_countries_csv(rows: List[Dict[str, Any]], path: str) -> None:
    df = countries_to_dataframe(rows)
    df.to_csv(path, sep=";", index=False)


# Run scraping and save to CSV
try:
    countries = scrape_all_countries(
        "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population_(United_Nations)"
    )
    save_path = "countries_population.csv"
    save_countries_csv(countries, save_path)
    print(f"Saved {len(countries)} countries to {save_path}")
    display(countries_to_dataframe(countries).head(5))
except Exception as e:
    print(f"Save failed: {e}")


Saved 238 countries to countries_population.csv


Unnamed: 0,Rank,Country,Population,Date,Source
0,World,8021407192,8091734930,+0.88%,–
1,India,1425423212,1438069596,+0.89%,Asia
2,China[a],1425179569,1422584933,−0.18%,Asia
3,United States,341534046,343477335,+0.57%,Americas
4,Indonesia,278830529,281190067,+0.85%,Asia


# Documentation

**Target:**
I've scraped the Wikipedia page *"List of countries and dependencies by population (United Nations)"* because it contains a well-structured table of countries with population data, suitable for educational web scraping without login or authentication.

**Previous attempts:**
I initially tried scraping three websites: two educational sandboxes and one regular site. These attempts failed because the sites were dynamic and relied heavily on JavaScript, making it impossible to fetch the data directly from the HTML.

**What worked:**
- Fetched the HTML successfully using `fetch_html()`.
- Parsed the main table and extracted fields: **Rank**, **Country**, **Population**, **Date**, and **Source**.
- Converted the data into a Pandas DataFrame and exported it to CSV using `;` as the separator.
- Maintained polite scraping with `time.sleep()` between requests.

**Challenges:**
- Wikipedia does not use pagination, so `find_next_page_url()` always returns `None`.
- Some country names include footnotes (e.g., `[a]`), which initially appeared in the scraped text.

**How we handled it:**
- `scrape_all_countries()` was designed to handle multiple pages, but for Wikipedia it collects all data from the single page.
- Footnotes were removed from country names using a regular expression during parsing.
