# Assignment 6 (4 points) — Web Scraping

In this assignment you will complete **two questions**. The **deadline is posted on Canvas**.


## Assignment Guide (Read Me First)

- This notebook provides an **Install Required Libraries** cell and a **Common Imports & Polite Headers** cell. Run them first.
- Each question includes a **skeleton**. The skeleton is **not** a solution; it is a lightweight scaffold you may reuse.
- Under each skeleton you will find a **“Write your answer here”** code cell. Implement your scraping, cleaning, and saving logic there.
- When your code is complete, run the **Runner** cell to print a Top‑15 preview and save the CSV.
- Expected outputs:
  - **Q1:** `data_q1.csv` + Top‑15 sorted by the specified numeric column.
  - **Q2:** `data_q2.csv` + Top‑15 sorted by `points`.


In [None]:
1) #Install Required Libraries
!pip -q install requests beautifulsoup4 lxml pandas
print("Dependencies installed.")


### 2) Common Imports & Polite Headers


In [23]:
# Common Imports & Polite Headers
from bs4 import BeautifulSoup
import requests
import pandas as pd
HEADERS = {"User-Agent": (
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
    "(KHTML, like Gecko) Chrome/122.0 Safari/537.36")}
def fetch_html(url: str, timeout: int = 20) -> str:
    r = requests.get(url, headers=HEADERS, timeout=timeout)
    r.raise_for_status()
    return r.text
def flatten_headers(df: pd.DataFrame) -> pd.DataFrame:
    if isinstance(df.columns, pd.MultiIndex):
        df.columns = [" ".join([str(x) for x in tup if str(x)!="nan"]).strip()
                      for tup in df.columns.values]
    else:
        df.columns = [str(c).strip() for c in df.columns]
    return df
print("Common helpers loaded.")


Common helpers loaded.


## Question 1 — IBAN Country Codes (table)

**URL:** https://www.iban.com/country-codes  
**Extract at least:** `Country`, `Alpha-2`, `Alpha-3`, `Numeric` (≥4 cols; you may add more)  
**Clean:** trim spaces; `Alpha-2/Alpha-3` → **UPPERCASE**; `Numeric` → **int** (nullable OK)  
**Output:** write **`data_q1.csv`** and **print a Top-15** sorted by `Numeric` (desc, no charts)  
**Deliverables:** notebook + `data_q1.csv` + short `README.md` (URL, steps, 1 limitation)

**Tip:** You can use `pandas.read_html(html)` to read tables and then pick one with ≥3 columns.


In [None]:
# --- Q1 Skeleton (fill the TODOs) ---
def q1_read_table(html: str) -> pd.DataFrame:
    """Return the first table with >= 3 columns from the HTML.
    TODO: implement with pd.read_html(html), pick a reasonable table, then flatten headers.
    """
    raise NotImplementedError("TODO: implement q1_read_table")

def q1_clean(df: pd.DataFrame) -> pd.DataFrame:
    """Clean columns: strip, UPPER Alpha-2/Alpha-3, cast Numeric to int (nullable), drop invalids.
    TODO: implement cleaning steps.
    """
    raise NotImplementedError("TODO: implement q1_clean")

def q1_sort_top(df: pd.DataFrame, top: int = 15) -> pd.DataFrame:
    """Sort descending by Numeric and return Top-N.
    TODO: implement.
    """
    raise NotImplementedError("TODO: implement q1_sort_top")


In [38]:
# Q1 — Write your answer here
def q1_read_table(html: str) -> pd.DataFrame:
    tables = pd.read_html(html, header=0)               # read all tables
    chosen = next(t for t in tables if t.shape[1] >= 3) # pick first with ≥3 cols
    return chosen

def q1_clean(df: pd.DataFrame) -> pd.DataFrame:
    df = df.rename(columns={
        "Alpha-2 code": "Alpha-2",
        "Alpha-3 code": "Alpha-3",
        "Numeric code": "Numeric",
    })
    for c in df.columns:
        df[c] = df[c].astype(str).str.strip()

    df['Alpha-2'] = df['Alpha-2'].str.upper()
    df['Alpha-3'] = df['Alpha-3'].str.upper()
    df['Numeric'] = pd.to_numeric(df['Numeric'], errors='coerce').astype("Int64")  # <- use nullable int
    return df

def q1_sort_top(df: pd.DataFrame, top: int = 15) -> pd.DataFrame:
    return (
        df.sort_values("Numeric", ascending=False, na_position="last")
          .head(top)
          .reset_index(drop=True)
    )


html = fetch_html("https://www.iban.com/country-codes")
df = q1_read_table(html)
df = q1_clean(df)
df = q1_sort_top(df)
df_full = q1_clean(q1_read_table(html))
df_full.to_csv("data_q1.csv", index=False)

# load the data using pandas
df_loaded = pd.read_csv("data_q1.csv")
print(df_loaded.head(16))









                Country Alpha-2 Alpha-3  Numeric
0           Afghanistan      AF     AFG        4
1         Åland Islands      AX     ALA      248
2               Albania      AL     ALB        8
3               Algeria      DZ     DZA       12
4        American Samoa      AS     ASM       16
5               Andorra      AD     AND       20
6                Angola      AO     AGO       24
7              Anguilla      AI     AIA      660
8            Antarctica      AQ     ATA       10
9   Antigua and Barbuda      AG     ATG       28
10            Argentina      AR     ARG       32
11              Armenia      AM     ARM       51
12                Aruba      AW     ABW      533
13            Australia      AU     AUS       36
14              Austria      AT     AUT       40
15           Azerbaijan      AZ     AZE       31


  tables = pd.read_html(html, header=0)               # read all tables
  tables = pd.read_html(html, header=0)               # read all tables


Fetched the page HTML and used pandas.read_html(html) to read all tables.

Selected the first table with ≥ 3 columns and standardized headers to: Country, Alpha-2, Alpha-3, Numeric.

Cleaned the data:

Trimmed leading/trailing spaces for all columns

Converted Alpha-2 and Alpha-3 to UPPERCASE

Cast Numeric to nullable integer (Int64)

Saved the full cleaned table to data_q1.csv.

Printed the Top-15 rows sorted by Numeric in descending order (no charts).

Output

data_q1.csv — full cleaned dataset

Notebook cell output — Top-15 by Numeric (desc)


## Question 2 — Hacker News (front page)

**URL:** https://news.ycombinator.com/  
**Extract at least:** `rank`, `title`, `link`, `points`, `comments` (user optional)  
**Clean:** cast `points`/`comments`/`rank` → **int** (non-digits → 0), fill missing text fields  
**Output:** write **`data_q2.csv`** and **print a Top-15** sorted by `points` (desc, no charts)  
**Tip:** Each story is a `.athing` row; details (points/comments/user) are in the next `<tr>` with `.subtext`.


In [None]:
# --- Q2 Skeleton (fill the TODOs) ---
def q2_parse_items(html: str) -> pd.DataFrame:
    """Parse front page items into DataFrame columns:
       rank, title, link, points, comments, user (optional).
    TODO: implement with BeautifulSoup on '.athing' and its sibling '.subtext'.
    """
    
    raise NotImplementedError("TODO: implement q2_parse_items")

def q2_clean(df: pd.DataFrame) -> pd.DataFrame:
    """Clean numeric fields and fill missing values.
    TODO: cast points/comments/rank to int (non-digits -> 0). Fill text fields.
    """
    raise NotImplementedError("TODO: implement q2_clean")

def q2_sort_top(df: pd.DataFrame, top: int = 15) -> pd.DataFrame:
    """Sort by points desc and return Top-N. TODO: implement."""
    raise NotImplementedError("TODO: implement q2_sort_top")


In [None]:

def q2_parse_items(html: str) -> pd.DataFrame:
    soup = BeautifulSoup(html, "html.parser")
    rows = []
    for item in soup.select(".athing"):
        rank = item.select_one(".rank").text.strip().rstrip(".")
        title = item.select_one(".titleline a").text.strip()
        link = item.select_one(".titleline a")["href"].strip()
        subtext = item.find_next_sibling("tr").select_one(".subtext")
        points = subtext.select_one(".score").text.strip().split()[0] if subtext and subtext.select_one(".score") else "0"
        comments = subtext.select_one("a:last-child").text.strip().split()[0] if subtext and subtext.select_one("a:last-child") else "0"
        user = subtext.select_one(".hnuser").text.strip() if subtext and subtext.select_one(".hnuser") else ""
        rows.append({
            "rank": rank,
            "title": title,
            "link": link,
            "points": points,
            "comments": comments,
            "user": user,
        })

    return pd.DataFrame(rows)

def q2_clean(df: pd.DataFrame) -> pd.DataFrame:
    df["rank"] = pd.to_numeric(df["rank"], errors="coerce").fillna(0).astype(int)
    df["points"] = pd.to_numeric(df["points"], errors="coerce").fillna(0).astype(int)
    df["comments"] = pd.to_numeric(df["comments"], errors="coerce").fillna(0).astype(int)
    df["title"] = df["title"].fillna("")
    df["link"] = df["link"].fillna("")
    return df

def q2_sort_top(df: pd.DataFrame, top: int = 15) -> pd.DataFrame:
    """Sort by points desc and return Top-N. TODO: implement."""
    df = df.sort_values("points", ascending=False).head(top)
    return df
#  usage example (optional) ---
html = requests.get("https://news.ycombinator.com/", timeout=20).text
df = q2_parse_items(html)
df = q2_clean(df)
df.to_csv("data_q2.csv", index=False)
df_full = pd.read_csv("data_q2.csv")
df_full.head(16)




Unnamed: 0,rank,title,link,points,comments,user
0,1,Myna: Monospace typeface designed for symbol-h...,https://github.com/sayyadirfanali/Myna,100,3,birdculture
1,2,Ruby Solved My Problem,https://newsletter.masilotti.com/p/ruby-alread...,91,2,joemasilotti
2,3,How did I get here?,https://how-did-i-get-here.net/,34,1,zachlatta
3,4,Ribir: Non-intrusive GUI framework for Rust/WASM,https://github.com/RibirX/Ribir,18,1,adamnemecek
4,5,I Love OCaml,https://mccd.space/posts/ocaml-the-worlds-best/,249,7,art-w
5,6,Venn Diagram for 7 Sets,https://moebio.com/research/sevensets/,53,3,bramadityaw
6,7,James Watson has died,https://www.nytimes.com/2025/11/07/science/jam...,125,2,granzymes
7,8,"YouTube Removes Windows 11 Bypass Tutorials, C...",https://news.itsfoss.com/youtube-removes-windo...,43,47,WaitWaitWha
8,9,Leaving Meta and PyTorch,https://soumith.ch/blog/2025-11-06-leaving-met...,649,15,saikatsg
9,10,"Angel Investors, a Field Guide",https://www.jeanyang.com/posts/angel-investors...,61,4,azhenley
