<a href="https://colab.research.google.com/github/arloera01-blip/AshlynL_DTSC3020_Fall2025/blob/main/Assignment_6_WebScraping_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 6 (4 points) — Web Scraping

In this assignment you will complete **two questions**. The **deadline is posted on Canvas**.


## Assignment Guide (Read Me First)

- This notebook provides an **Install Required Libraries** cell and a **Common Imports & Polite Headers** cell. Run them first.
- Each question includes a **skeleton**. The skeleton is **not** a solution; it is a lightweight scaffold you may reuse.
- Under each skeleton you will find a **“Write your answer here”** code cell. Implement your scraping, cleaning, and saving logic there.
- When your code is complete, run the **Runner** cell to print a Top‑15 preview and save the CSV.
- Expected outputs:
  - **Q1:** `data_q1.csv` + Top‑15 sorted by the specified numeric column.
  - **Q2:** `data_q2.csv` + Top‑15 sorted by `points`.


In [2]:
#Install Required Libraries
!pip -q install requests beautifulsoup4 lxml pandas
print("Dependencies installed.")

Dependencies installed.


### 2) Common Imports & Polite Headers

In [3]:
# Common Imports & Polite Headers
import re, sys, pandas as pd, requests
from bs4 import BeautifulSoup
HEADERS = {"User-Agent": (
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
    "(KHTML, like Gecko) Chrome/122.0 Safari/537.36")}
def fetch_html(url: str, timeout: int = 20) -> str:
    r = requests.get(url, headers=HEADERS, timeout=timeout)
    r.raise_for_status()
    return r.text
def flatten_headers(df: pd.DataFrame) -> pd.DataFrame:
    if isinstance(df.columns, pd.MultiIndex):
        df.columns = [" ".join([str(x) for x in tup if str(x)!="nan"]).strip()
                      for tup in df.columns.values]
    else:
        df.columns = [str(c).strip() for c in df.columns]
    return df
print("Common helpers loaded.")


Common helpers loaded.


## Question 1 — IBAN Country Codes (table)
**URL:** https://www.iban.com/country-codes  
**Extract at least:** `Country`, `Alpha-2`, `Alpha-3`, `Numeric` (≥4 cols; you may add more)  
**Clean:** trim spaces; `Alpha-2/Alpha-3` → **UPPERCASE**; `Numeric` → **int** (nullable OK)  
**Output:** write **`data_q1.csv`** and **print a Top-15** sorted by `Numeric` (desc, no charts)  
**Deliverables:** notebook + `data_q1.csv` + short `README.md` (URL, steps, 1 limitation)

**Tip:** You can use `pandas.read_html(html)` to read tables and then pick one with ≥3 columns.


In [None]:
# --- Q1 Skeleton (fill the TODOs) ---
def q1_read_table(html: str) -> pd.DataFrame:
    """Return the first table with >= 3 columns from the HTML.
    TODO: implement with pd.read_html(html), pick a reasonable table, then flatten headers.
    """
    raise NotImplementedError("TODO: implement q1_read_table")

def q1_clean(df: pd.DataFrame) -> pd.DataFrame:
    """Clean columns: strip, UPPER Alpha-2/Alpha-3, cast Numeric to int (nullable), drop invalids.
    TODO: implement cleaning steps.
    """
    raise NotImplementedError("TODO: implement q1_clean")

def q1_sort_top(df: pd.DataFrame, top: int = 15) -> pd.DataFrame:
    """Sort descending by Numeric and return Top-N.
    TODO: implement.
    """
    raise NotImplementedError("TODO: implement q1_sort_top")


In [None]:
# Question 1 — IBAN Country Codes Scraper

### Source URL
https://www.iban.com/country-codes

### What the script does
1. Downloads the HTML using `requests`
2. Parses the page using `BeautifulSoup`
3. Extracts the country code table (`Country`, `Alpha-2`, `Alpha-3`, `Numeric`)
4. Cleans:
   - Trimmed strings
   - Converted Alpha-2 and Alpha-3 codes to UPPERCASE
   - Converted Numeric to integer type (nullable support)
5. Saves the full cleaned dataset as **data_q1.csv**
6. Prints a **Top-15 preview** sorted by `Numeric` descending

### Deliverables Included
- Code notebook (`.ipynb`)
- Clean dataset: `data_q1.csv`
- This README.md

### Limitation
The website structure may change over time, which could break the CSS selectors or modify the dataset results on future runs.


In [21]:
# Question 1 — IBAN Country Codes Scraper
import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "https://www.iban.com/country-codes"
resp = requests.get(url, timeout=10)
soup = BeautifulSoup(resp.text, "html.parser")

rows = []
table = soup.find("table")  # the first + main country code table

for tr in table.select("tbody tr"):
    cols = [td.get_text(strip=True) for td in tr.select("td")]
    if len(cols) >= 4:
        country, alpha2, alpha3, numeric = cols[:4]

        # Clean fields
        alpha2 = alpha2.upper().strip()
        alpha3 = alpha3.upper().strip()
        try:
            numeric = int(numeric)
        except:
            numeric = None  # keep nullable like instructions say

        rows.append({
            "Country": country,
            "Alpha-2": alpha2,
            "Alpha-3": alpha3,
            "Numeric": numeric
        })

df = pd.DataFrame(rows)

# Sort Numeric descending and preview Top-15
df_top15 = df.sort_values(by="Numeric", ascending=False).head(15)

# Save full cleaned dataset
df.to_csv("data_q1.csv", index=False)

print("Saved data_q1.csv — total rows:", len(df_top15))
df_top15


Saved data_q1.csv — total rows: 15


Unnamed: 0,Country,Alpha-2,Alpha-3,Numeric
247,Zambia,ZM,ZMB,894
246,Yemen,YE,YEM,887
192,Samoa,WS,WSM,882
244,Wallis and Futuna,WF,WLF,876
240,Venezuela (Bolivarian Republic of),VE,VEN,862
238,Uzbekistan,UZ,UZB,860
237,Uruguay,UY,URY,858
35,Burkina Faso,BF,BFA,854
243,Virgin Islands (U.S.),VI,VIR,850
236,United States of America (the),US,USA,840


## Question 2 — Hacker News (front page)
**URL:** https://news.ycombinator.com/  
**Extract at least:** `rank`, `title`, `link`, `points`, `comments` (user optional)  
**Clean:** cast `points`/`comments`/`rank` → **int** (non-digits → 0), fill missing text fields  
**Output:** write **`data_q2.csv`** and **print a Top-15** sorted by `points` (desc, no charts)  
**Tip:** Each story is a `.athing` row; details (points/comments/user) are in the next `<tr>` with `.subtext`.


In [None]:
# --- Q2 Skeleton (fill the TODOs) ---
def q2_parse_items(html: str) -> pd.DataFrame:
    """Parse front page items into DataFrame columns:
       rank, title, link, points, comments, user (optional).
    TODO: implement with BeautifulSoup on '.athing' and its sibling '.subtext'.
    """
    raise NotImplementedError("TODO: implement q2_parse_items")

def q2_clean(df: pd.DataFrame) -> pd.DataFrame:
    """Clean numeric fields and fill missing values.
    TODO: cast points/comments/rank to int (non-digits -> 0). Fill text fields.
    """
    raise NotImplementedError("TODO: implement q2_clean")

def q2_sort_top(df: pd.DataFrame, top: int = 15) -> pd.DataFrame:
    """Sort by points desc and return Top-N. TODO: implement."""
    raise NotImplementedError("TODO: implement q2_sort_top")


In [20]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "https://news.ycombinator.com/"
resp = requests.get(url, timeout=10)
soup = BeautifulSoup(resp.text, "html.parser")

stories = []

for item in soup.select("tr.athing"):
    # Rank
    rank_tag = item.select_one("span.rank")
    rank_str = rank_tag.get_text(strip=True).rstrip('.') if rank_tag else ""
    try:
        rank = int(rank_str)
    except:
        rank = 0

    # Title + link
    titleline_span = item.select_one("span.titleline")
    if titleline_span:
        a_tag = titleline_span.find("a")
        if a_tag:
            title = a_tag.get_text(strip=True)
            link = a_tag.get("href", "").strip()
        else:
            title = ""
            link = ""
    else:
        title = ""
        link = ""


    # Subtext (next row)
    sub = item.find_next_sibling("tr").select_one("td.subtext")
    if sub:
        # Points
        pts_tag = sub.select_one("span.score")
        pts_str = pts_tag.get_text(strip=True).split()[0] if pts_tag else ""
        try:
            points = int(pts_str)
        except:
            points = 0

        # Comments
        comment_tags = sub.select("a")
        if comment_tags:
            last_link_text = comment_tags[-1].get_text(strip=True)
            if "comment" in last_link_text:
                comments_str = last_link_text.split()[0]
            else:
                comments_str = "0"
        else:
            comments_str = "0"
    else:
        points = 0
        comments_str = "0"

    try:
        comments = int(comments_str)
    except:
        comments = 0

    stories.append({
        "rank": rank,
        "title": title,
        "link": link,
        "points": points,
        "comments": comments
    })

df = pd.DataFrame(stories)

# Sort Top-15 by points (desc)
df_top15 = df.sort_values(by="points", ascending=False).head(15).reset_index(drop=True)

# Save to CSV
df.to_csv("data_q2.csv", index=False)

print("Saved data_q2.csv — total rows:", len(df_top15))
print(df_top15)

Saved data_q2.csv — total rows: 15
    rank                                              title  \
0      5                                            Mr TIFF   
1     28        I’m worried that they put co-pilot in Excel   
2     14          UPS plane crashes near Louisville airport   
3     21    Bluetui – A TUI for managing Bluetooth on Linux   
4     17  RISC-V takes first step toward international I...   
5     25  Apple’s Persona technology uses Gaussian splat...   
6     18      Hypothesis: Property-Based Testing for Python   
7      6  iOS 26.2 to allow third-party app stores in Ja...   
8      8  SPy: An interpreter and compiler for a fast st...   
9     27  Grayskull: A tiny computer vision library in C...   
10    19  Asus Announces October Availability of ProArt ...   
11     1               The shadows lurking in the equations   
12     2     An eBPF Loophole: Using XDP for Egress Traffic   
13    13                                   Radiant Computer   
14    30  The Micros