<a href="https://colab.research.google.com/github/ayesha-siddiqui17/Ayesha_DTSC3020_Fall2025/blob/main/Assignment_6_WebScraping_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 6 (4 points) — Web Scraping

In this assignment you will complete **two questions**. The **deadline is posted on Canvas**.


## Assignment Guide (Read Me First)

- This notebook provides an **Install Required Libraries** cell and a **Common Imports & Polite Headers** cell. Run them first.
- Each question includes a **skeleton**. The skeleton is **not** a solution; it is a lightweight scaffold you may reuse.
- Under each skeleton you will find a **“Write your answer here”** code cell. Implement your scraping, cleaning, and saving logic there.
- When your code is complete, run the **Runner** cell to print a Top‑15 preview and save the CSV.
- Expected outputs:
  - **Q1:** `data_q1.csv` + Top‑15 sorted by the specified numeric column.
  - **Q2:** `data_q2.csv` + Top‑15 sorted by `points`.


In [None]:
1) #Install Required Libraries
!pip -q install requests beautifulsoup4 lxml pandas
print("Dependencies installed.")


### 2) Common Imports & Polite Headers

In [None]:
# Common Imports & Polite Headers
import re, sys, pandas as pd, requests
from bs4 import BeautifulSoup
HEADERS = {"User-Agent": (
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
    "(KHTML, like Gecko) Chrome/122.0 Safari/537.36")}
def fetch_html(url: str, timeout: int = 20) -> str:
    r = requests.get(url, headers=HEADERS, timeout=timeout)
    r.raise_for_status()
    return r.text
def flatten_headers(df: pd.DataFrame) -> pd.DataFrame:
    if isinstance(df.columns, pd.MultiIndex):
        df.columns = [" ".join([str(x) for x in tup if str(x)!="nan"]).strip()
                      for tup in df.columns.values]
    else:
        df.columns = [str(c).strip() for c in df.columns]
    return df
print("Common helpers loaded.")


## Question 1 — IBAN Country Codes (table)
**URL:** https://www.iban.com/country-codes  
**Extract at least:** `Country`, `Alpha-2`, `Alpha-3`, `Numeric` (≥4 cols; you may add more)  
**Clean:** trim spaces; `Alpha-2/Alpha-3` → **UPPERCASE**; `Numeric` → **int** (nullable OK)  
**Output:** write **`data_q1.csv`** and **print a Top-15** sorted by `Numeric` (desc, no charts)  
**Deliverables:** notebook + `data_q1.csv` + short `README.md` (URL, steps, 1 limitation)

**Tip:** You can use `pandas.read_html(html)` to read tables and then pick one with ≥3 columns.


In [None]:
# --- Q1 Skeleton (fill the TODOs) ---
def q1_read_table(html: str) -> pd.DataFrame:
    """Return the first table with >= 3 columns from the HTML.
    TODO: implement with pd.read_html(html), pick a reasonable table, then flatten headers.
    """
    raise NotImplementedError("TODO: implement q1_read_table")

def q1_clean(df: pd.DataFrame) -> pd.DataFrame:
    """Clean columns: strip, UPPER Alpha-2/Alpha-3, cast Numeric to int (nullable), drop invalids.
    TODO: implement cleaning steps.
    """
    raise NotImplementedError("TODO: implement q1_clean")

def q1_sort_top(df: pd.DataFrame, top: int = 15) -> pd.DataFrame:
    """Sort descending by Numeric and return Top-N.
    TODO: implement.
    """
    raise NotImplementedError("TODO: implement q1_sort_top")


In [2]:
# Q1 — Write your answer here

from bs4 import BeautifulSoup

with open("List of country codes by alpha-2, alpha-3 code (ISO 3166).html", "r", encoding="utf-8") as file:
    html = file.read()

soup = BeautifulSoup(html, "html.parser")
table = soup.find("table")

rows = table.find_all("tr")[1:11]

for row in rows:
    cols = row.find_all("td")
    country = cols[0].text.strip()
    alpha2 = cols[1].text.strip()
    alpha3 = cols[2].text.strip()
    print(f"{country:25} {alpha2:5} {alpha3}")





Afghanistan               AF    AFG
Åland Islands             AX    ALA
Albania                   AL    ALB
Algeria                   DZ    DZA
American Samoa            AS    ASM
Andorra                   AD    AND
Angola                    AO    AGO
Anguilla                  AI    AIA
Antarctica                AQ    ATA
Antigua and Barbuda       AG    ATG


## Question 2 — Hacker News (front page)
**URL:** https://news.ycombinator.com/  
**Extract at least:** `rank`, `title`, `link`, `points`, `comments` (user optional)  
**Clean:** cast `points`/`comments`/`rank` → **int** (non-digits → 0), fill missing text fields  
**Output:** write **`data_q2.csv`** and **print a Top-15** sorted by `points` (desc, no charts)  
**Tip:** Each story is a `.athing` row; details (points/comments/user) are in the next `<tr>` with `.subtext`.


In [None]:
# --- Q2 Skeleton (fill the TODOs) ---
def q2_parse_items(html: str) -> pd.DataFrame:
    """Parse front page items into DataFrame columns:
       rank, title, link, points, comments, user (optional).
    TODO: implement with BeautifulSoup on '.athing' and its sibling '.subtext'.
    """
    raise NotImplementedError("TODO: implement q2_parse_items")

def q2_clean(df: pd.DataFrame) -> pd.DataFrame:
    """Clean numeric fields and fill missing values.
    TODO: cast points/comments/rank to int (non-digits -> 0). Fill text fields.
    """
    raise NotImplementedError("TODO: implement q2_clean")

def q2_sort_top(df: pd.DataFrame, top: int = 15) -> pd.DataFrame:
    """Sort by points desc and return Top-N. TODO: implement."""
    raise NotImplementedError("TODO: implement q2_sort_top")


In [4]:
# Q2 — Write your answer here

from bs4 import BeautifulSoup

with open("Hacker News.html", "r", encoding="utf-8") as file:
    html = file.read()

soup = BeautifulSoup(html, "html.parser")
stories = soup.find_all("tr", class_="athing")

for story in stories[:10]:
    title_tag = story.select_one("span.titleline a")
    title = title_tag.text.strip() if title_tag else "No title found"

    subtext_row = story.find_next_sibling("tr")
    if subtext_row:
        subtext = subtext_row.find("td", class_="subtext")
        score_tag = subtext.find("span", class_="score") if subtext else None
        author_tag = subtext.find("a", class_="hnuser") if subtext else None
        score = score_tag.text.strip() if score_tag else "No score"
        author = author_tag.text.strip() if author_tag else "No author"
    else:
        score, author = "No score", "No author"

    print(f"Title: {title}\nAuthor: {author}\nScore: {score}\n")


Title: Why is Zig so cool?
Author: vitalnodo
Score: 187 points

Title: Snapchat open-sources Valdi a cross-platform UI framework
Author: yehiaabdelm
Score: 112 points

Title: Becoming a Compiler Engineer
Author: lalitkale
Score: 136 points

Title: Mullvad: Shutting down our search proxy Leta
Author: holysoles
Score: 55 points

Title: Immutable Software Deploys Using ZFS Jails on FreeBSD
Author: vermaden
Score: 37 points

Title: Myna: Monospace typeface designed for symbol-heavy programming languages
Author: birdculture
Score: 216 points

Title: How did I get here?
Author: zachlatta
Score: 141 points

Title: Ruby Solved My Problem
Author: joemasilotti
Score: 183 points

Title: Why I love OCaml (2023)
Author: art-w
Score: 303 points

Title: Analysis of Hedy Lamarr's Contribution to Spread-Spectrum Communication
Author: drmpeg
Score: 36 points

