<a href="https://colab.research.google.com/github/aayushmarajlawat/Assignment_6_WebScraping_/blob/main/Assignment_6_WebScraping_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 6 (4 points) — Web Scraping

In this assignment you will complete **two questions**. The **deadline is posted on Canvas**.


## Assignment Guide (Read Me First)

- This notebook provides an **Install Required Libraries** cell and a **Common Imports & Polite Headers** cell. Run them first.
- Each question includes a **skeleton**. The skeleton is **not** a solution; it is a lightweight scaffold you may reuse.
- Under each skeleton you will find a **“Write your answer here”** code cell. Implement your scraping, cleaning, and saving logic there.
- When your code is complete, run the **Runner** cell to print a Top‑15 preview and save the CSV.
- Expected outputs:
  - **Q1:** `data_q1.csv` + Top‑15 sorted by the specified numeric column.
  - **Q2:** `data_q2.csv` + Top‑15 sorted by `points`.


In [None]:
 #Install Required Libraries
!pip -q install requests beautifulsoup4 lxml pandas
print("Dependencies installed.")


Dependencies installed.


### 2) Common Imports & Polite Headers

In [None]:
# Common Imports & Polite Headers
import re, sys, pandas as pd, requests
from bs4 import BeautifulSoup
HEADERS = {"User-Agent": (
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
    "(KHTML, like Gecko) Chrome/122.0 Safari/537.36")}
def fetch_html(url: str, timeout: int = 20) -> str:
    r = requests.get(url, headers=HEADERS, timeout=timeout)
    r.raise_for_status()
    return r.text
def flatten_headers(df: pd.DataFrame) -> pd.DataFrame:
    if isinstance(df.columns, pd.MultiIndex):
        df.columns = [" ".join([str(x) for x in tup if str(x)!="nan"]).strip()
                      for tup in df.columns.values]
    else:
        df.columns = [str(c).strip() for c in df.columns]
    return df
print("Common helpers loaded.")


Common helpers loaded.


## Question 1 — IBAN Country Codes (table)
**URL:** https://www.iban.com/country-codes  
**Extract at least:** `Country`, `Alpha-2`, `Alpha-3`, `Numeric` (≥4 cols; you may add more)  
**Clean:** trim spaces; `Alpha-2/Alpha-3` → **UPPERCASE**; `Numeric` → **int** (nullable OK)  
**Output:** write **`data_q1.csv`** and **print a Top-15** sorted by `Numeric` (desc, no charts)  
**Deliverables:** notebook + `data_q1.csv` + short `README.md` (URL, steps, 1 limitation)

**Tip:** You can use `pandas.read_html(html)` to read tables and then pick one with ≥3 columns.


In [None]:
# --- Q1 Skeleton (fill the TODOs) ---
def q1_read_table(html: str) -> pd.DataFrame:
    """Return the first table with >= 3 columns from the HTML."""
    tables = pd.read_html(html)
    if not tables:
        raise ValueError("No tables found on the page.")
    # pick first table with >=3 cols
    df = None
    for t in tables:
        if t.shape[1] >= 3:
            df = t.copy()
            break
    if df is None:
        raise ValueError("No suitable table (>=3 columns) found.")

    df = flatten_headers(df)

    # Normalize column names for easier downstream use
    col = {c: c.strip() for c in df.columns}
    df.rename(columns=col, inplace=True)

    # Try to standardize common header variants
    rename_map = {}
    for c in df.columns:
        lc = c.lower()
        if "country" in lc and "code" not in lc:
            rename_map[c] = "Country"
        elif "alpha-2" in lc or ("alpha" in lc and "2" in lc):
            rename_map[c] = "Alpha-2"
        elif "alpha-3" in lc or ("alpha" in lc and "3" in lc):
            rename_map[c] = "Alpha-3"
        elif "numeric" in lc and "code" in lc or lc.strip() == "numeric":
            rename_map[c] = "Numeric"
    df.rename(columns=rename_map, inplace=True)

    return df


def q1_clean(df: pd.DataFrame) -> pd.DataFrame:
    """Clean columns: strip, UPPER Alpha-2/Alpha-3, cast Numeric to int (nullable), drop invalids."""
    # Trim all string cells
    df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)

    # Ensure required columns exist
    def pick(colname, fallback_options):
        if colname in df.columns:
            return colname
        for opt in fallback_options:
            if opt in df.columns:
                return opt
        return None

    c_country = pick("Country", [c for c in df.columns if "country" in c.lower()])
    c_a2 = pick("Alpha-2", [c for c in df.columns if "alpha" in c.lower() and "2" in c])
    c_a3 = pick("Alpha-3", [c for c in df.columns if "alpha" in c.lower() and "3" in c])
    c_num = pick("Numeric", [c for c in df.columns if "numeric" in c.lower()])

    # Keep at least these 4 when present
    keep = [c for c in [c_country, c_a2, c_a3, c_num] if c is not None]
    out = df[keep].copy()

    # Rename to canonical names
    ren = {}
    if c_country: ren[c_country] = "Country"
    if c_a2: ren[c_a2] = "Alpha-2"
    if c_a3: ren[c_a3] = "Alpha-3"
    if c_num: ren[c_num] = "Numeric"
    out.rename(columns=ren, inplace=True)


    if "Alpha-2" in out.columns:
        out["Alpha-2"] = out["Alpha-2"].astype(str).str.strip().str.upper()
    if "Alpha-3" in out.columns:
        out["Alpha-3"] = out["Alpha-3"].astype(str).str.strip().str.upper()

    if "Numeric" in out.columns:
        out["Numeric"] = (
            out["Numeric"].astype(str).str.extract(r"(\d+)", expand=False)
            .astype("Int64")
        )

    # Drop rows without a country name
    out = out[~out["Country"].astype(str).str.strip().eq("")].reset_index(drop=True)

    return out


def q1_sort_top(df: pd.DataFrame, top: int = 15) -> pd.DataFrame:
    """Sort descending by Numeric and return Top-N."""
    if "Numeric" not in df.columns:
        raise ValueError("Column 'Numeric' not found after cleaning.")
    return df.sort_values("Numeric", ascending=False, na_position="last").head(top).reset_index(drop=True)




In [None]:
# Q1 — Write your answer here

html_q1 = fetch_html("https://www.iban.com/country-codes")
df_q1_raw = q1_read_table(html_q1)
df_q1 = q1_clean(df_q1_raw)

top15_q1 = q1_sort_top(df_q1, top=15)
print("Q1 — Top 15 by Numeric (desc):")
display(top15_q1)  # if in Jupyter; otherwise print(top15_q1)

df_q1.to_csv("data_q1.csv", index=False)
print("Saved: data_q1.csv")



Q1 — Top 15 by Numeric (desc):


  tables = pd.read_html(html)
  df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)


Unnamed: 0,Country,Alpha-2,Alpha-3,Numeric
0,Zambia,ZM,ZMB,894
1,Yemen,YE,YEM,887
2,Samoa,WS,WSM,882
3,Wallis and Futuna,WF,WLF,876
4,Venezuela (Bolivarian Republic of),VE,VEN,862
5,Uzbekistan,UZ,UZB,860
6,Uruguay,UY,URY,858
7,Burkina Faso,BF,BFA,854
8,Virgin Islands (U.S.),VI,VIR,850
9,United States of America (the),US,USA,840


Saved: data_q1.csv


## Question 2 — Hacker News (front page)
**URL:** https://news.ycombinator.com/  
**Extract at least:** `rank`, `title`, `link`, `points`, `comments` (user optional)  
**Clean:** cast `points`/`comments`/`rank` → **int** (non-digits → 0), fill missing text fields  
**Output:** write **`data_q2.csv`** and **print a Top-15** sorted by `points` (desc, no charts)  
**Tip:** Each story is a `.athing` row; details (points/comments/user) are in the next `<tr>` with `.subtext`.


In [None]:
# --- Q2 Skeleton (fill the TODOs) ---

def q2_parse_items(html: str) -> pd.DataFrame:
    """Parse front page items into DataFrame columns:
       rank, title, link, points, comments, user (optional).
    """
    soup = BeautifulSoup(html, "lxml")
    rows = soup.select("tr.athing")
    items = []

    for row in rows:
        # rank
        rank_text = row.select_one("span.rank")
        rank = rank_text.get_text(strip=True).replace(".", "") if rank_text else ""

        # title & link
        title_a = row.select_one("span.titleline > a")
        title = title_a.get_text(strip=True) if title_a else ""
        link = title_a.get("href", "") if title_a else ""

        # subtext is in the next <tr>
        sub = row.find_next_sibling("tr")
        points = ""
        comments = ""
        user = ""

        if sub:
            subtext = sub.select_one(".subtext")
            if subtext:
                # points
                score = subtext.select_one("span.score")
                points = score.get_text(strip=True) if score else ""

                # user
                user_a = subtext.select_one("a.hnuser")
                user = user_a.get_text(strip=True) if user_a else ""

                # comments
                a_tags = subtext.select("a")
                c_val = ""
                for a in reversed(a_tags):
                    txt = a.get_text(strip=True).lower()
                    if "comment" in txt or "discuss" in txt:
                        c_val = a.get_text(strip=True)
                        break
                comments = c_val

        items.append({
            "rank": rank,
            "title": title,
            "link": link,
            "points": points,
            "comments": comments,
            "user": user
        })

    return pd.DataFrame(items)


def q2_clean(df: pd.DataFrame) -> pd.DataFrame:
    """Clean numeric fields and fill missing values."""
    def to_int(s):
        if pd.isna(s):
            return 0
        s = str(s)

        m = re.search(r"(\d+)", s)
        return int(m.group(1)) if m else 0

    out = df.copy()

    # Fill text columns first
    for col in ["title", "link", "user"]:
        if col in out.columns:
            out[col] = out[col].fillna("").astype(str).str.strip()

    # Numeric conversions
    out["rank"] = out.get("rank", "").apply(to_int).astype(int)
    out["points"] = out.get("points", "").apply(to_int).astype(int)
    out["comments"] = out.get("comments", "").apply(to_int).astype(int)

    # Keep a reasonable column order
    cols = ["rank", "title", "link", "points", "comments"]
    if "user" in out.columns:
        cols.append("user")
    out = out[cols]

    return out


def q2_sort_top(df: pd.DataFrame, top: int = 15) -> pd.DataFrame:
    """Sort by points desc and return Top-N."""
    return df.sort_values("points", ascending=False).head(top).reset_index(drop=True)



In [None]:
# Q2 — Write your answer here

html_q2 = fetch_html("https://news.ycombinator.com/")
df_q2_raw = q2_parse_items(html_q2)
df_q2 = q2_clean(df_q2_raw)

top15_q2 = q2_sort_top(df_q2, top=15)
print("Q2 — Top 15 by points (desc):")
display(top15_q2)  # if in Jupyter; otherwise print(top15_q2)

df_q2.to_csv("data_q2.csv", index=False)
print("Saved: data_q2.csv")



Q2 — Top 15 by points (desc):


Unnamed: 0,rank,title,link,points,comments,user
0,11,You should write an agent,https://fly.io/blog/everyone-write-an-agent/,923,364,tabletcorry
1,3,Leaving Meta and PyTorch,https://soumith.ch/blog/2025-11-06-leaving-met...,586,135,saikatsg
2,17,Two billion email addresses were exposed,https://www.troyhunt.com/2-billion-email-addre...,571,402,esnard
3,24,Show HN: I scraped 3B Goodreads reviews to tra...,https://book.sv,526,211,costco
4,7,A Fond Farewell,https://www.farmersalmanac.com/fond-farewell-f...,481,168,erhuve
5,16,Meta projected 10% of 2024 revenue came from s...,https://sherwood.news/tech/meta-projected-10-o...,450,355,donohoe
6,25,Game design is simple,https://www.raphkoster.com/2025/11/03/game-des...,448,138,vrnvu
7,28,Analysis indicates that the universe’s expansi...,https://ras.ac.uk/news-and-press/research-high...,228,182,chrka
8,10,OpenMW 0.50.0 Released – open-source Morrowind...,https://openmw.org/2025/openmw-0-50-0-released/,177,63,agluszak
9,1,I Love OCaml,https://mccd.space/posts/ocaml-the-worlds-best/,156,84,art-w


Saved: data_q2.csv
