<a href="https://colab.research.google.com/github/caffeinated-beverage/NikkiSingh_DTSC3020_Fall2025/blob/main/assignment06_ns1239.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 6 (4 points) — Web Scraping

In this assignment you will complete **two questions**. The **deadline is posted on Canvas**.


## Assignment Guide (Read Me First)

- This notebook provides an **Install Required Libraries** cell and a **Common Imports & Polite Headers** cell. Run them first.
- Each question includes a **skeleton**. The skeleton is **not** a solution; it is a lightweight scaffold you may reuse.
- Under each skeleton you will find a **“Write your answer here”** code cell. Implement your scraping, cleaning, and saving logic there.
- When your code is complete, run the **Runner** cell to print a Top‑15 preview and save the CSV.
- Expected outputs:
  - **Q1:** `data_q1.csv` + Top‑15 sorted by the specified numeric column.
  - **Q2:** `data_q2.csv` + Top‑15 sorted by `points`.


In [10]:
#1) Install Required Libraries
!pip -q install requests beautifulsoup4 lxml pandas
print("Dependencies installed.")


Traceback (most recent call last):
  File "/usr/local/bin/pip3", line 4, in <module>
    from pip._internal.cli.main import main
  File "/usr/local/lib/python3.12/dist-packages/pip/_internal/cli/main.py", line 11, in <module>
    from pip._internal.cli.autocompletion import autocomplete
  File "/usr/local/lib/python3.12/dist-packages/pip/_internal/cli/autocompletion.py", line 10, in <module>
    from pip._internal.cli.main_parser import create_main_parser
  File "/usr/local/lib/python3.12/dist-packages/pip/_internal/cli/main_parser.py", line 9, in <module>
    from pip._internal.build_env import get_runnable_pip
  File "/usr/local/lib/python3.12/dist-packages/pip/_internal/build_env.py", line 19, in <module>
    from pip._internal.cli.spinners import open_spinner
  File "/usr/local/lib/python3.12/dist-packages/pip/_internal/cli/spinners.py", line 9, in <module>
    from pip._internal.utils.logging import get_indentation
  File "/usr/local/lib/python3.12/dist-packages/pip/_internal/util

### 2) Common Imports & Polite Headers

In [13]:
# Common Imports & Polite Headers
import re, sys, pandas as pd, requests
from bs4 import BeautifulSoup
HEADERS = {"User-Agent": (
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
    "(KHTML, like Gecko) Chrome/122.0 Safari/537.36")}
def fetch_html(url: str, timeout: int = 20) -> str:
    r = requests.get(url, headers=HEADERS, timeout=timeout)
    r.raise_for_status()
    return r.text
def flatten_headers(df: pd.DataFrame) -> pd.DataFrame:
    if isinstance(df.columns, pd.MultiIndex):
        df.columns = [" ".join([str(x) for x in tup if str(x)!="nan"]).strip()
                      for tup in df.columns.values]
    else:
        df.columns = [str(c).strip() for c in df.columns]
    return df
print("Common helpers loaded.")


Common helpers loaded.


## Question 1 — IBAN Country Codes (table)
**URL:** https://www.iban.com/country-codes  
**Extract at least:** `Country`, `Alpha-2`, `Alpha-3`, `Numeric` (≥4 cols; you may add more)  
**Clean:** trim spaces; `Alpha-2/Alpha-3` → **UPPERCASE**; `Numeric` → **int** (nullable OK)  
**Output:** write **`data_q1.csv`** and **print a Top-15** sorted by `Numeric` (desc, no charts)  
**Deliverables:** notebook + `data_q1.csv` + short `README.md` (URL, steps, 1 limitation)

**Tip:** You can use `pandas.read_html(html)` to read tables and then pick one with ≥3 columns.


In [17]:
# --- Q1 Skeleton (fill the TODOs) ---
def q1_read_table(html: str) -> pd.DataFrame:
    """
    Return the first table with >= 3 columns from the HTML.
    Uses pd.read_html and returns the first table it finds.
    """
    try:
        tables = pd.read_html(StringIO(html))
    except ImportError:
        print("Error: `lxml` library not found. Please install it with 'pip install lxml'", file=sys.stderr)
        return pd.DataFrame()
    except ValueError:
        print("Error: No tables were found in the provided HTML.", file=sys.stderr)
        return pd.DataFrame()

    for df in tables:
        if df.shape[1] >= 3:
            return df.copy()

    raise ValueError("No table with >= 3 columns was found in the HTML.")

def q1_clean(df: pd.DataFrame) -> pd.DataFrame:
    """
    Clean columns: strip, UPPER Alpha-2/Alpha-3, cast Numeric to int (nullable), drop invalids.
    """
    df_clean = df.copy()

    rename_map = {
        'Country': 'Country',
        'Alpha-2 code': 'Alpha-2',
        'Alpha-3 code': 'Alpha-3',
        'Numeric': 'Numeric'
    }
    df_clean = df_clean.rename(columns=rename_map)

    required_cols = ['Country', 'Alpha-2', 'Alpha-3', 'Numeric']

    missing_cols = [col for col in required_cols if col not in df_clean.columns]
    if missing_cols:
        raise ValueError(f"Missing required columns after rename: {missing_cols}")

    df_clean = df_clean[required_cols]

    df_clean['Country'] = df_clean['Country'].str.strip()
    df_clean['Alpha-2'] = df_clean['Alpha-2'].str.strip().str.upper()
    df_clean['Alpha-3'] = df_clean['Alpha-3'].str.strip().str.upper()

    df_clean['Numeric'] = pd.to_numeric(df_clean['Numeric'], errors='coerce')

    df_clean = df_clean.dropna(subset=['Numeric'])

    df_clean['Numeric'] = df_clean['Numeric'].astype(pd.Int64Dtype())

    return df_clean

def q1_sort_top(df: pd.DataFrame, top: int = 15) -> pd.DataFrame:
    """
    Sort descending by Numeric and return Top-N.
    """
    df_sorted = df.sort_values(by='Numeric', ascending=False)

    return df_sorted.head(top)


In [19]:
# Q1 — Write your answer here
import requests
import pandas as pd
import sys
from io import StringIO

def main():
    URL = "https://www.iban.com/country-codes"
    HEADERS = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
    }

    print(f"Fetching data from {URL}...")
    try:
        response = requests.get(URL, headers=HEADERS)
        response.raise_for_status()

        print("Parsing HTML table...")
        df_raw = q1_read_table(response.text)

        print("Cleaning data...")
        df_clean = q1_clean(df_raw)

        print("Sorting and getting top 15...")
        df_top15 = q1_sort_top(df_clean, top=15)

        print("\n--- Top 15 Countries by Numeric Code (Descending) ---")
        print(df_top15.to_string(index=False))

        output_filename = "data_q1.csv"
        df_clean.to_csv(output_filename, index=False, encoding='utf-8')
        print(f"\nSuccessfully saved all {len(df_clean)} cleaned rows to {output_filename}")

    except requests.exceptions.RequestException as e:
        print(f"Error during HTTP request: {e}", file=sys.stderr)
    except (ValueError, KeyError) as e:
        print(f"Error during data processing: {e}", file=sys.stderr)
    except Exception as e:
        print(f"An unexpected error occurred: {e}", file=sys.stderr)

if __name__ == "__main__":
    main()

Fetching data from https://www.iban.com/country-codes...
Parsing HTML table...
Cleaning data...
Sorting and getting top 15...

--- Top 15 Countries by Numeric Code (Descending) ---
                                                   Country Alpha-2 Alpha-3  Numeric
                                                    Zambia      ZM     ZMB      894
                                                     Yemen      YE     YEM      887
                                                     Samoa      WS     WSM      882
                                         Wallis and Futuna      WF     WLF      876
                        Venezuela (Bolivarian Republic of)      VE     VEN      862
                                                Uzbekistan      UZ     UZB      860
                                                   Uruguay      UY     URY      858
                                              Burkina Faso      BF     BFA      854
                                     Virgin Islands (U.S.)     

## Question 2 — Hacker News (front page)
**URL:** https://news.ycombinator.com/  
**Extract at least:** `rank`, `title`, `link`, `points`, `comments` (user optional)  
**Clean:** cast `points`/`comments`/`rank` → **int** (non-digits → 0), fill missing text fields  
**Output:** write **`data_q2.csv`** and **print a Top-15** sorted by `points` (desc, no charts)  
**Tip:** Each story is a `.athing` row; details (points/comments/user) are in the next `<tr>` with `.subtext`.


In [26]:
# --- Q2 Skeleton (fill the TODOs) ---
def q2_parse_items(html: str) -> pd.DataFrame:
    """
    Parse front page items into DataFrame columns:
        rank, title, link, points, comments, user.
    Uses BeautifulSoup to find '.athing' rows and their
    following '.subtext' sibling rows.
    """
    soup = BeautifulSoup(html, 'lxml')
    stories = []

    item_rows = soup.find_all('tr', class_='athing')

    for item_row in item_rows:
        rank_tag = item_row.find('span', class_='rank')
        title_span = item_row.find('span', class_='titleline')
        title_link = title_span.find('a') if title_span else None

        subtext_td = None
        next_tr = item_row.find_next_sibling('tr')
        if next_tr:
            subtext_td = next_tr.find('td', class_='subtext')

        points_tag = None
        user_tag = None
        comments_text = None

        if subtext_td:
            points_tag = subtext_td.find('span', class_='score')

            user_tag = subtext_td.find('a', class_='hnuser')

            comment_links = subtext_td.find_all('a')
            if comment_links:
                last_link = comment_links[-1]
                if 'comment' in last_link.text or 'discuss' in last_link.text:
                    comments_text = last_link.text

        story_data = {
            'rank': rank_tag.text if rank_tag else None,
            'title': title_link.text if title_link else None,
            'link': title_link.get('href') if title_link else None,
            'points': points_tag.text if points_tag else None,
            'user': user_tag.text if user_tag else None,
            'comments': comments_text
        }
        stories.append(story_data)

    return pd.DataFrame(stories)


def q2_clean(df: pd.DataFrame) -> pd.DataFrame:
    """
    Clean numeric fields and fill missing values.
    Casts points/comments/rank to int (non-digits -> 0).
    Fills missing text fields with empty string.
    """
    df_clean = df.copy()

    df_clean['rank'] = df_clean['rank'].astype(str).str.extract(r'(\d+)').fillna(0).astype(int)
    df_clean['points'] = df_clean['points'].astype(str).str.extract(r'(\d+)').fillna(0).astype(int)
    df_clean['comments'] = df_clean['comments'].astype(str).str.extract(r'(\d+)').fillna(0).astype(int)

    df_clean['title'] = df_clean['title'].fillna('')
    df_clean['link'] = df_clean['link'].fillna('')
    df_clean['user'] = df_clean['user'].fillna('')

    return df_clean

def q2_sort_top(df: pd.DataFrame, top: int = 15) -> pd.DataFrame:
    """Sort by points desc and return Top-N."""
    df_sorted = df.sort_values(by='points', ascending=False)
    return df_sorted.head(top)

In [27]:
# Q2 — Write your answer here

import requests
import pandas as pd
import sys
import re
from bs4 import BeautifulSoup

def main():
    URL = "https://news.ycombinator.com/"
    HEADERS = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
    }

    print(f"Fetching data from {URL}...")
    try:
        response = requests.get(URL, headers=HEADERS)
        response.raise_for_status()

        print("Parsing HTML...")
        df_raw = q2_parse_items(response.text)

        print("Cleaning data...")
        df_clean = q2_clean(df_raw)

        print("Sorting and getting top 15...")
        df_top15 = q2_sort_top(df_clean, top=15)

        print("\n--- Top 15 Hacker News Posts by Points (Descending) ---")
        display_cols = ['rank', 'points', 'comments', 'title']
        print(df_top15[display_cols].to_string(index=False))

        output_filename = "data_q2.csv"
        df_clean.to_csv(output_filename, index=False, encoding='utf-8')
        print(f"\nSuccessfully saved all {len(df_clean)} posts to {output_filename}")

    except requests.exceptions.RequestException as e:
        print(f"Error during HTTP request: {e}", file=sys.stderr)
    except (ValueError, KeyError, AttributeError) as e:
        print(f"Error during data processing: {e}", file=sys.stderr)
    except Exception as e:
        print(f"An unexpected error occurred: {e}", file=sys.stderr)

if __name__ == "__main__":
    main()

Fetching data from https://news.ycombinator.com/...
Parsing HTML...
Cleaning data...
Sorting and getting top 15...

--- Top 15 Hacker News Posts by Points (Descending) ---
 rank  points  comments                                                                           title
    1     536       265                                                Solarpunk is happening in Africa
   28     326       178 Norway reviews cybersecurity after remote-access feature found in Chinese buses
    6     322       142             New gel restores dental enamel and could revolutionise tooth repair
   20     284       176                I was right about dishwasher pods and now I can prove it [video]
   13     252        81                                            The shadows lurking in the equations
   24     232       108  SPy: An interpreter and compiler for a fast statically typed variant of Python
    2     229        80                                   Dillo, a multi-platform graphical web brow