<a href="https://colab.research.google.com/github/chenjeraichantelle-gif/web-scraping_Chenjerai/blob/main/Assignment_6_WebScraping_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 6 (4 points) — Web Scraping

In this assignment you will complete **two questions**. The **deadline is posted on Canvas**.


## Assignment Guide (Read Me First)

- This notebook provides an **Install Required Libraries** cell and a **Common Imports & Polite Headers** cell. Run them first.
- Each question includes a **skeleton**. The skeleton is **not** a solution; it is a lightweight scaffold you may reuse.
- Under each skeleton you will find a **“Write your answer here”** code cell. Implement your scraping, cleaning, and saving logic there.
- When your code is complete, run the **Runner** cell to print a Top‑15 preview and save the CSV.
- Expected outputs:
  - **Q1:** `data_q1.csv` + Top‑15 sorted by the specified numeric column.
  - **Q2:** `data_q2.csv` + Top‑15 sorted by `points`.


In [None]:
1) #Install Required Libraries
!pip -q install requests beautifulsoup4 lxml pandas
print("Dependencies installed.")


### 2) Common Imports & Polite Headers

In [None]:
# Common Imports & Polite Headers
import re, sys, pandas as pd, requests
from bs4 import BeautifulSoup
HEADERS = {"User-Agent": (
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
    "(KHTML, like Gecko) Chrome/122.0 Safari/537.36")}
def fetch_html(url: str, timeout: int = 20) -> str:
    r = requests.get(url, headers=HEADERS, timeout=timeout)
    r.raise_for_status()
    return r.text
def flatten_headers(df: pd.DataFrame) -> pd.DataFrame:
    if isinstance(df.columns, pd.MultiIndex):
        df.columns = [" ".join([str(x) for x in tup if str(x)!="nan"]).strip()
                      for tup in df.columns.values]
    else:
        df.columns = [str(c).strip() for c in df.columns]
    return df
print("Common helpers loaded.")


## Question 1 — IBAN Country Codes (table)
**URL:** https://www.iban.com/country-codes  
**Extract at least:** `Country`, `Alpha-2`, `Alpha-3`, `Numeric` (≥4 cols; you may add more)  
**Clean:** trim spaces; `Alpha-2/Alpha-3` → **UPPERCASE**; `Numeric` → **int** (nullable OK)  
**Output:** write **`data_q1.csv`** and **print a Top-15** sorted by `Numeric` (desc, no charts)  
**Deliverables:** notebook + `data_q1.csv` + short `README.md` (URL, steps, 1 limitation)

**Tip:** You can use `pandas.read_html(html)` to read tables and then pick one with ≥3 columns.


In [None]:
# --- Q1 Skeleton (fill the TODOs) ---
def q1_read_table(html: str) -> pd.DataFrame:
    """Return the first table with >= 3 columns from the HTML.
    TODO: implement with pd.read_html(html), pick a reasonable table, then flatten headers.
    """
    raise NotImplementedError("TODO: implement q1_read_table")

def q1_clean(df: pd.DataFrame) -> pd.DataFrame:
    """Clean columns: strip, UPPER Alpha-2/Alpha-3, cast Numeric to int (nullable), drop invalids.
    TODO: implement cleaning steps.
    """
    raise NotImplementedError("TODO: implement q1_clean")

def q1_sort_top(df: pd.DataFrame, top: int = 15) -> pd.DataFrame:
    """Sort descending by Numeric and return Top-N.
    TODO: implement.
    """
    raise NotImplementedError("TODO: implement q1_sort_top")


In [2]:
import pandas as pd
import requests

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

def q1_read_table(html: str) -> pd.DataFrame:
    """Return the first table with >= 3 columns from the HTML."""

    tables = pd.read_html(html)
    for table in tables:
        if len(table.columns) >= 3:

         if isinstance(table.columns, pd.MultiIndex):
            table.columns = table.columns.get_level_values(-1)
         return table

         raise ValueError("No table with at least 3 columns found.")

def q1_clean(df: pd.DataFrame) -> pd.DataFrame:
    """Clean columns: strip, UPPER Alpha-2/Alpha-3, cast Numeric to int (nullable), drop invalids."""
    df_clean = df.copy()

    df_clean = df_clean.apply(lambda x: x.str.strip() if isinstance(x,str) else x)

    column_mapping = {}
    for col in df_clean.columns:
        col_lower = str(col).lower()
        if 'country' in col_lower:
          column_mapping[col] = 'Country'
        elif 'alpha-2' in col_lower or 'code2' in a col_lower or 'two' in col_lower:
          column_mapping[col] = 'Alpha-2'

        elif 'alpha-3' in col_lower or 'code3' in col_lower or "three" in col_lower:
          column_mapping[col] = 'Alpha-3'

        elif "numeric" in col_lower:
          column_mapping[col] = 'Numeric'or 'number' in col_lower:
          column_mapping[col] = 'Numeric'
    df_clean = df_clean.rename(columns=column_mapping)

    required_cols = ['Country', 'Alpha-2', 'Alpha-3', 'Numeric']
    missing_cols = [col for col in required_cols if col not in df_clean.columns]
    if missing_cols:

      if len(df_clean.columns) >= 4:
        df_clean.columns = ['Country', 'Alpha-2', 'Alpha-3', 'Numeric'] + list(df_clean.columns[4:])
      else:
        raise ValueError(f"missing required columns:{missing_cols}")

        df_clean['Alpha-2'] = df_clean['Alpha-2']astype(str).str.upper().str.strip()
    df_clean['Alpha-3'] = df_clean['Alpha-3'].astype(str)str.upper().str.strip()
    df_clean['Numeric'] = pd.to_numeric(df_clean['Numeric'], errors='coerce')
    df_clean = df_clean.dropna(subset=['Numeric'])

    df_clean = df_clean.drop_duplicates().reset_index(drop=True)
    return df_clean

    def q1_sort_top(df: pd.DataFrame, top: int = 15) -> pd.DataFrame:
        """ Sort descending by Numeric and return Top-N."""
        return df.sort_values(by='Numeric', ascending=False).head(top)

    def run_q1():
        url = 'https://www.iban.com/country-codes'
        try:
          response = requests.get(url, headers=headers)
          response.raise_for_status()

          df_raw = q1_read_table(response.text)
          df_clean = q1_clean(df_raw)
          df_top = q1_sort_top(df_clean)

          df_clean.to_csv('data _q1.csv', index=false)

          print('Top 15 countries by Numeric code (descending):')

          print(df_top.to_string(index=False))

             print(f"\nFull dataset saved to data_q1.csv with {len(df_clean)}records")

             return df_clean

            except Exception as e:
              print(f"Error: {e}")
              return None

            # Run Q1
            df_q1 = run_q1()




IndentationError: unindent does not match any outer indentation level (<tokenize>, line 82)

## Question 2 — Hacker News (front page)
**URL:** https://news.ycombinator.com/  
**Extract at least:** `rank`, `title`, `link`, `points`, `comments` (user optional)  
**Clean:** cast `points`/`comments`/`rank` → **int** (non-digits → 0), fill missing text fields  
**Output:** write **`data_q2.csv`** and **print a Top-15** sorted by `points` (desc, no charts)  
**Tip:** Each story is a `.athing` row; details (points/comments/user) are in the next `<tr>` with `.subtext`.


In [None]:
# --- Q2 Skeleton (fill the TODOs) ---
def q2_parse_items(html: str) -> pd.DataFrame:
    """Parse front page items into DataFrame columns:
       rank, title, link, points, comments, user (optional).
    TODO: implement with BeautifulSoup on '.athing' and its sibling '.subtext'.
    """
    raise NotImplementedError("TODO: implement q2_parse_items")

def q2_clean(df: pd.DataFrame) -> pd.DataFrame:
    """Clean numeric fields and fill missing values.
    TODO: cast points/comments/rank to int (non-digits -> 0). Fill text fields.
    """
    raise NotImplementedError("TODO: implement q2_clean")

def q2_sort_top(df: pd.DataFrame, top: int = 15) -> pd.DataFrame:
    """Sort by points desc and return Top-N. TODO: implement."""
    raise NotImplementedError("TODO: implement q2_sort_top")


In [None]:
# Q2 — Write your answer here


