# Demo : scraping and parsing a website

### 1. inspecting the website

Inspect the following website [source]("https://news.ycombinator.com/")

Copy paste these commandes in the console :
```console.log("Hello world");```

```window.scrollTo({ top: document.body.scrollHeight, behavior: "smooth" });```

```document.body.style.background = "#111";```

```document.body.style.color = "#fff";```

### 2.Scraping the website page

In [2]:
import time
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
from urllib.parse import urljoin

In [3]:
BASE = "https://news.ycombinator.com/"
HEADERS = {
    "User-Agent": "Mozilla/5.0"
}

In [4]:
url = f"{BASE}news?p=1"
r = requests.get(url, headers=HEADERS, timeout=20)
r.raise_for_status()

In [5]:
soup = BeautifulSoup(r.text, "html.parser")

In [6]:
soup

<html lang="en" op="news"><head><meta content="origin" name="referrer"/><meta content="width=device-width, initial-scale=1.0" name="viewport"/><link href="news.css?qJ9W32h7VuXKTHuLDtqT" rel="stylesheet" type="text/css"/><link href="y18.svg" rel="icon"/><link href="rss" rel="alternate" title="RSS" type="application/rss+xml"/><title>Hacker News</title></head><body><center><table bgcolor="#f6f6ef" border="0" cellpadding="0" cellspacing="0" id="hnmain" width="85%"><tr><td bgcolor="#ff6600"><table border="0" cellpadding="0" cellspacing="0" style="padding:2px" width="100%"><tr><td style="width:18px;padding-right:4px"><a href="https://news.ycombinator.com"><img height="18" src="y18.svg" style="border:1px white solid; display:block" width="18"/></a></td><td style="line-height:12pt; height:10px;"><span class="pagetop"><b class="hnname"><a href="news">Hacker News</a></b><a href="newest">new</a> | <a href="front">past</a> | <a href="newcomments">comments</a> | <a href="ask">ask</a> | <a href="sho

### 3. Parsing soup

In [14]:
rows = []

# Each story uses a <tr class="athing"> and the very next <tr> has the subtext
for athing in soup.select("tr.athing"):
    rank_el = athing.select_one("span.rank")
    title_el = athing.select_one("span.titleline > a")

    subtext_row = athing.find_next_sibling("tr")
    subtext = subtext_row.select_one("td.subtext") if subtext_row else None

    points_el = subtext.select_one("span.score") if subtext else None
    author_el = subtext.select_one("a.hnuser") if subtext else None
    age_el = subtext.select_one("span.age") if subtext else None

    # comments link is usually the last <a> in subtext
    comment_links = subtext.select("a") if subtext else []
    comments_el = comment_links[-1] if comment_links else None

    story = {
        "page": 1,
        "rank": int(re.search(r"\d+", rank_el.get_text(strip=True).replace(",", "")).group()) if rank_el else None,
        "title": title_el.get_text(strip=True) if title_el else None,
        "url": title_el["href"] if (title_el and title_el.has_attr("href")) else None,
        "points": int(re.search(r"\d+", points_el.get_text(strip=True).replace(",", "")).group()) if points_el else None,
        "author": author_el.get_text(strip=True) if author_el else None,
        "age": age_el.get_text(strip=True) if age_el else None,
        # "comments_count": int(re.search(r"\d+", comments_el.get_text(strip=True).replace(",", "")).group()) if comments_el else None,
        "hn_item_url": urljoin(BASE, comments_el["href"]) if (comments_el and comments_el.has_attr("href")) else None,
    }
    rows.append(story)


In [15]:
rows

[{'page': 1,
  'rank': 1,
  'title': 'The Overcomplexity of the Shadcn Radio Button',
  'url': 'https://paulmakeswebsites.com/writing/shadcn-radio-button/',
  'points': 239,
  'author': 'dbushell',
  'age': '2 hours ago',
  'hn_item_url': 'https://news.ycombinator.com/item?id=46688971'},
 {'page': 1,
  'rank': 2,
  'title': 'Level S4 solar radiation event',
  'url': 'https://www.swpc.noaa.gov/news/g4-severe-geomagnetic-storm-levels-reached-19-jan-2026',
  'points': 439,
  'author': 'WorldPeas',
  'age': '13 hours ago',
  'hn_item_url': 'https://news.ycombinator.com/item?id=46684056'},
 {'page': 1,
  'rank': 3,
  'title': 'Linux kernel framework for PCIe device emulation, in userspace',
  'url': 'https://github.com/cakehonolulu/pciem',
  'points': 23,
  'author': '71bw',
  'age': '2 hours ago',
  'hn_item_url': 'https://news.ycombinator.com/item?id=46689065'},
 {'page': 1,
  'rank': 4,
  'title': 'Reticulum, a secure and anonymous mesh networking stack',
  'url': 'https://github.com/mar

### 4. Display it as a DataFrame

In [18]:


df = pd.DataFrame(rows)

# quick cleanup / sanity checks
# df["comments_count"] = df["comments_count"].fillna(0).astype(int)
df = df.sort_values(["page", "rank"], ascending=True).reset_index(drop=True)

df.head(10)


Unnamed: 0,page,rank,title,url,points,author,age,hn_item_url
0,1,1,The Overcomplexity of the Shadcn Radio Button,https://paulmakeswebsites.com/writing/shadcn-r...,239,dbushell,2 hours ago,https://news.ycombinator.com/item?id=46688971
1,1,2,Level S4 solar radiation event,https://www.swpc.noaa.gov/news/g4-severe-geoma...,439,WorldPeas,13 hours ago,https://news.ycombinator.com/item?id=46684056
2,1,3,Linux kernel framework for PCIe device emulati...,https://github.com/cakehonolulu/pciem,23,71bw,2 hours ago,https://news.ycombinator.com/item?id=46689065
3,1,4,"Reticulum, a secure and anonymous mesh network...",https://github.com/markqvist/Reticulum,211,brogu,10 hours ago,https://news.ycombinator.com/item?id=46686273
4,1,5,Increasing the performance of WebAssembly Text...,https://blog.gplane.win/posts/improve-wat-pars...,17,gplane,2 hours ago,https://news.ycombinator.com/item?id=46629399
5,1,6,King – man + woman is queen; but why? (2017),https://p.migdal.pl/blog/2017/01/king-man-woma...,7,CGMthrowaway,2 hours ago,https://news.ycombinator.com/item?id=46641145
6,1,7,x86 prefixes and escape opcodes flowchart,https://soc.me/interfaces/x86-prefixes-and-esc...,55,gaul,6 hours ago,https://news.ycombinator.com/item?id=46687705
7,1,8,Apple testing new App Store design that blurs ...,https://9to5mac.com/2026/01/16/iphone-apple-ap...,387,ksec,17 hours ago,https://news.ycombinator.com/item?id=46680974
8,1,9,What came first: the CNAME or the A record?,https://blog.cloudflare.com/cname-a-record-ord...,368,linolevan,17 hours ago,https://news.ycombinator.com/item?id=46681611
9,1,10,Nanolang: A tiny experimental language designe...,https://github.com/jordanhubbard/nanolang,157,Scramblejams,12 hours ago,https://news.ycombinator.com/item?id=46684958
