# Scratch: HTML Table Extraction with `pandas.read_html`

**Author:** Dhruv Singh  
**Last updated:** 2025-12-15  

This scratch notebook collects **tables embedded in HTML pages** using `pandas.read_html()`.
It’s intentionally separate from the main API project notebook to keep the portfolio narrative clean.

---
## What this demonstrates
- Fetching HTML responsibly (User-Agent header)
- Extracting `<table>` elements into DataFrames
- Inspecting / selecting tables from a list


## Example 1 — Wikipedia table extraction

In [1]:
import pandas as pd
import requests

pd.set_option('display.max_columns', None)


In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_Australian_capital_cities'

In [3]:
html = requests.get(url, headers={"User-Agent": "Mozilla/5.0"}, timeout=30).text
tables = pd.read_html(html)
len(tables)


  tables = pd.read_html(html)


10

In [4]:
tables = pd.read_html(html)

  tables = pd.read_html(html)


In [5]:
tables

[                                                   0
 0  WESTERN AUSTRALIA NORTHERN TERRITORY SOUTH AUS...,
                 State/territory    Capital  City population[2]  \
 0               New South Wales     Sydney             5029768   
 1                      Victoria  Melbourne             4725316   
 2                    Queensland   Brisbane             2360241   
 3             Western Australia      Perth             2022044   
 4               South Australia   Adelaide             1324279   
 5                      Tasmania     Hobart              224462   
 6  Australian Capital Territory   Canberra              403468   
 7            Northern Territory     Darwin              145916   
 
    State/territory population[3]  \
 0                        7759274   
 1                        6179249   
 2                        4848877   
 3                        2558951   
 4                        1713054   
 5                         517588   
 6                         

In [6]:
type(tables)

list

### Inspect and select a table

`pd.read_html` returns a **list** of DataFrames—one per table found on the page.

In [7]:
tables[1]

Unnamed: 0,State/territory,Capital,City population[2],State/territory population[3],Percentage of state/territory population in capital city,Established,Capital since,Image
0,New South Wales,Sydney,5029768,7759274,64.82%,1788,1788,
1,Victoria,Melbourne,4725316,6179249,76.47%,1835,1851,
2,Queensland,Brisbane,2360241,4848877,48.68%,1825,1860,
3,Western Australia,Perth,2022044,2558951,79.02%,1829,1829,
4,South Australia,Adelaide,1324279,1713054,77.31%,1836,1836,
5,Tasmania,Hobart,224462,517588,43.37%,1804,1826,
6,Australian Capital Territory,Canberra,403468,403468,100.00%,1913,1913,
7,Northern Territory,Darwin,145916,245740,59.38%,1869,1911,


## Example 2 — Finance page table extraction (may change over time)

In [8]:
url2 = 'https://finance.yahoo.com/markets/commodities/'

In [9]:
html2 = requests.get(url2, headers={"User-Agent": "Mozilla/5.0"}, timeout=30).text

In [10]:
tables = pd.read_html(html2)

  tables = pd.read_html(html2)


In [11]:
tables

[   Symbol                             Name  Unnamed: 2  \
 0    ES=F            E-Mini S&P 500 Dec 25         NaN   
 1    YM=F  Mini Dow Jones Indus.-$5 Dec 25         NaN   
 2    NQ=F                Nasdaq 100 Dec 25         NaN   
 3   RTY=F  E-mini Russell 2000 Index Futur         NaN   
 4    ZB=F  U.S. Treasury Bond Futures,Mar-         NaN   
 5    ZN=F  10-Year T-Note Futures,Dec-2025         NaN   
 6    ZF=F  Five-Year US Treasury Note Futu         NaN   
 7    ZT=F   2-Year T-Note Futures,Mar-2026         NaN   
 8    GC=F                      Gold Feb 26         NaN   
 9   MGC=F      Micro Gold Futures,Dec-2025         NaN   
 10   SI=F                    Silver Dec 25         NaN   
 11  SIL=F    Micro Silver Futures,Dec-2025         NaN   
 12   PL=F                  Platinum Jan 26         NaN   
 13   HG=F                    Copper Mar 26         NaN   
 14   PA=F                 Palladium Mar 26         NaN   
 15   CL=F                 Crude Oil Jan 26         NaN 

In [12]:
tables[0]

Unnamed: 0,Symbol,Name,Unnamed: 2,Price,Market Time,Change,Change %,Volume,Open Interest
0,ES=F,E-Mini S&P 500 Dec 25,,"6,822.50 -8.25 (-0.12%)",2:30PM EST,-8.25,-0.12%,1.372M,1.577M
1,YM=F,Mini Dow Jones Indus.-$5 Dec 25,,"48,782.00 -78.00 (-0.16%)",2:30PM EST,-78.0,-0.16%,78426,70455
2,NQ=F,Nasdaq 100 Dec 25,,"25,136.50 -77.00 (-0.31%)",2:30PM EST,-77.0,-0.31%,373650,270908
3,RTY=F,E-mini Russell 2000 Index Futur,,"2,540.40 -13.30 (-0.52%)",2:30PM EST,-13.3,-0.52%,247992,330017
4,ZB=F,"U.S. Treasury Bond Futures,Mar-",,114.88 +0.22 (+0.19%),2:30PM EST,0.22,+0.19%,276863,1.821M
5,ZN=F,"10-Year T-Note Futures,Dec-2025",,112.28 +0.11 (+0.10%),2:30PM EST,0.11,+0.10%,963918,5.396M
6,ZF=F,Five-Year US Treasury Note Futu,,109.17 +0.08 (+0.07%),2:30PM EST,0.08,+0.07%,794708,6.643M
7,ZT=F,"2-Year T-Note Futures,Mar-2026",,104.35 +0.04 (+0.04%),2:30PM EST,0.04,+0.04%,590301,4.496M
8,GC=F,Gold Feb 26,,"4,337.90 +9.60 (+0.22%)",2:30PM EST,9.6,+0.22%,189821,335268
9,MGC=F,"Micro Gold Futures,Dec-2025",,"4,337.80 +9.50 (+0.22%)",2:30PM EST,9.5,+0.22%,313402,44234
