# HtmlTableScraping Demo

This notebook demonstrates how to use the HtmlTableScraping library to convert HTML tables into pandas DataFrames.

## Installation

First, make sure you have the required dependencies installed:

```bash
uv sync
```

In [None]:
from bs4 import BeautifulSoup
from scraper import parse_table, TableCell
import pandas as pd

## Basic Example

Let's start with a simple HTML table and convert it to a pandas DataFrame:

In [None]:
html = '''
<table>
    <tr><th>Year</th><th>Change</th></tr>
    <tr><td>1970</td><td>0.10%</td></tr>
    <tr><td>1971</td><td>10.79%</td></tr>
</table>
'''

soup = BeautifulSoup(html, "html.parser")
table = soup.find("table")
result = parse_table(table)
print(result)
print(f"\nDataFrame type: {type(result)}")
print(f"Columns: {list(result.columns)}")

## Table with Complex Content

The library can handle tables with hidden elements, superscripts, and hyperlinks:

In [None]:
complex_html = '''
<table>
    <tr><th>Country</th><th>Population</th><th>Notes</th></tr>
    <tr>
        <td><a href="/usa">United States</a></td>
        <td>331 million<sup>1</sup></td>
        <td>Data from <span style="display:none">hidden text</span>2020 census</td>
    </tr>
    <tr>
        <td><a href="/china">China</a></td>
        <td>1.4 billion<sup>2</sup></td>
        <td>Estimated<br/>population</td>
    </tr>
</table>
'''

soup = BeautifulSoup(complex_html, "html.parser")
table = soup.find("table")
result = parse_table(table)
print("Parsed table:")
print(result)
print("\nNotice how:")
print("- Hidden spans are removed")
print("- Superscripts are stripped from main text")
print("- Line breaks are converted to spaces")
print("- Links are converted to plain text")

## Table without Headers

You can also parse tables that don't have header rows:

In [None]:
no_header_html = '''
<table>
    <tr><td>Apple</td><td>Red</td><td>Sweet</td></tr>
    <tr><td>Banana</td><td>Yellow</td><td>Sweet</td></tr>
    <tr><td>Lemon</td><td>Yellow</td><td>Sour</td></tr>
</table>
'''

soup = BeautifulSoup(no_header_html, "html.parser")
table = soup.find("table")
result = parse_table(table, first_row_as_col_titles=False)
print("Table without headers:")
print(result)
print(f"\nColumns: {list(result.columns)}")

## Using the Table.pretty_print() Method

The `Table` class (which extends pandas DataFrame) includes a `pretty_print()` method for rich display in Jupyter notebooks:

In [None]:
html_with_title = '''
<table>
    <tr><th>Product</th><th>Price</th><th>Stock</th></tr>
    <tr><td>Laptop</td><td>$999</td><td>15</td></tr>
    <tr><td>Mouse</td><td>$25</td><td>50</td></tr>
    <tr><td>Keyboard</td><td>$75</td><td>30</td></tr>
</table>
'''

soup = BeautifulSoup(html_with_title, "html.parser")
table = soup.find("table")
result = parse_table(table)
result.title = "Product Inventory"

print("Using pretty_print():")
result.pretty_print()

## Working with TableCell Objects

For more advanced use cases, you can work with individual `TableCell` objects to access metadata like links and superscripts:

In [None]:
cell_html = '<td><a href="/wikipedia">Wikipedia</a> article<sup>citation needed</sup></td>'
soup = BeautifulSoup(cell_html, "html.parser")
cell = TableCell.from_soup(soup.td)

print(f"Cell text: '{cell.text}'")
print(f"Links found: {cell.links}")
print(f"Superscripts found: {cell.sups}")

cell_df = cell.to_df()
print("\nCell as DataFrame:")
print(cell_df)

## Handling Edge Cases

The library gracefully handles various edge cases:

In [None]:
empty_html = '<table></table>'
soup = BeautifulSoup(empty_html, "html.parser")
empty_result = parse_table(soup.find("table"))
print(f"Empty table result: {empty_result}")
print(f"Shape: {empty_result.shape}")

none_result = parse_table(None)
print(f"\nNone input result: {none_result}")
print(f"Shape: {none_result.shape}")

uneven_html = '''
<table>
    <tr><th>A</th><th>B</th><th>C</th></tr>
    <tr><td>1</td><td>2</td></tr>
    <tr><td>3</td><td>4</td><td>5</td></tr>
</table>
'''
soup = BeautifulSoup(uneven_html, "html.parser")
uneven_result = parse_table(soup.find("table"))
print("\nUneven rows (automatically normalized):")
print(uneven_result)

## Limitations

As noted in the documentation, `parse_table` currently does not handle `rowspan` or `colspan` attributes. Tables using these features may not be parsed correctly:

In [None]:
colspan_html = '''
<table>
    <tr><th>Name</th><th colspan="2">Details</th></tr>
    <tr><td>John</td><td>Age: 30</td><td>City: NYC</td></tr>
</table>
'''

soup = BeautifulSoup(colspan_html, "html.parser")
colspan_result = parse_table(soup.find("table"))
print("Table with colspan (may not parse as expected):")
print(colspan_result)
print("\nNote: The colspan attribute is ignored, which may lead to misaligned data.")

## Conclusion

The HtmlTableScraping library provides a simple and effective way to convert HTML tables into pandas DataFrames. Key features include:

- Automatic handling of headers
- Removal of hidden elements and superscripts
- Preservation of metadata (links, superscripts) when needed
- Graceful handling of edge cases
- Rich display capabilities in Jupyter notebooks

For more information, see the [project README](README.md) and the source code documentation.