Skip to content
This repository has been archived by the owner on Jul 11, 2023. It is now read-only.

Commit

Permalink
Add support for raw_html extraction in html parser (#341)
Browse files Browse the repository at this point in the history
* Add support for raw_html extraction in html parser

* Adhere better to the tabulator standard
  • Loading branch information
akariv committed Nov 9, 2020
1 parent 609563f commit 3b06de8
Show file tree
Hide file tree
Showing 4 changed files with 18 additions and 9 deletions.
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -668,13 +668,15 @@ Supports simple tables (no merged cells) with any legal combination of the td, t
Usually `foramt='html'` would need to be specified explicitly as web URLs don't always use the `.html` extension.

```python
stream = Stream('http://example.com/some/page.aspx', format='html' selector='.content .data table#id1')
stream = Stream('http://example.com/some/page.aspx', format='html' selector='.content .data table#id1', raw_html=True)
```

**Options**

- **selector**: CSS selector for specifying which `table` element to extract. By default it's `table`, which takes the first `table` element in the document. If empty, will assume the entire page is the table to be extracted (useful with some Excel formats).

- **raw_html**: False (default) to extract the textual contents of each cell. True to return the inner html without modification.

### Custom file sources and formats

Tabulator is written with extensibility in mind, allowing you to add support for
Expand Down
2 changes: 1 addition & 1 deletion data/table3.html
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@
</tr>
<tr>
<td>1</td>
<td>english</td>
<td><b>english</b></td>
</tr>
<tr>
<td>2</td>
Expand Down
13 changes: 6 additions & 7 deletions tabulator/parsers/html.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,15 +19,17 @@ class HTMLTableParser(Parser):

options = [
'selector',
'raw_html'
]

def __init__(self, loader, force_parse=False, selector='table'):
def __init__(self, loader, force_parse=False, selector='table', raw_html=False):
self.__loader = loader
self.__selector = selector
self.__force_parse = force_parse
self.__extended_rows = None
self.__encoding = None
self.__chars = None
self.__extractor = (lambda x: x.html()) if raw_html else (lambda x: x.text())

@property
def closed(self):
Expand Down Expand Up @@ -78,14 +80,11 @@ def __iter_extended_rows(self):
table.children('tbody').children('tr')
)
rows = [pq(r) for r in rows if len(r) > 0]
first_row = rows.pop(0)
headers = [pq(th).text() for th in first_row.find('th,td')]

# Extract rows
rows = [pq(tr).find('td') for tr in rows]
rows = [[pq(td).text() for td in tr]
rows = [pq(tr).children('td,th') for tr in rows]
rows = [[self.__extractor(pq(td)) for td in tr]
for tr in rows if len(tr) > 0]

# Yield rows
for row_number, row in enumerate(rows, start=1):
yield (row_number, headers, row)
yield (row_number, None, row)
8 changes: 8 additions & 0 deletions tests/formats/test_html.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,3 +26,11 @@ def test_stream_html(source, selector):
{'id': '1', 'name': 'english'},
{'id': '2', 'name': '中国人'}]

def test_stream_html_raw_html():
with Stream('data/table3.html', selector='.mememe', headers=1, encoding='utf8', raw_html=True) as stream:
assert stream.headers == ['id', 'name']
assert stream.read(keyed=True) == [
{'id': '1', 'name': '<b>english</b>'},
{'id': '2', 'name': '中国人'}]


0 comments on commit 3b06de8

Please sign in to comment.