Fix and add html extractor #201

grodino · 2022-07-19T09:51:02Z

Hi !

Lately, I tried using the HTML extractor wrapper for clueweb documents. When wrapping directly the corpus docstore, everything works fine but when composing vrappers (dataset -> html extractor -> cachedocstore), I had an error because the html docstore wrapper did not super().__init__ itself.

This PR thus does two things :

Init the HtmlDocExtractorDocStoreWrapper
Add a new html extractor based on inscriptis (it seem that empirically, there is less junk chars than the vanilla bs4 extractor)

Cheers,
Augustin

seanmacavaney · 2022-07-19T10:26:01Z

Thanks! The HtmlDocExtractorDocStoreWrapper doesn't get much attention, so thanks for the improvements and bug fix.

Mind adding a unit test for it?

grodino · 2022-07-19T12:59:50Z

Mind adding a unit test for it?

Done but I still couldn't run the tests myself.

I tried python -m test.integration.clueweb12 TestClueWeb12.test_clueweb12_docs_html and got an error

======================================================================
ERROR: test_clueweb12_docs_html (__main__.TestClueWeb12) [docs_iter split] (dataset=<ir_datasets.wrappers.html_extractor.HtmlDocExtractor object at 0x7f42cb444130>)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/agodinot/experiments/ir_datasets/test/integration/base.py", line 38, in _test_docs
    self._assert_namedtuple(next(it[idx:idx+1]), doc)
  File "/home/agodinot/experiments/ir_datasets/ir_datasets/wrappers/html_extractor.py", line 33, in __next__
    return next(self.mapped_it)
StopIteration

As if the test case could not retrieve the documents :/ (I tried just using docstore.get_many() in the test case and it worked)

seanmacavaney · 2022-07-20T16:13:06Z

Thanks! I didn't have time to go over this today, but I'll look into it tomorrow.

heinrichreimer · 2022-10-18T16:50:06Z

The code looks good to me 👍
The added dependency inscriptis is not too big (~40kb) and seems to be well-maintained.
Tests are passing. Let's get this merged 😉

seanmacavaney · 2022-10-19T18:38:08Z

Thanks for bumping this PR @heinrichreimer, and thanks @grodino for the contribution!

Init super class in html doc extractor

657df8a

Augustin Godinot added 2 commits July 19, 2022 14:04

Add inscriptis html extractor

9ee3685

Add integration tests

9835c23

grodino force-pushed the fix-html-extractor branch from 5daace8 to 9835c23 Compare July 19, 2022 12:57

seanmacavaney merged commit e3d8e1c into allenai:master Oct 19, 2022

grodino deleted the fix-html-extractor branch October 21, 2022 19:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix and add html extractor #201

Fix and add html extractor #201

grodino commented Jul 19, 2022

seanmacavaney commented Jul 19, 2022

grodino commented Jul 19, 2022 •

edited

seanmacavaney commented Jul 20, 2022

heinrichreimer commented Oct 18, 2022

seanmacavaney commented Oct 19, 2022

Fix and add html extractor #201

Fix and add html extractor #201

Conversation

grodino commented Jul 19, 2022

seanmacavaney commented Jul 19, 2022

grodino commented Jul 19, 2022 • edited

seanmacavaney commented Jul 20, 2022

heinrichreimer commented Oct 18, 2022

seanmacavaney commented Oct 19, 2022

grodino commented Jul 19, 2022 •

edited