Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix and add html extractor #201

Merged
merged 3 commits into from
Oct 19, 2022
Merged

Conversation

grodino
Copy link
Contributor

@grodino grodino commented Jul 19, 2022

Hi !

Lately, I tried using the HTML extractor wrapper for clueweb documents. When wrapping directly the corpus docstore, everything works fine but when composing vrappers (dataset -> html extractor -> cachedocstore), I had an error because the html docstore wrapper did not super().__init__ itself.

This PR thus does two things :

  • Init the HtmlDocExtractorDocStoreWrapper
  • Add a new html extractor based on inscriptis (it seem that empirically, there is less junk chars than the vanilla bs4 extractor)

Cheers,
Augustin

@seanmacavaney
Copy link
Collaborator

Thanks! The HtmlDocExtractorDocStoreWrapper doesn't get much attention, so thanks for the improvements and bug fix.

Mind adding a unit test for it?

@grodino
Copy link
Contributor Author

grodino commented Jul 19, 2022

Mind adding a unit test for it?

Done but I still couldn't run the tests myself.

I tried python -m test.integration.clueweb12 TestClueWeb12.test_clueweb12_docs_html and got an error

======================================================================
ERROR: test_clueweb12_docs_html (__main__.TestClueWeb12) [docs_iter split] (dataset=<ir_datasets.wrappers.html_extractor.HtmlDocExtractor object at 0x7f42cb444130>)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/agodinot/experiments/ir_datasets/test/integration/base.py", line 38, in _test_docs
    self._assert_namedtuple(next(it[idx:idx+1]), doc)
  File "/home/agodinot/experiments/ir_datasets/ir_datasets/wrappers/html_extractor.py", line 33, in __next__
    return next(self.mapped_it)
StopIteration

As if the test case could not retrieve the documents :/ (I tried just using docstore.get_many() in the test case and it worked)

@seanmacavaney
Copy link
Collaborator

Thanks! I didn't have time to go over this today, but I'll look into it tomorrow.

@heinrichreimer
Copy link
Contributor

The code looks good to me 👍
The added dependency inscriptis is not too big (~40kb) and seems to be well-maintained.
Tests are passing. Let's get this merged 😉

@seanmacavaney
Copy link
Collaborator

Thanks for bumping this PR @heinrichreimer, and thanks @grodino for the contribution!

@seanmacavaney seanmacavaney merged commit e3d8e1c into allenai:master Oct 19, 2022
@grodino grodino deleted the fix-html-extractor branch October 21, 2022 19:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants