Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New feature allowing working with Wikinews media dumps #219

Merged
merged 6 commits into from Jan 8, 2019
Merged

New feature allowing working with Wikinews media dumps #219

merged 6 commits into from Jan 8, 2019

Conversation

ckot
Copy link
Contributor

@ckot ckot commented Jan 3, 2019

I added a project named param with default value 'wiki' to the Wikipedia constructor, which allows one to specify a wikimedia project other than 'wikipedia'. This PR only adds support for 'wikinews'.

Description

  • added a PROJECTS dictionary to allow lookup of values, allowing project aliases as well, and added exception handling for bad values
  • actually a simple change. file_stub format is now '{lang}{project}/{version}/{lang}{project}-{version}-pages-articles.xml.bz2'

Motivation and Context

I wanted to work with Wikinews data dumps. Other packages exist for extracting the page text but this project already performs category extraction, which is what I needed, so simply adding a simple feature to this project rather than a more complicated one to another project seemed to be my best option.

How Has This Been Tested?

  • As the feature is quite simple, I merely manually verified that the URLs generated by the modified file_stub value were still sane, and that my exception handling for unhandled values for the new project named param worked.
  • I did think about defining _verify_url(url) and calling it in download() just prior to the actual downloading of the URL, which would simply do a HEAD request, catch a 404 and then backoff to see if BASE_URL/{lang}{project} existed to determine what error message to provide in the case that whatever mediawiki project doesn't exist for a particular language, but figured I should talk with someone about this first. This would provide a function which could be added to a unit test though.

Screenshots (if appropriate):

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • [x ] New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • [ x] My code follows the code style of this project.
  • [ x] My change requires a change to the documentation, and I have updated it accordingly.

@ckot
Copy link
Contributor Author

ckot commented Jan 3, 2019

I only added support for Wikinews as I wasn't sure if my modifications to file_stub were robust enough to handle other Wikimedia projects. I suppose if it isn't, a _gen_file_stub(), which does project-specific branching to compute file_stub could be added

@ckot ckot changed the title Feat wikiprojects New feature allowing working with Wikinews media dumps Jan 3, 2019
Copy link
Collaborator

@bdewilde bdewilde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks for the clean and detailed PR!

@@ -51,6 +51,13 @@
'de': 'Kategorie:'
}

PROJECTS = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the project name aliases purely for developer convenience, or are they "official" in some sense? If the former, I think I'd prefer simplicity and just use the "wiki" and "wikinews" names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the aliases are non-official and simply for convenience. I can add a commit which removes them if you prefer

Copy link
Collaborator

@bdewilde bdewilde Jan 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that's best. It's nice to have that 1:1 link between project name and file name. Thank you! So this can just be PROJECTS = {"wiki", "wikinews"}.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, my original change was exactly that, remove the aliases and convert the dict to a set. However, when creating error messages for 404s I thought it was better to implement as

PROJECTS = {'wikipedia': 'wiki', 'wikinews': 'wikinews'}
PROJECTS_INV = {v, k for k, v in PROJECTS.items()}

and rather than have the default value for project be 'wiki', instead have it be 'wikipedia' but self.project would get set to 'wiki via the lookup in PROJECTS. Then in the case of a 404 error I can report that {self.lang} variant of {PROJECTS_INV[self.project]} doesn't exist, basically I think 'foo' variant of 'wikipedia' doesn't exist is kinder to the user than 'foo' variant of 'wiki' doesn't exist. I'll push out the commit and you can see what you think.

version=self.version, lang=self.lang)
proj = PROJECTS.get(project, None)
if proj is None:
raise AttributeError(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a ValueError makes more sense here, especially from a user's perspective?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fine with me

project=project
))
self.project = proj
self.filestub = '{lang}{project}/{version}/{lang}{project}-{version}-pages-articles.xml.bz2'.format(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a spot-check, and it looks like this filestub format still works for wikinews in various languages. I'm thinking the "check URL via HEAD request" idea you mentioned is probably worth doing, but that can be added separately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add this feature (as well as the other changes you suggested above) in a commit. This will be more robust than merely relying on both my/your spot-checking of the URLs it produces as it can return an error to the user in the event of a HEAD 404. In the event of a 404 I'll additionally backoff from the data dump file URL to the {lang}{project} root URL and verify that the project for that language exists and report a different message if it doesn't.

@bdewilde
Copy link
Collaborator

bdewilde commented Jan 8, 2019

Thanks for the updates — this looks good to me!

@bdewilde bdewilde merged commit db63765 into chartbeat-labs:master Jan 8, 2019
@ckot
Copy link
Contributor Author

ckot commented Jan 10, 2019

Great! Thanks!

@ckot ckot deleted the feat-wikiprojects branch January 11, 2019 19:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants