New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New feature allowing working with Wikinews media dumps #219
Conversation
I only added support for Wikinews as I wasn't sure if my modifications to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, thanks for the clean and detailed PR!
@@ -51,6 +51,13 @@ | |||
'de': 'Kategorie:' | |||
} | |||
|
|||
PROJECTS = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the project name aliases purely for developer convenience, or are they "official" in some sense? If the former, I think I'd prefer simplicity and just use the "wiki"
and "wikinews"
names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the aliases are non-official and simply for convenience. I can add a commit which removes them if you prefer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think that's best. It's nice to have that 1:1 link between project name and file name. Thank you! So this can just be PROJECTS = {"wiki", "wikinews"}
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, my original change was exactly that, remove the aliases and convert the dict to a set. However, when creating error messages for 404s I thought it was better to implement as
PROJECTS = {'wikipedia': 'wiki', 'wikinews': 'wikinews'}
PROJECTS_INV = {v, k for k, v in PROJECTS.items()}
and rather than have the default value for project be 'wiki', instead have it be 'wikipedia' but self.project
would get set to 'wiki via the lookup in PROJECTS. Then in the case of a 404 error I can report that {self.lang} variant of {PROJECTS_INV[self.project]}
doesn't exist, basically I think 'foo' variant of 'wikipedia' doesn't exist
is kinder to the user than 'foo' variant of 'wiki'
doesn't exist. I'll push out the commit and you can see what you think.
textacy/datasets/wikipedia.py
Outdated
version=self.version, lang=self.lang) | ||
proj = PROJECTS.get(project, None) | ||
if proj is None: | ||
raise AttributeError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a ValueError
makes more sense here, especially from a user's perspective?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fine with me
project=project | ||
)) | ||
self.project = proj | ||
self.filestub = '{lang}{project}/{version}/{lang}{project}-{version}-pages-articles.xml.bz2'.format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a spot-check, and it looks like this filestub format still works for wikinews in various languages. I'm thinking the "check URL via HEAD request" idea you mentioned is probably worth doing, but that can be added separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add this feature (as well as the other changes you suggested above) in a commit. This will be more robust than merely relying on both my/your spot-checking of the URLs it produces as it can return an error to the user in the event of a HEAD 404. In the event of a 404 I'll additionally backoff from the data dump file URL to the {lang}{project} root URL and verify that the project for that language exists and report a different message if it doesn't.
merging upstream changes to my fork
Thanks for the updates — this looks good to me! |
Great! Thanks! |
I added a
project
named param with default value 'wiki' to the Wikipedia constructor, which allows one to specify a wikimedia project other than 'wikipedia'. This PR only adds support for 'wikinews'.Description
file_stub
format is now'{lang}{project}/{version}/{lang}{project}-{version}-pages-articles.xml.bz2'
Motivation and Context
I wanted to work with Wikinews data dumps. Other packages exist for extracting the page text but this project already performs category extraction, which is what I needed, so simply adding a simple feature to this project rather than a more complicated one to another project seemed to be my best option.
How Has This Been Tested?
file_stub
value were still sane, and that my exception handling for unhandled values for the newproject
named param worked._verify_url(url)
and calling it indownload()
just prior to the actual downloading of the URL, which would simply do a HEAD request, catch a 404 and then backoff to see if BASE_URL/{lang}{project} existed to determine what error message to provide in the case that whatever mediawiki project doesn't exist for a particular language, but figured I should talk with someone about this first. This would provide a function which could be added to a unit test though.Screenshots (if appropriate):
Types of changes
Checklist: