Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import Wikipedia abstracts #11

Open
sylvinus opened this issue Feb 22, 2016 · 6 comments
Open

Import Wikipedia abstracts #11

sylvinus opened this issue Feb 22, 2016 · 6 comments

Comments

@sylvinus
Copy link
Contributor

We will end up importing the whole Wikipedia dumps soon enough, however a simple first step would be to import the abstracts, after #10 is finished.

That could allow us to add good descriptions, possibly of better quality than DMOZ. That would be helpful for commonsearch/cosr-results#1 for instance.

We could also possibly start including all wikipedia urls in the results, even if we didn't index their whole content yet.

https://meta.wikimedia.org/wiki/Data_dumps

There seems to be a combined abstract.xml file ("Recombine extracted page abstracts for Yahoo"), is this the one we should use?
https://dumps.wikimedia.org/enwiki/20160204/

@Tpt
Copy link

Tpt commented Mar 8, 2016

The "Recombine extracted page abstracts for Yahoo" quality seems very low (a lot of summary are actually infobox invocations) and is polluted with redirection. And I don't manage to find an other dump with article abstracts.

The relevant REST API call to get summary and image for an article is the /page/summary operation of the REST API: https://rest.wikimedia.org/fr.wikipedia.org/v1/?doc#/

@sylvinus
Copy link
Contributor Author

sylvinus commented Mar 8, 2016

Thanks for looking into it! Too bad for the abstracts dump.

We can't really use the REST API because we want all articles at once.

I guess we'll have to use the larger dumps. The solutions I can think of:

Any other ideas?

@wumpus
Copy link

wumpus commented Jul 12, 2016

I've downloaded the SQL dumps and used them directly, it's quick to download and not so hard to parse, although you do need to get smart about the HTML which is in the "text". (I can contribute perl code that does it...) blekko got pretty good "abstracts" by just extracting the first couple of sentences.

@sylvinus
Copy link
Contributor Author

There has been some recent activity on Wikimedia's side on the HTML dumps: https://phabricator.wikimedia.org/T133547

I hope they can go through so that we can index not only the abstracts but the whole articles as well.

In the meantime, either the SQL or the ZIM dumps seem like the best bets. @wumpus feel free to share your perl code, I'd be curious to see the hacks you did :)

@wumpus
Copy link

wumpus commented Jul 12, 2016

Hack alert!

extract-raw-wikipedia-articles.pl.txt

@sylvinus
Copy link
Contributor Author

Hack alert indeed!! It's almost as scary as I imagined ;-)

I just merged support for pluggable document sources so it should be very easy to write one that reads the .zim files and see how it behaves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants