GitHub - frnsys/focusgroup: build ground-truth news event clusters by sampling wikinews dumps

This package can digest WikiNews pages-articles XML dumps for the purpose of assembling evaluation clustering data.

It takes a WikiNews page with at least n cited sources and assumes that it constitutes an "event", and its sources are member articles. This data is saved to MongoDB and can later be used as ground-truth clusters.

You can download the latest pages-articles dump at http://dumps.wikimedia.org/enwikinews/latest/.

Setup & usage

Install the requirements:

$ pip install -r requirements.txt

To use it, run:

$ python run.py

That will parse the pages, and for any page that has over n cited (default n=3) sources, it will fetch the article data for those sources and save everything to MongoDB.

Some heuristics are used to try and build quality clusters:

only whitelisted sources are considered
source articles should be published within 3 days of each other (sometimes WikiNews entries will refer to older stories)
articles which are shorter than or equal to 400 characters are skipped - article bodies this short usually indicate that we've hit a 404 page, and generally articles this short are uninteresting

Then you can export that data, i.e.

$ mongoexport -d focusgroup -c event --jsonArray -o ~/Desktop/sample_events.json

You may need to do a bit of manual cleaning afterwards. There is an included preview.py script which randomly grabs an event and prints out its articles' bodies. Sometimes the content extraction algorithm pulls out the wrong stuff, or extra cruft gets pulled in (social media prompts, advertisement captions, etc); running the preview script a few times can help you build some heuristics to better clean the data. Or maybe you want to keep it in! Who knows

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
focusgroup		focusgroup
.gitignore		.gitignore
README.md		README.md
config.py		config.py
export.py		export.py
preview.py		preview.py
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Setup & usage

About

Releases

Packages

Languages

frnsys/focusgroup

Folders and files

Latest commit

History

Repository files navigation

Setup & usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages