GitHub - apmoore1/LU-Newshack: LU Team A newshack entry

Lancaster University Team A BBC NewsHack 2016 entry

This project is a small tool that examines news data (as downloaded from the BBC's CANDY API) and uses a number of simple techniques to align articles written on the same topic.

Approach

BBC news articles are written for various languages by independent teams. This means that there are no direct translations available (unless machine translated), so journalists covering a topic will do with various different document structures and sources.

In order to align articles with differing content, we focus on features that summarise the topic. This means:

Entities
Verbs
Dates

We also tried image path and some other features, but these were very ineffective. We restrict searches to only the title, summary, and image alt text. In order to allow the tool to operate on many languages, we translate the text prior to running feature extractors.

Using these features we compute a similarity matrix for documents in the corpus, which we use to look up similar articles. These are then displayed in the sample web interface.

Execution

The tool works in several stages:

Download a corpus of news data for several languages using the download/ tool.
Run the translation system in translate/ to create a translated corpus.
Annotate the translations using the tool in extract/ to create the similarity matrix.
Run the UI by passing the translated corpus and similarity matrix to the flask app in ui/: python ui.py articles.json similarities.json
Visit http://localhost:5000 and enter a URL.

Sample data

There are two corpora, gathered over the course of the hackathon, left in data/. To use them, you must find some URLs that existed at that time. The ones we gave as samples in our presentation are:

/news/business-35782239 (Oil)
/news/election-us-2016-35790460 (Trump)
/arabic/middleeast/2016/03/160312_zind_egypt_profile (Egypt)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

download

download

extract

extract

translate

translate

ui

ui

Newshack.pdf

Newshack.pdf

README.md

README.md

Repository files navigation

Approach

Execution

Sample data

Other writeups

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
data		data
download		download
extract		extract
translate		translate
ui		ui
Newshack.pdf		Newshack.pdf
README.md		README.md

apmoore1/LU-Newshack

Folders and files

Latest commit

History

Repository files navigation

Approach

Execution

Sample data

Other writeups

About

Resources

Stars

Watchers

Forks

Languages