Skip to content

felixludos/babel-briefings

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

No Nonsense News

Scraping and Presenting World News Headlines

See the final product: No Nonsense News Headlines

For more information about the project visit the project page

Here are a few scripts to scrape news headlines from all over the world using News API, translate them with the Helsinki-NLP Marian machine translation models (using HuggingFace), and display the articles on a Notion page - all from the comfort of python.

Usage

Aside from the requirements in the requirements.txt file, you must also have a News API key (which you can get by signing up for free at News API). Additionally, if you wish to present the scraped articles with Notion, then you must have a Notion account and have your corresponding token (which can be found like this).

This project heavily relies on omni-fig for organizing the scripts and config files.

The recommended way to update (scrape, translate, and present) all articles, is to first replace the link in the config file config/usual.yaml with a link to one of your notion pages where the table of articles should be created (it is recommended for the page to be empty).

After you only need to run:

python main.py usual

Or equivalently,

fig nnn usual

Due to the high volume of requests sent to the Notion server, you may receive an error 504 while uploading the articles to the Notion page. If this occurs wait for 30 seconds to a minute, and then resume the upload with the following command:

python main.py usual --resume

Each of the three steps can be done separately using:

fig scrape-news usual
fig sanitize-news usual
fig present-notion usual

(see omni-fig and/or code for more information about the available arguments)

Performance

Scraping all top headline articles for all countries and all categories, requires about 350 requests to the News API, and takes less than ten minutes and collects around 1700-2000 articles. Thanks to parallelism, formatting and translating the articles is significantly faster and takes around two minutes. Finally, presenting the articles on Notion is somewhat problematic because with parallelism it can be done in less than five minutes, however all the necessary requests sent to the Notion server overload it. As a result, the number of workers must be decreased. In practice, a full update takes around 15-20 minutes, however the last step may require a few tries to coax the Notion servers into accepting all requests and to display.

Nevertheless, I reckon this performance is sufficient for a common use case to be: you run the script while making breakfast, and then by the time you are back at your computer, over a thousand headlines from all over the world are there to greet you. (Perhaps more important use case focuses more on the scraping and formatting steps to provide a dataset of headlines from all over the world for various NLP settings and analysis).

About

Dataset of 4.7M News Headlines from around the world in 30 Languages

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published