Skip to content

NewsCollector Usage Documentation

Elise Landman edited this page Feb 10, 2023 · 7 revisions

What is NewsCollector?

NewsCollector is a Python algorithm for automated collection and comparison of news articles, as well as auto-generation of an HTML newsletter from the collected content.

Why NewsCollector?

As the internet has grown, the available sources of information at our disposal have grown. Nowadays, if you want to update yourself with the most important news of the day, you have a vast variety news sources to choose from. Since we have that many news sources at our disposal, instead of manually going through all the news content…

👉 With NewsCollector, we can let automation pick the top news stories from various newspapers for us, and nicely combine them into a single, nicely looking newsletter.

👉 This lets us save valuable minutes of our time and makes sure we instantly get access to news that is actually relevant.

📕 Read more about how the algorithm of NewsCollector works in my Medium article.

How does NewsCollector work?

Basic Usage

You can run the NewsCollector algorithm as follows:

from newscollector import *

newsletter = NewsCollector()
output = newsletter.create()

This will run the full NewsCollector pipeline by scraping the sources from the package default sources.json file and outputting the HTML newsletter. If it finds multiple articles from different sources covering similar topics, these will be considered as being relevant articles and included in the newsletter.

The output object will hold the location path of the generated newsletter, so that you can easily retrieve it programmatically:

output
> 'C:\\Output\\Path\\newsletter.html'

CLI Usage

The NewsCollector can also be run directly via the CLI with the following parameters:

newscollector.py [-h] [-s [SOURCES]] [-n [NEWS_NAME]] [-d [NEWS_DATE]] 
                 [-t [TEMPLATE]] [-o [OUTPUT_FILENAME]] [-a [AUTO_OPEN]]
                 [-r [RETURN_DETAILS]]

Output

The NewsCollector will output an HTML newsletter with the most relevant articles it found while scraping the sources provided.

View the full sample newsletter in PDF format here.

By default, the output newsletter will be created as an HTML file in the installation directory of your package, saved in the folder rendered under the filename newsletter_YYYY-MM-DD.html, where the date is the respective date the NewsCollector scraped its articles from.

To adjust the default settings, please refer to Additional Parameters.

How to customize NewsCollector?

Additional Parameters

You can customize the NewsCollector algorithm with the following optional parameters:

newsletter = NewsCollector(sources='sources.json', news_name='Daily News Update', news_date=date.today(), 
                           template='newsletter.html', output_filename='default', auto_open=False,
                           return_details=False)
Parameter Type Default Value Other Values
sources str 'sources.json' filename of any json sources file
news_name str 'Daily News Update' any str
news_date str date.today() any string in date format 'YYYY-MM-DD'
template str 'newsletter.html' filename of any html template
output_filename str 'newsletter_YYYY-MM-DD.html' filename/-path to be used for the output html
auto_open bool False True
return_details bool False True

sources (str)

The input of NewsCollector is a JSON formatted file with RSS news source links which will be processed in 3 major steps:

  • Scraping of RSS news sources
  • Comparing of the scraped articles on similarity
  • Formatting and rendering of the final newsletter

If NewsCollector finds multiple articles from different sources covering similar topics, these will be considered as being relevant articles and included in the output newsletter.

By default, NewsCollector will use the package default sources.json source file.

Using custom sources

You can customize the input source file by adjusting the sources parameter.

The sources file must contain a JSON formatted collection of RSS links from various online news providers. The more news sources provided, the better NewsCollector will be able to capture relevant articles. You can view the sample sources.json file here.

news_name (str)

The title you would like to give to the output newsletter. By default, the title will be 'Daily News Update'.

news_date (str)

The date from which the newsletter should be generated. By default, NewsCollector will scrape the articles from today's date.

News articles will be scraped if their publishedDate in the RSS feed corresponds to news_date. Therefore, you can also let NewsCollector generate a newsletter based on articles that were published in the past. Nevertheless, often RSS feeds only contain recent news articles and therefore scraping past news feeds might not give you satisfying results.

template (str)

The HTML template NewsCollector will use to generate the output HTML newsletter. By default, NewsCollector will use the package default newsletter.html template.

Using a custom template

You can customize NewsCollector to render with your own newsletter template. The templates used by NewsCollector must be located within the \templates folder of the working directory (see Path Structure).

When using a custom HTML template, it should include standardized variable placeholders. Upon rendering the newsletter, these placeholders will be filled by NewsCollector with the relevant article content it found. Variable placeholders should be surrounded by double curly brackets {{ }} as is done with Flask's render_template() function. You can read more about this in Flask's documentation.

Below is the list of required placeholder variables:
{{news_name}}, {{news_date}}, {{sourceXX}}, {{urlXX}}, {{picXX}}, {{titleXX}}, {{bodyXX}}, {{clusterXX_Y_source}} and {{clusterXX_Y_url}}

where:
... XX ranges from [00 to 05]
... Y ranges from [0 to 2]

This means you will need to include a total of 68 placeholder variables.

As example, you can view the default newsletter.html template here.

output_filename (str)

The filename/-path of the output HTML newsletter.

By default, the output newsletter will be created as an HTML file in the installation directory of your package, saved in the folder rendered under the filename newsletter_YYYY-MM-DD.html, where the date is the respective date the NewsCollector scraped its articles from.

Using a custom filename/-path

You can customize the output filename, as well as the output location of the newsletter by setting the outout_filename parameter to any string formatted as path. output_filename has to include .html in the filename so that the HTML newsletter can be properly rendered. Below are some examples of valid filenames and paths:

'C:\\Output\\Path\\newsletter.html'
'custom_filename.html'

If the given output path does not exist, the required folder will be created by NewsCollector before it saves the output newsletter there.

auto_open (bool)

Choose whether NewsCollector should open the output newsletter in your browser after it completed. By default, it is set to False.

return_details (bool)

Choose whether NewsCollector should return more detailed data about. By default, it is set to False.

Set return_details to True and NewsCollector will additionally return the article clusters it found, as well as the clusters it featured in the newsletter.

from newscollector import *

newsletter = NewsCollector(return_details=True)
output, clusters, featured_clusters  = newsletter.create()

clusters contains a collection of all article clusters that were found by NewsCollector. featured_clusters are the filtered clusters and include only clusters that contain at least 2 different news sources i. e. considered as being more relevant news.

Use clusters and featured_clusters to include the raw NewsCollector output in your own processing steps or front end.

clusters
> {0: [{'source': 'CNN',
        'url': 'https://www.cnn.com/politics/live-new...',
        'date': '2023-02-10',
        'time': '22:24:58 UTC',
        'title': 'Stocks fall as the News was announc...',
        'body': 'As this afternoon was announced, the...',
         ... },
       {'source': 'CNBC',
        'url': 'https://www.cnbc.com/2023/02/10/us-sh...',
        'date': '2023-02-10',
         ... }]
   {1: ... }

Path Structure

NewsCollector makes use of a specific path structure when generating its newsletter. The structure should look as following:

- news-collector\
  newscollector.py
  - templates\
    newsletter.html
  - static\
    - css
      newsletter.css
    - assets\
      logo.png
  - rendered\
    newsletter_YYYY-MM-DD.html

Whether you use the default newsletter.html template or use a custom template, it must be located within the \templates directory as this is the folder NewsCollector will use to search for the templates.

When using the default template, the newsletter.css file, as well as the logo.png file must be located within the templates\ folder.

📕 Read more about how the algorithm of NewsCollector works in my Medium article.


❤️ Open Source