<a href="https://www.kaggle.com/code/gpreda/bbc-news-rss-feeds?scriptVersionId=136914430" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Introduction

This is a Notebook used to collect data from BBC News RSS Feeds.

The Notebook is run with a certain frequency to collect new data.
Existing data (read from database) is merged (removing duplicates) with the new data.
Then the resulting updated data is saved as new version of the database.


We also exemplify here how to use Neptune.ai with Kaggle

# Install and import packages

In [1]:
!pip3 install requests_html

Collecting requests_html
  Downloading requests_html-0.10.0-py3-none-any.whl (13 kB)
Collecting pyppeteer>=0.0.14
  Downloading pyppeteer-1.0.2-py3-none-any.whl (83 kB)
     |████████████████████████████████| 83 kB 1.2 MB/s             
[?25hCollecting w3lib
  Downloading w3lib-2.1.1-py3-none-any.whl (21 kB)
Collecting parse
  Downloading parse-1.19.1-py2.py3-none-any.whl (18 kB)
Collecting pyquery
  Downloading pyquery-2.0.0-py3-none-any.whl (22 kB)
Collecting bs4
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
  Preparing metadata (setup.py) ... [?25l- done
[?25hCollecting fake-useragent
  Downloading fake_useragent-1.1.3-py3-none-any.whl (50 kB)
     |████████████████████████████████| 50 kB 4.0 MB/s             
Collecting pyee<9.0.0,>=8.1.0
  Downloading pyee-8.2.2-py2.py3-none-any.whl (12 kB)
Collecting cssselect>=1.2.0
  Downloading cssselect-1.2.0-py2.py3-none-any.whl (18 kB)
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py) ... 

In [2]:
!pip3 install neptune-client==1.2.0

Collecting neptune-client==1.2.0
  Downloading neptune_client-1.2.0-py3-none-any.whl (448 kB)
     |████████████████████████████████| 448 kB 7.8 MB/s            
Collecting swagger-spec-validator>=2.7.4
  Downloading swagger_spec_validator-3.0.3-py2.py3-none-any.whl (27 kB)
Collecting bravado<12.0.0,>=11.0.0
  Downloading bravado-11.0.3-py2.py3-none-any.whl (38 kB)
Collecting monotonic
  Downloading monotonic-1.6-py2.py3-none-any.whl (8.2 kB)
Collecting bravado-core>=5.16.1
  Downloading bravado_core-5.17.1-py2.py3-none-any.whl (67 kB)
     |████████████████████████████████| 67 kB 4.3 MB/s             
Collecting jsonref
  Downloading jsonref-1.1.0-py3-none-any.whl (9.4 kB)
Collecting rfc3987
  Downloading rfc3987-1.3.8-py2.py3-none-any.whl (13 kB)
Collecting uri-template
  Downloading uri_template-1.3.0-py3-none-any.whl (11 kB)
Collecting jsonpointer>1.13
  Downloading jsonpointer-2.4-py2.py3-none-any.whl (7.8 kB)
Collecting isoduration
  Downloading isoduration-2

In [3]:
import requests
import pandas as pd
from requests_html import HTML
from requests_html import HTMLSession
from bs4 import BeautifulSoup
import neptune.new as neptune
from kaggle_secrets import UserSecretsClient

  from neptune.version import version as neptune_client_version
  


# RSS Feed Parsing Functions

In [4]:
def get_html_source(url):
    """
        Return the source code for the provided URL. 
        source: https://practicaldatascience.co.uk/data-science/how-to-read-an-rss-feed-in-python
    Args: 
        url (string): URL of the page to scrape.

    Returns:
        response (object): HTTP response object from requests_html. 
    """

    try:
        session = HTMLSession()
        response = session.get(url)
        return response

    except requests.exceptions.RequestException as ex:
        print(ex)

In [5]:
def get_rss_feed(url):
    """
       Return a Pandas dataframe containing the RSS feed contents.
       Source: https://practicaldatascience.co.uk/data-science/how-to-read-an-rss-feed-in-python
       Modified to use BeautifulSoup (b4)
       
    Args: 
        url (string): URL of the RSS feed to read.

    Returns:
        df (dataframe): Pandas dataframe containing the RSS feed contents.
    """
    
    response = get_html_source(url)
    
    df = pd.DataFrame(columns = ['title', 'pubDate', 'guid', 'link', 'description'])

    with response as r:   
        # we use BeautifulSoup with `lxml-xml` type to parse the rss feed
        soup = BeautifulSoup(r.text , 'lxml-xml')
        items = soup.find_all('item')

        for item in items:   
            try:
                title = item.find('title').text
                pubDate = item.find('pubDate').text
                guid = item.find('guid').text
                link = item.find('link').text
                description = item.find('description').text

                row = {'title': title, 'pubDate': pubDate, 'guid': guid, 'link': link, 'description': description}
                df = df.append(row, ignore_index=True)
            except Exception as ex:
                print(ex)
                continue
    return df

In [6]:
user_secrets = UserSecretsClient()
neptune_api_token = user_secrets.get_secret("neptune_api")
run = None
try:
    run = neptune.init(
        project="preda/BBCNews",
        api_token=neptune_api_token,
    )  # your credentials
except Exception as ex:
    print(ex)

module 'neptune.new' has no attribute 'init'


# Read BBC News RSS Feeds

Initialize the RSS Feed url.

In [7]:
url = "http://feeds.bbci.co.uk/news/rss.xml"

Get the RSS Feed.

In [8]:
data_df = get_rss_feed(url)
if run:
    run["new_data_rows"] = data_df.shape[0]
    run["new_data_columns"] = data_df.shape[1]
print(f"New data collected: {data_df.shape[0]}")
data_df.head()

New data collected: 52


Unnamed: 0,title,pubDate,guid,link,description
0,UK signs off membership to Indo-Pacific trade ...,"Sun, 16 Jul 2023 01:32:07 GMT",https://www.bbc.co.uk/news/explainers-55858490,https://www.bbc.co.uk/news/explainers-55858490...,It may sound like an official has leant on the...
1,US heatwave: 'Dangerous’ temperatures could se...,"Sun, 16 Jul 2023 01:17:36 GMT",https://www.bbc.co.uk/news/world-us-canada-661...,https://www.bbc.co.uk/news/world-us-canada-661...,Nearly a third of Americans - about 113 millio...
2,Mortgage rates: Six reasons why the pain isn't...,"Sat, 15 Jul 2023 23:13:54 GMT",https://www.bbc.co.uk/news/business-66187232,https://www.bbc.co.uk/news/business-66187232?a...,"Mortgage rates are at a 15-year high, but ther..."
3,Djokovic relishing Alcaraz Wimbledon showdown,"Sat, 15 Jul 2023 18:21:27 GMT",https://www.bbc.co.uk/sport/tennis/66207600,https://www.bbc.co.uk/sport/tennis/66207600?at...,Novak Djokovic believes his eagerly anticipate...
4,Brighton hotel blaze: Winds hamper firefighters,"Sun, 16 Jul 2023 07:14:13 GMT",https://www.bbc.co.uk/news/uk-england-sussex-6...,https://www.bbc.co.uk/news/uk-england-sussex-6...,"No-one was hurt but ""difficult conditions"" mea..."


# Load data from database and concatenate old and new data

Load the data from database.

In [9]:
old_data_df = pd.read_csv("/kaggle/input/bbc-news/bbc_news.csv")
if run:
    run["old_data_rows"] = old_data_df.shape[0]
    run["old_data_columns"] = old_data_df.shape[1]
print(f"Old data: {old_data_df.shape[0]}")
old_data_df.head()

Old data: 19220


Unnamed: 0,title,pubDate,guid,link,description
0,Ukraine: Angry Zelensky vows to punish Russian...,"Mon, 07 Mar 2022 08:01:56 GMT",https://www.bbc.co.uk/news/world-europe-60638042,https://www.bbc.co.uk/news/world-europe-606380...,The Ukrainian president says the country will ...
1,War in Ukraine: Taking cover in a town under a...,"Sun, 06 Mar 2022 22:49:58 GMT",https://www.bbc.co.uk/news/world-europe-60641873,https://www.bbc.co.uk/news/world-europe-606418...,"Jeremy Bowen was on the frontline in Irpin, as..."
2,Ukraine war 'catastrophic for global food',"Mon, 07 Mar 2022 00:14:42 GMT",https://www.bbc.co.uk/news/business-60623941,https://www.bbc.co.uk/news/business-60623941?a...,One of the world's biggest fertiliser firms sa...
3,Manchester Arena bombing: Saffie Roussos's par...,"Mon, 07 Mar 2022 00:05:40 GMT",https://www.bbc.co.uk/news/uk-60579079,https://www.bbc.co.uk/news/uk-60579079?at_medi...,The parents of the Manchester Arena bombing's ...
4,Ukraine conflict: Oil price soars to highest l...,"Mon, 07 Mar 2022 08:15:53 GMT",https://www.bbc.co.uk/news/business-60642786,https://www.bbc.co.uk/news/business-60642786?a...,Consumers are feeling the impact of higher ene...


Let's look also to the dataset tail.

In [10]:
old_data_df.tail()

Unnamed: 0,title,pubDate,guid,link,description
19215,"Wimbledon 2023: Murray, Alcaraz & Jabeur featu...","Tue, 04 Jul 2023 18:57:39 GMT",https://www.bbc.co.uk/sport/av/tennis/66035962,https://www.bbc.co.uk/sport/av/tennis/66035962...,Watch the best shots from day two of Wimbledon...
19216,Tour de France 2023: Jasper Philipsen wins aga...,"Tue, 04 Jul 2023 17:09:55 GMT",https://www.bbc.co.uk/sport/cycling/66102679,https://www.bbc.co.uk/sport/cycling/66102679?a...,Belgium's Jasper Philipsen wins for the second...
19217,Wimbledon 2023: Eight-time champion Roger Fede...,"Tue, 04 Jul 2023 20:07:45 GMT",https://www.bbc.co.uk/sport/tennis/66097818,https://www.bbc.co.uk/sport/tennis/66097818?at...,Eight-time Wimbledon champion Roger Federer ta...
19218,Roberto Firmino: Saudi Arabian side Al-Ahli si...,"Tue, 04 Jul 2023 21:49:54 GMT",https://www.bbc.co.uk/sport/football/66105022,https://www.bbc.co.uk/sport/football/66105022?...,Roberto Firmino joins Saudi Pro League side Al...
19219,Cost of living: What are your rights as a tena...,"Tue, 04 Jul 2023 10:23:26 GMT",https://www.bbc.co.uk/news/technology-65038459,https://www.bbc.co.uk/news/technology-65038459...,"With one in five people now renting in the UK,..."


Merge the newly parsed data with existing one.
Remove duplicates.

In [11]:
new_data_df = pd.concat([old_data_df, data_df], axis=0)
print(f"Data after concatenation: {new_data_df.shape[0]}")
new_data_df = new_data_df.drop_duplicates()
if run:
    run["merged_data_rows"] = new_data_df.shape[0]
    run["merged_data_columns"] = new_data_df.shape[1]
print(f"Data after droping duplicates: {new_data_df.shape[0]}")
new_data_df.head()

Data after concatenation: 19272
Data after droping duplicates: 19266


Unnamed: 0,title,pubDate,guid,link,description
0,Ukraine: Angry Zelensky vows to punish Russian...,"Mon, 07 Mar 2022 08:01:56 GMT",https://www.bbc.co.uk/news/world-europe-60638042,https://www.bbc.co.uk/news/world-europe-606380...,The Ukrainian president says the country will ...
1,War in Ukraine: Taking cover in a town under a...,"Sun, 06 Mar 2022 22:49:58 GMT",https://www.bbc.co.uk/news/world-europe-60641873,https://www.bbc.co.uk/news/world-europe-606418...,"Jeremy Bowen was on the frontline in Irpin, as..."
2,Ukraine war 'catastrophic for global food',"Mon, 07 Mar 2022 00:14:42 GMT",https://www.bbc.co.uk/news/business-60623941,https://www.bbc.co.uk/news/business-60623941?a...,One of the world's biggest fertiliser firms sa...
3,Manchester Arena bombing: Saffie Roussos's par...,"Mon, 07 Mar 2022 00:05:40 GMT",https://www.bbc.co.uk/news/uk-60579079,https://www.bbc.co.uk/news/uk-60579079?at_medi...,The parents of the Manchester Arena bombing's ...
4,Ukraine conflict: Oil price soars to highest l...,"Mon, 07 Mar 2022 08:15:53 GMT",https://www.bbc.co.uk/news/business-60642786,https://www.bbc.co.uk/news/business-60642786?a...,Consumers are feeling the impact of higher ene...


Let's look also to new dataset tail.

In [12]:
new_data_df.tail()

Unnamed: 0,title,pubDate,guid,link,description
41,Cost-of-living payments: Who is getting them a...,"Tue, 20 Jun 2023 08:57:34 GMT",https://www.bbc.co.uk/news/business-61592496,https://www.bbc.co.uk/news/business-61592496?a...,"Low-income households, pensioners and some dis..."
43,Mortgages: What happens if I am struggling to ...,"Wed, 12 Jul 2023 11:38:53 GMT",https://www.bbc.co.uk/news/business-63486782,https://www.bbc.co.uk/news/business-63486782?a...,Many homeowners are worried about higher mortg...
48,What is a recession and how could it affect me?,"Fri, 31 Mar 2023 08:55:27 GMT",https://www.bbc.co.uk/news/business-52986863,https://www.bbc.co.uk/news/business-52986863?a...,Britain's economy is expected to shrink in 202...
49,What is the UK inflation rate and why is it so...,"Tue, 11 Jul 2023 09:20:07 GMT",https://www.bbc.co.uk/news/business-12196322,https://www.bbc.co.uk/news/business-12196322?a...,The rate at which prices are rising remains hi...
51,Mortgage calculator: how much will my mortgage...,"Wed, 12 Jul 2023 16:29:36 GMT",https://www.bbc.co.uk/news/business-63474582,https://www.bbc.co.uk/news/business-63474582?a...,Use our calculator to find out how much mortga...


# Save merged data

After merging the data, save it (this will populate the next version of dataset).

In [13]:
new_data_df.to_csv("bbc_news.csv", index=False)

# Stop Neptune.ai session

In [14]:
if run:
    run.stop()