<a href="https://www.kaggle.com/code/gpreda/bbc-news-rss-feeds?scriptVersionId=126162188" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Introduction

This is a Notebook used to collect data from BBC News RSS Feeds.

The Notebook is run with a certain frequency to collect new data.
Existing data (read from database) is merged (removing duplicates) with the new data.
Then the resulting updated data is saved as new version of the database.


We also exemplify here how to use Neptune.ai with Kaggle

# Install and import packages

In [1]:
!pip3 install requests_html

Collecting requests_html
  Downloading requests_html-0.10.0-py3-none-any.whl (13 kB)
Collecting bs4
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
  Preparing metadata (setup.py) ... [?25l- done
[?25hCollecting pyppeteer>=0.0.14
  Downloading pyppeteer-1.0.2-py3-none-any.whl (83 kB)
     |████████████████████████████████| 83 kB 1.2 MB/s             
[?25hCollecting parse
  Downloading parse-1.19.0.tar.gz (30 kB)
  Preparing metadata (setup.py) ... [?25l- done
Collecting pyquery
  Downloading pyquery-2.0.0-py3-none-any.whl (22 kB)
Collecting w3lib
  Downloading w3lib-2.1.1-py3-none-any.whl (21 kB)
Collecting fake-useragent
  Downloading fake_useragent-1.1.3-py3-none-any.whl (50 kB)
     |████████████████████████████████| 50 kB 4.5 MB/s             
[?25hCollecting pyee<9.0.0,>=8.1.0
  Downloading pyee-8.2.2-py2.py3-none-any.whl (12 kB)
Collecting cssselect>=1.2.0
  Downloading cssselect-1.2.0-py2.py3-none-any.whl (18 kB)
Building wheels for collected packages: bs

In [2]:
!pip3 install neptune-client

Collecting neptune-client
  Downloading neptune_client-1.1.1-py3-none-any.whl (442 kB)
     |████████████████████████████████| 442 kB 9.8 MB/s            
Collecting bravado<12.0.0,>=11.0.0
  Downloading bravado-11.0.3-py2.py3-none-any.whl (38 kB)
Collecting swagger-spec-validator>=2.7.4
  Downloading swagger_spec_validator-3.0.3-py2.py3-none-any.whl (27 kB)
Collecting bravado-core>=5.16.1
  Downloading bravado_core-5.17.1-py2.py3-none-any.whl (67 kB)
     |████████████████████████████████| 67 kB 3.9 MB/s             
Collecting monotonic
  Downloading monotonic-1.6-py2.py3-none-any.whl (8.2 kB)
Collecting jsonref
  Downloading jsonref-1.1.0-py3-none-any.whl (9.4 kB)
Collecting rfc3339-validator
  Downloading rfc3339_validator-0.1.4-py2.py3-none-any.whl (3.5 kB)
Collecting isoduration
  Downloading isoduration-20.11.0-py3-none-any.whl (11 kB)
Collecting rfc3987
  Downloading rfc3987-1.3.8-py2.py3-none-any.whl (13 kB)
Collecting uri-template
  Downloading uri_templa

In [3]:
import requests
import pandas as pd
from requests_html import HTML
from requests_html import HTMLSession
from bs4 import BeautifulSoup
import neptune.new as neptune
from kaggle_secrets import UserSecretsClient

  from neptune.version import version as neptune_client_version
  


# RSS Feed Parsing Functions

In [4]:
def get_html_source(url):
    """
        Return the source code for the provided URL. 
        source: https://practicaldatascience.co.uk/data-science/how-to-read-an-rss-feed-in-python
    Args: 
        url (string): URL of the page to scrape.

    Returns:
        response (object): HTTP response object from requests_html. 
    """

    try:
        session = HTMLSession()
        response = session.get(url)
        return response

    except requests.exceptions.RequestException as ex:
        print(ex)

In [5]:
def get_rss_feed(url):
    """
       Return a Pandas dataframe containing the RSS feed contents.
       Source: https://practicaldatascience.co.uk/data-science/how-to-read-an-rss-feed-in-python
       Modified to use BeautifulSoup (b4)
       
    Args: 
        url (string): URL of the RSS feed to read.

    Returns:
        df (dataframe): Pandas dataframe containing the RSS feed contents.
    """
    
    response = get_html_source(url)
    
    df = pd.DataFrame(columns = ['title', 'pubDate', 'guid', 'link', 'description'])

    with response as r:   
        # we use BeautifulSoup with `lxml-xml` type to parse the rss feed
        soup = BeautifulSoup(r.text , 'lxml-xml')
        items = soup.find_all('item')

        for item in items:   
            try:
                title = item.find('title').text
                pubDate = item.find('pubDate').text
                guid = item.find('guid').text
                link = item.find('link').text
                description = item.find('description').text

                row = {'title': title, 'pubDate': pubDate, 'guid': guid, 'link': link, 'description': description}
                df = df.append(row, ignore_index=True)
            except Exception as ex:
                print(ex)
                continue
    return df

In [6]:
user_secrets = UserSecretsClient()
neptune_api_token = user_secrets.get_secret("neptune_api")
run = None
try:
    run = neptune.init(
        project="preda/BBCNews",
        api_token=neptune_api_token,
    )  # your credentials
except Exception as ex:
    print(ex)

module 'neptune.new' has no attribute 'init'


# Read BBC News RSS Feeds

Initialize the RSS Feed url.

In [7]:
url = "http://feeds.bbci.co.uk/news/rss.xml"

Get the RSS Feed.

In [8]:
data_df = get_rss_feed(url)
if run:
    run["new_data_rows"] = data_df.shape[0]
    run["new_data_columns"] = data_df.shape[1]
print(f"New data collected: {data_df.shape[0]}")
data_df.head()

New data collected: 36


Unnamed: 0,title,pubDate,guid,link,description
0,Rishi Sunak investigated over declaration of i...,"Mon, 17 Apr 2023 15:53:05 GMT",https://www.bbc.co.uk/news/uk-politics-65301099,https://www.bbc.co.uk/news/uk-politics-6530109...,The prime minister faces a declaration of inte...
1,Vladimir Kara-Murza: Russian opposition figure...,"Mon, 17 Apr 2023 14:22:15 GMT",https://www.bbc.co.uk/news/world-europe-65297003,https://www.bbc.co.uk/news/world-europe-652970...,Vladimir Kara-Murza says the harsh sentence sh...
2,Lucy Letby trial: Nurse's notes read 'I killed...,"Mon, 17 Apr 2023 15:45:10 GMT",https://www.bbc.co.uk/news/uk-england-merseysi...,https://www.bbc.co.uk/news/uk-england-merseysi...,"A note found at Lucy Letby's home stated ""mayb..."
3,Sergeant charged with rape of woman while on d...,"Mon, 17 Apr 2023 11:29:47 GMT",https://www.bbc.co.uk/news/uk-england-devon-65...,https://www.bbc.co.uk/news/uk-england-devon-65...,Sgt David Stansbury is charged with three coun...
4,Ralph Yarl: Black teen shot by homeowner after...,"Mon, 17 Apr 2023 15:45:05 GMT",https://www.bbc.co.uk/news/world-us-canada-652...,https://www.bbc.co.uk/news/world-us-canada-652...,Ralph Yarl's parents sent him to pick up his b...


# Load data from database and concatenate old and new data

Load the data from database.

In [9]:
old_data_df = pd.read_csv("/kaggle/input/bbc-news/bbc_news.csv")
if run:
    run["old_data_rows"] = old_data_df.shape[0]
    run["old_data_columns"] = old_data_df.shape[1]
print(f"Old data: {old_data_df.shape[0]}")
old_data_df.head()

Old data: 15882


Unnamed: 0,title,pubDate,guid,link,description
0,Ukraine: Angry Zelensky vows to punish Russian...,"Mon, 07 Mar 2022 08:01:56 GMT",https://www.bbc.co.uk/news/world-europe-60638042,https://www.bbc.co.uk/news/world-europe-606380...,The Ukrainian president says the country will ...
1,War in Ukraine: Taking cover in a town under a...,"Sun, 06 Mar 2022 22:49:58 GMT",https://www.bbc.co.uk/news/world-europe-60641873,https://www.bbc.co.uk/news/world-europe-606418...,"Jeremy Bowen was on the frontline in Irpin, as..."
2,Ukraine war 'catastrophic for global food',"Mon, 07 Mar 2022 00:14:42 GMT",https://www.bbc.co.uk/news/business-60623941,https://www.bbc.co.uk/news/business-60623941?a...,One of the world's biggest fertiliser firms sa...
3,Manchester Arena bombing: Saffie Roussos's par...,"Mon, 07 Mar 2022 00:05:40 GMT",https://www.bbc.co.uk/news/uk-60579079,https://www.bbc.co.uk/news/uk-60579079?at_medi...,The parents of the Manchester Arena bombing's ...
4,Ukraine conflict: Oil price soars to highest l...,"Mon, 07 Mar 2022 08:15:53 GMT",https://www.bbc.co.uk/news/business-60642786,https://www.bbc.co.uk/news/business-60642786?a...,Consumers are feeling the impact of higher ene...


Let's look also to the dataset tail.

In [10]:
old_data_df.tail()

Unnamed: 0,title,pubDate,guid,link,description
15877,RBC Heritage: Matt Fitzpatrick beats Jordan Sp...,"Sun, 16 Apr 2023 23:38:10 GMT",https://www.bbc.co.uk/sport/golf/65294712,https://www.bbc.co.uk/sport/golf/65294712?at_m...,England's Matt Fitzpatrick beats American Jord...
15878,Women's FA Cup: Sam Kerr scores semi-final win...,"Sun, 16 Apr 2023 16:57:08 GMT",https://www.bbc.co.uk/sport/av/football/65294118,https://www.bbc.co.uk/sport/av/football/652941...,Holders Chelsea book their place in the Women'...
15879,World Snooker Championship 2023: Neil Robertso...,"Sun, 16 Apr 2023 21:45:19 GMT",https://www.bbc.co.uk/sport/snooker/65293696,https://www.bbc.co.uk/sport/snooker/65293696?a...,Neil Robertson produces a sublime display to d...
15880,"Garth Crooks' Team of the Week: Stones, Fernan...","Sun, 16 Apr 2023 21:07:01 GMT",https://www.bbc.co.uk/sport/football/65294896,https://www.bbc.co.uk/sport/football/65294896?...,Which Premier League players impressed our foo...
15881,Ashton reaches 100 Premiership tries as Tigers...,"Sun, 16 Apr 2023 16:55:41 GMT",https://www.bbc.co.uk/sport/rugby-union/65266094,https://www.bbc.co.uk/sport/rugby-union/652660...,Chris Ashton becomes the first player to score...


Merge the newly parsed data with existing one.
Remove duplicates.

In [11]:
new_data_df = pd.concat([old_data_df, data_df], axis=0)
print(f"Data after concatenation: {new_data_df.shape[0]}")
new_data_df = new_data_df.drop_duplicates()
if run:
    run["merged_data_rows"] = new_data_df.shape[0]
    run["merged_data_columns"] = new_data_df.shape[1]
print(f"Data after droping duplicates: {new_data_df.shape[0]}")
new_data_df.head()

Data after concatenation: 15918
Data after droping duplicates: 15910


Unnamed: 0,title,pubDate,guid,link,description
0,Ukraine: Angry Zelensky vows to punish Russian...,"Mon, 07 Mar 2022 08:01:56 GMT",https://www.bbc.co.uk/news/world-europe-60638042,https://www.bbc.co.uk/news/world-europe-606380...,The Ukrainian president says the country will ...
1,War in Ukraine: Taking cover in a town under a...,"Sun, 06 Mar 2022 22:49:58 GMT",https://www.bbc.co.uk/news/world-europe-60641873,https://www.bbc.co.uk/news/world-europe-606418...,"Jeremy Bowen was on the frontline in Irpin, as..."
2,Ukraine war 'catastrophic for global food',"Mon, 07 Mar 2022 00:14:42 GMT",https://www.bbc.co.uk/news/business-60623941,https://www.bbc.co.uk/news/business-60623941?a...,One of the world's biggest fertiliser firms sa...
3,Manchester Arena bombing: Saffie Roussos's par...,"Mon, 07 Mar 2022 00:05:40 GMT",https://www.bbc.co.uk/news/uk-60579079,https://www.bbc.co.uk/news/uk-60579079?at_medi...,The parents of the Manchester Arena bombing's ...
4,Ukraine conflict: Oil price soars to highest l...,"Mon, 07 Mar 2022 08:15:53 GMT",https://www.bbc.co.uk/news/business-60642786,https://www.bbc.co.uk/news/business-60642786?a...,Consumers are feeling the impact of higher ene...


Let's look also to new dataset tail.

In [12]:
new_data_df.tail()

Unnamed: 0,title,pubDate,guid,link,description
31,Supreme Court: Is India on the cusp of legalis...,"Mon, 17 Apr 2023 01:08:00 GMT",https://www.bbc.co.uk/news/world-asia-india-65...,https://www.bbc.co.uk/news/world-asia-india-65...,All eyes are on the Supreme Court as it readie...
32,'Ignorant' protesters blamed for Grand Nationa...,"Mon, 17 Apr 2023 10:29:26 GMT",https://www.bbc.co.uk/sport/horse-racing/65296693,https://www.bbc.co.uk/sport/horse-racing/65296...,Horse trainer Sandy Thomson says the interrupt...
33,Alex Fletcher: Bath City striker on impact of ...,"Mon, 17 Apr 2023 11:40:43 GMT",https://www.bbc.co.uk/sport/football/65296643,https://www.bbc.co.uk/sport/football/65296643?...,"Alex Fletcher, a non-league striker who fractu..."
34,Women's Six Nations 2023: England v Ireland co...,"Mon, 17 Apr 2023 14:17:35 GMT",https://www.bbc.co.uk/sport/rugby-union/65301890,https://www.bbc.co.uk/sport/rugby-union/653018...,BBC Sport reviews round three of the Women's S...
35,Women's Six Nations: Abby Dow and a brave litt...,"Mon, 17 Apr 2023 07:37:32 GMT",https://www.bbc.co.uk/sport/av/rugby-union/652...,https://www.bbc.co.uk/sport/av/rugby-union/652...,Watch the top five moments from the third roun...


# Save merged data

After merging the data, save it (this will populate the next version of dataset).

In [13]:
new_data_df.to_csv("bbc_news.csv", index=False)

# Stop Neptune.ai session

In [14]:
if run:
    run.stop()