# Introduction

This is a Notebook used to collect data from BBC News RSS Feeds.

The Notebook is run with a certain frequency to collect new data.
Existing data (read from database) is merged (removing duplicates) with the new data.
Then the resulting updated data is saved as new version of the database.

# Install and import packages

In [1]:
!pip3 install requests_html

Collecting requests_html
  Downloading requests_html-0.10.0-py3-none-any.whl (13 kB)
Collecting parse
  Downloading parse-1.19.0.tar.gz (30 kB)
  Preparing metadata (setup.py) ... [?25l- done
[?25hCollecting w3lib
  Downloading w3lib-1.22.0-py2.py3-none-any.whl (20 kB)
Collecting pyquery
  Downloading pyquery-1.4.3-py3-none-any.whl (22 kB)
Collecting pyppeteer>=0.0.14
  Downloading pyppeteer-1.0.2-py3-none-any.whl (83 kB)
     |████████████████████████████████| 83 kB 707 kB/s            
[?25hCollecting bs4
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
  Preparing metadata (setup.py) ... [?25l- done
[?25hCollecting fake-useragent
  Downloading fake-useragent-0.1.11.tar.gz (13 kB)
  Preparing metadata (setup.py) ... [?25l- done
Collecting pyee<9.0.0,>=8.1.0
  Downloading pyee-8.2.2-py2.py3-none-any.whl (12 kB)
Collecting cssselect>0.7.9
  Downloading cssselect-1.1.0-py2.py3-none-any.whl (16 kB)
Building wheels for collected packages: bs4, fake-useragent, pars

In [2]:
import requests
import pandas as pd
from requests_html import HTML
from requests_html import HTMLSession
from bs4 import BeautifulSoup

# RSS Feed Parsing Functions

In [3]:
def get_html_source(url):
    """
        Return the source code for the provided URL. 
        source: https://practicaldatascience.co.uk/data-science/how-to-read-an-rss-feed-in-python
    Args: 
        url (string): URL of the page to scrape.

    Returns:
        response (object): HTTP response object from requests_html. 
    """

    try:
        session = HTMLSession()
        response = session.get(url)
        return response

    except requests.exceptions.RequestException as ex:
        print(ex)

In [4]:
def get_rss_feed(url):
    """
       Return a Pandas dataframe containing the RSS feed contents.
       Source: https://practicaldatascience.co.uk/data-science/how-to-read-an-rss-feed-in-python
       Modified to use BeautifulSoup (b4)
       
    Args: 
        url (string): URL of the RSS feed to read.

    Returns:
        df (dataframe): Pandas dataframe containing the RSS feed contents.
    """
    
    response = get_html_source(url)
    
    df = pd.DataFrame(columns = ['title', 'pubDate', 'guid', 'link', 'description'])

    with response as r:   
        # we use BeautifulSoup with `lxml-xml` type to parse the rss feed
        soup = BeautifulSoup(r.text , 'lxml-xml')
        items = soup.find_all('item')

        for item in items:   
            try:
                title = item.find('title').text
                pubDate = item.find('pubDate').text
                guid = item.find('guid').text
                link = item.find('link').text
                description = item.find('description').text

                row = {'title': title, 'pubDate': pubDate, 'guid': guid, 'link': link, 'description': description}
                df = df.append(row, ignore_index=True)
            except Exception as ex:
                print(ex)
                continue
    return df

# Read BBC News RSS Feeds

Initialize the RSS Feed url.

In [5]:
url = "http://feeds.bbci.co.uk/news/rss.xml"

Get the RSS Feed.

In [6]:
data_df = get_rss_feed(url)
print(f"New data collected: {data_df.shape[0]}")
data_df.head()

New data collected: 55


Unnamed: 0,title,pubDate,guid,link,description
0,War in Ukraine: West hits Russia with oil bans...,"Wed, 09 Mar 2022 01:57:37 GMT",https://www.bbc.co.uk/news/world-us-canada-606...,https://www.bbc.co.uk/news/world-us-canada-606...,"The US bans Russian oil, targeting the Russian..."
1,War in Ukraine: Troops dig in near Kyiv,"Wed, 09 Mar 2022 00:00:25 GMT",https://www.bbc.co.uk/news/world-europe-60671329,https://www.bbc.co.uk/news/world-europe-606713...,Ukrainians are determined to defend their capi...
2,Ros Atkins on… The UK’s refugee response,"Tue, 08 Mar 2022 21:57:50 GMT",https://www.bbc.co.uk/news/uk-60668779,https://www.bbc.co.uk/news/uk-60668779?at_medi...,Ros Atkins looks at the UK's response to refug...
3,"War in Ukraine: McDonald’s, Coca-Cola and Star...","Wed, 09 Mar 2022 02:56:24 GMT",https://www.bbc.co.uk/news/business-60665877,https://www.bbc.co.uk/news/business-60665877?a...,Western companies are turning their backs on R...
4,War in Ukraine: Warning oil sanctions will fur...,"Wed, 09 Mar 2022 01:29:26 GMT",https://www.bbc.co.uk/news/business-60670120,https://www.bbc.co.uk/news/business-60670120?a...,Plans to ban or curb Russian oil and gas impor...


# Load data from database and concatenate old and new data

Load the data from database.

In [7]:
old_data_df = pd.read_csv("/kaggle/input/bbc-news/bbc_news.csv")
print(f"Old data: {old_data_df.shape[0]}")
old_data_df.head()

Old data: 165


Unnamed: 0,title,pubDate,guid,link,description
0,Ukraine: Angry Zelensky vows to punish Russian...,"Mon, 07 Mar 2022 08:01:56 GMT",https://www.bbc.co.uk/news/world-europe-60638042,https://www.bbc.co.uk/news/world-europe-606380...,The Ukrainian president says the country will ...
1,War in Ukraine: Taking cover in a town under a...,"Sun, 06 Mar 2022 22:49:58 GMT",https://www.bbc.co.uk/news/world-europe-60641873,https://www.bbc.co.uk/news/world-europe-606418...,"Jeremy Bowen was on the frontline in Irpin, as..."
2,Ukraine war 'catastrophic for global food',"Mon, 07 Mar 2022 00:14:42 GMT",https://www.bbc.co.uk/news/business-60623941,https://www.bbc.co.uk/news/business-60623941?a...,One of the world's biggest fertiliser firms sa...
3,Manchester Arena bombing: Saffie Roussos's par...,"Mon, 07 Mar 2022 00:05:40 GMT",https://www.bbc.co.uk/news/uk-60579079,https://www.bbc.co.uk/news/uk-60579079?at_medi...,The parents of the Manchester Arena bombing's ...
4,Ukraine conflict: Oil price soars to highest l...,"Mon, 07 Mar 2022 08:15:53 GMT",https://www.bbc.co.uk/news/business-60642786,https://www.bbc.co.uk/news/business-60642786?a...,Consumers are feeling the impact of higher ene...


Let's look also to the dataset tail.

In [8]:
old_data_df.tail()

Unnamed: 0,title,pubDate,guid,link,description
160,Ukraine war: UK can and will do more for refug...,"Tue, 08 Mar 2022 08:23:59 GMT",https://www.bbc.co.uk/news/uk-60655788,https://www.bbc.co.uk/news/uk-60655788?at_medi...,The UK has granted visas to 300 refugees fleei...
161,"Covid: Vaccines not linked to deaths, says stu...","Tue, 08 Mar 2022 07:53:27 GMT",https://www.bbc.co.uk/news/uk-60652743,https://www.bbc.co.uk/news/uk-60652743?at_medi...,Five things you need to know about the coronav...
162,Shane Warne: Do liquid diets work and are they...,"Tue, 08 Mar 2022 08:10:30 GMT",https://www.bbc.co.uk/news/health-60647276,https://www.bbc.co.uk/news/health-60647276?at_...,What do experts make of the diet Shane Warne w...
163,"Fury's future plans, AJ's new coach and Khan's...","Tue, 08 Mar 2022 08:06:41 GMT",https://www.bbc.co.uk/sport/boxing/60549538,https://www.bbc.co.uk/sport/boxing/60549538?at...,Will Tyson Fury hang up his gloves after facin...
164,Healy helps Australia beat Pakistan at World Cup,"Tue, 08 Mar 2022 08:16:28 GMT",https://www.bbc.co.uk/sport/cricket/60658487,https://www.bbc.co.uk/sport/cricket/60658487?a...,Alyssa Healy helps Australia secure a comforta...


Merge the newly parsed data with existing one.
Remove duplicates.

In [9]:
new_data_df = pd.concat([old_data_df, data_df], axis=0)
print(f"Data after concatenation: {new_data_df.shape[0]}")
new_data_df = new_data_df.drop_duplicates()
print(f"Data after droping duplicates: {new_data_df.shape[0]}")
new_data_df.head()

Data after concatenation: 220
Data after droping duplicates: 205


Unnamed: 0,title,pubDate,guid,link,description
0,Ukraine: Angry Zelensky vows to punish Russian...,"Mon, 07 Mar 2022 08:01:56 GMT",https://www.bbc.co.uk/news/world-europe-60638042,https://www.bbc.co.uk/news/world-europe-606380...,The Ukrainian president says the country will ...
1,War in Ukraine: Taking cover in a town under a...,"Sun, 06 Mar 2022 22:49:58 GMT",https://www.bbc.co.uk/news/world-europe-60641873,https://www.bbc.co.uk/news/world-europe-606418...,"Jeremy Bowen was on the frontline in Irpin, as..."
2,Ukraine war 'catastrophic for global food',"Mon, 07 Mar 2022 00:14:42 GMT",https://www.bbc.co.uk/news/business-60623941,https://www.bbc.co.uk/news/business-60623941?a...,One of the world's biggest fertiliser firms sa...
3,Manchester Arena bombing: Saffie Roussos's par...,"Mon, 07 Mar 2022 00:05:40 GMT",https://www.bbc.co.uk/news/uk-60579079,https://www.bbc.co.uk/news/uk-60579079?at_medi...,The parents of the Manchester Arena bombing's ...
4,Ukraine conflict: Oil price soars to highest l...,"Mon, 07 Mar 2022 08:15:53 GMT",https://www.bbc.co.uk/news/business-60642786,https://www.bbc.co.uk/news/business-60642786?a...,Consumers are feeling the impact of higher ene...


Let's look also to new dataset tail.

In [10]:
new_data_df.tail()

Unnamed: 0,title,pubDate,guid,link,description
50,Bairstow ton rescues England against West Indies,"Tue, 08 Mar 2022 22:41:57 GMT",https://www.bbc.co.uk/sport/cricket/60665827,https://www.bbc.co.uk/sport/cricket/60665827?a...,Jonny Bairstow's hundred rescues England after...
51,Lewandowski scores earliest Champions League h...,"Tue, 08 Mar 2022 23:04:53 GMT",https://www.bbc.co.uk/sport/football/60667847,https://www.bbc.co.uk/sport/football/60667847?...,Robert Lewandowski scores the earliest ever Ch...
52,Andy Murray pledges to donate prize money to h...,"Tue, 08 Mar 2022 20:42:38 GMT",https://www.bbc.co.uk/sport/tennis/60667095,https://www.bbc.co.uk/sport/tennis/60667095?at...,Andy Murray pledges to donate his earnings fro...
53,World Cup 2022: Wales' play-off with Austria t...,"Tue, 08 Mar 2022 19:11:57 GMT",https://www.bbc.co.uk/sport/football/60648692,https://www.bbc.co.uk/sport/football/60648692?...,Wales' World Cup play-off semi-final against A...
54,Women's World Cup: Deandra Dottin takes sensat...,"Wed, 09 Mar 2022 02:44:31 GMT",https://www.bbc.co.uk/sport/av/cricket/60672395,https://www.bbc.co.uk/sport/av/cricket/6067239...,West Indies' Deandra Dottin takes a sensationa...


# Save merged data

After merging the data, save it (this will populate the next version of dataset).

In [11]:
new_data_df.to_csv("bbc_news.csv", index=False)