# Introduction

This is a Notebook used to collect data from BBC News RSS Feeds.

The Notebook is run with a certain frequency to collect new data.
Existing data (read from database) is merged (removing duplicates) with the new data.
Then the resulting updated data is saved as new version of the database.

# Install and import packages

In [1]:
!pip3 install requests_html

Collecting requests_html
  Downloading requests_html-0.10.0-py3-none-any.whl (13 kB)
Collecting parse
  Downloading parse-1.19.0.tar.gz (30 kB)
  Preparing metadata (setup.py) ... [?25l- done
Collecting bs4
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
  Preparing metadata (setup.py) ... [?25l- done
[?25hCollecting w3lib
  Downloading w3lib-1.22.0-py2.py3-none-any.whl (20 kB)
Collecting pyquery
  Downloading pyquery-1.4.3-py3-none-any.whl (22 kB)
Collecting pyppeteer>=0.0.14
  Downloading pyppeteer-1.0.2-py3-none-any.whl (83 kB)
     |████████████████████████████████| 83 kB 709 kB/s            
[?25hCollecting fake-useragent
  Downloading fake-useragent-0.1.11.tar.gz (13 kB)
  Preparing metadata (setup.py) ... [?25l- done
Collecting pyee<9.0.0,>=8.1.0
  Downloading pyee-8.2.2-py2.py3-none-any.whl (12 kB)
Collecting cssselect>0.7.9
  Downloading cssselect-1.1.0-py2.py3-none-any.whl (16 kB)
Building wheels for collected packages: bs4, fake-useragent, parse
  B

In [2]:
import requests
import pandas as pd
from requests_html import HTML
from requests_html import HTMLSession
from bs4 import BeautifulSoup

# RSS Feed Parsing Functions

In [3]:
def get_html_source(url):
    """
        Return the source code for the provided URL. 
        source: https://practicaldatascience.co.uk/data-science/how-to-read-an-rss-feed-in-python
    Args: 
        url (string): URL of the page to scrape.

    Returns:
        response (object): HTTP response object from requests_html. 
    """

    try:
        session = HTMLSession()
        response = session.get(url)
        return response

    except requests.exceptions.RequestException as ex:
        print(ex)

In [4]:
def get_rss_feed(url):
    """
       Return a Pandas dataframe containing the RSS feed contents.
       Source: https://practicaldatascience.co.uk/data-science/how-to-read-an-rss-feed-in-python
       Modified to use BeautifulSoup (b4)
       
    Args: 
        url (string): URL of the RSS feed to read.

    Returns:
        df (dataframe): Pandas dataframe containing the RSS feed contents.
    """
    
    response = get_html_source(url)
    
    df = pd.DataFrame(columns = ['title', 'pubDate', 'guid', 'link', 'description'])

    with response as r:   
        # we use BeautifulSoup with `lxml-xml` type to parse the rss feed
        soup = BeautifulSoup(r.text , 'lxml-xml')
        items = soup.find_all('item')

        for item in items:   
            try:
                title = item.find('title').text
                pubDate = item.find('pubDate').text
                guid = item.find('guid').text
                link = item.find('link').text
                description = item.find('description').text

                row = {'title': title, 'pubDate': pubDate, 'guid': guid, 'link': link, 'description': description}
                df = df.append(row, ignore_index=True)
            except Exception as ex:
                print(ex)
                continue
    return df

# Read BBC News RSS Feeds

Initialize the RSS Feed url.

In [5]:
url = "http://feeds.bbci.co.uk/news/rss.xml"

Get the RSS Feed.

In [6]:
data_df = get_rss_feed(url)
print(f"New data collected: {data_df.shape[0]}")
data_df.head()

New data collected: 53


Unnamed: 0,title,pubDate,guid,link,description
0,Shrewsbury maternity deaths scandal will spark...,"Wed, 30 Mar 2022 19:50:37 GMT",https://www.bbc.co.uk/news/uk-england-shropshi...,https://www.bbc.co.uk/news/uk-england-shropshi...,Sajid Javid apologises over the maternity scan...
1,Ukraine War: Putin demands Mariupol surrender ...,"Wed, 30 Mar 2022 21:50:03 GMT",https://www.bbc.co.uk/news/world-europe-60926470,https://www.bbc.co.uk/news/world-europe-609264...,Russia's defence ministry has since announced ...
2,Germany and Austria take step towards gas rati...,"Wed, 30 Mar 2022 16:48:46 GMT",https://www.bbc.co.uk/news/business-60925016,https://www.bbc.co.uk/news/business-60925016?a...,Germany and Austria have both issued gas suppl...
3,Tom Parker: The Wanted singer dies aged 33,"Wed, 30 Mar 2022 19:27:37 GMT",https://www.bbc.co.uk/news/entertainment-arts-...,https://www.bbc.co.uk/news/entertainment-arts-...,The British boy band star told fans in 2020 he...
4,Boris Johnson must resign over lawbreaking at ...,"Wed, 30 Mar 2022 15:17:18 GMT",https://www.bbc.co.uk/news/uk-politics-60928083,https://www.bbc.co.uk/news/uk-politics-6092808...,The Labour leader accuses Boris Johnson of mis...


# Load data from database and concatenate old and new data

Load the data from database.

In [7]:
old_data_df = pd.read_csv("/kaggle/input/bbc-news/bbc_news.csv")
print(f"Old data: {old_data_df.shape[0]}")
old_data_df.head()

Old data: 1070


Unnamed: 0,title,pubDate,guid,link,description
0,Ukraine: Angry Zelensky vows to punish Russian...,"Mon, 07 Mar 2022 08:01:56 GMT",https://www.bbc.co.uk/news/world-europe-60638042,https://www.bbc.co.uk/news/world-europe-606380...,The Ukrainian president says the country will ...
1,War in Ukraine: Taking cover in a town under a...,"Sun, 06 Mar 2022 22:49:58 GMT",https://www.bbc.co.uk/news/world-europe-60641873,https://www.bbc.co.uk/news/world-europe-606418...,"Jeremy Bowen was on the frontline in Irpin, as..."
2,Ukraine war 'catastrophic for global food',"Mon, 07 Mar 2022 00:14:42 GMT",https://www.bbc.co.uk/news/business-60623941,https://www.bbc.co.uk/news/business-60623941?a...,One of the world's biggest fertiliser firms sa...
3,Manchester Arena bombing: Saffie Roussos's par...,"Mon, 07 Mar 2022 00:05:40 GMT",https://www.bbc.co.uk/news/uk-60579079,https://www.bbc.co.uk/news/uk-60579079?at_medi...,The parents of the Manchester Arena bombing's ...
4,Ukraine conflict: Oil price soars to highest l...,"Mon, 07 Mar 2022 08:15:53 GMT",https://www.bbc.co.uk/news/business-60642786,https://www.bbc.co.uk/news/business-60642786?a...,Consumers are feeling the impact of higher ene...


Let's look also to the dataset tail.

In [8]:
old_data_df.tail()

Unnamed: 0,title,pubDate,guid,link,description
1065,Portugal 2-0 North Macedonia: Bruno Fernandes ...,"Tue, 29 Mar 2022 21:07:28 GMT",https://www.bbc.co.uk/sport/football/60907078,https://www.bbc.co.uk/sport/football/60907078?...,Bruno Fernandes sent Portugal to the World Cup...
1066,Ex-speed skater Christie set for new e-scooter...,"Tue, 29 Mar 2022 20:03:36 GMT",https://www.bbc.co.uk/sport/motorsport/60922046,https://www.bbc.co.uk/sport/motorsport/6092204...,Former speed skater Elise Christie is amongst ...
1067,Miami Open: Cameron Norrie loses to Casper Ruu...,"Tue, 29 Mar 2022 21:14:40 GMT",https://www.bbc.co.uk/sport/tennis/60920528,https://www.bbc.co.uk/sport/tennis/60920528?at...,British number one Cameron Norrie is hampered ...
1068,Transgender cyclist Bridges set to race in fir...,"Tue, 29 Mar 2022 20:28:27 GMT",https://www.bbc.co.uk/sport/cycling/60911823,https://www.bbc.co.uk/sport/cycling/60911823?a...,Transgender athlete Emily Bridges may compete ...
1069,Election 2022: Is there an election in my area?,"Thu, 17 Mar 2022 17:42:01 GMT",https://www.bbc.co.uk/news/uk-politics-60695244,https://www.bbc.co.uk/news/uk-politics-6069524...,National elections in Northern Ireland and loc...


Merge the newly parsed data with existing one.
Remove duplicates.

In [9]:
new_data_df = pd.concat([old_data_df, data_df], axis=0)
print(f"Data after concatenation: {new_data_df.shape[0]}")
new_data_df = new_data_df.drop_duplicates()
print(f"Data after droping duplicates: {new_data_df.shape[0]}")
new_data_df.head()

Data after concatenation: 1123
Data after droping duplicates: 1112


Unnamed: 0,title,pubDate,guid,link,description
0,Ukraine: Angry Zelensky vows to punish Russian...,"Mon, 07 Mar 2022 08:01:56 GMT",https://www.bbc.co.uk/news/world-europe-60638042,https://www.bbc.co.uk/news/world-europe-606380...,The Ukrainian president says the country will ...
1,War in Ukraine: Taking cover in a town under a...,"Sun, 06 Mar 2022 22:49:58 GMT",https://www.bbc.co.uk/news/world-europe-60641873,https://www.bbc.co.uk/news/world-europe-606418...,"Jeremy Bowen was on the frontline in Irpin, as..."
2,Ukraine war 'catastrophic for global food',"Mon, 07 Mar 2022 00:14:42 GMT",https://www.bbc.co.uk/news/business-60623941,https://www.bbc.co.uk/news/business-60623941?a...,One of the world's biggest fertiliser firms sa...
3,Manchester Arena bombing: Saffie Roussos's par...,"Mon, 07 Mar 2022 00:05:40 GMT",https://www.bbc.co.uk/news/uk-60579079,https://www.bbc.co.uk/news/uk-60579079?at_medi...,The parents of the Manchester Arena bombing's ...
4,Ukraine conflict: Oil price soars to highest l...,"Mon, 07 Mar 2022 08:15:53 GMT",https://www.bbc.co.uk/news/business-60642786,https://www.bbc.co.uk/news/business-60642786?a...,Consumers are feeling the impact of higher ene...


Let's look also to new dataset tail.

In [10]:
new_data_df.tail()

Unnamed: 0,title,pubDate,guid,link,description
42,Women's World Cup: Australia's Beth Mooney tak...,"Wed, 30 Mar 2022 04:14:00 GMT",https://www.bbc.co.uk/sport/av/cricket/60923258,https://www.bbc.co.uk/sport/av/cricket/6092325...,Australia's Beth Mooney takes a superb diving ...
43,Roman Abramovich sanctions: Chelsea can't be '...,"Wed, 30 Mar 2022 14:26:02 GMT",https://www.bbc.co.uk/sport/football/60931486,https://www.bbc.co.uk/sport/football/60931486?...,Chelsea cannot be allowed to operate like it i...
44,Women's World Cup: England set to face South A...,"Wed, 30 Mar 2022 09:28:33 GMT",https://www.bbc.co.uk/sport/cricket/60923228,https://www.bbc.co.uk/sport/cricket/60923228?a...,England will look to book a Women's World Cup ...
45,Paris St-Germain president Nasser Al-Khelaifi ...,"Wed, 30 Mar 2022 17:35:45 GMT",https://www.bbc.co.uk/sport/football/60929239,https://www.bbc.co.uk/sport/football/60929239?...,Almost a year on from the European Super Leagu...
46,Barcelona Femenino 5-2 Real Madrid Femenino (8...,"Wed, 30 Mar 2022 19:07:28 GMT",https://www.bbc.co.uk/sport/football/60934500,https://www.bbc.co.uk/sport/football/60934500?...,Barcelona beat Real Madrid to reach the Women'...


# Save merged data

After merging the data, save it (this will populate the next version of dataset).

In [11]:
new_data_df.to_csv("bbc_news.csv", index=False)