<a href="https://colab.research.google.com/github/dominikklepl/News-search-engine/blob/master/Code_explanation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# News search engine
In this notebook we'll go through all code for building our own search engine specialised on news articles.
Each search engine consists of three main components:


* Crawler
* Indexer
* Query processor



## A. Crawler
We start with building crawler that goes to RSS feed, extracts title, description (if any), date published and link. Processes this information and saves in a meta-data "database" (actually a pandas dataframe and stores it as csv file.

### Import packages

In [0]:
!pip install feedparser

import feedparser
import pandas as pd

Read RSS feed from a URL. Also keep track of how many news are in the feed and how many of these have already been scraped with previous iteration of the crawler. This will be later important to automatically adjust the frequency with which the crawler will visit the RSS feed to scrape new information.
For testing we use just one URL: BBC World News.

In [9]:
URL = "http://feeds.bbci.co.uk/news/world/rss.xml"
feed = feedparser.parse(URL)

feed_len = len(feed.entries) #number of news in feed
old_news = 0  # count how many news in feed were already scraped

print("There are {} news in the RSS feed." .format(feed_len))

There are 38 news in the RSS feed.


Load the meta-data database stored as csv file. If this is the first time the crawler is let loose this will be just an empty file with prepared column names (ID, title, date and link).
I'll keep it commented out here and instead just create an empty dataframe a this point.

In [36]:
#meta_data = pd.read_csv(PATH + "database.csv", index_col = 'Unnamed: 0')
meta_data = pd.DataFrame(columns=['ID', 'title', 'link', 'published'])
meta_data.head()

Unnamed: 0,ID,title,link,published


Now we write a function that accepts one entry from the feed and parse its contents to list with title, date published and link and assigns it a unique ID (which also denotes when was the entry scraped and entered to our search engine).
We normalize the date so that it's in format day/month/year. We don't care about more precise time.

In [0]:
def process_entry(entry, ID):
  ID = ID
  title = entry.title
  link = entry.link
  published = str(entry.published_parsed.tm_mday) + '/' + \
              str(entry.published_parsed.tm_mon) + '/' + \
              str(entry.published_parsed.tm_year)
  return [ID, title, link, published]

Test the function on one entry

In [33]:
test_entry = feed.entries[0]
process_entry(test_entry, 1)

[1,
 "MH17 disaster: Phone-taps 'show Russia directed Ukraine rebels'",
 'https://www.bbc.co.uk/news/world-europe-50419669',
 '14/11/2019']

Now we iterate over all entries in the feed. Check if the entry is already in the meta-data database, if not the entry is processed, assigned an ID and appended to a list of entries that will be later added to the database.

In [0]:
data = [] #dataframe for saving the entries
n = len(meta_data)+1 #ID value based on the highest ID value in database
for i in range(len(feed.entries)):
  entry = feed.entries[i]
  
  #check that link isn't in the database yet
  if entry.link not in meta_data['link'].values:
    processed = process_entry(entry = entry, ID=n)
    data.append(processed)
    n += 1 #increase the ID value
  else: old_news += 1 #count already scraped entries

If there was at least one newly scraped entry, we add it to the database.

In [0]:
if len(data) > 0:
  #transform data to pandas DataFrame
  news_extracted = pd.DataFrame(data, columns=['ID', 'title', 'link', 'published'])

  #add new news to the database
  meta_data = pd.concat([meta_data, news_extracted], axis = 0)

  #write database to a csv file
  #meta_data.to_csv(PATH + "database.csv")

Look at the database

In [39]:
meta_data.head(5)

Unnamed: 0,ID,title,link,published
0,1,MH17 disaster: Phone-taps 'show Russia directe...,https://www.bbc.co.uk/news/world-europe-50419669,14/11/2019
1,2,California school shooting: Santa Clarita atta...,https://www.bbc.co.uk/news/world-us-canada-504...,14/11/2019
2,3,Venice floods: Italy to declare state of emerg...,https://www.bbc.co.uk/news/world-europe-50416306,14/11/2019
3,4,Israel-Gaza ceasefire holding despite rocket fire,https://www.bbc.co.uk/news/world-middle-east-5...,14/11/2019
4,5,Trump impeachment inquiry: New claims amid pub...,https://www.bbc.co.uk/news/world-us-canada-503...,14/11/2019


Get percentage of already-scraped entries

In [42]:
print("{} % of entries were already scraped." .format((old_news/feed_len)*100))

0.0 % of entries were already scraped.
