<a href="https://colab.research.google.com/github/dominikklepl/News-search-engine/blob/master/Code_explanation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# News search engine
In this notebook we'll go through all code for building our own search engine specialised on news articles.
Each search engine consists of three main components:


* Crawler
* Indexer
* Query processor



## A. Crawler
We start with building crawler that goes to RSS feed, extracts title, description (if any), date published and link. Processes this information and saves in a meta-data "database" (actually a pandas dataframe and stores it as csv file.

### Import packages

In [110]:
!pip install feedparser

import feedparser
import pandas as pd



Read RSS feed from a URL. Also keep track of how many news are in the feed and how many of these have already been scraped with previous iteration of the crawler. This will be later important to automatically adjust the frequency with which the crawler will visit the RSS feed to scrape new information.
For testing we use just one URL: BBC World News.

In [111]:
URL = "http://feeds.bbci.co.uk/news/world/rss.xml"
feed = feedparser.parse(URL)

feed_len = len(feed.entries) #number of news in feed
old_news = 0  # count how many news in feed were already scraped

print("There are {} news in the RSS feed." .format(feed_len))

There are 39 news in the RSS feed.


Load the meta-data database stored as csv file. If this is the first time the crawler is let loose this will be just an empty file with prepared column names (ID, title, date and link).
I'll keep it commented out here and instead just create an empty dataframe a this point.

In [112]:
#meta_data = pd.read_csv(PATH + "database.csv", index_col = 'Unnamed: 0')
meta_data = pd.DataFrame(columns=['ID', 'title', 'summary', 'link', 'published'])
meta_data.head()

Unnamed: 0,ID,title,summary,link,published


Now we write a function that accepts one entry from the feed and parse its contents to list with title, date published and link and assigns it a unique ID (which also denotes when was the entry scraped and entered to our search engine).
We normalize the date so that it's in format day/month/year. We don't care about more precise time.

In [0]:
def process_entry(entry, ID):
  ID = ID
  title = entry.title
  summary = entry.summary
  link = entry.link
  published = str(entry.published_parsed.tm_mday) + '/' + \
              str(entry.published_parsed.tm_mon) + '/' + \
              str(entry.published_parsed.tm_year)
  return [ID, title, summary, link, published]

Test the function on one entry

In [114]:
test_entry = feed.entries[0]
process_entry(test_entry, 1)

[1,
 "Trump impeachment inquiry: Envoy 'intimidated' by tweets during testimony",
 'The president criticised the former US envoy to Ukraine in the middle of her impeachment testimony.',
 'https://www.bbc.co.uk/news/world-us-canada-50436521',
 '15/11/2019']

Now we iterate over all entries in the feed. Check if the entry is already in the meta-data database, if not the entry is processed, assigned an ID and appended to a list of entries that will be later added to the database.

In [0]:
data = [] #dataframe for saving the entries
n = len(meta_data)+1 #ID value based on the highest ID value in database
for i in range(len(feed.entries)):
  entry = feed.entries[i]
  
  #check that link isn't in the database yet
  if entry.link not in meta_data['link'].values:
    processed = process_entry(entry = entry, ID=n)
    data.append(processed)
    n += 1 #increase the ID value
  else: old_news += 1 #count already scraped entries

If there was at least one newly scraped entry, we add it to the database.

In [0]:
if len(data) > 0:
  #transform data to pandas DataFrame
  news_extracted = pd.DataFrame(data, columns=['ID', 'title', 'summary', 'link', 'published'])

  #add new news to the database
  meta_data = pd.concat([meta_data, news_extracted], axis = 0)

  #write database to a csv file
  #meta_data.to_csv(PATH + "database.csv")

Look at the database

In [117]:
meta_data.head(5)

Unnamed: 0,ID,title,summary,link,published
0,1,Trump impeachment inquiry: Envoy 'intimidated'...,The president criticised the former US envoy t...,https://www.bbc.co.uk/news/world-us-canada-504...,15/11/2019
1,2,Roger Stone: Trump ally convicted of lying to ...,The former adviser to President Donald Trump i...,https://www.bbc.co.uk/news/world-us-canada-504...,15/11/2019
2,3,Hong Kong protests: China condemns 'appalling'...,Hong Kong's Justice Secretary Teresa Cheng was...,https://www.bbc.co.uk/news/world-asia-china-50...,15/11/2019
3,4,Chile protests: Government bows to demands for...,"After weeks of unrest in the country, Chile ha...",https://www.bbc.co.uk/news/world-latin-america...,15/11/2019
4,5,Inventor of the famed 'Sourtoe Cocktail' dies,Captain Dick Stevenson invented the alcoholic ...,https://www.bbc.co.uk/news/world-us-canada-503...,15/11/2019


Get percentage of already-scraped entries

In [118]:
print("{} % of entries were already scraped." .format((old_news/feed_len)*100))

0.0 % of entries were already scraped.


## B. Indexer
Second part of a search engine is an indexer. It's basically a smart storage of our news articles in which we can later easily retrieve relative articles given a search query.
It parses the title and description of the news articles scraped by the crawler to single words. All these words make up the vocabulary of our index. Next step is to put the ID of the article in the posting lists of the words that the article contains. For example article called "This happened today" will be stored in posting lists of terms "this", "happened" and "today".
Before creating the index we preprocess the text of the articles in order to get rid of useless information. We the text of accents and turn everything to lowercase. Next we perform lemmatization. This is slightly smarter version of stemming. Essentially, it's a word normalization, e.g. all nouns to singular, all verbs in present tense etc.

Let's do it.

#### Importing

In [119]:
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer 
import string

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


We start with an empty dictionary as our index. As we scrape more articles later we will instead of starting with an empty index just update the already created index.

The index is organized as:
```
{
  "word1": \[ID1, ID2, ...],
  "word2": \[ID5, ID8, ...],
  ...
}
```

We start with just a single article from our meta-data database.

In [0]:
entry = meta_data.loc[0,:].copy()

In [121]:
entry

ID                                                           1
title        Trump impeachment inquiry: Envoy 'intimidated'...
summary      The president criticised the former US envoy t...
link         https://www.bbc.co.uk/news/world-us-canada-504...
published                                           15/11/2019
Name: 0, dtype: object

### Text preprocessing
Turn title to lowercase, remove accents

In [0]:
def process_string(text):
  text = text.lower() #to lowercase
  text = text.translate(str.maketrans('', '', string.punctuation)) #strip punctuation
  return text

In [123]:
process_string(entry.title)

'trump impeachment inquiry envoy intimidated by tweets during testimony'

Now, lemmatize, i.e. word normalization.

This method requires some additional information about the words. We need to find the word category of each word, e.g. verb, noun etc.

In [0]:
def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

Test the function

In [125]:
print("Apple: {}\n Run: {}\n Happy: {}" .format(get_wordnet_pos("apple"), get_wordnet_pos("run"), get_wordnet_pos("happy")))

Apple: n
 Run: v
 Happy: a


We also need to remove stopwords, i.e. words with low informational value.

In [0]:
stop = stopwords.words('english')

Now we'll iterate over all words in text, lemmatize and return the transformed string.

In [0]:
lem = WordNetLemmatizer()

def stop_lemmatize(doc):
    tokens = nltk.word_tokenize(doc)
    tmp = ""
    for w in tokens:
        if w not in stop:
            tmp += lem.lemmatize(w, get_wordnet_pos(w)) + " "
    return tmp

In [128]:
stop_lemmatize(doc = entry.title)

"Trump impeachment inquiry : Envoy 'intimidated ' tweet testimony "

In [0]:
def process_string(text):
  text = text.lower() #to lowercase
  text = text.translate(str.maketrans('', '', string.punctuation)) #strip punctuation
  text = stop_lemmatize(text)
  return text

In [130]:
%time process_string(entry.title)

CPU times: user 4.04 ms, sys: 4 µs, total: 4.04 ms
Wall time: 4.68 ms


'trump impeachment inquiry envoy intimidate tweet testimony '

Now we apply the process_string function to all titles and summaries in our database.

In [0]:
meta_processed = meta_data.copy()

In [0]:
def transform_df(df):
  df['title'] = df['title'].apply(process_string)
  df['summary'] = df['summary'].apply(process_string)

In [133]:
%time transform_df(meta_processed)

CPU times: user 119 ms, sys: 6.96 ms, total: 126 ms
Wall time: 132 ms


In [134]:
meta_processed.head(5)

Unnamed: 0,ID,title,summary,link,published
0,1,trump impeachment inquiry envoy intimidate twe...,president criticise former u envoy ukraine mid...,https://www.bbc.co.uk/news/world-us-canada-504...,15/11/2019
1,2,roger stone trump ally convict lie congress,former adviser president donald trump convict ...,https://www.bbc.co.uk/news/world-us-canada-504...,15/11/2019
2,3,hong kong protest china condemns appal attack ...,hong kongs justice secretary teresa cheng surr...,https://www.bbc.co.uk/news/world-asia-china-50...,15/11/2019
3,4,chile protest government bow demand referendum,week unrest country chile agree hold referendu...,https://www.bbc.co.uk/news/world-latin-america...,15/11/2019
4,5,inventor famed sourtoe cocktail dy,captain dick stevenson invent alcoholic cockta...,https://www.bbc.co.uk/news/world-us-canada-503...,15/11/2019


In practice, we won't be transforming the whole meta-data database since that would mean creating index from scratch after every crawler iteration. Instead we would use only subset of the database with only newly added articles.

Now we can iterate over all entries to create the index. We'll go step by step again before wrapping it all in one nice function.

Merge title and summary into one field and drop all columns except for ID as we don't need those anymore.

In [0]:
meta_processed['text'] = meta_processed['title'] + " " + meta_processed['summary']
drop_cols = ['title', 'summary', 'published', 'link']
meta_processed = meta_processed.drop(drop_cols, axis=1)

In [136]:
meta_processed.head(5)

Unnamed: 0,ID,text
0,1,trump impeachment inquiry envoy intimidate twe...
1,2,roger stone trump ally convict lie congress f...
2,3,hong kong protest china condemns appal attack ...
3,4,chile protest government bow demand referendum...
4,5,inventor famed sourtoe cocktail dy captain di...


Add this part to a transform_df function.

In [0]:
def transform_df(df):
  df = df
  df['title'] = df['title'].apply(process_string)
  df['summary'] = df['summary'].apply(process_string)
  df['text'] = df['title'] + " " + df['summary']
  drop_cols = ['title', 'summary', 'published', 'link']
  df = df.drop(drop_cols, axis=1)
  return df

Now we'll build index with just one entry.

In [138]:
entry = meta_processed.loc[0,:].copy()
print(entry)
index_test = {}

ID                                                      1
text    trump impeachment inquiry envoy intimidate twe...
Name: 0, dtype: object


Split the entry to single words and return list and save entry's ID as object.

In [0]:
words = entry.text.split()
ID = entry.ID

Each word in index' vocabulary is a dictionary key and has its own posting list with IDs. Let's construct one word vocabulary as example.

In [140]:
word = words[0]
sample = {word: [ID]}
print(sample)

{'trump': [1]}


Now we iterate over all words and if they aren't in the vocabulary yet we add them. Also for each word we append the entry ID to the posting list.

In [0]:
for word in words:
  if word in index_test.keys():
    if ID not in index_test[word]:
      index_test[word].append(ID)
  else:
    index_test[word] = [ID]

In [142]:
print(index_test)

{'trump': [1], 'impeachment': [1], 'inquiry': [1], 'envoy': [1], 'intimidate': [1], 'tweet': [1], 'testimony': [1], 'president': [1], 'criticise': [1], 'former': [1], 'u': [1], 'ukraine': [1], 'middle': [1]}


Now this process can be repeated for all entries in the database

In [0]:
def index_it(entry, index):
  words = entry.text.split()
  ID = entry.ID
  for word in words:
    if word in index.keys():
      if ID not in index[word]:
        index[word].append(ID)
    else:
      index[word] = [ID]
  return index

In [144]:
ind = index_it(entry=entry, index= {})
print(ind)

{'trump': [1], 'impeachment': [1], 'inquiry': [1], 'envoy': [1], 'intimidate': [1], 'tweet': [1], 'testimony': [1], 'president': [1], 'criticise': [1], 'former': [1], 'u': [1], 'ukraine': [1], 'middle': [1]}


Again we can iterate over all entries in the database with scraped articles, process them append to index.

In [0]:
def index_all(df, index):
  for i in range(len(df)):
    entry = df.loc[i,:]
    index = index_it(entry = entry, index = index)
  return index

In [146]:
index = index_all(meta_processed, index = {})
len(index)

447

Finally we wrap everything in one nice function.

In [0]:
def build_index(df, index):
    to_add = transform_df(df)
    index = index_all(df = to_add, index = index)
    return index

In [0]:
idx = build_index(df = meta_data, index = {})

In [149]:
len(idx)

447

And for future use we save the index to json file.

In [0]:
import json

with open('index.json', 'w') as fp:
    json.dump(idx, fp, sort_keys=True, indent=4)

It can be of course opened again with following code.

In [0]:
with open('index.json', 'r') as f:
    data = json.load(f)