<a href="https://colab.research.google.com/github/dominikklepl/News-search-engine/blob/master/Code_explanation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# News search engine
In this notebook we'll go through all code for building our own search engine specialised on news articles.
Each search engine consists of three main components:


* Crawler
* Indexer
* Query processor



## A. Crawler
We start with building crawler that goes to RSS feed, extracts title, description (if any), date published and link. Processes this information and saves in a meta-data "database" (actually a pandas dataframe and stores it as csv file.

### Import packages

In [0]:
!pip install feedparser

import feedparser
import pandas as pd

Collecting feedparser
[?25l  Downloading https://files.pythonhosted.org/packages/91/d8/7d37fec71ff7c9dbcdd80d2b48bcdd86d6af502156fc93846fb0102cb2c4/feedparser-5.2.1.tar.bz2 (192kB)
[K     |█▊                              | 10kB 12.0MB/s eta 0:00:01[K     |███▍                            | 20kB 3.4MB/s eta 0:00:01[K     |█████▏                          | 30kB 4.9MB/s eta 0:00:01[K     |██████▉                         | 40kB 3.1MB/s eta 0:00:01[K     |████████▌                       | 51kB 3.8MB/s eta 0:00:01[K     |██████████▎                     | 61kB 4.4MB/s eta 0:00:01[K     |████████████                    | 71kB 5.0MB/s eta 0:00:01[K     |█████████████▋                  | 81kB 5.7MB/s eta 0:00:01[K     |███████████████▍                | 92kB 6.3MB/s eta 0:00:01[K     |█████████████████               | 102kB 5.0MB/s eta 0:00:01[K     |██████████████████▊             | 112kB 5.0MB/s eta 0:00:01[K     |████████████████████▌           | 122kB 5.0MB/s eta 0:00:

Read RSS feed from a URL. Also keep track of how many news are in the feed and how many of these have already been scraped with previous iteration of the crawler. This will be later important to automatically adjust the frequency with which the crawler will visit the RSS feed to scrape new information.
For testing we use just one URL: BBC World News.

In [0]:
URL = "http://feeds.bbci.co.uk/news/world/rss.xml"
feed = feedparser.parse(URL)

feed_len = len(feed.entries) #number of news in feed
old_news = 0  # count how many news in feed were already scraped

print("There are {} news in the RSS feed." .format(feed_len))

There are 29 news in the RSS feed.


Load the meta-data database stored as csv file. If this is the first time the crawler is let loose this will be just an empty file with prepared column names (ID, title, date and link).
I'll keep it commented out here and instead just create an empty dataframe a this point.

In [0]:
#meta_data = pd.read_csv(PATH + "database.csv", index_col = 'Unnamed: 0')
meta_data = pd.DataFrame(columns=['ID', 'title', 'summary', 'link', 'published'])
meta_data.head()

Unnamed: 0,ID,title,summary,link,published


Now we write a function that accepts one entry from the feed and parse its contents to list with title, date published and link and assigns it a unique ID (which also denotes when was the entry scraped and entered to our search engine).
We normalize the date so that it's in format day/month/year. We don't care about more precise time.

In [0]:
def process_entry(entry, ID):
  ID = ID
  title = entry.title
  summary = entry.summary
  link = entry.link
  published = str(entry.published_parsed.tm_mday) + '/' + \
              str(entry.published_parsed.tm_mon) + '/' + \
              str(entry.published_parsed.tm_year)
  return [ID, title, summary, link, published]

Test the function on one entry

In [0]:
test_entry = feed.entries[0]
process_entry(test_entry, 1)

[1,
 'Benjamin Netanyahu: Israel PM charged with corruption',
 'The prime minister described the charges as an "attempted coup", blaming a "tainted" process.',
 'https://www.bbc.co.uk/news/world-middle-east-50508399',
 '21/11/2019']

Now we iterate over all entries in the feed. Check if the entry is already in the meta-data database, if not the entry is processed, assigned an ID and appended to a list of entries that will be later added to the database.

In [0]:
data = [] #dataframe for saving the entries
n = len(meta_data)+1 #ID value based on the highest ID value in database
for i in range(len(feed.entries)):
  entry = feed.entries[i]
  
  #check that link isn't in the database yet
  if entry.link not in meta_data['link'].values:
    processed = process_entry(entry = entry, ID=n)
    data.append(processed)
    n += 1 #increase the ID value
  else: old_news += 1 #count already scraped entries

If there was at least one newly scraped entry, we add it to the database.

In [0]:
if len(data) > 0:
  #transform data to pandas DataFrame
  news_extracted = pd.DataFrame(data, columns=['ID', 'title', 'summary', 'link', 'published'])

  #add new news to the database
  meta_data = pd.concat([meta_data, news_extracted], axis = 0)

  #write database to a csv file
  #meta_data.to_csv(PATH + "database.csv")

Look at the database

In [0]:
meta_data.head(5)

Unnamed: 0,ID,title,summary,link,published
0,1,Benjamin Netanyahu: Israel PM charged with cor...,The prime minister described the charges as an...,https://www.bbc.co.uk/news/world-middle-east-5...,21/11/2019
1,2,UK to repatriate first citizens from north-eas...,The individuals will be returned to the UK fro...,https://www.bbc.co.uk/news/uk-50506909,21/11/2019
2,3,Russia bans sale of gadgets without Russian-ma...,Supporters say the law on new sales promotes R...,https://www.bbc.co.uk/news/world-europe-50507849,21/11/2019
3,4,"DR Congo measles: Nearly 5,000 dead in major o...","Almost 250,000 people in the Democratic Republ...",https://www.bbc.co.uk/news/world-africa-50506743,21/11/2019
4,5,Australia fires: Sea of fire races across fiel...,Bushfires have hit areas near Adelaide and cre...,https://www.bbc.co.uk/news/world-australia-504...,21/11/2019


Get percentage of already-scraped entries

In [0]:
print("{} % of entries were already scraped." .format((old_news/feed_len)*100))

0.0 % of entries were already scraped.


## B. Indexer
Second part of a search engine is an indexer. It's basically a smart storage of our news articles in which we can later easily retrieve relative articles given a search query.
It parses the title and description of the news articles scraped by the crawler to single words. All these words make up the vocabulary of our index. Next step is to put the ID of the article in the posting lists of the words that the article contains. For example article called "This happened today" will be stored in posting lists of terms "this", "happened" and "today".
Before creating the index we preprocess the text of the articles in order to get rid of useless information. We the text of accents and turn everything to lowercase. Next we perform lemmatization. This is slightly smarter version of stemming. Essentially, it's a word normalization, e.g. all nouns to singular, all verbs in present tense etc.

Let's do it.

#### Importing

In [0]:
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer 
import string

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


We start with an empty dictionary as our index. As we scrape more articles later we will instead of starting with an empty index just update the already created index.

The index is organized as:
```
{
  "word1": \[ID1, ID2, ...],
  "word2": \[ID5, ID8, ...],
  ...
}
```

We start with just a single article from our meta-data database.

In [0]:
entry = meta_data.loc[0,:].copy()

In [0]:
entry

ID                                                           1
title        Benjamin Netanyahu: Israel PM charged with cor...
summary      The prime minister described the charges as an...
link         https://www.bbc.co.uk/news/world-middle-east-5...
published                                           21/11/2019
Name: 0, dtype: object

### Text preprocessing
Turn title to lowercase, remove accents

In [0]:
def process_string(text):
  text = text.lower() #to lowercase
  text = text.translate(str.maketrans('', '', string.punctuation)) #strip punctuation
  return text

In [0]:
process_string(entry.title)

'benjamin netanyahu israel pm charged with corruption'

Now, lemmatize, i.e. word normalization.

This method requires some additional information about the words. We need to find the word category of each word, e.g. verb, noun etc.

In [0]:
def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

Test the function

In [0]:
print("Apple: {}\n Run: {}\n Happy: {}" .format(get_wordnet_pos("apple"), get_wordnet_pos("run"), get_wordnet_pos("happy")))

Apple: n
 Run: v
 Happy: a


We also need to remove stopwords, i.e. words with low informational value.

In [0]:
stop = stopwords.words('english')

Now we'll iterate over all words in text, lemmatize and return the transformed string.

In [0]:
lem = WordNetLemmatizer()

def stop_lemmatize(doc):
    tokens = nltk.word_tokenize(doc)
    tmp = ""
    for w in tokens:
        if w not in stop:
            tmp += lem.lemmatize(w, get_wordnet_pos(w)) + " "
    return tmp

In [0]:
stop_lemmatize(doc = entry.title)

'Benjamin Netanyahu : Israel PM charge corruption '

In [0]:
def process_string(text):
  text = text.lower() #to lowercase
  text = text.translate(str.maketrans('', '', string.punctuation)) #strip punctuation
  text = stop_lemmatize(text)
  return text

In [0]:
%time process_string(entry.title)

CPU times: user 2.14 ms, sys: 1.78 ms, total: 3.92 ms
Wall time: 9.52 ms


'benjamin netanyahu israel pm charge corruption '

Now we apply the process_string function to all titles and summaries in our database.

In [0]:
meta_processed = meta_data.copy()

In [0]:
def transform_df(df):
  df['title'] = df['title'].apply(process_string)
  df['summary'] = df['summary'].apply(process_string)

In [0]:
%time transform_df(meta_processed)

CPU times: user 103 ms, sys: 4.83 ms, total: 108 ms
Wall time: 118 ms


In [0]:
meta_processed.head(5)

Unnamed: 0,ID,title,summary,link,published
0,1,benjamin netanyahu israel pm charge corruption,prime minister described charge attempt coup b...,https://www.bbc.co.uk/news/world-middle-east-5...,21/11/2019
1,2,uk repatriate first citizen northeastern syria,individual return uk area formerly control com...,https://www.bbc.co.uk/news/uk-50506909,21/11/2019
2,3,russia ban sale gadget without russianmade sof...,supporter say law new sale promotes russian te...,https://www.bbc.co.uk/news/world-europe-50507849,21/11/2019
3,4,dr congo measles nearly 5000 dead major outbreak,almost 250000 people democratic republic congo...,https://www.bbc.co.uk/news/world-africa-50506743,21/11/2019
4,5,australia fire sea fire race across field near...,bushfires hit area near adelaide create smoky ...,https://www.bbc.co.uk/news/world-australia-504...,21/11/2019


In practice, we won't be transforming the whole meta-data database since that would mean creating index from scratch after every crawler iteration. Instead we would use only subset of the database with only newly added articles.

Now we can iterate over all entries to create the index. We'll go step by step again before wrapping it all in one nice function.

Merge title and summary into one field and drop all columns except for ID as we don't need those anymore.

In [0]:
meta_processed['text'] = meta_processed['title'] + " " + meta_processed['summary']
drop_cols = ['title', 'summary', 'published', 'link']
meta_processed = meta_processed.drop(drop_cols, axis=1)

In [0]:
meta_processed.head(5)

Unnamed: 0,ID,text
0,1,benjamin netanyahu israel pm charge corruption...
1,2,uk repatriate first citizen northeastern syria...
2,3,russia ban sale gadget without russianmade sof...
3,4,dr congo measles nearly 5000 dead major outbre...
4,5,australia fire sea fire race across field near...


Add this part to a transform_df function.

In [0]:
def transform_df(df):
  df = df
  df['title'] = df['title'].apply(process_string)
  df['summary'] = df['summary'].apply(process_string)
  df['text'] = df['title'] + " " + df['summary']
  drop_cols = ['title', 'summary', 'published', 'link']
  df = df.drop(drop_cols, axis=1)
  return df

### Build index

Now we'll build index with just one entry.

In [0]:
entry = meta_processed.loc[0,:].copy()
print(entry)
index_test = {}

ID                                                      1
text    benjamin netanyahu israel pm charge corruption...
Name: 0, dtype: object


Split the entry to single words and return list and save entry's ID as object.

In [0]:
words = entry.text.split()
ID = entry.ID

Each word in index' vocabulary is a dictionary key and has its own posting list with IDs. Let's construct one word vocabulary as example.

In [0]:
word = words[0]
sample = {word: [ID]}
print(sample)

{'benjamin': [1]}


Now we iterate over all words and if they aren't in the vocabulary yet we add them. Also for each word we append the entry ID to the posting list.

In [0]:
for word in words:
  if word in index_test.keys():
    if ID not in index_test[word]:
      index_test[word].append(ID)
  else:
    index_test[word] = [ID]

In [0]:
print(index_test)

{'benjamin': [1], 'netanyahu': [1], 'israel': [1], 'pm': [1], 'charge': [1], 'corruption': [1], 'prime': [1], 'minister': [1], 'described': [1], 'attempt': [1], 'coup': [1], 'blame': [1], 'taint': [1], 'process': [1]}


Now this process can be repeated for all entries in the database

In [0]:
def index_it(entry, index):
  words = entry.text.split()
  ID = entry.ID
  for word in words:
    if word in index.keys():
      if ID not in index[word]:
        index[word].append(ID)
    else:
      index[word] = [ID]
  return index

In [0]:
ind = index_it(entry=entry, index= {})
print(ind)

{'benjamin': [1], 'netanyahu': [1], 'israel': [1], 'pm': [1], 'charge': [1], 'corruption': [1], 'prime': [1], 'minister': [1], 'described': [1], 'attempt': [1], 'coup': [1], 'blame': [1], 'taint': [1], 'process': [1]}


Again we can iterate over all entries in the database with scraped articles, process them append to index.

In [0]:
def index_all(df, index):
  for i in range(len(df)):
    entry = df.loc[i,:]
    index = index_it(entry = entry, index = index)
  return index

In [0]:
index = index_all(meta_processed, index = {})
len(index)

383

Finally we wrap everything in one nice function.

In [0]:
def build_index(df, index):
    to_add = transform_df(df)
    index = index_all(df = to_add, index = index)
    return index

In [0]:
idx = build_index(df = meta_data, index = {})

In [0]:
len(idx)

383

And for future use we save the index to json file.

In [0]:
import json

with open('index.json', 'w') as fp:
    json.dump(idx, fp, sort_keys=True, indent=4)

It can be of course opened again with following code.

In [0]:
with open('index.json', 'r') as f:
    data = json.load(f)

# Bb Ranked retrieval
  The user would probably prefer the more relevant pages to be displayed before those that are less relevant (hopefully they're at least a bit relevant).
For our search engine to support such option we need to store some information about the scraped documents that could be later used for this purpose.
We'll use averaged word2vec for this purpose. Word2Vec model is single hidden-layer neural network. The hidden layer is actually what is so useful about this model. Given a word the layer's activation gives a unique vector that word. For each document we can iterate over all words, extract their vectors and then by averaging obtain a document vector. 
  Compared to other methods averaged word2vec has multiple advantages. Unlike simpler methods such as bag-of-words, n-grams and tf-idf the size of the vectors is fixed. For example bag-of-words is also using vectors but the size of these vectors equals the number of unique words in the corpus. This means that the computational and storage requirements get larger as the corpus gets larger.
  Averaged word2vec is also able to represent the documents on more abstract level than simpler methods and should therefore provide better method of ranking.
  We're using word2vec rather than doc2vec because we can simply use pretrained word2vec model to compute the document vectors. Using doc2vec would mean training a neural network from scratch which requires computational power, time and rather large dataset. 

Import and download pretrained word2vec model

In [0]:
import gensim
import numpy as np
!wget https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz

--2019-11-21 19:10:30--  https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.38.22
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.38.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1647046227 (1.5G) [application/x-gzip]
Saving to: ‘GoogleNews-vectors-negative300.bin.gz’


2019-11-21 19:11:05 (44.7 MB/s) - ‘GoogleNews-vectors-negative300.bin.gz’ saved [1647046227/1647046227]



Load word2vec model

In [0]:
word2vec = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


Try getting vectors for all words in the text and averaging to get single vector.

In [0]:
print(words)

['benjamin', 'netanyahu', 'israel', 'pm', 'charge', 'corruption', 'prime', 'minister', 'described', 'charge', 'attempt', 'coup', 'blame', 'taint', 'process']


In [0]:
def average_vectors(word2vec_model, doc):
    # remove out-of-vocabulary words
    doc = [word for word in doc if word in word2vec_model.vocab]
    if len(doc) == 0:
      return np.zeros(300)
    else:
      return np.mean(word2vec_model[doc], axis=0)

In [0]:
%time test_vec = average_vectors(word2vec, words)

CPU times: user 967 µs, sys: 92 µs, total: 1.06 ms
Wall time: 823 µs


Now we can iterate over documents, compute their vectors and construct a document vectors database.

In [0]:
def prepare_ranking(df):
  corpus = df[['ID', 'text']].copy()
  doc_vecs = {}
  for i in range(len(corpus)):
    row = corpus.loc[i,:]
    text = row.text.split()
    doc_vecs[row.ID]=average_vectors(word2vec, text)
  doc_vecs = pd.DataFrame.from_dict(data=doc_vecs, orient="index")
  doc_vecs['ID'] = doc_vecs.index
  return doc_vecs

In [0]:
doc_vecs = prepare_ranking(df=meta_data)

## C. Query processor
The final part of a search engine is a query processor which actually performs the search task. Given a query by user the processor should return list of relevant documents.
There are multiple types of queries. We'll start with a simple "google-ish" query where assume the user looks for documents relevant to all words in the query. Therefore we transform the query to boolean by connecting all words with AND operator.

First, the processor preprocesses the query the same way as the indexer preprocessed the text. In other words, we normalize the query to match the format of text in the index. Next, the query is parsed to single words. We look into index if these words are part of the vocabulary. If a word is in index we retrieve its posting list. Finally, we look for intersection of all retrieved posting lists. The result is list of document IDs that the user asked for.
However, we need to return something more useful than just a list of IDs. Therefore,we retrieve the information stored about the documents in the meta-data database. Before printing the results we should also rank the documents. This ranking should be based on relevance to query.
Optionally, the user may ask for news only from limited time window, e.g. published today or last week. So we need to filter the retrieved documents if this happens.

-----
To implement:
 - Boolean query
 - phrase matching
 -----


### Normalize query

Let's define an example query.

In [0]:
test = "Trump Ukraine China"

Now we use the "process string" function from used by indexer to normalize the query

In [0]:
print("User query: {}." .format(test))
test_norm = process_string(test)
print("Normalized query: {}." .format(test_norm))

User query: Trump Ukraine China.
Normalized query: trump ukraine china .


Now we split the query into words.

In [0]:
test_split = test_norm.split()

And we wrap this in function

In [0]:
def process_query(query):
  norm = process_string(query)
  return norm.split()

### Retrieve from index

And we iterate over the words, looking if they're in the index vocabulary. If so then we retrieve the associated posting list.

In [0]:
retrieved = []
for word in test_split:
  if word in index.keys():
    retrieved.append(index[word])

Now we look for the intersection of all posting lists.

In [0]:
def lists_intersection(lists):
  intersect = list(set.intersection(*map(set, lists)))
  intersect.sort()
  return intersect

In [0]:
result = lists_intersection(retrieved)
print(result)

[9, 14]


Let's wrap this part in a function before proceeding to formatting the results. The additional if statement is for cases when there's nothing retrieved.

In [0]:
def search_googleish(query, index=idx):
  query_split = process_query(query)
  retrieved = []
  for word in query_split:
    if word in index.keys():
      retrieved.append(index[word])
  if len(retrieved)>0:
    result = lists_intersection(retrieved)
  else:
      result = [0]
  return result

In [0]:
result_IDs = search_googleish("Trump", index)
print(result_IDs)

[0]


-----

*TO DO:
If there's no document retrieved, try removing one term and looking for simplified query + tell user that such document doesn't include term X.*

-----

### Retrieve meta-data
Now we need to connect the retrieved IDs with some useful information stored in database that we first use to refine the results and then to print nice result to user.

In [0]:
#in real setting we'll read the database from file here
#meta = pd.read_csv("database.csv")

#this is our database
meta = meta_data.drop(['text'], axis=1).copy()
meta.head(5)

Unnamed: 0,ID,title,summary,link,published
0,1,benjamin netanyahu israel pm charge corruption,prime minister described charge attempt coup b...,https://www.bbc.co.uk/news/world-middle-east-5...,21/11/2019
1,2,uk repatriate first citizen northeastern syria,individual return uk area formerly control com...,https://www.bbc.co.uk/news/uk-50506909,21/11/2019
2,3,russia ban sale gadget without russianmade sof...,supporter say law new sale promotes russian te...,https://www.bbc.co.uk/news/world-europe-50507849,21/11/2019
3,4,dr congo measles nearly 5000 dead major outbreak,almost 250000 people democratic republic congo...,https://www.bbc.co.uk/news/world-africa-50506743,21/11/2019
4,5,australia fire sea fire race across field near...,bushfires hit area near adelaide create smoky ...,https://www.bbc.co.uk/news/world-australia-504...,21/11/2019


Query from database to get only rows of retrieved IDs

In [0]:
def connect_id_df(retrieved_id, df):
    return df[df.ID.isin(retrieved_id)].reset_index(drop=True)

In [0]:
result_meta = connect_id_df(result_IDs, meta)
result_meta.head(5)

Unnamed: 0,ID,title,summary,link,published


### Ranked retrieval
Now we return back to the word2vec vectors we computed after indexing the documents.
We'l compute the vector for the query as well and then using a cosine similarity compare query to retrieved document relevance.

Compute vector for query

In [0]:
query_vec = average_vectors(word2vec, test_split)

Retrieve vectors of retrieve documents.

In [0]:
result_vecs = connect_id_df(result_IDs, doc_vecs)

Compute cosine similarity between retrieved documents and query

In [0]:
def cos_similarity(a, b):
  dot = np.dot(a, b)
  norma = np.linalg.norm(a)
  normb = np.linalg.norm(b)
  cos = dot / (norma * normb)
  return(cos)

In [0]:
cos_sim = []
for i in range(len(result_vecs)):
  doc_vec = result_vecs.loc[i,:].drop(['ID'])
  cos_sim.append(cos_similarity(doc_vec, query_vec))
result_meta['rank'] = cos_sim

Sort retrieved docs by cosine similarity which is proxi for relevance.

In [0]:
result_meta.sort_values('rank', axis=0)

Unnamed: 0,ID,title,summary,link,published,rank


Wrap this in function

In [0]:
def rank_results(query, results):
  query_norm = process_query(query)
  query_vec = average_vectors(word2vec, query_norm)
  result_vecs = connect_id_df(results.ID, doc_vecs)
  cos_sim = []
  for i in range(len(result_vecs)):
    doc_vec = result_vecs.loc[i,:].drop(['ID'])
    cos_sim.append(cos_similarity(doc_vec, query_vec))
  results['rank'] = cos_sim
  results = results.sort_values('rank', axis=0)
  return results

In [0]:
final_result = rank_results("Trump", result_meta)

### Date filtering
User might ask for news from specific day or date range.
For simplicity let's assume the user enters date in format day/month/year. 
Now we can define 2 types of date restrictions:
* single day
* date range - from X/X/X - to X/X/X

Another option of date restriction could be saying 'today'.



Restricing to single day

In [0]:
test = "16/11/2019"

#get news published on "test"
results_single = result_meta[result_meta.published==test].reset_index(drop=True)
results_single.head()

Unnamed: 0,ID,title,summary,link,published,rank


Restricing to "today"

In [0]:
#get today's date
from datetime import date, timedelta

def get_today():
  today = date.today()
  today = today.strftime("%d/%m/%Y")
  return [today]

results_today = result_meta[result_meta.published.isin(get_today())].reset_index(drop=True)
results_today.head()

Unnamed: 0,ID,title,summary,link,published,rank


In [0]:
get_today()

['21/11/2019']

Restricting to time interval

In [0]:
def daterange(start, end):
    for n in range(int ((end - start).days)+1):
        yield start + timedelta(n)

def format_date(dt):
  dt = dt.split("/")
  dt = date(int(dt[2]), int(dt[1]), int(dt[0]))
  return(dt)

def date_interval(interval):
  interval = interval.split("-")
  start = format_date(interval[0])
  end = format_date(interval[1])
  interval = []
  for dt in daterange(start, end):
      interval.append(dt.strftime("%d/%m/%Y"))
  return interval

In [0]:
date_interval("15/11/2019 - 16/11/2019")

['15/11/2019', '16/11/2019']

In [0]:
s = "16/11/2019"
len(s)

10

We also need to create the format in which the user should tell the search engine about the date restriction.
Let's pass such request as a new argument for the search function. We can either then simply ask for two separate inputs from the user or find a more elegant solution.

In [0]:
def filter_date(dat, df):
  if dat == "today":
    dat = get_today()
  if len(dat) == 10:
    dat = [dat]
  if len(dat) > 11:
    dat = date_interval(dat)

  result = df[df.published.isin(dat)].reset_index(drop=True)
  return result

In [0]:
dat = "today"
if dat is "today":
  get_today()

AttributeError: ignored

### Print results to user

In [0]:
def print_results(result_df):
  for i in range(len(result_df)):
    res = result_df.loc[i, :]
    print(res.title)
    print(res.summary)
    if i == len(result_df):
        print(res.link)
    else:
        print("{}\n" .format(res.link))

In [0]:
print_results(final_result)

### Put it all together

In [0]:
def search(query, date=None):
  result = search_googleish(query)
  result = connect_id_df(result, meta)
  result = rank_results(query, result)

  if date is not None:
    result = filter_date(date, result)

  print_results(result)

In [0]:
query = input("Search for:")
date = input("Date:")
search(query, date)

Search for:pm
Date:today


AttributeError: ignored

### Automated crawling every hour
Now we want the crawler to scrape new articles every hour by itself. Probably the simplest and most pythonic solution is a while loop and time.sleep().

First we create a list of feeds that we want to scrape.

In [0]:
URLS = ['http://feeds.bbci.co.uk/news/world/rss.xml', 
        'http://feeds.bbci.co.uk/news/uk/rss.xml', 
        'http://www.independent.co.uk/news/uk/rss', 
        'feed:https://rss.nytimes.com/services/xml/rss/nyt/World.xml', 
        'feed:https://rss.nytimes.com/services/xml/rss/nyt/US.xml', 
        'feed://feeds.washingtonpost.com/rss/world?tid=lk_inline_manual_13']

Now we iterate over these urls, scrape data and update the index. NOTE that this is a pseudo-code, the function definitions are stored in their respective scripts.

In [6]:
for url in URLS:
  print ("Crawling {}" .format(url))
  #perc, added = crawl(URL=url, PATH=OUTPUT_DIR)
  print("XXX % of entries were already scraped.\n")
  #update_index_vecs(df=added, index_path=INDEX_PATH, vec_path=VECTOR_PATH)

Crawling http://feeds.bbci.co.uk/news/world/rss.xml
XXX % of entries were already scraped.

Crawling http://feeds.bbci.co.uk/news/uk/rss.xml
XXX % of entries were already scraped.

Crawling http://www.independent.co.uk/news/uk/rss
XXX % of entries were already scraped.

Crawling feed:https://rss.nytimes.com/services/xml/rss/nyt/World.xml
XXX % of entries were already scraped.

Crawling feed:https://rss.nytimes.com/services/xml/rss/nyt/US.xml
XXX % of entries were already scraped.

Crawling feed://feeds.washingtonpost.com/rss/world?tid=lk_inline_manual_13
XXX % of entries were already scraped.



And we put this loop in a while loop. After the for loop is finished we pause the code execution for an hour. Then the code starts from the top again. 
Such while loop would run forever (literally, don't try!!!) so I'm leaving it commented out here.

In [0]:
import time

#while True:
#  for url in URLS:
#   print ("Crawling {}" .format(url))
    #perc, added = crawl(URL=url, PATH=OUTPUT_DIR)
#   print("XXX % of entries were already scraped.\n")
    #update_index_vecs(df=added, index_path=INDEX_PATH, vec_path=VECTOR_PATH)

  #pause for an hour
#  time.sleep(3600)

In real setting you'll need to find some restriction for this while loop. For example let it run for a day can be done as follows (also not recommended to run)

In [0]:
hours = 0

while hours < 24:
#  for url in URLS:
#   print ("Crawling {}" .format(url))
  # perc, added = crawl(URL=url, PATH=OUTPUT_DIR)
#   print("XXX % of entries were already scraped.\n")
  # update_index_vecs(df=added, index_path=INDEX_PATH, vec_path=VECTOR_PATH)

  #pause for an hour
#  time.sleep(3600)
   hours +=1