<a href="https://colab.research.google.com/github/ayomibamm/Vertical-Search-Engine/blob/main/Vertical_Search_Engine.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Robot.txt check**

https://pureportal.coventry.ac.uk/robots.txt:

- User-Agent: *

- Crawl-Delay: 5

- Disallow: /*?*format=rss

- Disallow: /*?*export=xls

- Sitemap: https://pureportal.coventry.ac.uk/sitemap.xml

## **Libraries importation**

In [None]:
import json
import string
import pandas as pd
import numpy as np
from time import sleep
from bs4 import BeautifulSoup as bs
import requests
from collections import Counter
import urllib.robotparser
import nltk
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from collections import defaultdict
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
# RobotFileParser object to help read the url robots.txt file
rule_preservation = urllib.robotparser.RobotFileParser()
rule_preservation.set_url("https://pureportal.coventry.ac.uk/en/organisations/research-centre-for-computational-science-and-mathematical-modell/robots.txt")
rule_preservation.read()

# Ensuring that required url can be crawled
url = "https://pureportal.coventry.ac.uk/en/organisations/research-centre-for-computational-science-and-mathematical-modell/page.html"
if rule_preservation.can_fetch("*", url):
    print("Website crawling allowed")
else:
    print("Website crawling disallowed")

Website crawling allowed


## **Web crawler processing**

In [None]:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36'}

data =[]
aut_l = []

def getInfo(page):
    url = f'https://pureportal.coventry.ac.uk/en/organisations/research-centre-for-computational-science-and-mathematical-modell/publications/?page={page}&pagesize=50'
    response = requests.get(url, headers=headers)
    html = response.text
    soup = bs(html, "html.parser")
    box = soup.find_all('h3', class_='title')

    for dom in box:
        pub = {}
        title =dom.string.lower()   # extraction of publication title
        link = dom.find('a', class_ = 'link')
        links = link.get('href') # extraction of publication links

        html2 = requests.get(links)
        soup2 = bs(html2.text, 'html.parser') # application of beautiful soup module to each link to extract further information

        pub_abstract = soup2.find('div', class_ = 'textblock') # retrieving publication abstract

        if pub_abstract:
            abstracts = pub_abstract.text.strip()
            abstract = abstracts.partition('.')[0] + '...' # ensuring that only the first line of abstract is extracted
        else:
            abstract= "not available" # some of the publications do not have this information

        aut_details = soup2.find_all('a', class_ = 'link person')
        aut_data = []
        for author in aut_details:
            aut_name = author.get_text().strip() # authors name and link
            if aut_name:
                aut_names = aut_name
                aut_links = author.get('href')
                aut_l.append(aut_links)
                aut_data.append(f'{aut_names}, {aut_links}')

        pub_date = soup2.find('span', class_ = 'date').string.strip()

        if aut_data: # all information appended only if at least 1 author has both information present
            pub['Publication title'] = title
            pub['Link'] = links
            pub['Abstract'] = abstract
            pub['Date'] = pub_date
            pub['Author information'] = aut_data
            data.append(pub)

    return

# crawl delay of 5 seconds per page
for x in range(0,5):
    getInfo(x)
    sleep(5)

print(f'Total number of publication details with at least 1 co-author full details: {len(data)}')
print(f'Total number of publications per staff: {Counter(aut_l)}')


Total number of publication details with at least 1 co-author full details: 202
Total number of publications per staff: Counter({'https://pureportal.coventry.ac.uk/en/persons/vasile-palade': 47, 'https://pureportal.coventry.ac.uk/en/persons/alireza-daneshkhah': 37, 'https://pureportal.coventry.ac.uk/en/persons/alison-halford': 22, 'https://pureportal.coventry.ac.uk/en/persons/elena-gaura': 19, 'https://pureportal.coventry.ac.uk/en/persons/fei-he': 17, 'https://pureportal.coventry.ac.uk/en/persons/matthew-england': 17, 'https://pureportal.coventry.ac.uk/en/persons/james-brusey': 13, 'https://pureportal.coventry.ac.uk/en/persons/amirhosein-sadeghimanesh-sadeghi-manesh': 11, 'https://pureportal.coventry.ac.uk/en/persons/jonathan-nixon': 11, 'https://pureportal.coventry.ac.uk/en/persons/sariya-cheruvallil-contractor': 9, 'https://pureportal.coventry.ac.uk/en/persons/xiaorui-jiang': 8, 'https://pureportal.coventry.ac.uk/en/persons/omid-chatrabgoun': 8, 'https://pureportal.coventry.ac.uk/en/

In [None]:
data

[{'Publication title': 'an extended plant circadian clock model for characterising flowering time under different light quality conditions',
  'Link': 'https://pureportal.coventry.ac.uk/en/publications/an-extended-plant-circadian-clock-model-for-characterising-flower',
  'Abstract': 'Speed breeding has recently emerged as an innovative agricultural technology solution to meet the ever-increasing global food demand...',
  'Date': '9 Jan 2023',
  'Author information': ['Miao Lin Pay, https://pureportal.coventry.ac.uk/en/persons/miao-lin-pay',
   'Jesper Christensen, https://pureportal.coventry.ac.uk/en/persons/jesper-christensen',
   'Fei He, https://pureportal.coventry.ac.uk/en/persons/fei-he',
   'Laura Roden, https://pureportal.coventry.ac.uk/en/persons/laura-roden']},
 {'Publication title': 'challenges and prospects of climate change impact assessment on mangrove environments through mathematical models',
  'Link': 'https://pureportal.coventry.ac.uk/en/publications/challenges-and-pro

## **Inverted Index implementation**

In [None]:
# data saved as a json data file
with open('publication.json', 'w') as f:
    json.dump(data, f)

publications = "/content/publication.json"

#function for inverted_index for the unique terms in the documents
ps = PorterStemmer()
def data_processing(publications):
    inverted_index={}
    cleaned_titles = []
    with open('publication.json', 'r') as f:
        publications = json.load(f)

    data_titles = [publication['Publication title'] for publication in publications]

    doc_id = 0

    # data pre-processing techniques application
    for titles in data_titles:
        tokenised_titles = word_tokenize(titles)
        tokenised_lower = [w.lower() for w in tokenised_titles if w.isalnum()]
        stopped_titles = [w for w in tokenised_lower if w not in stopwords.words('english')]
        stemmed_titles = [ps.stem(w) for w in stopped_titles]
        cleaned_titles.append(stemmed_titles)

        # initialisation of the indexer for each unique terms.
        for stemmed in stemmed_titles:
            value = inverted_index.get(stemmed)
            if value == None:
                count = [1, [doc_id]]
                inverted_index[stemmed] = count
            else:
                count = inverted_index[stemmed]
                if doc_id not in count[1]:
                    count[1].append(doc_id)
                    count[0] += 1

        doc_id += 1

    return inverted_index, cleaned_titles, data_titles

inverted_index, cleaned_titles, data_titles = data_processing(publications)

In [None]:
inverted_index

{'extend': [1, [0]],
 'plant': [2, [0, 32]],
 'circadian': [1, [0]],
 'clock': [1, [0]],
 'model': [17,
  [0, 1, 6, 32, 34, 38, 40, 45, 46, 86, 95, 98, 99, 121, 148, 149, 174]],
 'characteris': [3, [0, 26, 107]],
 'flower': [1, [0]],
 'time': [4, [0, 25, 102, 175]],
 'differ': [3, [0, 180, 197]],
 'light': [3, [0, 124, 171]],
 'qualiti': [1, [0]],
 'condit': [1, [0]],
 'challeng': [5, [1, 59, 126, 145, 199]],
 'prospect': [1, [1]],
 'climat': [1, [1]],
 'chang': [2, [1, 25]],
 'impact': [3, [1, 60, 179]],
 'assess': [5, [1, 22, 99, 164, 168]],
 'mangrov': [3, [1, 6, 86]],
 'environ': [6, [1, 86, 126, 127, 153, 168]],
 'mathemat': [2, [1, 38]],
 'diabet': [5, [2, 38, 40, 80, 148]],
 'retinopathi': [2, [2, 80]],
 'detect': [8, [2, 13, 31, 33, 82, 83, 92, 136]],
 'use': [33,
  [2,
   5,
   6,
   24,
   33,
   38,
   40,
   41,
   49,
   66,
   71,
   82,
   83,
   84,
   85,
   86,
   87,
   105,
   110,
   113,
   114,
   118,
   122,
   126,
   128,
   141,
   143,
   152,
   162,
   16

### **Example of implementation**

In [None]:
# example of the implementation of indexer- limited to just max of 2 termed query

flatten_list = [item for items in cleaned_titles for item in items]

query = 'gaussian distribution'
token_query = word_tokenize(query)
token_lower = [w.lower() for w in token_query if w.isalnum()]
stopped_query = [w for w in token_lower if w not in stopwords.words('english')]
stemmed = [ps.stem(w) for w in stopped_query]
print(stemmed)

if len(stemmed) == 1 and (stemmed[-1] in flatten_list):
    infor = inverted_index[stemmed[-1]]
    post_list = infor[1]
    print(f'{stemmed[-1]} present in these docid :{post_list}')

elif len(stemmed) == 2:
    result =  []
    for word in stemmed:
        if word in flatten_list:
            infor = inverted_index[word]
            lists = infor[1]
            print(f'{word} is present in these docid: {lists}')
            result.append(lists)

        else:
            print(f'{word} is not present in any publication')
            post_list = lists
            print(post_list)

    if len(result) > 1:
        post_list = list(set.intersection(*map(set,result)))
        print(f'both words are present in {post_list}')

['gaussian', 'distribut']
gaussian is present in these docid: [17, 45, 60, 75, 141]
distribut is present in these docid: [17, 46, 52, 60, 71, 88, 102, 116, 194]
both words are present in [17, 60]


## **Implementation of Query Processor**

In [None]:
docid = []
for i in range(1, (len(data)+1)):
    docid.append(i)

# dictionary of data with docid as key and main data as nested dictionary
#- this will be the processor result
data_dict = dict(zip(docid, data))

# convert data to a data frame with docid as index to help visualise each data better
df = pd.DataFrame(data)
df['doc_id'] = docid
df = df.set_index('doc_id')

data_dict

{1: {'Publication title': 'an extended plant circadian clock model for characterising flowering time under different light quality conditions',
  'Link': 'https://pureportal.coventry.ac.uk/en/publications/an-extended-plant-circadian-clock-model-for-characterising-flower',
  'Abstract': 'Speed breeding has recently emerged as an innovative agricultural technology solution to meet the ever-increasing global food demand...',
  'Date': '9 Jan 2023',
  'Author information': ['Miao Lin Pay, https://pureportal.coventry.ac.uk/en/persons/miao-lin-pay',
   'Jesper Christensen, https://pureportal.coventry.ac.uk/en/persons/jesper-christensen',
   'Fei He, https://pureportal.coventry.ac.uk/en/persons/fei-he',
   'Laura Roden, https://pureportal.coventry.ac.uk/en/persons/laura-roden']},
 2: {'Publication title': 'challenges and prospects of climate change impact assessment on mangrove environments through mathematical models',
  'Link': 'https://pureportal.coventry.ac.uk/en/publications/challenges-a

In [None]:
# combining the tokenised words in cleaned titles to form a cleaned sentence for vectorization
filtered_docs = []
for title in cleaned_titles:
    title = ' '.join(title)
    filtered_docs.append(title)
filtered_docs

['extend plant circadian clock model characteris flower time differ light qualiti condit',
 'challeng prospect climat chang impact assess mangrov environ mathemat model',
 'diabet retinopathi detect use transfer reinforc learn effect imag preprocess data augment techniqu',
 'lamarckian particl swarm optim flexibl ligand dock',
 'extract evolutionari backbon scientif domain semant main path network approach base citat context analysi',
 'gene regulatori network infer link predict use graph neural network',
 'model mangrov impos tidal wave use finit element discontinu galerkin method',
 'nonlinear manifold learn eeg function connect analysi applic alzheim diseas',
 'born woman gender robot',
 'posenormnet postur normal 3d bodi scan arbitrari postur',
 'unidirect migrat popul alle effect',
 'xdll explain deep learn local map method vehicl',
 'commun energi manag system smart microgrid',
 'hybrid linear iter cluster bay grabcut segment scheme dynam detect cervic cancer',
 'alloc medic reso

In [None]:
# combining the tokenised words in cleaned titles to form a cleaned sentence for vectorization
filtered_docs = []
for title in cleaned_titles:
    title = ' '.join(title)
    filtered_docs.append(title)
filtered_docs

def queryprocessor(query):
    tokens = word_tokenize(query)
    tmp = ""
    for w in tokens:
        if w not in stopwords.words('english'):
            tmp += ps.stem(w) + " "

    vectorizer = TfidfVectorizer()
    X = vectorizer.fit_transform(filtered_docs) # vectorization of the document
    query_vec = vectorizer.transform([tmp]) # vectorization of query
    results = cosine_similarity(X,query_vec).reshape((-1,)) # to measure document similary
    res = results.tolist()
    similarity_data = {'doc_id': docid, 'similarity_score': res, 'title': data_titles}
    df_similarity = pd.DataFrame(similarity_data) # dataframe to visualise the similarity of the query to each doc
    df_similarity.sort_values(by=['similarity_score'], inplace=True, ascending=False) # values arranged in ascending order
    query_relevance = df_similarity[df_similarity['similarity_score'] > 0] #only similarity values > 0 printed
    post_lists = (query_relevance['doc_id']).tolist()

    # retrieves publication details based on doc ID from post list
    if post_lists:
        for i in post_lists:
            idd = i
            for id, info in data_dict.items():
                if idd == id:
                    print()
                    for key in info:
                        result = print(key + ':', info[key])
    else:
        print('please try searching another term or phrase')

    return post_lists, results, query_relevance

user_input = input('Query: ')
post_lists, results, query_relevance = queryprocessor(user_input)

Query: machine learnin

Publication title: machine learning for computer algebra
Link: https://pureportal.coventry.ac.uk/en/publications/machine-learning-for-computer-algebra
Abstract: not available
Date: 2022
Author information: ['Rashid Barket, https://pureportal.coventry.ac.uk/en/persons/rashid-barket', 'Tereso del Río, https://pureportal.coventry.ac.uk/en/persons/tereso-del-r%C3%ADo-almajano', 'Matthew England, https://pureportal.coventry.ac.uk/en/persons/matthew-england']

Publication title: using machine learning in sc2
Link: https://pureportal.coventry.ac.uk/en/publications/using-machine-learning-in-sc2
Abstract: This talk exposes many possible uses of Machine Learning (ML) in the context of SC2, and how this approach differs from human-made heuristics...
Date: 23 Aug 2022
Author information: ['Tereso del Río, https://pureportal.coventry.ac.uk/en/persons/tereso-del-r%C3%ADo-almajano']

Publication title: sc-square: future progress with machine learning
Link: https://pureportal.c

In [None]:
query_relevance

Unnamed: 0,doc_id,similarity_score,title
51,52,0.495055,machine learning for computer algebra
84,85,0.467051,using machine learning in sc2
73,74,0.416693,sc-square: future progress with machine learning
177,178,0.361861,some computational considerations for kernel-b...
82,83,0.327064,using machine learning for anomaly detection o...
83,84,0.327064,using machine learning for anomaly detection o...
152,153,0.303623,using machine learning algorithms to develop a...
75,76,0.254152,stable likelihood computation for machine lear...
156,157,0.250876,a leap from randomized to quantum clustering w...
66,67,0.238679,predicting primary sequence-based protein-prot...
