In this notebook, the goal is to build a crawler that crawls the given seed url. <br>
Data that are scrapped is stored in a database as <i>database.csv</i> in the same working directory.<br><br>
seed_url = https://pureportal.coventry.ac.uk/en/organisations/centre-global-learning

# Import Required Libraries

In [1]:
from bs4 import BeautifulSoup
import requests
import math
import re
import os
import pandas as pd
import schedule
import time
import datetime

# Check the robots.txt file.

In [2]:
robots_text_page = requests.get("https://pureportal.coventry.ac.uk/en/robots.txt")
print(robots_text_page.text)

User-Agent: *
Crawl-Delay: 5
Disallow: /*?*format=rss
Disallow: /*?*export=xls
Sitemap: https://pureportal.coventry.ac.uk/sitemap.xml


We need to delay the crawler for 5 seconds after each crawl.

# Create some helper functions.

### Find total number of pages to crawl

In [3]:
def findTotalPages():
    """
    This function finds the total number of pages to crawl.
    """
    total_research = soup.select_one(".count").get_text()  # gives string
    total_research = int(''.join(filter(str.isdigit, total_research)))  # convert to int

    total_pages = math.ceil(total_research / 50)  # since 50 results in one page
    return total_pages

###### Demonstration

In [4]:
seed_url = "https://pureportal.coventry.ac.uk/en/organisations/centre-global-learning"
page = requests.get(seed_url)
soup = BeautifulSoup(page.text, "html.parser")
findTotalPages()

6

### Create list of urls to crawl

In [5]:
def getPageUrls():
    """
    This function creates the list of urls to crawl.
    """
    total_pages = findTotalPages()
    urls = [(seed_url + "/publications/?page=" + str(page)) for page in range(0, total_pages)]
    return urls

###### Demonstration

In [6]:
getPageUrls()

['https://pureportal.coventry.ac.uk/en/organisations/centre-global-learning/publications/?page=0',
 'https://pureportal.coventry.ac.uk/en/organisations/centre-global-learning/publications/?page=1',
 'https://pureportal.coventry.ac.uk/en/organisations/centre-global-learning/publications/?page=2',
 'https://pureportal.coventry.ac.uk/en/organisations/centre-global-learning/publications/?page=3',
 'https://pureportal.coventry.ac.uk/en/organisations/centre-global-learning/publications/?page=4',
 'https://pureportal.coventry.ac.uk/en/organisations/centre-global-learning/publications/?page=5']

### Create list of CGL profiles

In [7]:
def CGLprofiles():
    """
    This function creates a list of profiles on CGL.
    """
    profiles_link = seed_url + "/persons"
    profile_page = requests.get(profiles_link)
    soup3 = BeautifulSoup(profile_page.text, "html.parser")

    persons_list = soup3.find_all("a", class_="link person")
    profiles = []
    for person in persons_list:
        profiles.append(person.get_text())
    return profiles

###### Demonstration

In [8]:
profiles = CGLprofiles()
profiles

['Sian Alsop',
 'Dimitar Angelov',
 'Rami Ayoubi',
 'Ema Baukaite',
 'Julia Carroll',
 'Jacqueline Cawston',
 'Megan Crawford',
 'QueAnh Dang',
 'Alun DeWinter',
 'Ken Fero',
 'Mark Hodds',
 'Sylwia Holmes',
 'Elizabeth Horton',
 'Jaya Jacobo',
 'Emmanuel Johnson',
 'Mehmet Karakus',
 'Luca Morini',
 'Marina Orsini-Jones',
 'Charlotte Price',
 'Steve Raven',
 'Carlo Tramontano',
 'Katherine Wimpenny']

### Create or update the database

In [9]:
def updateDB():
    """
    This function updates the database.
    If no database exists, it creates one.
    If database already exists, it removes the duplicates.
    """
    if not os.path.isfile('database.csv'):
        database = pd.DataFrame(columns=['Title', 'Authors', 'Date', 'Publication_Link'])
        database.to_csv('database.csv', index=False)
    else:
        database = pd.read_csv("database.csv")
        database = database.drop_duplicates()
        database.to_csv('database.csv', index=False)

# Crawler Function

In [10]:
def crawler():
    """
    This is the main crawler.
    It crawls the seed url and stores the Title, Authors, Date and Publication Link
    in a database.
    It checks if the author is from CGL, and only stores the data if at least one of
    the authors has profile on CGL.
    It also counts the total number of publications crawled, total CGL publications,
    and non CGL publications.
    """
    url_list = getPageUrls()
    updateDB()

    count = 0
    count_cgl = 0
    count_non_cgl = 0

    while (url_list != []):
        url = url_list.pop(0)
        page = requests.get(url)
        soup = BeautifulSoup(page.text, 'html.parser')
        publications = soup.find_all('div', class_='result-container')

        for publication in publications:
            title = publication.select_one('.result-container .title').get_text()

            link = publication.select_one('.result-container .title .link').get('href')

            date = publication.select_one('.date').get_text()

            time.sleep(5)

            paper = requests.get(link)
            soup2 = BeautifulSoup(paper.text, "html.parser")
            authors = soup2.select_one("p", class_="relations persons")
            authors = authors.get_text()
            authors = re.sub(r'\s*\([^)]*\)', '', authors).split(", ")  # list of authors
            
            # check if any author is from CGL
            if any(author in authors for author in profiles):
                count_cgl += 1

                db = pd.read_csv('database.csv')
                new_data = pd.DataFrame({'Title': [title],
                                         'Authors': [authors],
                                         'Date': [date],
                                         'Publication_Link': [link]})
                db = pd.concat([db, new_data], ignore_index=True)
                db.to_csv('database.csv', index=False)

            else:
                count_non_cgl += 1
            count += 1

        time.sleep(5)

    updateDB()
    current_time = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    print('Last crawl at: ', current_time)
    
    print('total publications crawled:', count)
    print('cgl publications:', count_cgl)
    print('non-cgl publications:', count_non_cgl)

# Run and Schedule the crawler.

Please note that the below code runs indefinitely until it is interrupted.<br>
It is scheduled to crawl the given url on every Monday at 12 o'clock midnight.

In [None]:
seed_url = "https://pureportal.coventry.ac.uk/en/organisations/centre-global-learning"
page = requests.get(seed_url)
soup = BeautifulSoup(page.text, "html.parser")
time.sleep(5)
profiles = CGLprofiles()

# start crawling
crawler()

# schedule the crawler to run every monday at 12:00 AM
schedule.every().monday.at("00:00").do(crawler)

# run the scheduler continuously
while True:
    schedule.run_pending()
    time.sleep(1)
    # time.sleep(1) function pauses the program for 1 second between iterations to avoid unnecessary CPU usage.



Last crawl at:  2023-07-27 23:03:11
total publications crawled: 286
cgl publications: 195
non-cgl publications: 91


Remember to terminate the program manually when you want to stop the crawler.

Please also check the <i>indexing_and_query_processing.ipynb</i> after you crawl.

<i>Thank you for reading my notebook.
<br>
<br>Avishek K C<br>
2023</i>