# Biomedical Data Bases, 2021-2022
### Web Scraping
These are notes by prof. Davide Salomoni (d.salomoni@unibo.it) for the Biomedical Data Base course at the University of Bologna, academic year 2021-2022.

We will here demonstrate with a simple example how to fetch and process some data, related in particular to posting of bioinformatics jobs. Study the slides for more details and remember **not to overuse** web scraping.

## Import all necessary libraries

In [None]:
import re
import pandas as pd
import requests
from bs4 import BeautifulSoup

## Example 1: scraping bioinformatics jobs

### Get the page

In [None]:
LENGTH = 10 # number of jobs to retrieve
JOB_URL = 'https://www.bioinformatics.org/jobs/?group_id=101&summaries=1&length=%s' % (LENGTH-1)

# get the HTML of the page with requests.get()
r = requests.get(JOB_URL)

# now pass the page content to bs4
soup = BeautifulSoup(r.content)

### Get the elements that interest us

Looking at the HTML of the web page, it turns out that the job opportunities are in some _tables_. But the web page has many tables: how to find the ones that are interesting for us? We notice that the job opportunities are in tables where the text "Opportunity" is part of the content.

So, we get first all the tables, and then we retain only those with "Opportunity" in the content, storing them in a list.

In [None]:
# get all the tables
tables = soup.find_all('table')

# store the interesting tables in a list
entries = list()
for table in tables:
    if 'Opportunity' in table.text:
        entries.append(table)

In [None]:
# if we look at the list 'entries', we realize that the first two tables are not interesting (check for yourself the content of entries[0] and entries[1])
# therefore, we remove the first two items of the list and retain the remaining items
entries = entries[2:]

In [None]:
# check what an entry looks like, using entries[0] as an example
entries[0]

In [None]:
# each entry is a BeatifulSoup structure
type(entries[0])

In [None]:
# entries[0] looks OK. We can apply several bs4 methods to it. For instance, let's see the text
entries[0].text

In [None]:
# OK, there is some parsing to do on the text to remove unwanted elements, such as those \n characters. We will do it later.

# Let's now use the bs4 find_all method to find the 'a' HTML tags (identifying links)
links = entries[0].find_all('a', href=True)

# check what we got
links

In [None]:
# we got a list, of which we want to retain only the first element (the second refers to the person who posted the job opportunity)

# let's print out the "href", i.e. the actual link of the HTML tag
for link in links:
    print(link['href'])

In [None]:
# so, if we want only the first element:
print(links[0]['href'])

### Parse the text and store it in a list

Now that we have understood how to process the various elements of the HTML page, let's put everything together.

We will parse the text of each item of the 'entries' list using some Python constructs, as well as a _regular expression_. Let's look at the text for an item (see above); it could something like this:

```
'\n\n\n\nOpportunity: Bioinformatics Data Analyst @ Bowie State University -- Bowie, MD (US)\nSubmitted by Konda Reddy Karnati; posted on Friday,\xa0January\xa028,\xa02022 \n\n\n\n'
```

- we use strip() to remove the \n characters at the beginning and at the end of each line
- the text contained between ':' and '@' is the job title
- the text contained between '@' and '\n' is the job location
- the text after 'posted on ' is the publication data. Here we also need to replace \xa0 with a regular space

Also, a link will be something like this (see above):

```
https://www.bioinformatics.org/forums/forum.php?forum_id=14619
```

- the text after 'forum_id=' seems to be the job number, so we will extract that as well

We will put all the extracted elements in the list called `my_jobs`.

In [None]:
my_jobs = list()
for entry in entries:
    link = entry.find_all('a', href=True)[0]['href'] # we keep only the first link (see above)
    text = entry.text.strip()
    
    # parse the text, grouping the interesting parts as explained above
    # look up 'regular expressions in Python' if you are not familiar with the re.search syntax 
    m = re.search('Opportunity: (.+?) @ (.+?)\n.+?; posted on (.+?)$', text)
    
    # extract the various groups of the regular expression
    title = m.group(1)
    location = m.group(2)
    date = m.group(3).replace('\xa0', ' ') # replace also \xa0 with a space
    
    # extract the job id from the link
    m = re.search('.*forum_id=(.+)$', link)
    job_id = m.group(1)
    
    # finally, append all the extracted elements to a list
    my_jobs.append([job_id, title, location, link, date])

In [None]:
# check what we got in the end
my_jobs

In [None]:
# let's print it with some formatting
for job in my_jobs:
    print("Job number: %s" % job[0])
    print("\tTitle: %s" % job[1])
    print("\tLocation: %s" % job[2])
    print("\tURL: %s" % job[3])
    print("\tPublished on: %s\n" % job[4])

### Get information from a secondary page

You can apply web scraping also to the pages detailing each job. For example, let's go through the list of jobs we have retrieved so far, visit the respective URLs, and fetch the "DEADLINE". We will store this deadline into a python dictionary called `my_deadlines`; in this dictionary, we will define the key to be the job ID (which looks like a unique key for each job), and as value the deadline.

Looking the HTML for one of the URLs, we notice that the deadline is contained in a class called `sf-news`, and that the text of that class is "DEADLINE". There might be more than one instance of the `sf-news` class, so we loop through all of them, and stop when we find the one with the DEADLINE text. The actual deadline is container in the next element, which can be found in the `next_siblings` attribute of the class.

The deadline string will have some \r or \n characters before and after it, so we will remove them with `strip()`.

__However__, note that the "DEADLINE" field is _not_ mandatory, so it might not be present on a page. We should therefore consider also this case.

In [None]:
# verify how a job looks like, printing for example my_jobs[0]
my_jobs[0]

In [None]:
my_deadlines = dict()
for job in my_jobs:
    job_id = job[0]
    job_link = job[3]
    
    # get the page referenced by the current job
    r = requests.get(job_link)
    
    # pass the page content to bs4
    soup = BeautifulSoup(r.content)
    
    # find all the "sf-news" classes and get the one with text "DEADLINE"
    for c in soup.find_all(class_ = 'sf-news'):
        if c.text == "DEADLINE":
            # get the deadline, create a dictionary item and then exit from this loop
            my_deadlines[job_id] = c.next_sibling.strip()
            break
    else:
        # this part gets executed if no "break" was encountered.
        my_deadlines[job_id] = 'No deadline'

In [None]:
# check that my_deadlines contains what we want
my_deadlines

### Combine everything into a single block of code and store the info in a database

In the cells above, we first created the list `my_jobs` and then the dictionary `my_deadlines` to try out the code. Now that we have it working, we can combine everything into a single block of code, and for example print out all the information we have gathered.

We could then store everything in a permanent place, for example in a database. __Try this out for yourself__, for instance using a Redis database, configured with persistency. Then see if you can retrieve the info from the DB using various types of searches.

In [None]:
# this would be the final code, from start to end

LENGTH = 10 # number of jobs to retrieve
JOB_URL = 'https://www.bioinformatics.org/jobs/?group_id=101&summaries=1&length=%s' % (LENGTH-1)

# get the HTML of the page with requests.get()
r = requests.get(JOB_URL)

# now pass the page content to bs4
soup = BeautifulSoup(r.content)

# get all the tables
tables = soup.find_all('table')

# store the interesting tables in a list
entries = list()
for table in tables:
    if 'Opportunity' in table.text:
        entries.append(table)

# remove the first two entries
entries = entries[2:]

for entry in entries:
    link = entry.find_all('a', href=True)[0]['href'] # we keep only the first link (see above)
    text = entry.text.strip()
    
    # parse the text, grouping the interesting parts as explained above
    # look up 'regular expressions in Python' if you are not familiar with the re.search syntax 
    m = re.search('Opportunity: (.+?) @ (.+?)\n.+?; posted on (.+?)$', text)
    
    # extract the various groups of the regular expression
    title = m.group(1)
    location = m.group(2)
    date = m.group(3).replace('\xa0', ' ') # replace also \xa0 with a space
    
    # extract the job id from the link
    m = re.search('.*forum_id=(.+)$', link)
    job_id = m.group(1)
    
    # get the deadline by scraping the link
    link_page = requests.get(link)
    # pass the page content to bs4
    link_soup = BeautifulSoup(link_page.content)
    
    # find all the "sf-news" classes and get the one with text "DEADLINE"
    for c in link_soup.find_all(class_ = 'sf-news'):
        if c.text == "DEADLINE":
            # get the deadline text. remove spurious characters and exit from this loop
            deadline = c.next_sibling.strip()
            break
    else:
        # # this part gets executed if no "break" was encountered.
        deadline = 'No deadline'
   
    # print out all that we have gathered, with some formatting
    print("Job number: %s" % job_id)
    print("\tTitle: %s" % title)
    print("\tLocation: %s" % location)
    print("\tURL: %s" % link)
    print("\tPublished on: %s" % date)
    print("\tDeadline: %s\n" % deadline)
    
    # here we could insert all the info into a list as we did earlier
    # or, instead of creating the list, we could write the info to a DB
    # <do something with the info>


## Example 2: getting hourly weather forecasts

This and the following example are provided without many explanations. You should analyze the code, break it down into multiple cells and verify what each operations does. There are also other ways to extract the same info: you are encouraged to find and test them.

In [None]:
# we use the datetime module to get the current date
from datetime import date

WEATHER_URL = 'https://www.ilmeteo.it/meteo/Bologna'
r = requests.get(WEATHER_URL)
soup = BeautifulSoup(r.content) 

# extract the main table, identified in the HTML code by a class called 'datatable'
infos = soup.find_all(class_ = 'datatable')

# extract rows and columns of that table, looking at the tags <tr> and <td>
data = list()
rows = infos[0].find_all('tr')
for row in rows:
    cols = row.find_all('td')
    if len(cols):
        # just keep the text of each cell
        cols = [i.text.strip() for i in cols]
        # put everything in data, removing empty entries
        data.append([i for i in cols if i])

# note that at the beginning of the data list there are several uniteresting elements
# let's keep only the list elements where the first element is a number
data = [i for i in data if i[0].isdigit()]

# create a Pandas data frame from the data list. Column names will be by default 0, 1, 2, ...
df = pd.DataFrame(data)

# keep only the column I am interesting in (0, 1, 2) and convert column 2 into float
weather = df.iloc[:, [0,1,2]].copy()
weather.iloc[:,2] = weather.iloc[:,2].map(lambda x: float(x[:-1]))

# see what it looks like
print(weather)

# plot it 
my_date = date.today().strftime('%d/%m/%Y') # get today's date in the d/m/Y format
weather.plot(x=0,
             y=2,
             xlabel='Hour',
             ylabel='T (°C)',
             legend=False,
             title='Hourly weather forecast for Bologna on %s' % my_date
            )

## Example 3: getting weekly weather forecasts

In [None]:
WEATHER_URL = 'https://www.ilmeteo.it/meteo/Bologna'
r = requests.get(WEATHER_URL)
soup = BeautifulSoup(r.content)

# find all the tags like <ul id='daytabs' ...>
infos = soup.find_all('ul', id='daytabs')

# find all the HTML lists (tag 'li')
entries = infos[0].find_all('li')

# looking at the entries, we notice that we must remove those that do not start with a weekday
days = ['Lun', 'Mar', 'Mer', 'Gio', 'Ven', 'Sab', 'Dom']
data = list()
for entry in entries:
    text = list(entry.text.split())
    if text[0] in days:
        data.append(text)

# the list now looks almost ok, but in order to plot it we must:
# 1) join the first two fields: 'Sab' and '7' should become 'Sab 7' 
# 2) separate the third field: 19°35° should become '19' and '35'
for row in data:
    one, two, three = row
    tmin, tmax, _ = three.split('°')
    row[0] = one + ' ' + two
    row[1] = int(tmin)
    row[2] = int(tmax)

# good, we can now create a Pandas data frame from that list
df = pd.DataFrame(data, columns = ['Date', 'T(min)', 'T(max)'])
print(df)

# and finally plot it
df.plot(x='Date', 
        y=['T(min)','T(max)'], 
        rot=45, 
        ylabel='T (°C)', 
        title='Weekly weather forecast for Bologna on %s' % my_date
       )