### Full scraping workflow using Requests, BeautifulSoup combined with Regex

First we call the libraries needed.

In [16]:
import sys
import time
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
import itertools

We load the input .csv file containing the MR number and conver it to a Python ```list```.

In [17]:
input_test = pd.read_csv('GapBibMR.csv')
type_change = input_test.values.tolist()
mrn_numbers_only = list(itertools.chain(*type_change))
type(mrn_numbers_only)

list

Now we need to verify they are 7 digit long numbers, before we continue. In order to make sure none of the old-style MR numbers remain.

In [18]:
mrn = [] # list of all good MR numbers, made up from exactly 7 digits, that we will search for citations
non_standard_mrn = [] # list of non-standard MR numbers

for i in range(len(mrn_numbers_only)):
    if (mrn_numbers_only[i].isnumeric() and len(mrn_numbers_only[i]) == 7):
        each_mrn = ('MR' + mrn_numbers_only[i])
        mrn.append(each_mrn)
    else:
        non_standard_mrn.append(mrn_numbers_only[i])
print(mrn)

AttributeError: 'int' object has no attribute 'isnumeric'

In [5]:
print('Total input elements:')
print(len(mrn_numbers_only))
print('Number of non-standart elements separated for updating:')
print(len(non_standard_mrn))
print('Number of standard MR elements, that will be searched for GAP citations:')
print(len(mrn))

Total input elements:
3162
Number of non-standart elements separated for updating:
328
Number of standard MR elements, that will be searched for GAP citations:
2834


We define two functions used together to find all GAP citations by HTMl element and text contained inside it. They can be re-used in future we-scraping projects too.

In [6]:
MATCH_ALL = r'.*'


def like(string):
    """
    Return a compiled regular expression that matches the given
    string with any prefix and postfix, e.g. if string = "hello",
    the returned regex matches r".*hello.*"
    """
    string_ = string
    if not isinstance(string_, str):
        string_ = str(string_)
    regex = MATCH_ALL + re.escape(string_) + MATCH_ALL
    return re.compile(regex, flags=re.DOTALL)


def find_by_text(soup, text, tag, mrn, **kwargs):
    """
    Find the tag in soup that matches all provided kwargs, and contains the
    text.

    If no match is found, raise ValueError.
    """
    empty = 1
    elements = soup.find_all(tag, **kwargs)
    matches = []
    for element in elements:
        if element.find(text=like(text)):
            matches.append(mrn + ':')
            matches.append(element.text.strip())
    if len(matches) == 0:
        pass
    else:
        return matches

In [11]:
print('Initiate GAP citation scan...')

base_URL = "http://www.ams.org/mathscinet/search/publications.html?fmt=html&pg1=MR&s1="
all_matches = []
review_later = []
actual_scrapes = []
for i in range(len(mrn)):
	url = (base_URL + mrn[i])
	page = requests.get(url)
	soup = BeautifulSoup(page.content, 'html.parser')
	match = (find_by_text(soup, 'GAP', 'li', mrn[i]))
	if match is None:
		review_later.append(mrn[i])
	else:
		all_matches.append(match)
		actual_scrapes.append(mrn[i])
	# the following print statements allow user to track progress.
	print('Working on page:')
	print(i)
	print('from a total of:')
	print(len(mrn))
	print('Citations found in page:')
	print(match)
	print(' ') # to skip a line for better readability
	time.sleep(5) # adding 5 seconds rest interval between iterations  to avoid overloading the source website and also not to risk activating their security sentinel algorithms

print('Finished GAP citation scan...')

Initiate GAP citation scan...
Working on page:
0
from a total of:
2834
Citations found in page:
None
 
Working on page:
1
from a total of:
2834
Citations found in page:
None
 
Working on page:
2
from a total of:
2834
Citations found in page:
None
 
Working on page:
3
from a total of:
2834
Citations found in page:
None
 


KeyboardInterrupt: 

Some of the test HTMLs did not contain the word GAP and they returned NoneType elements. Using the following list comprehension we will remove them from the results before we continue.

In [8]:
all_matches = [i for i in all_matches if i is not None]
all_matches

[]

In [9]:
print(type(match))
print(type(all_matches))
print('Results count is:', len(all_matches))
print(all_matches[2])

<class 'NoneType'>
<class 'list'>
Results count is: 0


IndexError: list index out of range

In [None]:
joined = list(itertools.chain(*all_matches))
joined

In [None]:
print(joined[3])
print(type(joined))
print(type(joined[1]))
print('Now the Results count is:', len(joined), ' which confirms that our program also catches GAP Packages citation as separate results.')

In [None]:
final = []
for i in range(len(joined)):
    clean = (joined[i].strip())
    final.append(clean)
final

### Converting our data to Pandas dataframe for further analysis

In [None]:
df=pd.DataFrame(final)
df

Some MR numbers contain more than one GAP citations which produces extra columns. We need to take every odd element from the whole data and assign it to separate row in one 'MR' column. And then take every even element containing the corresponding citation and join it to its MR number in a second column called 'Citation'.

In [None]:
check = df.index%2==0  #checking if the index is even because the values are in consicutive order
final_df = pd.DataFrame([df.loc[check, 0].str.strip(':').tolist(), # taking every odd element which is MR number
                         df.loc[~check, 0].tolist()], # taking every even element which is Citation
                         index=['MR','Citation']).T # assigning the corresponding value names to each column

In [None]:
final_df

The resultung Pandas Data-frame has two columns. Now we can export it to a .CSV file which will be taken over by the next Jupyter Notebook in our pipeline.

In [None]:
final_df.to_csv('output.csv', index=False, encoding='utf-8')