### Full scraping workflow using Requests, BeautifulSoup combined with Regex

First we call the libraries needed.

In [18]:
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
import itertools

We load the input .csv file containing the MR number and conver it to a Python ```list```.

In [21]:
input_test = pd.read_csv('test_input.csv')
type_change = input_test.values.tolist()
mrn_numbers_only = list(itertools.chain(*type_change))
type(mrn_numbers_only)
mrn_numbers_only

['MR3',
 'MR4044696',
 'MR2900886',
 'MR3169623',
 'MR4180136',
 'MR11',
 'MR1111111',
 'MR5',
 'MR7',
 'MR9',
 'MRMR4044697']

We define two functions used together to find all GAP citations by HTMl element and text contained inside it. They can be re-used in future we-scraping projects too.

In [2]:
MATCH_ALL = r'.*'


def like(string):
    """
    Return a compiled regular expression that matches the given
    string with any prefix and postfix, e.g. if string = "hello",
    the returned regex matches r".*hello.*"
    """
    string_ = string
    if not isinstance(string_, str):
        string_ = str(string_)
    regex = MATCH_ALL + re.escape(string_) + MATCH_ALL
    return re.compile(regex, flags=re.DOTALL)


def find_by_text(soup, text, tag, mrn, **kwargs):
    """
    Find the tag in soup that matches all provided kwargs, and contains the
    text.

    If no match is found, raise ValueError.
    """
    empty = 1
    elements = soup.find_all(tag, **kwargs)
    matches = []
    for element in elements:
        if element.find(text=like(text)):
            matches.append(mrn + ':')
            matches.append(element.text.strip())
    if len(matches) == 0:
        pass
    else:
        return matches

In [7]:
base_URL = "D:\\"
mrn = ["MR3", "MR4044696", "MR2900886", "MR4180136", "MR4044697", "MR7", "MR5", "MR11"]
url_list = []
all_matches = []

for i in range(len(mrn)):
    url = (base_URL + mrn[i] + '.html')
    #url_list.append(url) #for records keeping only, not really needed 
    #url = "D:\MR2900886.html"
    print(url)
    page = open(url)
    soup = BeautifulSoup(page.read())
    match = (find_by_text(soup, 'GAP', 'li', mrn[i]))
    all_matches.append(match)
    
all_matches

D:\MR3.html
D:\MR4044696.html
D:\MR2900886.html
D:\MR4180136.html
D:\MR4044697.html
D:\MR7.html
D:\MR5.html
D:\MR11.html


[None,
 ['MR4044696:',
  'The GAP Group, GAP â€“ groups, algorithms and programming, version 4.10, Available from http://www.gap-system.org, 2018.'],
 ['MR2900886:',
  'The GAP Group, $GAP$ groups, algorithms, and programming, version 4.4.12 (2008), http://www.gap-system.org.'],
 ['MR4180136:',
  'The GAP Group, 2019. GAP â€“ Groups, Algorithms, and Programming, Version 4.10.1; https://www.gap-system.org.'],
 ['MR4044697:',
  'The GAP Group, GAP â€“ groups, algorithms and programming, version 4.10, Available from http://www.gap-system.org, 2018.'],
 ['MR7:',
  'V. A. Artamonov and A. A. Bovdi, Integral gro GAP up rings: Groups of invertible elements and classical $K$-theory, in Algebra, Topology, Geometry, Vol. 27 (Russian), Itogi Nauki i Tekhniki, 232. (Vsesoyuz. Inst. Nauchn. i Tekhn. Inform., Moscow, 1989), pp. 3â€“43. \nMR1039822',
  'MR7:',
  "V. Bovdi, A. Grishkov and A. Konovalov, Kimmerle @GAP a lapa conjecture for the Held and O'Nan sporadic simple groups, Sci. Math. Jpn. 69(3

In [8]:
print(type(all_matches[0]))

<class 'NoneType'>


Some of the test HTMLs did not contain the word GAP and they returned NoneType elements. Using the following list comprehension we will remove them from the results before we continue.

In [9]:
all_matches = [i for i in all_matches if i is not None]
all_matches

[['MR4044696:',
  'The GAP Group, GAP â€“ groups, algorithms and programming, version 4.10, Available from http://www.gap-system.org, 2018.'],
 ['MR2900886:',
  'The GAP Group, $GAP$ groups, algorithms, and programming, version 4.4.12 (2008), http://www.gap-system.org.'],
 ['MR4180136:',
  'The GAP Group, 2019. GAP â€“ Groups, Algorithms, and Programming, Version 4.10.1; https://www.gap-system.org.'],
 ['MR4044697:',
  'The GAP Group, GAP â€“ groups, algorithms and programming, version 4.10, Available from http://www.gap-system.org, 2018.'],
 ['MR7:',
  'V. A. Artamonov and A. A. Bovdi, Integral gro GAP up rings: Groups of invertible elements and classical $K$-theory, in Algebra, Topology, Geometry, Vol. 27 (Russian), Itogi Nauki i Tekhniki, 232. (Vsesoyuz. Inst. Nauchn. i Tekhn. Inform., Moscow, 1989), pp. 3â€“43. \nMR1039822',
  'MR7:',
  "V. Bovdi, A. Grishkov and A. Konovalov, Kimmerle @GAP a lapa conjecture for the Held and O'Nan sporadic simple groups, Sci. Math. Jpn. 69(3) (2009

In [10]:
print(type(match))
print(type(all_matches))
print('Results count is:', len(all_matches))
print(all_matches[2])

<class 'list'>
<class 'list'>
Results count is: 7
['MR4180136:', 'The GAP Group, 2019. GAP â€“ Groups, Algorithms, and Programming, Version 4.10.1; https://www.gap-system.org.']


In [11]:
joined = list(itertools.chain(*all_matches))
joined

['MR4044696:',
 'The GAP Group, GAP â€“ groups, algorithms and programming, version 4.10, Available from http://www.gap-system.org, 2018.',
 'MR2900886:',
 'The GAP Group, $GAP$ groups, algorithms, and programming, version 4.4.12 (2008), http://www.gap-system.org.',
 'MR4180136:',
 'The GAP Group, 2019. GAP â€“ Groups, Algorithms, and Programming, Version 4.10.1; https://www.gap-system.org.',
 'MR4044697:',
 'The GAP Group, GAP â€“ groups, algorithms and programming, version 4.10, Available from http://www.gap-system.org, 2018.',
 'MR7:',
 'V. A. Artamonov and A. A. Bovdi, Integral gro GAP up rings: Groups of invertible elements and classical $K$-theory, in Algebra, Topology, Geometry, Vol. 27 (Russian), Itogi Nauki i Tekhniki, 232. (Vsesoyuz. Inst. Nauchn. i Tekhn. Inform., Moscow, 1989), pp. 3â€“43. \nMR1039822',
 'MR7:',
 "V. Bovdi, A. Grishkov and A. Konovalov, Kimmerle @GAP a lapa conjecture for the Held and O'Nan sporadic simple groups, Sci. Math. Jpn. 69(3) (2009) 353â€“361. \nM

In [12]:
print(joined[3])
print(type(joined))
print(type(joined[1]))
print('Now the Results count is:', len(joined), ' which confirms that our program also catches GAP Packages citation as separate results.')

The GAP Group, $GAP$ groups, algorithms, and programming, version 4.4.12 (2008), http://www.gap-system.org.
<class 'list'>
<class 'str'>
Now the Results count is: 24  which confirms that our program also catches GAP Packages citation as separate results.


In [13]:
final = []
for i in range(len(joined)):
    clean = (joined[i].strip())
    final.append(clean)
final

['MR4044696:',
 'The GAP Group, GAP â€“ groups, algorithms and programming, version 4.10, Available from http://www.gap-system.org, 2018.',
 'MR2900886:',
 'The GAP Group, $GAP$ groups, algorithms, and programming, version 4.4.12 (2008), http://www.gap-system.org.',
 'MR4180136:',
 'The GAP Group, 2019. GAP â€“ Groups, Algorithms, and Programming, Version 4.10.1; https://www.gap-system.org.',
 'MR4044697:',
 'The GAP Group, GAP â€“ groups, algorithms and programming, version 4.10, Available from http://www.gap-system.org, 2018.',
 'MR7:',
 'V. A. Artamonov and A. A. Bovdi, Integral gro GAP up rings: Groups of invertible elements and classical $K$-theory, in Algebra, Topology, Geometry, Vol. 27 (Russian), Itogi Nauki i Tekhniki, 232. (Vsesoyuz. Inst. Nauchn. i Tekhn. Inform., Moscow, 1989), pp. 3â€“43. \nMR1039822',
 'MR7:',
 "V. Bovdi, A. Grishkov and A. Konovalov, Kimmerle @GAP a lapa conjecture for the Held and O'Nan sporadic simple groups, Sci. Math. Jpn. 69(3) (2009) 353â€“361. \nM

### Converting our data to Pandas dataframe for further analysis

In [14]:
df=pd.DataFrame(final)
df

Unnamed: 0,0
0,MR4044696:
1,"The GAP Group, GAP â€“ groups, algorithms and ..."
2,MR2900886:
3,"The GAP Group, $GAP$ groups, algorithms, and p..."
4,MR4180136:
5,"The GAP Group, 2019. GAP â€“ Groups, Algorithm..."
6,MR4044697:
7,"The GAP Group, GAP â€“ groups, algorithms and ..."
8,MR7:
9,"V. A. Artamonov and A. A. Bovdi, Integral gro ..."


Some MR numbers contain more than one GAP citations which produces extra columns. We need to take every odd element from the whole data and assign it to separate row in one 'MR' column. And then take every even element containing the corresponding citation and join it to its MR number in a second column called 'Citation'.

In [15]:
check = df.index%2==0  #checking if the index is even because the values are in consicutive order
separated = pd.DataFrame([df.loc[check, 0].str.strip(':').tolist(), # taking every odd element which is MR number
                         df.loc[~check, 0].tolist()], # taking every even element which is Citation
                         index=['MR','Citation']).T # assigning the corresponding value names to each column

In [16]:
separated

Unnamed: 0,MR,Citation
0,MR4044696,"The GAP Group, GAP â€“ groups, algorithms and ..."
1,MR2900886,"The GAP Group, $GAP$ groups, algorithms, and p..."
2,MR4180136,"The GAP Group, 2019. GAP â€“ Groups, Algorithm..."
3,MR4044697,"The GAP Group, GAP â€“ groups, algorithms and ..."
4,MR7,"V. A. Artamonov and A. A. Bovdi, Integral gro ..."
5,MR7,"V. Bovdi, A. Grishkov and A. Konovalov, Kimmer..."
6,MR7,"V. Bovdi, E. Jespers and A. Konovalov, Tors ga..."
7,MR7,"V. Bovdi and A. Konovalov, Integral group ring..."
8,MR5,"V. A. Artamonov and A. A. Bovdi, Integral gro ..."
9,MR5,"V. Bovdi, A. Grishkov and A. Konovalov, Kimmer..."


The resultung Pandas Data-frame has two columns. Now we can export it to a .CSV file which will be taken over by the next Jupyter Notebook in our pipeline.

In [17]:
separated.to_csv('local_test_output.csv', index=False, encoding='utf-8')