## Lingbuzz Scrapper

This python notebook displays some code for scrapping the [Lingbuzz](http://ling.auf.net/lingbuzz) website in order to generate a linguistic-related corpus. The code runs on python 2.7 without a hitch. A few modifications are necessary to ensure compatibility with python 3.4.

Lingbuzz is a repository for linguistic papers pertaining to different fields and/or languages. Each paper has a unique ID coded on 6 digits and can be downloaded as a `.pdf`. The website is, unfortunately, surprisingly quirky and unstable and while navigating through it, one may often stumble upon the dreaded _HTTP Error 503: Service Temporarily Unavailable_. The actual scrapping operation may require to be run iteratively in order to ensure the whole corpus is downloaded.

Lingbuzz boasts more than 5000 entries but I was able to retrieve only ~2700 of them. The root cause is still under investigation (in the future, maybe try `scrapy`).

The generated corpus is saved as a JSON file which can be used to generate a dictionary. The structure of each entry is:

* A unique ID as a `string` coded on 6 digits from 0 to 9.
    + _cnt_: the number of times the related document was downloaded (`int`).
    + _kwd_: a list of keywords related to the document (`[string]`).
    + _tit_: the title of the document (`string`)
    + _pub_: where the document has been published (`string`); if the document hasn't been published, then it is `N/A`.
    + _dat_: the date of publication on Lingbuzz (`string`).
    + _aut_: a list containing the names of the authors (`[string]`).
    + _ref_: the document ID, same as above (`string`).
    + _exc_: a short excerpt, detailing the work presented in the document.

### 1 - Preliminaries

This part is used to import the necessary packages and define some helper functions. The imports have been tuned with `try/except` blocks so as to ensure the compatibility with Python 3.4.

In [1]:
try:
    import urllib.request as urllib2
except ImportError:
    import urllib2
    
try:
    import http.client as httplib
except ImportError:
    import httplib
    
from bs4 import BeautifulSoup
import re
import json

If a corpus already exists, then it is loaded. Otherwise, an empty dictionary is generated, ready to store the data.

_(NOTA: to ensure compatibility with python 3.4, replace 'rb' with 'r')_

In [2]:
try:
    with open('Data/lingbuzzCorpus.json', 'rb') as infile:
        ref_dict = json.load(infile)
        print('Corpus successfuly loaded!')
        print('Number of entries: {0}'.format(len(ref_dict.keys())))
except IOError:
    ref_dict = dict()
    print('No existing corpus found, empty dictionary generated')

Corpus successfuly loaded!
Number of entries: 2723


Defining the user-agent may be useful as some servers react poorly when confronted to the default python user-agent. We set the user-agent as Mozilla's. The `url_base` variable holds the basic address of the website.

In [3]:
user_agent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:37.0) Gecko/20100101 Firefox/37.0"
url_base = "http://lingbuzz.auf.net"

### 2 - Scrapping

The following 3 helper functions perform all the work:

* `fetchDocIdAndNextPage` takes an url (`string`) as a parameter and returns a tuple `(id_list, next_page)` where `id_list` is a list of document IDs found on the page and `next_page` is part of the url leading to the next page. The parser used is `html5lib` because, though relatively slower than other parser, it is extremely consistent and prevents the loss of information.
* `fetch_doc` takes a document ID (`string`) as a parameter and returns the related webpage, ready to be parsed by BeautifulSoup.
* `parse_doc` takes the document returned by `fetch_doc` and the document ID as a `string`; the document is parsed with BeautifulSoup and the useful information is extracted and stored in a dictionary. Said dictionary is returned by the function.

_NOTA: to ensure compatibility with python 3.4:_
* _replace all instances of `except httplib.IncompleteRead, e:` with `except httplib.IncompleteRead as e:`._
* _replace `except ValueError, v:` with `except ValueError as v:`._

In [4]:
def fetchDocIdAndNextPage(url):
    
    req = urllib2.Request(url)
    req.add_header("User_Agent", user_agent)
    response = urllib2.urlopen(req)
    
    try:
        page = response.read()
    except httplib.IncompleteRead, e:
        page = e.partial
    response.close()
    
    id_list = list()
    next_page = ''
    bs = BeautifulSoup(page, "html5lib")
    
    for item in bs.find_all('a'):
        match_id = re.search(r'\d{6}$', item.get("href"))
        match_lk = re.search(r'Next \d+ articles', item.get_text(strip = True))
        if match_id:
            id_list.append(match_id.group())
        elif match_lk:
            next_page = item.get("href")
    
    return (id_list, next_page)

def fetch_doc(doc_id):
    
    doc_req = urllib2.Request(url_base + "/lingbuzz/" + doc_id)
    doc_req.add_header("User_Agent", user_agent)
    doc_response = urllib2.urlopen(doc_req)
    
    try:
        doc = doc_response.read()
    except httplib.IncompleteRead, e:
        doc = e.partial
    doc_response.close()

    return doc

def parse_doc(html_doc, doc_id):
    
    parsed = BeautifulSoup(html_doc, "html5lib")
    
    if parsed:
        article = dict()
        article["ref"] = doc_id
        text_head = parsed.body.center.get_text("\n", strip = True).lower().split("\n")
        text_body = parsed.body.get_text("\n", strip = True).lower().split("\n")
        article["tit"] = text_head[0]
        article["dat"] = text_head[-1]
        article["aut"] = list()
        
        for val in text_head[1:-1]:
            if val != ",":
                article["aut"].append(str(val.encode("ascii", "ignore")))
        
        try:
            article["pub"] = text_body[text_body.index("published in:")+1]
        except ValueError, v:
            article["pub"] = 'N/A'
        
        article["kwd"] = text_body[text_body.index("keywords:")+1]
        article["cnt"] = text_body[text_body.index("downloaded:")+1]
        text_bits = list()
        
        for tb in text_body[text_body.index(article["dat"])+1:text_body.index("format:")]:
            text_bits.append(tb)
        article["exc"] = ' '.join(text_bits)
        
        return article
    
    else:
        return None

The cell below may need to be run several times, as the _Error 503_ may pop up from time to time...

In [5]:
next_url = '/lingbuzz'
next_page = url_base + next_url
list_url = list()
while True:
    print('Processing...\t{0}'.format(next_page))
    id_list, next_url = fetchDocIdAndNextPage(next_page)
    if next_url == '':
        break
    else:
        next_page = url_base + next_url
        list_url.append(next_url)
    for id in id_list:
        if id in ref_dict.keys():
            continue
        else:
            ref_dict[id] = parse_doc(fetch_doc(id), id)
print('done; compiled {0} entries'.format(len(ref_dict.keys())))

Processing...	http://lingbuzz.auf.net/lingbuzz
Processing...	http://lingbuzz.auf.net/lingbuzz?_s=yOMERdpG_3N09Dtf&_k=iEVU1YfGUtL7_g76&start=31&74
Processing...	http://lingbuzz.auf.net/lingbuzz?_s=yOMERdpG_3N09Dtf&_k=_sMLn7_8XjZuQ8qf&start=131&186
Processing...	http://lingbuzz.auf.net/lingbuzz?_s=yOMERdpG_3N09Dtf&_k=KP3dwKIpL1VdmSfS&start=231&169
Processing...	http://lingbuzz.auf.net/lingbuzz?_s=yOMERdpG_3N09Dtf&_k=cYQGFOyNiM9PcOlP&start=331&181
Processing...	http://lingbuzz.auf.net/lingbuzz?_s=yOMERdpG_3N09Dtf&_k=Xj0-j0uO4ea2nzmS&start=431&195
Processing...	http://lingbuzz.auf.net/lingbuzz?_s=yOMERdpG_3N09Dtf&_k=1WpDW1_N6vHPVI6d&start=531&173
Processing...	http://lingbuzz.auf.net/lingbuzz?_s=yOMERdpG_3N09Dtf&_k=j5-mDWwX74lv3mLs&start=631&191
Processing...	http://lingbuzz.auf.net/lingbuzz?_s=yOMERdpG_3N09Dtf&_k=J44-RLTkYesfoNcb&start=731&167
Processing...	http://lingbuzz.auf.net/lingbuzz?_s=yOMERdpG_3N09Dtf&_k=BmPtWBlqKylPpkfj&start=831&174
Processing...	http://lingbuzz.auf.net/lingbuzz

Supplementary processing:

* Change the number of downloads into an integer.
* Remove punctuation markers in excerpt and change to lowcase. 
* Change the keywords entry into a list of lowcase keywords.

In [6]:
for k in ref_dict.keys():
    
    if type(ref_dict[k]['cnt']) != int: 
        ref_dict[k]['cnt'] = int(re.search(r'\d+', ref_dict[k]['cnt']).group())
    
    ref_dict[k]['exc'] = re.sub(r'[;:.\,()]', '', ref_dict[k]['exc']).lower()

    if type(ref_dict[k]['kwd']) == list:
        ref_dict[k]['kwd'] = [kw.lower() for kw in ref_dict[k]['kwd']]
    else:
        keywords = list()
        for kw in ref_dict[k]['kwd'].split(','):
            keywords.append(kw.strip().lower())
        ref_dict[k]['kwd'] = keywords

### 3 - Save Corpus

Save the corpus a JSON file.

_(NOTA: to ensure compatibility with python 3.4, replace 'wb' with 'w')_

In [7]:
with open('Data/lingbuzzCorpus.json', 'wb') as output:
    json.dump(ref_dict, output, indent = 4)