# Scraping Lingbuzz

[LingBuzz](https://ling.auf.net/) is an openly accessible repository of scholarly papers, discussions and other documents for linguistics. 
The choice for this resource is of a practical nature:
* Lingbuzz compiles a lot of linguistics papers (old and new, published or in process of publication), which permits scraping without having to jump all over the internet in search of open source papers.
* Authors voluntarily submit their papers to Lingbuzz, no [money](https://www.theatlantic.com/science/archive/2016/01/elsevier-academic-publishing-petition/427059/) involved. 
* A wide variety of subjects and authors

There are four categories of papers:
* Syntax
* Morphology
* Phonology
* Semantics

In [1]:
import pandas as pd
import json
from sklearn.utils import shuffle

Getting a list of all the urls to the pdfs, using scrapy:

In [None]:
import scrapy

class LingbuzzSpider(scrapy.Spider):
    name = 'lingbuzz'

    custom_settings = {
        "DOWNLOAD_DELAY": 3,
        "CONCURRENT_REQUESTS_PER_DOMAIN": 3,
        "HTTPCACHE_ENABLED": True
    }
    
    start_urls = ['https://ling.auf.net/']

    def parse(self, response):
        hrefs =[]
        for href in response.xpath('//a[contains(text(), "[pdf]")]').extract():
            yield {'url':href}

        next_url = 'https://ling.auf.net'+response.xpath('//a[contains(text(), "Next")]/@href').extract()[-1]

        yield scrapy.Request(url = next_url, callback=self.parse)

In [3]:
urls = pd.read_json('lingbuzz/lingbuzz.json', orient='records')

These are partial urls, 'https://ling.auf.net' needs to be added before each of them in order for download to be possible. also, not all links go to actual pdfs, some just reroute somewhere else.

In [9]:
final = []
for url in urls.url:
    if "current" in url:
        final.append('https://ling.auf.net'+url)    

In [11]:
len(final)

3473

Randomly select 750 of the papers:

In [12]:
subset = shuffle(final)[:750]

In [14]:
len(subset)

750

Write the list of urls to a txt file, for massive download on remote.

In [13]:
with open('urls.txt', 'w') as f:
    for item in subset:
        f.write("%s\n" % item)