Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

# Lab 1.4: Extracting Publications

The ACL anthology is the most relevant resource for research publications on natural language processing.

**Take a look at its search options in the browser:**
https://www.aclweb.org/anthology/search/?q=opinion+mining



## 1. Querying bibtex

Publications are commonly stored as bibtex-files. In this notebook, we work with a small subset of the anthology (the first 20,000 lines): anthology_small.bib

**Inspect the file and make sure you understand the structure.**

To get better results for your queries, download the full anthology from https://www.aclweb.org/anthology/anthology.bib.gz and extract it to your LaD/Lab1 folder. 

Let's load the file and parse it using bibtexparser (this takes a moment): 

In [None]:
import bibtexparser

with open("../data/anthology_small.bib") as bibtex_file:
    # Parse the bibtex file - this takes a while
    parser = bibtexparser.bparser.BibTexParser(common_strings=True)
    print("Loading...")
    anthology= bibtexparser.load(bibtex_file, parser)
    print("Done.")
    # Only choose entries with the type "inproceedings"
    articles = [article for article in anthology.entries if article["ENTRYTYPE"]=="inproceedings"]
    
    print("Number of articles: " + str(len(articles)))
    

## 2. Saving results

We can now query the articles and collect all urls of interesting articles in a list. We save both the abstracts and the full pdfs. 

**Try out different queries. Add code to filter by author, year, or booktitle. You can also modify the code to use regular expressions as queries.**

Note that the error : "Not Acceptable! An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security" may occur. This happens if cookies are turned off and mod_security requires cookies to match session data. Instead of seeing the website, we only see this error. (https://stackoverflow.com/questions/28090737/not-acceptable-an-appropriate-representation-of-the-requested-resource-could-no)



In [None]:
import requests
from util_html import url_to_html

query = "social media"

pdf_path = "../results/acl_results/pdf/"
abstracts_path = "../results/acl_results/abstracts/"


#Some servers request to add a user-agent to the query: 
headers = requests.utils.default_headers()
headers.update(
    {
        'User-Agent': 'My User Agent 1.0',
    }
)
# headers = {
#     'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36',
# }

for entry in articles:
    # Some titles contain curly braces to indicate uppercase names. We remove them from the title. 
    title = entry["title"].lower()
    title = title.replace("{","")
    title = title.replace("}","")

    # Get target articles
    if query in title:
        try:
            
            # Get metadata 
            id = entry["ID"]
            author = entry["author"]
            author = author.replace("\n"," ")
            author = author.replace("  "," ")


            # Save abstract
            url = entry["url"]          
            response = url_to_html(url)
            abstract = response.find(attrs={'class': 'card-body acl-abstract'}).text
            abstract = abstract.replace("Abstract", "",1)

            with open(abstracts_path + id + ".txt", 'w') as f:
                f.write(author+"\n"+title+"\n"+abstract)

            # Save pdf
            pdf_response = requests.get(url + ".pdf", headers = headers)
            with open(pdf_path + id + ".pdf", 'wb') as f:
                f.write(pdf_response.content)

            print(id, title)
            print(title, url)
            print()

        # Ignore entries that do not contain a url
        except KeyError as e:
            #print("Entry does not have URL")
            pass



## 3. Extracting PDFs

We would like to run our analyses on the full texts. However, it is not easy to extract texts from pdfs if you do not want to buy commercial software. 

Try out the code for extracting texts from pdfs below. For the moment, you can ignore the warnings. Currently, the code outputs only a part of the first file. 

In [None]:
from util_pdf import convert_pdf_to_txt
import os

for pdf_file in os.listdir(pdf_path):
    pdf = os.path.join(pdf_path,pdf_file)
    print(pdf)
    text = convert_pdf_to_txt(pdf)
    print(text[0:10000])
    print("\n\n")
    break
    
   

**Remove the "break" command and save the files as txt-files instead. Inspect the quality and discuss for which tasks this quality could be used.**

One of the articles throws an error. What could be the reason?