# Data extraction

This notebook provides code for extracting ACL anthologoy data following the documentation here: https://acl-anthology.readthedocs.io/latest/api/anthology/ and arXiv data following the documentation here: https://www.kaggle.com/datasets/Cornell-University/arxiv/code. The extraced sentences are only candidates for the evaluation sets. The final selection is performed manually, using my linguistic judgement.

#### 1. Get ACL anthology data

In [1]:
from acl_anthology import Anthology

anthology = Anthology.from_repo()
!pip show acl-anthology

Name: acl-anthology
Version: 0.5.1
Summary: A library for accessing the ACL Anthology
Home-page: https://github.com/acl-org/acl-anthology
Author: Marcel Bollmann
Author-email: marcel@bollmann.me
License: Apache-2.0
Location: /Users/doriellelonke/Desktop/thesis/.venv/lib/python3.12/site-packages
Requires: app-paths, attrs, citeproc-py, diskcache, docopt, gitpython, langcodes, lxml, numpy, omegaconf, platformdirs, pylatexenc, python-slugify, PyYAML, rich, rnc2rng, scipy, texsoup
Required-by: 


#### 2. Get arXiv data

In [10]:
# code snippet below taken from Kaggle docs

import kagglehub
from kagglehub import KaggleDatasetAdapter

file_path = "arxiv-metadata-oai-snapshot.json"

# Load the latest version
df = kagglehub.load_dataset(
  KaggleDatasetAdapter.PANDAS,
  "Cornell-University/arxiv",
  file_path,
  pandas_kwargs={"lines": True}
  # Provide any additional arguments like 
  # sql_query or pandas_kwargs. See the 
  # documenation for more information:
  # https://github.com/Kaggle/kagglehub/blob/main/README.md#kaggledatasetadapterpandas
)

  df = kagglehub.load_dataset(


Resuming download from 3353346048 bytes (1305981737 bytes left)...
Resuming download from https://www.kaggle.com/api/v1/datasets/download/Cornell-University/arxiv?dataset_version_number=232&file_name=arxiv-metadata-oai-snapshot.json (3353346048/4659327785) bytes left.


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.34G/4.34G [01:01<00:00, 21.4MB/s]


#### 3. Get papers in the form of a list of 3-tuples containing paper id, title and abstract

In [11]:
import spacy
nlp = spacy.load("en_core_web_md")

# zip 3-tuples of id, title and abstract from the arXiv dataframe
arxiv_paper_triplets = list(zip(df['id'].astype(str),df['title'].astype(str), df['abstract'].str.replace('\n', ' ')))

all_acl_papers = anthology.papers()
acl_paper_triplets = []

# iterate over acl object and obtain relevant data
for paper in all_acl_papers:
    if paper.abstract:
        paper_id = str(paper.id)
        paper_title = str(paper.title)
        paper_abstract = str(paper.abstract)
        paper_triplet = (paper_id,paper_title,paper_abstract)
        acl_paper_triplets.append(paper_triplet)

Unknown TeX-math command: \choose
Unknown TeX-math command: \textless
Unknown TeX-math command: \textgreater
Unknown TeX-math command: \textless
Unknown TeX-math command: \textgreater


#### 4. Initiate keyword list for finding relevant papers

In [59]:
# for a case-insensitive re.match with words from title:
title_keywords = ['AI','LM','LLM','GPT','ChatGPT'] 

# for a case insensitive re.search in title:
title_phrases = ['artificial intelligence','language model']

# for a lemma-based string comparison against entities in the abstract:
keywords = ['AI','LM','LMs','LLM','LLMs','model','system','algorithm','GPT','chatGPT'] 

#### 5. Functions for identifying specific linguistic patterns

In [73]:
import re
from tools import wordnet_syns as syns
import spacy
nlp = spacy.load("en_core_web_md")


def agent_subjects(sent,ai_words):
    """
    this function checks whether a spaCy sentence adheres to the following linguistic structure:
    active voice structure in which an AI entity is the nsubj of an anthropomorphic predicate

    :param sent: sentence from an abstract of a relevant (in-domain) paper
    :type sent: spacy.tokens.span.Span
    :param ai_words: list of AI entities to match inside the sentence
    :type ai_words: list of strings
    :return: True or False
    """ 
    stop_words = ['do','be','have','show'] 
    extended_agent_verbs = syns.extend_word_list('agent_verbs','v') # extend the list of words with similar words using WordNet
    anthro_words = [w for w in extended_agent_verbs if w not in stop_words] # exclude stop words

    check = 0

    for chunk in sent.noun_chunks:
        match = any(re.search(rf"\b{re.escape(word)}\b", chunk.text, re.IGNORECASE) for word in ai_words)
        if match and chunk.root.dep_ == 'nsubj' and chunk.root.head.lemma_ in anthro_words:
            check += 1

    if check > 0:
        return True
    else:
        return False

def agent_objects(sent,ai_words):
    """
    this function checks whether a spaCy sentence adheres to the following linguistic structure:
    passive voice structure in which an AI entity is the object of an anthropomorphic predicate
    by checking whether there is a verb given in passive voice, and whose pobj is an AI entity

    :param sent: sentence from an abstract of a relevant (in-domain) paper
    :type sent: spacy.tokens.span.Span
    :param ai_words: list of AI entities to match inside the sentence
    :type ai_words: list of strings
    :param anthro_words: list of anthropomorphic words (verbs,nouns or adjectives)
    :type anthro_words: list of strings
    :return: True or False
    """ 
    stop_words = ['do','be','have','show'] 
    extended_agent_verbs = syns.extend_word_list('agent_verbs','v') # extend the list of words with similar words using WordNet
    anthro_words = [w for w in extended_agent_verbs if w not in stop_words] # exclude stop words
    
    first_check = 0
    second_check = 0

    for chunk in sent.noun_chunks:
        if chunk.root.dep_ == 'nsubjpass' and chunk.root.head.lemma_ in anthro_words: # check that there is a passive anthro verb
            first_check += 1
    for chunk in sent.noun_chunks:
        match = any(re.search(rf"\b{re.escape(word)}\b", chunk.text, re.IGNORECASE) for word in ai_words) # check that the AI entity is pobj
        if match and first_check > 0 and chunk.root.dep_ == 'pobj':
            second_check += 1

    if second_check > 0:
        return True
    else:
        return False

def nonagent_objects(sent,ai_words):
    """
    this function checks whether a spaCy sentence adheres to the following linguistic structure:
    AI entity is the object of an anthropomorphic predicate
    by identifying AI entities as direct (dobj) / indirect (pobj) objects of anthropomorphic verbs

    :param sent: sentence from an abstract of a relevant (in-domain) paper
    :type sent: spacy.tokens.span.Span
    :param ai_words: list of AI entities to match inside the sentence
    :type ai_words: list of strings
    :return: True or False
    """ 
    with open("../wordlists/nonagent_verbs_dobj.txt","r") as file:
        nonagent_verbs_dobj = [word.strip() for word in file.readlines()] 
    with open("../wordlists/nonagent_verbs_pobj.txt","r") as file:
        nonagent_verbs_pobj = [word.strip() for word in file.readlines()]

    check = 0

    for chunk in sent.noun_chunks:
        match = any(re.search(rf"\b{re.escape(word)}\b", chunk.text, re.IGNORECASE) for word in ai_words)
        if match and chunk.root.dep_ == 'dobj' and chunk.root.head.lemma_ in nonagent_verbs_dobj:
            check += 1
        if match and chunk.root.dep_ == 'pobj' and chunk.root.head.head.lemma_ in nonagent_verbs_pobj:
            check += 1

    if check > 0:
        return True
    else:
        return False


def adjective_phrases(sent,ai_words):
    """
    this function checks whether a spaCy sentence adheres to the following linguistic structure:
    AI entity is modified or complemented by an anthropomorphic adjective

    :param sent: sentence from an abstract of a relevant (in-domain) paper
    :type sent: spacy.tokens.span.Span
    :param ai_words: list of AI entities to match inside the sentence
    :type ai_words: list of strings
    :return: True or False
    """     
    extended_adjectives = syns.extend_word_list('adjectives','n') # extend the list of words with similar words using WordNet

    check = 0

    for token in [token for token in sent if token.lemma_ in extended_adjectives]:
        if token.dep_ == 'amod' and any(re.search(rf"\b{re.escape(word)}\b", token.head.text, re.IGNORECASE) for word in ai_words) :
            check += 1
        elif token.dep_ == 'acomp':
            for descendant in token.head.subtree:
                if any(re.search(rf"\b{re.escape(word)}\b", descendant.text, re.IGNORECASE) for word in ai_words):
                    check += 1
                    
    if check > 0:
        return True
    else:
        return False

def noun_phrases(sent,ai_words):
    """
    this function checks whether a spaCy sentence adheres to the following linguistic structure:
    AI entity is part of an NP whose head is an anthropomorphic noun (assistant,teacher,...)

    :param sent: sentence from an abstract of a relevant (in-domain) paper
    :type sent: spacy.tokens.span.Span
    :param ai_words: list of AI entities to match inside the sentence
    :type ai_words: list of strings
    :return: True or False
    """ 
    extended_nouns = syns.extend_word_list('nouns','n') # extend the list of words with similar words using WordNet
    
    check = 0

    for chunk in sent.noun_chunks:
        if chunk.root.lemma_ in extended_nouns and any(re.search(rf"\b{re.escape(word)}\b", chunk.text, re.IGNORECASE) for word in ai_words):
            check += 1
                    
    if check > 0:
        return True
    else:
        return False

def possessives(sent,ai_words):
    """
    this function checks whether a spaCy sentence adheres to the following linguistic structure:
    AI entity is followed by a possessive marker 's

    :param sent: sentence from an abstract of a relevant (in-domain) paper
    :type sent: spacy.tokens.span.Span
    :param ai_words: list of AI entities to match inside the sentence
    :type ai_words: list of strings
    :return: True or False
    """ 
    check = 0

    for i,token in enumerate(sent):
        if i >= 1:
            prev_token = sent[i-1].text
        else:
            prev_token = "" # handling for first tokens in sentence which are never a possessive marker in well-formed sentences
        if token.text == "'s":
            if any(re.search(rf"\b{re.escape(word)}\b", prev_token, re.IGNORECASE) for word in ai_words):
                check += 1
                    
    if check > 0:
        return True
    else:
        return False

def comparisons(sent,ai_words):
    """
    this function checks whether a spaCy sentence adheres to the following linguistic structure:
    AI entity is being compared to human beings, by checking for specific comparison and human keywords

    :param sent: sentence from an abstract of a relevant (in-domain) paper
    :type sent: spacy.tokens.span.Span
    :param ai_words: list of AI entities to match inside the sentence
    :type ai_words: list of strings
    :return: True or False
    """ 
    if any(re.search(rf"\b{re.escape(word)}\b", str(sent), re.IGNORECASE) for word in ai_words):

        comparative_phrases = ["like","compared to","similarly to","similar to","resembles",
                               "resemble","resembling","mimick","mimicks","better than","as"]
        human_phrases = ["humans","people","human beings","humanity","mankind","human","person",
                         "child","childlike","humanlike","human-like","children"]

        comparatives = any(re.search(rf"\b{re.escape(phrase)}\b", str(sent), re.IGNORECASE) for phrase in comparative_phrases)
        human = any(re.search(rf"\b{re.escape(phrase)}\b", str(sent), re.IGNORECASE) for phrase in human_phrases)
    
        if comparatives and human:
            return True
        else:
            return False

    else:
        return False

#### 6. Functions for iterating over papers and obtaining matching sentences, writing to .txt and .pkl files

The .txt files include only the sentences and their unique ID (comprised of the paper id, the index of the paper in the tuple list, and the index of the sentence inside the abstract). The ID will be later used to extract the selected sentences from the .pkl dataframe, which also contains the previous and next sentence, which will be used later for the Atypical Animacy evaluation.

In [74]:
import pandas as pd
import pickle 

def get_sentences(dataset,cat,lim):
    """
    this function finds possible candidates for sentences adhering to various linguistic structures.

    :param dataset: name of dataset from which sentences are being extracted
    :type dataset: string
    :param cat: class of linguistic structures from the taxonomy of anthropomorphic language
    :type cat: string
    :param lim: number that limits the number of iterations
    :type lim: int
    :return: dictionary containing candidate sentences 
    """ 
    
    with open(f"../preprocessed_data/{dataset}_{cat}.txt","w") as file:

        done = False

        print(f"Looking for matching sentences for {cat} in the {dataset} dataset...")
        
        if dataset == "acl":
            paper_triplets = acl_paper_triplets
        elif dataset == "arxiv":
            paper_triplets = arxiv_paper_triplets

        if cat == "agent_subjects":
            criterion_met = agent_subjects
        elif cat == "agent_objects":
            criterion_met = agent_objects
        elif cat == "nonagent_objects":
            criterion_met = nonagent_objects
        elif cat == "adjective_phrases":
            criterion_met = adjective_phrases
        elif cat == "noun_phrases":
            criterion_met = noun_phrases
        elif cat == "possessives":
            criterion_met = possessives
        elif cat == "comparisons":
            criterion_met = comparisons
        else:
            print("The provided class of structures is not valid. Terminated process.")
            return
            
        counter = 0 # initiate counter
        sentences_dict = {"SentenceID":[],"currentSentence":[],"prevSentence":[],"nextSentence":[],"Abstract":[]}

        try: 
            
            for idx,paper in enumerate(paper_triplets):

                paper_id = paper[0]
                title = paper[1]
                abstract = paper[2]

                if done:
                    print(f"Process has finished succcessfully. {counter} sentences were logged.")
                    return sentences_dict

                words_in_title = [token.text for token in nlp(title)]
                keyword_match = any(re.match(keyword, word, re.IGNORECASE) for keyword in title_keywords for word in words_in_title)
                phrase_match = any(re.search(phrase, title.casefold(), re.IGNORECASE) for phrase in title_phrases)
        
                if keyword_match or phrase_match:
                    doc = nlp(abstract)
            
                    for i,sent in enumerate(doc.sents): # check for matches with the keywords in the noun chunks to find AI entities

                        if counter >= lim:
                            done = True
                            print(f"reached limit of {lim} sentences.")
                            break # stop when counter reaches the configured limit

                        sent_id = paper_id + "_" + str(idx) + "_" + str(i)

                        # check if the sentence adheres to the given structure
                        if criterion_met(sent,keywords):
                            counter += 1
                            file.write(sent_id+'\t'+sent.text+'\n')
                            sentences_dict["SentenceID"].append(sent_id)
                            sentences_dict["currentSentence"].append(list(doc.sents)[i].text)
                            sentences_dict["Abstract"].append(abstract)
                            try:
                                sentences_dict["prevSentence"].append(list(doc.sents)[i-1].text)
                            except IndexError:
                                sentences_dict["prevSentence"].append("")
                            try:
                                sentences_dict["nextSentence"].append(list(doc.sents)[i+1].text)
                            except IndexError:
                                sentences_dict["nextSentence"].append("")
                            #print(f"Found sentence - wrote to file and added to dictionary. counter is {counter}")

            print(f"Finished iterating through the sentences. {counter} sentences were found.")
        
        except UnboundLocalError:
            print("The provided dataset is not valid. Terminated process.")
            return
                            
    return sentences_dict

def write_to_files(dataset,cat,lim):
    sentence_dict = get_sentences(dataset,cat,lim)
    sentence_df = pd.DataFrame(data=sentence_dict)
    sentence_df.to_pickle(f"../preprocessed_data/dataframes/{dataset}_{lim}_{cat}.pkl")
    print(f"{dataset} {cat} dataframe was saved as .pkl file.")

#### 8. Retrieve candidates for sentences for each category

change parameter of get_sentences. The options are:
1. agent_subjects - sentences in which the AI entity is the subject of an anthropomorphic verb (nsubj)
2. agent_objects - sentences in which the AI entity is object (agent) of an anthropomorphic verb in the passive voice (pobj)
3. nonagent_objects - sentences in which the AI entity is object (cognizer) of an anthropomorphic verb
4. adjective_phrasess - sentences in which the AI entity is part of an anthropomorphic adjectival phrase
5. noun_phrases - sentences in which the AI entity is part of an anthropomorphic noun phrase
6. possessives - sentences in which the AI entity is immediately followed by a possessive marker
7. comparisons - sentences in which the AI entity is being compared to humans explicitly

In [75]:
write_to_files("acl","comparisons",1000)

Looking for matching sentences for comparisons in the acl dataset...
Finished iterating through the sentences. 534 sentences were found.
acl comparisons dataframe was saved as .pkl file.
