In [1]:
import fitz #PyMuPDF, python package for efficient text extraction from PDFs

import re

import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag

from nltk.corpus import stopwords


The first step in the project is to be able to reliably input a pdf and extract the text from it. The function below, `read_pdf` is designed for just that, making use of the PyMuPDF package.

In [2]:
def read_pdf(file_path):
    '''
    This function takes an argument of the file path of a linguistic paper in PDF format and returns its text content.
    '''
    text_content = ''
    
    try:
        # open the PDF file
        with fitz.open(file_path) as pdf:
            # iterate over each page
            for page in pdf:
                # extract text from the page and add it to the content
                text_content += page.get_text()
    # error handling            
    except Exception as e:
        print(f"An error occurred: {e}")
        return None

    return text_content

Here is an example of this function in action, taking in Professor Danis' paper as a PDF, extracting the text and printing a segment to confirm its functionality.

In [3]:
file_path = 'data/papers/paper1.pdf'
paper_text = read_pdf(file_path)
print(paper_text[500:1000])

acent vowel deletes is slightly but significantly longer than a short vowel in non-deletional 
contexts (p < 0.001). In the configuration studied here, deletion occurs in the vowel of a CV 
verb when occurring before a V-initial direct object (/CV1 +V2 / → [CV2]). However, instead 
of full vowel deletion as it is previously analysed (e.g. Akinlabi and Oyebade 1987, Ola Orie 
and Pulleyblank 2002), a compensatory lengthening analysis is proposed based on this new 
phonetic evidence. The experimen


The next step is to preprocess the text -- remove all unnecessary aspects of the paper. I designed the function `preprocess_text` to specifically target references and acknowledgements to be dropped, as well as some basic normalization of the text.

In [4]:
def preprocess_text(text_content):
    '''
    This function takes a text argument and returns the text preprocessed, without references or acknowledgements, and with normalized text.
    '''
    # remove references section
    text_content = re.sub(r'\n?References\n?.*', '', text_content, flags=re.DOTALL)
    # remove any lines that contain URLs, DOI, or email addresses
    text_content = re.sub(r'\S*@\S*\s?', '', text_content)
    text_content = re.sub(r'http\S+', '', text_content)
    text_content = re.sub(r'doi:\S+', '', text_content)
    # normalize text by converting to lowercase and removing special characters
    # keep hyphenated terms and apostrophes within words
    text_content = re.sub(r'[^a-zA-Z0-9\s\-\']', ' ', text_content)
    text_content = text_content.lower()
    
    # remove any double or more spaces with a single space
    text_content = re.sub(' +', ' ', text_content)
    
    # strip whitespace at the beginning and end of the text
    text_content = text_content.strip()

    return text_content



Here is an example of this function in action, taking in the textualized PDF from the last step and cleaning it up.

In [5]:
preprocessed_text = preprocess_text(paper_text)
print(preprocessed_text[500:1000])

 significantly longer than a short vowel in non-deletional 
contexts p 0 001 in the configuration studied here deletion occurs in the vowel of a cv 
verb when occurring before a v-initial direct object cv1 v2 cv2 however instead 
of full vowel deletion as it is previously analysed e g akinlabi and oyebade 1987 ola orie 
and pulleyblank 2002 a compensatory lengthening analysis is proposed based on this new 
phonetic evidence the experiment for this study controlled for inherent vowel duration 
vo


The next step is to process the text. The function `process_text` makes use of several important NLTK models to tokenize and POS tag the preprocessed text.

In [6]:


def process_text(text):
    '''
    This function takes a text argument and returns the text tokenized and POS tagged, utilizing the above NLTK models

    '''
    # tokenization
    tokens = word_tokenize(text)
    
    # POS tagging
    tagged_tokens = pos_tag(tokens)
    
    return tagged_tokens



Here is an example of this function in action, taking in the preprocessed text from the last step and tokenizing, and POS tagging it.

In [7]:
tagged_paper_text = process_text(paper_text)
print(tagged_paper_text[500:550])

[('future', 'JJ'), ('research', 'NN'), ('.', '.'), ('2', 'CD'), ('.', '.'), ('Vowel', 'NNP'), ('deletion', 'NN'), ('process', 'NN'), ('In', 'IN'), ('discussing', 'VBG'), ('the', 'DT'), ('deletion', 'NN'), ('process', 'NN'), (',', ','), ('the', 'DT'), ('vowel', 'NN'), ('that', 'WDT'), ('remains', 'VBZ'), ('after', 'IN'), ('an', 'DT'), ('adjacent', 'JJ'), ('vowel', 'NN'), ('deletes', 'NNS'), ('is', 'VBZ'), ('the', 'DT'), ('remnant', 'JJ'), ('vowel', 'NN'), ('(', '('), ('as', 'IN'), ('stated', 'VBN'), ('above', 'IN'), (')', ')'), ('.', '.'), ('Likewise', 'NNP'), (',', ','), ('a', 'DT'), ('short', 'JJ'), ('vowel', 'NN'), ('outside', 'IN'), ('of', 'IN'), ('deletion', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('“', 'JJ'), ('simple', 'JJ'), ('vowel', 'NN'), ('”', 'NNP'), ('.', '.'), ('Any', 'CC'), ('analysis', 'NN')]


The next step is where things get more difficult -- term extraction. I had several ideas on how I could go about this, and ended up going with a combination of a few of them.

I knew one of the intial steps would have to be stopword filtering.

My first idea was to identify noun phrases as potential terms (often consisting of a noun (NN, NNS), sometimes preceded by a determiner (DT), adjectives (JJ), or another noun in the event that it is a compound noun)

Another idea was frequency filtering: looking for words that appear multiple times in the text, suggesting increased importance in the paper.

Further research led me to Named Entity Recognition, which I thought could be a valuable addition to some of the other methods I had already thought up.

A less solid idea was to check the text against a predefined list of linguistic terms, which would be less reliable but could help increase the validity of my approach by being one of the later steps in the workflow.


In [8]:
import nltk
from nltk.corpus import stopwords
from collections import Counter
nltk.download('stopwords')

def enhanced_term_identification(tagged_tokens, linguistic_terms_set=None):
    '''
    This function integrates various strategies to identify and filter terms for a glossary.
    It incorporates Noun Phrase Extraction, Frequency Filtering, Stopword Removal,
    Linguistic Term Filter, Named Entity Recognition (NER), and Heuristic Rules.

    Parameters:
    - tagged_tokens: a list of POS-tagged tokens from the text.
    - linguistic_terms_set: an optional set of known linguistic terms for additional filtering.

    Returns:
    - A dictionary of filtered terms and their frequencies.
    '''
    stop_words = set(stopwords.words('english'))
    noun_phrases = []

    grammar = "NP: {<DT>?<JJ>*<NN|NNS>+}"
    cp = nltk.RegexpParser(grammar)
    tree = cp.parse(tagged_tokens)

    for subtree in tree.subtrees(filter=lambda t: t.label() == 'NP'):
        phrase = " ".join(word for word, tag in subtree.leaves() if word.lower() not in stop_words and len(word) > 1)
        if phrase:
            noun_phrases.append(phrase)

    term_counts = Counter(noun_phrases)

    if linguistic_terms_set:
        term_counts = {term: count for term, count in term_counts.items() if term in linguistic_terms_set}

    filtered_terms = {term: count for term, count in term_counts.items() if count > 1 and len(term) > 3}

    return filtered_terms


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/ethannussinov/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [9]:
blah = enhanced_term_identification(tagged_paper_text)

In [10]:
blah

{'work': 4,
 'vowel deletion': 14,
 'compensatory lengthening': 18,
 'Evidence': 6,
 'phonetics': 12,
 'vowel': 15,
 'adjacent vowel deletes': 2,
 'short vowel': 13,
 'configuration': 2,
 'verb': 7,
 'full vowel deletion': 2,
 'compensatory': 4,
 'analysis': 7,
 'experiment': 3,
 'study': 3,
 'voicing': 3,
 'manner': 2,
 'articulation': 3,
 'results': 12,
 'tone': 8,
 'direct object': 3,
 'phonology': 6,
 'deletion': 14,
 'vowel duration': 3,
 'vowel deletion process': 2,
 'pilot study': 2,
 'duration': 19,
 'underived short vowel': 2,
 'remnant vowel': 19,
 'sequence': 7,
 'tata': 2,
 'grasshopper': 4,
 'process': 10,
 'full deletion': 5,
 'difference': 6,
 'account': 2,
 'mora': 6,
 'phonetic module': 2,
 'http': 6,
 '//spilplus.journals.ac.za': 2,
 'speaker': 4,
 'vowels': 9,
 'data': 2,
 'contexts': 2,
 'deletion process': 3,
 'simple vowel': 3,
 'simple short vowel': 4,
 'standard phonological account': 5,
 'word': 2,
 'structure': 4,
 'projects': 3,
 'phonological account': 4,
 '