# Computational Linguistics Final Project
### Automatic Glossary Creator

The idea for my final project is a python workflow that reads in an academic paper, identifies relevant/important terms, then creates a glossary where it generates definitions for each.


## Step 1: Term Extraction

In [23]:
from utils import read_pdf, preprocess_text, process_text, extract_terms_with_tfidf, enhanced_term_identification

# Approach 1: TF-IDF

The idea with my initial approach was to extract terms using tf-idf. This involved processing the text extracted from the academic papers to identify terms that were 'statistically significant' within the document.

## Workflow:

### Pros: 
1. Automated, quantitative: works well at scale.
2. Easy to establish.

### Cons: 
1. No contextual understanding: elevates general or irrelevant terms that show up a lot, but aren't necessarily significant.
2. Overemphasis on common words: even outside of stop words, this approach prioritizes obscure or infrequently used words that arent useful for term identification approaches.


This approach likely could've been refined more to work better, and it is something that I will look into more over the next week, but at this point I decided to move on to a new approach.


In [24]:
file_path = 'data/papers/paper1.pdf'
paper_text = read_pdf(file_path)
preprocessed_text = preprocess_text(paper_text)


tfidf_terms = extract_terms_with_tfidf(preprocessed_text)

print(tfidf_terms)

['account', 'al', 'aj', 'additionally', 'approach', 'analysed', 'assimilation', 'articulation', 'assume', 'akinlabi', 'adjacent', 'appear', 'ak', 'akin', 'anonymous', 'anderson', 'ap', 'akg', 'alternate', 'amp', 'ambiguous', 'alternatively', 'acoustic', 'acknowledgements', 'adebola', 'accounts', 'acal', 'absolute', 'actually', 'aided', 'aid', 'advanced', 'amplitude', 'argued', 'analyses', 'analyse', 'assistance', 'assimilates', 'aaron']


# Approach 2: The Conglomerate Approach

The idea with this second approach was to combine a variety of different techniques that I had thought would all be useful, and see if combining them would lead to more effective results.

## Workflow:
1. Input preparation through `read_pdf` function
2. Text preprocessing through `preprocess_text`, removing references, URLs, email addresses, special characters, normalizing text (through ocnverting to lowercase), removing extraneous spaces and stripping whitespace.
3. Term extraction setup through POS tagging, utilizing the `process_text` function, which in turn utilizes NLTK tokenizer and POS tagger. Extremely crucial for identification of noun phrases moving forward.
4. Noun phrase extraction. A specific grammar pattern (`"NP: {<DT>?<JJ>*<NN|NNS>+}"`) is defined to identify noun phrases. This pattern optionally looks for determiners, any number of adjuectives, followed by one or more nouns. Using the defined grammar, a parser (in this case, nltk.RegexpParser) is used to parse the tagged tokens and extract sequences that match the noun phrase grammar.
5. Term identification and filtering -- extracted noun phrases are collected. Each phrase undergoes further filtering to exclude stopwords and single characters, ideally focusing on phrases that meaningfully represent concepts. The frequency of each noun phease is cacluated, counted and stored, alloweing for frequency-based filtering.


### Pros:
1. Contextual relevance: by extracting noun phrases, this approach better captures the context in which the terms we seek to extract are used.
2. Domain specific filtering: Incorporating this allows for terms more relevant to the specified field, in this case, linguistics.
3. Heuristic rules: filtering by length and frequency, allows us to eliminate irrelevant or less important terms
4. Noise reduction through stopword filtering and single character filtering.

### Cons:
1. Complexity: Sort of a mishmash, computationally more complex and generally harder to follow.
2. Dependency on tagged tokens: The quality of our results depends heavily on the accuracy of our POS tagging (something that is generally going to be out of my control, reliant on preexisting libraries), but errors in tagging would heavily impact the results of my model.
3. Overfiltering: Issue I haven't explored too much
4. Maintenance of the linguistic term set: the need to maintain/update the term set adds administrative burden -- problem I hope to circumvent by web scraping significant term websites.



In [25]:
file_path = 'data/papers/paper1.pdf'
paper_text = read_pdf(file_path)
preprocessed_text = preprocess_text(paper_text)
tagged_tokens = process_text(preprocessed_text)


terms = enhanced_term_identification(tagged_tokens)

print(terms)

{'work': 4, 'evidence': 6, 'phonetics': 14, 'yor vowel deletion': 4, 'vowel': 16, 'adjacent vowel deletes': 2, 'short vowel': 12, 'deletion': 13, 'full vowel deletion': 2, 'akinlabi': 3, 'oyebade': 2, 'ola orie': 7, 'pulleyblank': 10, 'compensatory': 10, 'analysis': 5, 'experiment': 2, 'study': 3, 'manner': 2, 'articulation': 3, 'results': 12, 'vowel duration': 2, 'vowel deletion process': 3, 'pilot study': 2, 'duration': 18, 'remnant vowel': 16, 'vv sequence': 3, 'tata': 2, 'process': 10, 'full deletion': 5, 'difference': 6, 'account': 3, 'mora': 5, 'remnant': 2, 'phonetic module': 2, 'speaker': 4, 'vowels': 6, 'data': 3, 'contexts': 2, 'research': 2, 'deletion process': 2, 'simple vowel': 2, 'simple short vowel': 4, 'standard phonological account': 6, 'structure': 4, 'projects': 3, 'standard account': 5, 'form': 4, 'phonetic duration': 2, 'case': 6, 'native speaker': 2, 'language': 6, 'subject': 7, 'verb': 5, 'quality': 3, 'analysis yor vowel deletion': 2, 'time': 2, 'elicitation': 2

## Next Steps:
1. Refinement of the Congolmerate Approach -- incorporation of the term set, improvement on heuristic rules, more focus on Named Entity Recognition


In [1]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel, set_seed


def generate_definition_gpt2(term):
    model_name = "gpt2"
    tokenizer = GPT2Tokenizer.from_pretrained(model_name)
    model = GPT2LMHeadModel.from_pretrained(model_name)

    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    tokenizer.padding_side = 'left'

    prompt = f"Define the term '{term} in the context of an academic linguistic paper"
    inputs = tokenizer(prompt, return_tensors='pt', padding='max_length', max_length=150)

    outputs = model.generate(
        inputs['input_ids'],
        attention_mask=inputs['attention_mask'],
        max_length=200,
        max_new_tokens=70, 
        num_beams=5,
        no_repeat_ngram_size=2,
        early_stopping=True
    )
    definition = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return definition

## Step 2: Definition Generation

In [2]:
term = "vowel deletion"
definition = generate_definition_gpt2(term)
print("Generated Definition:", definition)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=70) and `max_length`(=200) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Generated Definition: Define the term 'vowel deletion in the context of an academic linguistic paper'.

In this case, it is important to note that this is not the first time that academic papers have been deleted in this way. In fact, the most recent example of this was published in The Journal of the American Academy of Child and Adolescent Psychiatry in 2012, where a paper was deleted by the journal's editors after it had been


### Progress

Much more nitpicky than initially expected. I thought I'd be able to just throw in a ChatGPT prompt, but I will have to do some proper prompt engineering, as well as fine tuning temperature (as I understand it, the seriousness of the model), and the other changeable settings.

# Questions/Advice?



usage of .lower()
.casefold()
- more aggressive version of lower()