# Acronym Extraction

(C) 2022 by [Damir Cavar](http://damir.cavar.me/)

This is an example of the use of the *[abbreviations](https://github.com/philgooch/abbreviation-extraction)* module to extract acronyms from documents.

Install the module using:

    pip install abbreviations

For the *FileChooser* widget in this Jupyter notebook you might need to install also the *[ipyfilechooser](https://github.com/crahan/ipyfilechooser)*:

    pip install ipyfilechooser

The code below assumes that the text is encoded as UTF-8. If this is not the case for you, adapt the encoding specification in the *get_abbreviation* function below or convert your text to use the UTF-8 character encoding.

Run the following code to activate the *FileChooser* and select a folder with the target text files in it. The target text files can be in subfolders of arbitrary depth within this folder.

In [1]:
from ipyfilechooser import FileChooser
fc = FileChooser()
display(fc)

FileChooser(path='/home/damir/iCatalyst/Dev/similarityServiceMy/acronym_extraction', filename='', title='', sh…

In the following code cell we will import the necessary modules *[abbreviations](https://github.com/philgooch/abbreviation-extraction)* and *os* used in the functions below to process subfolders, find target text files, and extract all abbreviations from them.

In [2]:
from abbreviations import schwartz_hearst
import os

The following function reads the content from a text file in the *folder_path* and *directory* subdirectory.

In [7]:
def get_abbreviations(file_name = "", filter_terms = []):
    if not os.path.exists(file_name):
        return
    print("Processing file:", file_name)
    try:
        ifp = open(file_name, mode='r', encoding='utf-8')
        text = ifp.read()
        ifp.close()
    except IOError:
        return
    if not text:
        return
    most_common_defs = schwartz_hearst.extract_abbreviation_definition_pairs(doc_text=text, most_common_definition=True)
    first_defs       = schwartz_hearst.extract_abbreviation_definition_pairs(doc_text=text, first_definition=True)
    for x in filter_terms:
        if x in most_common_defs:
            del most_common_defs[x]
        if x in first_defs:
            del first_defs[x]
    res = {}
    for s in (first_defs, most_common_defs):
        for x in s:
            val = res.get(x, set())
            full_form = s[x].lower().title()
            val.add(full_form)
            res[x] = val
    abbreviations = list(res.items())
    abbreviations.sort()
    return abbreviations

We can define the folders to be skipped here in this list. For example, the following specification would ignore the folder *_test* and *not_relevant* in the target path:

In [8]:
skip_folders = ["_test", "not_relevant"]

The following loop will walk through all subfolders, skipping the ones specified above, and processing each text file ending in *.txt* in the folder. When calling *get_abbreviations*, the second parameter can be a list of strings that should be skipped in the output of the abbreviation extractor.

In [17]:
for root, dirs, files in os.walk(fc.selected_path):
    if os.path.basename(os.path.normpath(root)) in skip_folders:
        continue
    for file in files:
        if file.endswith(".txt"):
            [ print(f"{x[0]}: {', '.join(x[1])}") for x in get_abbreviations(os.path.join(root, file), ["if any", "The"]) ]


Processing file: /home/damir/iCatalyst/Dev/similarityServiceMy/Data/FAA_Order_1100.161A/FAA_Order_1100.161A_tab.txt
AC: Advisory Circular
AOV: Air Traffic Safety Oversight Service
ATO: Air Traffic Organization
AVS: Aviation Safety
FAA: Federal Aviation Administration
ICAO: International Civil Aviation Organization
NAS: National Airspace System
NOTAM: Notices To Airmen
SMS: Safety Management System
Processing file: /home/damir/iCatalyst/Dev/similarityServiceMy/Data/FAA_Order_1100.161A/FAA_Order_1100.161A_adobe_tab.txt
AC: Advisory Circular
AOV: Air Traffic Safety Oversight Service
ATO: Air Traffic Organization
AVS: Aviation Safety
FAA: Federal Aviation Administration
ICAO: International Civil Aviation Organization
NAS: National Airspace System
NOTAM: Notices To Airmen
SMS: Safety Management System


(C) 2022 by [Damir Cavar](http://damir.cavar.me/)