# Acronym Extraction

(C) 2022-2023 by [Damir Cavar](http://damir.cavar.me/)

This is an example of the use of the *[abbreviations](https://github.com/philgooch/abbreviation-extraction)* module to extract acronyms from documents.

Install the module using:

    pip install abbreviations

For the *FileChooser* widget in this Jupyter notebook you might need to install also the *[ipyfilechooser](https://github.com/crahan/ipyfilechooser)*:

    pip install ipyfilechooser

The code below assumes that the text is encoded as UTF-8. If this is not the case for you, adapt the encoding specification in the *get_abbreviation* function below or convert your text to use the UTF-8 character encoding.

Run the following code to activate the *FileChooser* and select a folder with the target text files in it. The target text files can be in subfolders of arbitrary depth within this folder. A good example file is ```bio_1.txt``` in the ```data``` subfolder.

In [1]:
from ipyfilechooser import FileChooser
fc = FileChooser()
display(fc)

FileChooser(path='C:\Users\damir\Develop\python-tutorial-for-ipython\notebooks', filename='', title='', show_h…

In the following code cell we will import the necessary modules *[abbreviations](https://github.com/philgooch/abbreviation-extraction)* and *os* used in the functions below to process subfolders, find target text files, and extract all abbreviations from them.

In [17]:
from abbreviations import schwartz_hearst
import os

The following function reads the content from a text file in the *folder_path* and *directory* subdirectory.

In [18]:
def get_abbreviations(file_name = ""):
	if not os.path.exists(file_name):
		return
	print("Processing file:", file_name)
	try:
		ifp = open(file_name, mode='r', encoding='utf-8')
		text = ifp.read()
		ifp.close()
	except IOError:
		return
	if not text:
		return
	most_common_defs = schwartz_hearst.extract_abbreviation_definition_pairs(doc_text=text, most_common_definition=True)
	first_defs       = schwartz_hearst.extract_abbreviation_definition_pairs(doc_text=text, first_definition=True)
	return most_common_defs, first_defs

We load the selected text file and print the resulting abbreviations:

In [19]:
abbreviations = get_abbreviations(os.path.join(fc.selected_path, fc.selected_filename))
print(abbreviations)

Processing file: C:\Users\damir\Develop\python-tutorial-for-ipython\notebooks\data\bio_1.txt
({'ER': 'endoplasmic reticulum'}, {'ER': 'endoplasmic reticulum'})


(C) 2023 by [Damir Cavar](http://damir.cavar.me/)