# Acronym Extraction

(C) 2022-2024 by [Damir Cavar](http://damir.cavar.me/)

**Version:** 0.3, September 2024

**Download:** This and various other Jupyter notebooks are available from my [GitHub repo](https://github.com/dcavar/python-tutorial-for-ipython).

**License:** [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))

This is an example of the use of the *[abbreviations](https://github.com/philgooch/abbreviation-extraction)* module to extract acronyms from documents.

Install the module using:

    pip install abbreviations

For the *FileChooser* widget in this Jupyter notebook you might need to install also the *[ipyfilechooser](https://github.com/crahan/ipyfilechooser)*:

    pip install ipyfilechooser

The code below assumes that the text is encoded as UTF-8. If this is not the case for you, adapt the encoding specification in the *get_abbreviation* function below or convert your text to use the UTF-8 character encoding.

In [1]:
!pip install -U abbreviations

Collecting abbreviations
  Using cached abbreviations-0.2.5-py3-none-any.whl.metadata (550 bytes)
Collecting regex (from abbreviations)
  Downloading regex-2024.9.11-cp312-cp312-win_amd64.whl.metadata (41 kB)
     ---------------------------------------- 0.0/41.5 kB ? eta -:--:--
     -------------------------------------- 41.5/41.5 kB 666.1 kB/s eta 0:00:00
Using cached abbreviations-0.2.5-py3-none-any.whl (5.7 kB)
Downloading regex-2024.9.11-cp312-cp312-win_amd64.whl (273 kB)
   ---------------------------------------- 0.0/273.5 kB ? eta -:--:--
   ---------------------------- ----------- 194.6/273.5 kB 3.9 MB/s eta 0:00:01
   ---------------------------------------- 273.5/273.5 kB 4.2 MB/s eta 0:00:00
Installing collected packages: regex, abbreviations
Successfully installed abbreviations-0.2.5 regex-2024.9.11



[notice] A new release of pip is available: 24.1.2 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
!pip install -U ipyfilechooser

Collecting ipyfilechooser
  Using cached ipyfilechooser-0.6.0-py3-none-any.whl.metadata (6.4 kB)
Collecting ipywidgets (from ipyfilechooser)
  Downloading ipywidgets-8.1.5-py3-none-any.whl.metadata (2.3 kB)
Collecting widgetsnbextension~=4.0.12 (from ipywidgets->ipyfilechooser)
  Downloading widgetsnbextension-4.0.13-py3-none-any.whl.metadata (1.6 kB)
Collecting jupyterlab-widgets~=3.0.12 (from ipywidgets->ipyfilechooser)
  Downloading jupyterlab_widgets-3.0.13-py3-none-any.whl.metadata (4.1 kB)
Using cached ipyfilechooser-0.6.0-py3-none-any.whl (11 kB)
Downloading ipywidgets-8.1.5-py3-none-any.whl (139 kB)
   ---------------------------------------- 0.0/139.8 kB ? eta -:--:--
   ----------------------- ---------------- 81.9/139.8 kB 2.2 MB/s eta 0:00:01
   ---------------------------------------- 139.8/139.8 kB 2.1 MB/s eta 0:00:00
Downloading jupyterlab_widgets-3.0.13-py3-none-any.whl (214 kB)
   ---------------------------------------- 0.0/214.4 kB ? eta -:--:--
   -----------------


[notice] A new release of pip is available: 24.1.2 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Run the following code to activate the *FileChooser* and select a folder with the target text files in it. The target text files can be in subfolders of arbitrary depth within this folder. A good example file is ```bio_1.txt``` in the ```data``` subfolder.

In [3]:
from ipyfilechooser import FileChooser

fc = FileChooser()
display(fc)

FileChooser(path='C:\Users\damir\Dropbox\Develop\python-tutorial-notebooks\notebooks', filename='', title='', …

In the following code cell we will import the necessary modules *[abbreviations](https://github.com/philgooch/abbreviation-extraction)* and *os* used in the functions below to process subfolders, find target text files, and extract all abbreviations from them.

In [4]:
from abbreviations import schwartz_hearst
import os

The following function reads the content from a text file in the *folder_path* and *directory* subdirectory.

In [33]:
def get_abbreviations(file_name=""):
    if not os.path.exists(file_name):
        return
    print("Processing file:", file_name)
    try:
        ifp = open(file_name, mode='r', encoding='utf-8')
        text = ifp.read()
        ifp.close()
    except IOError:
        return
    if not text:
        return
    most_common_defs = schwartz_hearst.extract_abbreviation_definition_pairs(doc_text=text, most_common_definition=True)
    first_defs = schwartz_hearst.extract_abbreviation_definition_pairs(doc_text=text, first_definition=True)
    return most_common_defs, first_defs

We load the selected text file and print the resulting abbreviations:

In [34]:
abbreviations = get_abbreviations(os.path.join(fc.selected_path, fc.selected_filename))
print(abbreviations)

Processing file: C:\Users\damir\Dropbox\Develop\python-tutorial-notebooks\notebooks\data\bio_1.txt
({'ER': 'endoplasmic reticulum'}, {'ER': 'endoplasmic reticulum'})


In [35]:
uri_prefix = "http://www.indiana.edu/nlplab/bioterminology#"

In [36]:
from rdflib.namespace import RDF, RDFS, SKOS, OWL, DC, DCTERMS, XSD, TIME, NamespaceManager
from rdflib import Graph, URIRef, Literal, Namespace

In [37]:
dictionary = {}
for x in abbreviations:
    entry = tuple(x.items())[0]
    dictionary[uri_prefix + "".join([z.title() for z in entry[1].split()])] = entry


In [38]:
from pprint import pprint

g = Graph()
vaem_acronym = URIRef("http://www.linkedmodel.org/schema/vaem#acronym")
for key in dictionary:
    g.add((URIRef(key), RDFS.label, Literal(dictionary[key][1])))
    g.add((URIRef(key), vaem_acronym, Literal(dictionary[key][0])))


In [39]:
import spacy
import pytextrank

In [50]:
nlp = spacy.load("en_core_web_trf")  # en_core_web_sm")
nlp.add_pipe("textrank")

<pytextrank.base.BaseTextRankFactory at 0x22b2e203010>

In [53]:
with open(os.path.join(fc.selected_path, fc.selected_filename), mode='r', encoding='utf-8') as ifp:
    text = ifp.read()

In [54]:
doc = nlp(text)

for phrase in doc._.phrases:
    print(phrase.text)
    print(phrase.rank, phrase.count)
    print(phrase.chunks)
    print()

In [46]:
dictionary = {}
for x in doc._.phrases:
    entry = x.text  # tuple(x.items())[0]
    key = URIRef(uri_prefix + "".join([z.title() for z in entry.split()]))
    # = entry
    # for key in dictionary:
    g.add((URIRef(key), RDFS.label, Literal(entry)))
    # g.add( (URIRef(key), vaem_acronym, Literal(dictionary[key][0])) )
print(dictionary)

{}


In [63]:
nelabels = {"ORG": "Organization",
            "PERSON": "Person"}
for k in nelabels:
    g.add((URIRef(uri_prefix + nelabels[k]), RDF.type, OWL.Class))
for ent in doc.ents:
    # print(ent.text, ent.label_)
    if ent.label_ in ("ORG", "PERSON"):
        key = URIRef(uri_prefix + "".join([z.title() for z in ent.text.split()]))
        type = nelabels[ent.label_]
        g.add((key, RDF.type, URIRef(uri_prefix + type)))
        g.add((key, RDFS.label, Literal(ent.text)))
        # print(ent.text, ent.label_)


In [64]:
g.serialize(destination="data/test_graph.ttl", format="turtle", encoding="utf-8")

<Graph identifier=N8793d07f04814241b59925a0d5c3aeae (<class 'rdflib.graph.Graph'>)>

(C) 2022-2024 by [Damir Cavar](http://damir.cavar.me/)