This notebook will generate summaries of descriptions in the HTO knowledge graphs. Create triples for these summaries, and add them to the graph.

We will use transformers to perform abstractive summarization.

In [1]:
# Load the graph
from rdflib import Graph, URIRef, Namespace

# Create a new RDFLib Graph
graph = Graph()

# Load hto ontology file into the graph
ontology_file = "../results/hto_eb_7th_hq.ttl"
graph.parse(ontology_file, format="turtle")
hto = Namespace("https://w3id.org/hto#")

In [2]:
len(graph)

417418

In [3]:
import pandas as pd
# Get all original description of topic terms
from rdflib.plugins.sparql import prepareQuery
q1 = prepareQuery('''
    SELECT ?description ?text WHERE {
        ?term a hto:TopicTermRecord;
            hto:hasOriginalDescription ?description.
        ?description hto:text ?text.
    }
  ''',
  initNs = { "hto": hto}
)

uri_description_list = []
for r in graph.query(q1):
    uri_description = {
        "description_uri": r.description,
        "description": str(r.text),
        "summary": None
    }
    uri_description_list.append(uri_description)
    print("%s %s" % (r.description, len(r.text)))


df_uri_description = pd.DataFrame(data=uri_description_list, columns=["description_uri", "description", "summary"])

https://w3id.org/hto/OriginalDescription/9910796273804340_192693199_HYDRODYNAMICS_0NCKP 576502
https://w3id.org/hto/OriginalDescription/9910796273804340_192693199_HYGROMETRY_0NCKP 167459
https://w3id.org/hto/OriginalDescription/9910796273804340_192693199_ICELAND_0NCKP 53301
https://w3id.org/hto/OriginalDescription/9910796273804340_192693199_ICHTHYOLOGY_0NCKP 637554
https://w3id.org/hto/OriginalDescription/9910796273804340_192693199_INDEPENDENTS_0NCKP 34153
https://w3id.org/hto/OriginalDescription/9910796273804340_192693199_INK_0NCKP 35103
https://w3id.org/hto/OriginalDescription/9910796273804340_192693199_INQUISITION_0NCKP 52745
https://w3id.org/hto/OriginalDescription/9910796273804340_192693199_INSCRIPTION_0NCKP 25113
https://w3id.org/hto/OriginalDescription/9910796273804340_192693199_INSTINCT_0NCKP 83434
https://w3id.org/hto/OriginalDescription/9910796273804340_192693199_INSURANCE_0NCKP 57018
https://w3id.org/hto/OriginalDescription/9910796273804340_192693199_INTEREST_0NCKP 68974
htt

In [4]:
print(df_uri_description.loc[1])

description_uri    https://w3id.org/hto/OriginalDescription/99107...
description        The formation of steam or aqueous vapour, and ...
summary                                                         None
Name: 1, dtype: object


## Define function for summarizing text

In [101]:
from transformers import pipeline
import nltk

summarizer = pipeline("summarization", model="Falconsai/text_summarization")

def summarise_text_abstractive(text):
    # Spilt text into sentences, and the number of sentences should not be over max sentences
    MAX_SENTENCES = 100
    sentences = nltk.sent_tokenize(text)
    print(len(sentences))
    if len(sentences) > MAX_SENTENCES:
        sentences = sentences[:MAX_SENTENCES]

    print("chunking the text....")
    # Group sentences into small chunk of text whose token length should not be over max token length allowed by the model.
    tokenizer = summarizer.tokenizer
    max_token_length = tokenizer.model_max_length - 2  # Accounting for special tokens [CLS] and [SEP]
    # Split the input text into chunks of max_chunk_length
    chunks = []
    current_chunk = []

    # Chunk the sentences based on the maximum token length
    for sentence in sentences:
        tokenized_sentence = tokenizer.encode(sentence, add_special_tokens=False)
        if len(current_chunk) + len(tokenized_sentence) < max_token_length:
            current_chunk.extend(tokenized_sentence)
        else:
            chunks.append(current_chunk)
            current_chunk = list(tokenized_sentence)

    if current_chunk:
        chunks.append(current_chunk)

    # Convert token IDs back to text
    grouped_sentences = [''.join(tokenizer.decode(chunk)) for chunk in chunks]
    print(f"text is chunked into {len(grouped_sentences)} pieces")

    summaries = []
    for index in range(0, len(grouped_sentences)):
        # Perform summarization on each chunk
        chunk = grouped_sentences[index]
        chunk_token_length = len(chunks[index])
        MAX_SUMMARY_LENGTH = 100
        if chunk_token_length < MAX_SUMMARY_LENGTH * 2:
            MAX_SUMMARY_LENGTH = int(chunk_token_length / 2)
        summary = summarizer(chunk, max_length=MAX_SUMMARY_LENGTH, min_length=5, do_sample=False)
        summaries.append(summary[0]['summary_text'])
    return ' '.join(summaries)

In [80]:
input_text = uri_description_list[1]["description"]
print(input_text)

The formation of steam or aqueous vapour, and its diffusion in space or in a gaseous medium, have already been considered under the article Evaporation. We now propose first to take a view of various methods and devices which have been employed to detect the presence of aqueous vapour, and to ascertain its amount, or how much of it is contained in a given volume, whether when alone or diffused in a gaseous medium. This is necessarily connected with, and will in a great measure consist of, the description, theory, and use of such instruments as almost exclusively belong to this branch of inquiry, and which are usually denominated hygrometers, from υγgος, moist, and μi r rgiω, I measure. We shall then consider under what circumstances moisture is deposited from the atmosphere, and shall examine some of the more remarkable phenomena resulting from or connected with the condensation of aqueous vapour. Many contrivances bearing the name of hygrometers have appeared from time to time, and th

In [99]:
import time
start_time = time.time()
summarised_text = summarise_text_abstractive(input_text)
end_time = time.time()
elapsed_time = end_time - start_time
print(elapsed_time)
print(summarised_text)

801
chunking the text....
text is chunked into 11 pieces
22.687130212783813
The formation of steam or aqueous vapour, and its diffusion in space or in a gaseous medium, have already been considered under the article Evaporation . We will then consider under what circumstances moisture is deposited from the atmosphere, and will examine some of the more remarkable phenomena resulting from or connected with the condensation . The earliest account of hygrometers perhaps worth no-’ ticing is that contained in the Philosophical Transaction a beam of straight-grained wood becomes thicker and broader by absorbing moisture, and vice versa . In bodies of a fibrous structure, however, the change of dimension occurs principally in a transverse direction, or across the fibres, and but very slightly in the direction of their length . hygrometer or weather-house, commonly sold as a toy, depends on this principle . It usually consists of a kind of box, representing a building with two doors, within wh

## Generate summary for each description in uri_description_list

In [102]:
print(f"total number of topics: {len(uri_description_list)}")
for index in range(0, len(uri_description_list)):
    print(f"------Summarising {index + 1}th description ---------")
    description = uri_description_list[index]["description"]
    summary = summarise_text_abstractive(description)
    uri_description_list[index]["summary"] = summary

total number of topics: 959
------Summarising 1th description ---------
6732
chunking the text....
text is chunked into 9 pieces
------Summarising 2th description ---------
801
chunking the text....
text is chunked into 11 pieces
------Summarising 3th description ---------
257
chunking the text....
text is chunked into 11 pieces
------Summarising 4th description ---------
5464
chunking the text....
text is chunked into 12 pieces
------Summarising 5th description ---------
188
chunking the text....
text is chunked into 10 pieces


Token indices sequence length is longer than the specified maximum sequence length for this model (581 > 512). Running this sequence through the model will result in indexing errors


------Summarising 6th description ---------
180
chunking the text....
text is chunked into 12 pieces
------Summarising 7th description ---------
230
chunking the text....
text is chunked into 14 pieces
------Summarising 8th description ---------
106
chunking the text....
text is chunked into 13 pieces
------Summarising 9th description ---------
371
chunking the text....
text is chunked into 10 pieces
------Summarising 10th description ---------
299
chunking the text....
text is chunked into 10 pieces
------Summarising 11th description ---------
388
chunking the text....
text is chunked into 9 pieces
------Summarising 12th description ---------
109
chunking the text....
text is chunked into 11 pieces
------Summarising 13th description ---------
420
chunking the text....
text is chunked into 7 pieces
------Summarising 14th description ---------
2383
chunking the text....
text is chunked into 13 pieces
------Summarising 15th description ---------
253
chunking the text....
text is chunked 

KeyboardInterrupt: 

In [None]:
from rdflib import RDF, Literal, XSD

for uri_description in uri_description_list:
    summary = uri_description["summary"]
    if summary is not None and summary != "":
        description_uri = uri_description["description_uri"]
        description_id = str(description_uri).split("/")[-1]
        summary_uri = URIRef("https://w3id.org/hto/Summary/" + description_id)
        graph.add((summary_uri, RDF.type, hto.Summary))
        graph.add((description_uri, hto.hasSummary, summary_uri))
        graph.add((summary_uri, hto.text, Literal(summary, datatype=XSD.string)))

In [None]:
# Save the Graph in the RDF Turtle format
graph.serialize(format="turtle", destination="../results/hto_eb_7th_hq_summary.ttl")