# Mixed source EB Dataframe to RDF

This notebook creates rdf triples from EB dataframe based on the [Heritage Text Ontology](https://github.com/frances-ai/HeritageTextOntology), and export it as ttl file. It will first load graph generated from single source using [single source notebook](Single_Source_EBDataFrame2RDF.ipynb), and then add triples which represents data from different sources. As a result, one EB term will have multiple original descriptions extracted from different sources.

## Load and check dataframe

Per entry in dataframes, it should have the following columns (see an example of one entry of the first edition):

- MMSID:
- editor:                                                  Smellie, William
- editor_date:                                                   1740-1795
- genre:                                                       encyclopedia
- language:                                                             eng
- termsOfAddress:                                                       NaN
- physicalDescription:               3 v., 160 plates : ill. ; 26 cm. (4to)
- place:                                                         Edinburgh
- publisher:              Printed for A. Bell and C. Macfarquhar; and so...
- referencedBy:           [Alston, R.C.  Engl. language III, 560, ESTC T...
- shelfLocator:                                                        EB.1
- editionSubTitle:        Illustrated with one hundred and sixty copperp...
- volumeTitle:            Encyclopaedia Britannica; or, A dictionary of ...
- year:                                                                1771
- volumeId:                                                       144133901
- permanentURL:                            https://digital.nls.uk/144133901
- publisherPersons:                     [C. Macfarquhar, Colin Macfarquhar]
- volumeNum:                                                              1
- letters:                                                              A-B
- part:                                                                   0
- editionNum:                                                             1
- supplementTitle:
- supplementSubTitle:
- supplementsTo:                                                         []
- term:                                                                  OR
- definition:             A NEW A D I C T I A A, the name of several riv...
- relatedTerms:                                                          []
- header:                                           EncyclopaediaBritannica
- startsAt:                                                              15
- endsAt:                                                                15
- position:                                                           0
- termType:                                                         Article
- filePath:                                  144133901/alto/188082904.34.xml

In [141]:
import re

import pandas as pd

# Nineteenth-Century Knowledge Project Dataframe
df = pd.read_json('../source_dataframes/eb/nckp_final_eb_7_dataframe_clean_Damon', orient="index")
df =df.fillna(0)

len(df)

23970

In [142]:
edition_mmsids = df["MMSID"].unique()
print(edition_mmsids)

[9910796273804340]


In [143]:
df_edition = df[df["MMSID"] == edition_mmsids[0]].reset_index(drop=True)
df_edition

Unnamed: 0,term,note,alter_names,reference_terms,definition,startsAt,endsAt,position,termType,filePath,...,volumeId,permanentURL,publisherPersons,volumeNum,editionNum,numberOfVolumes,numberOfTerms,supplementTitle,supplementSubTitle,supplementsTo
0,A,0,[],[],The first letter of the alphabet in every know...,11,12,1,Article,./eb07_TXT_v2/a2/kp-eb0702-000101-9822-v2.txt,...,192984259,https://digital.nls.uk/192984259,[],2,7,22,0.0,,,[]
1,A,0,[],[],"as an abbreviation, is likewise of frequent oc...",12,12,2,Article,./eb07_TXT_v2/a2/kp-eb0702-000101-9822-v2.txt,...,192984259,https://digital.nls.uk/192984259,[],2,7,22,0.0,,,[]
2,AA,0,[],[],"a river of the province of Groningen, in the k...",12,12,3,Article,./eb07_TXT_v2/a2/kp-eb0702-000201-9835-v2.txt,...,192984259,https://digital.nls.uk/192984259,[],2,7,22,0.0,,,[]
3,AA,0,[],[],a river in the province of Overyssel. in the N...,12,12,4,Article,./eb07_TXT_v2/a2/kp-eb0702-000201-9835-v2.txt,...,192984259,https://digital.nls.uk/192984259,[],2,7,22,0.0,,,[]
4,AA,0,[],[],"a river of the province of Antwerp, in the Net...",12,12,5,Article,./eb07_TXT_v2/a2/kp-eb0702-000201-9835-v2.txt,...,192984259,https://digital.nls.uk/192984259,[],2,7,22,0.0,,,[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23965,ZWENIGORODKA,0,[],[],a circle of the Russian government of Kiew. It...,1037,1037,4,Article,./eb07_TXT_v2/z21/kp-eb0721-102704-1077-v2.txt,...,193819045,https://digital.nls.uk/193819045,[],21,7,22,0.0,,,[]
23966,ZWICKAU,0,[],[],"a city of the kingdom of Saxony, the capital o...",1037,1037,5,Article,./eb07_TXT_v2/z21/kp-eb0721-102705-1077-v2.txt,...,193819045,https://digital.nls.uk/193819045,[],21,7,22,0.0,,,[]
23967,ZWOLLE,0,[],[],"a city, the capital of the circle of the same ...",1037,1037,6,Article,./eb07_TXT_v2/z21/kp-eb0721-102706-1077-v2.txt,...,193819045,https://digital.nls.uk/193819045,[],21,7,22,0.0,,,[]
23968,ZYGHUR,0,[],[],"a town of Hindustan, in the province of Bejapo...",1037,1037,7,Article,./eb07_TXT_v2/z21/kp-eb0721-102707-1077-v2.txt,...,193819045,https://digital.nls.uk/193819045,[],21,7,22,0.0,,,[]


## Load the graph, add extract triples

In [144]:
from rdflib import Graph, URIRef, Namespace

graph = Graph()

# Load the ttl file into the graph
ontology_file = "results/hto_eb_total_lq.ttl"
graph.parse(ontology_file, format="turtle")
hto = Namespace("https://w3id.org/hto#")

In [160]:
# Print the number of "triples" in the Graph
print(f"Graph g has {len(graph)} statements.")

Graph g has 2664274 statements.


In [146]:
import pandas as pd
single_source_dataframe_with_uris = pd.read_json("final_eb_total_dataframe_lq_with_uris", orient="index")
len(single_source_dataframe_with_uris)

151065

In [147]:
from rdflib import URIRef
test_term_ref = URIRef("https://w3id.org/hto/ArticleTermRecord/992277653804341_144133901_OR_0")

In [148]:
single_source_dataframe_with_uris[single_source_dataframe_with_uris["uri"] == str(test_term_ref)]

Unnamed: 0,MMSID,editionTitle,editor,editor_date,genre,language,termsOfAddress,numberOfPages,physicalDescription,place,...,header,startsAt,endsAt,numberOfTerms,numberOfWords,position,termType,filePath,id,uri
0,992277653804341,"First edition, 1771, Volume 1, A-B","Smellie, William",1740-1795,encyclopedia,eng,0.0,832,"3 v., 160 plates : ill. ; 26 cm. (4to)",Edinburgh,...,EBAA,15,15,22,54,0,Article,144133901/alto/188082904.34.xml,0,https://w3id.org/hto/ArticleTermRecord/9922776...


In [149]:
import regex

NON_AZ09_REGEXP = regex.compile('[^\p{L}\p{N}]')
def name_to_uri_name(name):
    uri_name=NON_AZ09_REGEXP.sub('', name)
    return uri_name

In [150]:
from datetime import datetime
from rdflib import Literal, XSD, RDF
from rdflib.namespace import FOAF, PROV, SDO
# create organization NCKP
agents = {
    "NCKP": ["Nineteen Century Knowledge Project", hto.Organization],
    "Ash": ["Ash Charlton", hto.Person],
    "NLS": ["National Library of Scotland", hto.Organization]
}

def create_organization(graph, agent):
    agent_uri = URIRef("https://w3id.org/hto/Organization/" + agent)
    graph.add((agent_uri, RDF.type, agents[agent][1]))
    graph.add((agent_uri, FOAF.name, Literal(agents[agent][0], datatype=XSD.string)))
    return agent_uri

In [151]:
def create_eb_text_dataset(graph, agent_uri, agent):
    eb_text_dataset = URIRef("https://w3id.org/hto/Collection/" + agent + "_eb_dataset")
    graph.add((eb_text_dataset, RDF.type, PROV.Collection))
    graph.add((eb_text_dataset, PROV.wasAttributedTo, agent_uri))

    # Create digitalising activity
    digitalising_activity = URIRef("https://w3id.org/hto/Activity/" + agent + "_digitalising_activity")
    graph.add((digitalising_activity, RDF.type, hto.Activity))
    graph.add((digitalising_activity, PROV.generated, eb_text_dataset))
    graph.add((digitalising_activity, PROV.wasAssociatedWith, agent_uri))
    graph.add((eb_text_dataset, PROV.wasGeneratedBy, digitalising_activity))
    return eb_text_dataset

In [152]:
def get_term_class_name(term_type, hto):
    term_class_name = hto.ArticleTermRecord
    if term_type == "TopicTermRecord":
        term_class_name = hto.TopicTermRecord
    return term_class_name

In [153]:
from difflib import SequenceMatcher

def is_descriptions_for_same_term(description_1, description_2):
    MAX_COMPARE_LENGTH = 200
    if len(description_1) > MAX_COMPARE_LENGTH:
        description_1 = description_1[:MAX_COMPARE_LENGTH]
    if len(description_2) > MAX_COMPARE_LENGTH:
        description_2 = description_2[:MAX_COMPARE_LENGTH]

    similarity_ratio = SequenceMatcher(None, description_1, description_2).quick_ratio()

    threshold = 0.7

    if similarity_ratio >= threshold:
        return True
    else:
        len_1 = len(description_1)
        len_2 = len(description_2)
        recheck = False
        if len_1 > len_2:
            recheck = True
            description_1 = description_1[:len_2]
        elif len_2 > len_1:
            recheck = True
            description_2 = description_2[:len_1]
        if recheck:
            similarity_ratio = SequenceMatcher(None, description_1, description_2).quick_ratio()
            threshold = 0.85
            if similarity_ratio > threshold:
                return True
        return False


def find_existing_term(same_vol_term_name_in_graph, description):
    for index in range(0, len(same_vol_term_name_in_graph)):
        term_info_in_graph = same_vol_term_name_in_graph.loc[index]
        if not term_info_in_graph["match"]:
            description_in_graph = term_info_in_graph["definition"]
            if is_descriptions_for_same_term(description_in_graph, description):
                description_pair = {
                    "new": description,
                    "graph": description_in_graph,
                    "uri": term_info_in_graph["uri"]
                }
                description_pairs.append(description_pair)
                same_vol_term_name_in_graph.loc[index, "match"] = True
                return term_info_in_graph["uri"]
    return None

In [154]:
# create software uris
defoe = URIRef("https://github.com/defoe-code/defoe")
graph.add((defoe, RDF.type, hto.SoftwareAgent))
frances_information_extraction = URIRef("https://github.com/frances-ai/frances-InformationExtraction")
graph.add((frances_information_extraction, RDF.type, hto.SoftwareAgent))
ABBYYFineReader = URIRef("https://pdf.abbyy.com")
graph.add((ABBYYFineReader, RDF.type, hto.SoftwareAgent))

<Graph identifier=N7b6c8a3db7d248c88b7920827c835faf (<class 'rdflib.graph.Graph'>)>

In [155]:
def link_entity_with_software(graph, entity, entity_type, agent):
    software = None
    if entity_type == "description":
        if agent == "NLS":
            software = defoe
        else:
            software = frances_information_extraction
    else:
        if agent == "NCKP":
            software = ABBYYFineReader

    if software:
        graph.add((entity, PROV.wasAttributedTo, software))

In [156]:
def get_source_ref(filePath, agent):
    if agent == "NCKP":
        parts = filePath.split("/")
        if len(parts) < 3:
            raise Exception("Wrong input format")
        edition_parts = parts[-3].split("_", 1)
        file_uri = "https://raw.githubusercontent.com/TU-plogan/kp-editions/main/" + edition_parts[0] + "/" +  edition_parts[1] + "/" +  parts[-2] + "/" + parts[-1]
        source_ref = URIRef(file_uri)
    else:
        source_uri_name = filePath.replace("/", "_").replace(".", "_")
        source_ref = URIRef("https://w3id.org/hto/InformationResource/" + source_uri_name)
    return source_ref

In [157]:
# Add triples to graph
# description_pairs stores two descriptions (new: the one will be added, graph: the one already in graph) of terms which has same uri with the existing one in the graph
description_pairs = []
def dataframe_to_rdf(dataframe, graph, hto, agent_uri, agent, eb_dataset):
    count = 0
    edition_mmsids = dataframe["MMSID"].unique()
    for mmsid in edition_mmsids:
        df_edition = dataframe[dataframe["MMSID"] == mmsid]
        # VOLUMES
        vol_numbers = df_edition["volumeNum"].unique()
        for vol_number in vol_numbers:
            df_vol = df_edition[df_edition["volumeNum"] == vol_number].reset_index(drop=True)
            volume_info = df_vol.loc[0]
            volume_id=volume_info["volumeId"]
            volume_ref = URIRef("https://w3id.org/hto/Volume/"+str(volume_info["MMSID"])+"_"+str(volume_id))

            df_vol_by_term=df_vol.groupby(['term'],)["term"].count().reset_index(name='counts')
            # print(df_vol_by_term)
            #### TERMS

            for t_index in range(0, len(df_vol_by_term)):
                term=df_vol_by_term.loc[t_index]["term"]
                term_counts=df_vol_by_term.loc[t_index]["counts"]
                term_uri_name = name_to_uri_name(term)

                same_vol_term_name_in_graph = single_source_dataframe_with_uris[(single_source_dataframe_with_uris["volumeId"] == volume_id) & (single_source_dataframe_with_uris["term"] == term)].reset_index(drop=True)
                same_vol_term_name_in_graph["match"] = False
                # print(term_uri_name)
                # All terms in one volume with name equals to value of term
                df_entries= df_vol[df_vol["term"] == term].reset_index(drop=True)
                new_term_count = 0
                len_same_vol_term_name_in_graph = len(same_vol_term_name_in_graph)
                for t_count in range(0, term_counts):
                    df_entry= df_entries.loc[t_count]
                    description = str(df_entry["definition"])
                    existing_term = find_existing_term(same_vol_term_name_in_graph, description)
                    term_type = str(df_entry["termType"]) + "TermRecord"
                    term_class_name = get_term_class_name(term_type, hto)
                    if existing_term:
                        term_ref = URIRef(existing_term)
                        count += 1
                        print(count)
                        # new description, new source
                    else:
                        # add new term record
                        term_id = str(mmsid)+"_"+str(df_entry["volumeId"])+"_"+term_uri_name+"_"+str(new_term_count + len_same_vol_term_name_in_graph)
                        term_ref = URIRef("https://w3id.org/hto/" + term_type + "/" + term_id)
                        graph.add((term_ref, RDF.type, term_class_name))
                        graph.add((term_ref, hto.name, Literal(term, datatype=XSD.string)))
                        graph.add((term_ref, hto.position, Literal(df_entry["position"], datatype=XSD.int)))


                        ## startsAt
                        page_startsAt= URIRef("https://w3id.org/hto/Page/"+ str(df_entry["MMSID"])+"_"+str(df_entry["volumeId"])+"_"+str(df_entry["startsAt"]))
                        graph.add((page_startsAt, RDF.type, hto.Page))
                        graph.add((page_startsAt, hto.number, Literal(df_entry["startsAt"], datatype=XSD.int)))
                        graph.add((volume_ref, hto.hadMember, page_startsAt))
                        graph.add((term_ref, hto.startsAtPage, page_startsAt))
                        graph.add((page_startsAt, RDF.type, hto.WorkCollection))
                        graph.add((page_startsAt, hto.hadMember, term_ref))

                        ## endsAt
                        page_endsAt= URIRef("https://w3id.org/hto/Page/"+ str(df_entry["MMSID"])+"_"+str(df_entry["volumeId"])+"_"+str(df_entry["endsAt"]))
                        graph.add((page_endsAt, RDF.type, hto.Page))
                        graph.add((page_endsAt, hto.number, Literal(df_entry["endsAt"], datatype=XSD.int)))
                        graph.add((volume_ref, hto.hadMember, page_endsAt))
                        graph.add((term_ref, hto.endsAtPage, page_endsAt))
                        graph.add((page_endsAt, RDF.type, hto.WorkCollection))
                        graph.add((page_endsAt, hto.hadMember, term_ref))

                        new_term_count += 1
                        # new term, add new page, new description, new source

                    if "note" in df_entry:
                        note = df_entry["note"]
                        if note != 0:
                            graph.add((term_ref, hto.note, Literal(note, datatype=XSD.string)))

                    if "alter_names" in df_entry:
                        alter_names = df_entry["alter_names"]
                        for alter_name in alter_names:
                            graph.add((term_ref, hto.name, Literal(alter_name, datatype=XSD.string)))

                    # Create original description instance
                    if description != "":
                        term_original_description = URIRef("https://w3id.org/hto/OriginalDescription/" + term_id + agent)
                        graph.add((term_original_description, RDF.type, hto.OriginalDescription))
                        text_quality = hto.Low
                        if agent == "Ash":
                            text_quality = hto.Moderate
                        elif agent == "NCKP":
                            text_quality = hto.High
                        graph.add((term_original_description, hto.hasTextQuality, text_quality))
                        # graph.add((term_original_description, hto.numberOfWords, Literal(df_entry["numberOfWords"], datatype=XSD.int)))
                        graph.add((term_original_description, hto.text, Literal(description, datatype=XSD.string)))
                        graph.add((term_ref, hto.hasOriginalDescription, term_original_description))
                        link_entity_with_software(graph, term_original_description, "description", agent)

                        # Create source entity where original description was extracted
                        # source location
                        # source_path_name = df_entry["altoXML"]
                        # source_path_ref = URIRef("https://w3id.org/eb/Location/" + source_path_name)
                        # graph.add((source_path_ref, RDF.type, PROV.Location))
                        # source
                        file_path = str(df_entry["filePath"])
                        source_ref = get_source_ref(file_path, agent)
                        graph.add((source_ref, RDF.type, hto.InformationResource))
                        graph.add((source_ref, PROV.value, Literal(file_path, datatype=XSD.string)))
                        graph.add((eb_dataset, hto.hadMember, source_ref))
                        graph.add((source_ref, PROV.wasAttributedTo, agent_uri))
                        link_entity_with_software(graph, source_ref, "source", agent)

                        #graph.add((source_ref, PROV.atLocation, source_path_ref))
                        # related agent and activity

                        """
                        source_digitalising_activity = URIRef("https://w3id.org/eb/Activity/nls_digitalising_activity" + source_name)
                        graph.add((source_digitalising_activity, RDF.type, PROV.Activity))
                        graph.add((source_digitalising_activity, PROV.generated, source_ref))
                        graph.add((source_digitalising_activity, PROV.wasAssociatedWith, nls))
                        graph.add((source_ref, PROV.wasGeneratedBy, source_digitalising_activity))
                        """
                        graph.add((term_original_description, hto.wasExtractedFrom, source_ref))
    return graph

In [158]:
print(1)
# Ash Edition 1
agent = "Ash"
agent_uri = create_organization(graph, agent)
eb_text_dataset = create_eb_text_dataset(graph, agent_uri, agent)
# import data from 1st edition
df_1= pd.read_json('../source_dataframes/eb/ash_final_eb_1_dataframe_clean_Damon', orient="index")
graph = dataframe_to_rdf(df_1, graph, hto,  agent_uri, agent, eb_text_dataset)

print(7)
# NCKP Edition 7
agent = "NCKP"
agent_uri = create_organization(graph, agent)
eb_text_dataset = create_eb_text_dataset(graph, agent_uri, agent)
# import data from 7st edition
df_7 = pd.read_json('../source_dataframes/eb/nckp_final_eb_7_dataframe_clean_Damon', orient="index")
graph = dataframe_to_rdf(df_7, graph, hto,  agent_uri, agent, eb_text_dataset)

1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
27

In [159]:
description_pairs

[{'new': 'a term, among alchemists, for lead.',
  'graph': 'a term, among alchemifts, for lead,',
  'uri': 'https://w3id.org/hto/ArticleTermRecord/992277653804341_144133901_AABAM_0'},
 {'new': 'the name of a town and river in Swabia. It is also a name sometimes given to Aix-la-chapelle.',
  'graph': 'the name of a town and river in Swabia. It is also a name sometimes given toAix-la-chapelle.',
  'uri': 'https://w3id.org/hto/ArticleTermRecord/992277653804341_144133901_AACH_0'},
 {'new': 'the name of two rivers, one in the country of the Grisons in Switzerland, and the other in Dutch Brabant.',
  'graph': 'the name of two rivers, one in the country of the Grifons in Switzerland, and the other in Dutch',
  'uri': 'https://w3id.org/hto/ArticleTermRecord/992277653804341_144133901_AADE_0'},
 {'new': 'a small town and district in Westphalia.',
  'graph': 'a small town and diftrift in Weftphalia.',
  'uri': 'https://w3id.org/hto/ArticleTermRecord/992277653804341_144133901_AAHUS_0'},
 {'new': '

In [161]:
len(description_pairs)

13407

In [162]:
# Save the Graph in the RDF Turtle format
graph.serialize(format="turtle", destination="../results/hto_eb_total.ttl")

<Graph identifier=N7b6c8a3db7d248c88b7920827c835faf (<class 'rdflib.graph.Graph'>)>