# The Occurrences of Named Entities: Concept and Workflow
by Tina Chen & Adam Anderson (PI)


##Concept:
The concept of citing an occurrence of a source is already well established in scientific publishing. The convention includes a source citation in an accepted format (e.g. MLA format), which provides evidence of the exact reference to the given entity. While this convention has been in place for a number of centuries, there has been no attempt to make this process comprehensive in any way, and most publications only cite the first, latest, or most relevant source on the subject. While this limited approach has been sufficient for human scholarship in the sciences, it does not go far enough to enable automated methods for entity detection in machine learning. This is primarily due to the ambiguity in named entities, as well as the alternative spellings of such entities, as published in the different languages around the world.
The result of this work will add valuable linked data for machine learning. Once these entities are properly cited, we can include each reference in Linked Data (e.g. Wikidata).

The following workflow establishes a comprehensive method for detecting and disambiguating named entities in a collection of publications, beginning with a batch of 20k. Each occurrence of an entity is counted and given a page number for formal citation purposes. Additional contextualizing methods will be included at a later stage to address any ambiguity in the resulting occurrences of entities.

##Workflow:
* Dataset: The first batch included 20k documents. 
* Data Dictionary: 600 Geographic names (GN), both modern and ancient names for each site.

1. Count the ‘occurrences’ of the GNs in the dataset. (this was done by Circle in GSoC 2022)
2. Collect the occurrences into a DataFrame for each GN 
3. Obtain page numbers for each occurrence
4. Make an equivalency to the actual page numbers of the document
5. Provide the bibliographic data for each document
6. Format with Wikidata properties and item IDs
7. Export dataset to Zenodo (CSV, json-ld, BibTex)
8. Export dataset to Wikidata using QuickStatements (CSV)


In [None]:
!pip3 install geojson
!pip3 install shapely.constructive
!pip install geopandas
import pandas as pd
import numpy as np
import csv 
import plotly.express as px
import geopandas as gpd
import json
import requests
%matplotlib inline
from shapely.geometry import Point
from geopandas import datasets, GeoDataFrame, read_file

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting geojson
  Downloading geojson-3.0.1-py3-none-any.whl (15 kB)
Installing collected packages: geojson
Successfully installed geojson-3.0.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[31mERROR: Could not find a version that satisfies the requirement shapely.constructive (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for shapely.constructive[0m[31m
[0mLooking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting geopandas
  Downloading geopandas-0.12.2-py3-none-any.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyproj>=2.6.1.post1
  Downloading pyproj-3.5.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━

In [None]:
from google.colab import drive
drive.mount('/content/drive')
#workdir = '/content/drive/MyDrive/Sumerian Network' # for Tina
workdir = '/content/drive/MyDrive/FactGrid Cuneiform (AWCA)/geography/' # for Adam

Mounted at /content/drive


### I. LoD Dictionary: FactGrid
The first step is to load the Linked open Data (LoD) dictionary for the Geographic Names (GN) used to find their occurrences. We are working from the FactGrid dataset because it includes the labels for each GN for both their modern and ancient name, along with links to wikidata.

In [None]:
factgrid_merge = pd.read_csv("/content/drive/MyDrive/FactGrid Cuneiform (AWCA)/geography/factgrid_merge.csv")
factgrid_merge.head()

Unnamed: 0.2,Unnamed: 0.1,ancientplace,coord,namehistory,cdli2,pleiades,Unnamed: 0,Double record?,qid,Sarwiki,...,wik_en_y,wik_ara_y,wik_fas_y,wik_gre_y,wik_heb_y,wik_tr_y,geometry,_merge,P402_(Wikidata),P10689_(Wikidata)
0,0,https://database.factgrid.de/entity/Q389901,Point(43.2304 35.5931),تل حويش,95.0,,58.0,,Q389901,,...,,,,,,,POINT (43.2305 35.5931),both,,
1,1,https://database.factgrid.de/entity/Q389900,Point(41.166 36.816),تل حميدي,,874740.0,303.0,,Q389900,,...,0,,,,,,POINT (41.1661 36.8161),both,567216082.0,node
2,2,https://database.factgrid.de/entity/Q389898,Point(40.0399 36.8268),تل حلف,,874739.0,289.0,,Q389898,,...,,,,,,,POINT (40.04 36.8269),both,573540773.0,node
3,3,https://database.factgrid.de/entity/Q389892,Point(45.7032 31.8254),تل جدر,318.0,912957.0,195.0,,Q389892,,...,0,,,,,,POINT (45.7033 31.8255),both,,
4,4,https://database.factgrid.de/entity/Q389891,Point(40.5872 36.7381),تل بيدر,260.0,423885388.0,155.0,,Q389891,تل_بيدر,...,https://en.wikipedia.org/wiki/Tell_Beydar,https://ar.wikipedia.org//wiki/تل_بيدر,,,,,POINT (40.5873 36.7382),both,226879817.0,way


### II. Occurrence List
The next step is to obtain the list of documents and their 'occurrences.csv' files which were created by counting the occurnces of each GN in the 20k document dataset (in Google Drive).

In [None]:
ocurrance_list = pd.read_csv("/content/drive/MyDrive/FactGrid Cuneiform (AWCA)/geography/catalog_1.csv")
ocurrance_list = ocurrance_list.loc[ocurrance_list["filename"] == "occurences.csv"]
ocurrance_list = ocurrance_list.reset_index()
ocurrance_list.reset_index()

Unnamed: 0,level_0,index,ocr-output,filename
0,0,2,ocr-output/0_Attinger - A propos de AK «faire»...,occurences.csv
1,1,7,"ocr-output/10000_JCS 19, Borger, Aufstieg NB R...",occurences.csv
2,2,11,"ocr-output/10000_JCS 19, Borger, Aufstieg NB R...",occurences.csv
3,3,15,ocr-output/10001_Soden,occurences.csv
4,4,19,ocr-output/10003_Conservation the Core of Arch...,occurences.csv
...,...,...,...,...
20907,20907,83691,ocr-output/9997_Coleman - A History Of Politic...,occurences.csv
20908,20908,83695,ocr-output/9998_Yun 2008 diss Tell Fekheriyeh,occurences.csv
20909,20909,83699,ocr-output/999_41103868,occurences.csv
20910,20910,83703,ocr-output/99_Fensham--Some Remarks on the Fir...,occurences.csv


In [None]:
# occur1 = pd.read_csv(f"/content/drive/MyDrive/FactGrid Cuneiform (AWCA)/ocr/{ocurrance_list.iloc[0,1]}/{ocurrance_list.iloc[0,2]}")
# occur1["path"] = f"/content/drive/MyDrive/FactGrid Cuneiform (AWCA)/ocr/{ocurrance_list.iloc[0,1]}/page.csv"
# not_exist = pd.read_csv(f'/content/drive/MyDrive/FactGrid Cuneiform (AWCA)/ocr/ocr-output/{name} /{ocurrance_list.iloc[3660,2]}')
# not_exist

In [None]:
# try:
#   tbl = pd.read_csv(f"/content/drive/MyDrive/FactGrid Cuneiform (AWCA)/ocr/{name}/{ocurrance_list.iloc[0,2]}")
#   tbl["path"] = f"{ocurrance_list.iloc[0,1]}/page.csv"
# except FileNotFoundError:
#   tbl = pd.read_csv(f"/content/drive/MyDrive/FactGrid Cuneiform (AWCA)/ocr/{name} /{ocurrance_list.iloc[0,2]}")
#   tbl["path"] = f"{ocurrance_list.iloc[0,1]} /page.csv"
# tbl

In [None]:
# occur1 = pd.DataFrame()
# for i in range(len(ocurrance_list)):
#   try:
#     tbl = pd.read_csv(f"/content/drive/MyDrive/FactGrid Cuneiform (AWCA)/ocr/{ocurrance_list.iloc[i,1]}/{ocurrance_list.iloc[i,2]}")
#     tbl["path"] = f"{ocurrance_list.iloc[i,1]}/page.csv"
#   except FileNotFoundError:
#     try:
#       tbl = pd.read_csv(f"/content/drive/MyDrive/FactGrid Cuneiform (AWCA)/ocr/{ocurrance_list.iloc[i,1]} /{ocurrance_list.iloc[i,2]}")
#       tbl["path"] = f"{ocurrance_list.iloc[i,1]} /page.csv"
#     except FileNotFoundError:
#       print(f"{ocurrance_list.iloc[i,0]}/{ocurrance_list.iloc[i,1]}")
#   occur1 = pd.concat([occur1,tbl])

In [None]:
# occur1.to_csv("/content/drive/MyDrive/FactGrid Cuneiform (AWCA)/geography/occur_sum.csv")

In [None]:
fact = factgrid_merge[["qid","Len"]]
fact

Unnamed: 0,qid,Len
0,Q389901,Tall Ḥuwaysh
1,Q389900,Tall Ḥamīdī
2,Q389898,Guzana
3,Q389892,Tall Jidar
4,Q389891,Tell Beydar
...,...,...
603,Q390009,Kānī Shāyah
604,Q390019,Nigūb
605,Q390073,Ziyaret Tepe
606,Q390044,Kuşaklı


### III. Occurrences Sum
Here we have a complete list of occurrences, which we can see includes the modern name, ancient name, and even includes null values. So we will remove those instances of 0 occurrence from this dataset, which will leave us with the full dataset of occurrences and along with the documents directory paths and names.

In [None]:
ocurr_sum = pd.read_csv("/content/drive/MyDrive/FactGrid Cuneiform (AWCA)/geography/occur_sum.csv")
ocurr_sum

Unnamed: 0.1,Unnamed: 0,id,provenience,ancient_name,modern_name,total_occurences,path
0,0,21,Ur (mod. Tell Muqayyar),Ur,Tell Muqayyar,57,ocr-output/0_Attinger - A propos de AK «faire»...
1,1,149,Me-Turran (mod. Tell Haddad),Me-Turran,Tell Haddad,21,ocr-output/0_Attinger - A propos de AK «faire»...
2,2,22,Nippur (mod. Nuffar),Nippur,Nuffar,4,ocr-output/0_Attinger - A propos de AK «faire»...
3,3,303,Isin (mod. Bahriyat),Isin,Bahriyat,3,ocr-output/0_Attinger - A propos de AK «faire»...
4,4,105,Uruk (mod. Warka),Uruk,Warka,2,ocr-output/0_Attinger - A propos de AK «faire»...
...,...,...,...,...,...,...,...
7653787,361,137,uncertain (mod. Chogha Mish),uncertain,Chogha Mish,0,ocr-output/9_van den Hout (2006) - Life and Ti...
7653788,362,136,Ašnakkum (mod. Chagar Bazar),Ašnakkum,Chagar Bazar,0,ocr-output/9_van den Hout (2006) - Life and Ti...
7653789,363,135,Dur-Katlimmu (mod. Tall Shekh Hamad),Dur-Katlimmu,Tall Shekh Hamad,0,ocr-output/9_van den Hout (2006) - Life and Ti...
7653790,364,134,Kar-Nabu (mod. uncertain),Kar-Nabu,uncertain,0,ocr-output/9_van den Hout (2006) - Life and Ti...


In [None]:
# occur1.groupby("path").count()

### IV. Modern names
Here we count the occurrences of the modern names in each of the documents.

In [None]:
sum_tbl_modern = fact.merge(ocurr_sum, left_on = "Len", right_on = "modern_name")
sum_tbl_modern.sort_values(by = "total_occurences").drop_duplicates()

Unnamed: 0.1,qid,Len,Unnamed: 0,id,provenience,ancient_name,modern_name,total_occurences,path
0,Q390165,Ozbaki,204,64,uncertain (mod. Ozbaki),uncertain,Ozbaki,0,ocr-output/0_Attinger - A propos de AK «faire»...
570821,Q390036,Kültepe,17,291,Kanesh (mod. Kültepe),Kanesh,Kültepe,0,ocr-output/16944_Lambert 1985 The pair Lahmu-L...
570823,Q390036,Kültepe,30,291,Kanesh (mod. Kültepe),Kanesh,Kültepe,0,"ocr-output/16945_CANE, Green, Iconography/page..."
570824,Q390036,Kültepe,22,291,Kanesh (mod. Kültepe),Kanesh,Kültepe,0,ocr-output/16947_0415149282/page.csv
570825,Q390036,Kültepe,24,291,Kanesh (mod. Kültepe),Kanesh,Kültepe,0,ocr-output/16948_zava/page.csv
...,...,...,...,...,...,...,...,...,...
277169,Q389820,Babylon,1,200,Bābili (mod. Babylon),Bābili,Babylon,846,ocr-output/1558_Heimpel 2003 Letters to the Ki...
616709,Q390059,Harran,0,202,Harran (mod. Harran),Harran,Harran,888,"ocr-output/26935_CHANE 10 Holloway, Assur is k..."
278413,Q389820,Babylon,0,200,Bābili (mod. Babylon),Bābili,Babylon,891,ocr-output/17557_Jursa_2010_Aspects_of_the_Eco...
567727,Q390036,Kültepe,0,291,Kanesh (mod. Kültepe),Kanesh,Kültepe,891,ocr-output/12820_JCSSupp4_2014 (5)/page.csv


### V. Ancient names
Here we count the total occurrences of the ancient names as they were identified in each of the documents.

In [None]:
sum_tbl_modern = sum_tbl_modern.loc[sum_tbl_modern["total_occurences"] != 0]
sum_tbl_modern

Unnamed: 0.1,qid,Len,Unnamed: 0,id,provenience,ancient_name,modern_name,total_occurences,path
711,Q390165,Ozbaki,96,64,uncertain (mod. Ozbaki),uncertain,Ozbaki,2,"ocr-output/10641_Ebeling, E/page.csv"
712,Q390165,Ozbaki,96,64,uncertain (mod. Ozbaki),uncertain,Ozbaki,2,"ocr-output/10641_Ebeling, E/page.csv"
1852,Q390165,Ozbaki,69,64,uncertain (mod. Ozbaki),uncertain,Ozbaki,1,ocr-output/11693_Mieroop - A History of the An...
1853,Q390165,Ozbaki,69,64,uncertain (mod. Ozbaki),uncertain,Ozbaki,1,ocr-output/11693_Mieroop - A History of the An...
6789,Q390165,Ozbaki,25,64,uncertain (mod. Ozbaki),uncertain,Ozbaki,1,ocr-output/17974_FrancfMarha-4/page.csv
...,...,...,...,...,...,...,...,...,...
856223,Q390044,Kuşaklı,3,190,uncertain (mod. Kuşaklı),uncertain,Kuşaklı,2,"ocr-output/8944_TAS, KARAKIZ/page.csv"
856522,Q390044,Kuşaklı,0,190,uncertain (mod. Kuşaklı),uncertain,Kuşaklı,3,"ocr-output/9205_Melchert, H/page.csv"
856523,Q390044,Kuşaklı,0,190,uncertain (mod. Kuşaklı),uncertain,Kuşaklı,3,"ocr-output/9205_Melchert, H/page.csv"
856606,Q390044,Kuşaklı,4,190,uncertain (mod. Kuşaklı),uncertain,Kuşaklı,1,ocr-output/9281_ArAn8-KUB 5/page.csv


In [None]:
sum_tbl_ancient = fact.merge(ocurr_sum, left_on = "Len", right_on = "ancient_name")
sum_tbl_ancient.sort_values(by = "total_occurences").drop_duplicates()

Unnamed: 0.1,qid,Len,Unnamed: 0,id,provenience,ancient_name,modern_name,total_occurences,path
0,Q389898,Guzana,365,637,Guzana (mod. Tell Halaf),Guzana,Tell Halaf,0,ocr-output/0_Attinger - A propos de AK «faire»...
728921,Q390014,Kish,178,391,Kish (mod. Tell el-Bender),Kish,Tell el-Bender,0,"ocr-output/1440_Moor, Johannes C de - The poet..."
728922,Q390014,Kish,224,85,Kish (mod. Tell Ingharra),Kish,Tell Ingharra,0,"ocr-output/1440_Moor, Johannes C de - The poet..."
728923,Q390014,Kish,319,184,Kish (mod. Tell Uhaimir),Kish,Tell Uhaimir,0,"ocr-output/1440_Moor, Johannes C de - The poet..."
728924,Q390014,Kish,117,361,Kish (mod. Tell Barguthiat),Kish,Tell Barguthiat,0,ocr-output/14414_Attinger - La malédiction d'...
...,...,...,...,...,...,...,...,...,...
160183,Q390001,Assur,0,211,Assur (mod. Qalat Sherqat),Assur,Qalat Sherqat,2335,ocr-output/4582_PNA 1/page.csv
26054,Q389896,Mari,2,161,Mari (mod. Tell Hariri),Mari,Tell Hariri,2494,ocr-output/15347_AfO Register 1974-2004_bea2/p...
150108,Q390001,Assur,0,211,Assur (mod. Qalat Sherqat),Assur,Qalat Sherqat,2896,ocr-output/13455_PNA 2/page.csv
172438,Q389938,Ebla,1,79,Ebla (mod. Tell Mardikh),Ebla,Tell Mardikh,2959,ocr-output/15347_AfO Register 1974-2004_bea2/p...


In [None]:
factgrid_sum = pd.concat([sum_tbl_ancient, sum_tbl_modern], axis = 0)
factgrid_sum = factgrid_sum.drop_duplicates()
factgrid_sum = factgrid_sum.loc[factgrid_sum["total_occurences"] != 0]
factgrid_sum["Pages"] = ""
factgrid_sum = factgrid_sum.drop_duplicates().reset_index(drop = True)
factgrid_sum

Unnamed: 0.1,qid,Len,Unnamed: 0,id,provenience,ancient_name,modern_name,total_occurences,path,Pages
0,Q389898,Guzana,19,637,Guzana (mod. Tell Halaf),Guzana,Tell Halaf,16,ocr-output/10013_akkermansschwartz/page.csv,
1,Q389898,Guzana,8,637,Guzana (mod. Tell Halaf),Guzana,Tell Halaf,11,ocr-output/10035_Radner_2014_State_Corresponde...,
2,Q389898,Guzana,4,637,Guzana (mod. Tell Halaf),Guzana,Tell Halaf,24,ocr-output/10048_SAAS11 = Mattila_2000_Magnate...,
3,Q389898,Guzana,25,637,Guzana (mod. Tell Halaf),Guzana,Tell Halaf,1,ocr-output/10085_Allred PhD2006_é/page.csv,
4,Q389898,Guzana,25,637,Guzana (mod. Tell Halaf),Guzana,Tell Halaf,10,ocr-output/10088_cad_p/page.csv,
...,...,...,...,...,...,...,...,...,...,...
67885,Q390044,Kuşaklı,4,190,uncertain (mod. Kuşaklı),uncertain,Kuşaklı,1,"ocr-output/7644_Kulakoglu, F/page.csv",
67886,Q390044,Kuşaklı,3,190,uncertain (mod. Kuşaklı),uncertain,Kuşaklı,2,"ocr-output/8944_TAS, KARAKIZ/page.csv",
67887,Q390044,Kuşaklı,0,190,uncertain (mod. Kuşaklı),uncertain,Kuşaklı,3,"ocr-output/9205_Melchert, H/page.csv",
67888,Q390044,Kuşaklı,4,190,uncertain (mod. Kuşaklı),uncertain,Kuşaklı,1,ocr-output/9281_ArAn8-KUB 5/page.csv,


### VI. Occurance page count
Once all the possible names are collected for each toponym along with their occurances in the documents, we can list the page numbers for each occurance.

**Note** that this code cell will take at least 3 hours to complete this task. You may need to allow chrome to run in the background in order for this to work.

In [None]:
factgrid_sum_with_pages = factgrid_sum.copy()
for j in range(len(factgrid_sum_with_pages)):
  tbl = pd.read_csv( f'/content/drive/MyDrive/FactGrid Cuneiform (AWCA)/ocr/{factgrid_sum_with_pages.iloc[j,8]}'
                    )
  tbl = tbl.dropna()
  result = ''
  try:
    for i in range(len(tbl)):
      a = tbl.iloc[i,:]["text"]
      if factgrid_sum_with_pages.iloc[j,:]["Len"] in a or factgrid_sum_with_pages.iloc[j,:]["ancient_name"] in a or factgrid_sum_with_pages.iloc[j,:]["modern_name"] in a:
        result += f'{str(i)} '
  except TypeError or OSError:
      print(j)
  factgrid_sum_with_pages.loc[j,"Pages"] = result
factgrid_sum_with_pages

Unnamed: 0.1,qid,Len,Unnamed: 0,id,provenience,ancient_name,modern_name,total_occurences,path,Pages
0,Q389898,Guzana,19,637,Guzana (mod. Tell Halaf),Guzana,Tell Halaf,16,ocr-output/10013_akkermansschwartz/page.csv,4 13 58 66 76 78 196 201 202 213 229 238
1,Q389898,Guzana,8,637,Guzana (mod. Tell Halaf),Guzana,Tell Halaf,11,ocr-output/10035_Radner_2014_State_Corresponde...,
2,Q389898,Guzana,4,637,Guzana (mod. Tell Halaf),Guzana,Tell Halaf,24,ocr-output/10048_SAAS11 = Mattila_2000_Magnate...,11 41 65 68 82 89 94 97 98 99 122 131 132 147 ...
3,Q389898,Guzana,25,637,Guzana (mod. Tell Halaf),Guzana,Tell Halaf,1,ocr-output/10085_Allred PhD2006_é/page.csv,
4,Q389898,Guzana,25,637,Guzana (mod. Tell Halaf),Guzana,Tell Halaf,10,ocr-output/10088_cad_p/page.csv,
...,...,...,...,...,...,...,...,...,...,...
67885,Q390044,Kuşaklı,4,190,uncertain (mod. Kuşaklı),uncertain,Kuşaklı,1,"ocr-output/7644_Kulakoglu, F/page.csv",
67886,Q390044,Kuşaklı,3,190,uncertain (mod. Kuşaklı),uncertain,Kuşaklı,2,"ocr-output/8944_TAS, KARAKIZ/page.csv",
67887,Q390044,Kuşaklı,0,190,uncertain (mod. Kuşaklı),uncertain,Kuşaklı,3,"ocr-output/9205_Melchert, H/page.csv",
67888,Q390044,Kuşaklı,4,190,uncertain (mod. Kuşaklı),uncertain,Kuşaklı,1,ocr-output/9281_ArAn8-KUB 5/page.csv,


## VII. Save file as CSV and Pickle
Lastly, we save the resulting data table as a CSV. We can also make a pickle file for ongoing use so we won't have to run this again (3 hours).

In [None]:
factgrid_sum_with_pages.to_csv('/content/drive/MyDrive/FactGrid Cuneiform (AWCA)/geography/occurance_pagecount.csv')
factgrid_sum_with_pages.to_pickle('/content/drive/MyDrive/FactGrid Cuneiform (AWCA)/geography/occurance_pagecount.p')

In [None]:
# j = 2
# tbl = pd.read_csv( f'/content/drive/MyDrive/FactGrid Cuneiform (AWCA)/ocr/{factgrid_sum_with_pages.iloc[j,8]}'
#                     )
# result = ''
# for i in range(len(tbl)):
#   try:
#     a = tbl.iloc[i,:]["text"]
#     if factgrid_sum_with_pages.iloc[j,:]["Len"] in a or factgrid_sum_with_pages.iloc[j,:]["ancient_name"] in a or factgrid_sum_with_pages.iloc[j,:]["modern_name"] in a:
#       result += f'{str(i)} '
#   except TypeError:
#       print(j,a)
# factgrid_sum_with_pages.loc[j,"Pages"] = result
# factgrid_sum_with_pages