# Test using OGSL to use Name Authority in ORACC
The Drehem name authority was made by extracting names from [BDTNS](http://bdtns.filol.csic.es/). Using this name authority for the Ur III data set in [ePSD2](http://oracc.org/epsd2/admin/u3adm/pager) runs into the problem of (slightly) different transliteration conventions. For instance, where [BDTNS](http://bdtns.filol.csic.es/) writes "uru", the [ePSD2](http://oracc.org/epsd2/admin/u3adm/pager) transliterations (which derives from [CDLI](http://cdli.ucla.edu)) have "iri".

Similarly, transliterations of names in [CDLI](http://cdli.ucla.edu) (and therefore in [ePSD2](http://oracc.org/epsd2/admin/u3adm/pager)) is often inconsistent. **FIND GOOD EXAMPLE**

The name instances found in the Ur III corpus, therefore, may be reduced to sequences of sign names, which may be compared to sequences of sign names in the name authority.

First parse the [OGSL](http://oracc.org/ogsl) database with the notebook 1-parse_ogsl.ipynb. This results in a DataFrame that is pickled as "output/ogsl.p".

In [1]:
import pandas as pd
import pickle
import re

In [3]:
with open("output/ogsl.p", "rb") as p:
    ogsl_df = pd.read_pickle(p)
ogsl_df

Unnamed: 0,hex,name,utf8,value
0,x12000,A,𒀀,ʾu₄
1,x12000,A,𒀀,a
2,x12000,A,𒀀,aia₂
3,x12000,A,𒀀,aya₂
4,x12000,A,𒀀,barₓ
5,x12000,A,𒀀,buniŋₓ
6,x12000,A,𒀀,burₓ
7,x12000,A,𒀀,dur₅
8,x12000,A,𒀀,duru₅
9,x12000,A,𒀀,e₄


Create a dictionary where the keys are sign values and the values sign names.

In [4]:
val = list(ogsl_df["value"])
names = list(ogsl_df["name"])
d = dict(zip(val,names))

In [5]:
file = "name_authority/Drehem_name_authority.atf"
with open(file, "r", encoding="utf-8") as f:
    z = f.readlines()
y = [re.sub(r"\t+", "\t", l).strip().split('\t') for l in z]
del y[0]
y

[['0', 'A-AN-ba-az', 'A.AN.ba.az[]PN', 'MVN 13 464 r 10 (copy/photo)'],
 ['1', 'A-KU-um', 'A.KU.um[]PN', 'Aegyptus 10, 270 27 o 7 (copy)'],
 ['2', 'A-KU.KU-ta', 'A.KU.KU[]PN', "AnOr 12 277 o i 17' (copy)"],
 ['3', 'A-NI-ta', 'A.NI[]PN', 'Babyl. 7 pl. 22 18 o 3 (copy)'],
 ['4',
  'A-U.E2-nu-tuku',
  'A.U.E₂.nu.tuku[]PN',
  'AnOr 07 150 o 2: A-U.KID-nu-tuku IŠ (copy/photo)'],
 ['5', 'A-a-', 'unkn', '5654 possibilities'],
 ['6', 'A-a-bad3', 'Ayabad[]PN'],
 ['7', 'A-a-ce3', 'Aya[]PN'],
 ['8', 'A-a-dingir', 'Ayadiŋir[]PN'],
 ['9', 'A-a-dingir-mu', 'Ayadiŋirŋu[]PN'],
 ['10', 'A-a-dingir-mu-ta', 'Ayadiŋirŋu[]PN'],
 ['11', 'A-a-dingir-ta', 'Ayadiŋir[]PN'],
 ['12', 'A-a-i3-li2', 'Abuʾili[]PN'],
 ['13', 'A-a-i3-li2-cu', 'Abuʾilišu[]PN', 'uncertain (Ayailišu?)'],
 ['14', 'A-a-kal-la', 'Ayakala[]PN'],
 ['15', 'A-a-kal-la-mu', 'Ayakalaŋu[]PN'],
 ['16', 'A-a-kal-la-ta', 'Ayakala[]PN'],
 ['17', 'A-a-ma', 'Ayama[]PN', 'hypo?'],
 ['18', 'A-a-mu', 'Ayaŋu[]PN'],
 ['19', 'A-a-mu-ce3', 'Ayaŋu[]PN'],
 ['20'

In [6]:
cols = ["index", "translit_bdtns", "cf_oracc", "notes"]
na_df = pd.DataFrame(y, columns = cols).fillna("")

In [7]:
na_df = na_df.drop(["index"], axis=1)
na_df

Unnamed: 0,translit_bdtns,cf_oracc,notes
0,A-AN-ba-az,A.AN.ba.az[]PN,MVN 13 464 r 10 (copy/photo)
1,A-KU-um,A.KU.um[]PN,"Aegyptus 10, 270 27 o 7 (copy)"
2,A-KU.KU-ta,A.KU.KU[]PN,AnOr 12 277 o i 17' (copy)
3,A-NI-ta,A.NI[]PN,Babyl. 7 pl. 22 18 o 3 (copy)
4,A-U.E2-nu-tuku,A.U.E₂.nu.tuku[]PN,AnOr 07 150 o 2: A-U.KID-nu-tuku IŠ (copy/photo)
5,A-a-,unkn,5654 possibilities
6,A-a-bad3,Ayabad[]PN,
7,A-a-ce3,Aya[]PN,
8,A-a-dingir,Ayadiŋir[]PN,
9,A-a-dingir-mu,Ayadiŋirŋu[]PN,


Some data cleaning

In [8]:
na_df = na_df.loc[~(na_df["cf_oracc"] == "unkn")]
na_df = na_df.loc[~(na_df.cf_oracc.str.contains("not PN"))]

In [9]:
replace = {"c": "š" , "C" : "Š", "t," : "ṭ", "T," : "Ṭ", "s," : "ṣ", "S," : "Ṣ", "1" : "₁", "2" :  "₂", "3": "₃", "4" : "₄", "5": "₅", 
              "6" : "₆", "7" : "₇", "8" :  "₈", "9" : "₉", "0" : "₀", "x" : "×"}
na_df = na_df.replace({"translit_bdtns" : replace}, regex=True)

# Create signs and sign_names column.

In [11]:
separators = ['{', '}', '-', '.', "+"]
def separate(e): 
    for s in separators: # split word into signs   
        e = e.replace(s, ' ').strip()
    return(e.lower().split())

In [12]:
na_df["signs"] = na_df["translit_bdtns"].apply(lambda x: separate(x))
na_df["sign_names"] = na_df["signs"].apply(lambda x: [d[s] if s in d else s for s in x])

In [13]:
na_df

Unnamed: 0,translit_bdtns,cf_oracc,notes,signs,sign_names
0,A-AN-ba-az,A.AN.ba.az[]PN,MVN 13 464 r 10 (copy/photo),"[a, an, ba, az]","[A, AN, BA, |PIRIG×ZA|]"
1,A-KU-um,A.KU.um[]PN,"Aegyptus 10, 270 27 o 7 (copy)","[a, ku, um]","[A, KU, UM]"
2,A-KU.KU-ta,A.KU.KU[]PN,AnOr 12 277 o i 17' (copy),"[a, ku, ku, ta]","[A, KU, KU, TA]"
3,A-NI-ta,A.NI[]PN,Babyl. 7 pl. 22 18 o 3 (copy),"[a, ni, ta]","[A, NI, TA]"
4,A-U.E₂-nu-tuku,A.U.E₂.nu.tuku[]PN,AnOr 07 150 o 2: A-U.KID-nu-tuku IŠ (copy/photo),"[a, u, e₂, nu, tuku]","[A, U, E₂, NU, TUK]"
6,A-a-bad₃,Ayabad[]PN,,"[a, a, bad₃]","[A, A, |EZEN×BAD|]"
7,A-a-še₃,Aya[]PN,,"[a, a, še₃]","[A, A, EŠ₂]"
8,A-a-dingir,Ayadiŋir[]PN,,"[a, a, dingir]","[A, A, AN]"
9,A-a-dingir-mu,Ayadiŋirŋu[]PN,,"[a, a, dingir, mu]","[A, A, AN, MU]"
10,A-a-dingir-mu-ta,Ayadiŋirŋu[]PN,,"[a, a, dingir, mu, ta]","[A, A, AN, MU, TA]"


In [16]:
with open("name_authority/Drehem_na.json", "w", encoding = "utf-8")  as j: 
    na_df.to_json(j, force_ascii=False, orient="table", index=False)

In [17]:
import json
with open("name_authority/Drehem_na.json", "r", encoding = "utf-8")  as j: 
    n = json.load(j)

In [28]:
n["authors"] = "Niek Veldhuis and John Carnahan"

In [31]:
n["license"] = "CC0; https://creativecommons.org/share-your-work/public-domain/cc0/; Open Domain"

In [32]:
n["website"] = "https://github.com/niekveldhuis/UrIII-names"

In [33]:
n["notes"] = "Based on the BDTNS (http://bdtns.filol.csic.es/) dataset, December 2016. Proper nouns in BDTNS are marked by initial capital. Proper nouns were extracted and normalized with a script, authored by Niek Veldhuis (https://github.com/niekveldhuis/UrIII-names). Drehem proper nouns were checked and hand-edited by John Carnahan."

In [36]:
with open("name_authority/Drehem_na2.json", "w", encoding = "utf-8") as j: 
    json.dump(n, j, ensure_ascii=False, sort_keys=True, indent=4, separators=(',', ': '))