# Test using OGSL to use Name Authority in ORACC
The Drehem name authority was made by extracting names from [BDTNS](http://bdtns.filol.csic.es/). Using this name authority for the Ur III data set in [ePSD2](http://oracc.org/epsd2/admin/u3adm/pager) runs into the problem of (slightly) different transliteration conventions. For instance, where [BDTNS](http://bdtns.filol.csic.es/) writes "uru", the [ePSD2](http://oracc.org/epsd2/admin/u3adm/pager) transliterations (which derives from [CDLI](http://cdli.ucla.edu)) have "iri".

Similarly, transliteration of names in [CDLI](http://cdli.ucla.edu) (and therefore in [ePSD2](http://oracc.org/epsd2/admin/u3adm/pager)) is often inconsistent. **FIND GOOD EXAMPLE**

The name instances found in the Ur III corpus, therefore, are reduced to sequences of sign names, which may be compared to sequences of sign names in the name authority.

First parse the [OGSL](http://oracc.org/ogsl) database with the notebook 1-parse_ogsl.ipynb. This results in a DataFrame that is pickled as "output/ogsl.p".

In [1]:
import pandas as pd
import pickle
import re
from collections import OrderedDict
import datetime
import json

In [2]:
with open("output/ogsl.p", "rb") as p:
    ogsl_df = pd.read_pickle(p)
ogsl_df

Unnamed: 0,hex,name,utf8,value
0,x12000,A,𒀀,ʾu₄
1,x12000,A,𒀀,a
2,x12000,A,𒀀,aia₂
3,x12000,A,𒀀,aya₂
4,x12000,A,𒀀,barₓ
5,x12000,A,𒀀,buniŋₓ
6,x12000,A,𒀀,burₓ
7,x12000,A,𒀀,dur₅
8,x12000,A,𒀀,duru₅
9,x12000,A,𒀀,e₄


Create a dictionary where the keys are sign values and the values sign names. The resulting dictionary can be used to transform a sign reading (such as "buru₁₄") into a sign name ("EN×KAR₂@g"), in the following way

```python
d["buru₁₄"]
```

In [3]:
val = list(ogsl_df["value"])
names = list(ogsl_df["name"])
d = dict(zip(val,names))

In [None]:
#file = "name_authority/Drehem_name_authority.atf"
#with open(file, "r", encoding="utf-8") as f:
#    z = f.readlines()
#y = [re.sub(r"\t+", "\t", l).strip().split('\t') for l in z]  # replace multiple tabs by single tab and split on TAB
#del y[0]  # remove line with column names

In [10]:
cols = ["index", "translit_bdtns", "cf_oracc", "notes"]
#na_df = pd.DataFrame(y, columns = cols).fillna("")

In [13]:
file = "../../UrIII-names/output/UrIII-Names.csv"
with open(file, 'r', encoding = 'utf8') as f: 
    na_df = pd.read_csv(f, sep='\t')
na_df.columns = cols[:3]
na_df

Unnamed: 0,index,translit_bdtns,cf_oracc
0,0,A-AMA-a2-a,A.AMA.a₂.a[]PN
1,1,A-AN-ba-az,A.AN.ba.az[]PN
2,2,A-Ab-ba-ge-na-ta,A.Ab.ba.ge.na[]PN
3,3,A-Ad-da,A.Ad.da[]PN
4,4,A-Ad-da-kal-la,A.Ad.da.kal.la[]PN
5,5,A-Ad-da-mu,A.Ad.da.ŋu₁₀[]PN
6,6,A-An-na-hi-li-bi,A.An.na.hi.li.bi[]PN
7,7,A-DU-a,A.DU.a[]PN
8,8,A-DU-ba,A.DU.ba[]PN
9,9,A-DU-ba-bi,A.DU.ba.bi[]PN


In [14]:
na_df = na_df.drop(["index"], axis=1)
na_df

Unnamed: 0,translit_bdtns,cf_oracc
0,A-AMA-a2-a,A.AMA.a₂.a[]PN
1,A-AN-ba-az,A.AN.ba.az[]PN
2,A-Ab-ba-ge-na-ta,A.Ab.ba.ge.na[]PN
3,A-Ad-da,A.Ad.da[]PN
4,A-Ad-da-kal-la,A.Ad.da.kal.la[]PN
5,A-Ad-da-mu,A.Ad.da.ŋu₁₀[]PN
6,A-An-na-hi-li-bi,A.An.na.hi.li.bi[]PN
7,A-DU-a,A.DU.a[]PN
8,A-DU-ba,A.DU.ba[]PN
9,A-DU-ba-bi,A.DU.ba.bi[]PN


# Some data cleaning
Remove lines where the ORACC Citation Form has "unkn" or "not PN" and reset the index afterwards.

In [15]:
na_df = na_df.loc[~(na_df["cf_oracc"] == "unkn")]
na_df = na_df.loc[~(na_df.cf_oracc.str.contains("not PN"))]
na_df = na_df.reset_index(drop=True)

Deal with differences in transliteration conventions between BDTNS and CDLI/ORACC.

In [16]:
replace1 = {"([a-wy-zA-WY-Z])X" : "\\1ₓ"} 
replace2 = {"c": "š" , "C" : "Š", "ty" : "ṭ", "Ty" : "Ṭ", "sy" : "ṣ", "Sy" : "Ṣ", "1" : "₁", "2" :  "₂", "3": "₃", "4" : "₄", "5": "₅", 
              "6" : "₆", "7" : "₇", "8" :  "₈", "9" : "₉", "0" : "₀", "x" : "×", "nigarₓ" : "nigar", 
           "nemurₓ(|PIRIG.TUR|)" : "nemur₂", "nagₓ(GAZ)" : "naŋ₃" }
na_df = na_df.replace({"translit_bdtns" : replace1}, regex=True)
na_df = na_df.replace({"translit_bdtns" : replace2}, regex=True)

# Create signs and sign_names column.
Split a word (name) into signs by replacing sign separators by blanks. All signs are lowercased and the string is split into a list.

Use the dictionary `d`, created above, to find the sign name for each sign. Each form is now reduced to a list of sign names. The sign names are re-connected to strings (separated by hyphens), in order to make comparison easier.

In [17]:
def separate(e): 
    separators = ['{', '}', '-', '.', "+"]
    for s in separators: # split word into signs   
        e = e.replace(s, ' ').strip()
    return(e.lower().split())

In [18]:
na_df["signs"] = na_df["translit_bdtns"].apply(separate)
na_df["sign_names"] = na_df["signs"].apply(lambda x: "-".join([d[s] if s in d else s for s in x]))

In [19]:
na_df

Unnamed: 0,translit_bdtns,cf_oracc,signs,sign_names
0,A-AMA-a₂-a,A.AMA.a₂.a[]PN,"[a, ama, a₂, a]",A-|GA₂×AN|-A₂-A
1,A-AN-ba-az,A.AN.ba.az[]PN,"[a, an, ba, az]",A-AN-BA-|PIRIG×ZA|
2,A-Ab-ba-ge-na-ta,A.Ab.ba.ge.na[]PN,"[a, ab, ba, ge, na, ta]",A-AB-BA-GI-NA-TA
3,A-Ad-da,A.Ad.da[]PN,"[a, ad, da]",A-AD-DA
4,A-Ad-da-kal-la,A.Ad.da.kal.la[]PN,"[a, ad, da, kal, la]",A-AD-DA-KAL-LA
5,A-Ad-da-mu,A.Ad.da.ŋu₁₀[]PN,"[a, ad, da, mu]",A-AD-DA-MU
6,A-An-na-hi-li-bi,A.An.na.hi.li.bi[]PN,"[a, an, na, hi, li, bi]",A-AN-NA-HI-LI-BI
7,A-DU-a,A.DU.a[]PN,"[a, du, a]",A-DU-A
8,A-DU-ba,A.DU.ba[]PN,"[a, du, ba]",A-DU-BA
9,A-DU-ba-bi,A.DU.ba.bi[]PN,"[a, du, ba, bi]",A-DU-BA-BI


# Find Duplicates
Duplicate sequences of sign names should result in the same reading in the column `cf_oracc`.

In [20]:
dups = na_df.loc[na_df.duplicated(["sign_names"], keep=False)]
dups = dups.reset_index()
dups

Unnamed: 0,index,translit_bdtns,cf_oracc,signs,sign_names
0,3,A-Ad-da,A.Ad.da[]PN,"[a, ad, da]",A-AD-DA
1,4,A-Ad-da-kal-la,A.Ad.da.kal.la[]PN,"[a, ad, da, kal, la]",A-AD-DA-KAL-LA
2,5,A-Ad-da-mu,A.Ad.da.ŋu₁₀[]PN,"[a, ad, da, mu]",A-AD-DA-MU
3,7,A-DU-a,A.DU.a[]PN,"[a, du, a]",A-DU-A
4,8,A-DU-ba,A.DU.ba[]PN,"[a, du, ba]",A-DU-BA
5,9,A-DU-ba-bi,A.DU.ba.bi[]PN,"[a, du, ba, bi]",A-DU-BA-BI
6,12,A-DU-mu,A.DU.ŋu₁₀[]PN,"[a, du, mu]",A-DU-MU
7,13,A-DU-mu-ta,A.DU.ŋu₁₀[]PN,"[a, du, mu, ta]",A-DU-MU-TA
8,15,A-DU-ta,A.DU[]PN,"[a, du, ta]",A-DU-TA
9,18,A-DU.DU-ta,A.DU.DU[]PN,"[a, du, du, ta]",A-DU-DU-TA


Create output for Manuel Molina

In [29]:
dups = dups.sort_values(by="sign_names").reset_index(drop=True)
with open("output/duplicate_names_bdtns.csv", "w", encoding='utf8') as o: 
    dups.to_csv(o, sep = "\t", index=False)
dups

Unnamed: 0,translit_bdtns,cf_oracc,signs,sign_names
0,A-a-,Aya[]PN,"[a, a]",A-A
1,A-a,Aya[]PN,"[a, a]",A-A
2,A-a-še₃,Aya[]PN,"[a, a, še₃]",A-A-EŠ₂
3,A-a-gir₁₅,Ayagir[]PN,"[a, a, gir₁₅]",A-A-EŠ₂
4,A-a-Kal-la,Aya.Kal.la[]PN,"[a, a, kal, la]",A-A-KAL-LA
5,A-a-kal-la-,Ayakala[]PN,"[a, a, kal, la]",A-A-KAL-LA
6,A-a-kal-la,Ayakala[]PN,"[a, a, kal, la]",A-A-KAL-LA
7,A-a-ni,Ayani[]PN,"[a, a, ni]",A-A-NI
8,A-a-NI,Aya.NI[]PN,"[a, a, ni]",A-A-NI
9,A-a-uru-IN,Aya.iri.IN[]PN,"[a, a, uru, in]",A-A-URU-IN


# Same Normalization?
If the duplicates do *not* have the same normalization (in `cf_oracc`) add them to a list for inspection. This is awfully slow - there is probably a better way of doing this.

In [None]:
dups_l = []
for i, n in enumerate(dups["sign_names"]):
    for o in range(i + 1, len(dups)): 
        if n == dups.iloc[o]["sign_names"]: 
            if dups.iloc[i]["cf_oracc"] == dups.iloc[o]["cf_oracc"]: 
                continue
            else: 
                l = [dups.iloc[i]["index"], dups.iloc[i]["sign_names"], dups.iloc[i]["cf_oracc"], dups.iloc[o]["index"], 
                     dups.iloc[o]["sign_names"], dups.iloc[o]["cf_oracc"]]
                dups_l.append(l)

In [None]:
if len(dups_l) > 0: 
    dups_df = pd.DataFrame(dups_l)
    dups_df

In [None]:
len(dups_l)

In [None]:
r = na_df.to_dict("records")
r = {"data" : r}

p = OrderedDict()
p["authors"] = "Niek Veldhuis and John Carnahan"
p["license"] = "CC0; https://creativecommons.org/share-your-work/public-domain/cc0/; Open Domain"
p["website"] = "https://github.com/niekveldhuis/UrIII-names"
p["notes"] = "Based on the BDTNS (http://bdtns.filol.csic.es/) dataset, December 2016. Proper nouns in BDTNS, marked by initial capital, were extracted and normalized with a script, authored by Niek Veldhuis (https://github.com/niekveldhuis/UrIII-names). Drehem proper nouns were checked and hand-edited by John Carnahan."
fmt='%Y-%m-%d'
p["date"] = datetime.datetime.now().strftime(fmt)
p.update(r)

In [None]:
with open("name_authority/Drehem_na.json", "w", encoding = "utf-8") as j: 
    json.dump(p, j, ensure_ascii=False, sort_keys=False, indent=4, separators=(',', ': '))

In [None]:
with open("name_authority/Drehem_na.json", "r", encoding = "utf-8") as k: 
    l = json.load(k)
df = pd.DataFrame(l["data"])
df

In [None]:
with open("name_authority/people.csv", "r", encoding="utf-8") as f: 
    all_names = pd.read_csv(f)
all_names

In [None]:
with open("name_authority/Drehem_P_BDTNS.txt", "r", encoding="utf-8") as f:
    P_nos = pd.read_csv(f, sep="\t", header=None, usecols=[1]).fillna("")
P_nos = P_nos[P_nos[1] != ""]
P_nos = list(P_nos[1])
P_nos = [int(n[1:]) for n in P_nos]
P_nos

In [None]:
drehem_names = all_names.loc[all_names["p index"].isin(P_nos)].copy()

In [None]:
len(all_names), len(drehem_names)

In [None]:
drehem_names["signs"] = drehem_names["name"].apply(separate)
drehem_names["sign_names"] = drehem_names["signs"].apply(lambda x: "-".join([d[s] if s in d else s for s in x]))

In [None]:
drehem_names

In [None]:
na_d = dict(zip(na_df["sign_names"], na_df["cf_oracc"]))

In [None]:
drehem_names["norm2"] = drehem_names["sign_names"].apply(lambda x: na_d[x] if x in na_d else "not found")

In [None]:
perc = len(drehem_names[drehem_names["norm2"] == "not found"]) / len(drehem_names) * 100

In [None]:
print("percentage of name instances not recognized in normalization " + str(perc) + "%")

In [None]:
drehem_names.loc[drehem_names["norm2"] == "not found", "norm2"] = drehem_names["name"]