# ePSD2 Sux-Gloss Noteook
This Jupyter Notebook was developed through the UC Berkeley Data Science Discovery Program, and is therefore held in a Creative Commons licence 0 (cc0).

The purpose of the ePSD2 sux-gloss Notebook is to build DataFrames from the json files of the glossary of the electronic Penn Sumerian Dictionary, which can be used in Linked Open Data and subsequent natural language processing and machine learning tasks.

**Cite as:**

> Kim, M. and Anderson, A. 2023. ePSD2 Sux-Gloss Notebook: A Python Noteook Pipeline from the ePSD2 Sumerian Glossary to Wikidata Lexemes and URIs.


**Authors:** 
1. Minoo Kim (minookim@berkeley.edu). UC Berkeley Data Science major (2025)
2. Dr. Adam Anderson (adamganderson@gmail.com). UC Berkeley Data Science Discovery Partner, FactGrid Cuneiform project PI

#Contents:
# Intro to ORACC Headwords and Forms
## Initial Headword DataFrame
## Initial Forms DataFrame
## Linking Headwords in the Forms DataFrame
## Final Headwords DataFrame: headword_df
## Final Forms DataFrame: forms_df
# Labeling Forms: grammatical features, esp. suffixes
## Labels for Nous with suffix cases
## Labels for Pronouns
## Labels for Demonstratives
## Wikidata formatting and Labels to Q-ids
# Formatting for Wikidata QuickStatements

# Intro to ORACC ePSD2 Headwords and Forms

ORACC makes the json file for their main glossary readily available under this URL: http://oracc.museum.upenn.edu/epsd2/json/index.html

The zipped folder is 213.4 MB, but this needs to be unzipped in order to access the sux-gloss.json file. When unzipped the following files will be visable:
* catalogue.json = 76.6 MB
* corpus.json = 3.4 MB
* corpusjson (folder)
* epsd2-portal.json = 88 KB
* epsd2-sl.json

This notebook begins by mounting the Notebook `ipynb` to the Google Drive using Google Colab. Once this is done, we can use the directory path to obtain the ePSD2 json files, which we downloaded previously.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pwd

/content


In [None]:
import requests
from bs4 import BeautifulSoup
from tqdm.auto import tqdm
import os
import ipywidgets as widgets
from zipfile import ZipFile
import pandas as pd
import numpy as np
import json

In [None]:
import os

In [None]:
!ls -R "/content/drive/My Drive/Colab Notebooks/epsd2"

'/content/drive/My Drive/Colab Notebooks/epsd2':
catalogue.json	epsd2-portal.json  gloss-sux.json  index-sux.json  unzipped
corpus.json	epsd2-sl.json	   index-cat.json  metadata.json
corpusjson	epsd2.zip	   index-lem.json  sortcodes.json

'/content/drive/My Drive/Colab Notebooks/epsd2/corpusjson':

'/content/drive/My Drive/Colab Notebooks/epsd2/unzipped':
epsd2

'/content/drive/My Drive/Colab Notebooks/epsd2/unzipped/epsd2':
epsd2

'/content/drive/My Drive/Colab Notebooks/epsd2/unzipped/epsd2/epsd2':
catalogue.json	corpusjson     index-lem.json  sortcodes.json
corpus.json	epsd2-sl.json  index-sux.json

'/content/drive/My Drive/Colab Notebooks/epsd2/unzipped/epsd2/epsd2/corpusjson':


In [None]:
with open('/content/drive/My Drive/Colab Notebooks/epsd2/gloss-sux.json') as project_file:    
    data = json.load(project_file)  
df = pd.json_normalize(data)

In [None]:
df

Unnamed: 0,type,project,source,license,license-url,more-info,UTC-timestamp,lang,entries,instances.sux.r0028b3,...,summaries.o0048608,summaries.o0048610,summaries.o0043041,summaries.o0043043,summaries.o0043045,summaries.o0043047,summaries.o0043051,summaries.o0043053,summaries.o0043056,summaries.o0048612
0,glossary,epsd2,http://oracc.org/epsd2,This data is released under the CC0 license,https://creativecommons.org/publicdomain/zero/...,http://oracc.org/doc/opendata/,2021-12-21T03:21:45,sux,"[{'headword': 'a[arm]N', 'id': 'o0023086', 'oi...",[epsd2/literary:Q000372.157.3],...,"<p class=""summary"" id=""o0048608""><span class=""...","<p class=""summary"" id=""o0048610""><span class=""...","<p class=""summary"" id=""o0043041""><span class=""...","<p class=""summary"" id=""o0043043""><span class=""...","<p class=""summary"" id=""o0043045""><span class=""...","<p class=""summary"" id=""o0043047""><span class=""...","<p class=""summary"" id=""o0043051""><span class=""...","<p class=""summary"" id=""o0043053""><span class=""...","<p class=""summary"" id=""o0043056""><span class=""...","<p class=""summary"" id=""o0048612""><span class=""..."


## Initial Headword DataFrame
This is the dataframe showning the `headword` for each lemma (lexeme) in the glossary. Each `headword` has: a text it `id`, a count of occurences of the headword in ORACC `icount`, the form of the `headword` is the `cf`, the English gloss is `gw`, and the part of speech tag is `pos`.

In [None]:
normalized = pd.json_normalize(df['entries'][0])
headwords = normalized.loc[:, ["headword", "id", "oid", "icount", "cf", "gw", "pos"]]
headwords

Unnamed: 0,headword,id,oid,icount,cf,gw,pos
0,a[arm]N,o0023086,o0023086,11722,a,arm,N
1,a[bird-cry]N,o0023098,o0023098,2,a,bird-cry,N
2,a[time]N,o0023100,o0023100,19,a,time,N
3,a[water]N,o0023102,o0023102,5347,a,water,N
4,a aŋ[command]V/t,o0023107,o0023107,143,a aŋ,command,V/t
...,...,...,...,...,...,...,...
14607,zurzur[official]N,o0043047,o0043047,38,zurzur,official,N
14608,zuses[bird]N,o0043051,o0043051,3,zuses,bird,N
14609,zusik[plucking]N,o0043053,o0043053,166,zusik,plucking,N
14610,zuša[roaring]N,o0043056,o0043056,0,zuša,roaring,N


In [None]:
forms = pd.json_normalize(normalized['forms'][0])
forms = forms.loc[:, ["id", "n", "c", "xis"]]
pd.DataFrame(forms.values)

Unnamed: 0,0,1,2,3
0,o0023086.0,a,50,sux.r002e75
1,o0023086.1,a-bi,324,sux.r000005
2,o0023086.2,a₂,1774,sux.r002e76
3,o0023086.3,a₂\a,1775,sux.r002e77
4,o0023086.4,A₂,1776,sux.r002e78
...,...,...,...,...
69,o0023086.69,a₂-zu-še₃-ne-ne,2422,sux.r002eb2
70,o0023086.70,a₂-zu-ta,2424,sux.r002eb3
71,o0023086.71,a₂-zu\eL,2427,sux.r002eb4
72,o0023086.72,an,6303,sux.r000005


## Initial Forms DataFrame
This dataframe shows the forms which are listed in a text according to the text `id`. Each form is shown in `n` along with a count of how many times `n` that form occurs in ORACC, and their corresponding ids as `xis`.

In [None]:
form_df_lst = []
for i in range(headwords.shape[0]):
  form = pd.json_normalize(normalized['forms'][i])
  form_df_lst.append(form.values)
lst_of_dfs = [pd.DataFrame(form_df_lst[j]) for j in range(len(form_df_lst))]
forms = pd.concat(lst_of_dfs)
forms = forms.iloc[:, [1, 2, 3, 6]]
forms = forms.rename(columns = {1: "id_form", 2: "n", 3: "c", 6: "xis"}) # I renamed the identifier id_form to avoid confusion with the headword id
forms

Unnamed: 0,id_form,n,c,xis
0,o0023086.0,a,50,sux.r002e75
1,o0023086.1,a-bi,324,sux.r000005
2,o0023086.2,a₂,1774,sux.r002e76
3,o0023086.3,a₂\a,1775,sux.r002e77
4,o0023086.4,A₂,1776,sux.r002e78
...,...,...,...,...
11,o0043053.11,zu₂-sig,104655,sux.r01d04c
12,o0043053.12,zu₂-x,104672,sux.r000005
0,o0043056.0,zu₄-ša₄,104684,sux.r000005
0,o0048612.0,{m}zu-zu,104476,sux.r002bb0


## Linking Headwords in the Forms DataFrame
In this step we want to join the forms `n` of each headword (in this above case, the headword is [0], so we do this iteratively for all 14611 headwords). We include all the fields for each form (this will add redundancy to the dataframe, but it's necessary for importing in Wikidata).

For each of the `headwords`, we duplicate the row to the number of `forms` that correspond to the headword number [...], then concat the corresponding `forms` for each `headword` in the dataframe pd.concat

In [None]:
lst = []
for i in range(headwords.shape[0]):
  lst.append(np.repeat(headwords.iloc[i:(i+1)].values, pd.json_normalize(normalized['forms'][i]).shape[0], axis = 0))

temp = [pd.DataFrame(lst[j]) for j in range(len(lst))]
headwords_long = pd.concat(temp)
headwords_long = headwords_long.rename(columns = {0: "headword", 1: "id", 2: "oid", 3: "icount", 4: "cf", 5: "gw", 6: "pos"})
headwords_long

Unnamed: 0,headword,id,oid,icount,cf,gw,pos
0,a[arm]N,o0023086,o0023086,11722,a,arm,N
1,a[arm]N,o0023086,o0023086,11722,a,arm,N
2,a[arm]N,o0023086,o0023086,11722,a,arm,N
3,a[arm]N,o0023086,o0023086,11722,a,arm,N
4,a[arm]N,o0023086,o0023086,11722,a,arm,N
...,...,...,...,...,...,...,...
11,zusik[plucking]N,o0043053,o0043053,166,zusik,plucking,N
12,zusik[plucking]N,o0043053,o0043053,166,zusik,plucking,N
0,zuša[roaring]N,o0043056,o0043056,0,zuša,roaring,N
0,Zuzu[1]PN,o0048612,o0048612,7,Zuzu,1,PN


## Final Headword DataFrame: headword_df
In this DataFrame we return to the glossary in order to include the following two fields for each `headword` `id` (i.e. the `headwords` dataframe above with 14612 rows):

1. time period (string data)
2. additional glosses in English (string data)
3. URL path for each headword (http://oracc.museum.upenn.edu/epsd2/+`id`

* It may be helpful to use this html python parser: https://docs.python.org/3/library/html.parser.html
* the data we need in the json begins with this tag:
  "summaries"
* Here are some examples of how the json includes html at the bottom of the file:

1. "o0023086": "<p class=\"summary\" id=\"o0023086\"><span class=\"summary\"><span class=\"summary-headword\"><a href=\"javascript:p3Article('/epsd2/cbd/sux/o0023086.html')\"><span class=\"cf\">a</span> [<span class=\"gw\">ARM</span>] <span class=\"cf\">N</span></a> (11722x) </span>Early Dynastic IIIa, Early Dynastic IIIb, Old Akkadian, Lagash II, Ur III, Old Babylonian, Middle Assyrian, Middle Babylonian, Neo-Assyrian, Neo-Babylonian, Persian, Hellenistic, Uncertain, unknown  wr. <span class=\"wr\"><span class=\"w sux \" id=\"\"><span x=\"3\" class=\"sux\">a₂</span></span></span>; <span class=\"wr\"><span class=\"w sux \" id=\"\"><sup class=\"sux\">ŋeš</sup><span x=\"3\" class=\"sux\">a₂</span></span></span>; <span class=\"wr\"><span class=\"w sux \" id=\"\"><sup class=\"sux\">kuš</sup><span x=\"3\" class=\"sux\">a₂</span></span></span>; <span class=\"wr\"><span class=\"w sux \" id=\"\"><span x=\"3\" class=\"sux\">a</span></span></span>; <span class=\"wr\"><span class=\"w sux \" id=\"\"><sup class=\"sux\">a</sup><span x=\"3\" class=\"sux\">a₂</span></span></span>; <span class=\"wr\"><span class=\"w sux \" id=\"\"><sup class=\"sux\">urud</sup><span x=\"3\" class=\"sux\">a₂</span></span></span>; <span class=\"wr\"><span class=\"w sux \" id=\"\"><span x=\"3\" class=\"sux\">an</span>-<span x=\"3\" class=\"sux\">na</span></span></span>; <span class=\"wr\"><span class=\"w sux \" id=\"\"><span x=\"3\" class=\"sux\">a₂</span><sup class=\"sux\">a</sup></span></span>; <span class=\"wr\"><span class=\"w sux \" id=\"\"><sup class=\"sux\">na₄</sup><span x=\"3\" class=\"sux\">a₂</span></span></span>; <span class=\"wr\"><span class=\"w sux \" id=\"\"><span x=\"3\" class=\"sux\">an</span></span></span> \"arm; plow handle; side; wing; horn; strength; power; wage, wages; rent; strap; part of a scale; weapon; work\"</span></p>(Citation URL http://oracc.org/epsd2/o0023086)

2. "o0023709": "<p class=\"summary\" id=\"o0023709\"><span class=\"summary\"><span class=\"summary-headword\"><a href=\"javascript:p3Article('/epsd2/cbd/sux/o0023709.html')\"><span class=\"cf\">agin</span> [<span class=\"gw\">THUS</span>] <span class=\"cf\">N</span></a> (145x) </span>Early Dynastic IIIb, Old Babylonian, Middle Babylonian, Neo-Babylonian, Hellenistic, unknown  wr. <span class=\"wr\"><span class=\"w sux \" id=\"\"><span x=\"3\" class=\"sux\">a</span>-<span x=\"3\" class=\"sux\">gin₇</span></span></span> \"thus; how?\"</span></p>(Citation URL http://oracc.museum.upenn.edu/epsd2/o0023709)

3. "o0023728": "<p class=\"summary\" id=\"o0023728\"><span class=\"summary\"><span class=\"summary-headword\"><a href=\"javascript:p3Article('/epsd2/cbd/sux/o0023728.html')\"><span class=\"cf\">aguba</span> [<span class=\"gw\">VESSEL</span>] <span class=\"cf\">N</span></a> (33x) </span>Old Akkadian, Ur III, Old Babylonian, Middle Babylonian, Neo-Assyrian, Neo-Babylonian, Hellenistic, Uncertain, unknown  wr. <span class=\"wr\"><span class=\"w sux \" id=\"\"><span x=\"3\" class=\"sux\">a</span>-<span x=\"3\" class=\"sux\">gub₂</span>-<span x=\"3\" class=\"sux\">ba</span></span></span>; <span class=\"wr\"><span class=\"w sux \" id=\"\"><span x=\"3\" class=\"sux\">a</span>-<span x=\"3\" class=\"sux\">gub₂</span></span></span>; <span class=\"wr\"><span class=\"w sux \" id=\"\"><span x=\"3\" class=\"sux\">a₂</span>-<span x=\"3\" class=\"sux\">gub</span></span></span>; <span class=\"wr\"><span class=\"w sux \" id=\"\"><sup class=\"sux\">dug</sup><span x=\"3\" class=\"sux\">a</span>-<span x=\"3\" class=\"sux\">gub₂</span>-<span x=\"3\" class=\"sux\">ba</span></span></span> \"a cultic vessel for water\"</span></p> (Citation URL http://oracc.museum.upenn.edu/epsd2/o0023728)

In [None]:
summary = df.copy().loc[:, 'summaries.o0023086':].T

In [None]:
summary = summary[0].str.extract('\d+x\) </span>(.+wr.) <span class=\"wr\">.*</span>(.*)</span></p>').reset_index()
summary

Unnamed: 0,index,0,1
0,summaries.o0023086,"Early Dynastic IIIa, Early Dynastic IIIb, Old ...","""arm; plow handle; side; wing; horn; strength..."
1,summaries.o0023098,Lagash II wr.,"""a bird-cry"""
2,summaries.o0023100,Old Babylonian wr.,"""time"""
3,summaries.o0023102,"Early Dynastic IIIa, Early Dynastic IIIb, Ebla...","""water; watercourse; semen, sperm; progeny"""
4,summaries.o0023107,"Early Dynastic IIIb, Ur III, Old Babylonian, M...","""to command; to instruct"""
...,...,...,...
14607,summaries.o0043047,"Early Dynastic IIIb, Old Akkadian wr.","""animal keeper?"""
14608,summaries.o0043051,Old Babylonian wr.,"""a bird"""
14609,summaries.o0043053,"Early Dynastic IIIb, Old Akkadian, Ur III, Old...","""plucking time; plucked (said of sheep)"""
14610,summaries.o0043056,wr.,"""roaring; murmuring"""


In [None]:
headword_df = pd.concat([headwords, summary], axis = 1)
headword_df = headword_df.drop(['index'], axis = 1).rename({0: "time_period", 1: "translations"}, axis = 1)
headword_df

Unnamed: 0,headword,id,oid,icount,cf,gw,pos,time_period,translations
0,a[arm]N,o0023086,o0023086,11722,a,arm,N,"Early Dynastic IIIa, Early Dynastic IIIb, Old ...","""arm; plow handle; side; wing; horn; strength..."
1,a[bird-cry]N,o0023098,o0023098,2,a,bird-cry,N,Lagash II wr.,"""a bird-cry"""
2,a[time]N,o0023100,o0023100,19,a,time,N,Old Babylonian wr.,"""time"""
3,a[water]N,o0023102,o0023102,5347,a,water,N,"Early Dynastic IIIa, Early Dynastic IIIb, Ebla...","""water; watercourse; semen, sperm; progeny"""
4,a aŋ[command]V/t,o0023107,o0023107,143,a aŋ,command,V/t,"Early Dynastic IIIb, Ur III, Old Babylonian, M...","""to command; to instruct"""
...,...,...,...,...,...,...,...,...,...
14607,zurzur[official]N,o0043047,o0043047,38,zurzur,official,N,"Early Dynastic IIIb, Old Akkadian wr.","""animal keeper?"""
14608,zuses[bird]N,o0043051,o0043051,3,zuses,bird,N,Old Babylonian wr.,"""a bird"""
14609,zusik[plucking]N,o0043053,o0043053,166,zusik,plucking,N,"Early Dynastic IIIb, Old Akkadian, Ur III, Old...","""plucking time; plucked (said of sheep)"""
14610,zuša[roaring]N,o0043056,o0043056,0,zuša,roaring,N,wr.,"""roaring; murmuring"""


### Use this line of code to export the resulting data frame in a CSV file:

In [None]:
headword_df.to_csv('headword_id_oid_icount_cf_gw_pos_period_transslations.csv')

## Headword DataFrame: headword_df
This can now be used for Linked Data, with the URL for each `headword`. 

The following fields include:
1. `headword` which is the lemmatized form of the word
2. `id` the headword identifier in ePSD2
3. `oid` the Oracc identifier, which is identical to the ePSD2 headword `id`
4. `icount` is the count for all linked headwords in ORACC
5. `cf` is the normalized lemma for each headword
6.  `gw` is the English glossary term for the headword
7. `pos` is the part of speech tag used in ORACC
8. `time_period` includes all time periods listed for each headword in the ePSD2
9. `translations` includes all translations listed for each headword in the ePSD2
10. `url` is the URL for each headword in the ePSD2

In [None]:
url_list = ["http://oracc.museum.upenn.edu/epsd2/" + id for id in headword_df['id']]
headword_df['url'] = url_list
headword_df

NameError: ignored

## Final Forms DataFrame: forms_df

The Forms DataFrame links each attested form in ORACC to the glossary `headwords`. This includes each form `n` for every `headword` or `cf`, along with a count of each unique form `c` and the ePSD2 identifier `xis` in the json.

The following fields include:
1. `headword` which includes the lemmatization, i.e. normalized lemma, English gloss in square brackets, and the part of speech tag
2. `id` is the ePSD2 identifier for the headword
3. `oid` is the ORACC identifier for the headword, which is identical to `id`, the ePSD2 identifier
4. `icount` is the count for each lemma in ORACC linked to the headword in ePSD2
5. `cf` (citation form) is the normalized lemma, as initially seen in the headword
6. `gw` (guide word) is the English glossary term for the headword
7. `pos` a basic part-of-speech tag assigned to each lemma
8. `id_form` is the identifier for each form of the headword in ePSD2
9. `n` is the written form as it appears in the ORACC ATF. These are not normalized forms, like the `cf`, but preserve exact spellings of each lemma
10. `c` is the count of each unique form in ORACC
11. `xis` is the ePSD2 identifier for the form, which is used internally in ORACC to link each attested form in ORACC to the `xis` id in the ePSD2

For more details, see the documentation in ORACC:
http://oracc.museum.upenn.edu/doc/help/glossaries/index.html



# **Labeling Forms: grammatical featuers, esp. suffix and prefix**

After having extracted the different forms for each lexeme, the final step in or workflow is to label them with their grammatical features, including: person (personal / impersonal), number (1, 2, 3), and case (absolute, dative, gentitive, locative, etc.). 

It should be noted that there is not 100% agreement among Sumerologists for the labels of these morphological particles. For the labeling task we will begin by referring to a published Sumerian grammar in English by Abraham Jagersma (2010), an unpublished Sumerian grammar in German by Walther Sallaberger (2007), a published Sumerian grammar in French by Attinger, and a published Sumerian grammar in Spanish by Miquel Civil (2020). We are building these references in Wikidata so that other Sumerologists can contribute to this process.

For the specifics on these particles we are working on labeling, we have a google sheet which is open for edits. We welcome contributions:
https://docs.google.com/spreadsheets/d/1L9cwl9V7N3oikbeimNwxga7j1nquRDXdHCFSFbTrcpU/edit?usp=sharing

In [None]:
label_df = forms_df.copy()
label_df["suffix"] = label_df["n"].str.extract("(-.+-?.+?)")
label_df["prefix"] = label_df["n"].str.extract("(^{.+})")
label_df

Unnamed: 0,headword,id,oid,icount,cf,gw,pos,id_form,n,c,xis,suffix,prefix
0,a[arm]N,o0023086,o0023086,11722,a,arm,N,o0023086.0,a,50,sux.r002e75,,
1,a[arm]N,o0023086,o0023086,11722,a,arm,N,o0023086.1,a-bi,324,sux.r000005,-bi,
2,a[arm]N,o0023086,o0023086,11722,a,arm,N,o0023086.2,a₂,1774,sux.r002e76,,
3,a[arm]N,o0023086,o0023086,11722,a,arm,N,o0023086.3,a₂\a,1775,sux.r002e77,,
4,a[arm]N,o0023086,o0023086,11722,a,arm,N,o0023086.4,A₂,1776,sux.r002e78,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
11,zusik[plucking]N,o0043053,o0043053,166,zusik,plucking,N,o0043053.11,zu₂-sig,104655,sux.r01d04c,-sig,
12,zusik[plucking]N,o0043053,o0043053,166,zusik,plucking,N,o0043053.12,zu₂-x,104672,sux.r000005,,
0,zuša[roaring]N,o0043056,o0043056,0,zuša,roaring,N,o0043056.0,zu₄-ša₄,104684,sux.r000005,-ša₄,
0,Zuzu[1]PN,o0048612,o0048612,7,Zuzu,1,PN,o0048612.0,{m}zu-zu,104476,sux.r002bb0,-zu,{m}


## Labels for Nouns with suffix cases

We have created a google doc to describe the pertinent morphological features of the nouns, pronouns, demonstratives, adjectives and verbs:

https://docs.google.com/document/d/11TuOSj5g3L_myqrVSaOxdMQPMUhHmtRBhiC84oYB96A/edit?usp=sharing

In [None]:
# function that matches input noun suffix to the correct q-id
def noun_suffix_labeler(pos, suffix):
  # make sure to return when suffix is contained in the entire suffix, not direct match
  if pos == 'N':
    if suffix == '-a':
      return 'Locative'
    elif suffix == '-ak':
      return 'Genetive'
    elif suffix == '-da':
      return 'Comitative'
    elif suffix == '-e':
      return ['Ergative (personal)', 'Directive (impersonal)']
    elif suffix == '-eš':
      return 'Adverbiative'
    elif suffix == '-gin₇':
      return 'Equative'
    elif suffix == '-ne':
      return 'Locative 2'
    elif suffix == '-ra':
      return 'Dative'
    elif suffix == '-še':
      return 'Terminative'
    elif suffix == '-ta':
      return 'Ablative'

In [None]:
label_df['noun_q-id'] = label_df.apply(lambda row: noun_suffix_labeler(row['pos'], row['suffix']), axis = 1)

## Labels for Pronouns
This includes person ([personal](https://www.wikidata.org/wiki/Q67372736) / [impersonal](https://www.wikidata.org/wiki/Q67372837)), number (first, second, third), and case (see above).

In [None]:
def pronoun_suffix_labeler(pos, suffix):
  SUB = str.maketrans("0123456789", "₀₁₂₃₄₅₆₇₈₉")

  if pos == 'N':
    if suffix == '-ŋu':
      return 'First person singular + personal'
    if '-ŋu₁₀' in str(suffix):
      if '-uš' in str(suffix) or '-uš-še₃' in str(suffix):
        return 'First person + personal + Terminative -še'
      if '-ur₂' in str(suffix):
        return 'First person + personal + Dative -r(a)'
      else:
        return ['First person singular + personal', 'First person + personal + Directive -e', 'First person + personal + Ergative -e']

    elif suffix == '-me':
      return 'First person plural + personal'

    elif '-zu' in str(suffix):
      if '-e-ne-ne' in str(suffix):
        return 'Second person plural + personal'
      if '-ne' in str(suffix):
        return 'Second person plural + personal'
      if '-ur₂' in str(suffix):
        return 'Second person + personal + Dative -r(a)'
      if '-uš' in str(suffix):
        return 'Second person + personal + Terminative -še'
      else:
        return ['Second person singular + personal', 'Second person + personal + Directive -e']

    elif suffix == '-a-ne':
      if '-ne-ne' in str(suffix):
        return 'Third person plural + personal'
      else: 
        return 'Third person singular + personal'

    elif suffix == '-a-ni':
      if '-ir' in str(suffix):
        return 'Third person + personal + Dative -r(a)'
      if '-še3' in str(suffix):
        return 'Third person + personal + Terminative -še'
      else:
        return ['Third person singular + personal', 'Third person + personal + Directive -e', 'Third person + personal + Ergative -e'] # same condition -a-ni

    elif suffix == '-bi':
      return ['Third person + impersonal', 'Third person + impersonal + Directive -e', 'Third person + impersonal + Ergative -e'] # same condition -bi

    elif suffix == '-be₂':
      return 'Third person + impersonal'
    elif suffix == '-ŋa₂':
      return ['First person + personal + Gentive {ak}', 'First person + personal + Locative {a}']
    elif suffix == '-ra':
      return ['First person + personal + Dative -r(a)', 'Second person + personal + Dative -r(a)', '-a-ni-ir / -ra', 'Third person + impersonal + Dative -r(a)'] # same condition -ra
    elif suffix == '-za':
      return 'Second person + personal + Locative {a}'
    elif suffix == '-še3':
      return 'Second person + personal + Terminative -še'
    elif suffix == '-a-na':
      return ['Third person + personal + Gentive {ak}', 'Third person + personal + Locative {a}'] # same condition -a-na
    elif suffix == '-a-ne₂':
      return ['Third person + personal + Directive -e', 'Third person + personal + Ergative -e'] # same condition -a-ne2
    elif suffix == '-ba':
      return ['Third person + impersonal + Gentive {ak}', 'Third person + impersonal + Locative {a}'] # same condition -ba
    elif suffix == '-bi-a':
      return ['Third person + impersonal + Gentive {ak}', 'Third person + impersonal + Locative {a}'] # same condition -bi-a
    elif suffix == '-bi-ir':
      return 'Third person + impersonal + Dative -r(a)'
    elif suffix == '-biš':
      return 'Third person + impersonal + Terminative -še'
    elif suffix == '-bi-še₃':
      return 'Third person + impersonal + Terminative -še'
    elif suffix == '-be₂':
      return ['Third person + impersonal + Directive -e', 'Third person + impersonal + Ergative -e'] # same condition -be2

label_df['pronoun_q-id'] = label_df.apply(lambda row: pronoun_suffix_labeler(row['pos'], row['suffix']), axis = 1)

## Labels for Demonstratives

In [None]:
# function that matches input demonstrative suffix to the correct label
def demonstrative_suffix_labeler(pos, suffix):
  if pos == 'N':
    if suffix == '-e':
      return '“this”, Directive'
    elif suffix == '-e-en':
      return '“this”, Directive'
    elif suffix == '-ri':
      return '“that”; Adjective'
    elif suffix == '-re':
      return '“that”; Adjective'
    elif suffix == '-re-en3':
      return '“that”; Adjective'
    elif suffix == '-ne-(e/en)':
      return '“this one” = personal pronoun: 3 person, impersonal'
    elif suffix == '-bi':
      return '“this” = possessive pronoun, 3 person impersonal'
    elif suffix == '-še':
      return '"hither"'
    elif suffix == 'ur5-gin₇':
      return '“so” = Noun used for personal pronoun, 3 person impersonal'
    elif suffix == 'ur5-ta':
      return "because” = Noun used for personal pronoun, 3 person impersonal'"

label_df['demonstrative_q-id'] = label_df.apply(lambda row: demonstrative_suffix_labeler(row['pos'], row['suffix']), axis = 1)

## Wikidata formatting and Labels to Q-ids

There will be multiple hits for some of these suffixes, so we should include all possible options. For example, the -bi suffix could be "possessive pronoun, 3 person impersonal" and "Third person + impersonal + Ergative -e", etc.

When there are such cases, the row should be duplicated so we can keep the prefix column as a single dimension (rather than making multiple prefix columns for a single row)

See this doc for more info: https://docs.google.com/document/d/11TuOSj5g3L_myqrVSaOxdMQPMUhHmtRBhiC84oYB96A/edit?usp=sharing

The last step before we can upload the labeled lemmas to Wikidata is to use the proper Q-ids for each of the Sumerian labels we've identified:

Replace for the following
1. noun_q-id = Q1084
2. pronoun_q-id = Q36224
3. demonstrative_q-id = Q282301


In [None]:

q_id_df = label_df.copy()
q_id_df['q-id'] = q_id_df['noun_q-id'].fillna(q_id_df['pronoun_q-id']).fillna(q_id_df['demonstrative_q-id'])
q_id_df = q_id_df.drop(columns = ['noun_q-id', 'pronoun_q-id', 'demonstrative_q-id'])
q_id_df = q_id_df.explode('q-id')
q_id_df.head(50)

Unnamed: 0,headword,id,oid,icount,cf,gw,pos,id_form,n,c,xis,suffix,prefix,q-id
0,a[arm]N,o0023086,o0023086,11722,a,arm,N,o0023086.0,a,50,sux.r002e75,,,
1,a[arm]N,o0023086,o0023086,11722,a,arm,N,o0023086.1,a-bi,324,sux.r000005,-bi,,Third person + impersonal
1,a[arm]N,o0023086,o0023086,11722,a,arm,N,o0023086.1,a-bi,324,sux.r000005,-bi,,Third person + impersonal + Directive -e
1,a[arm]N,o0023086,o0023086,11722,a,arm,N,o0023086.1,a-bi,324,sux.r000005,-bi,,Third person + impersonal + Ergative -e
2,a[arm]N,o0023086,o0023086,11722,a,arm,N,o0023086.2,a₂,1774,sux.r002e76,,,
3,a[arm]N,o0023086,o0023086,11722,a,arm,N,o0023086.3,a₂\a,1775,sux.r002e77,,,
4,a[arm]N,o0023086,o0023086,11722,a,arm,N,o0023086.4,A₂,1776,sux.r002e78,,,
5,a[arm]N,o0023086,o0023086,11722,a,arm,N,o0023086.5,a₂{+a},1777,sux.r002e79,,,
6,a[arm]N,o0023086,o0023086,11722,a,arm,N,o0023086.6,{+a}a₂,1790,sux.r002e7a,,{+a},
7,a[arm]N,o0023086,o0023086,11722,a,arm,N,o0023086.7,{geš}a₂,1791,sux.r002e7b,,{geš},


# Formatting for Wikidata QuickStatements
* add the Sumerian language item for each = Q36790 (https://www.wikidata.org/wiki/Q36790)
* add 'instance of' (P31) 'compund'(https://www.wikidata.org/wiki/Q245423) for every lexeme with more than one sign 
* add statement 'described by source' (https://www.wikidata.org/wiki/Property:P1343) = https://www.wikidata.org/wiki/Q7164210
* add the 'lexical category' for each pos:
  * N = Q1084 (https://www.wikidata.org/wiki/Q1084)
    * grammatical gender (https://www.wikidata.org/wiki/Property:P5185) = personal (https://www.wikidata.org/wiki/Q67372736) or impersonal (https://www.wikidata.org/wiki/Q67372837)
  * PN = Q25047676 (https://www.wikidata.org/wiki/Q25047676)
    * Instance of = full name (Q1071027)
    * Lexical category = name (Q82799)
  * RN = royal name (Q116)
    * Instance of = full name (Q1071027)
    * Lexical category = name (Q82799)
  * DN = deity name (Q108524837)
    * Instance of = full name (Q1071027)
    * Lexical category = name (Q82799)
  * MN = month name (Q56413401)
  * (see others here: https://docs.google.com/document/d/10toOySKDERGlMmRlY7kUG-7Lz0ijC15gz8M_J4BVwVo/edit?usp=sharing)


Rename Header Labels (for Headwords DataFrame):
* cf = Lsux-latn
  * convert `n` to unicode (sux-xsux), use the first `n` for the label (Lsux-xsux)
* id = P11062 (https://www.wikidata.org/wiki/Property:P11062)
* For every word in `translations`:
  1. relationship from Lexeme to sense: ontolex:sense
  2. relationship sense to English label: skos:definition
  3. manually assign an item for this sense
* time_period = P2348 (https://www.wikidata.org/wiki/Property:P2348)
  * with a mapping to the Q-items (see here: 

Rename Header Labels (for Forms DataFrame):
* n = P2440 (https://www.wikidata.org/wiki/Property:P2440)
* Add a column P459 = Q114871134 (in every row)
* connect each lexeme to the form: ontolex:lexicalForm
* Add label for each form using ontolex:representation
  * sux-latn:
  * convert `n` to unicode (sux-xsux), use the Canonical ASCII version for the label (Lsux-latn)

* convert each form to CDLI format (Canonical ASCII version: https://www.wikidata.org/wiki/Q114871020)
* Add another column P459 = Q114871020 (in every row)
* Add grammatical features: https://www.wikidata.org/wiki/Lexeme:L714279
  * Grammatical features = absolutive case, etc. (we can add these later, probably manually...)
* transliteration (https://www.wikidata.org/wiki/Property:P2440): the readings attached to the transliteration
  * determination method (https://www.wikidata.org/wiki/Property:P459) = 
    * https://www.wikidata.org/wiki/Q114871020 (cdli)
    * https://www.wikidata.org/wiki/Q114871134 (oracc)

Before import:
1. cross-check the data added already into Wikidata (with SparQL). Make sure we are not adding duplicate statements and forms.
2. identify null-values (as these will stop the import process)

Later work / Challenge:
* combines lexemes https://www.wikidata.org/wiki/Property:P5238
  * this probably can't be done automatically (ordinal, object form, object sense - which needs to be assigned manually)
* add a usage example for each lemma (lexeme and form) (search Oracc for usage examples and their n-grams)

In [None]:
forms_df = pd.concat([headwords_long, forms], axis = 1)
forms_df

Unnamed: 0,headword,id,oid,icount,cf,gw,pos,id_form,n,c,xis
0,a[arm]N,o0023086,o0023086,11722,a,arm,N,o0023086.0,a,50,sux.r002e75
1,a[arm]N,o0023086,o0023086,11722,a,arm,N,o0023086.1,a-bi,324,sux.r000005
2,a[arm]N,o0023086,o0023086,11722,a,arm,N,o0023086.2,a₂,1774,sux.r002e76
3,a[arm]N,o0023086,o0023086,11722,a,arm,N,o0023086.3,a₂\a,1775,sux.r002e77
4,a[arm]N,o0023086,o0023086,11722,a,arm,N,o0023086.4,A₂,1776,sux.r002e78
...,...,...,...,...,...,...,...,...,...,...,...
11,zusik[plucking]N,o0043053,o0043053,166,zusik,plucking,N,o0043053.11,zu₂-sig,104655,sux.r01d04c
12,zusik[plucking]N,o0043053,o0043053,166,zusik,plucking,N,o0043053.12,zu₂-x,104672,sux.r000005
0,zuša[roaring]N,o0043056,o0043056,0,zuša,roaring,N,o0043056.0,zu₄-ša₄,104684,sux.r000005
0,Zuzu[1]PN,o0048612,o0048612,7,Zuzu,1,PN,o0048612.0,{m}zu-zu,104476,sux.r002bb0
