#RulerCatalogue

Notebook by Melinee Her

Cleans the megacatalogue to get and export a dataframe titled, "RulerCatalogue" that captures data like ruler, eponym, year and month.


## The workflow from [Megacatalogue](https://colab.research.google.com/drive/17bsHjB8_o8ydYjdbTsafZDGNpBj-gDTi?usp=sharing) accomplished:

The creation of a large dataframe of ORACC catalogue data.

##Next Steps:

1. Resolving duplicate fields
  * lowercase all fields and merge duplicates (as long as there are no merge conflicts)
2. Resolving date fields
  * cross-validate & harmonize the date fields between ORACC projects:
  1. `ruler`
  2. `date_of_origin`
  3. `date`
  4. `long_date`
  5. `date_gen`
  6. `day`
  7. `long_date_gen`
  8. `month`
  9. `year`
  10. `eponym`
  11. `regnal_dates`
  12. `ancient_year`
  13. `date_bce`
  14. `months_recorded`
  15. `date_comments`
  16. `proposed_date`
  17. `eponym_title`
  18. `astron_date`
  19. `Reg_year`
  20. `Reg_no`
  21. `Ruler`
  22. `Day`
  23. `Month`
  24. `Year`
  25. `dynastic_seat`
  26. `date remarks`
  27. `year_name_eponym`
  28. `ancient_date`
  29. `century`
  30. `modern_converted_date`
  31. `accounting_period`
  * We can use this Chronology notebook to check against the CDLI dates: https://colab.research.google.com/drive/1ZYWIapSC6za-WJd6EOA7xjSm6o6CPKPu?usp=sharing

3. Harmonizing with the CDLI catalogue
  1. [GitHub repo cdli_cat.csv](https://github.com/cdli-gh/data/blob/master/cdli_cat.csv)
  * [Zenodo](https://zenodo.org/record/6975724) (should be same as above)
  2. Processed & cleaned CDLI catalogue subset: https://github.com/ancient-world-citation-analysis/CDLI2LoD

4. Formatting for LOD in FactGrid
  * Example from ORACC: http://oracc.museum.upenn.edu/epsd2/admin/ur3/P123456
    * Dates Referenced: SH44 - 01 - 26
    * SH = Šulgi

|id_text|Ruler|Year_number|Month|Day|Earliest_P41|Latest_P43|
|--|--|--|--|--|--|--|
|P123456|Šulgi|44|01|26|Earliest|Latest|



# Mount Google Drive folder + imports

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
#any necessary imports
import pandas as pd
import zipfile
from zipfile import ZipFile
import json
import requests
from tqdm import tqdm
import os
import errno
import re
import random
import numpy as np
import sys
import copy
import networkx as nx
from pathlib import Path

#Set folder for remote drive
#folder = '/content/drive/My Drive/FactGrid Cuneiform (AWCA)/people/Melinee'
folder = '/content/drive/MyDrive/Melinee/'

#importing utils for the method which downloads the current text json files
os.chdir(folder + 'network/utils/')
from utils import oracc_download

# This is a user defined module that searches through the texts to find the entities in the text that
# are people and places, to be imported as nodes into the network
os.chdir(folder + 'network/')
import rank_parser4 as rp

Retrieving the megacatalogue from drive

In [None]:
path = '/content/drive/MyDrive/Melinee/ORACC_DFS/megacatalogue.csv'
megacatalogue = pd.read_csv(path, low_memory=False, index_col=False)
print(megacatalogue.shape)

(171145, 405)


# Cleaning the Megacatalogue
1. Drop any columns with all null values
2. Attempt to remove duplicates

In [None]:
#drops any columns with all null values
nonullmegacat = megacatalogue.dropna(axis='columns', how = 'all')

In [None]:
#drops any duplicates
megacatalogue = nonullmegacat.T.drop_duplicates().T

In [None]:
print(megacatalogue.shape)
megacatalogue.head(3)

(171145, 339)


Unnamed: 0.1,Unnamed: 0,id_text,langs,project,id_text.1,primary_publication,provenience,pleiades_id,pleiades_coord,excavation_no,...,Delnero_subgenre_no,deity,museum_URL,Delnero_remarks,Cohen_balag,external_URL_name,external_URL,google_earth_provenience,alternative_years,oracc_id
0,0,P522592,0x08000000,tilbarsip,P522592,Til-Barsip 01,Tell Ahmar (Til Barsip),658410.0,"[38.1191944, 36.6749623]",T 01,...,,,,,,,,,,
1,1,P522593,0x08000000,tilbarsip,P522593,Til-Barsip 02,Tell Ahmar (Til Barsip),658410.0,"[38.1191944, 36.6749623]",T 02,...,,,,,,,,,,
2,2,P522594,0x08000000,tilbarsip,P522594,Til-Barsip 03,Tell Ahmar (Til Barsip),658410.0,"[38.1191944, 36.6749623]",T 03,...,,,,,,,,,,


As we can see there are no "perfectly" duplicated columns. The column "id_text.1" has around 9000 different values that "id_text", and it's mostly null. "primary_publication" and "designation" are very similary as well.

also note that column "has_date" is unreliable for most projects.


#Creating the Ruler Catalogue
1. Filter for columns related to rulers and date information
2. Sort and export

First get a subset of the megacatalogue with these columns:

    'id_text', 'ruler', 'year', 'month', 'day', 'eponym', 'regnal_dates', 'provenience', 'eponym_title', 'proposed_date', 'date',
    'date_of_origin','long_date', 'date_gen', 'long_date_gen', 'ancient_year', 'date_bce', 'months_recorded', 'date_comments',
    'astron_date', 'Reg_year', 'Reg_no', 'dynastic_seat', 'ancient_date'

Columns in final rulercatalogue:

    'id_text', 'ruler', 'year', 'month', 'day', 'eponym', 'regnal_dates', 'provenience', 'eponym_title', 'proposed_date', 'date',
    'date_of_origin', 'long_date', 'date_gen', 'long_date_gen','astron_date', 'dynastic_seat'

In [None]:
#gets subset of metacatalogue with ruler and date information
datemegacat = megacatalogue[['id_text', 'ruler', 'year', 'month', 'day', 'eponym', 'regnal_dates', 'provenience', 'eponym_title', 'proposed_date', 'date', 'date_of_origin','long_date', 'date_gen', 'long_date_gen', 'ancient_year', 'date_bce', 'months_recorded', 'date_comments', 'astron_date', 'Reg_year', 'Reg_no', 'dynastic_seat', 'ancient_date']]

In [None]:
print(datemegacat.shape)
datemegacat.head(3)

(171145, 24)


Unnamed: 0,id_text,ruler,year,month,day,eponym,regnal_dates,provenience,eponym_title,proposed_date,...,long_date_gen,ancient_year,date_bce,months_recorded,date_comments,astron_date,Reg_year,Reg_no,dynastic_seat,ancient_date
0,P522592,,,,,,,Tell Ahmar (Til Barsip),,,...,,,,,,,,,,
1,P522593,,[...],VII,[...],,,Tell Ahmar (Til Barsip),,,...,"Tašrītu ...th, [eponymy of ...]",,,,,,,,,
2,P522594,Ashurbanipal,650,VII,01,Bēl-Harrān-šaddû’a,668–ca. 631 BC,Tell Ahmar (Til Barsip),,,...,"Tašrītu 1st, eponymy of Bēl-Harrān-šaddû’a",,,,,,,,,


In [None]:
#makes a catalogue from megacatalogue of where 'ruler' is identified and drops any all NaN columns.
rulercatalogue = datemegacat[~datemegacat['ruler'].isna()]
rulercatalogue = rulercatalogue.dropna(axis='columns', how='all', inplace=False)
print(rulercatalogue.shape)
rulercatalogue.head(10)

(25158, 17)


Unnamed: 0,id_text,ruler,year,month,day,eponym,regnal_dates,provenience,eponym_title,proposed_date,date,date_of_origin,long_date,date_gen,long_date_gen,astron_date,dynastic_seat
2,P522594,Ashurbanipal,650,VII,1.0,Bēl-Harrān-šaddû’a,668–ca. 631 BC,Tell Ahmar (Til Barsip),,,650-VII-01,Assurbanipal.limu Bel-Harran-shaddu’a.07.01,"Tašrītu 1[st], eponymy of Bēl-Harrān-šadd[û’a]",650-VII-01,"Tašrītu 1st, eponymy of Bēl-Harrān-šaddû’a",,
3,P522595,Ashurbanipal,650,VII,1.0,Bēl-Harrān-šaddû’a,668–ca. 631 BC,Tell Ahmar (Til Barsip),,,650-VII-01,Assurbanipal.limu Bel-Harran-shaddu’a.07.01,"Tašrītu 1st, eponymy of Bēl-Harrān-šadd[û’a]",650-VII-01,"Tašrītu 1st, eponymy of Bēl-Harrān-šaddû’a",,
5,P522597,Ashurbanipal,640*,IX,,Šarru-mētu-uballiṭ,668–ca. 631 BC,Tell Ahmar (Til Barsip),,,640*-IX,Assurbanipal.limu Shamash-metu-uballit.09.00,"Kislīmu, eponymy after that of Aššur-garu’a-[n...",640*-IX,"Kislīmu, eponymy of Šarru-mētu-uballiṭ",,
12,P522604,Ashurbanipal,643*,II,21.0,Aššur-šarru-uṣur,668–ca. 631 BC,Tell Ahmar (Til Barsip),,,643*-II-21,Assurbanipal.limu Assur-sharru-usur.02.21,"Ayyāru 21st, eponymy of Šamaš-da[’’in]anni",643*-II-21,"Ayyāru 21st, eponymy of Aššur-šarru-uṣur",,
13,P522605,Ashurbanipal,658,XI,7.0,Ša-Nabû-šû,668–ca. 631 BC,Tell Ahmar (Til Barsip),,,658-XI-07,Assurbanipal.limu Sha-Nabu-shu.11.07,"Šabāṭu 7th, eponymy of Ša-Nabû-šû",658-XI-07,"Šabāṭu 7th, eponymy of Ša-Nabû-šû",,
14,P522606,Sennacherib,683,X,1.0,Mannu-kī-Adad,704–681 BC,Tell Ahmar (Til Barsip),,,683-X-01,Sennacherib.limu Mannu-ki-Adad.10.01,"Kanūnu 1st, eponymy of Mannu-kī-Adad",683-X-01,"Kanūnu 1st, eponymy of Mannu-kī-Adad",,
18,P522610,Ashurbanipal,649,III,2.0,Ahu-ilā’ī,668–ca. 631 BC,Tell Ahmar (Til Barsip),,,649-III-02,Assurbanipal.limu Ahu-ila’i.03.02,"Simānu 2nd, eponymy after that of Bēl-Harrān-š...",649-III-02,"Simānu 2nd, eponymy of Ahu-ilā’ī",,
819,P223391,uncertain,,,,,,Qalat Sherqat (Assur),,(7th century),,00.000.00.00,,,,,
820,P250657,Esarhaddon,684,III,11.0,Manzernê,680–669 BC,Qalat Sherqat (Assur),,,684-III-11,Esarhaddon.limu Manzerne.03.11,"Simānu 11th, [eponymy of] Manzarnia, governor ...",684-III-11,"Simānu 11th, eponymy of Manzernê",,
821,P250905,Sîn-šarru-iškun,621* or 619*,II,26.0,,ca. 626–612 BC,Qalat Sherqat (Assur),,,621*-II-26 or 619*-II-26,Sin-sharru-ishkun.000.02.26,"Ayyāru 26th, eponymy of Bel-iqbi",621* or 619*-II-26,Ayyāru 26th,,


In [None]:
#Notice that there are rulers under "uncertain", should we remove this?
#Here's a closer look at the uncertain rulers
uncertains = rulercatalogue[rulercatalogue['ruler'].str.contains('uncertain')]
uncertains.head(10)

Unnamed: 0,id_text,ruler,year,month,day,eponym,regnal_dates,provenience,eponym_title,proposed_date,date,date_of_origin,long_date,date_gen,long_date_gen,astron_date,dynastic_seat
819,P223391,uncertain,,,,,,Qalat Sherqat (Assur),,(7th century),,00.000.00.00,,,,,
824,P282261,uncertain,,,,,,Qalat Sherqat (Assur),,(7th century),,00.000.00.00,,,,,
825,P282609,uncertain,,,,,,Qalat Sherqat (Assur),,(9th-7th century),,00.000.00.00,,,,,
826,P282610,uncertain,,,,,,Qalat Sherqat (Assur),,(7th century),,00.000.00.00,,,,,
827,P285502,uncertain,,,,,,Qalat Sherqat (Assur),,(719-662),719-662,00.000.00.00,,,,,
828,P285503,uncertain,,,,,,Qalat Sherqat (Assur),,(872-659),872-659,00.000.00.00,,,,,
830,P285525,uncertain,,,,,,Qalat Sherqat (Assur),,(7th century),,00.000.00.00,,,,,
832,P285555,uncertain,,,,,,Qalat Sherqat (Assur),,(8th century),,00.000.00.00,,,,,
843,P336141,uncertain,,,,,,Qalat Sherqat (Assur),,(9th-7th century),,00.000.00.00,,,,,
844,P336142,uncertain,,,,,,Qalat Sherqat (Assur),,(7th century),,00.000.00.00,,,,,


In [None]:
#Sorting the rulercatalogue by year ascending
sorted = rulercatalogue.sort_values(by='year', ascending=True)
sorted

Unnamed: 0,id_text,ruler,year,month,day,eponym,regnal_dates,provenience,eponym_title,proposed_date,date,date_of_origin,long_date,date_gen,long_date_gen,astron_date,dynastic_seat
28835,P527165,Nebuchadnezzar II,600,I,10,,,Tell Sheikh Hamad (Dur-Katlimmu),,,600-I-10,Nebuchadnezzar2.01.10,"Nisannu 10th, year 5 of Nebuchadnezzar (II), k...",600-I-10,Nisannu 10th,,
28834,P527164,Nebuchadnezzar II,603,XII,,,,Tell Sheikh Hamad (Dur-Katlimmu),,,603-XII-[...],Nebuchadnezzar2.12.00,"Addaru, year 2 of Nebuchadnezzar (II), king of...",603-XII,Addaru,,
28833,P527163,Nebuchadnezzar II,603,XI,10,,,Tell Sheikh Hamad (Dur-Katlimmu),,,603-XI-10,Nebuchadnezzar2.11.10,"Šabāṭu 10th, year 2 of Nebuchadnezzar (II), ki...",603-XI-10,Šabāṭu 10th,,
28832,P527162,Nebuchadnezzar II,603,VIII,,,,Tell Sheikh Hamad (Dur-Katlimmu),,,603-VIII,Nebuchadnezzar2.08.00,"Arahsamna, year 2 of Nebuchadnezzar (II), king...",603-VIII,Arahsamna,,
28526,P527428,Aššur-uballiṭ II,610*,VIII,15,Nabû-šarru-uṣur,,Tell Halaf (Guzana),,,610*-VIII-15,00.limu Nabu-sharru-usur.08.15,"Arahsamna 15th, eponymy of Nabû-šarru-uṣur, ch...",610*-VIII-15,"Arahsamna 15th, eponymy of Nabû-šarru-uṣur",,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
169218,Q004773,Adad-nerari III,,,,,,,,,,,,,,,
169219,Q004774,Adad-nerari III,,,,,,,,,,,,,,,
169220,Q004775,Adad-nerari III,,,,,,,,,,,,,,,
169221,Q004776,Adad-nerari III,,,,,,,,,,,,,,,


We can see that there is a lot of missing data for year, month, and date, despite the ruler being identified in a given text. I think this data is stil useful because identifying rulers is important. However identifying their time period given just the refined ORACC data is not enough right now.

For the "uncertain" rulers, most have a provenience and proposed date, and some have have date data not split into year, month, and date. This information may be useful if another datasource has similar information with a ruler identified / suspected.

##Exporting the rulercatalogue as a CSV file

In [None]:
#exports ruler catalogue to the folder ORACC_DFS
rulercatalogue.to_csv(folder + 'ORACC_DFS/RulerCatalogue.csv')

End of Notebook.