# Princeton Geniza Project

This notebook demonstrates parsing dates from non-Gregorian calendars and working with mixed-calendar dates.

This notebook uses document data from the [Princeton Geniza Project](https://geniza.princeton.edu/), which is a database of fragmentary medieval documents found in the Cairo Geniza. Documents are written largely in Hebrew script in Hebrew and Arabic languages, and use a range of calendars including: 
- Hebrew _Anno Mundi_
- Islamic _Hijri_
- Hebrew Seleucid calendar (_Anno Mundi_ calendar with a 3449 year offset)

The dataset includes original dates and standardized Common Era dates (Julian before 1583, Gregorian after).

This notebook uses the data published on GitHub at https://github.com/princetongenizalab/pgp-metadata


*Notebook authored by Rebecca Sutton Koeser, 2025.*

## Load and filter data

Limit to documents with authoritative "date on document" set in the metadata.

In [206]:
import pandas as pd

pgp_documents_csv = "https://github.com/princetongenizalab/pgp-metadata/raw/main/data/documents.csv"
documents = pd.read_csv(pgp_documents_csv)

In [207]:
# limit to documents with dates
docs_with_dates = documents[documents.doc_date_standard.notna() | documents.inferred_date_standard.notna()]
docs_with_docdate = documents[documents.doc_date_standard.notna()].copy()
docs_with_inferreddate = documents[documents.inferred_date_standard.notna()]

print(f"""
Total documents:      {len(documents):,}
Documents with dates: {len(docs_with_dates):,}
    date on document: {len(docs_with_docdate):,}
     inferred dating:  {len(docs_with_inferreddate):,}""")


Total documents:      35,156
Documents with dates: 4,432
    date on document: 4,107
     inferred dating:  331


In [208]:
docs_with_docdate[['pgpid', 'doc_date_original', 'doc_date_calendar', 'doc_date_standard']].head(10)

Unnamed: 0,pgpid,doc_date_original,doc_date_calendar,doc_date_standard
5,449,1570,Seleucid,1259
16,463,19 Adar 1427,Seleucid,1116-03-05
23,472,1337,Seleucid,1025-08-28/1026-09-14
36,491,,,1131
41,499,"Wednesday, 15 Kislev 1500",Seleucid,1188-12-07
43,502,Tevet 1548,Seleucid,1236-11-30/1236-12-28
47,506,Elul 1428,Seleucid,1117-08-01/1117-08-29
55,516,First decade of Ḥeshvan 1442,Seleucid,1130-10-06/1130-10-15
61,524,"Thursday, 12 Sivan 4795",Anno Mundi,1035-05-22
62,525,Shawwāl 425,Hijrī,1034-08-29/1034-09-07


## Parse dates (standard and original)

Parse the standardized date (Julian/Gregorian) as EDTF; in some cases this may fail due to invalid user-entered data.

In [209]:
from lark.visitors import VisitError

# first, how far can we get with the standard dates? can we parse as edtf and sort, render?
from undate import Undate 

def parse_standard_date(value):
    try:
        return Undate.parse(value, "EDTF")
    except VisitError as err:
        print(f"Parse error on {value}: {err}")
                        

# ignore gregorian/julian distinction for now
# from pgp code:
# Julian Thursday, 4 October 1582, being followed by Gregorian Friday, 15 October
# cut off between gregorian/julian dates, in julian days
#gregorian_start_jd = convertdate.julianday.from_julian(1582, 10, 5)

docs_with_docdate['undate_standard'] = docs_with_docdate.doc_date_standard.apply(parse_standard_date)

Parse error on 1217-02-20/1217-02-29: Error trying to process rule "date":

Day out of range in datetime string "1217-02-29"
Parse error on 1747-02-29: Error trying to process rule "date":

Day out of range in datetime string "1747-02-29"


In [210]:
# what are the records with standardized dates that couldn't be parsed?

# this is probably a data error in the original

docs_with_docdate[docs_with_docdate.undate_standard.isna()][['pgpid', 'doc_date_original', 'doc_date_calendar', 'doc_date_standard', 'last_modified']]

Unnamed: 0,pgpid,doc_date_original,doc_date_calendar,doc_date_standard,last_modified
3190,3957,middle decade of Adar 1528,Seleucid,1217-02-20/1217-02-29,2025-04-12 20:45:36.603800+00:00
34439,40006,,,1747-02-29,2024-08-07 18:24:19.425288+00:00


What calendars are used by documents with original dates?

In [211]:
docs_with_docdate.doc_date_calendar.value_counts()

doc_date_calendar
Seleucid      1600
Anno Mundi    1144
Hijrī          871
Kharājī          8
Name: count, dtype: int64

In [212]:
# example hebrew dates
docs_with_docdate[docs_with_docdate.doc_date_calendar == "Anno Mundi"][['pgpid', 'doc_date_original', 'doc_date_calendar', 'doc_date_standard']].head(10)

Unnamed: 0,pgpid,doc_date_original,doc_date_calendar,doc_date_standard
61,524,"Thursday, 12 Sivan 4795",Anno Mundi,1035-05-22
90,561,10 Nisan 4716,Anno Mundi,0956-03-24
111,582,"Thursday, 6 Adar 4996",Anno Mundi,1236-02-14
119,591,"Sunday, 29 Tammuz 4898",Anno Mundi,1138-07-10
131,603,4805/4806,Anno Mundi,1044-08-27/1045-09-13
177,660,22 Sivan 4974,Anno Mundi,1214-06-01
207,695,"Friday, [25] Nisan [4810]",Anno Mundi,1050-04-20
215,703,8 Elul (4)811,Anno Mundi,1051-08-18
255,750,"Friday, 24 Ḥeshvan 4765",Anno Mundi,1004-11-10
264,760,"Thursday, 11 Av 4783",Anno Mundi,1023-08-01


### Inspect variations in the data that may cause problems for parsing

There are some ideosyncrasies with the original dates, since some of them were entered before the PGPv4 system supported built-in conversion.

- calendar abbreviation included in the date string (i.e., AM, AH for _Anno Mundi_, _Anno Hegirae_ respectively)
- brackets for inferred digits or unknown digits (e.g., `152[.]` or `[4]82[.]`)
- ordinals instead of numerals for the day of the month (e.g., "11th Tammuz 4767" or "Monday, 27th Ṭevet 4797")

In [213]:
# how many end with AM ?
hebrew_dates = docs_with_docdate[docs_with_docdate.doc_date_calendar == "Anno Mundi"][docs_with_docdate.doc_date_original.notna()]
hebrew_dates[hebrew_dates.doc_date_original.str.endswith("AM")][['pgpid', 'doc_date_original', 'doc_date_calendar', 'doc_date_standard']]

  hebrew_dates = docs_with_docdate[docs_with_docdate.doc_date_calendar == "Anno Mundi"][docs_with_docdate.doc_date_original.notna()]


Unnamed: 0,pgpid,doc_date_original,doc_date_calendar,doc_date_standard
702,1223,"Wednesday, 9 Tammuz 4912 AM",Anno Mundi,1152-06-13
16699,19975,"Sunday, 10 Kislev 5583 AM",Anno Mundi,1822-11-24
25417,30550,Tammuz 5537 AM,Anno Mundi,1777-07-06/1777-08-03


In [214]:
# how many include periods?
docs_with_docdate[docs_with_docdate.doc_date_original.notna() & docs_with_docdate.doc_date_original.str.contains("\\.")][['pgpid', 'doc_date_original', 'doc_date_calendar', 'doc_date_standard']]

Unnamed: 0,pgpid,doc_date_original,doc_date_calendar,doc_date_standard
1556,2163,first third of Tammuz 500[.],Anno Mundi,1244/1249
1567,2175,End of Sivan 152[.],Seleucid,1209/1218
1753,2460,13[..],Seleucid,988/1088
2018,2745,1[.] Kislev 48[..],Anno Mundi,1039-11-30/1138-11-24
3044,3805,13[..],Seleucid,988/1087
...,...,...,...,...
30591,35955,12 Muḥarram 52[.],Hijrī,1126/1134
31228,36738,54[.],Hijrī,1145/1154
32550,38077,14[...],Seleucid,1088-09-19/1188-09-23
34654,40226,49[.],Hijrī,1096-12-19/1106-09-01


In [215]:
# how many use ordinals instead of numerals?
hebrew_dates[hebrew_dates.doc_date_original.str.contains("st") | hebrew_dates.doc_date_original.str.contains("rd") | hebrew_dates.doc_date_original.str.contains("th")][['pgpid', 'doc_date_original', 'doc_date_calendar', 'doc_date_standard']].head(10)

Unnamed: 0,pgpid,doc_date_original,doc_date_calendar,doc_date_standard
635,1154,Last decade of Kislev 5004,Anno Mundi,1243-12
1172,1750,11th Tammuz 4767,Anno Mundi,1007
1173,1751,"Monday, 27th Ṭevet 4797",Anno Mundi,1037-01-23
1556,2163,first third of Tammuz 500[.],Anno Mundi,1244/1249
5142,6795,last decade of Tishrei 4991,Anno Mundi,1230-09-29/1230-10-08
5223,6892,last decade of Iyyar 4906,Anno Mundi,1146-05-04/1146-05-13
5664,7409,last third of Ḥeshvan 4965,Anno Mundi,1204-10-17/1204-10-25
5812,7581,middle third of Adar 4876,Anno Mundi,1116-05
7024,9068,Last decade of Ṭevet 4898,Anno Mundi,1138-01
8639,11215,Middle third of Av 4889,Anno Mundi,1129-07-29/1129-08-07


In [216]:
import re

def remove_ordinals(val):
    return re.sub(r'(\d+)(st|nd|rd|th)', "\\1", val)

# test removing ordinals without removing the numbers
for val in ['11th Tammuz 4767', "27th Tevet", "8th Kislev"]:
    print(f"{val}: { remove_ordinals(val)}")

11th Tammuz 4767: 11 Tammuz 4767
27th Tevet: 27 Tevet
8th Kislev: 8 Kislev


Since this dataset has a mix of calendars and has known inconsistencies that may need cleaning,
we define a custom parsing method that selects the appropriate calendar and simplifies date portions that are not currently supported by the undate parsers.

In [217]:
# parse hijri, anno mundi, and seleucid dates as undates

import re
from lark.exceptions import UnexpectedEOF

# set this to True to see details about parsing
VERBOSE_PARSE_OUTPUT = False 


def parse_original_date(row):
    # print(f"PGPID {row.pgpid} {row.doc_date_original} ({row.doc_date_calendar})")
    undate_calendar = None
    if row.doc_date_calendar == "Anno Mundi":
        undate_calendar = "Hebrew"
    elif row.doc_date_calendar == "Hijrī":
        undate_calendar = "Islamic"
    elif row.doc_date_calendar == "Seleucid":
        # handle seleucid as hebrew with offset (adapt from pgp code)
        undate_calendar = "Seleucid"

    
    if undate_calendar:
        value = row.doc_date_original

        # some dates have unknown digits, e.g. 1[.] Kislev 48[..] or 152[.]
        # ... the calendar parser don't support this, even though Undate does support unknown digits
        # in future, perhaps we can add missing digit logic with this syntax to share across appropriate parsers
        if '[.' in value:
            if VERBOSE_PARSE_OUTPUT:
                print(f"ignoring missing digits for now {value}")
            value = value.replace("[.]", "0").replace("[..]", "00").replace("[...]", "000")         
        
        # some dates have inferred numbers, e.g. Friday, [25] Nisan [4810] or 8 Elul (4)811'
        # for now, just strip out brackets before parsing; 
        # in future, could potentially infer uncertainty based on these
        value = value.replace('[', '').replace(']', '').replace('(', '').replace(')', '')

        # for now, remove modifiers that are not supported by undate parser:
        #   Late Tevet 4903, Last decade of Kislev 5004, first third of ...
        #   some dates include of, e.g. day of month
        modifiers = ["Late ", "(first|middle|last)( third|half|decade|tenth)? (of )?", "(Beginning|end) of ", "last day", "First 10 days", " of", "spring", "decade ", "night, "]
        for mod in modifiers:
            value = re.sub(mod, "", value, flags=re.I)

        # there are a handful of misspelled wednesdays...
        value = value.replace("Wedensday", "Wednesday")
        # and a Thrusday
        value = value.replace("Thrusday", "Thursday")

        # three Hebrew calendar dates include text "AM" at end; at least one AH date
        if value.endswith(" AM") or value.endswith(" AH"):
            value = value[:-3]
        if value.endswith("."):  # strip off trailing period
            value = value[:-1]

        # about 62 have ordinals; strip them out
        value = remove_ordinals(value)
        
        try:
            return Undate.parse(value, undate_calendar)
        except (VisitError, ValueError, UnexpectedEOF) as err:
            if VERBOSE_PARSE_OUTPUT:
                print(f"Parse error on PGPID {row.pgpid} {value} ({undate_calendar}): {err}")

            # there are a handful of cases in PGP where calendars are mixed,
            # i.e. hebrew months used for hijri calendar

            # some dates are entered in ISO format for another calendar; can we parse and set calendar?
            if "-" in value and "/" not in value:  # exclude intervals for now
                try:
                    parsed = Undate.parse(value, "ISO8601")
                    if parsed:
                        parsed = parsed.as_calendar(undate_calendar)
                        if VERBOSE_PARSE_OUTPUT:
                            print(f"parsed {value} with ISO8601 format and calendar {undate_calendar}, result is {parsed} ({parsed.earliest}/{parsed.latest})")
                        return parsed
                except ValueError as err:
                    if VERBOSE_PARSE_OUTPUT:
                        print(f"Could not parse {value} as ISO date: {err}")

docs_with_docdate['undate_orig'] = docs_with_docdate.apply(parse_original_date, axis=1)

### Review parsing results 

How many of the dates in supported calendars were parsed?

In [218]:
orig_dates_parsed = docs_with_docdate[docs_with_docdate.undate_orig.notna()].copy()
orig_dates_unparsed = docs_with_docdate[docs_with_docdate.doc_date_original.notna() & docs_with_docdate.doc_date_calendar.isin(['Anno Mundi', 'Hijrī', 'Seleucid']) & docs_with_docdate.undate_orig.isna()] 

total_parsed = len(orig_dates_parsed)
total_unparsed = len(orig_dates_unparsed)
print(f"""original dates parsed: {total_parsed}
original dates unparsed: {total_unparsed} (anno mundi, hijri, and seleucid calendars)
proportion parsed: {(total_parsed/(total_parsed + total_unparsed))*100:0.2f}%""")

original dates parsed: 3443
original dates unparsed: 172 (anno mundi, hijri, and seleucid calendars)
proportion parsed: 95.24%


What is the date granularity of the dates that were parsed?

Note that these results are skewed somewhat due to the modifiers and uncertainty that we are simplifying in order to parse the dates.

In [219]:
# determine original date precision based on parsed undate
orig_dates_parsed['orig_date_precision'] = orig_dates_parsed.undate_orig.apply(lambda x: str(x.precision).lower())
orig_dates_parsed[['pgpid', 'doc_date_original', 'doc_date_calendar', 'doc_date_standard', 'undate_standard', 'undate_orig', 'orig_date_precision']].head()

Unnamed: 0,pgpid,doc_date_original,doc_date_calendar,doc_date_standard,undate_standard,undate_orig,orig_date_precision
5,449,1570,Seleucid,1259,1259,1570,year
16,463,19 Adar 1427,Seleucid,1116-03-05,1116-03-05,1427-12-19,day
23,472,1337,Seleucid,1025-08-28/1026-09-14,1025-08-28/1026-09-14,1337,year
41,499,"Wednesday, 15 Kislev 1500",Seleucid,1188-12-07,1188-12-07,1500-09-15,day
43,502,Tevet 1548,Seleucid,1236-11-30/1236-12-28,1236-11-30/1236-12-28,1548-10,month


In [220]:
# this is skewed because of the kinds of dates we're not able to parse or modifiers we're omitting entirely
orig_dates_parsed.orig_date_precision.value_counts()

orig_date_precision
day      1593
month    1022
year      828
Name: count, dtype: int64

Check on the Seleucid date parsing by comparing undate calendar conversion with the standardized CE date included in the dataset.

We expect `undate` dates before 1583 to be off by about ~ 10 days since we did not adjust for Julian calendar.

In [221]:
seleucid_dates = orig_dates_parsed[orig_dates_parsed.doc_date_calendar == 'Seleucid'].copy()
# add undate earliest/latest (Gregorian) for comparison with dataset standardized date 
seleucid_dates['undate_earliest'] = seleucid_dates.undate_orig.apply(lambda x: x.earliest)
seleucid_dates['undate_latest'] = seleucid_dates.undate_orig.apply(lambda x: x.latest)

seleucid_dates[['pgpid', 'doc_date_original', 'doc_date_calendar', 'undate_orig', 'orig_date_precision', 'doc_date_standard', 'undate_earliest', 'undate_latest']].head(10)
                

Unnamed: 0,pgpid,doc_date_original,doc_date_calendar,undate_orig,orig_date_precision,doc_date_standard,undate_earliest,undate_latest
5,449,1570,Seleucid,1570,year,1259,1258-09-07,1259-09-26
16,463,19 Adar 1427,Seleucid,1427-12-19,day,1116-03-05,1116-03-12,1116-03-12
23,472,1337,Seleucid,1337,year,1025-08-28/1026-09-14,1025-09-03,1026-09-20
41,499,"Wednesday, 15 Kislev 1500",Seleucid,1500-09-15,day,1188-12-07,1188-12-14,1188-12-14
43,502,Tevet 1548,Seleucid,1548-10,month,1236-11-30/1236-12-28,1236-12-07,1237-01-04
47,506,Elul 1428,Seleucid,1428-06,month,1117-08-01/1117-08-29,1117-08-08,1117-09-05
55,516,First decade of Ḥeshvan 1442,Seleucid,1442-08,month,1130-10-06/1130-10-15,1130-10-13,1130-11-10
73,537,Ḥeshvan 1453,Seleucid,1453-08,month,1141,1141-10-11,1141-11-08
75,544,"Sunday, 21 Kislev 1355",Seleucid,1355-09-21,day,1043-11-26,1043-12-02,1043-12-02
91,562,15 Av 1346,Seleucid,1346-05-15,day,1035-07-23,1035-07-29,1035-07-29


In [222]:
# can we sort by parsed original dates? 
# doesn't work currently because of overlapping dates / different granularity
#orig_dates_parsed.sort_values(by='undate_orig') #, key=lambda col: col.value.earliest)

## Plot documents by date

For the dates we could parse, how are the documents distributed over time and calendar?

First let's graph by year based on the midpoint of the date range.

In [223]:
# set earliest/latest for graphing

# NOTE: we have to cast type to something pandas/altair supports

orig_dates_parsed['orig_date_earliest'] = orig_dates_parsed.undate_orig.apply(lambda x: x.earliest).astype('datetime64[s]')
orig_dates_parsed['orig_date_latest'] = orig_dates_parsed.undate_orig.apply(lambda x: x.latest).astype('datetime64[s]')
orig_dates_parsed['orig_date_mid'] = orig_dates_parsed.undate_orig.apply(lambda x: x.earliest + (x.latest - x.earliest)/2).astype('datetime64[s]')

In [224]:
orig_dates_parsed[['orig_date_earliest', 'orig_date_latest', 'orig_date_mid', 'pgpid', 'doc_date_calendar']].head(10)

Unnamed: 0,orig_date_earliest,orig_date_latest,orig_date_mid,pgpid,doc_date_calendar
5,1258-09-07,1259-09-26,1259-03-18,449,Seleucid
16,1116-03-12,1116-03-12,1116-03-12,463,Seleucid
23,1025-09-03,1026-09-20,1026-03-13,472,Seleucid
41,1188-12-14,1188-12-14,1188-12-14,499,Seleucid
43,1236-12-07,1237-01-04,1236-12-21,502,Seleucid
47,1117-08-08,1117-09-05,1117-08-22,506,Seleucid
55,1130-10-13,1130-11-10,1130-10-27,516,Seleucid
61,1035-05-28,1035-05-28,1035-05-28,524,Anno Mundi
62,1034-08-25,1034-09-22,1034-09-08,525,Hijrī
73,1141-10-11,1141-11-08,1141-10-25,537,Seleucid


In [225]:
# graph documents by calendar

date_docs_cal = orig_dates_parsed[orig_dates_parsed.doc_date_standard.notna()]

dated_docs_cal = date_docs_cal.fillna({'doc_date_calendar': 'Unspecified'})
dated_docs_cal['midpoint_year'] = dated_docs_cal.orig_date_mid.apply(lambda x: x.year)

orig_dates_calendars_chart = alt.Chart(dated_docs_cal[['pgpid', 'midpoint_year', 'doc_date_calendar']]).mark_area(opacity=0.7).encode(
  x=alt.X('midpoint_year', title="Year (midpoint)", bin=alt.Bin(maxbins=120), axis=alt.Axis(format="r")),
  y=alt.Y('count(pgpid)', title='Documents'),
  color=alt.Y("doc_date_calendar", title="Calendar")
).properties(width=900, height=200, title="Documents by calendar (original date)")

orig_dates_calendars_chart

For comparison, what does it look like if we graph by the standardized dates in the dataset?

In [226]:
# graph documents with calendars

def undate_midpoint(value):
    # parsed standard date could be an undate or an interval; handle either
    if isinstance(value, Undate):
        earliest = value.earliest
        latest = value.latest
    else: # interval
        earliest = value.earliest.earliest
        latest = value.latest.latest
    return earliest + (latest - earliest)/2
    

dated_docs_cal = docs_with_docdate.copy()
dated_docs_cal = dated_docs_cal.fillna({'doc_date_calendar': 'Unspecified'})
# get the midpoint from the parsed standard date; convert to supported type
dated_docs_cal['midpoint'] = dated_docs_cal.undate_standard.apply(lambda x: undate_midpoint(x) if pd.notna(x) else None).astype("datetime64[s]")
dated_docs_cal['midpoint_year'] = dated_docs_cal.midpoint.apply(lambda x: x.year if pd.notna(x) else None)


std_dates_calendars_chart = alt.Chart(dated_docs_cal[['pgpid', 'midpoint_year', 'doc_date_calendar']]).mark_area(opacity=0.7).encode(
  x=alt.X('midpoint_year', title="Year", bin=alt.Bin(maxbins=120), axis=alt.Axis(format="r")),
  y=alt.Y('count(pgpid)', title='Documents'),
  color=alt.Y("doc_date_calendar", title="Calendar").scale(domain=['Anno Mundi', 'Hijrī', 'Seleucid', 'Kharājī', 'Unspecified'])
).properties(width=900, height=200, title="Documents by calendar (standard date)")

std_dates_calendars_chart

Here are the two plots together. The unspecified calendars are most likely Julian/Gregorian dates.

In [227]:
orig_dates_calendars_chart & std_dates_calendars_chart

We can try graphing by range, but our parsing currently excludes the original dates with larger ranges.

In [228]:
import altair as alt

graphable_data = orig_dates_parsed[['orig_date_earliest', 'orig_date_latest', 'orig_date_mid', 'pgpid', 'doc_date_calendar']].copy()
# graphable_data['midpoint'] = graphable_data.undate_standard.apply(lambda x: undate_midpoint(x) if pd.notna(x) else None).astype("datetime64[s]")
graphable_data['midpoint_year'] = graphable_data.orig_date_mid.apply(lambda x: x.year if pd.notna(x) else None)


bar_chart = alt.Chart(graphable_data).mark_bar(opacity=0.5).encode(
    x=alt.X('orig_date_earliest:T', title="original date (range)"), # , axis=alt.Axis(format="r")),
    x2='orig_date_latest:T',
    y=alt.Y('count(pgpid)', title='Count of Documents')
).properties(width=1200, height=150)

line_chart = alt.Chart(graphable_data).mark_line(opacity=0.6, color="green", interpolate="monotone").encode(
 x=alt.X('orig_date_mid:T', title="Year (midpoint)"),
 y=alt.Y('count(pgpid)', title='Documents')
).properties(width=1200, height=150)

(bar_chart & line_chart).properties(title="Documents by date (1000-1300)").interactive()

## Compare weekdays

Sometimes the original date includes a day of the week; we don't expect these to be completely reliable, but lets compare the weekdays in the original date with the weekday as determined by the parsed `Undate`.

In [230]:
weekday_dates = orig_dates_parsed[orig_dates_parsed.doc_date_original.str.contains('day ')][['pgpid', 'doc_date_original', 'doc_date_calendar', 'doc_date_standard', 'undate_standard', 'undate_orig', 'orig_date_precision', 'type']]
weekday_dates

Unnamed: 0,pgpid,doc_date_original,doc_date_calendar,doc_date_standard,undate_standard,undate_orig,orig_date_precision,type
851,1377,"Wednesday night, 28 Sivan 1581",Seleucid,1270,1270,1581-03-28,day,Legal document
1714,2418,Monday 20 Tevet 1520,Seleucid,1208-12-29,1208-12-29,1520-10-20,day,Legal document
1929,2649,"Sunday night, 25 Kislev 1444",Seleucid,1133,1133,1444-09-25,day,Legal document
2013,2739,Wednesday 29th Elul 1354,Seleucid,1043-09-07,1043-09-07,1354-06-29,day,Legal document
3257,4026,"Wednesday night, 29 Tishrei 1541",Seleucid,1229-09-18,1229-09-18,1541-07-29,day,Legal document
...,...,...,...,...,...,...,...,...
29305,34623,"Sunday night, 20 Ṭevet 1578",Seleucid,1266/1267,1266/1267,1578-10-20,day,Legal document
29926,35264,Wednesday 13 Ṭevet 1526,Seleucid,1214/1215,1214/1215,1526-10-13,day,Legal document
34010,39564,Monday 16 Tevet 1339,Seleucid,1027-12-18,1027-12-18,1339-10-16,day,Legal document
34468,40035,Monday 1st Iyyar 1437,Seleucid,1126-04-26,1126-04-26,1437-02-01,day,Legal document


Extract the weekday from the original date and determine the undate weekday.

Both Arabic and Hebrew days begin in the evening, so if the date string includes the text "night" we shift the original day by one for comparison.

In [231]:
days = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]

# get numeric weekday; since these dates are all day-precision we can just use the earliest date
weekday_dates['undate_weekday'] = weekday_dates.undate_orig.apply(lambda x: x.earliest.weekday)
weekday_dates['undate_weekday_name'] = weekday_dates.undate_weekday.apply(lambda x: days[x])
# extract weekday from date label
weekday_dates['orig_weekday'] = weekday_dates.doc_date_original.str.extract('([a-zA-Z]+day)', expand=False).str.strip()
# correct misspellings
misspelled_days = {
    "Wedensday": "Wednesday",
    "Thrusday": "Thursday",
}
weekday_dates['orig_weekday'] = weekday_dates.orig_weekday.apply(lambda x: misspelled_days.get(x, x))

# shift night to next day, e.g. Wednesday night should be Thursday
# NOTE: this must be done immediately after the day extraction, otherwise repeated runs continue shifting to the next day
def next_day(weekday):
    return days[(days.index(weekday) +1) % 7]

weekday_dates['orig_weekday'] = weekday_dates.apply(lambda row: next_day(row.orig_weekday) if " night" in row.doc_date_original else row.orig_weekday, axis=1)

weekday_dates

Unnamed: 0,pgpid,doc_date_original,doc_date_calendar,doc_date_standard,undate_standard,undate_orig,orig_date_precision,type,undate_weekday,undate_weekday_name,orig_weekday
851,1377,"Wednesday night, 28 Sivan 1581",Seleucid,1270,1270,1581-03-28,day,Legal document,3,Thursday,Thursday
1714,2418,Monday 20 Tevet 1520,Seleucid,1208-12-29,1208-12-29,1520-10-20,day,Legal document,0,Monday,Monday
1929,2649,"Sunday night, 25 Kislev 1444",Seleucid,1133,1133,1444-09-25,day,Legal document,0,Monday,Monday
2013,2739,Wednesday 29th Elul 1354,Seleucid,1043-09-07,1043-09-07,1354-06-29,day,Legal document,2,Wednesday,Wednesday
3257,4026,"Wednesday night, 29 Tishrei 1541",Seleucid,1229-09-18,1229-09-18,1541-07-29,day,Legal document,3,Thursday,Thursday
...,...,...,...,...,...,...,...,...,...,...,...
29305,34623,"Sunday night, 20 Ṭevet 1578",Seleucid,1266/1267,1266/1267,1578-10-20,day,Legal document,0,Monday,Monday
29926,35264,Wednesday 13 Ṭevet 1526,Seleucid,1214/1215,1214/1215,1526-10-13,day,Legal document,2,Wednesday,Wednesday
34010,39564,Monday 16 Tevet 1339,Seleucid,1027-12-18,1027-12-18,1339-10-16,day,Legal document,0,Monday,Monday
34468,40035,Monday 1st Iyyar 1437,Seleucid,1126-04-26,1126-04-26,1437-02-01,day,Legal document,0,Monday,Monday


Here are the subset of records that specify "night":

In [232]:
weekday_dates[weekday_dates.doc_date_original.str.contains(" night")]

Unnamed: 0,pgpid,doc_date_original,doc_date_calendar,doc_date_standard,undate_standard,undate_orig,orig_date_precision,type,undate_weekday,undate_weekday_name,orig_weekday
851,1377,"Wednesday night, 28 Sivan 1581",Seleucid,1270,1270,1581-03-28,day,Legal document,3,Thursday,Thursday
1929,2649,"Sunday night, 25 Kislev 1444",Seleucid,1133,1133,1444-09-25,day,Legal document,0,Monday,Monday
3257,4026,"Wednesday night, 29 Tishrei 1541",Seleucid,1229-09-18,1229-09-18,1541-07-29,day,Legal document,3,Thursday,Thursday
5511,7237,"Tuesday night, 22 Kislev 1435",Seleucid,1123-12-12,1123-12-12,1435-09-22,day,Legal document,2,Wednesday,Wednesday
5854,7637,"Monday night, 29 Ṭevet 1438",Seleucid,1127,1127,1438-10-29,day,Legal document,4,Friday,Tuesday
5857,7642,"Thursday night, 23 Tammuz 1538",Seleucid,1227-07-09,1227-07-09,1538-04-23,day,Legal document,4,Friday,Friday
6419,8332,"Friday night, 20 Iyar 4957",Anno Mundi,1197-05,1197-05,4957-02-20,day,Legal document,5,Saturday,Saturday
29305,34623,"Sunday night, 20 Ṭevet 1578",Seleucid,1266/1267,1266/1267,1578-10-20,day,Legal document,0,Monday,Monday


How many of the original and undate weekdays match?

In [233]:
matches = weekday_dates[weekday_dates.undate_weekday_name == weekday_dates.orig_weekday]

mismatches = weekday_dates[weekday_dates.undate_weekday_name != weekday_dates.orig_weekday]

print(f"{len(matches)} matches, {len(mismatches)} mismatches ({(len(matches)/(len(matches)+len(mismatches)))*100:0.2f}%)")
mismatches.head(20)

44 matches, 60 mismatches (42.31%)


Unnamed: 0,pgpid,doc_date_original,doc_date_calendar,doc_date_standard,undate_standard,undate_orig,orig_date_precision,type,undate_weekday,undate_weekday_name,orig_weekday
5271,6947,Monday 3 Iyyar 1740,Seleucid,1429-04-07,1429-04-07,1740-02-03,day,Legal document,3,Thursday,Monday
5854,7637,"Monday night, 29 Ṭevet 1438",Seleucid,1127,1127,1438-10-29,day,Legal document,4,Friday,Tuesday
8649,11227,Monday 24 Jumādā I 517,Hijrī,1123-07-20,1123-07-20,0517-05-24,day,Paraliterary text,4,Friday,Monday
16398,19649,Thursday 26 Iyyar 5306,Anno Mundi,1546-04-28,1546-04-28,5306-02-26,day,Legal document,2,Wednesday,Thursday
17724,21094,Saturday 20 Rajab 550,Hijrī,1155-09-19,1155-09-19,0550-07-20,day,Legal document,0,Monday,Saturday
23101,27479,Tuesday 11 Tammuz 5525,Anno Mundi,1765-06-30,1765-06-30,5525-04-11,day,Legal document,6,Sunday,Tuesday
23106,27484,Friday 20th Shevat 5405,Anno Mundi,1645,1645,5405-11-20,day,Legal document,3,Thursday,Friday
23107,27485,Sunday 22 Adar 5590,Anno Mundi,1830-03-17,1830-03-17,5590-12-22,day,Legal document,2,Wednesday,Sunday
23109,27487,Thursday 15th Shevat 5450,Anno Mundi,1690,1690,5450-11-15,day,Legal document,2,Wednesday,Thursday
23111,27489,Sunday 6 Nisan 5528,Anno Mundi,1768-03-24,1768-03-24,5528-01-06,day,Legal document,3,Thursday,Sunday


Is there any noticable difference about where the mismatches are coming from based on calendar or day of week?

In [234]:
mismatches.doc_date_calendar.value_counts()

doc_date_calendar
Anno Mundi    55
Seleucid       3
Hijrī          2
Name: count, dtype: int64

In [235]:
mismatches.orig_weekday.value_counts()

orig_weekday
Wednesday    17
Sunday       12
Monday       10
Thursday      9
Tuesday       7
Friday        4
Saturday      1
Name: count, dtype: int64

In [236]:
# how many mismatches are due to night?
night_mismatches = mismatches[mismatches.doc_date_original.str.contains(" night")]
print(f"{len(night_mismatches)} mismatches that include text 'night'")
night_mismatches

1 mismatches that include text 'night'


Unnamed: 0,pgpid,doc_date_original,doc_date_calendar,doc_date_standard,undate_standard,undate_orig,orig_date_precision,type,undate_weekday,undate_weekday_name,orig_weekday
5854,7637,"Monday night, 29 Ṭevet 1438",Seleucid,1127,1127,1438-10-29,day,Legal document,4,Friday,Tuesday


### Plot document frequency by day

Because we're preserving as much date information as possible, we can plost based on things like weekday - even across different calendars.

For documents with day-level date precision, how are they distributed by weekday?

In [237]:
# get numeric weekday
orig_dates_parsed['undate_weekday'] = orig_dates_parsed.undate_orig.apply(lambda x: x.earliest.weekday)
orig_dates_parsed['undate_weekday_name'] = orig_dates_parsed.undate_weekday.apply(lambda x: days[x])

# restrict to dates with day precision; the rest are just using earliest day
orig_dates_days = orig_dates_parsed[orig_dates_parsed.orig_date_precision == 'day']

alt.Chart(orig_dates_days[['undate_weekday', 'undate_weekday_name', 'pgpid']]).mark_rect().encode(
    alt.X('undate_weekday_name', sort=days, title='weekday'),
    alt.Color('count(pgpid)', title='# of documents')
).properties(title='document frequency by weekday')


In [238]:
orig_dates_days.undate_weekday_name.value_counts()

undate_weekday_name
Monday       304
Thursday     282
Tuesday      239
Sunday       229
Wednesday    228
Friday       214
Saturday      97
Name: count, dtype: int64

In [255]:
weekday_calendar_chart = alt.Chart(weekday_dates[['undate_weekday', 'undate_weekday_name', 'pgpid', 'doc_date_calendar']]).mark_rect().encode(
    alt.X('undate_weekday_name', sort=days, title='weekday'),
    # alt.Y('doc_date_calendar'),
    alt.Color('count(pgpid)')
).facet(row=alt.Facet('doc_date_calendar', title="Original Calendar")).properties(title='document frequency by weekday and calendar')
weekday_calendar_chart

This chart is skewed due to the fact we have so many more day-precision dates from the Hebrew calendar than any other.    

In [240]:
weekday_dates.doc_date_calendar.value_counts()

doc_date_calendar
Anno Mundi    82
Seleucid      20
Hijrī          2
Name: count, dtype: int64

This is more obvious if we use indepenend color scales.

In [256]:
weekday_calendar_chart.resolve_scale(color='independent')

What about weekday by centuy?  

In [242]:
# get rough century (gregorian calendar)
weekday_dates['century'] = orig_dates_days.undate_orig.apply(lambda x: f"{("%04d" % x.earliest.year)[:2]}00s")

weekday_dates[['pgpid', 'doc_date_original', 'doc_date_calendar', 'doc_date_standard', 'undate_standard', 'undate_orig', 'century']].head()


Unnamed: 0,pgpid,doc_date_original,doc_date_calendar,doc_date_standard,undate_standard,undate_orig,century
851,1377,"Wednesday night, 28 Sivan 1581",Seleucid,1270,1270,1581-03-28,1200s
1714,2418,Monday 20 Tevet 1520,Seleucid,1208-12-29,1208-12-29,1520-10-20,1200s
1929,2649,"Sunday night, 25 Kislev 1444",Seleucid,1133,1133,1444-09-25,1100s
2013,2739,Wednesday 29th Elul 1354,Seleucid,1043-09-07,1043-09-07,1354-06-29,1000s
3257,4026,"Wednesday night, 29 Tishrei 1541",Seleucid,1229-09-18,1229-09-18,1541-07-29,1200s


In [243]:

alt.Chart(weekday_dates[['undate_weekday', 'undate_weekday_name', 'pgpid', 'century']]).mark_rect().encode(
    alt.X('undate_weekday_name', sort=days, title='weekday'),
    alt.Y('century'),
    alt.Color('count(pgpid)')
).properties(title='document frequency by weekday and century')


The weekday + century heatmap suggets we're more likely to have day-level precision dates from the 1700s than any other time period in the dataset.

## Plot frequency by month and calendar

In [247]:
# what about heat map by month?

# get numeric month
orig_dates_parsed['undate_month'] = orig_dates_parsed.undate_orig.apply(lambda x: x.month)
# orig_dates_parsed['undate_weekday_name'] = orig_dates_parsed.undate_weekday.apply(lambda x: days[x])

has_month = orig_dates_parsed[orig_dates_parsed.undate_month.notna()]

alt.Chart(has_month[['undate_month', 'pgpid', 'doc_date_calendar']]).mark_rect().encode(
    alt.X('undate_month', title='month'),
    alt.Color('count(pgpid)', title='# of documents')
).facet(
    row=alt.Facet('doc_date_calendar', title="Original Calendar")
).properties(title='Document frequency by month and calendar')

That very light month 13 in the Hebrew and Seleucid calendars reflects the fact that the Hebrew calendar has a leap _month_.

In [248]:
has_month.doc_date_calendar.value_counts()

doc_date_calendar
Seleucid      1196
Anno Mundi     903
Hijrī          516
Name: count, dtype: int64

In [249]:
orig_dates_days[orig_dates_days.undate_weekday_name.notna()].shape

(1593, 38)

In [250]:
# weekday frequency by month?

orig_dates_days['undate_month'] = orig_dates_days.undate_orig.apply(lambda x: x.month)

alt.Chart(orig_dates_days[['undate_weekday', 'undate_weekday_name', 'pgpid', 'undate_month', 'doc_date_calendar']]).mark_rect().encode(
    alt.X('undate_weekday_name', sort=days, title='weekday'),
    alt.Y('undate_month', title="month"),
    alt.Color('count(pgpid)')
).facet(
    column=alt.Facet('doc_date_calendar', title="Original Calendar")
).properties(title='Document frequency by weekday and month (1,557 documents)')


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  orig_dates_days['undate_month'] = orig_dates_days.undate_orig.apply(lambda x: x.month)
