# Compare partial date duration logic

Comparing `UndateInterval` with similar work from Shakespeare and Company Project (S&co for short).

This notebook compares the `UndateInterval` duration calculation for date ranges between partially known dates with the similar logic implemented in the [Shakespeare and Company Project](https://shakespeareandco.princeton.edu/) [events dataset](https://doi.org/10.34770/nz90-ym25). Event start and end dates are in ISO8601 format and include as much precision for the date as is known; format is one of: YYYY, YYYY-MM, YYYY-MM-DD, --MM-DD 

Deciding how to calculate date ranges may be contextual; current UndateInterval logic includes both the start and the end date, while the S&co logic does not - so they are off by one. Once we make that adjustment, the borrowing durations in the S&co data match the logic in Undate.

Subscription durations in S&co are sometimes known to be for a particular term (e.g. a year or six months) but without specific dates, perhaps only a year or year and month; Undate calculates durations based on the earliest and latest days in the range, so it overestimates these durations.

*Notebook authored by Rebecca Sutton Koeser, 2023.*


In [2]:
%pip install -q pandas


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/Users/rkoeser/workarea/env/undate/bin/python3 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [3]:
import pandas as pd

# load the 1.2 version of S&co events dataset; we have a copy in our use-cases folder
events_df = pd.read_csv("../use-cases/shakespeare-and-company-project/SCoData_events_v1.2_2022-01.csv", low_memory=False)
events_df.head()

Unnamed: 0,event_type,start_date,end_date,member_uris,member_names,member_sort_names,subscription_price_paid,subscription_deposit,subscription_duration,subscription_duration_days,...,item_uri,item_title,item_volume,item_authors,item_year,item_notes,source_type,source_citation,source_manifest,source_image
0,Generic,1920,,https://shakespeareandco.princeton.edu/members...,Raymonde Linossier,"Linossier, Raymonde",,,,,...,https://shakespeareandco.princeton.edu/books/b...,Pigs Is Pigs,,"Butler, Ellis Parker",1906.0,,Lending Library Card,"Sylvia Beach, Raymonde Linossier Lending Libra...",https://figgy.princeton.edu/concern/scanned_re...,https://iiif.princeton.edu/loris/figgy_prod/00...
1,Subscription,1921,,https://shakespeareandco.princeton.edu/members...,Mme Garreta,"Garreta, Mme",,,,,...,,,,,,,Address Book,"Sylvia Beach, Address Book 1919–1935, box 69, ...",,
2,Borrow,1922,1922-08-23,https://shakespeareandco.princeton.edu/members...,Mr. Rhys,"Rhys, Mr.",,,,,...,https://shakespeareandco.princeton.edu/books/c...,Typhoon,,"Conrad, Joseph",1902.0,,Lending Library Card,"Sylvia Beach, Rhys Lending Library Card, Box 4...",https://figgy.princeton.edu/concern/scanned_re...,https://iiif.princeton.edu/loris/figgy_prod/67...
3,Generic,1922,,https://shakespeareandco.princeton.edu/members...,Ernest Walsh,"Walsh, Ernest",,,,,...,https://shakespeareandco.princeton.edu/books/b...,The Pretty Lady,,"Bennett, Arnold",1918.0,,Lending Library Card,"Sylvia Beach, Ernest Walsh Lending Library Car...",https://figgy.princeton.edu/concern/scanned_re...,https://iiif.princeton.edu/loris/figgy_prod/af...
4,Subscription,1922,,https://shakespeareandco.princeton.edu/members...,Mr. Lincoln,"Lincoln, Mr.",,7.0,,,...,,,,,,,Address Book,"Sylvia Beach, Address Book 1919–1935, box 69, ...",,


## Define a method to parse dates and calculate duration

Define a method to initialize an `UndateInterval` from start and end date strings in ISO format as used in S&co datasets

**Note:** There's an off-by-one discrepancy between how we currently calculate duration in Undate and in the Shakespeare and Company Project code. This is because S&co code counts the first day in the range but not the last (this could also be thought of as counting half of the start and end dates). For simplicity of comparison here, we subtract one day from the  result returned by `UndateInterval.duration`.

In [1]:
from undate import UndateInterval
from undate.date import ONE_DAY
from undate.converters.iso8601 import ISO8601DateFormat

def undate_duration(start_date, end_date):
  isoformat = ISO8601DateFormat()

  unstart = isoformat.parse(start_date)
  unend = isoformat.parse(end_date)
  interval = UndateInterval(earliest=unstart, latest=unend)

  # subtract one here for simplicity of comparison,
  # to reconcile differences between duration logic
  return interval.duration() - ONE_DAY

## Compare subscription event durations

S&co data includes membership subscriptions with known duration; the dataset includes them in a human readable format (`subscription_duration`) and in a numeric form (`subscription_duration_days`).

Select subscription events with available duration information to compare with Undate logic.

In [5]:
# identify subscription events with duration information
subs_duration = events_df[events_df.subscription_duration_days.notna()]
# limit to fields that are relevant for this exploration
subs_duration = subs_duration[['member_names', 'start_date', 'end_date', 'subscription_duration', 'subscription_duration_days']]
subs_duration.head()

Unnamed: 0,member_names,start_date,end_date,subscription_duration,subscription_duration_days
28,Arthur Elliott Felkin,1927,1928,1 year,365.0
70,Geraldine Deknatel;William Deknatel,1931,1932,1 year,365.0
233,Mrs. G. S. Madam,1921-07,1921-08,1 month,31.0
234,Anne Moderwell;Hiram Moderwell / H. K. Moderwell,1921-09,1922-02,5 months,153.0
260,Victor Llona,1923-06,1923-10,4 months,122.0


### Subscription duration exploration

Briefly explore the duration data information for these subscriptions.

What do the duration day values look like? What rnage of values?

In [6]:
# What do the subscription duration day values look like?
subs_duration.subscription_duration_days.value_counts()

subscription_duration_days
31.0     2997
30.0     1975
92.0      936
91.0      397
365.0     337
         ... 
69.0        1
36.0        1
73.0        1
574.0       1
171.0       1
Name: count, Length: 133, dtype: int64

In [7]:
subs_duration.subscription_duration_days.describe()

count    9146.000000
mean       72.142685
std        81.559368
min         1.000000
25%        30.000000
50%        31.000000
75%        91.000000
max       574.000000
Name: subscription_duration_days, dtype: float64

Do we have any subscriptions with known duration but unknown start or end date?

In [8]:
# events with unknown start date
subs_duration[subs_duration.start_date.isna()]

Unnamed: 0,member_names,start_date,end_date,subscription_duration,subscription_duration_days


In [9]:
# events with unknown end date
subs_duration[subs_duration.end_date.isna()]

Unnamed: 0,member_names,start_date,end_date,subscription_duration,subscription_duration_days
13168,Jean (Bakewell) Connolly / Mrs. Cyril Connolly,1932-10-06,,,31.0
13686,Stanislas Pascal Franchot,1933-03-02,,,31.0


There are two one-month subscriptions with known start date but end date not set. Exclude those from our comparison.

In [10]:
# omit events with unknown end date since we can't recalculate duration
# (duration in the dataset is based on the subscription duration)
subs_duration = subs_duration[subs_duration.end_date.notna()]

### Calculate durations with Undate and compare

In [11]:
# add a new field for duration as calculated by Undate using the method defined previously
subs_duration["undate_duration"] = subs_duration.apply(lambda row: undate_duration(str(row.start_date), str(row.end_date)), axis=1)
subs_duration.head()

Unnamed: 0,member_names,start_date,end_date,subscription_duration,subscription_duration_days,undate_duration
28,Arthur Elliott Felkin,1927,1928,1 year,365.0,730 days
70,Geraldine Deknatel;William Deknatel,1931,1932,1 year,365.0,730 days
233,Mrs. G. S. Madam,1921-07,1921-08,1 month,31.0,61 days
234,Anne Moderwell;Hiram Moderwell / H. K. Moderwell,1921-09,1922-02,5 months,153.0,180 days
260,Victor Llona,1923-06,1923-10,4 months,122.0,152 days


In [12]:
# Compare undate duration with dataset duration
subs_duration.head()

Unnamed: 0,member_names,start_date,end_date,subscription_duration,subscription_duration_days,undate_duration
28,Arthur Elliott Felkin,1927,1928,1 year,365.0,730 days
70,Geraldine Deknatel;William Deknatel,1931,1932,1 year,365.0,730 days
233,Mrs. G. S. Madam,1921-07,1921-08,1 month,31.0,61 days
234,Anne Moderwell;Hiram Moderwell / H. K. Moderwell,1921-09,1922-02,5 months,153.0,180 days
260,Victor Llona,1923-06,1923-10,4 months,122.0,152 days


In [13]:
# what's the difference between the two?
subs_duration['duration_diff'] = subs_duration.apply(lambda row: row.undate_duration.astype("int") - row.subscription_duration_days, axis=1)
subs_duration

Unnamed: 0,member_names,start_date,end_date,subscription_duration,subscription_duration_days,undate_duration,duration_diff
28,Arthur Elliott Felkin,1927,1928,1 year,365.0,730 days,365.0
70,Geraldine Deknatel;William Deknatel,1931,1932,1 year,365.0,730 days,365.0
233,Mrs. G. S. Madam,1921-07,1921-08,1 month,31.0,61 days,30.0
234,Anne Moderwell;Hiram Moderwell / H. K. Moderwell,1921-09,1922-02,5 months,153.0,180 days,27.0
260,Victor Llona,1923-06,1923-10,4 months,122.0,152 days,30.0
...,...,...,...,...,...,...,...
35114,Capon,1941-11-24,1941-12-24,1 month,30.0,30 days,0.0
35115,Mme Domer,1941-11-24,1941-12-24,1 month,30.0,30 days,0.0
35116,Quesney,1941-12-04,1942-01-04,1 month,31.0,31 days,0.0
35118,Mlle Renauld,1941-12-08,1942-03-08,3 months,90.0,90 days,0.0


In [14]:
subs_duration['duration_diff'].value_counts()

duration_diff
0.0      9065
30.0       30
29.0       21
1.0        10
-1.0        9
28.0        4
365.0       2
27.0        1
2.0         1
-3.0        1
Name: count, dtype: int64

### Investigate discrepancies

In [15]:
# investigate the ones with larger differences
subset_subdurations = subs_duration[subs_duration.duration_diff != 0]
subset_subdurations.head(10)

Unnamed: 0,member_names,start_date,end_date,subscription_duration,subscription_duration_days,undate_duration,duration_diff
28,Arthur Elliott Felkin,1927,1928,1 year,365.0,730 days,365.0
70,Geraldine Deknatel;William Deknatel,1931,1932,1 year,365.0,730 days,365.0
233,Mrs. G. S. Madam,1921-07,1921-08,1 month,31.0,61 days,30.0
234,Anne Moderwell;Hiram Moderwell / H. K. Moderwell,1921-09,1922-02,5 months,153.0,180 days,27.0
260,Victor Llona,1923-06,1923-10,4 months,122.0,152 days,30.0
261,Mrs. L. McNair,1923-08,1923-09,1 month,31.0,60 days,29.0
271,René Martin,1924-02,1924-03,1 month,29.0,59 days,30.0
272,Nigel Monro,1924-02,1924-04,2 months,60.0,89 days,29.0
293,Madeleine Lorsignol,1926-03,1926-10,7 months,214.0,244 days,30.0
313,M. Mathieu,1926-11,1926-12,1 month,30.0,60 days,30.0


In [16]:
# too many to lok at once, can we segment by subscription duration?
subset_subdurations.subscription_duration.value_counts()

subscription_duration
1 month      38
3 months     12
2 months      7
6 months      6
4 months      5
5 months      3
1 year        2
7 months      2
8 months      2
11 months     1
10 months     1
Name: count, dtype: int64

In [17]:
# lots of one-month subscriptions, what do the discrepancies look like?
subset_subdurations[subset_subdurations.subscription_duration == '1 month'].head(15)

Unnamed: 0,member_names,start_date,end_date,subscription_duration,subscription_duration_days,undate_duration,duration_diff
233,Mrs. G. S. Madam,1921-07,1921-08,1 month,31.0,61 days,30.0
261,Mrs. L. McNair,1923-08,1923-09,1 month,31.0,60 days,29.0
271,René Martin,1924-02,1924-03,1 month,29.0,59 days,30.0
313,M. Mathieu,1926-11,1926-12,1 month,30.0,60 days,30.0
354,Emmanuel Leopold,1928-02,1928-03,1 month,29.0,59 days,30.0
356,Louis Lozowick,1928-02,1928-03,1 month,29.0,59 days,30.0
393,B. Malbert,1929-08,1929-09,1 month,31.0,60 days,29.0
394,M. McPherson,1929-08,1929-09,1 month,31.0,60 days,29.0
430,R. L. Lowey,1930-05,1930-06,1 month,31.0,60 days,29.0
444,Marguerite Gay Hutchinson,1930-11,1930-12,1 month,30.0,60 days,30.0


The first set of these are calculated differently because they are partial dates; undate logic calculates based on earliest possible date through last possible date, but we have additional information in these cases that is project-specific and undate can't take into account, i.e. subscription duration is one month starting sometime in a known year or month.

The handful towards the end that are off by one in either direction (+/-) are a little more concerning... (potential bug in S&co code? or value calculated based on known semantic duration?)

In [18]:
# durations other than one month
subset_subdurations[subset_subdurations.subscription_duration != '1 month'].head(15)

Unnamed: 0,member_names,start_date,end_date,subscription_duration,subscription_duration_days,undate_duration,duration_diff
28,Arthur Elliott Felkin,1927,1928,1 year,365.0,730 days,365.0
70,Geraldine Deknatel;William Deknatel,1931,1932,1 year,365.0,730 days,365.0
234,Anne Moderwell;Hiram Moderwell / H. K. Moderwell,1921-09,1922-02,5 months,153.0,180 days,27.0
260,Victor Llona,1923-06,1923-10,4 months,122.0,152 days,30.0
272,Nigel Monro,1924-02,1924-04,2 months,60.0,89 days,29.0
293,Madeleine Lorsignol,1926-03,1926-10,7 months,214.0,244 days,30.0
321,Thomas MacGreevy,1927-03,1928-02,11 months,337.0,365 days,28.0
331,Arthur Moss,1927-07,1927-10,3 months,92.0,122 days,30.0
337,Ruth Meyer,1927-10,1928-06,8 months,244.0,273 days,29.0
349,René Leroi,1928-01,1928-04,3 months,91.0,120 days,29.0


## Compare Borrow event durations

S&co data also includes borrowing events with known duration; it uses the same format as subscriptions (`subscription_duration` and `subscription_duration_days`.

Select borrow events with available duration information to compare with Undate logic.

In [19]:
borrow_duration = events_df[events_df.borrow_duration_days.notna()]
# limit to fields we care about for this check
borrow_duration = borrow_duration[['member_names', 'start_date', 'end_date', 'borrow_duration_days']]
borrow_duration.head()

Unnamed: 0,member_names,start_date,end_date,borrow_duration_days
602,G. E. Pulsford,--01-07,--01-13,6.0
603,G. E. Pulsford,--01-12,--01-20,8.0
604,Robert D. Sage,--01-16,--02-16,31.0
605,Gertrude Stein,--01-19,--01-24,5.0
606,G. E. Pulsford,--01-20,--01-28,8.0


In [20]:
borrow_duration.tail()

Unnamed: 0,member_names,start_date,end_date,borrow_duration_days
29903,Henri Michaux,1961-06-30,1961-10-04,96.0
29904,Henri Michaux,1961-06-30,1961-10-04,96.0
29905,Henri Michaux,1961-06-30,1961-10-04,96.0
29907,Ann Samyn,1961-10-04,1962-03-21,168.0
29908,Ann Samyn,1961-10-04,1962-03-21,168.0


In [21]:
# add a new field for duration as calculated by undate
borrow_duration["undate_duration"] = borrow_duration.apply(lambda row: undate_duration(str(row.start_date), str(row.end_date)), axis=1)
borrow_duration.head(10)

Unnamed: 0,member_names,start_date,end_date,borrow_duration_days,undate_duration
602,G. E. Pulsford,--01-07,--01-13,6.0,6 days
603,G. E. Pulsford,--01-12,--01-20,8.0,8 days
604,Robert D. Sage,--01-16,--02-16,31.0,31 days
605,Gertrude Stein,--01-19,--01-24,5.0,5 days
606,G. E. Pulsford,--01-20,--01-28,8.0,8 days
607,Gertrude Stein,--01-24,--03-20,55.0,55 days
608,Gertrude Stein,--01-24,--03-20,55.0,55 days
609,Gertrude Stein,--01-24,--03-20,55.0,55 days
610,Gertrude Stein,--01-24,--05-30,126.0,126 days
611,Gertrude Stein,--01-24,--05-30,126.0,126 days


In [22]:
# what's the difference between the two?
borrow_duration['duration_diff'] = borrow_duration.apply(lambda row: row.undate_duration.astype("int") - row.borrow_duration_days, axis=1)
borrow_duration.head(10)

Unnamed: 0,member_names,start_date,end_date,borrow_duration_days,undate_duration,duration_diff
602,G. E. Pulsford,--01-07,--01-13,6.0,6 days,0.0
603,G. E. Pulsford,--01-12,--01-20,8.0,8 days,0.0
604,Robert D. Sage,--01-16,--02-16,31.0,31 days,0.0
605,Gertrude Stein,--01-19,--01-24,5.0,5 days,0.0
606,G. E. Pulsford,--01-20,--01-28,8.0,8 days,0.0
607,Gertrude Stein,--01-24,--03-20,55.0,55 days,0.0
608,Gertrude Stein,--01-24,--03-20,55.0,55 days,0.0
609,Gertrude Stein,--01-24,--03-20,55.0,55 days,0.0
610,Gertrude Stein,--01-24,--05-30,126.0,126 days,0.0
611,Gertrude Stein,--01-24,--05-30,126.0,126 days,0.0


In [23]:
# what do the duration differences look like?
borrow_duration.duration_diff.value_counts()

duration_diff
0.0    19728
Name: count, dtype: int64

Woohoo, everything matches! 🎉

* * * 

In a previous run, there were two borrow events where the calculation did not match; this was due to an error in undate duration method when the start and end dates have unknown years and dates wrap to the following year (e.g., december to january), which has now been corrected.

**Note:** One of those events has a range (--06-07/--06-06) that looks like a data error in S&co, but the data matches what is [written on the lending card](https://shakespeareandco.princeton.edu/members/davet-yvonne/cards/cf96d38f-e651-491c-a575-131ea32ce425/#).

* * * 

In a preliminary implementation of the numpy datetime64 integration, the new earliest possible year turned out to be a leap year, resulting in the counts for Gertrude Stein's borrows from January to March to be off by one. This was corrected by adjusting the minimum year by one to ensure it is not a leap year.



In [24]:
# Confirm that we have no mismatches
assert len(borrow_duration[borrow_duration.duration_diff != 0]) == 0