# Exploration: broken date tags

#### Date-tag resolution definitions
Based on whether a regex pattern matched on a date-tag
```
year : \d{4}-xx-xx
month : \d{4}-\d{2}-xx
day : \d{4}-\d{2}-\d{2}
broken : none of the above 
```

In [1]:
import re
import ndjson
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
# import primitives
path_primitives = '../data/primitives_220303/primitives.ndjson'
with open(path_primitives) as fin:
    primitives = ndjson.load(fin)

# import date tags
path_dates = '../data/primitives_220303/df_date.csv'
df_date = pd.read_csv(path_dates)

#### overview of corpus by date tag resolution

In [3]:
df_date.groupby(['resolution']).describe()

Unnamed: 0_level_0,call_nr,call_nr,call_nr,call_nr,date,date,date,date
Unnamed: 0_level_1,count,unique,top,freq,count,unique,top,freq
resolution,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
broken,557,40,1668_Gent_Bill_07,164,557,446,xxxx-xx-xx,25
day,15974,63,1574_Antw_Haec,1357,15974,8951,1776-09-14,110
month,1450,61,1574_Antw_Haec,256,1450,755,1566-04-xx,16
year,6385,62,1666_Gent_Bill_01B,899,6385,1238,1584-xx-xx,184


#### where are broken tags?

In [4]:
(df_date
    .groupby(['resolution', 'call_nr'])
    .describe()
    .query('resolution == "broken"')
    .sort_values(by=[('date', 'count')], ascending=False)
)

Unnamed: 0_level_0,Unnamed: 1_level_0,date,date,date,date
Unnamed: 0_level_1,Unnamed: 1_level_1,count,unique,top,freq
resolution,call_nr,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
broken,1668_Gent_Bill_07,164,124,1003,4
broken,1602_Brus_Pott,111,109,jaer LXX,2
broken,1575_Antw_Ulle,54,46,x524-xx-xx,2
broken,1791_Purm_Louw_01,36,22,xxxx-xx-xx,14
broken,1574_Antw_Haec,26,23,xxxx-04-01,3
broken,1628_Alkm_Wijn,21,15,145x-xx-xx,4
broken,1719_Hoor_Spru,17,4,xxxx-xx-xx,9
broken,1791_Purm_Louw_02,14,10,1766-03-x8,3
broken,1687_Rott_Waer,11,9,xxxx-03-02,2
broken,1671_Rott_Lois,10,10,1347.,1


#### broken chronicle 1: gent police
problem: non-standardized annotation (e.g. 1003 instead of 1003-xx-xx )

In [5]:
(df_date
    .query('call_nr == "1668_Gent_Bill_07"')
    .query('resolution == "broken"')
    .groupby('date')
    .size()
    .to_frame(name='count')
    .sort_values(by='count', ascending=False)
)

Unnamed: 0_level_0,count
date,Unnamed: 1_level_1
1003,4
1067,3
941,3
984,3
985,3
...,...
1120,1
1119,1
1118,1
1114,1


#### broken chronicle 2: brussels
tagged, but empty attributes (e.g. jaer LXX)

In [6]:
(df_date
    .query('call_nr == "1602_Brus_Pott"')
    .query('resolution == "broken"')
    .groupby('date')
    .size()
    .to_frame(name='count')
    .sort_values(by='count', ascending=False)
)

Unnamed: 0_level_0,count
date,Unnamed: 1_level_1
Assensiondach,2
jaer LXX,2
"iersten januario op\t\t\t een jaerdach, ao LII",1
heylich sacramentdach,1
dit jaer van LXXXII,1
...,...
VIIIsten octobre,1
VIIIsten februario ao LXXI,1
Sint Andries-dach,1
Sinssendach ende den twedren dach,1
