### How to get dataset ready
Notebook that should document how scraped Cantus Index data were modified (cleaned) befor being exported as dataset.  
Here we show the procedure on data from May25 but path to genre files as well as to other csv files can be changed so code can be used to help anyone construct and clean a bit dataset from any other data with same structure (possibly produced by scraping scripts we provided).  
  
We are providing two CSV files as dataset:  
  - chants
  - sources
  
Main procedures underwent:
- join all genres into one file
- discarding duplicates in chantlinks
- genre standardization based on from what genre list in CI those records where from (issues only aroud Tp...)
- discarding data (from chants and sources), where for sources we cannot collect additional info
- add numerical century in sources
- inspecting duplicate sources - discarding and unifying duplicates
  
Finally, we are adding some basic statistics about the just constructed dataset, that can be again used on different data via paths change.

In [None]:
import pandas as pd
import glob
import os

In [None]:
CHANTS_DIR_PATH = 'data/chants' # Rename to fit your directory structure
CHANTS_CSV_PATH = 'data/chants/all_chants.csv' # Rename to fit your directory structure
SOURCES_CSV_PATH = 'data/sources.csv' # Rename to fit your directory structure

In [3]:
# Read static files
feast =pd.read_csv('static/feast.csv', dtype={'feast_code' : str})
genre = pd.read_csv('static/genre.csv')
office = pd.read_csv('static/office.csv')

In [4]:
sources = pd.read_csv(SOURCES_CSV_PATH)

In [5]:
# Prepare May 2025 data
concat_chants_files = glob.glob(CHANTS_DIR_PATH + '/*.csv')
chants_dfs_dict = {os.path.splitext(os.path.basename(file))[0] : pd.read_csv(file,  dtype=str) for file in concat_chants_files}

chants = pd.concat(chants_dfs_dict, ignore_index=True)

non_empty_genre = 0
for file, df in chants_dfs_dict.items():
    if len(df) > 0:
        non_empty_genre += 1
    df['genre_file'] = file
print('number of genre files may25:', non_empty_genre)

chants_genre_file = pd.concat(chants_dfs_dict, ignore_index=True)

print('number of records in may25:', len(chants))

number of genre files may25: 106
number of records in may25: 1005793


In [6]:
print("number of not unique records in chantlinks in may25:", len(chants["chantlink"].value_counts()[lambda x: x > 1].index))
print("number of not unique rows in may25:", len(chants) - len(chants.drop_duplicates()))
print("number of not duplicit records with duplicit chantlink:", len(chants.drop_duplicates()) - len(chants['chantlink'].drop_duplicates()))
duplicate_chantlinks = chants.drop_duplicates()["chantlink"].value_counts()[lambda x: x > 1].index
chants[chants["chantlink"].isin(duplicate_chantlinks)].sort_values(by="chantlink").to_csv('duplicated_chantlinks.csv')

number of not unique records in chantlinks in may25: 117100
number of not unique rows in may25: 117088
number of not duplicit records with duplicit chantlink: 12


Here comes May 2025 CantusCorpus 1.0 specific piece of work...

In [7]:
# Chants without duplicates
# turns out those 12 were records of AH49403 vs ah49403 so we gonna keep the lowercased version as being standard...

# Step 1: Drop fully duplicated rows
df = chants.drop_duplicates()

# Step 2: Find `chantlink` values that are still duplicated
dup_chantlinks = df['chantlink'].value_counts()[lambda x: x > 1].index

# Step 3: Keep only rows with duplicated chantlink **and** cantus_id starting with lowercase letter
mask = (
    df['chantlink'].isin(dup_chantlinks) & 
    df['cantus_id'].str.match(r'^[a-z]')
)

# Step 4: Keep rows that are either:
# - Not part of duplicated chantlinks
# - Or part of duplicated chantlinks AND their cantus_id starts with lowercase
chants = df[~df['chantlink'].isin(dup_chantlinks) | mask]
print("number of chants records without duplicates:", len(chants))

number of chants records without duplicates: 888693


#### Genre
Searching for overview of how various genre field is.  
If too messy we can try to standardize it with the help of "from this CI genre list" value we have for each chant record.

In [8]:
genres_in_data = set(chants['genre'])
print('Genres present in data and not in CI genre list:')
print(genres_in_data.difference(set(genre['genre_name'])))

Genres present in data and not in CI genre list:
{'Varia', 'Am5', 'R5', 'A10', 'Tr3', 'R14', 'R4', 'V124', 'Vs', 'Am6', 'a1', 'R3+', 'Ant/Resp', 'Am4', 'V31', 'A8', 'La', 'TrV', 'Im', 'Gr3V', 'V9', 'LDM', 'All2', 'IntrV', 'Ap+', 'Ant', 'Aproc', 'V5', 'All-1', 'V11', 'A13', 'All6', 'R1', 'Am7', 'V152', 'M', 'A7', 'R9', 'Tr', 'Gr2', 'Resp', 'AntV', '\xa0?', 'R+', 'V10', 'Off', 'Am9', 'V8', 'V14', 'RespV', 'Tr2', 'V6', 'Am', 'Hymn', 'V122', 'V32', 'Gr5', 'Tr4', 'Seq', 'An', 'Varia/A', 'V7', 'Ab+', 'Gr2V', 'All-2', 'Am+', '[a3]', 'V21', 'Am1', 'AllV', 'A12', 'All1', 'Ant1', 'Gr1V', 'All+', 'Gr4V', 'R15', 'Am3', 'Gr+', 'V3', 'Tr1', 'A9', 'V1', 'Dox', 'All5', 'Be', 'V+', 'r', nan, 'CommV', 'a', 'V126', 'All1V', 'OffV', 'V151', 'All', 'a+', 'V2', 'V12', 'a3+', 'R10', 'Gr-V', 'All-V', 'A14', 'R8', 'a2+', 'R11', 'V123', 'V3+', 'An+', 'All3', 'Intr', 'a2', 'Comm+', 'Ab', 'Gr3', 'Comm', 'Off+', 'Ant3', 'A6', 'Ap', 'a5+', 'V33', 'Gr4', 'R13', 'V4', 'Ant4', 'R7', 'R6', 'R12', 'Gr1', 'R2', 'GRCV', '

Because these all seem like just existing genre with some numbers that probably indicates position and because we consider **genre** to be an important information, it makes sence to standardize it to CI values with knowladge in what genre list they are displayed in CI...

In [9]:
# Chantlinks are now unique so I can use them to mach what is unique in file with genre files
# just ensure we got V and not on CI disappered [GV]
print('all scraped chnats:', len(chants_genre_file))
#  Known problem, drop [GV] that are no longer displayed in CI
chants_genre_file = chants_genre_file[chants_genre_file['genre_file'] != '[GV]'] 
print('all scraped chnats without from [GV] file records:', len(chants_genre_file))

# Identify ambiguous chantlinks (appearing more than once)
ambiguous_chantlinks = chants_genre_file[chants_genre_file['chantlink'].duplicated(keep=False)]
ambiguous_rows = chants_genre_file[chants_genre_file['chantlink'].isin(ambiguous_chantlinks['chantlink'])]

# Look if we can keep their genre value as was scraped:
print('genre values :', set(ambiguous_rows['genre']).difference(set(genre['genre_name'])))
# if we got set() we can, they all come from known vocabulary
if len(set(ambiguous_rows['genre']).difference(set(genre['genre_name']))) == 0:
    chants = chants.copy()
    for chantlink, group in ambiguous_chantlinks.groupby('chantlink'):
        genre_files = group['genre_file'].dropna().unique().tolist()
        chant_genre = group['genre'].tolist()[0] # I know they are always same...
        if chant_genre in genre_files:
            chants.loc[chants['chantlink'] == chantlink, 'genre'] = chant_genre
        else:
            if chant_genre == '[M]': # I am sure bc [M] is to be reasigned and TpBD is more specific option
                chants.loc[chants['chantlink'] == chantlink, 'genre'] = 'TpBD' # not Tp
            # genre: Sq, files: Psl and Tp
            # genre: V, files: Psl and Tp
            # genre: BD, files: TpBD and Tp
            # I do not want to do any editorial decision (working downstream...) so I am gonna keep genre value of those records
            else:
                chants.loc[chants['chantlink'] == chantlink, 'genre'] = chant_genre


print('number of empty genre before final mapping:' ,len(chants) - len(chants['genre'].dropna())) # check

# Filter rows where each chantlink appears exactly once
unique_genre_map = (
    chants_genre_file[chants_genre_file['chantlink'].duplicated(keep=False) == False]
    .set_index('chantlink')['genre_file']
)

# Map unambiguos genre_file values to genre data using chantlink, keep other genre values
chants = chants.copy()
chants['genre'] = (chants['genre'].combine_first(chants['chantlink'].map(unique_genre_map)))
chants.to_csv(CHANTS_CSV_PATH, index=False)

print('number of not solved records:' ,len(chants) - len(chants['genre'].dropna())) # check

all scraped chnats: 1005793
all scraped chnats without from [GV] file records: 888788
genre values : set()
number of empty genre before final mapping: 1680
number of not solved records: 0


#### Office
Simply just to be aware how non-standardized this filed is... not much to do about it besides passing the information.  

In [10]:
offices_in_data = set(chants['office'])
print('Offices present in data and not in CDB office list:')
numeric, alpha = [], []
for o in offices_in_data.difference(set(office['name'])):
    if str(o).isdigit():
        numeric.append(o)
    else:
        alpha.append(o)
print('\tnumeric:', numeric)
print('\tother:', alpha)

Offices present in data and not in CDB office list:
	numeric: ['1003', '1002', '969', '980', '977', '976', '963', '970', '975', '1004', '979', '972', '964', '965', '978', '974', '968', '971', '966', '967']
	other: ['AL', 'DU&D', 'S&O', 'Pec', 'Q&Q', 'C2', 'MH', nan, 'P&S', 'Noc', 'MASS', 'MN']


In [11]:
# Look how frequent office values are
chants['office'].value_counts()

office
M       311232
MASS    140836
L        95628
V        53038
V2       52106
E        19116
X        16570
MI       16512
T        12421
N        11530
S        11136
MH       10359
P         6652
C         4518
H         4110
R         2015
D         1442
964       1150
974        886
963        871
975        818
967        813
966        459
965        350
Noc        261
979        243
972        140
C2         115
S&O        106
?           89
MN          73
976         60
971         59
970         57
969         56
DU&D        44
968         44
P&S         41
Q&Q         24
AL          24
980         19
Pec         11
1004         7
1003         7
1002         7
978          5
CA           4
977          1
Name: count, dtype: int64

Number values are coming from Hungarian database.  
Hard to say if MI (from CDB) and MASS (from SEMM) means really always the same thing or not...  
Overall we are sticking to policy of "being downstream" and so we would let the data be as they are.

#### Melody overview

In [12]:
print('Number of melody_ids records:', len(chants['melody_id'].dropna()))
print('Number of melody_id values in data:', len(set(chants['melody_id'].dropna())))

Number of melody_ids records: 0
Number of melody_id values in data: 0


In [13]:
chants['mode'].value_counts().head(40)

mode
*        121318
8         89894
1         79870
7         59778
4         44864
2         43350
3         30378
5         23932
r         22351
?         20456
6         18895
6T         4460
4T         3900
2T         3110
1S         3047
1T         2244
8S         1427
2S         1210
3S         1186
6S         1102
5S          986
7S          936
4S          805
G           738
8*          673
1*          666
7*          458
8?          456
1?          424
5T          423
D           416
Gd          396
7T          385
-           355
4*          354
?S          328
Gc          324
2 Trp       315
S           297
E           294
Name: count, dtype: int64

### Feasts
Since no clear standard exists on filed of feasts, we can provide only simple statistic.

In [14]:
print('number of feasts recognized in CI list:', len(feast))
print('number of feast values in data:', len(set(chants['feast'])))

number of feasts recognized in CI list: 1794
number of feast values in data: 2401


## Sources
Just a quick look at scraped sources.  
Problems with http -> https where redirect works corectly so scraper did not noticed.  
Should geocoding data be used to unify locations where we know about it?

In [15]:
# HTTP -> HTTPS
# all databases moved to https except musmed (http in data is a mistake in sources scraping)
sources['srclink'] = sources['srclink'].apply(
    lambda x: x if not isinstance(x, str) else (
        x if x.startswith('http://musmed') else x.replace('http://', 'https://')
    )
)
# Clean spaces
sources['siglum'] = sources['siglum'].str.strip()
sources['title'] = sources['title'].str.strip()
sources['provenance'] = sources['provenance'].str.strip()

In [16]:
# For how many sources mentioned in data we do not have source information scraped
sources_in_data = set(chants['srclink'])
scraped_sources = set(sources['srclink'])
print('Sources being scraped and not present in data:', scraped_sources.difference(sources_in_data))
print()
print('Sources being in data and not in scraped sources info:')
print(sources_in_data.difference(scraped_sources))
print(len(sources_in_data.difference(scraped_sources)))

Sources being scraped and not present in data: set()

Sources being in data and not in scraped sources info:
{'https://cantusbohemiae.cz/source/22098', 'https://cantusbohemiae.cz/source/9188', 'https://musicahispanica.eu/source/25470', 'https://cantusbohemiae.cz/source/11619', 'https://musicahispanica.eu/source/25465', 'https://cantusbohemiae.cz/source/21983', 'https://musicahispanica.eu/source/25463', 'https://cantusbohemiae.cz/source/22705', 'https://musicahispanica.eu/source/25468', 'https://cantusbohemiae.cz/source/9198', 'https://cantusbohemiae.cz/source/2147', 'https://cantusbohemiae.cz/source/22179', 'https://musicahispanica.eu/source/25466', 'https://musicahispanica.eu/source/25464', 'https://cantusbohemiae.cz/source/2153', 'https://cantusbohemiae.cz/source/9185', 'https://cantusbohemiae.cz/source/10804', 'https://cantusbohemiae.cz/source/22046', 'https://musicahispanica.eu/source/25461', 'https://musicahispanica.eu/source/25469', 'https://musicahispanica.eu/source/25467', 'htt

Very "data version" specific piece of code follows:

In [17]:
# Inspect those 30 troublemakers 
hispanica_once = []
fontes_once = []
others = []
for trouble_source_URL in sources_in_data.difference(scraped_sources):
    if 'hispanica' in trouble_source_URL:
        hispanica_once.append(trouble_source_URL)
    elif 'cantusbohemiae' in trouble_source_URL:
        fontes_once.append(trouble_source_URL)
    else:
        others.append(trouble_source_URL)

print('hispanica:', len(hispanica_once))
for url in hispanica_once:
    print(url)
print('FCB:', len(fontes_once))
for url in fontes_once:
    print(url)
print('others:', len(others))
for url in others:
    print(url)

hispanica: 12
https://musicahispanica.eu/source/25470
https://musicahispanica.eu/source/25465
https://musicahispanica.eu/source/25463
https://musicahispanica.eu/source/25468
https://musicahispanica.eu/source/25466
https://musicahispanica.eu/source/25464
https://musicahispanica.eu/source/25461
https://musicahispanica.eu/source/25469
https://musicahispanica.eu/source/25467
https://musicahispanica.eu/source/25319
https://musicahispanica.eu/source/25462
https://musicahispanica.eu/source/25460
FCB: 18
https://cantusbohemiae.cz/source/22098
https://cantusbohemiae.cz/source/9188
https://cantusbohemiae.cz/source/11619
https://cantusbohemiae.cz/source/21983
https://cantusbohemiae.cz/source/22705
https://cantusbohemiae.cz/source/9198
https://cantusbohemiae.cz/source/2147
https://cantusbohemiae.cz/source/22179
https://cantusbohemiae.cz/source/2153
https://cantusbohemiae.cz/source/9185
https://cantusbohemiae.cz/source/10804
https://cantusbohemiae.cz/source/22046
https://cantusbohemiae.cz/source/91

Those hispanica sources are all fragments of one manuscript and all are missing Shelfmark (-> siglum), but we can get that value directly from their chant records - we would add them half manually before dataset realese.  
  
Those FCB source pages are returning 'Acces denied'...  
Since we did not manage to get info about reason of this hidding, we decided to discard their chant records in case these sources were hidden due to some quality problems etc.

In [None]:
for srclink in hispanica_once:
    sources.loc[sources['srclink'] == srclink, 'title'] = chants.loc[chants['srclink'] == srclink, 'siglum'].unique()[0]
    sources.loc[sources['srclink'] == srclink, 'siglum'] = chants.loc[chants['srclink'] == srclink, 'siglum'].unique()[0]
    # Rest of the information should be added manually!

In [19]:
print('number of chants records before discarding problematic FCB sources:', len(chants))
for srclink in fontes_once:
    # Discard "FCB hidden sources" records in chants
    chants = chants[chants['srclink'] != srclink]
print('number of chants records after discarding problematic FCB sources:', len(chants))

number of chants records before discarding problematic FCB sources: 888693
number of chants records after discarding problematic FCB sources: 888110


##### Duplicity in sources...?
We wanna look how unique value siglum is.

In [20]:
# Look for duplicity in sigla
print(sources['siglum'].value_counts()[lambda x : x > 1])

siglum
PL-PŁsem MsEPl 12                                 2
CZ-OLu M III 6                                    2
SK-KRE 1625                                       2
CZ-Pn XII A 24                                    2
A-KN CCl 1018                                     2
CZ-Pn XV A 10                                     2
CZ-Pu VI G 3a                                     2
P-LA Caixa 2, Fragmento 017                       2
CZ-Pu XIV G 46                                    2
SK-KRE Tom. 1, Fons 32, Fasc. 9, Nro. 83, 1583    2
SK-KRE Tom. 2, Fons 41, Fasc. 1, Nro. 3, 1601     2
Name: count, dtype: int64


In [21]:
# Lets inspect them 
for siglum in sources['siglum'].value_counts()[lambda x : x > 1].index:
    print(sources[sources['siglum'] == siglum][['title', 'siglum', 'srclink']])
    srclinks = list(sources[sources['siglum'] == siglum]['srclink'])
    srclink1 = srclinks[0]
    print('number of chants in', srclink1, ':', len(chants[chants['srclink'] == srclink1]))
    srclink2 = srclinks[1]
    print('number of chants in', srclink2, ':', len(chants[chants['srclink'] == srclink2]))
    print('---------------------')

                                                 title             siglum  \
136  PL-PŁsem MsEPl 12 Antyfonarz z Płocka, pars de...  PL-PŁsem MsEPl 12   
162  PL-PŁsem MsEPl 12 Antyfonarz z Płocka, pars de...  PL-PŁsem MsEPl 12   

                                  srclink  
136  https://cantusplanus.pl/source/14457  
162  https://cantusplanus.pl/source/14458  
number of chants in https://cantusplanus.pl/source/14457 : 2019
number of chants in https://cantusplanus.pl/source/14458 : 1825
---------------------
                                       title          siglum  \
696                 CZ-OLu (Olomouc) M III 6  CZ-OLu M III 6   
2122  Missale Olomucense scriptoris Stephani  CZ-OLu M III 6   

                                     srclink  
696   https://cantusbohemiae.cz/source/11616  
2122      https://hymnologica.cz/source/6983  
number of chants in https://cantusbohemiae.cz/source/11616 : 467
number of chants in https://hymnologica.cz/source/6983 : 156
---------------------
    

Very "data version" specific piece of code follows:

In [22]:
# And then handle problematic cases manually
# A-KN CCl 1018 - TWO different books having same sigla on URL pages - in CDB and AustruiaManus
#                 probably AM one is a piece of parachment inserted inside book referd to by CDB
# SK-KRE Tom. 2, Fons 41, Fasc. 1, Nro. 3, 1601 - two parts of book with separate URL entries
# SK-KRE Tom. 1, Fons 32, Fasc. 9, Nro. 83, 1583 - two parts of book with separate URL entries
# PL-PŁsem MsEPl 12 - two parts of book with separate URL entries

# And these needs to be inspect on overlaping chant records:
# P-LA Caixa 2, Fragmento 017 - https://musicahispanica.eu/source/62316 and https://pemdatabase.eu/source/46528
caixa2PEM = chants[chants['srclink'] == "https://pemdatabase.eu/source/46528"]
caixa2SEMM = chants[chants['srclink'] == "https://musicahispanica.eu/source/62316"]
print('P-LA Caixa 2, Fragmento 017')
print('number of chants in manuscripts:', len(caixa2PEM), len(caixa2SEMM))
print('size of interesection:', len(set(caixa2SEMM['cantus_id']).intersection(set(caixa2PEM['cantus_id']))))
print()
# Both records complete -> shlould we discard one - in sources as well as in chants...

# CZ-Pu VI G 3a - https://hymnologica.cz/source/5364 and https://cantusbohemiae.cz/source/9147
viFCB = chants[chants['srclink'] == "https://cantusbohemiae.cz/source/9147"]
viHYM = chants[chants['srclink'] == "https://hymnologica.cz/source/5364"]
print('VI G 3a, folios in HYM & not in FCB:', set(viHYM[['folio', 'cantus_id']]).difference(set(viFCB[['folio', 'cantus_id']])))
print('VI G 3a, folios in HYM & in FCB:', set(viHYM['folio']).intersection(set(viFCB['folio'])))
vi_dupl_folios_cids = set(zip(viHYM['folio'], viHYM['cantus_id'])).intersection(set(zip(viFCB['folio'], viFCB['cantus_id'])))
print()
# CZ-Pn XII A 24 https://hymnologica.cz/source/10619  and https://cantusbohemiae.cz/source/33177
xiiFCB = chants[chants['srclink'] == "https://cantusbohemiae.cz/source/33177"]
xiiHYM = chants[chants['srclink'] == "https://hymnologica.cz/source/10619"]
print('XII A 24, folios in HYM & not in FCB:', set(xiiHYM['folio']).difference(set(xiiFCB['folio'])))
print('XII A 24, folios in HYM & in FCB:', set(xiiHYM['folio']).intersection(set(xiiFCB['folio'])))
xii_dupl_folios_cids = set(zip(xiiHYM['folio'], xiiHYM['cantus_id'])).intersection(set(zip(xiiFCB['folio'], xiiFCB['cantus_id'])))
print()
# CZ-Pn XV A 10 - https://hymnologica.cz/source/47  and https://cantusbohemiae.cz/source/28509
xvFCB = chants[chants['srclink'] == "https://cantusbohemiae.cz/source/28509"]
xvHYM = chants[chants['srclink'] == "https://hymnologica.cz/source/47"]
print('XV A 10, folios in HYM & not in FCB:', set(xvHYM['folio']).difference(set(xvFCB['folio'])))
print('XV A 10, folios in HYM & in FCB:', set(xvHYM['folio']).intersection(set(xvFCB['folio'])))
xv_dupl_folios_cids = set(zip(xvHYM['folio'], xvHYM['cantus_id'])).intersection(set(zip(xvFCB['folio'], xvFCB['cantus_id'])))
print()
# CZ-Pu XIV G 46 - https://hymnologica.cz/source/5366 and https://cantusbohemiae.cz/source/9194
xivFCB = chants[chants['srclink'] == "https://cantusbohemiae.cz/source/9194"]
xivHYM = chants[chants['srclink'] == "https://hymnologica.cz/source/5366"]
print('XIV G 64, folios in HYM & not in FCB:', set(xivHYM['folio']).difference(set(xivFCB['folio'])))
print('XIV G 64, folios in HYM & in FCB:', set(xivHYM['folio']).intersection(set(xivFCB['folio'])))
xiv_dupl_folios_cids = set(zip(xivHYM['folio'], xivHYM['cantus_id'])).intersection(set(zip(xivFCB['folio'], xivFCB['cantus_id'])))
print()
# CZ-OLu M III 6 - https://hymnologica.cz/source/6983 and https://cantusbohemiae.cz/source/11616
iiiFCB = chants[chants['srclink'] == "https://cantusbohemiae.cz/source/11616"]
iiiHYM = chants[chants['srclink'] == "https://hymnologica.cz/source/6983"]
iii_dupl_folios_cids = set(zip(iiiHYM['folio'], iiiHYM['cantus_id'])).intersection(set(zip(iiiFCB['folio'], iiiFCB['cantus_id'])))
print('M III 6: FCB:', len(iiiFCB), 'HYM:', len(iiiHYM))
print('number of folios in HYM that are not in FCB:', len(set(iiiHYM['folio']).difference(set(iiiFCB['folio']))))
print('Genres in HYM chants:', iiiHYM['genre'].value_counts())

P-LA Caixa 2, Fragmento 017
number of chants in manuscripts: 8 8
size of interesection: 8

VI G 3a, folios in HYM & not in FCB: set()
VI G 3a, folios in HYM & in FCB: {'096v', '108v', '104r', '112r', '105v', '108r', '102v', '099v', '110r', '102r', '103r', '062v', '114v', '113v', '106v', '109r', '112v', '101v', '103v', '097r', '105r', '111r', '110v', '099r', '104v', '056r', '107v', '097v', '109v', '101r', '114r', '106r'}

XII A 24, folios in HYM & not in FCB: set()
XII A 24, folios in HYM & in FCB: {'029v', '002r', '029r', '028v', '028r', '001v'}

XV A 10, folios in HYM & not in FCB: set()
XV A 10, folios in HYM & in FCB: {'007r', '040v', '007v'}

XIV G 64, folios in HYM & not in FCB: {'074v', '080r'}
XIV G 64, folios in HYM & in FCB: {'116v'}

M III 6: FCB: 467 HYM: 156
number of folios in HYM that are not in FCB: 112
Genres in HYM chants: genre
Sq    156
Name: count, dtype: int64


##### Troublemakers
P-LA Caixa 2, Fragmento 017  
-> lets discard PEM record since SEMM has full_text

In [23]:
# Discard duplicate PEM source
chants = chants[chants['srclink'] != "https://pemdatabase.eu/source/46528"]
sources = sources[sources['srclink'] != "https://pemdatabase.eu/source/46528"]

FCB 'vs' HYM  
-> lets keep non-duplicate chants records from both - just change srclink from HYM one to FCB one (but keep HYM chantlink and db)  
-> discard HYM records in sources

In [None]:
hymnologica_links = [
    "https://hymnologica.cz/source/5364",
    "https://hymnologica.cz/source/10619",
    "https://hymnologica.cz/source/47",
    "https://hymnologica.cz/source/5366",
    "https://hymnologica.cz/source/6983"
]
# Discard HYM chant records where FCB equivalent exists
# we would try to detect this based on folio and cantus_id
duplicate_pairs = list(xii_dupl_folios_cids) + list(xiv_dupl_folios_cids) + list(xv_dupl_folios_cids) + list(iii_dupl_folios_cids) + list(vi_dupl_folios_cids)
mask = chants.apply(
    lambda row: ((row['folio'], row['cantus_id']) in duplicate_pairs) and (row['srclink'] in hymnologica_links),
    axis=1
)

print('number of chant records before HYM duplicites discarding:', len(chants))
# Filter out the rows where mask is True
empt_chants = chants[~mask].reset_index(drop=True)
print('number of chant records after HYM duplicites discarding:', len(empt_chants))
chants = empt_chants

number of chant records before HYM duplicites discarding 888102
number of chant records after HYM duplicites discarding 888010


In [25]:
# Change HYM srclinks to FCB once in chants
chants.loc[chants['srclink'] == "https://hymnologica.cz/source/5364", 'srclink'] = "https://cantusbohemiae.cz/source/9147"
chants.loc[chants['srclink'] == "https://hymnologica.cz/source/10619", 'srclink'] = "https://cantusbohemiae.cz/source/33177"
chants.loc[chants['srclink'] == "https://hymnologica.cz/source/47", 'srclink'] = "https://cantusbohemiae.cz/source/28509"
chants.loc[chants['srclink'] == "https://hymnologica.cz/source/5366", 'srclink'] = "https://cantusbohemiae.cz/source/9194"
chants.loc[chants['srclink'] == "https://hymnologica.cz/source/6983", 'srclink'] = "https://cantusbohemiae.cz/source/11616"

# Discard HYM sources in sources
sources = sources[~sources['srclink'].isin(hymnologica_links)]

##### Numerical century
- again general code

In [26]:
import re

In [27]:
# Numerical century
def get_numerical_century(century : str) -> int:
    try:
        two_digits_pattern = r'(?<!\d)\d{2}(?!\d)'
        two_digits_match = re.findall(two_digits_pattern, century)
        if len(two_digits_match) == 1:
            return int(two_digits_match[0])
        elif len(two_digits_match) > 1: 
            # take first anyway
            return int(two_digits_match[0])
        
        one_digit_pattern = r'(?<!\d)\d{1}(?!\d)'
        one_digit_match = re.findall(one_digit_pattern, century)
        if len(one_digit_match) == 1:
            return int(one_digit_match[0])
        
        four_digits_pattern = r'(?<!\d)\d{4}(?!\d)'
        four_digits_match = re.findall(four_digits_pattern, century)
        if four_digits_match is not None:
            if len(four_digits_match) == 1:
                return int(four_digits_match[0][0:2])+1
            elif len(four_digits_match) > 1: 
                # take first anyway
                return int(four_digits_match[0][0:2])+1
            else:
                print('PROBLEM:', century)
        else:
            print('PROBLEM:', century)
    except: # probably nan coming
        return ''

In [28]:
# Apply numerical century creation on source data
sources_num_cent = sources
sources_num_cent['num_century'] = sources['century'].apply(get_numerical_century)

#### Cursus

In [29]:
print('number of sources:', len(sources))
print()
print(sources['cursus'].value_counts())

number of sources: 2266

cursus
Secular      196
Monastic      86
cathedral     49
unknown       35
Romanum       14
Name: count, dtype: int64


#### Provenance


In [30]:
print('number of provenance values in data:', len(set(sources['provenance'])))
print(sources['provenance'].value_counts())

number of provenance values in data: 642
provenance
Slovakia                                                            110
Bohemia                                                              63
Klosterneuburg                                                       63
Hungary                                                              57
Austria/Germany                                                      44
                                                                   ... 
Bellelay Abbey                                                        1
Troyes                                                                1
Coimbra, Catedral                                                     1
France, Abbaye Saint-Pierre d’Angers (?) puis abbaye Saint-Aubin      1
Troyes, France                                                        1
Name: count, Length: 641, dtype: int64


In [31]:
# Save chants after all changes
chants.to_csv(CHANTS_CSV_PATH[:-4]+'_processed.csv', index=False)
# Save sources after all changes
sources_num_cent.to_csv(SOURCES_CSV_PATH[:-4]+'_processed.csv', index=False)

## Statistics
Here we computed basic statistics about our dataset.
- number of chant records

- number of source manuscripts of this records
- out of them how many have:
    - provenance
    - century
    - cursus

- number of chant records with melody in volpiano - more then 20 notes

- mode distribution? - tricky bc lots of wierd values (category other?)
- office distribution? - tricky bc lots of wierd values (category other?)  
- genre distribution? - ok, is standardized... still many values  
(these as little plots I guess)

In [32]:
chants = pd.read_csv(CHANTS_CSV_PATH[:-4]+'_processed.csv', dtype=str)
sources = pd.read_csv(SOURCES_CSV_PATH[:-4]+'_processed.csv', dtype=str)

In [45]:
print('number of chants records after all processing:', len(chants))
print('out of them number of:')

number of chants records after all processing: 888010
out of them number of:


In [39]:
print('number of sources records after all processing:', len(sources))
print('out of them number of:')
print('\tsources with provenance value:', len(sources[sources['provenance'].notna()]))
print('\tsources with century value:', len(sources[sources['century'].notna()]))
print('\tsources with cursus value:', len(sources[sources['cursus'].notna()]) - len(sources[sources['cursus'] == 'unknown']))

number of sources records after all processing: 2266
out of them number of:
	sources with provenance value: 1606
	sources with century value: 2228
	sources with cursus value: 345
