#Intro

This Jupyter Notebook has been created for the <a href="https://www.unibo.it/it/didattica/insegnamenti/insegnamento/2021/443749" target="_blank">90154 - Electronic Publishing and Digital Storytelling</a> course taught by **Prof. Marilena Daquino** in the framework of the 2nd year of the <a href="https://corsi.unibo.it/2cycle/DigitalHumanitiesKnowledge"  target="_blank">DHDK Master Degree</a>, a.a. 2021-22.<br>
Here listed the main steps for the realization of the project **Partizione Antica**: 
https://mybinder.org/v2/gh/https%3A%2F%2Fenri-ca.github.io%2FEPDS_EZ/main
      
       
    1. Data Preparation:
          - creation of two complexive xml files for F and OA records coming from the Federico Zeri Foundation catalogues
          - extraction from nested xml stucture of relevant information for the project and structuring them in plain tabular format
    2. Data Elaboration: seeking for furter analysis elements via:
          - deeper work on photographer for enhance their information
          - deeper work on places
          - work on unstructured annotations: NER
     2. Data Visualization


---

# 1. Data preparation

This research started from a **record data extraction** provided from the Federico Zeri Foundation: the original data counted 3.260 F and 2.634 OA records - respectively for the photographs, and for the depicted works of art - of the Supino Partizione Antica fund. <br>
The original data have been used for **illustrative and didactical purposes only**: all the credits and reuse authorizations must be asked to <a href="mailto:fondazionezeri.fototeca@unibo.it">Federico Zeri Foundation</a>.

**1.1 Creation of the F and OA complexive xml files**

To allow a better management and manipulation, as well as to anonymize personal data, complexive files (via <a href="/content/sample_data/0_Creation_UniqeXML.xquery" target="_blank">0_Creation_UniqeXML.xquery</a> collection command) have been created and published. 
They collect:

*   all the single photograph xml files' records in the F_entries.xml file (data/0_source_data)
*   all the single works of art xml files' records in the OA_entries.xml file (data/0_source_data)

**1.2 Creation of the flat tabular dataset extracting relevant information for the project from the nested xml elements and attributes**

Due to the hypernested and not consistently presence of elements at different levels, <pandas.read_xml> method was not effectively parsing what was needed.
The <xml.etree.ElementTree> library has then been preferred because it allows to call for single elements at different nesting levels. Nevertheless, this approach presents some drawbacks as the need of a previous and deep knowledge of the database structure that does not allow to uncover unexpected correlations possible through the exploration of a comprehensive dataset.

**1.3 Preliminary installation**(Uncomment the first line to install the library)
- libraries
- imports:
  - xml.etree.ElementTree, pandas, csv for managing the dataset
  - ...


In [2]:
#preliminary imports
import csv
#from csv import DictReader
import xml.etree.ElementTree as ET


#function to have back the text element required if present without
def extract_data(path):
    #name = SCHEDA.find(path).tag
    if SCHEDA.find(path) != None:
         name = SCHEDA.find(path).text
    else:
        name = None
    return name

In [4]:
#parse the complexive Fxml and OAxml files

F_tree = ET.parse('data/0_source_data/F_entries.xml')
F_root = F_tree.getroot()
F_root.attrib['test']

OA_tree = ET.parse('data/0_source_data/F_entries.xml')
OA_root = OA_tree.getroot()
OA_root.attrib['test2']

#set the colums' headers for the choosen elements
header = ["sercdf_F_ser", "sercdoa_OA_ser", "INVN_F", "UBFC_Fshelfmark", #ids
          "PVCS_OAcountry", "PVCC_OAtown", "LDCN_OArep", "PRVC_OAprev_town", "AUFI_Fatelier_town", #places
          "AUFN_Faut", "SGLT_Ftitle", "SGTT_OAtitle", "AUTN_OAaut", #authors/titles
          "OGTT_OAtype", "AUTB_Fsubj_main", "OGTDOA_OAsubj_sub", #subjects
          "ROFI_Fneg", #external relations
          "OSS_Fnotes", "OSS_OAnotes", #unstructured infos
         "FTAN_filename", "NCTN_F_entry", "NRSCHEDA_OA_entry", #2ary ids
          "LRD_Fshotdates", "DTSI_Fprintdates", "DTSF_Fprintdates", "AUFA_Faut_dates"] #time

#setting an empty list
data = []

#iterate on UNIQUE_F SCHEDA - and on correspondig UNIQUE_OA SCHEDA - 
#for extracting elements texts, store them in a list and add it to the data
#two fields from original data are futherly modify for our purposes

for SCHEDA in F_root.findall("SCHEDA"):
    oa_ser = SCHEDA.get("sercdoa")
    f_ser = SCHEDA.get("sercdf")
    inv = extract_data("./PARAGRAFO/INVN")
    container = extract_data("./PARAGRAFO/UBFT")
    shelf = extract_data("./PARAGRAFO/UBFC")
    title_f = extract_data("./PARAGRAFO/RIPETIZIONE/SGLT")
    aut_f = extract_data("./PARAGRAFO/RIPETIZIONE/AUFN") #the original data do not distinguish AUFN and AUFB for collective agents
    aut_f_dates = extract_data("./PARAGRAFO/RIPETIZIONE/AUFA")#timespan of photographer's actvity
    aut_f_place = extract_data("./PARAGRAFO/RIPETIZIONE/AUFI")#place of photographer's actvity as reported in the photograph
    aut_oa = extract_data("./PARAGRAFO/RIPETIZIONE/AUTN")
    subj_main = extract_data("./PARAGRAFO/RIPETIZIONE/AUTB")
    subj_sub = extract_data("./PARAGRAFO/RIPETIZIONE/OGTDOA")
    notes_f = extract_data("./PARAGRAFO/OSS")
    neg_num = extract_data("./PARAGRAFO/ROFI")
    f_entry = extract_data("./PARAGRAFO/NCTN")
    filename = extract_data("./PARAGRAFO/FTAN")
    shotdates = extract_data("./PARAGRAFO/LRD")
    if shotdates != None:
        #reduce uncertainty: if /ante in field, put 1826 as conventional beginning date
        #first negative https://en.wikipedia.org/wiki/Negative_(photography)
        if "/ante" in shotdates:
            shotdates.replace("/ante", "/ ante")
        if "/ ante" in shotdates:
            shotdates = "1826-"+shotdates[:-6] 
    printdates_start = extract_data("./PARAGRAFO/DTSI")
    printdates_end = extract_data("./PARAGRAFO/DTSF")

    for SCHEDA in OA_root.findall("SCHEDA"):
        if SCHEDA.get("sercdoa") == oa_ser:
            title_oa = extract_data("./PARAGRAFO/SGTT")
            date_from_oa = extract_data("./PARAGRAFO/DTSI")
            date_to_oa = extract_data("./PARAGRAFO/DTSF")
            country_oa = extract_data("./PARAGRAFO/PVCS") #Original data report just 2 LRCS: name of the country where the shot was taken.
            town_oa = extract_data("./PARAGRAFO/PVCC") #Original data report just 2 LRCC: name of the country where the shot was taken.
            rep_oa = extract_data("./PARAGRAFO/LDCN")
            prev_town_oa = extract_data("./PARAGRAFO/RIPETIZIONE/PRVC")
            if prev_town_oa != None:
                #save the previous locations only if in 1800-1899 timespan (PRDU) otherwise put "NR" (not relevant)
                if extract_data("./PARAGRAFO/RIPETIZIONE/PRDU") != None:
                    if "1799" < extract_data("./PARAGRAFO/RIPETIZIONE/PRDU") < "1900": #PRDU last date the OA was in that location
                        prev_town_oa = prev_town_oa + " | " + str(extract_data("./PARAGRAFO/RIPETIZIONE/PRDU"))
                    else:
                        prev_town_oa = "NR"
            type_oa = extract_data("./PARAGRAFO/OGTT")
            notes_oa = extract_data("./PARAGRAFO/OSS")
            oa_entry = extract_data("./PARAGRAFO/NRSCHEDA")

    row = [oa_ser, f_ser, inv, shelf,
           country_oa, town_oa, rep_oa, prev_town_oa, aut_f_place,
           aut_f, title_f, title_oa, aut_oa,
           type_oa, subj_main, subj_sub,
           neg_num,
           notes_f, notes_oa,
           filename, f_entry, oa_entry,
           shotdates, printdates_start, printdates_end, aut_f_dates]
    data.append(row)

#Write the data and their header in a new csv dataset
with open('data/data/F_OA_selected_data', 'w', encoding='UTF8', newline='') as tabular_data:
    writer = csv.writer(tabular_data)
    writer.writerow(header)
    writer.writerows(data)

# 2. Data analysis




# 3. Data visualization

# Nuova sezione


---
**Analyse**
pandas library in order to examine our data.
     
       
    1. Data Preparation:
          - creation of two complexive xml files for F and OA records coming from the Federico Zeri Foundation catalogues
          - extraction from nested xml stucture of relevant information for the project and structuring them in plain tabular format
    2. Data Elaboration: seeking for furter analysis elements via:
          - deeper work on photographer for enhance their information
          - deeper work on places
          - work on unstructured annotations: NER
     2. Data Visualization



---
**Analyse**
pandas library in order to examine our data.
     
       
    1. Data Preparation:
          - creation of two complexive xml files for F and OA records coming from the Federico Zeri Foundation catalogues
          - extraction from nested xml stucture of relevant information for the project and structuring them in plain tabular format
    2. Data Elaboration: seeking for furter analysis elements via:
          - deeper work on photographer for enhance their information
          - deeper work on places
          - work on unstructured annotations: NER
     2. Data Visualization
