# Intro

This Jupyter Notebook has been created for the realization of the project **Partizione Antica. Looking for a XIXth century. art historian identity** as final project of the <a href="https://www.unibo.it/it/didattica/insegnamenti/insegnamento/2021/443749" target="_blank">90154 - Electronic Publishing and Digital Storytelling</a> course, taught by **Prof. Marilena Daquino**, in the framework of the 2nd year of the <a href="https://corsi.unibo.it/2cycle/DigitalHumanitiesKnowledge" target="_blank">DHDK Master Degree</a>, a.a. 2021-22.<br>
Here are listed the main steps faced:
       
    1. Data Preparation:
          1.0 creation of two complexive xml files for F and OA entries extracted from the Federico Zeri Foundation catalogues
          1.1 extraction of relevant information for the project from nested xml stucture and structuring them in plain tabular format
          1.2 extraction from previpus tabular data of unstructured annotations
    2. Data Elaboration: seeking for furter analysis elements via:
          2.1 deeper work on photographer for enhancing their information (workplace, timespan of activity, etc.)
          2.2 deeper work on places for enhancing their gelocation
          2.3 work on unstructured annotations trough NLP and NER
    3. Data Visualization
          3.1 Depicted works of art typologies and geographic distribution
          3.2 Photographers and anonymous photos
          3.3 Annotations 
          3.4 Derived infos from annotations

# 1. Data preparation

This research started from a **record data extraction** of the Supino Partizione Antica fund provided from the Federico Zeri Foundation: the original data counted 3.260 records for photographs and 2.634 records for depicted works of art. <br>
The original data have been used for **illustrative and didactical purposes only**: all the credits and reuse authorizations must be asked to <a href="mailto:fondazionezeri.fototeca@unibo.it">Federico Zeri Foundation</a>.

**1.0 Creation of the F and OA complexive xml files**

To allow a better management and manipulation, as well as to anonymize personal data, complexive files (via <a href="/content/sample_data/0_Creation_UniqeXML.xquery" target="_blank">0_Creation_UniqeXML.xquery</a> collection command) have been created and published. 
They collect:

*   all the single photograph xml files' records in the F_entries.xml file (data/0_source_data)
*   all the single works of art xml files' records in the OA_entries.xml file (data/0_source_data)

## XML metadata
1.1 Creation of the flat tabular dataset extracting relevant information for the project from the nested xml elements and attributes**

Due to the hypernested and not consistently presence of elements at different levels, <pandas.read_xml> method was not effectively parsing what was needed.
The <xml.etree.ElementTree> library has then been preferred because it allows to call for single elements at different nesting levels. Nevertheless, this approach presents some drawbacks as the need of a previous and deep knowledge of the database structure that does not allow to uncover unexpected correlations possible through the exploration of a comprehensive dataset.

**1.3 Preliminary installation**(Uncomment the first line to install the library)
- libraries
- imports:
  - xml.etree.ElementTree, pandas, csv for managing the dataset
  - ...

In [2]:
#preliminary imports
#!pip install python-csv
#!pip install elementpath
!pip install pandas

import xml.etree.ElementTree as ET
import csv
import pandas as pd

Collecting pandas
  Downloading pandas-1.3.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.3/11.3 MB[0m [31m27.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting numpy>=1.17.3
  Downloading numpy-1.21.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.7/15.7 MB[0m [31m26.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: numpy, pandas
Successfully installed numpy-1.21.6 pandas-1.3.5


1.1 Prepare structured data from metadata

In [3]:
#preliminary imports
#!pip install python-csv
#!pip install elementpath
#!pip install pandas

import xml.etree.ElementTree as ET
import csv
import pandas as pd

In [4]:
#function to have back the element texts
def extract_data(path):
    if SCHEDA.find(path) != None:
        name = SCHEDA.find(path).text
    else:
        name = None
    return name

#parse the complexive Fxml and OAxml files
F_tree = ET.parse("data/0_source_data/F_entries.xml")
F_root = F_tree.getroot()
F_root.attrib["test"]

OA_tree = ET.parse("data/0_source_data/OA_entries.xml")
OA_root = OA_tree.getroot()
OA_root.attrib["test2"]

#set the colums' headers for the choosen elements
header = ["sercdf_F_ser", "sercdoa_OA_ser", "INVN_F", "UBFC_Fshelfmark", #ids
          "PVCS_OAcountry", "PVCC_OAtown", "LDCN_OArep", "PRVC_OAprev_town", "AUFI_Fatelier_address", #places
          "AUFN_Faut", "SGLT_Ftitle", "SGTT_OAtitle", "AUTN_OAaut", #authors/titles
          "OGTT_OAtype", "AUTB_Fsubj_main", "OGTDOA_OAsubj_sub", #subjects
          "ROFI_Fneg", "BIBA_OAbib",#external relations
          "OSS_Fnotes", "OSS_OAnotes", #unstructured infos
         "FTAN_filename", "NCTN_F_entry", "NRSCHEDA_OA_entry", #2ary ids
        "DTZG_OAcentury", "LRD_Fshotdates", "DTSI_Fprintdates", "DTSF_Fprintdates", "AUFA_Faut_dates"] #time

#setting an empty list
data = []

#iterate on F_entries - and on correspondig OA_entries -
#for extracting elements texts, store them in a list and add it to the data
#two fields from original data are futherly modify for our purposes

for SCHEDA in F_root.findall("SCHEDA"):
    oa_ser = SCHEDA.get("sercdoa")
    f_ser = SCHEDA.get("sercdf")
    inv = extract_data("./PARAGRAFO/INVN")
    container = extract_data("./PARAGRAFO/UBFT")
    shelf = extract_data("./PARAGRAFO/UBFC")
    title_f = extract_data("./PARAGRAFO/RIPETIZIONE/SGLT")
    aut_f = extract_data("./PARAGRAFO/RIPETIZIONE/AUFN") #the original data do not distinguish AUFN and AUFB for collective agents
    aut_f_dates = extract_data("./PARAGRAFO/RIPETIZIONE/AUFA")#timespan of photographer's actvity
    aut_f_addr = extract_data("./PARAGRAFO/RIPETIZIONE/AUFI")#place of photographer's actvity as reported in the photograph >AF of variants
    aut_oa = extract_data("./PARAGRAFO/RIPETIZIONE/AUTN")
    subj_main = extract_data("./PARAGRAFO/RIPETIZIONE/AUTB")
    subj_sub = extract_data("./PARAGRAFO/RIPETIZIONE/OGTDOA")
    notes_f = extract_data("./PARAGRAFO/OSS")
    neg_num = extract_data("./PARAGRAFO/ROFI")
    f_entry = extract_data("./PARAGRAFO/NCTN")
    filename = extract_data("./PARAGRAFO/FTAN")
    shotdates = extract_data("./PARAGRAFO/LRD")
    if shotdates != None:
        #reduce uncertainty: if /ante in field, put 1855 as conventional beginning date
        #for collodium negatives (accordign to other Zeri cataloguing)
        if "/ante" in shotdates:
            shotdates.replace("/ante", "/ ante")
        if "/ ante" in shotdates:
        #if re.match("/ante|\/ ante", shotdates):
            shotdates = "1855-"+shotdates[:-6]
    printdates_start = extract_data("./PARAGRAFO/DTSI")
    printdates_end = extract_data("./PARAGRAFO/DTSF")

    for SCHEDA in OA_root.findall("SCHEDA"):
        if SCHEDA.get("sercdoa") == oa_ser:
            title_oa = extract_data("./PARAGRAFO/SGTT")
            century_oa = extract_data("./PARAGRAFO/DTZG")
            country_oa = extract_data("./PARAGRAFO/PVCS") #Original data report just 2 LRCS: name of the country where the shot was taken.
            town_oa = extract_data("./PARAGRAFO/PVCC") #Original data report just 2 LRCC: name of the country where the shot was taken.
            rep_oa = extract_data("./PARAGRAFO/LDCN")
            prev_town_oa = extract_data("./PARAGRAFO/RIPETIZIONE/PRVC")
            if prev_town_oa != None:
                #save the previous locations only if in 1800-1899 timespan (PRDU) otherwise put "NR" (not relevant)
                if extract_data("./PARAGRAFO/RIPETIZIONE/PRDU") != None:
                    if "1799" < extract_data("./PARAGRAFO/RIPETIZIONE/PRDU") < "1900": #PRDU last date the OA was in that location
                        prev_town_oa = prev_town_oa + " | " + str(extract_data("./PARAGRAFO/RIPETIZIONE/PRDU"))
                    else:
                        prev_town_oa = "NR"
            type_oa = extract_data("./PARAGRAFO/OGTT")
            notes_oa = extract_data("./PARAGRAFO/OSS")
            oa_entry = extract_data("./PARAGRAFO/NRSCHEDA")
            beg_date_oa = extract_data("./PARAGRAFO/DTSI")
            if extract_data("./PARAGRAFO/RIPETIZIONE/BIBA") != None:
                #save the bib ref only if in 1800-1899 timespan (BIBD) otherwise put "NR" (not relevant)
                if extract_data("./PARAGRAFO/RIPETIZIONE/BIBD") != None:
                    if "1799" < extract_data("./PARAGRAFO/RIPETIZIONE/BIBD") < "1900": #PRDU last date the OA was in that location
                        bib_oa = extract_data("./PARAGRAFO/RIPETIZIONE/BIBA")
                        bib_oa = bib_oa + " | " + str(extract_data("./PARAGRAFO/RIPETIZIONE/BIBD"))
                    else:
                        bib_oa = "NR"
            else:
                bib_oa = None

    row = [oa_ser, f_ser, inv, shelf,
           country_oa, town_oa, rep_oa, prev_town_oa, aut_f_addr,
           aut_f, title_f, title_oa, aut_oa,
           type_oa, subj_main, subj_sub,
           neg_num, bib_oa,
           notes_f, notes_oa,
           filename, f_entry, oa_entry,
           century_oa, shotdates, printdates_start, printdates_end, aut_f_dates]
    data.append(row)

#Write the data and their header in a new csv dataset
with open("data/F_OA_selected_data.csv", "w", encoding="utf-8", newline="") as tabular_data:
    # create the csv writer
    writer = csv.writer(tabular_data)
    writer.writerow(header)
    writer.writerows(data)

#have a look at the data
data_df = pd.read_csv('data/F_OA_selected_data.csv')
#data_df.describe()
#print(data_df.head(10))

## 1.2 Unstructured data from transcribed annotations

Naural language annotations from Note <OSS> field of structured data have been extracted and stored in a new file as a complexive corpus to be used with NER and NLP libraries.
Restricted dataframe with just inventory and annotations columns has also been created.

In [5]:
#!pip install pandas
import pandas as pd

In [6]:
data_df = pd.read_csv('data/F_OA_selected_data.csv')

#reduce the dataset to just the columns needed, the not-empty and not-duplicates rows
OAnotes_df = data_df[["OSS_OAnotes"]].dropna()
OAnotes_df = OAnotes_df.drop_duplicates()

#split multilines rows and once again remove duplicates rows
OAnotes_df["OSS_OAnotes"] = OAnotes_df["OSS_OAnotes"].str.split("&#10;|"". Foto ", expand = False)
OAnotes_df = OAnotes_df.explode("OSS_OAnotes")
OAnotes_df = OAnotes_df.drop_duplicates()

#save just rows with transcriptions notes (including "Foto sup \d{1,4}" string)
OAnotes_df = OAnotes_df[OAnotes_df["OSS_OAnotes"].str.contains("sup \d{1,4}")== True].reset_index(drop=True)
OAnotes_df = OAnotes_df[OAnotes_df["OSS_OAnotes"].str.startswith("La foto ")== False].reset_index(drop=True)

#separe note texts from other infos and remove the column containing the whole infos, save and check the result
OAnotes_df[["Inv", "Note"]] = OAnotes_df["OSS_OAnotes"].str.split(': "', n=1, expand=True)
OAnotes_df= OAnotes_df.drop(columns=["OSS_OAnotes"]).reset_index(drop=True)
print("Photographs which annotations have been transcribed in OA entries: ", OAnotes_df.shape[0], "(/over 3.222 photographs") #1839
print(OAnotes_df.head(10))
OAnotes_df.to_csv("data/1_working_data/1_OAnotes01.csv", encoding="utf-8")

Photographs which annotations have been transcribed in OA entries:  1839 (/over 3.222 photographs
                                                 Inv  \
0                  Foto sup 748, verso: nota anonima   
1      Foto sup 763, verso: nota anonima manoscritta   
2      Foto sup 982, verso: nota anonima manoscritta   
3  Foto sup 893, verso: nota anonima manoscritta:...   
4      Foto sup 988, verso: nota anonima manoscritta   
5      Foto sup 991, verso: nota anonima manoscritta   
6      Foto sup 998, verso: nota manoscritta anonima   
7     Foto sup 1000, verso: nota anonima manoscritta   
8     Foto sup 1002, verso: nota anonima manoscritta   
9     Foto sup 1003, verso: nota anonima manoscritta   

                                                Note  
0  Near Avezzano and not far from Tagliacozzo. He...  
1  Aquila. S. Maria di Collemaggio. Founded by Pi...  
2  The pulpit of San Giovanni del Toro, of the mi...  
3                                               None  
4  Here al

In [7]:
#manual checking and adjusting for 1)"manoscritta:">"manoscritta"," 2)"".",>""." 3)\n",>"
#inv: 943, 75, 559, 1635, 1648, 1702, (1768 non riporta), 1787, 2397, 2789, 2869, 2849,
# 2984, 2222,2270,2880, saved in data\OAnotes02.csv

#open the manually modified dataframe, search for unsuseful informations in 'Inv' and eliminate them
OAnotes2_df = pd.read_csv('data/1_working_data/1_OAnotes02.csv', encoding="utf-8").dropna(subset=['Inv']).reset_index(drop=True)
pattern = 'Foto |, (.+)'
OAnotes2_df["Inv"] = OAnotes2_df["Inv"].replace(to_replace=pattern, value='', regex=True).reset_index(drop=True)

#check and save the third version of OAnotes_df
print(OAnotes2_df.head(15))
OAnotes2_df.to_csv("data/1_working_data/1_OAnotes03.csv", encoding="utf-8")

#check how many of them are incomplete
annotations_incompleted = OAnotes2_df[OAnotes2_df['Note'].str.contains("[...]")== True].reset_index(drop=True)
print("Photographs which transcribed annotations are likely to be incomplete: ", annotations_incompleted.shape[0], "(/over ",OAnotes_df.shape[0]," transcribed)")

#create the corpus to be passed with spacy
corpus = ""
for OAnote in OAnotes2_df["Note"]:
    corpus = corpus+"---"+str(OAnote)+"---\n"
with open("data/OAnotes_corpus.txt", "w", encoding="utf-8") as f:
        f.write(corpus)

    Unnamed: 0       Inv                                               Note
0            0   sup 748  Near Avezzano and not far from Tagliacozzo. He...
1            1   sup 763  Aquila. S. Maria di Collemaggio. Founded by Pi...
2            2   sup 982  The pulpit of San Giovanni del Toro, of the mi...
3            3   sup 893  Clara pudicicie dux Paulabianca potentis / A g...
4            4   sup 988  Here also his buried Sibylla of Burgundy. "Rex...
5            5   sup 991  Queen Margherita widow of Carlo III, who died ...
6            6   sup 998  Piissimi Patris Nicolai Piscicelli optimi pres...
7            7  sup 1000  Rude sarcophagus in the porch of the church. T...
8            8  sup 1002  Amalfi. This campanile is said to date from 11...
9            9  sup 1003  Cloister of the Canonica founded in 1213 by Ca...
10          10  sup 1012  In cloister of Amalfi Duomo. Sarc. of an archb...
11          11  sup 1005  Amalfi. Cloister of San Francesco founded by t...
12          

# 2. Data elaboration




## 2.1 Work on photographers

In [8]:
!pip install SPARQLWrapper
!pip install geopy
from csv import DictReader
from SPARQLWrapper import SPARQLWrapper, JSON
import pandas as pd
import ssl
from geopy.geocoders import Nominatim

Collecting SPARQLWrapper
  Downloading SPARQLWrapper-2.0.0-py3-none-any.whl (28 kB)
Collecting rdflib>=6.1.1
  Downloading rdflib-6.2.0-py3-none-any.whl (500 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m500.3/500.3 kB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
Collecting pyparsing
  Downloading pyparsing-3.0.9-py3-none-any.whl (98 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.3/98.3 kB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting isodate
  Downloading isodate-0.6.1-py2.py3-none-any.whl (41 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.7/41.7 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pyparsing, isodate, rdflib, SPARQLWrapper
Successfully installed SPARQLWrapper-2.0.0 isodate-0.6.1 pyparsing-3.0.9 rdflib-6.2.0
Collecting geopy
  Downloading geopy-2.3.0-py3-none-any.whl (119 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 kB[0m [31m5.3

In [9]:
ssl._create_default_https_context = ssl._create_unverified_context
geolocator = Nominatim(timeout=10, user_agent="myGeolocator")

In [10]:
# functions
# define a function to open file in reading mode
def process_csv(data_file_path):
    import csv
    source = open(data_file_path, mode="r", encoding="UTF8")
    source_reader = csv.DictReader(source)
    source_data = list(source_reader)
    return source_data

#define a function for transforming lists of elements in strings
def write_string(source, output_txt_name):
    string = ""
    for source_data in source:
        string = string+source_data+"|"
    string = string[:-1]
    with open(output_txt_name, "w", encoding="utf-8") as f:
        f.write(string)
    return string

#define a function to query endpoints
def query_endpoint(endpoint_url, SPRQL_query):
    get_endpoint = endpoint_url
    sparql_w = SPARQLWrapper(get_endpoint)
    sparql_w.setQuery(SPRQL_query)
    sparql_w.setReturnFormat(JSON)
    spqrl_w_res = sparql_w.query().convert()
    return spqrl_w_res

#define a function to manipulate results and have back 1. a set of wd_URI corresponding to our wd_names,
# 2. update of ph_matrix, 3. not matched wd_names

def manipulate(spqrl_w_res, dataset_to_enhance):
    res_dic = {}
    res_NF_tem = set()
    res_F = set()
    for res in spqrl_w_res["results"]["bindings"]:
        for datum in dataset_to_enhance:
            if datum["ph_wd_URI"]:
                continue
            else:
                if datum["ph_wd_name"] not in res_dic:
                    if res["fLabel"]["value"] == datum["ph_wd_name"]:
                        res_F.add(res["f"]["value"])
                        new_pairs = {"ph_wd_URI": res["f"]["value"]}
                        res_dic.update({datum["ph_wd_name"]: new_pairs})
                        datum.update([("ph_wd_URI", res["f"]["value"])])
                    else:
                        res_NF_tem.add(datum["ph_wd_name"])
    res_NF_def = res_NF_tem - set(list(res_dic.keys()))
    print("labels matched: ", len(res_F))
    print("labels not found: ", len(res_NF_def))
    return res_F, res_NF_def

In [11]:
#import pandas as pd
# open source data with pandas
data_df = pd.read_csv("data/F_OA_selected_data.csv")

#initialize a photograph's frequency dataframe
ph_freq = pd.DataFrame(data_df["AUFN_Faut"].value_counts().reset_index().values, columns=["AUFN_Faut", "count"])

#extend dataframe colums to host next datas
ph_freq["ph_wd_name"], ph_freq["ph_wd_URI"], ph_freq["gender"], ph_freq["workplace"], ph_freq["lat"], ph_freq["lon"],\
ph_freq["born"], ph_freq["died"], ph_freq["lat"] = ["", "", "", "", "", "", "", "", ""]

In [12]:
#create the firts string for the SPARQL query by 
#normalizing (personal) names in form "surname, name" to "name surname" as in wikidata
#and create a list of the modified names tobe added to the dataframe

first_ph_names_string =""
ph_wd_name_list = []
for ph in ph_freq.index:
    ph_name = str(ph_freq["AUFN_Faut"][ph])
    # reverse only (personal) names in form "surname, name" > "name surname"
    if ", " in ph_name:
        ph_split = ph_name.split(", ")
        ph_wd_name = ph_split[1] + " " + ph_split[0]
    else:
        ph_wd_name = ph_name
    ph_wd_name_list.append(ph_wd_name)
    first_ph_names_string = first_ph_names_string + ph_wd_name + "|"
first_ph_names_string = first_ph_names_string[:-1]

#show a sample of the string and save it
print(first_ph_names_string[0:200])
with open("data/1_working_data/2_PHstring01.txt", "w", encoding="utf-8") as f:
    f.write(first_ph_names_string)

Anonimo|Fratelli Alinari|Romualdo Moscioni|Brogi|Giorgio Sommer|Jean Laurent|Incorpora|Giraudon|Paolo Lombardi|Naya|Carlo Baldassarre Simelli|Pietro Poppi|Séraphin-Médéric  Mieusement|Robert Rive|John


In [13]:
#add ph_wd_name_list to the dataframe and show a sample of the current dataframe
ph_freq["ph_wd_name"] = ph_wd_name_list
print(ph_freq.head(10))
#save the dataframe in a csv file and open it as a dictionary to iterate
ph_freq.to_csv("data/1_working_data/2_PH_freq_01.csv", encoding="utf-8")

            AUFN_Faut count         ph_wd_name ph_wd_URI gender workplace lat  \
0             Anonimo  1336            Anonimo                                  
1   Alinari, Fratelli   556   Fratelli Alinari                                  
2  Moscioni, Romualdo   159  Romualdo Moscioni                                  
3               Brogi   158              Brogi                                  
4     Sommer, Giorgio   147     Giorgio Sommer                                  
5       Laurent, Jean    73       Jean Laurent                                  
6           Incorpora    57          Incorpora                                  
7            Giraudon    55           Giraudon                                  
8     Lombardi, Paolo    53     Paolo Lombardi                                  
9                Naya    48               Naya                                  

  lon born died  
0                
1                
2                
3                
4                


In [14]:
ph_matrix = process_csv("data/1_working_data/2_PH_freq_01.csv")
first_ph_names_string = open('data/1_working_data/2_PHstring01.txt', 'r', encoding="utf-8").read()

#prepare the first query string to collect wikidata URI
first_ph_SPARQL_query = """
SELECT DISTINCT ?f ?fLabel
WHERE
{    { ?f wdt:P106 wd:Q33231 } UNION { ?f wdt:P31 wd:Q672070}. #P106_has_for_occupation wd:Q33231_photographer 
                                                                #P31_is instance wd:Q672070_studios
    ?f rdfs:label ?fLabel.
     FILTER regex(?fLabel, \" """+first_ph_names_string+""" \")
     FILTER(LANG(?fLabel) = "en").
}"""

#perform the first SPARQL query and result manipulation
first_ph_wd_res = query_endpoint("https://query.wikidata.org/bigdata/namespace/wdq/sparql", first_ph_SPARQL_query)
first_ph_manipulate = manipulate(first_ph_wd_res, ph_matrix)
first_F_set = first_ph_manipulate[0]
first_NF = first_ph_manipulate[1]
#check not found
print(first_NF)

labels matched:  45
labels not found:  67
{'Fotografia A. Premi', 'Pere Pallejá Domenech', 'Tuminello Lodovico', 'Giraudon', 'Robert MacPherson', 'Bruckmann Verlag', 'Dimitris Konstantinou', 'Pascal Sébah', 'Carl Prior Merlin', 'Vasari', 'Albert', 'Istituto Fotografico Antonio Fortunato Perini', 'Goupil & C.ie Editeurs', 'Giovanni Battista Unterverger', 'Lyon E.D.', 'Lombardi', 'Fratelli Esposito', 'J. Garrigues', 'Incorpora', 'Abdullah Frères', 'Stereoscopic Co.', 'A. Dumaine', 'Neurdein', 'Filippo Lais', 'J. Kuhn', 'Fratelli Amodio', 'Istituto Centrale per il Catalogo e la Documentazione: Fototeca Nazionale', 'Zedler & Vogel', 'G. Stuffler', 'Studio Fotografico Ciappei', 'Sommer & Behles', 'W. F. Mansell', 'Budtz Muller & Co.', 'Anonimo', 'William Lawrence', 'George Wilson Washington', 'Naya', 'Brogi', 'Guillaume Gustave Berggren', 'Poulton Series', 'Neue Photographische Gesellschaft', 'Giuseppe Polozzi', 'Edith Emily Coulson James', 'Francesco Venturi', 'Vincenzo Paganori', 'Johanne

In [15]:
#after revising first results, refine the unmatched labels
new_list = []
for ph_wd_NF in first_NF:
    if "  " in ph_wd_NF:
        ph_wd_new = ph_wd_NF.replace("  ", " ") #cancel double spaces
    elif "Fratelli" in ph_wd_NF:
        ph_wd_new = ph_wd_NF.replace("Fratelli", "") #cancel "Fratelli"
    elif "&" in ph_wd_NF:
        ph_wd_new = ph_wd_NF.replace("&", "and") #change "&" in "and"
    #check for corresponding form
    elif "Brogi" == ph_wd_NF:
        ph_wd_new = "Giacomo Brogi" 
    elif "Incorpora" == ph_wd_NF:
        ph_wd_new = "Giuseppe Incorpora"
    elif "Giraudon" == ph_wd_NF:
        ph_wd_new = "Adolphe Giraudon"
    else:
        continue
    new_list.append(ph_wd_new)
    for ph_data in ph_matrix:
        if ph_data["ph_wd_name"] == ph_wd_NF:
            ph_data.update([("ph_wd_name", ph_wd_new)])

#from the new modified names, by using the function, obtain a second string to query 
second_ph_string = write_string(new_list, "data/1_working_data/2_PHstring02.txt")
print(second_ph_string)

Adolphe Giraudon|Goupil and C.ie Editeurs| Esposito|Giuseppe Incorpora| Amodio|Zedler and Vogel|Sommer and Behles|Budtz Muller and Co.|Giacomo Brogi|Johannes Jaeger|Séraphin-Médéric Mieusement|Pierre Henry Voland|Clarke and Davies|P. Famin and Cie.|Alary and Geiser


In [16]:
#prepare the second query string to collect wikidata URI
second_ph_SPARQL_query = """
SELECT DISTINCT ?f ?fLabel
WHERE
{    { ?f wdt:P106 wd:Q33231 } UNION { ?f wdt:P31 wd:Q672070}. #P106_has_for_occupation wd:Q33231_photographer 
                                                                #P31_is instance wd:Q672070_studios
    ?f rdfs:label ?fLabel.
     FILTER regex(?fLabel, \" """+second_ph_string+""" \")
     FILTER(LANG(?fLabel) = "en").
}
"""
#perform the second SPARQL query and result manipulation
second_ph_wd_res = query_endpoint("https://query.wikidata.org/bigdata/namespace/wdq/sparql", second_ph_SPARQL_query)
second_manipulate = manipulate(second_ph_wd_res, ph_matrix)
second_F_set = second_manipulate[0]

labels matched:  5
labels not found:  62


In [17]:
#obtain the list of found wikidata URI
complex_F_set = second_F_set.union(first_F_set)

#prepare the thirtd string to be passed in SPARQL query and save it
third_ph_string_URI =""
for F_URI in complex_F_set:
    third_ph_string_URI = third_ph_string_URI+"<"+F_URI+">"
    
with open("data/1_working_data/2_PHstring03.txt", "w", encoding="utf8") as f:
    f.write(third_ph_string_URI)

#third query
third_ph_SPARQL_query = """
SELECT DISTINCT ?ph ?genderLabel ?countryLabel ?birthyear ?deathyear
    WHERE
    { VALUES ?ph {"""+third_ph_string_URI+"""} 
        ?ph rdfs:label ?phLabel;
        wdt:P937 ?country; #P937_worklocation
        #wdt:P27 ?citiz;        
        wdt:P569 ?birth;
        wdt:P570 ?death.
        OPTIONAL {FILTER(LANG(?fLabel) = "en").
                    ?ph wdt:P21 ?gender;
                    #wdt:P937 ?worklocation; #P937_worklocation
        }
        BIND(year(?birth) AS ?birthyear)
        BIND(year(?death) AS ?deathyear)

        #BIND(COALESCE(?worklocation, ?citiz, "NaN") AS ?country).
        #BIND(IF(BOUND(?worklocation),?worklocation,?citiz) AS ?country).
    SERVICE wikibase:label {bd:serviceParam wikibase:language "en".}     
    }"""
#OPTIONAL { ?ph wdt:P569 ?birthdate;        wdt:P570 ?deathdate.} ci servono...

#perform the third query
third_ph_wd_res = query_endpoint("https://query.wikidata.org/bigdata/namespace/wdq/sparql", third_ph_SPARQL_query)

#manipulate results
wd_total_dic = {}
for result in third_ph_wd_res["results"]["bindings"]:
    item_key = result["ph"]["value"]
    item_value = {"workplace": result["countryLabel"]["value"],
                  "lat": geolocator.geocode(result["countryLabel"]["value"]).latitude,
                  "lon": geolocator.geocode(result["countryLabel"]["value"]).longitude,
                  "born": result["birthyear"]["value"],
                  "died": result["deathyear"]["value"]}
    if item_key not in wd_total_dic:
        wd_total_dic.update({item_key: item_value})
        for ph_data in ph_matrix:
            if ph_data["ph_wd_URI"] == item_key:
                item2=item_value.items()
                ph_data.update(item2)

#wd_total_list = list(wd_total_dic.values())
#print(wd_total_list)

#save the enhanced matrix
keys = ph_matrix[0].keys()
with open("data/2_PHfreq.csv", "w", encoding="utf-8", newline="") as output_file:
    dict_writer = csv.DictWriter(output_file, keys)
    dict_writer.writeheader()
    dict_writer.writerows(ph_matrix)

#transform the enhanced matrix in a df and have a look at it
ph_freq_df = pd.DataFrame.from_dict(ph_matrix, orient='columns', dtype=None, columns=None)
print(ph_freq_df.head())

               AUFN_Faut count         ph_wd_name  \
0  0             Anonimo  1336            Anonimo   
1  1   Alinari, Fratelli   556   Fratelli Alinari   
2  2  Moscioni, Romualdo   159  Romualdo Moscioni   
3  3               Brogi   158      Giacomo Brogi   
4  4     Sommer, Giorgio   147     Giorgio Sommer   

                                 ph_wd_URI gender workplace        lat  \
0                                                                        
1   http://www.wikidata.org/entity/Q644689                               
2  http://www.wikidata.org/entity/Q3441292             Rome   41.89332   
3  http://www.wikidata.org/entity/Q2346257         Florence  43.769871   
4    http://www.wikidata.org/entity/Q64212           Naples  40.835885   

         lon  born  died  
0                         
1                         
2  12.482932  1849  1925  
3  11.255576  1822  1881  
4  14.248768  1834  1914  


## 2.2 Work on places

In [18]:
#define function to store lat-lon from a list of places
def get_coordinates(list, df):
    check = df["place"].unique().tolist()
    for place in list:
        if place not in check:
            if geolocator.geocode(place) != None:
                lat = geolocator.geocode(place).latitude
                lon = geolocator.geocode(place).longitude
            else:
                lat = "NaN"
                lon = "NaN"
                new_place = [str(place), lat, lon]
            df.loc[len(df)] = new_place
            check = df["place"].unique().tolist()
        else:
            continue
    df.to_csv("data/3_PLcoordinates.csv", encoding="UTF-8")

In [19]:
#open the saved file, reduce columns and change column name, check first rows
ph_freq_df = pd.read_csv("data/2_PHfreq.csv", encoding="UTF-8", index_col=0)
places_F = ph_freq_df[['workplace', "lat", "lon"]].dropna().reset_index(drop=True)
places_F = places_F.rename(columns={"workplace": "place"}).sort_values(by=['place']).reset_index(drop=True)
places_F.head()

#extract towns unique names from original dataframe
data_df = pd.read_csv("data/F_OA_selected_data.csv", encoding="UTF-8")
towns_OA = data_df['PVCC_OAtown'].unique().tolist()

#obtain coordinates from the town list and store them in a df
get_coordinates(towns_OA, places_F)

In [27]:
#extract country unique names from original dataframe
data_df = pd.read_csv("data/F_OA_selected_data.csv", encoding="UTF-8")
countries_OA = data_df['PVCS_OAcountry'].unique().tolist()

# open the new file and enrich it with data from countries
places_df = pd.read_csv("data/3_PLcoordinates.csv", encoding="UTF-8", index_col=0)
get_coordinates(countries_OA, places_df)

## 2.3 Work on Annotations

In [20]:
!pip install spacy
!python -m spacy download en_core_web_sm
!python -m spacy download xx_ent_wiki_sm
import spacy
from spacy.matcher import Matcher
from spacy.attrs import POS
#import pandas as pd

Collecting spacy
  Downloading spacy-3.5.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.5/6.5 MB[0m [31m24.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting langcodes<4.0.0,>=3.2.0
  Downloading langcodes-3.3.0-py3-none-any.whl (181 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m181.6/181.6 kB[0m [31m36.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting srsly<3.0.0,>=2.4.3
  Downloading srsly-2.4.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (490 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m490.0/490.0 kB[0m [31m62.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting thinc<8.2.0,>=8.1.0
  Downloading thinc-8.1.7-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (815 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m815.9/815.9 kB[0m [31m46.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting catalogue<2.1.0,

### 2.3.1 Find for personal experiences

In [21]:
#open the file with annotations texts
with open("data/OAnotes_corpus.txt", mode="r") as f:
    contents = f.read()

#load the model
nlp = spacy.load("en_core_web_sm")

#look for POS combination I(lemma)+verb and have a look to the list
matcher = Matcher(nlp.vocab)
matcher.add("PA_creator", [[{"LEMMA": "I"}, {POS: 'VERB'}]])
doc = nlp(contents)
matches = matcher(doc)

matched = []
for match_id,start,end in matches:
    I_verb = str(doc[start:end])
    matched.append(I_verb)
#print(matched)

#have a look to the occurrencies
matched_df = pd.DataFrame()
matched_df["Match"] = matched
matched_freq = pd.DataFrame(matched_df["Match"].value_counts().reset_index().values, columns=["Match", "count"])
print(matched_freq.head(25))

           Match count
0         I have     7
1        I think     4
2          I saw     4
3        I doubt     2
4         I page     2
5    I respected     1
6          I put     1
7      I believe     1
8        i corpi     1
9        me look     1
10        i farn     1
11            Im     1
12        I told     1
13  I discovered     1
14    i ritratti     1
15          I et     1
16     me posuit     1
17        i suoi     1
18     i diritti     1


In [22]:
#select the not pertinent items to be removed
items_to_remove = {'I page', 'I et', 'i ritratti', 'Im', 'i suoi', 'i cittadini', 'i migliori', 'i farn', 'i diritti'}

#obtain a pertinent set of occurrencies and a string to search for it in the Notes df
pertinent_matched_set = set(matched_freq["Match"])-items_to_remove
#print(pertinent_matched_set)

matched_list = ""
for item in pertinent_matched_set:
    matched_list = matched_list+item + "|"
matched_list = matched_list[:-1]

print("Personal experiences (I+verbs) combinations in Notes: ", matched_list)

Personal experiences (I+verbs) combinations in Notes:  I have|I told|I doubt|I believe|I think|I put|I respected|me posuit|I saw|me look|I discovered|i corpi


In [23]:
#filter the annotations dataframe if item in matched list is present and save them
OA_data = pd.read_csv("data/1_working_data/1_OAnotes03.csv")
personal_experiences_df = OA_data[OA_data["Note"].str.contains(matched_list)==True].reset_index(drop=True)
personal_experiences_df.to_csv("data/1_working_data/1_OAnotes04_Iverb.csv", encoding="UTF-8")
        
print("Number of personal experiences (I+verbs) combinations in Notes: ", personal_experiences_df.shape[0])
print(personal_experiences_df.head())

Number of personal experiences (I+verbs) combinations in Notes:  24
   Unnamed: 0  Unnamed: 0.1       Inv  \
0          20            20  sup 1009   
1          76            76   sup 188   
2         153           153  sup 1227   
3         170           170  sup 1259   
4         571           571  sup 1909   

                                                Note  
0  Porta della Sirena. Pastum. Lenormant thinks t...  
1  The so called Apotheosis of Augustus - Sacrest...  
2   I have but little doubs that the Columbarium ...  
3  Santa Chiara. Naples. Ancient sarcophagus with...  
4  Porta dell'Arco. Volterra. / The masonry for t...  


In [37]:
#save the restricted corpus
personal_experiences_df = pd.read_csv("data/1_working_data/1_OAnotes04_Iverb.csv", encoding="UTF-8")
restricted_corpus = ""
for OAnote in personal_experiences_df["Note"]:
    restricted_corpus = restricted_corpus+"---"+str(OAnote)+"---\n"
with open("data/OAnotes_corpus2.txt", "w", encoding="utf8") as file:
        file.write(restricted_corpus)

### 2.3.2 Find for dated and located personal experiences

In [30]:
#filter the _personal_experiences_df if a contemporary date ("18[4-9]\d{1}") is present
dated_personal_experiences_df = pd.read_csv("data/1_working_data/1_OAnotes04_Iverb.csv", encoding="utf8")
dated_personal_experiences_df = personal_experiences_df[personal_experiences_df["Note"].str.contains("18[4-9]\d{1}") == True].reset_index(drop=True)
dated_personal_experiences_df.to_csv("data/1_working_data/1_OAnotes05_Iverb_data.csv", encoding="UTF-8")
print("Number of dated personal experiences in Notes: ", dated_personal_experiences_df.shape[0])

Number of dated personal experiences in Notes:  5


In [29]:
#manual extraction of time-place pairs saved in "data/1_working_data/6_Time-place_visited.csv"
visited_places_df = pd.read_csv("data/1_working_data/6_Time-place_visited.csv", encoding="utf8")
print(visited_places_df.head())

        Inv                                               Note    place  date  \
0   sup 218  Corneto Vitelleschi. Court of Cardinal Vitelle...  Corneto  1861   
1  sup 1351  This grand headless Etruscan lady came with th...   Chiusi  1866   
2  sup 1368  The sarcophagus is of solid porphyry; the six ...  Palermo  1892   
3  sup 1811  Agrippa. The celebrated statue which the great...    Corfù  1894   
4  sup 2470  This is the fine tower built in the 13.th cent...    Atene  1865   

                                           photo_url  
0  http://catalogo.fondazionezeri.unibo.it/scheda...  
1  <a href="http://catalogo.fondazionezeri.unibo....  
2  <a href="http://catalogo.fondazionezeri.unibo....  
3  http://catalogo.fondazionezeri.unibo.it/scheda...  
4  http://catalogo.fondazionezeri.unibo.it/scheda...  


In [34]:
#launch get_coordinates function for new named places
visited_places_df = pd.read_csv("data/1_working_data/6_Time-place_visited.csv", encoding="utf-8")
places_df = pd.read_csv("data/3_PLcoordinates.csv", encoding="UTF-8", index_col=0)
visited_places = visited_places_df['place'].unique().tolist()
get_coordinates(visited_places, places_df)

### 2.3.3 NER explorations

In [35]:
import pandas as pd
#define a function to perform NER on strings and obtain, visualize and save a df with count for entity
def get_entities(source_file_path, model, LABEL, path):
    with open(source_file_path, mode="r") as file:
        contents = file.read()
    NER = spacy.load(model)  # possible models for our Notes: it_core_news_md -- | xx_ent_wiki_sm | -- en_core_web_sm
    parsed = NER(contents)
    ent_count = dict()
    ent_list = []
    for ent in parsed.ents:
        if ent.label_ == LABEL:
            ent_str = str(ent)
            ent_list.append(ent_str)
    for ent_str in ent_list:
        if ent_str not in ent_count:
            ent_count.update({ent_str: ent_list.count(ent_str)})
    #entities_df = pd.DataFrame.from_dict(ent_count, orient="index", columns=["count"])
    ent_df = pd.DataFrame(ent_list, columns=["ent"])
    ent_freq = pd.DataFrame(ent_df["ent"].value_counts().reset_index().values, columns=["ent", "count"])
    #print(ent_freq.sort_values(by="count", ascending=False).head(35))
    ent_freq.to_csv(path, encoding="UTF-8")
    return ent_freq

#explore some NER extractions
PERSON = get_entities("data/OAnotes_corpus.txt", "en_core_web_sm", "PERSON", "data/1_working_data/4_OAentities0_PERSON.csv")
PER = get_entities("data/OAnotes_corpus.txt", "xx_ent_wiki_sm", "PER", "data/1_working_data/4_OAentities01_PER.csv")
print("Results for en_core_web_sm model: ")
print(PERSON.head(10))
print("Results for xx_ent_wiki_sm model: ")
print(PER.head(10))

Results for en_core_web_sm model: 
                ent count
0         Wolters N    20
1  Maria del Popolo    18
2              Zeus    16
3           Madonna    16
4             Dante    16
5           Bologna    15
6           Ravenna    15
7           Minerva    12
8          S. Maria    11
9            Dennis    11
Results for xx_ent_wiki_sm model: 
                ent count
0         Donatello    23
1            Christ    19
2           Perkins    18
3        Praxiteles    18
4              Zeus    17
5             Dante    15
6            N. Cat    15
7  Maria del Popolo    15
8            Hermes    14
9            Apollo    13


In [38]:
#The "xx_ent_wiki_sm" model seems to give back more pertinent results for our corpus compared to
# the "en_core_web_sm" model so, the rest of extractions when possible ('PER' 'ORG' 'misc' 'LOC')
# will be done on that one
get_entities("data/OAnotes_corpus2.txt", "xx_ent_wiki_sm", "PER", "data/1_working_data/4_OAentities01_PER2.csv")
get_entities("data/OAnotes_corpus.txt", "xx_ent_wiki_sm", "LOC", "data/1_working_data/4_OAentities02_LOC.csv")
get_entities("data/OAnotes_corpus2.txt", "xx_ent_wiki_sm", "LOC", "data/1_working_data/4_OAentities02_LOC2.csv")
get_entities("data/OAnotes_corpus.txt", "xx_ent_wiki_sm", "ORG", "data/1_working_data/4_OAentities03_ORG.csv")
get_entities("data/OAnotes_corpus2.txt", "xx_ent_wiki_sm", "ORG", "data/1_working_data/4_OAentities03_ORG2.csv")

#have a look to other NER extraction  #CARDINAL, DATE, EVENT, FAC, GPE, LANGUAGE, LAW, LOC, MONEY, NORP, ORDINAL,
# ORG, PERCENT, PERSON, PRODUCT, QUANTITY, TIME, WORK_OF_ART
get_entities("data/OAnotes_corpus.txt", "en_core_web_sm", "DATE", "data/1_working_data/4_OAentities04_DATE.csv")
get_entities("data/OAnotes_corpus2.txt", "en_core_web_sm", "DATE", "data/1_working_data/4_OAentities04_DATE2.csv")
get_entities("data/OAnotes_corpus.txt", "en_core_web_sm", "WORK_OF_ART", "data/1_working_data/4_OAentities05_OA.csv")

#get_entities("data/OAnotes_corpus.txt", "en_core_web_sm", "LANGUAGE", "data/1_working_data/4_0Aentities06_LANGUAGE.csv")
#get_entities("data/OAnotes_corpus.txt", "en_core_web_sm", "MONEY", "data/1_working_data/4_OAentities07_MONEY.csv")

Unnamed: 0,ent,count
0,manoscritta,12
1,Paolo,2
2,La più,2
3,Fedeltà di Capua,1
4,the Stoa of Eumenes,1
5,Naim,1
6,Paolo e Giorgio,1
7,Specimen in the Museum,1
8,the Heraion of / Argos,1
9,Prison of Socrates,1


# 3. Data visualization


---
**Analyse**
pandas library in order to examine our data.
     
       
    1. Data Preparation:
          - creation of two complexive xml files for F and OA records coming from the Federico Zeri Foundation catalogues
          - extraction from nested xml stucture of relevant information for the project and structuring them in plain tabular format
    2. Data Elaboration: seeking for furter analysis elements via:
          - deeper work on photographer for enhance their information
          - deeper work on places
          - work on unstructured annotations: NER
     2. Data Visualization


## 3.0. Data overview

In [49]:
!pip install pandas_profiling
import pandas as pd
import pandas_profiling as pp
%matplotlib inline



In [46]:
# parse the csv into a dataframe
data_df = pd.read_csv('data/F_OA_selected_data.csv')

# reduce the dataset to just the columns needed
data_sample_df = data_df[['INVN_F', 'PVCS_OAcountry', 'PVCC_OAtown', 'LDCN_OArep', 'PRVC_OAprev_town',
          'AUFN_Faut', 'LRD_Fshotdates', 'OGTT_OAtype', 'AUTB_Fsubj_main', 'OGTDOA_OAsubj_sub', 'AUTN_OAaut']]
#print(data_df.head(15))
data_sample_df.describe()

Unnamed: 0,INVN_F,PVCS_OAcountry,PVCC_OAtown,LDCN_OArep,PRVC_OAprev_town,AUFN_Faut,LRD_Fshotdates,OGTT_OAtype,AUTB_Fsubj_main,OGTDOA_OAsubj_sub,AUTN_OAaut
count,3260,3202,3169,2320,366,3260,3208,3260,3260,3260,3260
unique,3222,27,331,478,137,112,215,13,130,180,668
top,sup 2180,Italia,Roma,Museo Archeologico Nazionale,NR,Anonimo,1855-1899,scultura,Arte romana,monumento funebre,Anonimo sec. IV a.C.
freq,3,2006,417,155,88,1336,1271,1999,453,397,120


In [50]:
report = pp.ProfileReport(data_df, title="Partizione Antica Fund - overview")
report.to_file("data/2_data_viz/0.1.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [51]:
report



## 3.1. Depicted works of art

### 3.1.1 Works of art for type - pie chart

In [53]:
!pip install plotly

Collecting plotly
  Downloading plotly-5.12.0-py2.py3-none-any.whl (15.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.2/15.2 MB[0m [31m25.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting tenacity>=6.2.0
  Downloading tenacity-8.1.0-py3-none-any.whl (23 kB)
Installing collected packages: tenacity, plotly
Successfully installed plotly-5.12.0 tenacity-8.1.0


In [56]:
import pandas as pd
import plotly.express as px
%matplotlib inline

# open the data and counts photos by type of depicted work of art
data_df = pd.read_csv("data/F_OA_selected_data.csv", encoding="UTF-8")
OAt_df = pd.DataFrame(data_df["OGTT_OAtype"].value_counts().reset_index().values, columns=["OGTT_OAtype", "count"])

# create a pie chart on types
fig1_1 = px.pie(OAt_df, values='count', names="OGTT_OAtype",
            title='Which type of works of art are mainly present in the fund? Photographs per depicted works of art type',
            color_discrete_sequence=px.colors.sequential.RdBu,
            labels = OAt_df['OGTT_OAtype'], hover_name = 'OGTT_OAtype',
            hover_data = {'OGTT_OAtype':False}
            )

fig1_1.write_html("data/2_data_viz/1.1.html")
fig1_1.show()

### 3.1.2 Works of art for object type - bar chart

In [57]:
import plotly.graph_objects as go

# open the data and counts photos by object_type of depicted work of art
data_df = pd.read_csv("data/F_OA_selected_data.csv", encoding="UTF-8")
OA_object_type_df = pd.DataFrame(data_df["OGTDOA_OAsubj_sub"].value_counts().reset_index().values, columns=["OGTDOA_OAsubj_sub", "count"])
OA_object_type_df = OA_object_type_df.rename(columns={"OGTDOA_OAsubj_sub": "Object_type"})

#filter for more than 30 occurrencies
OA_object_type_df = OA_object_type_df[OA_object_type_df["count"]>=30]

fig1_2 = px.bar(OA_object_type_df, x="Object_type", y="count",
                title="Photographs per depicted works of art object type (>= 30 occurrencies)",
                color="Object_type",
                  color_discrete_sequence=px.colors.sequential.turbid,
                 labels = None,
                hover_data = {'Object_type':False, 'count':True}
                  )
fig1_2.write_html("data/2_data_viz/1.2.html")
fig1_2.show()

### 3.1.3 Works of art for current location - map

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from geopy.geocoders import Nominatim
geolocator = Nominatim(timeout=10, user_agent = "myGeolocator")

import plotly
import plotly.graph_objects as go
import csv

# filter cities and number of photos from data
df_data = pd.read_csv("OAcountry_freq.csv", encoding="utf8")
ph_geol = px.scatter_mapbox(df_data, lon=df_data['lon'],
                            lat=df_data['lat'], size=df_data["count"], zoom=2, color=df_data['PVCS_OAcountry'],
                            color_continuous_scale=px.colors.cyclical.Twilight,
                            #color_discrete_sequence=px.colors.sequential.RdBu,
                            title="Depicted OA",
                            size_max=80,
                            labels=df_data['PVCS_OAcountry'], hover_name='PVCS_OAcountry',
                            hover_data={'PVCS_OAcountry':False, 'lat':False, 'lon':False})

# mapbox style
ph_geol.update_layout(mapbox_style='carto-positron')
ph_geol.show()
ph_geol.write_html("data/2_data_viz/1.3.html")

## 3.2 Photographers

### 3.2.1 Photographs for photographers - barchart

In [58]:
ph_df = pd.read_csv("data/2_PHfreq.csv", encoding="utf-8")
ph_df_main = ph_df.rename(columns={"AUFN_Faut": 'Photographer'})
ph_df_main.loc[ph_df_main['count'] < 20, 'Photographer'] = 'Less than 20 photos photographers'


fig2_1 = px.bar(ph_df_main, x="Photographer", y="count", title="Photographs per photographer",
                  color="count",
                color_discrete_sequence=px.colors.qualitative.Dark24,
                labels = ph_df_main['Photographer'], hover_name = 'Photographer',
                hover_data = {'Photographer': False, 'workplace': True}
              )

fig2_1.write_html("data/2_data_viz/2.1.html")
fig2_1.show()

### 3.2.2 Anonymous photographers: depicted works of art locations and types - barchart

In [60]:
#open the source data, filter for anonymous photographs and extract places and type columns
data_df = pd.read_csv("data/F_OA_selected_data.csv", encoding="UTF-8")
ph_anonymous_df = data_df[data_df.AUFN_Faut == "Anonimo"]
ph_anonymous_df = ph_anonymous_df[["AUFN_Faut", 'PVCS_OAcountry', 'OGTT_OAtype', 'PVCC_OAtown']]

#rename "immovable" all the immovable works of art types
ph_anonymous_df = ph_anonymous_df.replace(to_replace ='architettura\ scultura|architettura|scultura|complesso archeologico|sito archeologico', value = 'immovable', regex = True)

#group by type and rename the columns
ph_anonymous_by_country_type = ph_anonymous_df.groupby(['PVCS_OAcountry', 'OGTT_OAtype']).size().reset_index()
ph_anonymous_by_country_type.columns = ["country", "type", "count"]

fig2_2 = px.bar(ph_anonymous_by_country_type,
                x="count", y="country", color="type",
                title="Distribution for country of depicted work of art in anonymous photos (1336/3220)"
                )

fig2_2.write_html("data/2_data_viz/2.2.html")
fig2_2.show()

### 3.2.3 Anonymous photographers: immovable depicted works of art locations - map

In [61]:
'''let's do the same steps of 3.2.2
#open the source data, filter for anonymous photographs and extract places and type columns
data_df = pd.read_csv("data/F_OA_selected_data.csv", encoding="UTF-8")
ph_anonymous_df = data_df[data_df.AUFN_Faut == "Anonimo"]
ph_anonymous_df = ph_anonymous_df[["AUFN_Faut", 'PVCS_OAcountry', 'OGTT_OAtype', 'PVCC_OAtown']]

#rename "immovable" all the immovable works of art types
ph_anonymous_df = ph_anonymous_df.replace(to_replace ='architettura\ scultura|architettura|scultura|complesso archeologico|sito archeologico', value = 'immovable', regex = True)
'''

#filter just the immovable type and group by country-town; rename the columns
ph_anonymous_geo = ph_anonymous_df[ph_anonymous_df.OGTT_OAtype == "immovable"]
ph_anonymous_geo = ph_anonymous_geo.groupby(['PVCS_OAcountry', 'PVCC_OAtown']).size().reset_index()
ph_anonymous_geo.columns = ["country", "place", "count"]

#add geoloc infos to the data
geo_df = pd.read_csv("data/3_PLcoordinates.csv", encoding="UTF-8")
ph_anonymous_geo = ph_anonymous_geo.merge(geo_df, how='left', on="place")

fig2_3_bis = px.scatter_mapbox(ph_anonymous_geo,
                           lon=ph_anonymous_geo['lon'], lat=ph_anonymous_geo['lat'],
                           size=pd.to_numeric(ph_anonymous_geo["count"]),
                           zoom=3,
                            color=ph_anonymous_geo['country'],
                            color_continuous_scale=px.colors.cyclical.Twilight,
                            title="Photos for locations of immovable works of art depicted in anonymous photos",
                            size_max=80,
                            labels=ph_anonymous_geo['place'], hover_name='place', #rinomina in city
                            hover_data={'place':False, 'lat':False, 'lon':False})

# mapbox style
fig2_3_bis.update_layout(mapbox_style='carto-positron')
fig2_3_bis.write_html("data/2_data_viz/2.3.bis.html")
fig2_3_bis.show()

## 3.3 Annotations

### 3.3.1 Complete/incomplete/missing transcriptions - pie chart

In [4]:
import pandas as pd
import plotly.express as px

#overall vision on Annotations
data_df = pd.read_csv('data/F_OA_selected_data.csv')
all_inv = data_df[["INVN_F", "DTZG_OAcentury"]]
all_inv = all_inv.rename(columns={"INVN_F": "Inv", "DTZG_OAcentury": "century"}).reset_index(drop=True)
OAnotes_df = pd.read_csv('data/1_working_data/1_OAnotes03.csv')

#sostituisci value con complete/incomplete
def check(row):
    if "..." in str(row["Note"]):
        status = "incomplete"
    else:
        status = "complete"
    return status

OAnotes_df["status"] = OAnotes_df.apply(check, axis=1)
merged = all_inv.merge(OAnotes_df, how='left', on="Inv").reset_index(drop=True)
merged['status'] = merged['status'].fillna('missing')

#
fig3_1 = px.pie(new, values='count', names="status",
              title='Annotations on photographs (transcribed in OA entries)',

                color_discrete_map={
                "complete": "rgb(211, 156, 131)",
                    "incomplete": "rgb(224, 194, 162)",
                    "missing": "rgb(237, 229, 207)"},
              labels = new['status'], hover_name='status',
              hover_data = {'status':True}
             )

fig3_1.show()
fig3_1.write_html("data/2_data_viz/3.1.html")

### 3.3.1 Complete/incomplete/missing transcriptions for works of art century - barchart

In [4]:
#
'''import pandas as pd
import plotly.express as px

#overall vision on Annotations
data_df = pd.read_csv('data/F_OA_selected_data.csv')
all_inv = data_df[["INVN_F", "DTZG_OAcentury"]]
all_inv = all_inv.rename(columns={"INVN_F": "Inv", "DTZG_OAcentury": "century"}).reset_index(drop=True)
OAnotes_df = pd.read_csv('data/1_working_data/1_OAnotes03.csv')

#sostituisci value con complete/incomplete
def check(row):
    if "..." in str(row["Note"]):
        status = "incomplete"
    else:
        status = "complete"
    return status

OAnotes_df["status"] = OAnotes_df.apply(check, axis=1)
merged = all_inv.merge(OAnotes_df, how='left', on="Inv").reset_index(drop=True)
merged['status'] = merged['status'].fillna('missing')'''

# group rows by period and year, hence add a column with the counting of collections
data_by_year = merged.groupby(["century", "status"]).size().reset_index()
# rename the columns
data_by_year.columns = ["century", "status", "count"]
data_by_year.to_csv("data/1_working_data/5_Notes01.csv", encoding="UTF-8")

#in case of more than a date, consider just the first one
data_by_year = data_by_year.replace(to_replace ='sec. |,.*', value = '', regex = True)
#in case of a range (es. sec. I a.C./ I ) consider the last one
data_by_year = data_by_year.replace(to_replace ='.*/', value = '', regex = True)
#remove white spaces
data_by_year = data_by_year.replace(to_replace ='^ ', value = '', regex = True)
data_by_year = data_by_year.groupby(['century','status'], as_index=False)['count'].sum()
data_by_year.to_csv("data/1_working_data/5_Notes02.csv", encoding="UTF-8")

#manual revision, saved in data/1_working_data/5_Notes03.csv
data_by_year2 = pd.read_csv('data/1_working_data/5_Notes03.csv')

fig3_3_2 = px.bar(data_by_year2,
                x="century", y="count", color="status",
                title="Transcribed annotations over total photos for depicted work of art century",
                #color_discrete_sequence=px.colors.sequential.Brwnyl
                color_discrete_map={ # replaces default color mapping by value
                "complete": "rgb(211, 156, 131)", "incomplete": "rgb(224, 194, 162)", "missing": "rgb(237, 229, 207)"},
                )

fig3_3_2.write_html("data/2_data_viz/3.3.2.html")
fig3_3_2.show()

### 3.3.2  century distribution

3.3.3 Annotations and metadata: compared dates distribution

## 4.  Derived infos: dated and located personal experiences from annotations - map 

In [None]:
data_df = pd.read_csv('data/F_OA_selected_data.csv')
visited_places_ph_df = data_df[["INVN_F", "sercdf_F_ser"]]
visited_places_ph_df = visited_places_ph_df.rename(columns={"sercdf_F_ser": "Fentry_id", "INVN_F":"Inv"})

visited_place_coor = visited_place_coor.merge(visited_places_ph_df, how='left', on="Inv")
visited_place_coor = visited_place_coor.rename(columns={"Note":"Note_one_line",
                                                      "lat":"lats", "lon":"lons"})

visited_place_coor['Note'] = visited_place_coor.apply(lambda row: ('<br>'.join(str(row.Note_one_line)[i:i+30] for i in range(0, len(str(row.Note_one_line)), 30))), axis = 1)
visited_place_coor['photo_url'] = visited_place_coor.apply(lambda row: ('<a href="http://catalogo.fondazionezeri.unibo.it/scheda/fotografia/'+str(row.Fentry_id)+' target="_blank">http://catalogo.fondazionezeri.unibo.it/scheda/fotografia/'+str(row.Fentry_id)+'</a>'), axis = 1)
#movement_df_coor['Note_br'] = movement_df_coor.apply(lambda row: ('<br>'.join(a + b for a, b in zip(str(row.Note)[0:30], str(row.Note)[::30]))) , axis = 1)
#res = ', '.join(test_str[i:i + 2] for i in range(0, len(test_str), 2))

visited_place_coor.to_csv("data/1_working_data/6_02.csv", encoding="UTF-8")
print(visited_place_coor.head())

import plotly.express as px
#df = px.data.movement_df_coor()
fig = px.scatter_geo(visited_place_coor, color="date", #locations="iso_alpha",
                     #size="pop",
                lat=visited_place_coor["lats"].values.tolist(),
                  lon=visited_place_coor["lons"].values.tolist(),
                  title="Places visited in defined dates according to annotations", size="date",
                     #animation_frame="place", # projection="natural earth", scope="europe",
                     projection="natural earth", scope="europe",
            labels = visited_place_coor['place'], hover_name='place',
            hover_data = {'place':False, 'Inv':True, 'photo_url':True, 'Note':True, 'lats': False, 'lons':False}
            )

fig.write_html("data/2_data_viz/4.1.html")
fig.show()