# Final project

This Jupyter Notebook was created for the exam in __Electronic Publishing and Digital Storytelling__ taught by Prof. Marilena Daquino at the University of Bologna for the year 2021-2022.

The team project is composed by Alice Bordignon, Federico Cagnola and Gabriele Fiorenza. 

## About 
The project explores the relation between italian historians of __Dictionary of Art Historians (DoAH)__ and __ArtChives__ with keepers of their collections and fonds from a temporal and geospatial points of view. 

# Step 1
We imported the __.csv file__ of __DoAH__ filtered by nationality __('it')__ and manually integrated it with __Wikidata labels__.
We selected the columns we were interested in, removing the historians without any archive or keeper associated and reorganizing data.
Then, we created a pandas dataframe to easily store and manipulate this information.

In [14]:
import pandas as pd
import re
# python3 -m pip install qwikidata
# python library for working with sparql and linked data from WikiData
from qwikidata.sparql import return_sparql_query_results

In [8]:
# create first dataframe only using the specified columns 
data = pd.read_csv("DoAH_StoriciItaliani_integrato.csv", sep=",",
                    usecols=["Full Name", "Gender", "Collection", "Keeper"], encoding="utf-8")

# axis 0 to drop the rows, subset to only remove NaNs from the column Archives
data.dropna(axis=0, subset=["Keeper"], inplace=True)

# resetting the index because all deleted rows have changed the length of the dataframe
data.reset_index(inplace=True, drop=True)

# .pickle is a python serialization format for easy and quick read-write, and pandas supports it natively
data.to_pickle("00_first_db.pickle")

# the first table we have looks like this:
pd.set_option("display.max_rows", None)
data.head(120)

Unnamed: 0,Full Name,Gender,Collection,Keeper
0,"Accascina, Maria",female,,Comune di Palermo
1,"Agostini, Leonardo",male,,Scuola Normale Superiore
2,"Alfieri, Vittorio",male,,Biblioteca Medicea Laurenziana
3,"Alinari, Giuseppe",male,Archivio Alinari,Museo Nazionale Alinari della Fotografia
4,"Alinari, Leopoldo",male,Archivio Alinari,Museo Nazionale Alinari della Fotografia
5,"Arcangeli, Francesco",male,"Fondo speciale Angelo, Gaetano, Bianca e Franc...",Biblioteca comunale dell'Archiginnasio
6,"Aretino, Pietro",male,Fondo Bongi,State Archives of Lucca
7,"Argan, Giulio Carlo",male,,Archivio privato a Roma
8,"Arias, Paolo Enrico",male,,Scuola Normale Superiore
9,"Baglione, Giovanni",male,,Archivio di Stato di Roma


We saved our starting database into __pickle format__. Pickle can be used to serialize Python object structures, which refers to the process of converting an object in the memory to a byte stream that can be stored as a binary file on disk. When we load it back to a Python program, this binary file can be de-serialized back to a Python object.
It is much __faster__ when compared to CSV files and __reduces the file size__ to almost half of CSV files using its compression techniques.

# Step 2 
Looking at the database, we faced the first problems:
1. Full names are reversed (`surname, name`). We created a function to fix them (`name surname`).
2. We need to have a controlled entity (`wd:xyz`) for each name and keeper, to be able to link them to other information.

### Step 2.1
#### Historian Names
We applied the `reformat_names` function on the historians' full names, removing duplicate whitespaces using regular expressions. 

In [9]:
def reformat_names(name):
    """ reverse names from surname,name format to name surname """
    l = name.split(", ")
    new = " ".join(reversed(l))
    # compile regex for multiple consecutive spaces
    return re.sub(r"\s+", " ", new)

In [10]:
# reverse names and remove duplicate whitespace
data["Full Name"] = data["Full Name"].apply(reformat_names)
data.describe()

Unnamed: 0,Full Name,Gender,Collection,Keeper
count,118,118,50,118
unique,118,2,49,77
top,Maria Accascina,male,Archivio Alinari,BEIC Digital Library
freq,1,108,2,10


### Step 2.2
#### Historian Entities 
Once the historians' names were fixed, we proceeded with the __search of the historians' entities on Wikidata.__
The __SPARQL query__ search for human individuals that speak Latin or Italian and work as: `art historian`, `historian`, `university teacher`, `archaeologist`, `artist`, `art critic`, `philosopher`, `antiquarian`, or `photographer`. The `{}`will be reserved to the historian's label in the database to match its Wikidata entity. 

The `find_historian_entity_from_name` function matches the value finding out the wd entity of each historian in the database. 

In [11]:
historian_entity_from_label = """
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?artHistorian

WHERE {{
    ?artHistorian wdt:P31 wd:Q5 ;
                  wdt:P1412 ?language
                  FILTER (?language IN (wd:Q652, wd:Q397 ) ) 
    ?artHistorian wdt:P106 ?occupation
                  FILTER (?occupation IN (wd:Q1792450, wd:Q201788, wd:Q1622272, wd:Q3621491, wd:Q483501, wd:Q4164507, wd:Q4964182, wd:Q5697103, wd:Q33231 ) )    
    ?artHistorian rdfs:label ?o
                  FILTER ( str(?o) = "{}" )  .
}}
"""

In [12]:
def find_historian_entity_from_name(name: str):
    query = historian_entity_from_label.format(name)
    res = return_sparql_query_results(query_string=query)
    try:
        wdt_uri = res['results']['bindings'][0]['artHistorian']['value']
    except (IndexError, KeyError):
        return ""
    return wdt_uri.split("/")[-1]

We implemented the function adding a __new column__ (`Historian Entity`) for the entities found. 

In [13]:
data["Historian Entity"] = data["Full Name"].apply(find_historian_entity_from_name)

In [16]:
data.to_pickle("00_first_db.pickle")

In [14]:
pd.set_option("display.max_rows", None)
data.head(120)

Unnamed: 0,Full Name,Gender,Collection,Keeper,Historian Entity
0,Maria Accascina,female,,Comune di Palermo,Q98804253
1,Leonardo Agostini,male,,Scuola Normale Superiore,Q1054161
2,Vittorio Alfieri,male,,Biblioteca Medicea Laurenziana,Q296244
3,Giuseppe Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q18934975
4,Leopoldo Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q16164590
5,Francesco Arcangeli,male,"Fondo speciale Angelo, Gaetano, Bianca e Franc...",Biblioteca comunale dell'Archiginnasio,Q1121086
6,Pietro Aretino,male,Fondo Bongi,State Archives of Lucca,
7,Giulio Carlo Argan,male,,Archivio privato a Roma,Q778445
8,Paolo Enrico Arias,male,,Scuola Normale Superiore,Q3894011
9,Giovanni Baglione,male,,Archivio di Stato di Roma,Q983332


We checked the results and integrated manually the missing 13 historian entities. Then, we saved the new database.

In [15]:
import numpy as np
print(data.replace(r'^\s*$', np.nan, regex=True).isnull().sum())

Full Name            0
Gender               0
Collection          68
Keeper               0
Historian Entity    13
dtype: int64


In [17]:
import pandas as pd
data = pd.read_pickle("00_first_db.pickle")
data.to_json("00_first_db.json")


### Step 2.3
#### Keepers 
We did the same process with keepers. This time, the SPARQL query filter the search for the institutions of a certain type: `archive`, `library`, `university`, `museum`, `state archive`, `research institute`, `foundation`, `academy`, etc. The `{}`will be reserved to the keeper's label in the database to match its Wikidata entity.  

The `find_keeper_entity_from_label` function matches the value finding out the wd entity of each keeper in the database. 

In [1]:
import pandas as pd
from json import JSONDecodeError

In [61]:
data = pd.read_json("00_first_db.json")

In [62]:
keeper_entity_from_label = """
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT ?keeper

WHERE {{
    VALUES ?keeperRole {{wd:Q166118 wd:Q7075 wd:Q3953379 wd:Q3918 wd:Q33506 wd:Q17620767 wd:Q43229 wd:Q31855 wd:Q212805
                        wd:Q2352616 wd:Q1966910 wd:Q157031 wd:Q207694 wd:Q414147 wd:Q22806 wd:Q28564 wd:Q1329623 wd:Q44796387
                        wd:Q856234 wd:Q2122214 }}
    ?keeper wdt:P31 ?keeperRole .
    ?keeper rdfs:label ?o 
                  FILTER ( str(?o) = "{}" )  .
    
}}
"""

In [63]:
def find_keeper_entity_from_label(label: str):
    # remove trailing and leading whitespace
    label = label.strip()
    # substitute multiple spaces with a single one
    label = re.sub(r"\s+", " ", label)
    query = keeper_entity_from_label.format(label)
    try:
        res = return_sparql_query_results(query_string=query)
        wdt_uri = res['results']['bindings'][0]['keeper']['value']
    except (IndexError, KeyError, JSONDecodeError):
        return ""
    return wdt_uri.split("/")[-1]

We implemented the function adding a __new column__ (`Keeper Entity`) for the entities found. 

In [2]:
def create_keeper_col(data):
    # create a new column computing the keeper entity from the keeper label
    data["Keeper Entity"] = data["Keeper"].apply(find_keeper_entity_from_label)

After gathering entities for keepers a problem we found was that the research was really slow __(~35mins for the full dataframe lookup)__.
To speed things up this time we tried __splitting the dataframe in two equal parts__ (around index 58) and launching two separate threads, each on a portion of the dataframe. This way, if the SPARQL engine takes a long time to respond, we have 2 concurrent calls being made: it __reduced our running time to around 18 minutes__ for a full dataframe apply.

Since the approach seems to be successful we should probably try with 4/8 concurrent threads.

NB: this does __not speed up computation__, instead, when the operation is waiting for an IO task (waiting for sparql to respond with the JSON result for the query) it launches other requests or handle other responses without blocking.

In [65]:
# split df in half
df1 = data.iloc[:58, :]
df2 = data.iloc[58:, :]

# launch two threads running create_keeper_col on df1 and df2
from threading import Thread
t1 = Thread(target=create_keeper_col, args=(df1,))
t2 = Thread(target=create_keeper_col, args=(df2,))
t1.start()
t2.start()
# wait for the threads to finish
t1.join()
t2.join()

# concatenate the two dataframes
data = pd.concat([df1, df2], axis=0)

data.head(120)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["Keeper Entity"] = data["Keeper"].apply(find_keeper_entity_from_label)


Unnamed: 0,Full Name,Gender,Collection,Keeper,Historian Entity,Keeper Entity
0,Maria Accascina,female,,Comune di Palermo,Q98804253,Q81174665
1,Leonardo Agostini,male,,Scuola Normale Superiore,Q1054161,Q672416
2,Vittorio Alfieri,male,,Biblioteca Medicea Laurenziana,Q296244,Q856419
3,Giuseppe Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q18934975,Q1075580
4,Leopoldo Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q16164590,Q1075580
5,Francesco Arcangeli,male,"Fondo speciale Angelo, Gaetano, Bianca e Franc...",Biblioteca comunale dell'Archiginnasio,Q1121086,Q3639645
6,Pietro Aretino,male,Fondo Bongi,State Archives of Lucca,Q296272,Q3621654
7,Giulio Carlo Argan,male,,Archivio privato a Roma,Q778445,
8,Paolo Enrico Arias,male,,Scuola Normale Superiore,Q3894011,Q672416
9,Giovanni Baglione,male,,Archivio di Stato di Roma,Q983332,Q2860424


In [66]:
data.to_pickle("01_second_db.pickle")

In [24]:
import pandas as pd
import numpy as np


data = pd.read_pickle("01_second_db.pickle")    
print(data.replace(r'^\s*$', np.nan, regex=True).isnull().sum())

Full Name            0
Gender               0
Collection          68
Keeper               0
Historian Entity     0
Keeper Entity       19
dtype: int64


19 entities could not be found automatically: we exported the db to json and manually integrated those which were missing.

In [19]:
data.to_json("01_second_db.json")

# Step 3
### Database intersection: merge doAH and ARTchives historians 

After saving our new database in JSON format, we added and filled the missing keepers' entities manually. In some cases, we added __new items on Wikidata__ to give the organization a controlled entity. In the case of __private archives__, we decided to use the same entity (`wd:Q12161242`) to identify "archival collection or institution that is not accessible to the public".

The next step is to __integrate ARTchives historians, collections and keepers.__ 

In [310]:
#Runna da qui
import pandas as pd
data = pd.read_json("01_second_db.json")

### Find italian historians on ARTchives
We query the remote endpoint of ARTchives. The following __SPARQL query__ returns: 
1. Full Names of historians
2. Collections
3. Keepers 
4. Historian Entity
5. Keepers entity

Of the italian historians on ARTchives

In [15]:
from SPARQLWrapper import SPARQLWrapper, JSON
import ssl

In [65]:
ssl._create_default_https_context = ssl._create_unverified_context

# get the endpoint API
artchives_endpoint = "http://artchives.fondazionezeri.unibo.it/sparql"

In [66]:
artchives_italy = """
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdp: <http://www.wikidata.org/wiki/Property:>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT ?HistorianEntity (SAMPLE(?HistorianName) AS ?FullName) ?CollectionEntity (SAMPLE(?CollectionName) AS ?Collection) ?KeeperEntity (SAMPLE(?KeeperName) AS ?Keeper)
WHERE {
?HistorianEntity a wd:Q5 ; rdfs:label ?HistorianName ; wdp:P27 wd:Q38 .
?CollectionEntity wdp:P170 ?HistorianEntity ; rdfs:label ?CollectionName .
?KeeperEntity wdp:P1830 ?CollectionEntity ; rdfs:label ?KeeperName .
}  
  GROUP BY ?HistorianEntity ?CollectionEntity ?KeeperEntity
"""

In [67]:
# set the endpoint 
sparql_wd = SPARQLWrapper(artchives_endpoint)
# set the query
sparql_wd.setQuery(artchives_italy)
# set the returned format
sparql_wd.setReturnFormat(JSON)
# get the results
results = sparql_wd.query().convert()

# manipulate the result
for result in results["results"]["bindings"]:
    Full_Name = result["FullName"]["value"]
    Hist_entity = result["HistorianEntity"]["value"].split("/")[-1]
    Coll = result["Collection"]["value"]
    Keeper = result["Keeper"]["value"]
    Keeper_entity = result["KeeperEntity"]["value"].split("/")[-1]
    print(Full_Name + "; " + Coll + "; " + Keeper + "; HIST: " + Hist_entity + "; KEEP: " + Keeper_entity)

Federico Zeri; Fototeca Zeri; Fondazione Federico Zeri; HIST: Q1089074; KEEP: Q23687322
Stefano Tumidei; Fototeca Stefano Tumidei; Fondazione Federico Zeri; HIST: Q55453618; KEEP: Q23687322
Luisa Vertova; Archivio Luisa Vertova; Fondazione Federico Zeri; HIST: Q61913691; KEEP: Q23687322
Eugenio Battisti; Battisti Eugenio (complex of fonds); Scuola Normale Superiore; HIST: Q1373290; KEEP: Q672416
Adolfo Venturi; Venturi Adolfo (complex of fonds); Scuola Normale Superiore; HIST: Q2824734; KEEP: Q672416
Luigi Salerno; Luigi Salerno research papers; Getty Research Institute; HIST: Q6700132; KEEP: Q11203476
Roberto Longhi; Archivio Longhi; Fondazione Roberto Longhi; HIST: Q1361667; KEEP: Q1634770
Cesare Brandi; Archive Cesare Brandi; Direzione regionale musei della Toscana; HIST: Q1056780; KEEP: Q108323065


Since some historians were already in the database, we __manually integrated the new ones__ into the `01_second_db.json` and saved the database:

118. Stefano Tumidei
119. Luisa Vertova
120. Eugenio Battisti
121. Cesare Brandi 

We can see now the integrated database. 

In [68]:
pd.set_option("display.max_rows", None)
data.head(122)

Unnamed: 0,Full Name,Gender,Collection,Keeper,Historian Entity,Keeper Entity
0,Maria Accascina,female,,Comune di Palermo,Q98804253,Q81174665
1,Leonardo Agostini,male,,Scuola Normale Superiore,Q1054161,Q672416
2,Vittorio Alfieri,male,,Biblioteca Medicea Laurenziana,Q296244,Q856419
3,Giuseppe Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q18934975,Q1075580
4,Leopoldo Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q16164590,Q1075580
5,Francesco Arcangeli,male,"Fondo speciale Angelo, Gaetano, Bianca e Franc...",Biblioteca comunale dell'Archiginnasio,Q1121086,Q3639645
6,Pietro Aretino,male,Fondo Bongi,State Archives of Lucca,Q296272,Q3621654
7,Giulio Carlo Argan,male,,Archivio privato a Roma,Q778445,Q12161242
8,Paolo Enrico Arias,male,,Scuola Normale Superiore,Q3894011,Q672416
9,Giovanni Baglione,male,,Archivio di Stato di Roma,Q983332,Q2860424


# Step 4 
### Use Wikidata entities to query relevant information about historians and keepers

We created some SPARQL queries to find out relevant information about historians and keepers to finalize our database: birth and death places and dates for historians, locations for keepers.

In [16]:
import pandas as pd
import numpy as np
from qwikidata.sparql import return_sparql_query_results
from json import JSONDecodeError
import time
from collections import defaultdict

In [3]:
data = pd.read_json("01_second_db.json")
data.head(122)

Unnamed: 0,Full Name,Gender,Collection,Keeper,Historian Entity,Keeper Entity
0,Maria Accascina,female,,Comune di Palermo,Q98804253,Q81174665
1,Leonardo Agostini,male,,Scuola Normale Superiore,Q1054161,Q672416
2,Vittorio Alfieri,male,,Biblioteca Medicea Laurenziana,Q296244,Q856419
3,Giuseppe Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q18934975,Q1075580
4,Leopoldo Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q16164590,Q1075580
...,...,...,...,...,...,...
117,Bruno Zevi,male,Archivio Bruno Zevi,Fondazione Bruno Zevi,Q558155,Q73016367
118,Stefano Tumidei,male,Fototeca Stefano Tumidei,Fondazione Federico Zeri,Q55453618,Q23687322
119,Luisa Vertova,female,Archivio Luisa Vertova,Fondazione Federico Zeri,Q61913691,Q23687322
120,Eugenio Battisti,male,Battisti Eugenio (complex of fonds),Scuola Normale Superiore,Q1373290,Q672416


### Step 4.1
#### Find keepers' place
The following SPARQL query returns the __administrative territorial entity__ of the keepers, filtered for __english labels__ only.

The `find_keepers_place` function checks the existence of the query results for each keeper, adding into the database the two new columns `Keeper Place` and `Keeper Place Label` containing the information found. __(da finire di spiegare)__   

In [4]:
keepers_place_query = """
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?keeperPlace ?keeperPlaceLabel
WHERE {{
    {} wdt:P131 ?keeperPlace .
    ?keeperPlace rdfs:label ?keeperPlaceLabel .
    FILTER (lang(?keeperPlaceLabel) = 'en')
}}
"""

In [7]:
def find_keepers_place(entity: str):
    query = keepers_place_query.format(f"wd:{entity}")
    try:
        res = return_sparql_query_results(query_string=query)
        return_entity = res['results']['bindings'][0]['keeperPlace']['value']
        return_label = res['results']['bindings'][0]['keeperPlaceLabel']['value']
    except (IndexError):
        return None
    return [return_entity.split("/")[-1], return_label]

found = defaultdict(lambda: "-")
entities = pd.Series(name="Keeper Place Entity")
labels = pd.Series(name="Keeper Place")
for index, row in data.iterrows():
    if found[row['Keeper Entity']] == "-":
        cnt = 0
        while cnt < 10:
            try:
                found[row['Keeper Entity']] = find_keepers_place(row["Keeper Entity"])
                break
            except (JSONDecodeError):
                time.sleep(0.5)
                cnt += 1 
                continue
    
    if found[row['Keeper Entity']]:
        entities.loc[index] = found[row['Keeper Entity']][0]
        labels.loc[index] = found[row['Keeper Entity']][1]
    else:
        print("Not found: ", row['Keeper'])

data = pd.concat([data, entities, labels], axis=1)

  if sys.path[0] == "":
  del sys.path[0]


Not found:  Archivio privato a Roma
Not found:  Getty Research Institute
Not found:  Lombard Institute Academy of Science and Letters
Not found:  Getty Research Institute
Not found:  Fondazione "Biblioteca Benedetto Croce"
Not found:  Lombardia Beni Culturali
Not found:  Fondazione Cassa di Risparmio di Perugia
Not found:  Getty Research Institute
Not found:  Getty Research Institute
Not found:  Archivio privato di Calcata
Not found:  Archivio privato di Meleto
Not found:  Getty Research Institute
Not found:  Istituto Lombardo Accademia di Scienze e Lettere
Not found:  Direzione regionale musei della Toscana


We will integrate the missing data later, we just save the new database for now. 

In [10]:
data.head(122)
data.to_pickle("02_place_db.pickle")

### Step 4.2 
#### Find date of birth and death of the historians 

In [21]:
from datetime import datetime
import pandas as pd
from collections import defaultdict
from json import JSONDecodeError
from qwikidata.sparql import return_sparql_query_results
import time
from requests.exceptions import ChunkedEncodingError

In [22]:
data = pd.read_pickle("02_place_db.pickle")
data.head(122)

Unnamed: 0,Full Name,Gender,Collection,Keeper,Historian Entity,Keeper Entity,Keeper Place Entity,Keeper Place
0,Maria Accascina,female,,Comune di Palermo,Q98804253,Q81174665,Q2656,Palermo
1,Leonardo Agostini,male,,Scuola Normale Superiore,Q1054161,Q672416,Q13375,Pisa
2,Vittorio Alfieri,male,,Biblioteca Medicea Laurenziana,Q296244,Q856419,Q2044,Florence
3,Giuseppe Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q18934975,Q1075580,Q2044,Florence
4,Leopoldo Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q16164590,Q1075580,Q2044,Florence
...,...,...,...,...,...,...,...,...
117,Bruno Zevi,male,Archivio Bruno Zevi,Fondazione Bruno Zevi,Q558155,Q73016367,Q220,Rome
118,Stefano Tumidei,male,Fototeca Stefano Tumidei,Fondazione Federico Zeri,Q55453618,Q23687322,Q1891,Bologna
119,Luisa Vertova,female,Archivio Luisa Vertova,Fondazione Federico Zeri,Q61913691,Q23687322,Q1891,Bologna
120,Eugenio Battisti,male,Battisti Eugenio (complex of fonds),Scuola Normale Superiore,Q1373290,Q672416,Q13375,Pisa


In [23]:
historians_dob_dod_query = """
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?dob ?dod
WHERE {{
    VALUES ?historian {historian} . 
    ?historian wdt:P569 ?dob .
    OPTIONAL {{ ?historian wdt:P570 ?dod }}.
}}
"""

In [14]:
def find_historian_dob_dod(entity: str):
    query = historians_dob_dod_query.format(historian=f"{{wd:{entity}}}")
    res = []
    sparql_res = return_sparql_query_results(query_string=query)
    try:
        if sparql_res['results']['bindings'][0]['dob']['type'] == "uri":
            # unknown dates are mapped to URIs
            res.append(None)
        else:
            res.append(sparql_res['results']['bindings'][0]['dob']['value'].rstrip("Z"))
    except (IndexError, KeyError):
        res.append(None)
    try:
        if sparql_res['results']['bindings'][0]['dod']['type'] == "uri":
            # unknown dates are mapped to URIs
            res.append(None)
        else:
            res.append(sparql_res['results']['bindings'][0]['dod']['value'].rstrip("Z"))
    except (IndexError, KeyError):
        res.append(None)
    return res

found = defaultdict(lambda: None)

dob_list = pd.Series(name="Historian Birth", dtype="datetime64[ns]")
dod_list = pd.Series(name="Historian Death", dtype="datetime64[ns]")

for index, row in data.iterrows():
    cnt = 0
    while cnt < 7:
        try:
            found[row['Historian Entity']] = find_historian_dob_dod(row["Historian Entity"])
            break
        except (JSONDecodeError, ChunkedEncodingError):
            time.sleep(0.5)
            cnt += 1 
            continue
    
    if any(found[row['Historian Entity']]):
        dob, dod = found[row['Historian Entity']]
        if dob:
            dob_list.loc[index] = datetime.fromisoformat(dob).date()
        else:
            dob_list.loc[index] = None
        if dod:
            dod_list.loc[index] = datetime.fromisoformat(dod).date()
        else:
            dod_list.loc[index] = None

data = pd.concat([data, dob_list, dod_list], axis=1)

In [16]:
data.head(122)
data.to_pickle("03_dob_dod_db.pickle")

### Step 4.3 
#### Find place of birth and place of death of the historians

In [8]:
from datetime import datetime
import pandas as pd
from collections import defaultdict
from json import JSONDecodeError
from qwikidata.sparql import return_sparql_query_results
import time
from requests.exceptions import ChunkedEncodingError

In [25]:
data = pd.read_pickle("03_dob_dod_db.pickle")
data.head(122)

Unnamed: 0,Full Name,Gender,Collection,Keeper,Historian Entity,Keeper Entity,Keeper Place Entity,Keeper Place,Historian Birth,Historian Death
0,Maria Accascina,female,,Comune di Palermo,Q98804253,Q81174665,Q2656,Palermo,1898-01-01,1979-01-01
1,Leonardo Agostini,male,,Scuola Normale Superiore,Q1054161,Q672416,Q13375,Pisa,1593-09-18,1676-08-01
2,Vittorio Alfieri,male,,Biblioteca Medicea Laurenziana,Q296244,Q856419,Q2044,Florence,1749-01-16,1803-10-08
3,Giuseppe Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q18934975,Q1075580,Q2044,Florence,1836-04-29,1890-04-24
4,Leopoldo Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q16164590,Q1075580,Q2044,Florence,1832-01-01,1865-01-01
...,...,...,...,...,...,...,...,...,...,...
117,Bruno Zevi,male,Archivio Bruno Zevi,Fondazione Bruno Zevi,Q558155,Q73016367,Q220,Rome,1918-01-22,2000-01-09
118,Stefano Tumidei,male,Fototeca Stefano Tumidei,Fondazione Federico Zeri,Q55453618,Q23687322,Q1891,Bologna,1962-08-15,2008-05-09
119,Luisa Vertova,female,Archivio Luisa Vertova,Fondazione Federico Zeri,Q61913691,Q23687322,Q1891,Bologna,1921-01-01,2021-06-28
120,Eugenio Battisti,male,Battisti Eugenio (complex of fonds),Scuola Normale Superiore,Q1373290,Q672416,Q13375,Pisa,1924-12-14,1989-10-17


In [26]:
historians_pob_pod_query = """
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?pob ?pod ?pobLabel ?podLabel
WHERE {{
    VALUES ?historian {historian} . 
  
    OPTIONAL {{ ?historian wdt:P19 ?pob }}.
    OPTIONAL {{ ?historian wdt:P20 ?pod }}.
    
    ?pob rdfs:label ?pobLabel 
      FILTER (lang(?pobLabel) = 'en') .

    ?pod rdfs:label ?podLabel
      FILTER (lang(?podLabel) = 'en') .
}}
"""

In [27]:

def find_historian_pob_pod(entity: str):
    query = historians_pob_pod_query.format(historian=f"{{wd:{entity}}}")
    res = dict()
    sparql_res = return_sparql_query_results(query_string=query)
    for v in ["pod", "pob", "podLabel", "pobLabel"]:
        try:
            res[v] = sparql_res['results']['bindings'][0][v]['value'] if v in {"podLabel", "pobLabel"} else sparql_res['results']['bindings'][0][v]['value'].split("/")[-1]
        except (IndexError, KeyError):
            res[v] = None
    return res

pob_list = pd.Series(name="Historian Birthplace Entity")
pobL_list = pd.Series(name="Historian Birthplace")

pod_list = pd.Series(name="Historian Deathplace Entity")
podL_list = pd.Series(name="Historian Deathplace")

found = defaultdict(lambda: None)
for index, row in data.iterrows():
    historian = row['Historian Entity']
    cnt = 0
    while cnt < 7:
        try:
            found[historian] = find_historian_pob_pod(historian)
            break
        except (JSONDecodeError, ChunkedEncodingError):
            time.sleep(0.5)
            cnt += 1 
            continue
    if found[historian] and any(found[historian].values()):
        if found[historian]['pob']:
            pob_list.loc[index] = found[historian]['pob']
        else:
            pob_list.loc[index] = None

        if found[historian]['pod']:
            pod_list.loc[index] = found[historian]['pod']
        else:
            pod_list.loc[index] = None

        if found[historian]['pobLabel']:
            pobL_list.loc[index] = found[historian]['pobLabel']
        else:
            pobL_list.loc[index] = None

        if found[historian]['podLabel']:
            podL_list.loc[index] = found[historian]['podLabel']
        else:
            podL_list.loc[index] = None
        
data = pd.concat([data, pod_list, podL_list, pob_list, pobL_list], axis=1)
data.head(120)


  pob_list = pd.Series(name="Historian Birthplace Entity")
  pobL_list = pd.Series(name="Historian Birthplace")
  pod_list = pd.Series(name="Historian Deathplace Entity")
  podL_list = pd.Series(name="Historian Deathplace")


Unnamed: 0,Full Name,Gender,Collection,Keeper,Historian Entity,Keeper Entity,Keeper Place Entity,Keeper Place,Historian Birth,Historian Death,Historian Deathplace Entity,Historian Deathplace,Historian Birthplace Entity,Historian Birthplace
0,Maria Accascina,female,,Comune di Palermo,Q98804253,Q81174665,Q2656,Palermo,1898-01-01,1979-01-01,Q2656,Palermo,Q2634,Naples
1,Leonardo Agostini,male,,Scuola Normale Superiore,Q1054161,Q672416,Q13375,Pisa,1593-09-18,1676-08-01,Q220,Rome,Q2402810,Boccheggiano
2,Vittorio Alfieri,male,,Biblioteca Medicea Laurenziana,Q296244,Q856419,Q2044,Florence,1749-01-16,1803-10-08,Q2044,Florence,Q6122,Asti
3,Giuseppe Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q18934975,Q1075580,Q2044,Florence,1836-04-29,1890-04-24,Q2044,Florence,Q2044,Florence
4,Leopoldo Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q16164590,Q1075580,Q2044,Florence,1832-01-01,1865-01-01,Q2044,Florence,Q2044,Florence
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
115,Antonio Maria Zanetti,male,,Biblioteca Nazionale Marciana,Q944948,Q578460,Q641,Venice,1680-02-20,1757-12-31,Q641,Venice,Q641,Venice
116,Federico Zeri,male,Federico Zeri Archives,Fondazione Federico Zeri,Q1089074,Q23687322,Q1891,Bologna,1921-08-12,1998-10-05,Q242942,Mentana,Q220,Rome
117,Bruno Zevi,male,Archivio Bruno Zevi,Fondazione Bruno Zevi,Q558155,Q73016367,Q220,Rome,1918-01-22,2000-01-09,Q220,Rome,Q220,Rome
118,Stefano Tumidei,male,Fototeca Stefano Tumidei,Fondazione Federico Zeri,Q55453618,Q23687322,Q1891,Bologna,1962-08-15,2008-05-09,Q1891,Bologna,Q13367,Forlì


In [28]:
# save the data to a pickle file named  "04_full_db.pickle"
data.to_pickle("04_full_db.pickle")


Now we just need to fill the few missing data manually, and save our final database. 

In [18]:
!pip3 install pickle5
import pickle5 as pickle
import numpy as np
with open("04_full_db.pickle", "rb") as fh:
  data = pickle.load(fh)
print(data.replace(r'^\s*$', np.nan, regex=True).isnull().sum())

Full Name                       0
Gender                          0
Collection                     68
Keeper                          0
Historian Entity                0
Keeper Entity                   0
Keeper Place Entity            14
Keeper Place                   14
Historian Birth                 1
Historian Death                 3
Historian Deathplace Entity    14
Historian Deathplace           14
Historian Birthplace Entity    14
Historian Birthplace           14
dtype: int64


In [12]:
data.to_json("04_full_db.json")
#Caso 88, Charlotte-Catherine Patin è parigina 

Since some information were not available even manually, we left some missing data to not undermine the objectivity of the analysis. Some keepers place's labels were also fixed because they were referring the neighborhoods rather than the city.    

((We also removed Charlotte-Catherine Patin from our historians, because she was born in Paris.))

In [29]:
data = pd.read_json("04_full_db.json")
#pd.set_option("display.max_rows", None)
#data.head(120) 

In [30]:
print(data.replace(r'^\s*$', np.nan, regex=True).isnull().sum())

Full Name                       0
Gender                          0
Collection                     68
Keeper                          0
Historian Entity                0
Keeper Entity                   0
Keeper Place Entity             2
Keeper Place                    2
Historian Birth                 1
Historian Death                 3
Historian Deathplace Entity     6
Historian Deathplace            6
Historian Birthplace Entity     3
Historian Birthplace            3
dtype: int64


In [31]:
#data = data.drop(data.index[88]) #eliminare sta Charlotte? 
pd.set_option("display.max_rows", None)
data.head(120) 

Unnamed: 0,Full Name,Gender,Collection,Keeper,Historian Entity,Keeper Entity,Keeper Place Entity,Keeper Place,Historian Birth,Historian Death,Historian Deathplace Entity,Historian Deathplace,Historian Birthplace Entity,Historian Birthplace
0,Maria Accascina,female,,Comune di Palermo,Q98804253,Q81174665,Q2656,Palermo,-2272061000000.0,283996800000.0,Q2656,Palermo,Q2634,Naples
1,Leonardo Agostini,male,,Scuola Normale Superiore,Q1054161,Q672416,Q13375,Pisa,6572274000000.0,9187429000000.0,Q220,Rome,Q2402810,Boccheggiano
2,Vittorio Alfieri,male,,Biblioteca Medicea Laurenziana,Q296244,Q856419,Q2044,Florence,-6972739000000.0,-5245862000000.0,Q2044,Florence,Q6122,Asti
3,Giuseppe Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q18934975,Q1075580,Q2044,Florence,-4218394000000.0,-2514758000000.0,Q2044,Florence,Q2044,Florence
4,Leopoldo Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q16164590,Q1075580,Q2044,Florence,-4354906000000.0,-3313440000000.0,Q2044,Florence,Q2044,Florence
5,Francesco Arcangeli,male,"Fondo speciale Angelo, Gaetano, Bianca e Franc...",Biblioteca comunale dell'Archiginnasio,Q1121086,Q3639645,Q1891,Bologna,-1719274000000.0,130032000000.0,Q1891,Bologna,Q1891,Bologna
6,Pietro Aretino,male,Fondo Bongi,State Archives of Lucca,Q296272,Q3621654,Q13373,Lucca,3370376000000.0,5408379000000.0,Q641,Venice,Q13378,Arezzo
7,Giulio Carlo Argan,male,,Archivio privato a Roma,Q778445,Q12161242,Q220,Rome,-1913242000000.0,721526400000.0,Q220,Rome,Q495,Turin
8,Paolo Enrico Arias,male,,Scuola Normale Superiore,Q3894011,Q672416,Q13375,Pisa,-1971130000000.0,912643200000.0,Q13375,Pisa,Q34130,Vittoria
9,Giovanni Baglione,male,,Archivio di Stato di Roma,Q983332,Q2860424,Q220,Rome,5697733000000.0,8190632000000.0,Q220,Rome,Q220,Rome
