# Step 1
We imported the .csv file of the __Dictionary of Art Historians (DoAH)__ filtered by nationality __('it')__ and manually integrated with __Wikidata labels__.
We selected the columns we were interested in, removing the historians without any archive or keeper associated and reorganizing data.
Then, we created a pandas dataframe to easily store and manipulate this information.

In [10]:
import pandas as pd
import re
# python3 -m pip install qwikidata
# python library for working with sparql and linked data from WikiData
from qwikidata.sparql import return_sparql_query_results

In [8]:
# create first dataframe only using the specified columns 
data = pd.read_csv("DoAH_StoriciItaliani_integrato.csv", sep=",",
                    usecols=["Full Name", "Gender", "Collection", "Keeper"], encoding="utf-8")

# axis 0 to drop the rows, subset to only remove NaNs from the column Archives
data.dropna(axis=0, subset=["Keeper"], inplace=True)

# resetting the index because all deleted rows have changed the length of the dataframe
data.reset_index(inplace=True, drop=True)

# .pickle is a python serialization format for easy and quick read-write, and pandas supports it natively
data.to_pickle("00_first_db.pickle")

# the first table we have looks like this:
pd.set_option("display.max_rows", None)
data.head(120)

Unnamed: 0,Full Name,Gender,Collection,Keeper
0,"Accascina, Maria",female,,Comune di Palermo
1,"Agostini, Leonardo",male,,Scuola Normale Superiore
2,"Alfieri, Vittorio",male,,Biblioteca Medicea Laurenziana
3,"Alinari, Giuseppe",male,Archivio Alinari,Museo Nazionale Alinari della Fotografia
4,"Alinari, Leopoldo",male,Archivio Alinari,Museo Nazionale Alinari della Fotografia
5,"Arcangeli, Francesco",male,"Fondo speciale Angelo, Gaetano, Bianca e Franc...",Biblioteca comunale dell'Archiginnasio
6,"Aretino, Pietro",male,Fondo Bongi,State Archives of Lucca
7,"Argan, Giulio Carlo",male,,Archivio privato a Roma
8,"Arias, Paolo Enrico",male,,Scuola Normale Superiore
9,"Baglione, Giovanni",male,,Archivio di Stato di Roma


We saved our starting database into __pickle format__. Pickle can be used to serialize Python object structures, which refers to the process of converting an object in the memory to a byte stream that can be stored as a binary file on disk. When we load it back to a Python program, this binary file can be de-serialized back to a Python object.
It is much __faster__ when compared to CSV files and __reduces the file size__ to almost half of CSV files using its compression techniques.

# Step 2 
Looking at the database, we faced the first problems:
1. Full names are reversed (`surname, name`). We created a function to fix them (`name surname`).
2. We need to have a controlled entity (`wd:xyz`) for each name and keeper, to be able to link them to other information.

### Step 2.1 
We applied the `reformat_names` function on the historians' full names, removing duplicate whitespaces using regular expressions. 

In [9]:
def reformat_names(name):
    """ reverse names from surname,name format to name surname """
    l = name.split(", ")
    new = " ".join(reversed(l))
    # compile regex for multiple consecutive spaces
    return re.sub(r"\s+", " ", new)

In [10]:
# reverse names and remove duplicate whitespace
data["Full Name"] = data["Full Name"].apply(reformat_names)
data.describe()

Unnamed: 0,Full Name,Gender,Collection,Keeper
count,118,118,50,118
unique,118,2,49,77
top,Maria Accascina,male,Archivio Alinari,BEIC Digital Library
freq,1,108,2,10


### Step 2.2
Once the historians' names were fixed, we proceeded with the search of the historians' entity on Wikidata.
The SPARQL query search for all the human individuals that speaks latin or italian and works as one of the listed occupations: `art historian`, `historian`, `university teacher`, `archaeologist`, `artist`, `art critic`, `philosopher`, `antiquarian`, `photographer`. The `{}`will be reserved to the historian label in the database to match its corresponding wikidata entity. 

The `find_historian_entity_from_name` function matches the value finding out the wd entity. 

In [11]:
historian_entity_from_label = """
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?artHistorian

WHERE {{
    ?artHistorian wdt:P31 wd:Q5 ;
                  wdt:P1412 ?language
                  FILTER (?language IN (wd:Q652, wd:Q397 ) ) 
    ?artHistorian wdt:P106 ?occupation
                  FILTER (?occupation IN (wd:Q1792450, wd:Q201788, wd:Q1622272, wd:Q3621491, wd:Q483501, wd:Q4164507, wd:Q4964182, wd:Q5697103, wd:Q33231 ) )    
    ?artHistorian rdfs:label ?o
                  FILTER ( str(?o) = "{}" )  .
}}
"""

In [12]:
def find_historian_entity_from_name(name: str):
    query = historian_entity_from_label.format(name)
    res = return_sparql_query_results(query_string=query)
    try:
        wdt_uri = res['results']['bindings'][0]['artHistorian']['value']
    except (IndexError, KeyError):
        return ""
    return wdt_uri.split("/")[-1]

Finally, we implemented the function on the `Full Name` column, adding a new one (`Historian Entity`) reserved to the entities. 

In [13]:
data["Historian Entity"] = data["Full Name"].apply(find_historian_entity_from_name)

In [16]:
data.to_pickle("00_first_db.pickle")

In [14]:
pd.set_option("display.max_rows", None)
data.head(120)

Unnamed: 0,Full Name,Gender,Collection,Keeper,Historian Entity
0,Maria Accascina,female,,Comune di Palermo,Q98804253
1,Leonardo Agostini,male,,Scuola Normale Superiore,Q1054161
2,Vittorio Alfieri,male,,Biblioteca Medicea Laurenziana,Q296244
3,Giuseppe Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q18934975
4,Leopoldo Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q16164590
5,Francesco Arcangeli,male,"Fondo speciale Angelo, Gaetano, Bianca e Franc...",Biblioteca comunale dell'Archiginnasio,Q1121086
6,Pietro Aretino,male,Fondo Bongi,State Archives of Lucca,
7,Giulio Carlo Argan,male,,Archivio privato a Roma,Q778445
8,Paolo Enrico Arias,male,,Scuola Normale Superiore,Q3894011
9,Giovanni Baglione,male,,Archivio di Stato di Roma,Q983332


In [15]:
import numpy as np

print(data.replace(r'^\s*$', np.nan, regex=True).isnull().sum())

Full Name            0
Gender               0
Collection          68
Keeper               0
Historian Entity    13
dtype: int64


In [17]:
import pandas as pd
data = pd.read_pickle("00_first_db.pickle")
data.to_json("00_first_db.json")


# Step 3
___
### add a column with controlled entities for institutions with the role of "keeper"

In [1]:
import pandas as pd
from json import JSONDecodeError

In [61]:
data = pd.read_json("00_first_db.json")

In [62]:
keeper_entity_from_label = """
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT ?keeper

WHERE {{
    VALUES ?keeperRole {{wd:Q166118 wd:Q7075 wd:Q3953379 wd:Q3918 wd:Q33506 wd:Q17620767 wd:Q43229 wd:Q31855 wd:Q212805
                        wd:Q2352616 wd:Q1966910 wd:Q157031 wd:Q207694 wd:Q414147 wd:Q22806 wd:Q28564 wd:Q1329623 wd:Q44796387
                        wd:Q856234 wd:Q2122214 }}
    ?keeper wdt:P31 ?keeperRole .
    ?keeper rdfs:label ?o 
                  FILTER ( str(?o) = "{}" )  .
    
}}
"""

In [63]:
def find_keeper_entity_from_label(label: str):
    # remove trailing and leading whitespace
    label = label.strip()
    # substitute multiple spaces with a single one
    label = re.sub(r"\s+", " ", label)
    query = keeper_entity_from_label.format(label)
    try:
        res = return_sparql_query_results(query_string=query)
        wdt_uri = res['results']['bindings'][0]['keeper']['value']
    except (IndexError, KeyError, JSONDecodeError):
        return ""
    return wdt_uri.split("/")[-1]

In [64]:
def create_keeper_col(data):
    # create a new column computing the keeper entity from the keeper label
    data["Keeper Entity"] = data["Keeper"].apply(find_keeper_entity_from_label)

After gathering entities for art historians a problem we found was that the research was really slow (~35mins for the full dataframe lookup).
To speed things up this time we tried splitting the dataframe in two equal parts (around index 58) and launching two separate threads, each on a portion of the dataframe. This way, if the SPARQL engine takes a long time to respond, we have 2 concurrent calls being made: it reduced our running time to around 18 minutes for a full dataframe apply.

Since the approach seems to be successful we should probably try with 4/8 concurrent threads.

NB: this does not __speed up computation__, instead, when the operation is waiting for an IO task (waiting for sparql to respond with the JSON result for the query) it launches other requests or handle other responses without blocking.

In [65]:
# split df in half
df1 = data.iloc[:58, :]
df2 = data.iloc[58:, :]

# launch two threads running create_keeper_col on df1 and df2
from threading import Thread
t1 = Thread(target=create_keeper_col, args=(df1,))
t2 = Thread(target=create_keeper_col, args=(df2,))
t1.start()
t2.start()
# wait for the threads to finish
t1.join()
t2.join()

# concatenate the two dataframes
data = pd.concat([df1, df2], axis=0)

data.head(120)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["Keeper Entity"] = data["Keeper"].apply(find_keeper_entity_from_label)


Unnamed: 0,Full Name,Gender,Collection,Keeper,Historian Entity,Keeper Entity
0,Maria Accascina,female,,Comune di Palermo,Q98804253,Q81174665
1,Leonardo Agostini,male,,Scuola Normale Superiore,Q1054161,Q672416
2,Vittorio Alfieri,male,,Biblioteca Medicea Laurenziana,Q296244,Q856419
3,Giuseppe Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q18934975,Q1075580
4,Leopoldo Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q16164590,Q1075580
5,Francesco Arcangeli,male,"Fondo speciale Angelo, Gaetano, Bianca e Franc...",Biblioteca comunale dell'Archiginnasio,Q1121086,Q3639645
6,Pietro Aretino,male,Fondo Bongi,State Archives of Lucca,Q296272,Q3621654
7,Giulio Carlo Argan,male,,Archivio privato a Roma,Q778445,
8,Paolo Enrico Arias,male,,Scuola Normale Superiore,Q3894011,Q672416
9,Giovanni Baglione,male,,Archivio di Stato di Roma,Q983332,Q2860424


In [66]:
data.to_pickle("01_second_db.pickle")

In [24]:
import pandas as pd
import numpy as np


data = pd.read_pickle("01_second_db.pickle")    
print(data.replace(r'^\s*$', np.nan, regex=True).isnull().sum())

Full Name            0
Gender               0
Collection          68
Keeper               0
Historian Entity     0
Keeper Entity       19
dtype: int64


19 entities could not be found automatically: we exported the db to json and manually integrated those which were missing.

In [19]:
data.to_json("01_second_db.json")

# Step 4
## Db intersection: merge doAH and Artchives historians 

After saving our new database in JSON format, we added and filled the missing keepers' entities manually. In some cases, we added __new items on Wikidata__ to give the organization a controlled entity. In the case of __private archives__, we decided to use the same entity (wd:Q12161242) to identify "archival collection or institution that is not accessible to the public".

The next step is to integrate Artchives historians, collections and keepers and finalize our database. 

In [310]:
#Runna da qui
import pandas as pd
data = pd.read_json("01_second_db.json")

# Find italian historians on Artchives
The following SPARQL query returns: 
1. Full Names of historians
2. __Gender__: not available on artchives, needs to be integrated
3. Collections
4. Keepers 
5. Historian Entity
6. Keepers entity

Of the 8 italian historians on artchives

In [64]:
from SPARQLWrapper import SPARQLWrapper, JSON
import ssl

In [65]:
ssl._create_default_https_context = ssl._create_unverified_context

# get the endpoint API
artchives_endpoint = "http://artchives.fondazionezeri.unibo.it/sparql"

In [66]:
artchives_italy = """
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdp: <http://www.wikidata.org/wiki/Property:>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT ?HistorianEntity (SAMPLE(?HistorianName) AS ?FullName) ?CollectionEntity (SAMPLE(?CollectionName) AS ?Collection) ?KeeperEntity (SAMPLE(?KeeperName) AS ?Keeper)
WHERE {
?HistorianEntity a wd:Q5 ; rdfs:label ?HistorianName ; wdp:P27 wd:Q38 .
?CollectionEntity wdp:P170 ?HistorianEntity ; rdfs:label ?CollectionName .
?KeeperEntity wdp:P1830 ?CollectionEntity ; rdfs:label ?KeeperName .
}  
  GROUP BY ?HistorianEntity ?CollectionEntity ?KeeperEntity
"""

In [67]:
# set the endpoint 
sparql_wd = SPARQLWrapper(artchives_endpoint)
# set the query
sparql_wd.setQuery(artchives_italy)
# set the returned format
sparql_wd.setReturnFormat(JSON)
# get the results
results = sparql_wd.query().convert()

# manipulate the result
for result in results["results"]["bindings"]:
    Full_Name = result["FullName"]["value"]
    Hist_entity = result["HistorianEntity"]["value"].split("/")[-1]
    Coll = result["Collection"]["value"]
    Keeper = result["Keeper"]["value"]
    Keeper_entity = result["KeeperEntity"]["value"].split("/")[-1]
    print(Full_Name + "; " + Coll + "; " + Keeper + "; HIST: " + Hist_entity + "; KEEP: " + Keeper_entity)

Federico Zeri; Fototeca Zeri; Fondazione Federico Zeri; HIST: Q1089074; KEEP: Q23687322
Stefano Tumidei; Fototeca Stefano Tumidei; Fondazione Federico Zeri; HIST: Q55453618; KEEP: Q23687322
Luisa Vertova; Archivio Luisa Vertova; Fondazione Federico Zeri; HIST: Q61913691; KEEP: Q23687322
Eugenio Battisti; Battisti Eugenio (complex of fonds); Scuola Normale Superiore; HIST: Q1373290; KEEP: Q672416
Adolfo Venturi; Venturi Adolfo (complex of fonds); Scuola Normale Superiore; HIST: Q2824734; KEEP: Q672416
Luigi Salerno; Luigi Salerno research papers; Getty Research Institute; HIST: Q6700132; KEEP: Q11203476
Roberto Longhi; Archivio Longhi; Fondazione Roberto Longhi; HIST: Q1361667; KEEP: Q1634770
Cesare Brandi; Archive Cesare Brandi; Direzione regionale musei della Toscana; HIST: Q1056780; KEEP: Q108323065


Since some historians were already in the database, we manually added the new ones:

118. Stefano Tumidei
119. Luisa Vertova
120. Eugenio Battisti
121. Cesare Brandi 

# Final Database (doAH + ARTchives)

In [68]:
pd.set_option("display.max_rows", None)
data.head(122)

Unnamed: 0,Full Name,Gender,Collection,Keeper,Historian Entity,Keeper Entity
0,Maria Accascina,female,,Comune di Palermo,Q98804253,Q81174665
1,Leonardo Agostini,male,,Scuola Normale Superiore,Q1054161,Q672416
2,Vittorio Alfieri,male,,Biblioteca Medicea Laurenziana,Q296244,Q856419
3,Giuseppe Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q18934975,Q1075580
4,Leopoldo Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q16164590,Q1075580
5,Francesco Arcangeli,male,"Fondo speciale Angelo, Gaetano, Bianca e Franc...",Biblioteca comunale dell'Archiginnasio,Q1121086,Q3639645
6,Pietro Aretino,male,Fondo Bongi,State Archives of Lucca,Q296272,Q3621654
7,Giulio Carlo Argan,male,,Archivio privato a Roma,Q778445,Q12161242
8,Paolo Enrico Arias,male,,Scuola Normale Superiore,Q3894011,Q672416
9,Giovanni Baglione,male,,Archivio di Stato di Roma,Q983332,Q2860424


## Use wikidata entities to query relevant information about historians and keepers

We created some queries to find out relevant information about historians and keepers to finalize our database: birth and death places and dates for historians, locations for keepers.

In [129]:
import pandas as pd
import numpy as np
from qwikidata.sparql import return_sparql_query_results
from json import JSONDecodeError
import time
from collections import defaultdict

In [130]:
data = pd.read_json("01_second_db.json")
data.head(122)

Unnamed: 0,Full Name,Gender,Collection,Keeper,Historian Entity,Keeper Entity
0,Maria Accascina,female,,Comune di Palermo,Q98804253,Q81174665
1,Leonardo Agostini,male,,Scuola Normale Superiore,Q1054161,Q672416
2,Vittorio Alfieri,male,,Biblioteca Medicea Laurenziana,Q296244,Q856419
3,Giuseppe Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q18934975,Q1075580
4,Leopoldo Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q16164590,Q1075580
...,...,...,...,...,...,...
117,Bruno Zevi,male,Archivio Bruno Zevi,Fondazione Bruno Zevi,Q558155,Q73016367
118,Stefano Tumidei,male,Fototeca Stefano Tumidei,Fondazione Federico Zeri,Q55453618,Q23687322
119,Luisa Vertova,female,Archivio Luisa Vertova,Fondazione Federico Zeri,Q61913691,Q23687322
120,Eugenio Battisti,male,Battisti Eugenio (complex of fonds),Scuola Normale Superiore,Q1373290,Q672416


In [131]:
keepers_place_query = """
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?keeperPlace ?keeperPlaceLabel
WHERE {{
    {} wdt:P131 ?keeperPlace .
    ?keeperPlace rdfs:label ?keeperPlaceLabel .
    FILTER (lang(?keeperPlaceLabel) = 'en')
}}
"""

In [132]:
def find_keepers_place(entity: str):
    query = keepers_place_query.format(f"wd:{entity}")
    try:
        res = return_sparql_query_results(query_string=query)
        return_entity = res['results']['bindings'][0]['keeperPlace']['value']
        return_label = res['results']['bindings'][0]['keeperPlaceLabel']['value']
    except (IndexError):
        return None
    return [return_entity.split("/")[-1], return_label]

found = defaultdict(lambda: "-")
entities = pd.Series(name="Keeper Place Entity")
labels = pd.Series(name="Keeper Place")
for index, row in data.iterrows():
    if found[row['Keeper Entity']] == "-":
        cnt = 0
        while cnt < 10:
            try:
                found[row['Keeper Entity']] = find_keepers_place(row["Keeper Entity"])
                break
            except (JSONDecodeError):
                time.sleep(0.5)
                cnt += 1 
                continue
    
    if found[row['Keeper Entity']]:
        entities.loc[index] = found[row['Keeper Entity']][0]
        labels.loc[index] = found[row['Keeper Entity']][1]
    else:
        print("Not found: ", row['Keeper'])

pd.concat([data, entities, labels], axis=1)

  entities = pd.Series(name="Keeper Place Entity")
  labels = pd.Series(name="Keeper Place")


Not found:  Archivio privato a Roma
Not found:  Getty Research Institute
Not found:  Lombard Institute Academy of Science and Letters
Not found:  Getty Research Institute
Not found:  Fondazione "Biblioteca Benedetto Croce"
Not found:  Lombardia Beni Culturali
Not found:  Fondazione Cassa di Risparmio di Perugia
Not found:  Getty Research Institute
Not found:  Getty Research Institute
Not found:  Archivio privato di Calcata
Not found:  Archivio privato di Meleto
Not found:  Getty Research Institute
Not found:  Istituto Lombardo Accademia di Scienze e Lettere
Not found:  Direzione regionale musei della Toscana


Unnamed: 0,Full Name,Gender,Collection,Keeper,Historian Entity,Keeper Entity,Keeper Place Entity,Keeper Place
0,Maria Accascina,female,,Comune di Palermo,Q98804253,Q81174665,Q2656,Palermo
1,Leonardo Agostini,male,,Scuola Normale Superiore,Q1054161,Q672416,Q2044,Florence
2,Vittorio Alfieri,male,,Biblioteca Medicea Laurenziana,Q296244,Q856419,Q2044,Florence
3,Giuseppe Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q18934975,Q1075580,Q2044,Florence
4,Leopoldo Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q16164590,Q1075580,Q2044,Florence
...,...,...,...,...,...,...,...,...
117,Bruno Zevi,male,Archivio Bruno Zevi,Fondazione Bruno Zevi,Q558155,Q73016367,Q220,Rome
118,Stefano Tumidei,male,Fototeca Stefano Tumidei,Fondazione Federico Zeri,Q55453618,Q23687322,Q1891,Bologna
119,Luisa Vertova,female,Archivio Luisa Vertova,Fondazione Federico Zeri,Q61913691,Q23687322,Q1891,Bologna
120,Eugenio Battisti,male,Battisti Eugenio (complex of fonds),Scuola Normale Superiore,Q1373290,Q672416,Q2044,Florence


In [134]:
data.to_pickle("02_place_db.pickle")

NEW STEP: find date of birth and death

In [72]:
from datetime import datetime
import pandas as pd
from collections import defaultdict
from json import JSONDecodeError
from qwikidata.sparql import return_sparql_query_results
import time

In [73]:

data = pd.read_pickle("02_place_db.pickle")
data.head(122)

Unnamed: 0,Full Name,Gender,Collection,Keeper,Historian Entity,Keeper Entity
0,Maria Accascina,female,,Comune di Palermo,Q98804253,Q81174665
1,Leonardo Agostini,male,,Scuola Normale Superiore,Q1054161,Q672416
2,Vittorio Alfieri,male,,Biblioteca Medicea Laurenziana,Q296244,Q856419
3,Giuseppe Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q18934975,Q1075580
4,Leopoldo Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q16164590,Q1075580
...,...,...,...,...,...,...
117,Bruno Zevi,male,Archivio Bruno Zevi,Fondazione Bruno Zevi,Q558155,Q73016367
118,Stefano Tumidei,male,Fototeca Stefano Tumidei,Fondazione Federico Zeri,Q55453618,Q23687322
119,Luisa Vertova,female,Archivio Luisa Vertova,Fondazione Federico Zeri,Q61913691,Q23687322
120,Eugenio Battisti,male,Battisti Eugenio (complex of fonds),Scuola Normale Superiore,Q1373290,Q672416


In [74]:
historians_dob_dod_query = """
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?dob ?dod
WHERE {{
    VALUES ?historian {historian} . 
    ?historian wdt:P569 ?dob .
    OPTIONAL {{ ?historian wdt:P570 ?dod }}.
}}
"""

In [75]:
def find_historian_dob_dod(entity: str):
    query = historians_dob_dod_query.format(historian=f"{{wd:{entity}}}")
    res = []
    sparql_res = return_sparql_query_results(query_string=query)
    try:
        if sparql_res['results']['bindings'][0]['dob']['type'] == "uri":
            # unknown dates are mapped to URIs
            res.append(None)
        else:
            res.append(sparql_res['results']['bindings'][0]['dob']['value'].rstrip("Z"))
    except (IndexError, KeyError):
        res.append(None)
    try:
        if sparql_res['results']['bindings'][0]['dod']['type'] == "uri":
            # unknown dates are mapped to URIs
            res.append(None)
        else:
            res.append(sparql_res['results']['bindings'][0]['dod']['value'].rstrip("Z"))
    except (IndexError, KeyError):
        res.append(None)
    return res

found = defaultdict(lambda: None)

dob_list = pd.Series(name="Historian Birth", dtype="datetime64[ns]")
dod_list = pd.Series(name="Historian Death", dtype="datetime64[ns]")

for index, row in data.iterrows():
    cnt = 0
    while cnt < 10:
        try:
            found[row['Historian Entity']] = find_historian_dob_dod(row["Historian Entity"])
            break
        except (JSONDecodeError):
            time.sleep(0.5)
            cnt += 1 
            continue
    
    if any(found[row['Historian Entity']]):
        dob, dod = found[row['Historian Entity']]
        if dob:
            dob_list.loc[index] = datetime.fromisoformat(dob).date()
        else:
            dob_list.loc[index] = None
        if dod:
            dod_list.loc[index] = datetime.fromisoformat(dod).date()
        else:
            dod_list.loc[index] = None

data = pd.concat([data, dob_list, dod_list], axis=1)

Q98804253 1898-01-01T00:00:00 1979-01-01T00:00:00
Q1054161 1593-09-18T00:00:00 1676-08-01T00:00:00
Q296244 1749-01-16T00:00:00 1803-10-08T00:00:00
Q18934975 1836-04-29T00:00:00 1890-04-24T00:00:00
Q16164590 1832-01-01T00:00:00 1865-01-01T00:00:00
Q1121086 1915-07-10T00:00:00 1974-02-14T00:00:00
Q296272 1492-04-01T00:00:00 1556-10-31T00:00:00
Q778445 1909-05-17T00:00:00 1992-11-12T00:00:00
Q3894011 1907-07-17T00:00:00 1998-12-03T00:00:00
Q983332 1566-01-01T00:00:00 1644-12-30T00:00:00
Q979574 1625-06-03T00:00:00 1697-01-01T00:00:00
Q358348 1488-10-16T00:00:00 1560-02-17T00:00:00
Q3840434 1894-07-13T00:00:00 1978-02-17T00:00:00
Q19754060 1927-04-02T00:00:00 2016-01-01T00:00:00
Q17279844 1905-08-12T00:00:00 1956-03-03T00:00:00
Q1058859 1936-07-07T00:00:00 2011-04-26T00:00:00
Q19753744 1882-10-03T00:00:00 1955-03-06T00:00:00
Q471179 1900-02-19T00:00:00 1975-01-17T00:00:00
Q5563896 1717-09-30T00:00:00 1781-01-01T00:00:00
Q18934852 1480-01-01T00:00:00 1536-01-01T00:00:00
Q16657844 1500-10-05

In [77]:
data.head(122)
data.to_pickle("03_dob_dod_db.pickle")

# NEW STEP: ADD place of birth and place of death

In [81]:
from datetime import datetime
import pandas as pd
from collections import defaultdict
from json import JSONDecodeError
from qwikidata.sparql import return_sparql_query_results
import time

In [82]:
data = pd.read_pickle("03_dob_dod_db.pickle")
data.head(122)

Unnamed: 0,Full Name,Gender,Collection,Keeper,Historian Entity,Keeper Entity,Historian Birth,Historian Death
0,Maria Accascina,female,,Comune di Palermo,Q98804253,Q81174665,1898-01-01,1979-01-01
1,Leonardo Agostini,male,,Scuola Normale Superiore,Q1054161,Q672416,1593-09-18,1676-08-01
2,Vittorio Alfieri,male,,Biblioteca Medicea Laurenziana,Q296244,Q856419,1749-01-16,1803-10-08
3,Giuseppe Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q18934975,Q1075580,1836-04-29,1890-04-24
4,Leopoldo Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q16164590,Q1075580,1832-01-01,1865-01-01
...,...,...,...,...,...,...,...,...
117,Bruno Zevi,male,Archivio Bruno Zevi,Fondazione Bruno Zevi,Q558155,Q73016367,1918-01-22,2000-01-09
118,Stefano Tumidei,male,Fototeca Stefano Tumidei,Fondazione Federico Zeri,Q55453618,Q23687322,1962-08-15,2008-05-09
119,Luisa Vertova,female,Archivio Luisa Vertova,Fondazione Federico Zeri,Q61913691,Q23687322,1921-01-01,2021-06-28
120,Eugenio Battisti,male,Battisti Eugenio (complex of fonds),Scuola Normale Superiore,Q1373290,Q672416,1924-12-14,1989-10-17


In [83]:
historians_pob_pod_query = """
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?pob ?pod ?pobLabel ?podLabel
WHERE {{
    VALUES ?historian {historian} . 
  
    OPTIONAL {{ ?historian wdt:P19 ?pob }}.
    OPTIONAL {{ ?historian wdt:P20 ?pod }}.
    
    ?pob rdfs:label ?pobLabel 
      FILTER (lang(?pobLabel) = 'en') .

    ?pod rdfs:label ?podLabel
      FILTER (lang(?podLabel) = 'en') .
}}
"""

In [85]:
def find_historian_pob_pod(entity: str):
    query = historians_pob_pod_query.format(historian=f"{{wd:{entity}}}")
    res = dict()
    sparql_res = return_sparql_query_results(query_string=query)
    for v in ["pod", "pob", "podLabel", "pobLabel"]:
        try:
            res[v] = (sparql_res['results']['bindings'][0][v]['value'])
        except (IndexError, KeyError):
            res.append(None)
    return res

found = defaultdict(lambda: None)

pob_list = pd.Series(name="Historian Birthplace Entity")
pobL_list = pd.Series(name="Historian Birthplace")

pod_list = pd.Series(name="Historian Deathplace Entity")
podL_list = pd.Series(name="Historian Deathplace")


for index, row in data.iterrows():
    cnt = 0
    while cnt < 10:
        try:
            found[row['Historian Entity']] = find_historian_pob_pod(row["Historian Entity"])
            break
        except (JSONDecodeError):
            time.sleep(0.5)
            cnt += 1 
            continue
    
    if found[row['Historian Entity']]:
        if found[row['Historian Entity']]['pob']:
            pob_list.loc[index] = found[row['Historian Entity']]['pob']
        else:
            pob_list.loc[index] = None

        if found[row['Historian Entity']]['pod']:
            pod_list.loc[index] = found[row['Historian Entity']]['pod']
        else:
            pod_list.loc[index] = None

        if found[row['Historian Entity']]['pobLabel']:
            pobL_list.loc[index] = found[row['Historian Entity']]['pobLabel']
        else:
            pobL_list.loc[index] = None

        if found[row['Historian Entity']]['podLabel']:
            podL_list.loc[index] = found[row['Historian Entity']]['podLabel']
        else:
            podL_list.loc[index] = None
        

data = pd.concat([data, pod_list, podL_list, pob_list, pobL_list], axis=1)

  pob_list = pd.Series(name="Historian Birthplace Entity")
  pobL_list = pd.Series(name="Historian Birthplace")
  pod_list = pd.Series(name="Historian Deathplace Entity")
  podL_list = pd.Series(name="Historian Deathplace")


ChunkedEncodingError: ("Connection broken: ConnectionResetError(54, 'Connection reset by peer')", ConnectionResetError(54, 'Connection reset by peer'))