# ITArtHistorians project

This Jupyter Notebook was created for the exam in __Electronic Publishing and Digital Storytelling__ taught by Prof. Marilena Daquino at the University of Bologna for the year 2021-2022.

The team project is composed by Digital Humanities and Digital Knowledge master's students [Alice Bordignon](mailto:alice.bordignon@studio.unibo.it), [Federico Cagnola](mailto:federico.cagnola@studio.unibo.it) and [Gabriele Fiorenza](mailto:gabriele.fiorenza@studio.unibo.it). 

## About 
The project explores the relations between italian 🇮🇹 art historians from the [__Dictionary of Art Historians (DoAH)__](https://arthistorians.info) and [__ArtChives__](http://artchives.fondazionezeri.unibo.it) datasets and the cultural institutions who are classified as *keepers* of their collections and fonds from a temporal and geospatial point of view.

___

# Step 1
We imported the `.csv` file of the open-sourced [__DoAH__](https://arthistorians.info) filtered by nationality __('it')__ and manually integrated it with __Wikidata labels__.
We selected the columns we were interested in, dropping all rows where historians weren't associated to any archive or keeper instance and reorganized the data.

In [None]:
import pandas as pd
import numpy as np
import re
from datetime import datetime
from collections import defaultdict
from json import JSONDecodeError
from qwikidata.sparql import return_sparql_query_results # python library for working with sparql and linked data from WikiData
import time
from requests.exceptions import ChunkedEncodingError
import math
#from SPARQLWrapper import SPARQLWrapper, JSON # sparql query library
import ssl

After importing the necessary libraries, we gather existing data from external sources into a Pandas `Dataframe` for easier data manipulation and table operations.

In [15]:
# create first dataframe only using the specified columns 
data = pd.read_csv("DoAH_StoriciItaliani_integrato.csv", sep=",",
                    usecols=["Full Name", "Gender", "Collection", "Keeper"], encoding="utf-8")

# axis 0 to drop the rows, subset to only remove NaNs from the column Archives
data.dropna(axis=0, subset=["Keeper"], inplace=True)

# resetting the index because all deleted rows have changed the length of the dataframe
data.reset_index(inplace=True, drop=True)

# .pickle is a python serialization format for easy and quick read-write, and pandas supports it natively
data.to_pickle("00_first_db.pickle")

# the first table we have looks like this:
data.head()

Unnamed: 0,Full Name,Gender,Collection,Keeper
0,"Accascina, Maria",female,,Comune di Palermo
1,"Agostini, Leonardo",male,,Scuola Normale Superiore
2,"Alfieri, Vittorio",male,,Biblioteca Medicea Laurenziana
3,"Alinari, Giuseppe",male,Archivio Alinari,Museo Nazionale Alinari della Fotografia
4,"Alinari, Leopoldo",male,Archivio Alinari,Museo Nazionale Alinari della Fotografia


The first database is then __pickled__: the `pickle` format can be used to serialize Python object structures, which refers to the process of converting an object in the memory to a byte stream that can be stored as a file on disk. When we load it back to Python, this binary file can be de-serialized back to a Python object.
It is much __faster__ when compared to CSV files and __reduces the file size__ to almost half of CSV files using its compression techniques.

___

# Step 2 

After inspecting the first dataframe, we quickly identified the first problems:
1. Full names are reversed (`surname, name`). -> We created a function to fix them (to format `name surname`).
2. We need to have a controlled entity (`wd:xyz`) for each name and keeper, to be able to link them to other information.

### Step 2.1
#### Historian Names
We applied the `reformat_names` function on the historians' full names, removing duplicate whitespaces using regular expressions. 

In [9]:
def reformat_names(name):
    """ reverse names from surname,name format to name surname """
    l = name.split(", ")
    new = " ".join(reversed(l))
    # compile regex for multiple consecutive spaces
    return re.sub(r"\s+", " ", new)

In [10]:
# reverse names and remove duplicate whitespace
data["Full Name"] = data["Full Name"].apply(reformat_names)
data.describe()

Unnamed: 0,Full Name,Gender,Collection,Keeper
count,118,118,50,118
unique,118,2,49,77
top,Maria Accascina,male,Archivio Alinari,BEIC Digital Library
freq,1,108,2,10


### Step 2.2
#### Historian Entities 
Once the historians' names were fixed, we proceeded with the __search of the historians' entities on Wikidata.__
The __SPARQL query__ search for human individuals that speak Latin or Italian and work as: `art historian`, `historian`, `university teacher`, `archaeologist`, `artist`, `art critic`, `philosopher`, `antiquarian`, or `photographer`. The `{}`will be reserved to the historian's label in the database to match its Wikidata entity. 

The `find_historian_entity_from_name` function matches the value finding out the wd entity of each historian in the database. 

In [11]:
historian_entity_from_label = """
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?artHistorian

WHERE {{
    ?artHistorian wdt:P31 wd:Q5 ;
                  wdt:P1412 ?language
                  FILTER (?language IN (wd:Q652, wd:Q397 ) ) 
    ?artHistorian wdt:P106 ?occupation
                  FILTER (?occupation IN (wd:Q1792450, wd:Q201788, wd:Q1622272, wd:Q3621491, wd:Q483501, wd:Q4164507, wd:Q4964182, wd:Q5697103, wd:Q33231 ) )    
    ?artHistorian rdfs:label ?o
                  FILTER ( str(?o) = "{}" )  .
}}
"""

In [12]:
def find_historian_entity_from_name(name: str):
    query = historian_entity_from_label.format(name)
    res = return_sparql_query_results(query_string=query)
    try:
        wdt_uri = res['results']['bindings'][0]['artHistorian']['value']
    except (IndexError, KeyError):
        return ""
    return wdt_uri.split("/")[-1]

Let's add a __new column__ with label `Historian Entity` while applying this function: the result will be a column of all the entities we found. 

In [13]:
data["Historian Entity"] = data["Full Name"].apply(find_historian_entity_from_name)

After long-running steps we export the data in .pickle format and commit the directory to a remote source control repository hosted on Github.

In [16]:
data.to_pickle("00_first_db.pickle")

In [17]:
data.head()

Unnamed: 0,Full Name,Gender,Collection,Keeper
0,"Accascina, Maria",female,,Comune di Palermo
1,"Agostini, Leonardo",male,,Scuola Normale Superiore
2,"Alfieri, Vittorio",male,,Biblioteca Medicea Laurenziana
3,"Alinari, Giuseppe",male,Archivio Alinari,Museo Nazionale Alinari della Fotografia
4,"Alinari, Leopoldo",male,Archivio Alinari,Museo Nazionale Alinari della Fotografia


We checked the results and integrated manually the missing 13 historian entities. Then, we saved the new database.

In [15]:
print(data.replace(r'^\s*$', np.nan, regex=True).isnull().sum())

Full Name            0
Gender               0
Collection          68
Keeper               0
Historian Entity    13
dtype: int64


In [17]:
data = pd.read_pickle("00_first_db.pickle")
data.to_json("00_first_db.json")


### Step 2.3
#### Keepers 
We did the same process with keepers. This time, the SPARQL query filter the search for the institutions of a certain type: `archive`, `library`, `university`, `museum`, `state archive`, `research institute`, `foundation`, `academy`, etc. The `{}` will be used to interpolate the keeper's label from the database to match its Wikidata entity.  

The `find_keeper_entity_from_label` function matches the value finding out the wd entity of each keeper in the database. 

In [61]:
data = pd.read_json("00_first_db.json")

In [62]:
keeper_entity_from_label = """
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT ?keeper

WHERE {{
    VALUES ?keeperRole {{wd:Q166118 wd:Q7075 wd:Q3953379 wd:Q3918 wd:Q33506 wd:Q17620767 wd:Q43229 wd:Q31855 wd:Q212805
                        wd:Q2352616 wd:Q1966910 wd:Q157031 wd:Q207694 wd:Q414147 wd:Q22806 wd:Q28564 wd:Q1329623 wd:Q44796387
                        wd:Q856234 wd:Q2122214 }}
    ?keeper wdt:P31 ?keeperRole .
    ?keeper rdfs:label ?o 
                  FILTER ( str(?o) = "{}" )  .
    
}}
"""

In [63]:
def find_keeper_entity_from_label(label: str):
    # remove trailing and leading whitespace
    label = label.strip()
    # substitute multiple spaces with a single one
    label = re.sub(r"\s+", " ", label)
    query = keeper_entity_from_label.format(label)
    try:
        res = return_sparql_query_results(query_string=query)
        wdt_uri = res['results']['bindings'][0]['keeper']['value']
    except (IndexError, KeyError, JSONDecodeError):
        return ""
    return wdt_uri.split("/")[-1]

We implemented the function adding a __new column__ (`Keeper Entity`) for the entities found. 

In [2]:
def create_keeper_col(data):
    # create a new column computing the keeper entity from the keeper label
    data["Keeper Entity"] = data["Keeper"].apply(find_keeper_entity_from_label)

After gathering entities for keepers a problem we found was that the research was really slow __(~35mins for the full dataframe lookup)__.
To speed things up this time we tried __splitting the dataframe in two equal parts__ (around index 58) and launching two separate threads, each on a portion of the dataframe. This way, if the SPARQL engine takes a long time to respond, we have 2 concurrent calls being made: it __reduced our running time to around 18 minutes__ for a full dataframe apply.

NB: this does __not speed up computation__, instead, when the operation is waiting for an IO task (waiting for sparql to respond with the JSON result for the query) it launches other requests or handle other responses without blocking.

In [65]:
# split df in half
df1 = data.iloc[:58, :]
df2 = data.iloc[58:, :]

# launch two threads running create_keeper_col on df1 and df2
from threading import Thread
t1 = Thread(target=create_keeper_col, args=(df1,))
t2 = Thread(target=create_keeper_col, args=(df2,))
t1.start()
t2.start()
# wait for the threads to finish
t1.join()
t2.join()

# concatenate the two dataframes
data = pd.concat([df1, df2], axis=0)

data.head(120)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["Keeper Entity"] = data["Keeper"].apply(find_keeper_entity_from_label)


Unnamed: 0,Full Name,Gender,Collection,Keeper,Historian Entity,Keeper Entity
0,Maria Accascina,female,,Comune di Palermo,Q98804253,Q81174665
1,Leonardo Agostini,male,,Scuola Normale Superiore,Q1054161,Q672416
2,Vittorio Alfieri,male,,Biblioteca Medicea Laurenziana,Q296244,Q856419
3,Giuseppe Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q18934975,Q1075580
4,Leopoldo Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q16164590,Q1075580
5,Francesco Arcangeli,male,"Fondo speciale Angelo, Gaetano, Bianca e Franc...",Biblioteca comunale dell'Archiginnasio,Q1121086,Q3639645
6,Pietro Aretino,male,Fondo Bongi,State Archives of Lucca,Q296272,Q3621654
7,Giulio Carlo Argan,male,,Archivio privato a Roma,Q778445,
8,Paolo Enrico Arias,male,,Scuola Normale Superiore,Q3894011,Q672416
9,Giovanni Baglione,male,,Archivio di Stato di Roma,Q983332,Q2860424


In [66]:
data.to_pickle("01_second_db.pickle")

In [24]:
data = pd.read_pickle("01_second_db.pickle")    
print(data.replace(r'^\s*$', np.nan, regex=True).isnull().sum())

Full Name            0
Gender               0
Collection          68
Keeper               0
Historian Entity     0
Keeper Entity       19
dtype: int64


19 entities could not be found automatically: we exported the db to json and manually integrated those which were missing.

In [19]:
data.to_json("01_second_db.json")

___

# Step 3
### Database intersection: merge doAH and ARTchives historians 

After saving our new database in JSON format, we added and filled the missing keepers' entities manually. In some cases, we had to manually add __new entries on Wikidata__, thus contributing directly to linking open data, to give the organization a controlled entity. In the case of __private archives__, we decided to use a single entity (`wd:Q12161242`) to identify "archival collection or institution that is not accessible to the public". We will explain better how we treated these exceptions in __section 4.4__. 

The next step is to __integrate ARTchives italian historians, collections and keepers.__ 

In [310]:
data = pd.read_json("01_second_db.json")

### Find italian historians on ARTchives
We query the remote endpoint of ARTchives. The following __SPARQL query__ returns: 
1. Full Names of historians
2. Collections
3. Keepers 
4. Historian Entity
5. Keepers entity

Of the italian historians on ARTchives

In [65]:
# The library we're using requires us to manually create a TLS context
ssl._create_default_https_context = ssl._create_unverified_context

# get the endpoint API
artchives_endpoint = "http://artchives.fondazionezeri.unibo.it/sparql"

In [66]:
artchives_italy = """
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdp: <http://www.wikidata.org/wiki/Property:>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT ?HistorianEntity (SAMPLE(?HistorianName) AS ?FullName) ?CollectionEntity (SAMPLE(?CollectionName) AS ?Collection) ?KeeperEntity (SAMPLE(?KeeperName) AS ?Keeper)
WHERE {
?HistorianEntity a wd:Q5 ; rdfs:label ?HistorianName ; wdp:P27 wd:Q38 .
?CollectionEntity wdp:P170 ?HistorianEntity ; rdfs:label ?CollectionName .
?KeeperEntity wdp:P1830 ?CollectionEntity ; rdfs:label ?KeeperName .
}  
  GROUP BY ?HistorianEntity ?CollectionEntity ?KeeperEntity
"""

We instantiate the client, set the query and ask the server to return a `json` response for easier parsing. Then we loop over the dictionray and inspect the response before integrating into our dataframe.

In [67]:
# set the endpoint 
sparql_wd = SPARQLWrapper(artchives_endpoint)
# set the query
sparql_wd.setQuery(artchives_italy)
# set the returned format
sparql_wd.setReturnFormat(JSON)
# get the results
results = sparql_wd.query().convert()

# manipulate the result
for result in results["results"]["bindings"]:
    Full_Name = result["FullName"]["value"]
    Hist_entity = result["HistorianEntity"]["value"].split("/")[-1]
    Coll = result["Collection"]["value"]
    Keeper = result["Keeper"]["value"]
    Keeper_entity = result["KeeperEntity"]["value"].split("/")[-1]
    print(Full_Name + "; " + Coll + "; " + Keeper + "; HIST: " + Hist_entity + "; KEEP: " + Keeper_entity)

Federico Zeri; Fototeca Zeri; Fondazione Federico Zeri; HIST: Q1089074; KEEP: Q23687322
Stefano Tumidei; Fototeca Stefano Tumidei; Fondazione Federico Zeri; HIST: Q55453618; KEEP: Q23687322
Luisa Vertova; Archivio Luisa Vertova; Fondazione Federico Zeri; HIST: Q61913691; KEEP: Q23687322
Eugenio Battisti; Battisti Eugenio (complex of fonds); Scuola Normale Superiore; HIST: Q1373290; KEEP: Q672416
Adolfo Venturi; Venturi Adolfo (complex of fonds); Scuola Normale Superiore; HIST: Q2824734; KEEP: Q672416
Luigi Salerno; Luigi Salerno research papers; Getty Research Institute; HIST: Q6700132; KEEP: Q11203476
Roberto Longhi; Archivio Longhi; Fondazione Roberto Longhi; HIST: Q1361667; KEEP: Q1634770
Cesare Brandi; Archive Cesare Brandi; Direzione regionale musei della Toscana; HIST: Q1056780; KEEP: Q108323065


Since some of the historians were already in the database, we __manually integrated the new ones__ into the `01_second_db.json` and saved the database:

118. Stefano Tumidei
119. Luisa Vertova
120. Eugenio Battisti
121. Cesare Brandi 

We can see now the integrated database. 

In [68]:
pd.set_option("display.max_rows", None)
data.head(122)

Unnamed: 0,Full Name,Gender,Collection,Keeper,Historian Entity,Keeper Entity
0,Maria Accascina,female,,Comune di Palermo,Q98804253,Q81174665
1,Leonardo Agostini,male,,Scuola Normale Superiore,Q1054161,Q672416
2,Vittorio Alfieri,male,,Biblioteca Medicea Laurenziana,Q296244,Q856419
3,Giuseppe Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q18934975,Q1075580
4,Leopoldo Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q16164590,Q1075580
5,Francesco Arcangeli,male,"Fondo speciale Angelo, Gaetano, Bianca e Franc...",Biblioteca comunale dell'Archiginnasio,Q1121086,Q3639645
6,Pietro Aretino,male,Fondo Bongi,State Archives of Lucca,Q296272,Q3621654
7,Giulio Carlo Argan,male,,Archivio privato a Roma,Q778445,Q12161242
8,Paolo Enrico Arias,male,,Scuola Normale Superiore,Q3894011,Q672416
9,Giovanni Baglione,male,,Archivio di Stato di Roma,Q983332,Q2860424


___

# Step 4 
### Use Wikidata entities to query relevant information about historians and keepers

We created some SPARQL queries to find out relevant information about historians and keepers to finalize our database: birth and death places and dates for historians, locations for keepers.

In [3]:
data = pd.read_json("01_second_db.json")
data.head(122)

Unnamed: 0,Full Name,Gender,Collection,Keeper,Historian Entity,Keeper Entity
0,Maria Accascina,female,,Comune di Palermo,Q98804253,Q81174665
1,Leonardo Agostini,male,,Scuola Normale Superiore,Q1054161,Q672416
2,Vittorio Alfieri,male,,Biblioteca Medicea Laurenziana,Q296244,Q856419
3,Giuseppe Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q18934975,Q1075580
4,Leopoldo Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q16164590,Q1075580
...,...,...,...,...,...,...
117,Bruno Zevi,male,Archivio Bruno Zevi,Fondazione Bruno Zevi,Q558155,Q73016367
118,Stefano Tumidei,male,Fototeca Stefano Tumidei,Fondazione Federico Zeri,Q55453618,Q23687322
119,Luisa Vertova,female,Archivio Luisa Vertova,Fondazione Federico Zeri,Q61913691,Q23687322
120,Eugenio Battisti,male,Battisti Eugenio (complex of fonds),Scuola Normale Superiore,Q1373290,Q672416


### Step 4.1
#### Find keepers' place
The following SPARQL query returns the __administrative territorial entity__ of the keepers, filtered for __english labels__ only.

The `find_keepers_place` function checks the existence of the query results for each keeper, adding into the database the two new columns `Keeper Place` and `Keeper Place Label` containing the information found.  

In [4]:
keepers_place_query = """
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?keeperPlace ?keeperPlaceLabel
WHERE {{
    {} wdt:P131 ?keeperPlace .
    ?keeperPlace rdfs:label ?keeperPlaceLabel .
    FILTER (lang(?keeperPlaceLabel) = 'en')
}}
"""

In [7]:
def find_keepers_place(entity: str):
    """ Tries to return the controlled entity for a cultural keeper location """
    # create the final sparql query by adding the keeper's entity to the query
    query = keepers_place_query.format(f"wd:{entity}")
    try:
        res = return_sparql_query_results(query_string=query)
        return_entity = res['results']['bindings'][0]['keeperPlace']['value']
        return_label = res['results']['bindings'][0]['keeperPlaceLabel']['value']
    except (IndexError): 
        # the response did not contain any results, we just skip those by returning None
        return None
    return [return_entity.split("/")[-1], return_label]

found = defaultdict(lambda: "-")
# Entities and labels will be two pd.Series which we'll turn to columns to add to the dataframe
entities = pd.Series(name="Keeper Place Entity")
labels = pd.Series(name="Keeper Place")

# Loop over the dataframe (slow but we don't care since it's under 300 rows)
for index, row in data.iterrows():
    if found[row['Keeper Entity']] == "-":
        cnt = 0
        # SPARQL endpoint throttles requests and returns 429-Resource exhausted so we try a few (10) times
        while cnt < 10:
            try:
                found[row['Keeper Entity']] = find_keepers_place(row["Keeper Entity"])
                break
            except (JSONDecodeError):
                time.sleep(0.5)
                cnt += 1 
                continue
    # In case we have a result (and not None) we add it to the appropriate series
    if found[row['Keeper Entity']]:
        entities.loc[index] = found[row['Keeper Entity']][0]
        labels.loc[index] = found[row['Keeper Entity']][1]
    else:
        print("Not found: ", row['Keeper'])

# Finally, we add the series to the dataframe on axis=1 (as columns)
data = pd.concat([data, entities, labels], axis=1)

  if sys.path[0] == "":
  del sys.path[0]


Not found:  Archivio privato a Roma
Not found:  Getty Research Institute
Not found:  Lombard Institute Academy of Science and Letters
Not found:  Getty Research Institute
Not found:  Fondazione "Biblioteca Benedetto Croce"
Not found:  Lombardia Beni Culturali
Not found:  Fondazione Cassa di Risparmio di Perugia
Not found:  Getty Research Institute
Not found:  Getty Research Institute
Not found:  Archivio privato di Calcata
Not found:  Archivio privato di Meleto
Not found:  Getty Research Institute
Not found:  Istituto Lombardo Accademia di Scienze e Lettere
Not found:  Direzione regionale musei della Toscana


We will integrate the missing data later, we just save the new database for now. 

In [10]:
data.head(122)
data.to_pickle("02_place_db.pickle")

### Step 4.2 
#### Find date of birth and death of the historians 

Spiegazione

In [22]:
data = pd.read_pickle("02_place_db.pickle")
data.head(122)

Unnamed: 0,Full Name,Gender,Collection,Keeper,Historian Entity,Keeper Entity,Keeper Place Entity,Keeper Place
0,Maria Accascina,female,,Comune di Palermo,Q98804253,Q81174665,Q2656,Palermo
1,Leonardo Agostini,male,,Scuola Normale Superiore,Q1054161,Q672416,Q13375,Pisa
2,Vittorio Alfieri,male,,Biblioteca Medicea Laurenziana,Q296244,Q856419,Q2044,Florence
3,Giuseppe Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q18934975,Q1075580,Q2044,Florence
4,Leopoldo Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q16164590,Q1075580,Q2044,Florence
...,...,...,...,...,...,...,...,...
117,Bruno Zevi,male,Archivio Bruno Zevi,Fondazione Bruno Zevi,Q558155,Q73016367,Q220,Rome
118,Stefano Tumidei,male,Fototeca Stefano Tumidei,Fondazione Federico Zeri,Q55453618,Q23687322,Q1891,Bologna
119,Luisa Vertova,female,Archivio Luisa Vertova,Fondazione Federico Zeri,Q61913691,Q23687322,Q1891,Bologna
120,Eugenio Battisti,male,Battisti Eugenio (complex of fonds),Scuola Normale Superiore,Q1373290,Q672416,Q13375,Pisa


In [23]:
historians_dob_dod_query = """
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?dob ?dod
WHERE {{
    VALUES ?historian {historian} . 
    ?historian wdt:P569 ?dob .
    OPTIONAL {{ ?historian wdt:P570 ?dod }}.
}}
"""

In [14]:
def find_historian_dob_dod(entity: str):
    # Create query by concatenating the entity
    query = historians_dob_dod_query.format(historian=f"{{wd:{entity}}}")
    res = []
    sparql_res = return_sparql_query_results(query_string=query)
    # Retrieve dates of birth and death, if present, and return them as a list of 2 items
    try:
        if sparql_res['results']['bindings'][0]['dob']['type'] == "uri":
            # unknown dates are mapped to URIs
            res.append(None)
        else:
            res.append(sparql_res['results']['bindings'][0]['dob']['value'].rstrip("Z"))
    except (IndexError, KeyError):
        res.append(None)
    try:
        if sparql_res['results']['bindings'][0]['dod']['type'] == "uri":
            # unknown dates are mapped to URIs
            res.append(None)
        else:
            res.append(sparql_res['results']['bindings'][0]['dod']['value'].rstrip("Z"))
    except (IndexError, KeyError):
        res.append(None)
    return res

found = defaultdict(lambda: None)

# Again, we use Series as columns before concatenating them to the db, and explicitly set the data type
dob_list = pd.Series(name="Historian Birth", dtype="datetime64[ns]")
dod_list = pd.Series(name="Historian Death", dtype="datetime64[ns]")

for index, row in data.iterrows():
    cnt = 0
    # Try minimizing throttled requests by waiting 0.5 seconds between each request in case of errors
    while cnt < 7:
        try:
            found[row['Historian Entity']] = find_historian_dob_dod(row["Historian Entity"])
            break
        except (JSONDecodeError, ChunkedEncodingError):
            time.sleep(0.5)
            cnt += 1 
            continue
    
    # We may have just a date of birth or, weirdly, just a date of death, so we check for both
    if any(found[row['Historian Entity']]):
        dob, dod = found[row['Historian Entity']]
        if dob:
            dob_list.loc[index] = datetime.fromisoformat(dob).date()
        else:
            dob_list.loc[index] = None
        if dod:
            dod_list.loc[index] = datetime.fromisoformat(dod).date()
        else:
            dod_list.loc[index] = None

# Finally, join the created columns to the dataframe
data = pd.concat([data, dob_list, dod_list], axis=1)

In [16]:
data.head(122)
data.to_pickle("03_dob_dod_db.pickle")

### Step 4.3 
#### Find place of birth and place of death of the historians

Then, we used another SPARQL query to return the places of birth and death of our historians, filtered for english labels only.

The find_historian_pob_pod function checks the existence of the query results for each historian, adding into the database the four new columns related to historian birthplaces and deathplaces containing the information found. 

In [25]:
data = pd.read_pickle("03_dob_dod_db.pickle")
data.head(122)

Unnamed: 0,Full Name,Gender,Collection,Keeper,Historian Entity,Keeper Entity,Keeper Place Entity,Keeper Place,Historian Birth,Historian Death
0,Maria Accascina,female,,Comune di Palermo,Q98804253,Q81174665,Q2656,Palermo,1898-01-01,1979-01-01
1,Leonardo Agostini,male,,Scuola Normale Superiore,Q1054161,Q672416,Q13375,Pisa,1593-09-18,1676-08-01
2,Vittorio Alfieri,male,,Biblioteca Medicea Laurenziana,Q296244,Q856419,Q2044,Florence,1749-01-16,1803-10-08
3,Giuseppe Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q18934975,Q1075580,Q2044,Florence,1836-04-29,1890-04-24
4,Leopoldo Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q16164590,Q1075580,Q2044,Florence,1832-01-01,1865-01-01
...,...,...,...,...,...,...,...,...,...,...
117,Bruno Zevi,male,Archivio Bruno Zevi,Fondazione Bruno Zevi,Q558155,Q73016367,Q220,Rome,1918-01-22,2000-01-09
118,Stefano Tumidei,male,Fototeca Stefano Tumidei,Fondazione Federico Zeri,Q55453618,Q23687322,Q1891,Bologna,1962-08-15,2008-05-09
119,Luisa Vertova,female,Archivio Luisa Vertova,Fondazione Federico Zeri,Q61913691,Q23687322,Q1891,Bologna,1921-01-01,2021-06-28
120,Eugenio Battisti,male,Battisti Eugenio (complex of fonds),Scuola Normale Superiore,Q1373290,Q672416,Q13375,Pisa,1924-12-14,1989-10-17


In [26]:
historians_pob_pod_query = """
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?pob ?pod ?pobLabel ?podLabel
WHERE {{
    VALUES ?historian {historian} . 
  
    OPTIONAL {{ ?historian wdt:P19 ?pob }}.
    OPTIONAL {{ ?historian wdt:P20 ?pod }}.
    
    ?pob rdfs:label ?pobLabel 
      FILTER (lang(?pobLabel) = 'en') .

    ?pod rdfs:label ?podLabel
      FILTER (lang(?podLabel) = 'en') .
}}
"""

In [27]:

def find_historian_pob_pod(entity: str):
    # Create query by concatenating the entity
    query = historians_pob_pod_query.format(historian=f"{{wd:{entity}}}")
    res = dict()
    sparql_res = return_sparql_query_results(query_string=query)
    # The query has multiple optional results, so we check for each one
    for v in ["pod", "pob", "podLabel", "pobLabel"]:
        try:
            res[v] = sparql_res['results']['bindings'][0][v]['value'] if v in {"podLabel", "pobLabel"} else sparql_res['results']['bindings'][0][v]['value'].split("/")[-1]
        except (IndexError, KeyError):
            res[v] = None
    return res
# Labels and entities are both needed
pob_list = pd.Series(name="Historian Birthplace Entity")
pobL_list = pd.Series(name="Historian Birthplace")

pod_list = pd.Series(name="Historian Deathplace Entity")
podL_list = pd.Series(name="Historian Deathplace")

found = defaultdict(lambda: None)
# Loop over the dataframe row by row
for index, row in data.iterrows():
    historian = row['Historian Entity']
    cnt = 0
    # Try minimizing throttled requests by waiting 0.5 seconds between each request in case of errors
    while cnt < 7:
        try:
            found[historian] = find_historian_pob_pod(historian)
            break
        except (JSONDecodeError, ChunkedEncodingError):
            time.sleep(0.5)
            cnt += 1 
            continue
    # Create series with place of birth and death labels and entities
    if found[historian] and any(found[historian].values()):
        if found[historian]['pob']:
            pob_list.loc[index] = found[historian]['pob']
        else:
            pob_list.loc[index] = None

        if found[historian]['pod']:
            pod_list.loc[index] = found[historian]['pod']
        else:
            pod_list.loc[index] = None

        if found[historian]['pobLabel']:
            pobL_list.loc[index] = found[historian]['pobLabel']
        else:
            pobL_list.loc[index] = None

        if found[historian]['podLabel']:
            podL_list.loc[index] = found[historian]['podLabel']
        else:
            podL_list.loc[index] = None

# Concatenate the series to the dataframe as columns
data = pd.concat([data, pod_list, podL_list, pob_list, pobL_list], axis=1)
data.head(120)


  pob_list = pd.Series(name="Historian Birthplace Entity")
  pobL_list = pd.Series(name="Historian Birthplace")
  pod_list = pd.Series(name="Historian Deathplace Entity")
  podL_list = pd.Series(name="Historian Deathplace")


Unnamed: 0,Full Name,Gender,Collection,Keeper,Historian Entity,Keeper Entity,Keeper Place Entity,Keeper Place,Historian Birth,Historian Death,Historian Deathplace Entity,Historian Deathplace,Historian Birthplace Entity,Historian Birthplace
0,Maria Accascina,female,,Comune di Palermo,Q98804253,Q81174665,Q2656,Palermo,1898-01-01,1979-01-01,Q2656,Palermo,Q2634,Naples
1,Leonardo Agostini,male,,Scuola Normale Superiore,Q1054161,Q672416,Q13375,Pisa,1593-09-18,1676-08-01,Q220,Rome,Q2402810,Boccheggiano
2,Vittorio Alfieri,male,,Biblioteca Medicea Laurenziana,Q296244,Q856419,Q2044,Florence,1749-01-16,1803-10-08,Q2044,Florence,Q6122,Asti
3,Giuseppe Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q18934975,Q1075580,Q2044,Florence,1836-04-29,1890-04-24,Q2044,Florence,Q2044,Florence
4,Leopoldo Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q16164590,Q1075580,Q2044,Florence,1832-01-01,1865-01-01,Q2044,Florence,Q2044,Florence
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
115,Antonio Maria Zanetti,male,,Biblioteca Nazionale Marciana,Q944948,Q578460,Q641,Venice,1680-02-20,1757-12-31,Q641,Venice,Q641,Venice
116,Federico Zeri,male,Federico Zeri Archives,Fondazione Federico Zeri,Q1089074,Q23687322,Q1891,Bologna,1921-08-12,1998-10-05,Q242942,Mentana,Q220,Rome
117,Bruno Zevi,male,Archivio Bruno Zevi,Fondazione Bruno Zevi,Q558155,Q73016367,Q220,Rome,1918-01-22,2000-01-09,Q220,Rome,Q220,Rome
118,Stefano Tumidei,male,Fototeca Stefano Tumidei,Fondazione Federico Zeri,Q55453618,Q23687322,Q1891,Bologna,1962-08-15,2008-05-09,Q1891,Bologna,Q13367,Forlì


In [28]:
data.to_pickle("04_full_db.pickle")

In [None]:
data.to_json("04_full_db.json")

### Step 4.4
#### Uniform private archives and missing values

In [27]:
data = pd.read_pickle("04_full_db.pickle")
data.head(10)

Unnamed: 0,Full Name,Gender,Collection,Keeper,Historian Entity,Keeper Entity,Keeper Place Entity,Keeper Place,Historian Birth,Historian Death,Historian Deathplace Entity,Historian Deathplace,Historian Birthplace Entity,Historian Birthplace
0,Maria Accascina,female,,Comune di Palermo,Q98804253,Q81174665,Q2656,Palermo,1898-01-01,1979-01-01,Q2656,Palermo,Q2634,Naples
1,Leonardo Agostini,male,,Scuola Normale Superiore,Q1054161,Q672416,Q13375,Pisa,1593-09-18,1676-08-01,Q220,Rome,Q2402810,Boccheggiano
2,Vittorio Alfieri,male,,Biblioteca Medicea Laurenziana,Q296244,Q856419,Q2044,Florence,1749-01-16,1803-10-08,Q2044,Florence,Q6122,Asti
3,Giuseppe Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q18934975,Q1075580,Q2044,Florence,1836-04-29,1890-04-24,Q2044,Florence,Q2044,Florence
4,Leopoldo Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q16164590,Q1075580,Q2044,Florence,1832-01-01,1865-01-01,Q2044,Florence,Q2044,Florence
5,Francesco Arcangeli,male,"Fondo speciale Angelo, Gaetano, Bianca e Franc...",Biblioteca comunale dell'Archiginnasio,Q1121086,Q3639645,Q1891,Bologna,1915-07-10,1974-02-14,Q1891,Bologna,Q1891,Bologna
6,Pietro Aretino,male,Fondo Bongi,State Archives of Lucca,Q296272,Q3621654,Q13373,Lucca,1492-04-01,1556-10-31,Q641,Venice,Q13378,Arezzo
7,Giulio Carlo Argan,male,,Archivio privato a Roma,Q778445,Q12161242,,,1909-05-17,1992-11-12,Q220,Rome,Q495,Turin
8,Paolo Enrico Arias,male,,Scuola Normale Superiore,Q3894011,Q672416,Q13375,Pisa,1907-07-17,1998-12-03,Q13375,Pisa,Q34130,Vittoria
9,Giovanni Baglione,male,,Archivio di Stato di Roma,Q983332,Q2860424,Q220,Rome,1566-01-01,1644-12-30,Q220,Rome,Q220,Rome


In order to have a uniform database we need to replace all empty/falsy values with numpy's `NaN`.
The Dataframe method `.fillna()` replaces all `None`-like values with the value provided as argument.

In [28]:
data.fillna(value=np.nan, inplace=True)
data.head(10)

Unnamed: 0,Full Name,Gender,Collection,Keeper,Historian Entity,Keeper Entity,Keeper Place Entity,Keeper Place,Historian Birth,Historian Death,Historian Deathplace Entity,Historian Deathplace,Historian Birthplace Entity,Historian Birthplace
0,Maria Accascina,female,,Comune di Palermo,Q98804253,Q81174665,Q2656,Palermo,1898-01-01,1979-01-01,Q2656,Palermo,Q2634,Naples
1,Leonardo Agostini,male,,Scuola Normale Superiore,Q1054161,Q672416,Q13375,Pisa,1593-09-18,1676-08-01,Q220,Rome,Q2402810,Boccheggiano
2,Vittorio Alfieri,male,,Biblioteca Medicea Laurenziana,Q296244,Q856419,Q2044,Florence,1749-01-16,1803-10-08,Q2044,Florence,Q6122,Asti
3,Giuseppe Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q18934975,Q1075580,Q2044,Florence,1836-04-29,1890-04-24,Q2044,Florence,Q2044,Florence
4,Leopoldo Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q16164590,Q1075580,Q2044,Florence,1832-01-01,1865-01-01,Q2044,Florence,Q2044,Florence
5,Francesco Arcangeli,male,"Fondo speciale Angelo, Gaetano, Bianca e Franc...",Biblioteca comunale dell'Archiginnasio,Q1121086,Q3639645,Q1891,Bologna,1915-07-10,1974-02-14,Q1891,Bologna,Q1891,Bologna
6,Pietro Aretino,male,Fondo Bongi,State Archives of Lucca,Q296272,Q3621654,Q13373,Lucca,1492-04-01,1556-10-31,Q641,Venice,Q13378,Arezzo
7,Giulio Carlo Argan,male,,Archivio privato a Roma,Q778445,Q12161242,,,1909-05-17,1992-11-12,Q220,Rome,Q495,Turin
8,Paolo Enrico Arias,male,,Scuola Normale Superiore,Q3894011,Q672416,Q13375,Pisa,1907-07-17,1998-12-03,Q13375,Pisa,Q34130,Vittoria
9,Giovanni Baglione,male,,Archivio di Stato di Roma,Q983332,Q2860424,Q220,Rome,1566-01-01,1644-12-30,Q220,Rome,Q220,Rome


Some of the collected "Keepers" are labeled as variations of "private archive". The team decided to treat all these occurrences as a controlled entity `wd:Q12161242` ([wikidata](https://www.wikidata.org/wiki/Q12161242)).

In [29]:
def set_private_archives_entity(keeper: str, entity: str):
    # if the keeper column contains the word private, set the entity to private archive
    if keeper:
        if re.search("privat", keeper, re.IGNORECASE):
            return "Q12161242"
    return entity

# .apply is very fast, to apply to every row we use an anonymous function
data['Keeper Entity'] = data.apply(lambda x: set_private_archives_entity(x['Keeper'], x['Keeper Entity']), axis=1)
data.head(10)

Unnamed: 0,Full Name,Gender,Collection,Keeper,Historian Entity,Keeper Entity,Keeper Place Entity,Keeper Place,Historian Birth,Historian Death,Historian Deathplace Entity,Historian Deathplace,Historian Birthplace Entity,Historian Birthplace
0,Maria Accascina,female,,Comune di Palermo,Q98804253,Q81174665,Q2656,Palermo,1898-01-01,1979-01-01,Q2656,Palermo,Q2634,Naples
1,Leonardo Agostini,male,,Scuola Normale Superiore,Q1054161,Q672416,Q13375,Pisa,1593-09-18,1676-08-01,Q220,Rome,Q2402810,Boccheggiano
2,Vittorio Alfieri,male,,Biblioteca Medicea Laurenziana,Q296244,Q856419,Q2044,Florence,1749-01-16,1803-10-08,Q2044,Florence,Q6122,Asti
3,Giuseppe Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q18934975,Q1075580,Q2044,Florence,1836-04-29,1890-04-24,Q2044,Florence,Q2044,Florence
4,Leopoldo Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q16164590,Q1075580,Q2044,Florence,1832-01-01,1865-01-01,Q2044,Florence,Q2044,Florence
5,Francesco Arcangeli,male,"Fondo speciale Angelo, Gaetano, Bianca e Franc...",Biblioteca comunale dell'Archiginnasio,Q1121086,Q3639645,Q1891,Bologna,1915-07-10,1974-02-14,Q1891,Bologna,Q1891,Bologna
6,Pietro Aretino,male,Fondo Bongi,State Archives of Lucca,Q296272,Q3621654,Q13373,Lucca,1492-04-01,1556-10-31,Q641,Venice,Q13378,Arezzo
7,Giulio Carlo Argan,male,,Archivio privato a Roma,Q778445,Q12161242,,,1909-05-17,1992-11-12,Q220,Rome,Q495,Turin
8,Paolo Enrico Arias,male,,Scuola Normale Superiore,Q3894011,Q672416,Q13375,Pisa,1907-07-17,1998-12-03,Q13375,Pisa,Q34130,Vittoria
9,Giovanni Baglione,male,,Archivio di Stato di Roma,Q983332,Q2860424,Q220,Rome,1566-01-01,1644-12-30,Q220,Rome,Q220,Rome


The rows containing private archives are just three, and filtering them with pandas is quite easy, so we can take note of the cities of these archives and fill in the place and place entity columns as well.

In [30]:
entities = {"Rome": "Q220",
            "Calcata": "Q159696",
            "Meleto": "Q18487110"}

# Set entities:
def set_private_archives_place(keeper: str, existing_entity: str):
    if keeper:
        for city in entities.keys():
            if re.search(city[:-1], keeper, re.IGNORECASE):
                return entities[city]
    return existing_entity

data['Keeper Place Entity'] = data.apply(lambda x: set_private_archives_place(x['Keeper'], x['Keeper Place Entity']), axis=1)


In [31]:
# Set labels
def set_private_archives_place_label(keeper: str, existing_entity: str):
    if keeper:
        for city in entities.keys():
            if re.search(city[:-1], keeper, re.IGNORECASE):
                return city
    return existing_entity

data['Keeper Place'] = data.apply(lambda x: set_private_archives_place_label(x['Keeper'], x['Keeper Place']), axis=1)

In [32]:
# the final result is this 
data.head(10)

Unnamed: 0,Full Name,Gender,Collection,Keeper,Historian Entity,Keeper Entity,Keeper Place Entity,Keeper Place,Historian Birth,Historian Death,Historian Deathplace Entity,Historian Deathplace,Historian Birthplace Entity,Historian Birthplace
0,Maria Accascina,female,,Comune di Palermo,Q98804253,Q81174665,Q2656,Palermo,1898-01-01,1979-01-01,Q2656,Palermo,Q2634,Naples
1,Leonardo Agostini,male,,Scuola Normale Superiore,Q1054161,Q672416,Q13375,Pisa,1593-09-18,1676-08-01,Q220,Rome,Q2402810,Boccheggiano
2,Vittorio Alfieri,male,,Biblioteca Medicea Laurenziana,Q296244,Q856419,Q2044,Florence,1749-01-16,1803-10-08,Q2044,Florence,Q6122,Asti
3,Giuseppe Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q18934975,Q1075580,Q2044,Florence,1836-04-29,1890-04-24,Q2044,Florence,Q2044,Florence
4,Leopoldo Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q16164590,Q1075580,Q2044,Florence,1832-01-01,1865-01-01,Q2044,Florence,Q2044,Florence
5,Francesco Arcangeli,male,"Fondo speciale Angelo, Gaetano, Bianca e Franc...",Biblioteca comunale dell'Archiginnasio,Q1121086,Q3639645,Q1891,Bologna,1915-07-10,1974-02-14,Q1891,Bologna,Q1891,Bologna
6,Pietro Aretino,male,Fondo Bongi,State Archives of Lucca,Q296272,Q3621654,Q13373,Lucca,1492-04-01,1556-10-31,Q641,Venice,Q13378,Arezzo
7,Giulio Carlo Argan,male,,Archivio privato a Roma,Q778445,Q12161242,Q220,Rome,1909-05-17,1992-11-12,Q220,Rome,Q495,Turin
8,Paolo Enrico Arias,male,,Scuola Normale Superiore,Q3894011,Q672416,Q13375,Pisa,1907-07-17,1998-12-03,Q13375,Pisa,Q34130,Vittoria
9,Giovanni Baglione,male,,Archivio di Stato di Roma,Q983332,Q2860424,Q220,Rome,1566-01-01,1644-12-30,Q220,Rome,Q220,Rome


In [38]:
# Datetime format can be complicated to inspect and compare in external tools, so we stick to a simple format as a string
data["Historian Birth"] = data["Historian Birth"].astype(str)
data["Historian Death"] = data["Historian Death"].astype(str)

Since some information were not available even manually, we left some missing data to not undermine the objectivity of the analysis. Some keepers place's labels were also fixed because they were referring the neighborhoods rather than the city.

### Step 4.5
#### Keepers' instances

Adding the instances of the keepers in our data allows us to investigate the __presence of patterns__ between the places, the historians, and the instances of the keepers. We did the same process as before, and saved the results. 

In [2]:
data = pd.read_json("05_db.json")

In [3]:
instances_of_keepers = """
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?type ?typeLabel 
WHERE {{
    VALUES ?keeper {keeper} . 
    OPTIONAL {{ ?keeper wdt:P31 ?type . ?type rdfs:label ?typeLabel }} . 
    FILTER (lang(?typeLabel) = 'en') 
}}
"""

In [22]:
def find_keepers_instance(entity: str):
    query = instances_of_keepers.format(keeper=f"{{wd:{entity}}}")

    res = return_sparql_query_results(query_string=query)

    final_result = []
    # Instance-Of property usually has multiple entities, so we check for each one and store them as a list of strings
    for n in range(len(res['results']['bindings'])):
        return_entity = res['results']['bindings'][n]['type']['value']
        return_label = res['results']['bindings'][n]['typeLabel']['value']
        final_result.append((return_entity.split("/")[-1], return_label))
    return final_result

keeper_instances = pd.Series(name="Keeper Instances")

found = defaultdict(lambda: [])
for index, row in data.iterrows():
    keeper = row['Keeper Entity']
    cnt = 0
    # Try minimizing throttled requests by waiting 0.5 seconds between each request in case of errors
    while cnt < 5:
        try:
            found[keeper] = find_keepers_instance(keeper)
            break
        except JSONDecodeError:
            time.sleep(1)
            cnt += 1 
            continue
       
    if len(found[keeper]) > 0:
        keeper_instances.loc[index] = found[keeper]
    else:
        print("Not found: ", row['Keeper'])

# Concatenate new column to existing dataframe
data = pd.concat([data, keeper_instances], axis=1)
data.head(120) 

  del sys.path[0]


Not found:  Archivio privato a Roma
Not found:  Fondazione Benetton
Not found:  Nuova Fondazione Rossana e Carlo Pedretti
Not found:  Archivio privato di Calcata
Not found:  Archivio privato di Meleto


Unnamed: 0,Full Name,Gender,Collection,Keeper,Historian Entity,Keeper Entity,Keeper Place Entity,Keeper Place,Historian Birth,Historian Death,Historian Deathplace Entity,Historian Deathplace,Historian Birthplace Entity,Historian Birthplace,Keeper Instances
0,Maria Accascina,female,,Comune di Palermo,Q98804253,Q81174665,Q2656,Palermo,1898-01-01,1979-01-01,Q2656,Palermo,Q2634,Naples,"[(Q166118, archive)]"
1,Leonardo Agostini,male,,Scuola Normale Superiore,Q1054161,Q672416,Q13375,Pisa,1593-09-18,1676-08-01,Q220,Rome,Q2402810,Boccheggiano,"[(Q3953379, superior graduate school in Italy)]"
2,Vittorio Alfieri,male,,Biblioteca Medicea Laurenziana,Q296244,Q856419,Q2044,Florence,1749-01-16,1803-10-08,Q2044,Florence,Q6122,Asti,"[(Q11834910, state public library), (Q684740, ..."
3,Giuseppe Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q18934975,Q1075580,Q2044,Florence,1836-04-29,1890-04-24,Q2044,Florence,Q2044,Florence,"[(Q33506, museum), (Q684740, real property), (..."
4,Leopoldo Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q16164590,Q1075580,Q2044,Florence,1832-01-01,1865-01-01,Q2044,Florence,Q2044,Florence,"[(Q94701721, Tuscan museum of regional importa..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
115,Antonio Maria Zanetti,male,,Biblioteca Nazionale Marciana,Q944948,Q578460,Q641,Venice,1680-02-20,1757-12-31,Q641,Venice,Q641,Venice,"[(Q856584, library building), (Q11834910, stat..."
116,Federico Zeri,male,Federico Zeri Archives,Fondazione Federico Zeri,Q1089074,Q23687322,Q1891,Bologna,1921-08-12,1998-10-05,Q242942,Mentana,Q220,Rome,"[(Q31855, research institute)]"
117,Bruno Zevi,male,Archivio Bruno Zevi,Fondazione Bruno Zevi,Q558155,Q73016367,Q220,Rome,1918-01-22,2000-01-09,Q220,Rome,Q220,Rome,"[(Q1329623, cultural center)]"
118,Stefano Tumidei,male,Fototeca Stefano Tumidei,Fondazione Federico Zeri,Q55453618,Q23687322,Q1891,Bologna,1962-08-15,2008-05-09,Q1891,Bologna,Q13367,Forlì,"[(Q31855, research institute)]"


In [23]:
data.to_pickle("06_db.pickle")

Since some data were not automatically found, we fixed the missing ones manually. Then, we saved the database. 

In [14]:
data.to_json("06_db.json")

___

# Step 5

## Locations' coordinates

After exploring multiple approaches we finally decided to create a __second database__ only containing locations, by using the `Keeper Place Entities`, the `Historian Birthplace Entities` and the `Historian Deathplace Entities` of our data as *foreign keys* to our new database. The new database will be called __Locations__, and will contain the coordinates, labels and entities of places. 

In [58]:
import pandas as pd
import re
from collections import defaultdict
from qwikidata.sparql import return_sparql_query_results
from json import JSONDecodeError
import time

In [59]:
data = pd.read_json("06_db.json")

In [60]:
places = []

# Grab existing columns and use them to populate an initial list of places
for col in ["Keeper Place Entity", "Historian Birthplace Entity", "Historian Deathplace Entity"]:
    places.extend(data[col].unique())

# Removed duplicated values between columns by casting to python set
locations = set(places)
# We have 90 locations!
print(len(locations))
print(locations)

90
{'Q17660', 'Q90', 'Q5836', 'Q56086', 'Q617', 'Q13364', 'Q8621', 'Q237', 'Q48027', 'Q216853', 'Q46931', 'Q3661494', 'Q13375', 'Q2028', 'Q2751', 'Q6285', 'Q6596', 'Q6537', 'Q6122', 'Q243371', 'Q2402810', 'Q18341', 'Q2759', 'Q641', 'Q270328', 'Q13367', 'Q82822', 'Q13135', 'Q5475', 'Q48457', 'Q2790', 'Q242942', 'Q18485635', 'Q101388', 'Q51871', 'Q1449', 'Q10226', 'Q102599', 'Q12892', 'Q72672', 'Q220', 'Q13362', 'Q20146', 'Q2634', 'Q18484193', 'Q18021', 'Q103305', 'Q34130', 'Q210098', 'Q2966', 'Q138823', 'Q41819', 'Q65', 'Q13373', 'Q6247', 'Q60', None, 'Q34560', 'Q279', 'Q91228', 'Q266975', 'Q490', 'Q11299', 'Q1155', 'Q3437', 'Q1533', 'Q13498', 'Q47385', 'Q2807', 'Q2044', 'Q3415', 'Q279373', 'Q13378', 'Q8611', 'Q52097', 'Q159696', 'Q1210', 'Q20571', 'Q190584', 'Q29080', 'Q2656', 'Q628', 'Q1891', 'Q495', 'Q30028725', 'Q244388', 'Q94638', 'Q13134', 'Q269516', 'Q50121'}


In [61]:
locations = pd.DataFrame(locations, columns=['Place Entity'], dtype = np.object)

In [62]:
locations.head()

Unnamed: 0,Place Entity
0,Q17660
1,Q90
2,Q5836
3,Q56086
4,Q617


### Finding coordinates

In [63]:
coordinates = """
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?coordinates ?placeLabel
WHERE {{
    VALUES ?location {location} 
    ?location rdfs:label ?placeLabel .
    FILTER (lang(?placeLabel) = 'en')
    OPTIONAL {{?location wdt:P625 ?coordinates}}
}}
"""

In [64]:
def deg_to_dms(deg, kind='lat'):
    # Thanks StackOverflow for this useful thread, used multiple answers and merged them into something that could work for us
    # https://stackoverflow.com/questions/2579535/convert-dd-decimal-degrees-to-dms-degrees-minutes-seconds-in-python
    decimals, number = math.modf(deg)
    d = int(number)
    m = int(decimals * 60)
    s = (deg - d - m / 60) * 3600.00
    compass = {
        'lat': ('N','S'),
        'lon': ('E','W')
    }
    compass_str = compass[kind][0 if d >= 0 else 1]
    return '{}º{}\'{:.2f}"{}'.format(abs(d), abs(m), abs(s), compass_str)

In [67]:
def find_coordinates(entity: str):
    query = coordinates.format(location=f"{{wd:{entity}}}")
    coords_regex = re.compile(r"([\d\.]*)\s([\d\.]*)")
    try:
        res = return_sparql_query_results(query_string=query)
        # Coordinates are in format 'Point(13.6193 45.9352)': select two decimals
        return_coords = re.search(coords_regex, res['results']['bindings'][0]['coordinates']['value'])
        if return_coords:
          coord_1 = float(return_coords.group(1))
          coord_2 = float(return_coords.group(2))
        else:
          return None
        return_labels = res['results']['bindings'][0]['placeLabel']['value']
    except (IndexError, KeyError):
        return None
    #return [f"{deg_to_dms(coord_1, 'lon')}, {deg_to_dms(coord_2, 'lat')}", return_labels]
    return [coord_1, coord_2, return_labels]


found = defaultdict(lambda: None)
# Again, we create a series for each new column
wd_coordinates = pd.Series(name="Coordinates")
longitudes = pd.Series(name="Longitude")
latitudes = pd.Series(name="Latitude")
wd_labels = pd.Series(name="Place")

for index, row in locations.iterrows():
    place = row['Place Entity']
    cnt = 0
    while cnt < 5:
        # Try minimizing throttled requests by waiting 0.5 seconds between each request in case of errors
        try:
            found[place] = find_coordinates(place)
            break
        except JSONDecodeError:
            time.sleep(1)
            cnt += 1 
            continue
       
    if found[place]:
        longitudes.loc[index] = found[place][0]
        latitudes.loc[index] = found[place][1]
        wd_labels.loc[index] = found[place][2]
    else:
        print("Not found: ", row['Place Entity'])

# Concatenate new values to our dataframe
locations = pd.concat([locations, latitudes, longitudes, wd_labels], axis=1)
locations.head(120)  

  wd_coordinates = pd.Series(name="Coordinates")
  longitudes = pd.Series(name="Longitude")
  latitudes = pd.Series(name="Latitude")
  wd_labels = pd.Series(name="Place")


Not found:  None


Unnamed: 0,Place Entity,Coordinates,Place,lat,lon,Latitude,Longitude,Place.1
0,Q2759,"12º38'13.92""E, 43º43'30.86""N",Urbino,"12º38'13.92""E","43º43'30.86""N",43.725239,12.637200,Urbino
1,Q13134,"12º54'47.88""E, 43º54'36.54""N",Pesaro,"12º54'47.88""E","43º54'36.54""N",43.910150,12.913300,Pesaro
2,Q34130,"14º31'60.00""E, 36º57'0.00""N",Vittoria,"14º31'60.00""E","36º57'0.00""N",36.950000,14.533333,Vittoria
3,Q8621,"12º39'0.00""E, 42º33'60.00""N",Terni,"12º39'0.00""E","42º33'60.00""N",42.566667,12.650000,Terni
4,Q29080,"11º28'60.00""E, 44º27'0.00""N",Ozzano dell'Emilia,"11º28'60.00""E","44º27'0.00""N",44.450000,11.483333,Ozzano dell'Emilia
...,...,...,...,...,...,...,...,...
85,Q13373,"10º30'60.00""E, 43º51'0.00""N",Lucca,"10º30'60.00""E","43º51'0.00""N",43.850000,10.516667,Lucca
86,Q52097,"11º31'58.00""E, 43º33'52.00""N",San Giovanni Valdarno,"11º31'58.00""E","43º33'52.00""N",43.564444,11.532778,San Giovanni Valdarno
87,Q90,"2º21'5.00""E, 48º51'25.00""N",Paris,"2º21'5.00""E","48º51'25.00""N",48.856944,2.351389,Paris
88,Q11299,"73º59'39.00""E, 40º43'42.00""N",Manhattan,"73º59'39.00""E","40º43'42.00""N",40.728333,73.994167,Manhattan


In [71]:
locations.to_pickle("db_locations.pickle")

In [72]:
locations.to_csv("db_locations.csv", sep=',', encoding='utf-8', index=False)

___

# Data gathering: END

We finally have our two final databases, from which we will proceed to the data analysis step. 

___

# Data analysis

With our final databases we can now start to delve into the gathered data and extract useful knowledge. For practical reasons we decided to carry out the analysis in Python to take advantage of pandas' API to make queries and extract answers to our research questions.

After testing out many approaches we decided to perform the data analyses on the dataframe to take advantage of the flexibility and power or python/pandas and to export the result of each analysis in `json` format for visualization purposes.

In [6]:
import pandas as pd
import numpy as np
import json

In [7]:
database = pd.read_pickle("06_db.pickle")

- Historians who were born in the same place where the keeper is located
- Historians who died in the same place where the keeper is located

In [8]:
# Create columns for historian born or dead in the same place as their keeper
database["Born_same_Keeper"] = np.where(database["Keeper Place Entity"] == database["Historian Birthplace Entity"], True, False)
database["Died_same_Keeper"] = np.where(database["Keeper Place Entity"] == database["Historian Deathplace Entity"], True, False)

database.head()

Unnamed: 0,Full Name,Gender,Collection,Keeper,Historian Entity,Keeper Entity,Keeper Place Entity,Keeper Place,Historian Birth,Historian Death,Historian Deathplace Entity,Historian Deathplace,Historian Birthplace Entity,Historian Birthplace,Keeper Instances,Born_same_Keeper,Died_same_Keeper
0,Maria Accascina,female,,Comune di Palermo,Q98804253,Q81174665,Q2656,Palermo,1898-01-01,1979-01-01,Q2656,Palermo,Q2634,Naples,"[(Q166118, archive)]",False,True
1,Leonardo Agostini,male,,Scuola Normale Superiore,Q1054161,Q672416,Q13375,Pisa,1593-09-18,1676-08-01,Q220,Rome,Q2402810,Boccheggiano,"[(Q3953379, superior graduate school in Italy)]",False,False
2,Vittorio Alfieri,male,,Biblioteca Medicea Laurenziana,Q296244,Q856419,Q2044,Florence,1749-01-16,1803-10-08,Q2044,Florence,Q6122,Asti,"[(Q11834910, state public library), (Q684740, ...",False,True
3,Giuseppe Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q18934975,Q1075580,Q2044,Florence,1836-04-29,1890-04-24,Q2044,Florence,Q2044,Florence,"[(Q33506, museum), (Q684740, real property), (...",True,True
4,Leopoldo Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q16164590,Q1075580,Q2044,Florence,1832-01-01,1865-01-01,Q2044,Florence,Q2044,Florence,"[(Q94701721, Tuscan museum of regional importa...",True,True


### Data analysis questions:

 1. Most common instance between keepers with location == historian birthplace
 2. Most common instance between keepers with location == historian deathplace
 3. Most common instance between keepers with location != historian deathplace and != historian birthplace


In [11]:
def sort_shrink_by_frequency(lst: list)-> dict:
    import collections
    # Sort by frequency
    counts = collections.Counter(lst)
    new_list = sorted(lst, key=lambda x: (counts[x], x), reverse=True)

    counts = dict(counts)
    ordered = dict()
    for item in new_list:
        ordered[item[1]] = counts[item]
    print(ordered)
    return ordered

In [12]:
# We select the database columns where the historian was born in the same place 
# as their keeper and then extract the instances column and remove NaN values
instances_birthplace_keeper_location = database[database["Born_same_Keeper"]]['Keeper Instances'].dropna()

# instances are lists of lists, we need to flatten to single big list
flattened_instances = list([i for instance in instances_birthplace_keeper_location for i in instance])

# sort them with our custom function
sorted_instances = sort_shrink_by_frequency(flattened_instances)

# store the results in json for visualization
with open('instances_birthplace_keeper_same_location_by_frequency.json', 'w+') as fp:
    json.dump(sorted_instances, fp)


{'library': 7, 'archive': 7, 'museum': 5, 'state public library': 5, 'real property': 4, 'public library': 4, 'national museum': 4, 'Tuscan museum of regional importance': 2, 'national archives': 2, 'art museum': 2, 'state archive (Italy)': 2, 'foundation': 2, 'cultural center': 2, 'library building': 1, 'organization': 1, 'university': 1, 'historical archive': 1, 'cultural institution': 1, 'national library': 1, 'United Nations Depository Library': 1}


In [13]:
# Same thing with death locations, we select only rows where the historian died in the same place as their keeper
# and extract the instances column and remove NaN values
instances_deathplace_keeper_location = database[database["Died_same_Keeper"]]['Keeper Instances'].dropna()

# again, instances are lists of lists, we need to flatten to single big list
flattened_instances = list([i for instance in instances_deathplace_keeper_location for i in instance])

# sort with custom function
sorted_instances = sort_shrink_by_frequency(flattened_instances)

# export in json for visualizations (yes, we got creative with filenames)
with open('instances_deathplace_keeper_same_location_by_frequency.json', 'w+') as fp:
    json.dump(sorted_instances, fp)


{'archive': 11, 'library': 9, 'state public library': 8, 'real property': 7, 'public library': 7, 'museum': 5, 'university': 3, 'national museum': 3, 'foundation': 3, 'Tuscan museum of regional importance': 2, 'public university': 2, 'open-access publisher': 2, 'superior graduate school in Italy': 2, 'research institute': 2, 'national archives': 2, 'library building': 1, 'organization': 1, 'historical archive': 1, 'cultural institution': 1, 'national library': 1, 'art museum': 1, 'state archive (Italy)': 1, 'cultural center': 1, 'United Nations Depository Library': 1}


In [14]:
# This was a bit more complicated: select rows where the keeper was born and died in a different place as their keeper
instances_different_place_keeper_location = database[(~database['Born_same_Keeper']) & (~database['Died_same_Keeper']) ]

# Only select the instances column and drop NaN values
instances_different_place_keeper_location = instances_different_place_keeper_location['Keeper Instances'].dropna()

# flatten list of list to single long list
flattened_instances = list([i for instance in instances_different_place_keeper_location for i in instance])

# sort and shrink list with custom function
sorted_instances = sort_shrink_by_frequency(flattened_instances)

# export in json for visualizations
with open('instances_place_keeper_different_location_by_frequency.json', 'w+') as fp:
    json.dump(sorted_instances, fp)


{'open-access publisher': 12, 'university': 11, 'website': 11, 'digital library': 11, 'catalogue': 10, 'research institute': 8, 'organization': 7, 'art museum': 7, 'state archive (Italy)': 5, 'state public library': 5, 'library': 3, 'academy of sciences': 3, 'superior graduate school in Italy': 3, 'national library': 3, 'publisher': 3, 'archive': 3, 'library building': 2, 'academic library': 2, 'Accademia delle lettere': 2, 'public library': 2, 'national museum': 2, 'scientific collection': 2, 'online database': 1, 'enterprise': 1, 'former hospital': 1, 'monument': 1, 'business': 1, 'state archives section': 1, 'museum': 1, 'church archive': 1, 'government organization': 1, 'municipal library': 1, 'bank': 1, 'national archives': 1, 'national academy': 1, 'web portal': 1, 'cultural center': 1, 'United Nations Depository Library': 1}


In [4]:
# For a custom viz we export all rows at selected columns to have a slim file with info on death and birthplaces and keeper locations
database.loc[:, ["Full Name", "Born_same_Keeper", "Died_same_Keeper"]].to_json("historians_birth_deathplaces_keepers.json")

___

### Data analysis questions (contd.):
Frequency-ordered lists for:

4. Historian birthplaces
5. Historian deathplaces
6. Keeper locations

In [5]:
# Export these analyses to JSON in an ordered fashion
database["Historian Birthplace"].value_counts().to_json("historian_birthplace_count.json")
database['Historian Deathplace'].value_counts().to_json("historian_deathplace_count.json")
database['Keeper Place'].value_counts().to_json("keeper_place_count.json")

___


### Data analysis questions (contd.):

7. Location frequency for places where historian birth == keeper location

In [6]:
# Select rows where 'Historian Birthplace Entity' is equal to 'Keeper Place Entity'. Export to JSON the 'Keeper Place' for each row

d = database.loc[database["Born_same_Keeper"] == True, ["Keeper Place"]].value_counts().to_dict()
# for some reason string values are exported as tuples, so extract just the string for the city
d = {k[0]: v for k, v in d.items()}
with open("born_same_keeper_places_frequency.json", "w+") as f:
    json.dump(d, f)

___

8. Keepers ordered by frequency

In [9]:
# Count values for 'Keeper' column
database['Keeper'].value_counts().to_json("keeper_frequency.json")

___

9. Keeper location frequency: how many keepers are in each unique location

In [30]:
keeper_locations_freq_dict = dict()

# select the columns 'Keeper', 'Keeper Place' and only pick unique occurrences of 'Keeper Place'
keeper_location_frequency = database[["Keeper", "Keeper Place"]].drop_duplicates(subset=['Keeper']).value_counts(subset="Keeper Place").to_dict()
for city in keeper_location_frequency.keys():
    frequency = keeper_location_frequency[city]
    # find all rows which have city as 'Keeper Place' and extract 'Keeper'
    keepers = database.loc[database["Keeper Place"] == city, "Keeper"].unique().tolist()
    keeper_locations_freq_dict[city] = {"frequency": frequency, "keepers": keepers}

with open('keepers_location_frequency.json', 'w+') as f:
    json.dump(keeper_locations_freq_dict, f)
    

___

10. Keeper instances frequency
11. Instance frequency based on location

In [57]:
# Create a table of only Keepers (unique), their location and what they're instances of
keepers_locations_instances = database[["Keeper", "Keeper Place", "Keeper Instances"]].drop_duplicates(subset=['Keeper']).dropna(subset="Keeper Instances").reset_index(drop=True)

# Create a set of all possible Keeper instances
instances = set()
for idx, k in keepers_locations_instances.iterrows():
    instances.update([i for i in k['Keeper Instances']])

instance_counts = dict()
for idx, keeper in keepers_locations_instances.iterrows():
    for instance in keeper['Keeper Instances']:
        instance_name = instance[1]
        # get doesn't raise exception if key doesn't exist
        instance_counts[instance_name] = instance_counts.get(instance_name, 0) + 1

with open('instance_frequency.json', 'w+') as f:
    json.dump(instance_counts, f)


In [69]:
# Create a dictionary with key 'Keeper Places' unique values and value a list of all instances for that place
keepers_instances_dict = dict()
for location in database['Keeper Place'].unique():
    all_location_instances = database.loc[database['Keeper Place'] == location, 'Keeper Instances'].dropna().tolist()
    flattened_instances = list(set([i for instance in all_location_instances for i in instance]))
    keepers_instances_dict[location] = flattened_instances

with open('keeper_instance_location_frequency.json', 'w+') as f:
    json.dump(keepers_instances_dict, f)

___

12. Keepers where their locations is not historian birthplace nor deathplace. 

In [91]:
# find rows where keeper location is different from historian birthplace and historian deathplace
keeper_locations_different = database.loc[(database["Born_same_Keeper"] == False) & (database['Died_same_Keeper'] == False), ["Keeper Place", "Keeper"]].drop_duplicates(subset="Keeper").reset_index(drop=True)

{'Bologna': 2,
 'Calcata': 1,
 'Camaiore': 1,
 'Florence': 4,
 'Heidelberg': 1,
 'Kansas City': 1,
 'Lombardy': 1,
 'Los Angeles': 1,
 'Lucca': 2,
 'Madrid': 1,
 'Manhattan': 1,
 'Mantua': 1,
 'Meleto Valdarno': 1,
 'Milan': 5,
 'Naples': 1,
 'New York City': 1,
 'Paris': 1,
 'Perugia': 1,
 'Pisa': 1,
 'Rome': 2,
 'Rovereto': 1,
 'Siena': 3,
 'Spoleto': 1,
 'Terni': 1,
 'Turin': 1,
 'Udine': 1,
 'Varese': 1,
 'Vatican City': 2,
 'Venice': 1,
 'Verona': 1}

___

13.	For the first 3 keepers of a frequency-ordered list how many historians born in the 1400/1500/1600/1700/1800/1900 are preserved there.
14.	For the first 3 keepers of a frequency-ordered list how many historians dead in the 1400/1500/1600/1700/1800/1900 are preserved there.

In [19]:
import pandas as pd
import numpy as np
import json

In [20]:
database = pd.read_pickle("06_db.pickle")

In [21]:
instances = database['Keeper Instances'].dropna().to_list()
# remove duplicates by casting to set
instances = set([i for instance in instances for i in instance])

In [22]:
# create a dictionary to keep score of centuries: one key for every century from 1400 to 1900
centuries_dict = {century: dict() for century in range(1400, 2000, 100)}

def is_instance_of(instance, instances):
    # many different datatypes are possible, so we need to check for all of them
    if isinstance(instances, tuple):
        return instance == instances
    elif isinstance(instances, list):
        return instance in instances
    else:
        return False

for century in centuries_dict.keys():
    for instance in instances:
        # find all rows where current 'instance' is in 'Keeper Instances'
        keeper_rows = database.loc[database['Keeper Instances'].apply(lambda x: is_instance_of(instance, x)), :]

        # filter rows based on 'Historian Birth': if it is a string and 'Historian Birth'.split('-')[0] == century, add 1 to the score
        century_rows = keeper_rows.loc[keeper_rows['Historian Birth'].apply(lambda x: isinstance(x, str) and int(x.split('-')[0]) >= century and int(x.split('-')[0]) < century + 100), :]

        if centuries_dict[century].get(instance[1]):
            centuries_dict[century][instance[1]] += len(century_rows)
        else:
            centuries_dict[century][instance[1]] = len(century_rows)

from pprint import pprint
pprint(centuries_dict)

# export to json for visualization
with open('centuries_historian_births_by_keeper_kind.json', 'w+') as fp:
    json.dump(centuries_dict, fp)


{1400: {'Accademia delle lettere': 0,
        'Tuscan museum of regional importance': 0,
        'United Nations Depository Library': 1,
        'academic library': 0,
        'academy of sciences': 1,
        'archive': 0,
        'art museum': 1,
        'bank': 0,
        'business': 0,
        'catalogue': 2,
        'church archive': 0,
        'cultural center': 0,
        'cultural institution': 0,
        'digital library': 2,
        'enterprise': 0,
        'former hospital': 0,
        'foundation': 0,
        'government organization': 0,
        'historical archive': 0,
        'library': 1,
        'library building': 0,
        'monument': 0,
        'municipal library': 0,
        'museum': 0,
        'national academy': 1,
        'national archives': 0,
        'national library': 1,
        'national museum': 0,
        'online database': 0,
        'open-access publisher': 0,
        'organization': 0,
        'public library': 0,
        'public university': 0,
   

In [24]:
# create a dictionary to keep score of centuries: one key for every century from 1400 to 1900
centuries_dict = {century: dict() for century in range(1400, 2000, 100)}

def is_instance_of(instance, instances):
    if isinstance(instances, tuple):
        return instance == instances
    elif isinstance(instances, list):
        return instance in instances
    else:
        return False

for century in centuries_dict.keys():
    for instance in instances:

        # find all rows where current 'instance' is in 'Keeper Instances'
        keeper_rows = database.loc[database['Keeper Instances'].apply(lambda x: is_instance_of(instance, x)), :]

        # filter rows based on 'Historian Birth': if it is a string and 'Historian Birth'.split('-')[0] == century, add 1 to the score
        century_rows = keeper_rows.loc[keeper_rows['Historian Death'].apply(lambda x: isinstance(x, str) and int(x.split('-')[0]) >= century and int(x.split('-')[0]) < century + 100), :]

        if centuries_dict[century].get(instance[1]):
            centuries_dict[century][instance[1]] += len(century_rows)
        else:
            centuries_dict[century][instance[1]] = len(century_rows)

from pprint import pprint
pprint(centuries_dict)

with open('centuries_historian_deaths_by_keeper_kind.json', 'w+') as fp:
    json.dump(centuries_dict, fp)


{1400: {'Accademia delle lettere': 0,
        'Tuscan museum of regional importance': 0,
        'United Nations Depository Library': 0,
        'academic library': 0,
        'academy of sciences': 1,
        'archive': 1,
        'art museum': 1,
        'bank': 0,
        'business': 0,
        'catalogue': 1,
        'church archive': 0,
        'cultural center': 1,
        'cultural institution': 0,
        'digital library': 1,
        'enterprise': 0,
        'former hospital': 0,
        'foundation': 0,
        'government organization': 0,
        'historical archive': 0,
        'library': 0,
        'library building': 0,
        'monument': 0,
        'municipal library': 0,
        'museum': 0,
        'national academy': 1,
        'national archives': 0,
        'national library': 0,
        'national museum': 0,
        'online database': 0,
        'open-access publisher': 0,
        'organization': 0,
        'public library': 0,
        'public university': 0,
   

___


# Location extraction
To perform a geographical visualization of the keepers' locations we extract the coordinates, entities and cities from our existing databases.

In [2]:
import pandas as pd
import numpy as np
import json

In [8]:
database = pd.read_pickle("06_db.pickle")
database.head()

Unnamed: 0,Full Name,Gender,Collection,Keeper,Historian Entity,Keeper Entity,Keeper Place Entity,Keeper Place,Historian Birth,Historian Death,Historian Deathplace Entity,Historian Deathplace,Historian Birthplace Entity,Historian Birthplace,Keeper Instances
0,Maria Accascina,female,,Comune di Palermo,Q98804253,Q81174665,Q2656,Palermo,1898-01-01,1979-01-01,Q2656,Palermo,Q2634,Naples,"[(Q166118, archive)]"
1,Leonardo Agostini,male,,Scuola Normale Superiore,Q1054161,Q672416,Q13375,Pisa,1593-09-18,1676-08-01,Q220,Rome,Q2402810,Boccheggiano,"[(Q3953379, superior graduate school in Italy)]"
2,Vittorio Alfieri,male,,Biblioteca Medicea Laurenziana,Q296244,Q856419,Q2044,Florence,1749-01-16,1803-10-08,Q2044,Florence,Q6122,Asti,"[(Q11834910, state public library), (Q684740, ..."
3,Giuseppe Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q18934975,Q1075580,Q2044,Florence,1836-04-29,1890-04-24,Q2044,Florence,Q2044,Florence,"[(Q33506, museum), (Q684740, real property), (..."
4,Leopoldo Alinari,male,Archivio Alinari,Museo Nazionale Alinari della Fotografia,Q16164590,Q1075580,Q2044,Florence,1832-01-01,1865-01-01,Q2044,Florence,Q2044,Florence,"[(Q94701721, Tuscan museum of regional importa..."


In [7]:
locations = pd.read_pickle("db_locations.pickle")
locations.head()

Unnamed: 0,Place Entity,Coordinates,Place,lat,lon,Latitude,Longitude,Place.1
0,Q2759,"12º38'13.92""E, 43º43'30.86""N",Urbino,"12º38'13.92""E","43º43'30.86""N",43.725239,12.6372,Urbino
1,Q13134,"12º54'47.88""E, 43º54'36.54""N",Pesaro,"12º54'47.88""E","43º54'36.54""N",43.91015,12.9133,Pesaro
2,Q34130,"14º31'60.00""E, 36º57'0.00""N",Vittoria,"14º31'60.00""E","36º57'0.00""N",36.95,14.533333,Vittoria
3,Q8621,"12º39'0.00""E, 42º33'60.00""N",Terni,"12º39'0.00""E","42º33'60.00""N",42.566667,12.65,Terni
4,Q29080,"11º28'60.00""E, 44º27'0.00""N",Ozzano dell'Emilia,"11º28'60.00""E","44º27'0.00""N",44.45,11.483333,Ozzano dell'Emilia


In [11]:
# Ignore rows without an entity, because they'll be without place and coordinates
keepers = database[['Keeper', 'Keeper Place', 'Keeper Place Entity']].dropna(subset=['Keeper Place Entity']).reset_index(drop=True)
keepers.head()

Unnamed: 0,Keeper,Keeper Place,Keeper Place Entity
0,Comune di Palermo,Palermo,Q2656
1,Scuola Normale Superiore,Pisa,Q13375
2,Biblioteca Medicea Laurenziana,Florence,Q2044
3,Museo Nazionale Alinari della Fotografia,Florence,Q2044
4,Museo Nazionale Alinari della Fotografia,Florence,Q2044


In [12]:
# Create Latitude and Longitude column on the keepers dataframe where the value is the 'Latitude' of the row in the locations dataframe where the 'Keeper Place Entity' is the same as the 'Keeper Place'
keepers['Latitude'] = keepers['Keeper Place Entity'].apply(lambda x: locations.loc[locations['Place Entity'] == x, 'Latitude'].values[0])
keepers['Longitude'] = keepers['Keeper Place Entity'].apply(lambda x: locations.loc[locations['Place Entity'] == x, 'Longitude'].values[0])
# Create Place column on the keepers dataframe where the value is the 'Place' of the row in the locations dataframe where the 'Keeper Place Entity' is the same as the 'Keeper Place'
keepers.head()

Unnamed: 0,Keeper,Keeper Place,Keeper Place Entity,Latitude,Longitude
0,Comune di Palermo,Palermo,Q2656,38.115658,13.361262
1,Scuola Normale Superiore,Pisa,Q13375,43.716667,10.4
2,Biblioteca Medicea Laurenziana,Florence,Q2044,43.771389,11.254167
3,Museo Nazionale Alinari della Fotografia,Florence,Q2044,43.771389,11.254167
4,Museo Nazionale Alinari della Fotografia,Florence,Q2044,43.771389,11.254167


In [13]:
keepers.to_csv('keepers_locations.csv', sep=";")