# City Networks - Full example

This notebook presents a full example of constructing, analysing and visualising city data from the CERL Thesaurus -- starting from getting the data out of the database to plotting it on a HTML map that can easily be shared.

Our goal is to create a map that shows places where people related to Göttingen were also active.

Author: Andreas Lüschow

Last updated: 2021/09/28

-----

## Part 1 - Querying the database

First of all, we have to import some python modules and packages to work with our data.

In [1]:
import json  # for reading/writing JSON data
import os  # for file system manipulation (i.e., creating files etc.)

from cerl import ample_query, ample_record, ids_from_result, by_dot, the  # CERL specific python library, see https://pypi.org/project/cerl/

At the beginning of each python script, it's always a good idea to define variables that are used throughout the script. By this, we are able to find and change our parameters easily.

In [2]:
# the file path
DOWNLOAD_PATH = "./data/ct.json"  # data from the CERL Thesaurus will be downloaded into this file

# create the directory in the download path if it does not already exist
if not os.path.exists(os.path.dirname(DOWNLOAD_PATH)):
    os.makedirs(os.path.dirname(DOWNLOAD_PATH))
    
# the output data    
download_data = {}  # a python dictionary that will contain the data we collect    

# the database used
DATABASE = 'data.cerl.org/thesaurus'

# the search query used
GOETTINGEN_ID = "cnl00029316"
QUERY = f"related_to:{GOETTINGEN_ID} AND type:cnp"  # i.e., get people records related to Göttingen

Next, we have to make a connection to the database, send a search query and handle the retrieved data sets afterwards. This may take a minute or two.

__As the search query defined above shows, we are interested in people that were active in Göttingen. From these person records we will collect all information about places of activity. We will then construct a network of cities.__

In [3]:
# Connect to the CERL Thesaurus and run the search query
result = ample_query(DATABASE, QUERY)

In [4]:
# iterate over the search results
for idx in ids_from_result(result):
    record = ample_record(DATABASE, idx)  # get the record as a python dictionary
    cid = the(by_dot(record, '_id'))  # access the record by dot notation (see https://pypi.org/project/cerl/ for more explanation)
    assert cid == idx  # just to make sure the correct record was downloaded

    download_data[cid] = {"515": {"a": []}}  # add information about the record ID, the field and subfield to the output dictionary
    download_data[cid]["515"]["a"].append(GOETTINGEN_ID)

    # iterate over places in the record
    for places in by_dot(record, "data.place"):
        for place in places:
            try:
                type_place = place["typeOfPlace"]  # try to get type of place
                place_id = place["id"]  # try to get place identifier
                
                # add combination of place type and ID to output dictionary
                if type_place == "actv" and place_id is not None:
                    download_data[cid]["515"]["a"].append(place_id)
            except:
                pass        

    # if no place of activity was found, remove empty entries from the output dictionary
    if len(download_data[cid]["515"]["a"]) == 0:
        del download_data[cid]

We can now save the data to a JSON file.

In [5]:
with open(DOWNLOAD_PATH, "w", encoding="utf-8") as f:
    json.dump(download_data, f, indent=4)

Let's have a look at a part of the data we just created.

In [6]:
with open(DOWNLOAD_PATH, "r", encoding="utf-8") as f:
    data = json.load(f)
    
# show five entries from the data
for x in list(data)[:5]:
    print(x, data[x])

cnp01074965 {'515': {'a': ['cnl00029316', 'cnl00009244', 'cnl00029316', 'cnl00010570', 'cnl00006169', 'cnl00016122']}}
cnp02240049 {'515': {'a': ['cnl00029316', 'cnl00016078', 'cnl00000707', 'cnl00016122']}}
cnp02244016 {'515': {'a': ['cnl00029316']}}
cnp01498318 {'515': {'a': ['cnl00029316', 'cnl00009564']}}
cnp02317375 {'515': {'a': ['cnl00029316', 'cnl00029316', 'cnl00010570']}}


-----

## Part 2 - Creating an edge list

Next, we create a graph representation of our data, i.e., a format were nodes and edges are defined. We use the python library _Bibliometa_ to this end.

In [7]:
import pandas as pd  # python library for handling tabular data

from bibliometa.graph.conversion import JSON2EdgeList
from bibliometa.graph.similarity import Similarity

Let's define some paths were our input and output data will be located.

In [8]:
SIMILARITY_PATH = "./data/similarity.csv"
GRAPH_CORPUS_PATH = "./data/graph_corpus.json"
LOG_PATH = "./data/logs/j2e.log"

The Bibliometa class `JSON2EdgeList` needs some configuration to work properly.

In [9]:
# fields from the input data that will be considered
FIELDS = [("515", "a")]

# whether keys and values from the input data will be switched
# since we downloaded person records but want to create a network of places, we set this parameter to True
SWAP = True  

# Similarity functions that will be used in graph creation.
# Two nodes (i.e., cities) are only considered connected if their similarity is greater then zero.
# See the Bibliometa documentation for more information about this.
# The following similarity function will consider two cities connected, if they occur together in at least one person record.
SIM_FUNCTIONS = [
    {"name": "mint_1",
     "function": Similarity.Functions.mint,
     "args": {
         "f": lambda a, b: len(list(a.intersection(b))),
         "t": 1}
     }
]

We can now run the JSON to edge list conversion.

In [10]:
j2e = JSON2EdgeList()

j2e.set_config(i=DOWNLOAD_PATH,
               o=SIMILARITY_PATH,
               create_corpus=True,
               corpus=GRAPH_CORPUS_PATH,
               log=LOG_PATH,
               fields=FIELDS,
               sim_functions=SIM_FUNCTIONS,
               swap=SWAP
               )
j2e.start()

  0%|          | 0/540 [00:00<?, ?it/s]

  0%|          | 0/14 [00:00<?, ?it/s]

The edge list looks as follows. Column 1 and 2 contain the name of a node (i.e., place), column 3 contains the number of connections between these nodes.

In [11]:
with open(SIMILARITY_PATH, "r", encoding="utf-8") as f:
    df = pd.read_csv(f, sep="\t", header=None, index_col=0)
df[:10]

Unnamed: 0_level_0,1,2,3
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
339,cnl00008232,cnl00009858,1
514,cnl00008232,cnl00029316,1
596,cnl00029794,cnl00015003,1
1052,cnl00029794,cnl00029316,1
1097,cnl00029167,cnl00033301,1
1111,cnl00029167,cnl00027430,1
1192,cnl00029167,cnl00016003,1
1260,cnl00029167,cnl00029175,1
1516,cnl00029167,cnl00032895,1
1589,cnl00029167,cnl00029316,1


-----

## Part 3 - Graph Analysis

Based on the edge list we can run some graph analysis algorithms. This will also create a graph representation in GraphML format which we will user later on.

In [12]:
# imports
from bibliometa.graph.analysis import GraphAnalysis

In [13]:
SIMILARITY_FILE = "./data/similarity.tar.gz"
OUTPUT_FILE = "./data/graph_analysis.txt"
IMG_FOLDER = "./data/img/"
LOG_PATH = "./data/logs/ga.log"

CREATE_GRAPHML = True
GRAPHML_FILE = "./data/graphml.graphml"

NODES = "cities"  # name for nodes
EDGES = "similarity"  # name for edges
SIMILARITY_FUNCTION = "mint_1"  # name of similarity function used in graph creation
SIMILARITY_FUNCTIONS_ALL = ["mint_1"]  # list of available similarity functions (see previous conversion step from JSON to edge list)
THRESHOLD = 0  # threshold of similarity function
WEIGHTED = True  # whether similarity function is weighted

# analyses that will be used
TYPES = ["node_count",
         "edge_count",
         "component_count",
         "max_component"]

Run the analysis.

In [14]:
ga = GraphAnalysis()

ga.set_config(i=SIMILARITY_FILE,
              o=OUTPUT_FILE,
              create_graphml=CREATE_GRAPHML,
              graphml=GRAPHML_FILE,
              img=IMG_FOLDER,
              log=LOG_PATH,
              n=NODES,
              e=EDGES,
              sim=SIMILARITY_FUNCTION,
              sim_functions=SIMILARITY_FUNCTIONS_ALL,
              t=THRESHOLD,
              weighted=WEIGHTED,
              types=TYPES
              )
ga.start()

  0%|          | 0/4 [00:00<?, ?it/s]

-----

## Part 4 - Network creation

With the GraphML file created in the previous step we now can draw a nice network of cities and their connections.


In [15]:
import networkx as nx  # a library for network modeling and analysis
import osmnx as ox  # a library for more beautiful network visualisation

At this point, we have to decide for which city want to draw the network. Since we downloaded data sets related to Göttingen, it's reasonable to draw a network around Göttingen.

In [16]:
COORDINATES = "../data/lng_lat.csv"  # path to a CSV file with longitude/latitude information 
KEYS_LABELS = ("id", "city")  # needed for parsing the coordinates file

ego = GOETTINGEN_ID  # CERL Thesaurus ID for Göttingen 
EGO_DEPTH = 1  # degree of neighbor nodes considered in the network
EGO_FILE = "./data/ego.graphml"  # a network centered around the "ego" city
HTML_FILE = "./data/network.html"  # output HTML representation of the network

CRS = "epsg:4326"  # coordinate reference system

The following code is a bit longer than in the previous steps, so we split it into three parts.

In [17]:
# function for getting position of graph nodes
def create_pos(graph, df, keys_labels):
    pos = {}
    for n in graph.nodes:
        lat = df[df[keys_labels] == n]["lat"].to_string(index=False)
        lng = df[df[keys_labels] == n]["lng"].to_string(index=False)
        try:
            pos[n] = [float(lng), float(lat)]
        except:
            pos[n] = [None, None]
    return pos

In [18]:
# read GraphML file
g = nx.read_graphml(GRAPHML_FILE)

# create ego network around ego node
g = nx.ego_graph(g, ego, radius=EGO_DEPTH)

# read coordinates
df = pd.read_csv(COORDINATES, sep="\t")

# create positions for nodes in the ego network
pos = create_pos(g, df, KEYS_LABELS[0])

# remove nodes without reasonable coordinate information
for node, (x, y) in pos.items():
    if x and y:
        g.nodes[node]['x'] = float(x)
        g.nodes[node]['y'] = float(y)
    else:
        g.remove_node(node)

# save ego network as GraphML
nx.write_graphml(g, EGO_FILE)

In [19]:
import folium

# create MultiDiGraph from ego network
G = nx.MultiDiGraph(crs=CRS)
for n in g.nodes:
    G.add_node(n, x=g.nodes[n]["x"], y=g.nodes[n]["y"])

for e in g.edges(data=True):
    if ego in e:
        G.add_edge(e[0], e[1], weight=e[2]["weight"])

# remove nodes without edges
G.remove_nodes_from(list(nx.isolates(G)))

# plot ego network on map and save as HTML
orange = "#fe7830"
orange_2 = "#fe7877"
green = "green"

m = ox.folium.plot_graph_folium(G, color=orange, weight=0.7, opacity=0.7)
factor = 250

for n in G.nodes(data=True):
    coords = [n[1]["y"], n[1]["x"]]  # city coordinates
    if n[0] != ego:  # exclude ego node
        tooltip = df[df["id"] == n[0]]["city"].values[0]  # city name from coordinates table
        radius = G.get_edge_data(ego, n[0], default=G.get_edge_data(n[0], ego))[0]["weight"] * factor
        popup = f"<strong>{tooltip}</strong><br>{int(radius/factor)} person(s)"
        folium.Circle(location=coords, radius=radius, popup=popup, fill=True, color=orange_2, 
                      fill_color=orange_2).add_to(m)
    elif n[0] == ego:
        tooltip = df[df["id"] == ego]["city"].values[0]  # city name from coordinates table                
        try:
            radius = G.get_edge_data(ego, n[0], default=G.get_edge_data(n[0], ego))[0]["weight"] * factor
            popup = f"<strong>{tooltip}</strong><br>{int(radius/factor)} person(s)"    
        except:
            popup = f"<strong>{tooltip}</strong>"    
            radius = 1
        folium.Circle(location=coords, radius=radius, popup=popup, fill=True, color=green, 
                      fill_color=green).add_to(m)


html_file = f"./data/{GOETTINGEN_ID}.html"        
m.save(html_file)

Opening the created HTML file with a Browser, we can see the network of people related to Göttingen and the places these people were also active.

## Concluding remarks

The workflow presented above can be adjusted in many ways.

For example, we could take other definitions of "relatedness" into account. The workflow above queries the CERL Thesaurus for data from people related to Göttingen and their places of activity. Instead of looking at places of activity, we could just as well look at places of birth or death. By this, we would be able to visualize those places people that were active in Göttingen went to or came from.

Likewise, we could run the scripts for a number of cities, not only Göttingen, to allow for a comparison of different networks. We could also create networks for certain time periods by taking only those people records into account with activity/birth/death dates in a certain interval.

-----