# Tutorial: JSON2EdgeList

The _JSON2EdgeList_ class is used to transform data from JSON files to an edge list graph representation by using similarity functions.

Author: Andreas Lüschow

Last updated: 2021/07/28

-----

## Import

Import the appropriate class from __Bibliometa__'s _graph_ module:

In [1]:
from bibliometa.graph.conversion import JSON2EdgeList

As you can see from the following output, the _JSON2EdgeList_ class has a lot of built-in functions:

In [2]:
dir(JSON2EdgeList)

['__abstractmethods__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__slots__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_abc_impl',
 'get_config',
 'set_config',
 'start']

We are only interested in the public methods, so let's have a look at them:

In [3]:
[m for m in dir(JSON2EdgeList) if not m.startswith('_')]

['get_config', 'set_config', 'start']

The usage of _JSON2EdgeList_ class is quite simple: There are two methods to work with the class configuration, and only one method to actually start the conversion process.

-----

## Configuration

Most __Bibliometa__ classes come with already predefined configuration for their class attributes. In this case, you can see the default configuration using the _get_config()_ function on a new _JSON2EdgeList_ object. So let's create an object first:

In [4]:
j2e = JSON2EdgeList()

### Input, output, graph corpus

And now let's have a look at the default configuration values:

In [5]:
j2e.get_config()

('i', None)
('o', None)
('create_corpus', False)
('corpus', None)
('name', '')
('fields', None)
('swap', False)
('sim_functions', None)
('archive', True)
('archive_ext', '.tar.gz')
('csv_sep', '\t')
('log', None)
('log_level_std', 'INFO')
('log_level_file', 'DEBUG')
('verbose', False)
('encoding', 'utf-8')

As a shortcut, you can also simply print out the object itself which will return a representation of the configuration values:

In [6]:
j2e

('i', None)
('o', None)
('create_corpus', False)
('corpus', None)
('name', '')
('fields', None)
('swap', False)
('sim_functions', None)
('archive', True)
('archive_ext', '.tar.gz')
('csv_sep', '\t')
('log', None)
('log_level_std', 'INFO')
('log_level_file', 'DEBUG')
('verbose', False)
('encoding', 'utf-8')

There are a lot of configuration options, let's go through them step by step.

__i__ (str): Input JSON file
* Path to the JSON file that will be converted to an edge list.

__o__ (str): Output similarity file
* Path to the CSV file that will be created by calculating similarities between data sets. If the path contains folders that are not existent yet, they will be created during the conversion process.

__create_corpus__ (bool): If a graph corpus is created
* If this parameter is set to True, a graph corpus will be created during the conversion process. See further below in this tutorial for an explanation of the graph corpus.

__corpus__ (str): Graph corpus file
* Path to the graph corpus JSON file that will be created only if _create_corpus_ == True. If the path contains folders that are not existent yet, they will be created during the conversion process. 

__name__ (str): Name for single conversion step
* If the conversion from JSON to an edge list representation is conducted on more than one file, you somehow have to make sure that for each file a separate similarity file (and probably graph corpus) is created. Using the _name_ parameter allows you to assign each file a unique identifier. For example, if you created your input JSON files using `bibliometa.conversion.CSV2JSON`, the year information inside the output file names may be used as unique identifier for the JSON2EdgeList conversion. See below for an example.

__fields__ (list of tuples): Fields and subfields to consider
* This parameter defines which field/subfield combinations from the input JSON file are considered in creating a graph corpus. The cumulative content of all fields specified in this parameter will be used as representation for a single data set, i.e., if you specify two or more fields, all values in these fields are collected and the set of these values will be used to calculate similarities between two data sets.
    
__swap__ (bool): Wheter keys become nodes or keys become values
* If _swap_ == True, keys from the input JSON file will become values in the graph corpus. For example, let's assume your input JSON data represents metadata records and the keywords used in them. Keys are record IDs and values are the keywords. If _swap_ == True, those keywords will become keys in the graph corpus (which later will be the nodes in the graph) and the record ID will become the values of these keyword keys. If _swap_ == False, keys from the input JSON will also be keys in the graph corpus. See below for an example.

At this point, let's try to change a configuration parameter using the _set_config()_ function. After each function call the current configuration is printed out automatically to check if your changes worked as expected:

In [7]:
j2e.set_config(create_corpus=True)

('i', None)
('o', None)
('create_corpus', True)
('corpus', None)
('name', '')
('fields', None)
('swap', False)
('sim_functions', None)
('archive', True)
('archive_ext', '.tar.gz')
('csv_sep', '\t')
('log', None)
('log_level_std', 'INFO')
('log_level_file', 'DEBUG')
('verbose', False)
('encoding', 'utf-8')

Calling the _set_config()_ function with keyword arguments allows you to change the configuration parameters according to your needs. Since it is a bit cumbersome to find the parameters in the output above, you can also use keyword arguments with the _get_config()_ function to check for specific configuration parameters:

In [8]:
j2e.get_config("create_corpus")

('create_corpus', True)

In [9]:
j2e.get_config("i", "o", "corpus") 

('i', None)
('o', None)
('corpus', None)

As you can see, working with configuration parameters is quite simple.

In [10]:
j2e.get_config("i")

('i', None)

In [11]:
j2e.set_config(i="../data/my_own_data.csv")
j2e.get_config("i")

('i', '../data/my_own_data.csv')

Actually, if you know the parameter you want to change, you can also set and get configuration parameters using a dot notation. This is the preferred way if you need to change or access only a single parameter value, since the output does not include the parameter key itself:

In [12]:
j2e.config.i = "../data/my_very_own_data.csv"
j2e.config.i

'../data/my_very_own_data.csv'

However, if you need to change or access more than one configuration parameter, using the _set_config()_ and _get_config()_ functions is the way to go.

-----

## Graph corpus

Let's see how a graph corpus is created by using different configuration options.

We use a demo file that can be created using the _CSV2JSON_ tutorial (but that is also already available in the "examples" folder). First, we check our configuration:

In [13]:
j2e.get_config()

('i', '../data/my_very_own_data.csv')
('o', None)
('create_corpus', True)
('corpus', None)
('name', '')
('fields', None)
('swap', False)
('sim_functions', None)
('archive', True)
('archive_ext', '.tar.gz')
('csv_sep', '\t')
('log', None)
('log_level_std', 'INFO')
('log_level_file', 'DEBUG')
('verbose', False)
('encoding', 'utf-8')

The parameter _create_corpus_ is already set to True. We want to see what happens if we do not create a corpus but try to start the conversion directly. So we have to change the input file, output file, and the _create_corpus_ parameter:

In [14]:
j2e.config.i = "../data/examples/demo_1700.json"
j2e.config.o = "../data/output/similarity/similarity.csv"
j2e.config.log = "../data/logs/json2edgelist_demo.out"
j2e.config.create_corpus = False
j2e.config.i, j2e.config.create_corpus

('../data/examples/demo_1700.json', False)

Now let's start the JSON2EdgeList conversion:

In [15]:
j2e.start()

TypeError: expected str, bytes or os.PathLike object, not NoneType

As you can learn from the error message, there is no graph corpus available. Since the graph corpus is the basis on which similarity calcultation will be conducted, you have to create a graph corpus first. In subsequent runs you may skip the graph corpus creation if there were no changes to your input data and you only want to re-run the similarity calculation.

But for now, we have to create a corpus for the first time.

In [16]:
j2e.config.create_corpus = True
j2e.config.corpus = "../data/output/graph_corpus/graph_corpus.json"
j2e.start()

TypeError: 'NoneType' object is not iterable

Again, you can get some valuable information from the error message. The _fields_ parameter in our configuration is not set, so that the conversion does not know which fields to process. Let's look into the input data to find out which fields are available.

In [17]:
import json

with open(j2e.config.i, "r", encoding=j2e.config.encoding) as f:
    d = json.load(f)

for i in list(d.items())[:5]:
    print(i)

('cnp01287518', {'515': {'a': ['Breslau']}})
('cnp01287801', {'515': {'a': ['Helmstedt']}})
('cnp01417221', {'515': {'a': ['Bayreuth', 'Schleusingen']}})
('cnp01418335', {'515': {'a': ['Halle']}})
('cnp01289912', {'515': {'a': ['Venedig']}})


It seems like there is only one field/subfield combination in our data: "515", subfield "a". But we should check this more systematically:

In [18]:
combinations = []
for k, v in d.items():
    for field in v.keys():
        for subfield in v[field].keys():
            combinations.append((field, subfield))
set(combinations)

{('515', 'a')}

There is indeed only this single field/subfield combination in the data. We use this to configure our _JSON2EdgeList_ object accordingly:

In [19]:
j2e.config.fields = [("515", "a")]
j2e.config.fields

[('515', 'a')]

And now let's try again to start the graph corpus creation:

In [20]:
j2e.start()

  0%|          | 0/67 [00:00<?, ?it/s]

TypeError: 'NoneType' object is not iterable

It seems like nothing really happend and that our configuration is still missing a reasonable value for a  mandatory parameter (called _sim_functions_). But this is not the case: actually, a graph corpus was created before the error message occurred. We can check this by looking into the file we specified in the configuration parameter _corpus_:

In [21]:
import os

os.listdir(os.path.dirname(j2e.config.corpus))

['graph_corpus.json']

Let's understand what is inside this file by looking at the first 10 elements:

In [22]:
with open(j2e.config.corpus, "r", encoding=j2e.config.encoding) as f:
    d = json.load(f)
list(d.items())[:10]

[('cnp01287518', ['Breslau']),
 ('cnp01287801', ['Helmstedt']),
 ('cnp01417221', ['Bayreuth', 'Schleusingen']),
 ('cnp01418335', ['Halle']),
 ('cnp01289912', ['Venedig']),
 ('cnp01286053', ['Leipzig']),
 ('cnp01937459', ['Bergisches Land']),
 ('cnp02047908', ['Pirna', 'Leipzig']),
 ('cnp01416622', ['Bad Frankenhausen']),
 ('cnp01286888', ['Weißenfels'])]

We can see that for each key present in the input JSON, a list of values was created. This is our graph corpus on which further analyses are based.

But what happens if we change the _swap_ parameter?

In [23]:
j2e.config.swap = True
j2e.start()

  0%|          | 0/552 [00:00<?, ?it/s]

  0%|          | 0/15 [00:00<?, ?it/s]

TypeError: 'NoneType' object is not iterable

We can ignore the error message again and look into the corpus file that was created. First of all, you can see that previous results get overwritten, there is still only one file in the appropriate folder (since we did not change the path to the corpus file in the _corpus_ parameter):

In [24]:
os.listdir(os.path.dirname(j2e.config.corpus))

['graph_corpus.json']

Indeed, the file content did change:

In [25]:
with open(j2e.config.corpus, "r", encoding=j2e.config.encoding) as f:
    d = json.load(f)
list(d.items())[:5]

[('Praha', ['cnp01332556']),
 ('Roma',
  ['cnp01396704',
   'cnp01390940',
   'cnp01402089',
   'cnp01337377',
   'cnp01331949',
   'cnp01331836']),
 ('Distrikt Angus', ['cnp00475273']),
 ('Poschiavo', ['cnp00100939']),
 ('Flatow', ['cnp01299679'])]

Keys and values did switch places. Now, for each value that was found in the input data (i.e., for each place name), a list of record IDs in which a place occurs was compiled.

By using the _swap_ parameter you can thus create different graph corpora, depending on the "view" you want to have on your input data. This might come in handy in some cases.

-----

## Similarity functions

Remember that we are now able to create a graph corpus from a single file, but that we still get an error message that says our configuration is missing a reasonable _sim_functions_ parameter.

We will now use predefined similarity functions to get rid of this error message. We will learn afterwards how we can define our own similarity functions to be applied on the graph corpora created in the previous step.

Finally, we will see how to run the conversion process on a list of files (e.g., created by the _CSV2JSON_ conversion).

We re-run the previous function to see the error message again. We also change the _swap_ parameter (only because this runs a bit faster).

In [26]:
j2e.config.swap = False
j2e.start()

  0%|          | 0/67 [00:00<?, ?it/s]

TypeError: 'NoneType' object is not iterable

We still get the error message that our configuration is missing a value for the parameter _sim_functions_. Why exists this configuration parameter?

__sim_functions__ (list of dict): Similarity functions
* This parameter provides a list of dictionaries. Each dictionary contains the configuration for a similarity function. Example:

  `{
    'name': 'oned',
    'function': Similarity.mint,
    'args': {'f': 1, 't': 1}
  }`
    
    Each similarity function needs a "name" (which can be any string), a "function" (which needs to be defined either in the __Bibliometa__ package or otherwise), and two arguments ("args") "f" and "t". The argument "f" defines the value that is returned when two sets a and b are compared according to their similarity; the parameter "t" defines a threshold that needs to be reached to return the value defined by "f". If the result of f is <= t, a similarity of 0 will be returned.

We already have three similarity functions available in the __Bibliometa__ package:

In [27]:
from bibliometa.graph.similarity import Similarity
[m for m in dir(Similarity.Functions) if not m.startswith('_')]

['jaccard', 'mint', 'overlap']

Note that three predefined functions are available in the _Functions_ class which is a class __inside__ the _Similarity_ class: _jaccard()_, _mint()_, and _overlap()_.

You may be interested in the help text of these functions:

In [28]:
help(Similarity.Functions.jaccard)

Help on function jaccard in module bibliometa.graph.similarity:

jaccard(a, b, f, t=0)
    The Jaccard Index. a and b are considered similar if the size of their intersection divided by their
    union is greater than or equal to t.
    
    :param a: Set of values for item a
    :type a: `set`
    :param b: Set of values for item b
    :type b: `set`
    :param f: This value (or the result of this function) will be returned if similarity between
        a and b >= t
    :type f: function or `int`
    :param t: Threshold
    :type t: `int`
    :return: Similarity value
    :rtype: `float` or `int`
    :raise ValueError: If f is neither a function nor an `int` or `float`



Or maybe even the source code is of interest:

In [29]:
Similarity.Functions.jaccard??

For the moment, it is enough to know that you can use these predefined functions in defining your similarity functions. So let's set three functions in our configuration:

In [30]:
SIM_FUNCTIONS = [
    {"name": "mint_1",
     "function": Similarity.Functions.mint,
     "args": {
         "f": lambda a, b: len(list(a.intersection(b))),
         "t": 1}
     },
    {"name": "jacc",
     "function": Similarity.Functions.jaccard,
     "args": {
         "f": lambda a: round(a, 4),
         "t": 0}
     },
    {"name": "ovlp",
     "function": Similarity.Functions.overlap,
     "args": {
         "f": lambda a: round(a, 4),
         "t": 0}
     },
]
# The function "mint_1" will return the length of the intersection of a and b ("f")
# only if it is greater than or equal to 1 ("t" == threshold).
# For "jacc" and "ovlp", the function "f" is applied on the resulting value before it is returned.
# That means that the result of these functions is rounded to 4 digits after the decimal point.

In [31]:
j2e.config.sim_functions = SIM_FUNCTIONS
j2e.config.sim_functions

[{'name': 'mint_1',
  'function': <function bibliometa.graph.similarity.Similarity.Functions.mint(a, b, f, t=0)>,
  'args': {'f': <function __main__.<lambda>(a, b)>, 't': 1}},
 {'name': 'jacc',
  'function': <function bibliometa.graph.similarity.Similarity.Functions.jaccard(a, b, f, t=0)>,
  'args': {'f': <function __main__.<lambda>(a)>, 't': 0}},
 {'name': 'ovlp',
  'function': <function bibliometa.graph.similarity.Similarity.Functions.overlap(a, b, f, t=0)>,
  'args': {'f': <function __main__.<lambda>(a)>, 't': 0}}]

After having defined three similarity functions we can try to start the conversion again. Just to be clear about our configuration, we show it on screen first:

In [32]:
j2e.get_config()

('i', '../data/examples/demo_1700.json')
('o', '../data/output/similarity/similarity.csv')
('create_corpus', True)
('corpus', '../data/output/graph_corpus/graph_corpus.json')
('name', '')
('fields', [('515', 'a')])
('swap', False)
('sim_functions', [{'name': 'mint_1', 'function': <function Similarity.Functions.mint at 0x7fcf7b0e1940>, 'args': {'f': <function <lambda> at 0x7fcf79daedc0>, 't': 1}}, {'name': 'jacc', 'function': <function Similarity.Functions.jaccard at 0x7fcf7b0e19d0>, 'args': {'f': <function <lambda> at 0x7fcf79daec10>, 't': 0}}, {'name': 'ovlp', 'function': <function Similarity.Functions.overlap at 0x7fcf7b0e1a60>, 'args': {'f': <function <lambda> at 0x7fcf79daeb80>, 't': 0}}])
('archive', True)
('archive_ext', '.tar.gz')
('csv_sep', '\t')
('log', '../data/logs/json2edgelist_demo.out')
('log_level_std', 'INFO')
('log_level_file', 'DEBUG')
('verbose', False)
('encoding', 'utf-8')

Looks fine. We can start the conversion:

In [33]:
j2e.start() 

  0%|          | 0/67 [00:00<?, ?it/s]

No error message! (hopefully)

-----

## Output files

So let's see if all output files were created as desired.

In [34]:
os.listdir(os.path.dirname(j2e.config.corpus))

['graph_corpus.json']

In [35]:
with open(j2e.config.corpus, "r", encoding=j2e.config.encoding) as f:
    d = json.load(f)
list(d.items())[:5]

[('cnp01287518', ['Breslau']),
 ('cnp01287801', ['Helmstedt']),
 ('cnp01417221', ['Bayreuth', 'Schleusingen']),
 ('cnp01418335', ['Halle']),
 ('cnp01289912', ['Venedig'])]

The graph corpus looks good. How about the output similarity file? Remember its definition in the configuration:

In [36]:
j2e.config.o

'../data/output/similarity/similarity.csv'

In [37]:
os.listdir(os.path.dirname(j2e.config.o))

['similarity.csv', 'similarity.tar.gz']

Wait, why are there two files?

To improve the efficiency of subsequent graph analysis, the _JSON2EdgeList_ conversion automatically creates a ".tar.gz" archive if you did not configure it otherwise. This can be switched on or off in the configuration:

In [38]:
j2e.get_config()

('i', '../data/examples/demo_1700.json')
('o', '../data/output/similarity/similarity.csv')
('create_corpus', True)
('corpus', '../data/output/graph_corpus/graph_corpus.json')
('name', '')
('fields', [('515', 'a')])
('swap', False)
('sim_functions', [{'name': 'mint_1', 'function': <function Similarity.Functions.mint at 0x7fcf7b0e1940>, 'args': {'f': <function <lambda> at 0x7fcf79daedc0>, 't': 1}}, {'name': 'jacc', 'function': <function Similarity.Functions.jaccard at 0x7fcf7b0e19d0>, 'args': {'f': <function <lambda> at 0x7fcf79daec10>, 't': 0}}, {'name': 'ovlp', 'function': <function Similarity.Functions.overlap at 0x7fcf7b0e1a60>, 'args': {'f': <function <lambda> at 0x7fcf79daeb80>, 't': 0}}])
('archive', True)
('archive_ext', '.tar.gz')
('csv_sep', '\t')
('log', '../data/logs/json2edgelist_demo.out')
('log_level_std', 'INFO')
('log_level_file', 'DEBUG')
('verbose', False)
('encoding', 'utf-8')

__archive__ (bool): Whether the similarity file is put into an archive
* If this parameter is True, the similarity CSV file is also put into an archive.

__archive_ext__ (str): File extension of archive
* This is the type of archive that will be used. This parameter has no effect if _archive_ == False. Note: Currently only ".tar.gz" archives are supported!

__csv_sep__ (str): CSV separator
* CSV separator used in output similarity CSV file

The output CSV looks as follows:

In [39]:
import pandas as pd

with open(j2e.config.o, "r", encoding=j2e.config.encoding) as f:
    df = pd.read_csv(f, sep=j2e.config.csv_sep, header=None, index_col=0)
    
df.head()

Unnamed: 0_level_0,1,2,3,4,5
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
284,cnp01287518,cnp01130525,1,0.2,1.0
380,cnp01287518,cnp00654099,1,0.5,1.0
414,cnp01287518,cnp00523315,1,1.0,1.0
415,cnp01287518,cnp00682190,1,0.3333,1.0
437,cnp01287518,cnp00649657,1,1.0,1.0


The first column is the row ID, the second and third column contain two nodes of the graph. After this, there is one column for each similarity function. The columns have the same order as the functions are defined in the _sim_functions_ configuration parameter. (In this case, first _mint_1_, then _jacc_, then _ovlp_.)

We can filter the data based on certain values. For example, to get only those rows where the jaccard function returned a value greater than 0.5, we could do the following:

In [40]:
jacc = df[df[4] > 0.5]
jacc

Unnamed: 0_level_0,1,2,3,4,5
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
414,cnp01287518,cnp00523315,1,1.0,1.0
437,cnp01287518,cnp00649657,1,1.0,1.0
486,cnp01287518,cnp00483281,1,1.0,1.0
1261,cnp01287801,cnp01277806,1,1.0,1.0
1380,cnp01287801,cnp01223930,1,1.0,1.0
...,...,...,...,...,...
676547,cnp01346309,cnp02232017,1,1.0,1.0
676676,cnp02231699,cnp02232017,1,1.0,1.0
676831,cnp02162198,cnp02162203,1,1.0,1.0
676835,cnp02162198,cnp02162197,1,1.0,1.0


We can also compare how many data sets achieved a certain similarity score:

In [41]:
ovlp = df[df[5] < 0.5]
print(f"Rows with jaccard > 0.5: {jacc.shape[0]}")
print(f"Rows with overlap < 0.5: {ovlp.shape[0]}")
ovlp = df[df[5] == 1]
print(f"Rows with overlap == 1: {ovlp.shape[0]}")

Rows with jaccard > 0.5: 7194
Rows with overlap < 0.5: 66
Rows with overlap == 1: 11392


-----

## Logging, encoding, verbose

There are only a couple of configuration parameters left, let's have a look at them.

In [42]:
j2e.get_config()

('i', '../data/examples/demo_1700.json')
('o', '../data/output/similarity/similarity.csv')
('create_corpus', True)
('corpus', '../data/output/graph_corpus/graph_corpus.json')
('name', '')
('fields', [('515', 'a')])
('swap', False)
('sim_functions', [{'name': 'mint_1', 'function': <function Similarity.Functions.mint at 0x7fcf7b0e1940>, 'args': {'f': <function <lambda> at 0x7fcf79daedc0>, 't': 1}}, {'name': 'jacc', 'function': <function Similarity.Functions.jaccard at 0x7fcf7b0e19d0>, 'args': {'f': <function <lambda> at 0x7fcf79daec10>, 't': 0}}, {'name': 'ovlp', 'function': <function Similarity.Functions.overlap at 0x7fcf7b0e1a60>, 'args': {'f': <function <lambda> at 0x7fcf79daeb80>, 't': 0}}])
('archive', True)
('archive_ext', '.tar.gz')
('csv_sep', '\t')
('log', '../data/logs/json2edgelist_demo.out')
('log_level_std', 'INFO')
('log_level_file', 'DEBUG')
('verbose', False)
('encoding', 'utf-8')

__log__ (str): Path to log file
* The conversion process and its errors are documented in a log file. If _verbose_ == True (see below), the logging information is also shown on standard output if its level is _log_level_std_ or above.

__log_level_std__ (str): Logging level considered for standard output
* Only log messages with this level (or above) are shown on the standard output. This parameter has no effect if _verbose_ == False. Possible severity levels can be found in the documentation of the logging package `loguru`: https://loguru.readthedocs.io/en/stable/api/logger.html

__log_level_file__ (str): Logging level considered for log file
* Only log messages with this level (or above) are shown in the log file.

__verbose__ (bool): Show detailed information on standard output
* Whether logging information is not only written to the log file but also shown on the standard output.

__encoding__ (str): File encoding
* File encoding of input and output files. The default value is "utf-8" and there is usually no need to change this.

We can see what's in the log file and check if everything went as expected:

In [43]:
with open(j2e.config.log, "r", encoding=j2e.config.encoding) as f:
    log_text = f.read().splitlines()

log_text

['2021-07-28T15:41:00.274425+0200 INFO Start JSON2EdgeList conversion.',
 '2021-07-28T15:41:00.650325+0200 INFO Start JSON2EdgeList conversion.',
 '2021-07-28T15:41:02.023603+0200 INFO Start JSON2EdgeList conversion.',
 '2021-07-28T15:41:02.030828+0200 DEBUG Corpus keys: 1164',
 '2021-07-28T15:41:02.035515+0200 DEBUG Unique values: 552',
 '2021-07-28T15:41:02.046869+0200 INFO Corpus written to file ../data/output/graph_corpus/graph_corpus.json.',
 '2021-07-28T15:41:02.049565+0200 INFO Start similarity calculation.',
 '2021-07-28T15:41:03.103462+0200 INFO Start JSON2EdgeList conversion.',
 '2021-07-28T15:41:05.382170+0200 DEBUG Corpus keys: 552',
 '2021-07-28T15:41:05.392618+0200 INFO Corpus written to file ../data/output/graph_corpus/graph_corpus.json.',
 '2021-07-28T15:41:05.392939+0200 INFO Start similarity calculation.',
 '2021-07-28T15:41:17.868326+0200 INFO Start JSON2EdgeList conversion.',
 '2021-07-28T15:41:17.884078+0200 DEBUG Corpus keys: 1164',
 '2021-07-28T15:41:17.884349+02

-----

## Custom similarity functions

You can define your own similiarity function outside of __Bibliometa__'s source code. Just make sure that you have all necessary parameters a, b, f, and t in the function definition.

In [44]:
# This function returns the value f if one or more elements from set a can be found in set b. 
# Otherwise it returns 0.111.
# That means EVERY combination of two nodes has a similarity!
def custom_function(a, b, f, t=0):
    return f if any(x in b for x in a) else 0.111

After defining a similarity function, add it to the _JSON2EdgeList_ configuration:

In [45]:
SIM_FUNCTIONS = [
    {"name": "custom",
     "function": custom_function,
     "args": {
         "f": 0.99,
         "t": 1}
     }
]

In [46]:
j2e.config.sim_functions = SIM_FUNCTIONS

Then start the conversion:

In [47]:
j2e.start()

  0%|          | 0/67 [00:00<?, ?it/s]

Finally, check the produced similarity file:

In [48]:
with open(j2e.config.o, "r", encoding=j2e.config.encoding) as f:
    df = pd.read_csv(f, sep=j2e.config.csv_sep, header=None, index_col=0)
    
df.head()

Unnamed: 0_level_0,1,2,3
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,cnp01287518,cnp01287801,0.111
1,cnp01287518,cnp01417221,0.111
2,cnp01287518,cnp01418335,0.111
3,cnp01287518,cnp01289912,0.111
4,cnp01287518,cnp01286053,0.111


We can count how often the custom similarity function returned 0.111 and 0.99, respectively:

In [49]:
df[df[3] == 0.111].shape[0]

664824

In [50]:
df[df[3] == 0.99].shape[0]

12042

Just for fun, let's test if all combinations of two nodes were considered in similarity calculation:

In [51]:
from scipy.special import binom

# get number of input data sets
with open(j2e.config.i, "r", encoding=j2e.config.encoding) as f:
    d = json.load(f)
size_input = len(d.keys())

# get number of combinations
combinations = binom(size_input, 2)

# check if result similarity file has as many rows as possible combinations exist
assert (
    df[df[3] == 0.111].shape[0] + df[df[3] == 0.99].shape[0]
) == combinations

No output message means: Everything is perfect.

-----

## Iterating over multiple files

There is one use case where the configuration parameter _name_ is needed.

__name__ (str): Name for single conversion step
* If the conversion from JSON to an edge list representation is conducted on more than one file, you somehow have to make sure that for each file a separate similarity file (and probably graph corpus) is created. Using the _name_ parameter allows you to assign each file a unique identifier. For example, if you created your input JSON files using `bibliometa.conversion.CSV2JSON`, the year information inside the output file names may be used as unique identifier for the JSON2EdgeList conversion.

We start by creating a new _JSON2EdgeList_ object:

In [52]:
j2e = JSON2EdgeList()
j2e

('i', None)
('o', None)
('create_corpus', False)
('corpus', None)
('name', '')
('fields', None)
('swap', False)
('sim_functions', None)
('archive', True)
('archive_ext', '.tar.gz')
('csv_sep', '\t')
('log', None)
('log_level_std', 'INFO')
('log_level_file', 'DEBUG')
('verbose', False)
('encoding', 'utf-8')

We add our similarity functions:

In [53]:
SIM_FUNCTIONS = [
    {"name": "mint_1",
     "function": Similarity.Functions.mint,
     "args": {
         "f": lambda a, b: len(list(a.intersection(b))),
         "t": 1}
     },
    {"name": "jacc",
     "function": Similarity.Functions.jaccard,
     "args": {
         "f": lambda a: round(a, 4),
         "t": 0}
     },
    {"name": "ovlp",
     "function": Similarity.Functions.overlap,
     "args": {
         "f": lambda a: round(a, 4),
         "t": 0}
     },
]

In the next step, we iterate over multiple files in a folder. It is especially important that we take care to set the _name_ configuration parameter with a unique value in each iteration and that the _i_ parameter uses the single files of the iteration (and not always the same one). The file names of the output, graph corpus and log files are generated automatically if the _name_ parameter is set, i.e., the _name_ is appended to their file names to make sure each iteration produces a new and unique similarity/corpus/log file.

Alternatively, you could use a different value for the _o_ and _corpus_ parameter in each iteration.

We set _swap_ == True to get meaningful graphs that can be visualized in one of the next tutorials (04-visualization). This will result in cities as nodes; edges will be drawn between similar cities (i.e., cities that appear in the same data set).

In [54]:
j2e = JSON2EdgeList()

# iterate over files in folder
for root, dirs, files in os.walk(os.path.dirname("../data/examples/multiple_files/")):
    for file in files:
        filename = os.path.splitext(file)[0]
        j2e.set_config(i=root + os.sep + file,
                       o=f"../data/output/similarity/similarity_{filename}.csv",
                       corpus=f"../data/output/graph_corpus/graph_corpus_{filename}.json",
                       create_corpus=True,
                       name=filename,
                       fields=[
                           ("515", "a"),
                       ],
                       swap=True,
                       sim_functions=SIM_FUNCTIONS,
                       log=f"../data/logs/json2edgelist_{filename}.out"
                      ).start()

  0%|          | 0/544 [00:00<?, ?it/s]

  0%|          | 0/14 [00:00<?, ?it/s]

  0%|          | 0/563 [00:00<?, ?it/s]

  0%|          | 0/15 [00:00<?, ?it/s]

  0%|          | 0/552 [00:00<?, ?it/s]

  0%|          | 0/15 [00:00<?, ?it/s]

Looking into the folder for similarity, graph corpus, and log files, we see that the files were created in a correct manner:

In [55]:
[x for x in os.listdir(os.path.dirname(j2e.config.o)) if not x.endswith(".tar.gz")]

['similarity_demo_1710_demo_1710.csv',
 'similarity.csv',
 'similarity_demo_1720_demo_1720.csv',
 'similarity_demo_1700_demo_1700.csv']

In [56]:
[x for x in os.listdir(os.path.dirname(j2e.config.corpus)) if not x.endswith(".tar.gz")]

['graph_corpus_demo_1710_demo_1710.json',
 'graph_corpus_demo_1700_demo_1700.json',
 'graph_corpus.json',
 'graph_corpus_demo_1720_demo_1720.json']

In [57]:
[x for x in os.listdir(os.path.dirname(j2e.config.log)) if not x.endswith(".tar.gz")]

['json2edgelist_demo_1710.out',
 'json2edgelist_demo_1700.out',
 'json2edgelist_demo_progress.out',
 'csv2json_demo.out',
 'json2edgelist_demo.out',
 'json2edgelist_demo_1700_demo_1700_progress.out',
 'json2edgelist_demo_1710_demo_1710_progress.out',
 'json2edgelist_demo_1720_demo_1720_progress.out',
 'json2edgelist_demo_1720.out']

-----