# Kialo 0 - Repickle `.csv` File into Folder of NetworkX Digraphs

## A. P. Young

### 2022 May 1

In [1]:
import time
START = time.time()
import datetime
from datetime import timedelta
import pandas as pd
import networkx as nx
from collections import Counter
from tqdm import tqdm

# Introduction

This notebook takes the single `.csv` file of Kialo debates, called `kialo_debates.csv`, that was output from the [previous notebook](https://github.com/apy05/kialo_debates/blob/main/Kialo0_Dataset_Compatibility.ipynb) and repickles the files to networks that are compatible with NetworkX `2.x`.

**Steps**

  1. Go to https://netsys.surrey.ac.uk/datasets/graphnli/ (last accessed 1 May 2022).
  
  2. Fill out the form and request the dataset.

  3. Once the dataset request is approved, download and unzip the dataset.
  
  4. Run the [previous notebook](https://github.com/apy05/kialo_debates/blob/main/Kialo0_Dataset_Compatibility.ipynb) to obtain the `.csv` file output.
  
We have NetworkX version `2.x` here:

In [2]:
nx.__version__

'2.8'

## References

If you find this notebook useful, please cite:

  * Young, A.P., Joglekar, S., Boschi, G. and Sastry, N., 2021. Ranking comment sorting policies in online debates. Argument & Computation, 12(2), pp.265-285.

Paper [here](https://content.iospress.com/articles/argument-and-computation/aac200909).

Given you are working with the Kialo dataset, please also cite:

  * Agarwal, V., Joglekar, S., Young, A.P. and Sastry, N., 2022. GraphNLI: A Graph-based Natural Language Inference Model for Polarity Prediction in Online Debates. Proceedings of the ACM Web Conference 2022, pp. 2729 - 2737.
  
Paper [here](https://dl.acm.org/doi/abs/10.1145/3485447.3512144).

# Get Data

We get the `.csv` file:

In [3]:
data = pd.read_csv('kialo_debates.csv').fillna('')

# Some Functions

The following function inputs a dataframe of such debates and outputs a networkx digraph with all node and edge attributes.

In [4]:
def dataframe_into_graph(mydf):
    """
    Input dataframe of columns at least: source, target, weight
    Output networkx digraph with all attributes
    """
    # does not mutate argument
    mydf2 = mydf.copy()
    # sort by node index suffix
    mydf2_indices = [int(item.split('_')[1]) for item in mydf2['source']]
    mydf2.insert(0, 'source_index', mydf2_indices)
    mydf2 = mydf2.sort_values(by = 'source_index')
    mydf2 = mydf2.drop(columns = ['source_index'])
    node_df = mydf2.drop(columns = ['target', 'weight'])
    edge_df = mydf2[['source', 'target', 'weight']]
    answer = nx.DiGraph()
    # add node attributes
    for rownum in range(node_df.shape[0]):
        myrow = node_df.iloc[rownum]
        source = myrow['source']
        # ignore source
        attributes = dict(myrow[1:])
        answer.add_node(source, **attributes)
    # add edge attributes
    for rownum in range(edge_df.shape[0]):
        myrow = edge_df.iloc[rownum]
        source = myrow['source']
        target = myrow['target']
        if target != '-1':
            attributes = dict(myrow[2:])
            answer.add_edge(source, target, **attributes)
    return answer

The following function applies the above function to the dataframe which follows the syntax of node names, a string of the form `X_Y`, where `X` and `Y` are integers. Intuitively, nodes of the same prefix `X` belong to the same debate.

In [5]:
def construct_list_of_digraphs(mydf):
    """
    Input dataframe
    Output list of digraphs
    - nodes and edges as in the dataframe
    - attributes all preserved
    """
    answer = []
    all_graph_indices = sorted(list(set([int(item.split('_')[0]) for item in list(mydf['source'])])))
    # progress bar
    for index in tqdm(all_graph_indices):
        subdf = mydf[mydf['source'].str.startswith(str(index) + '_')]
        digraph = dataframe_into_graph(subdf)
        pair = (index, digraph)
        answer.append(pair)
    return answer

# Construct the Digraphs

We construct the digraphs - this takes several minutes.

In [6]:
data_digraphs = construct_list_of_digraphs(data)

100%|███████████████████████████████████████| 1560/1560 [03:50<00:00,  6.76it/s]


There are $1,560$ such debates, which is consistent with the previous notebook.

In [7]:
len(data_digraphs)

1560

# Pickle the Digraphs

We first create a folder called `serialisedGraphs` in this directory (by hand), then we pickle the digraphs using a depreciated function in `nx`. This also takes several minutes.

In [8]:
# progress bar
for index, digraph in tqdm(data_digraphs):
    filepath = './serialisedGraphs/' + str(index) + '.pkl'
    nx.write_gpickle(digraph, filepath)

100%|███████████████████████████████████████| 1560/1560 [02:03<00:00, 12.64it/s]


The result is a folder of $1,560$ pickled directed graphs of Kialo debates, each preserving all node and edge attributes.

**N.B.** The main difference between these pickled graphs and those provided by the data link from Step 1 in the Introduction above is that the five Kialo vote categories were initially represented as a list of five integers, but the pickled graphs here has separated that list of five integers into five distinct graph attributes, each of a single integer. This was mainly to avoid any complex data types (e.g. a list, which is more complex than an integer) appearing in the `.csv` file.

# Conclusion

In this short notebook, we have taken the `.csv` output of the previous notebook, formed it into a list of NetworkX digraphs (under version `2.x`), preserving all node and edge attributes, and repickled them into a list of `.pkl` files. The purpose is to provide a set of digraphs compatible with NetworkX `2.x`, for further downstream tasks.

In [9]:
"Notebook done in " + str(timedelta(seconds = time.time() - START)) + '.'

'Notebook done in 0:05:59.416048.'