# Kialo  0 - Transform Pickled Files of Kialo Debates to a Single Table

## A. P. Young

### 2022 April 13

In [1]:
import time
START = time.time()
import datetime
from datetime import timedelta
import os
import pickle
from tqdm import tqdm
import networkx as nx
print("NetworkX version", nx.__version__)
import pandas as pd
from collections import Counter
import numpy as np

NetworkX version 1.10


# Introduction

This notebook contains code to convert pickled files of [Kialo](https://www.kialo.com/) debates to a single `.csv` file with the same data. The reason for this notebook is because the files provided were pickled in NetworkX version 1.x, and the methods in NetworkX version 2.x do not work due to version incompatibility.

**To run this notebook, we assume you have NetworkX 1.x** (see the version print out above). Now do the following:

  1. Go to https://netsys.surrey.ac.uk/datasets/graphnli/ (last accessed 13 April 2022).
  
  2. Fill out the form and request the dataset.
  
  3. Once the dataset request is approved, download and unzip the dataset. The folder's name is:

In [2]:
folder = './serializedGraphs/'

  4. Then run the notebook steps below, assuming that the folder is in the same directory as this notebook.
  
  5. The result is a `.csv` file, `kialo_debates.csv`, which you can use for downstream analysis, e.g. in online debate networks or NLP of the texts of the debates.

## References

If you find this notebook useful, please cite:

  * Young, A.P., Joglekar, S., Boschi, G. and Sastry, N., 2021. *Ranking comment sorting policies in online debates*. Argument & Computation, 12(2), pp.265-285.

Paper [here](https://content.iospress.com/articles/argument-and-computation/aac200909).

Given you are working with the Kialo dataset, please also cite:

  * Agarwal, V., Joglekar, S., Young, A.P. and Sastry, N., 2022. *GraphNLI: A Graph-based Natural Language Inference Model for Polarity Prediction in Online Debates*. arXiv preprint arXiv:2202.08175.

Preprint [here](https://arxiv.org/abs/2202.08175), paper to appear at The ACM Web Conference 2022, end-April 2022.

# Unpickle the Files

We unpickle the files.

In [3]:
def create_filepaths_excluding_hidden_files(directory_string):
    """
    Input string that points to directory,
    e.g. relative to this notebook
    Output list of strings
    i.e. names of directory's items and the filepath
    Ignores any hidden files (see condition in list comprehension)
    """
    return sorted([directory_string + item for item in os.listdir(directory_string) if not item.startswith('._')])

First get the filepaths.

In [4]:
filepaths = create_filepaths_excluding_hidden_files(folder)

Then unpickle the files.

In [5]:
def unpickle_file(filepath):
    """
    Input filepath string of .pkl file
    Output unpickled file
    """
    with open(filepath, 'rb') as f:
        answer = pickle.load(f)
    return answer

In [6]:
def unpickle_file_from_list_of_filepaths(list_of_filepaths):
    """
    Input list of strings (filepaths)
    Apply the preceding unpickle_file function to all items in the input list
    """
    # progress bar
    return [unpickle_file(item) for item in tqdm(list_of_filepaths)]

In [7]:
debates = unpickle_file_from_list_of_filepaths(filepaths)

100%|██████████| 1560/1560 [00:14<00:00, 107.29it/s]


# Get Graph Node and Edge Information

Each pickled file is a NetworkX digraph, pickled in NetworkX version 1.x. The following methods follow the syntax of version 1.x. This is why this notebook assumes you have NetworkX 1.x.

We first extract all nodes and their attributes into a list.

In [8]:
def extract_all_nodes_with_date_into_list(mylist):
    """
    Input list of nx graphs
    Output list of pairs
    - string node name
    - node attributes dictionary
    """
    answer = []
    # progress bar
    for item in tqdm(mylist):
        answer += item.nodes(data = True)
    return answer

In [9]:
debate_nodes = extract_all_nodes_with_date_into_list(debates)

100%|██████████| 1560/1560 [00:00<00:00, 1829.63it/s]


We do the same for the edges.

In [10]:
def extract_all_edges_with_date_into_list(mylist):
    """
    Input list of nx graphs
    Output list of triples
    - string node name source node of edges
    - string node name target node of edges
    - dictionary of edge attributes
    """
    answer = []
    for item in tqdm(mylist):
        answer += item.edges(data = True)
    return answer

In [11]:
debate_edges = extract_all_edges_with_date_into_list(debates)

100%|██████████| 1560/1560 [00:01<00:00, 1056.06it/s]


# Merge Node and Edge Information into a Single Dataframe

We turn each such list into a dataframe:

In [12]:
def structure_list_of_nodes_or_edges_with_data_into_dataframe(mylist):
    """
    Input a list of doubles (string, dict) or a list of triples (string1, string2, dict)
    Output a dataframe with all of the dict keys as their separate columns
    """
    answer = []
    # progress bar
    for item in tqdm(mylist):
        # structure is either (a, mydict) or (a, b, mydict)
        if len(item) == 2:
            node_name = item[0]
            myrow = item[-1]
            myrow['source'] = node_name
        if len(item) == 3:
            source_node_name = item[0]
            target_node_name = item[1]
            myrow = item[-1]
            myrow['source'] = source_node_name
            myrow['target'] = target_node_name
        answer.append(myrow)
    return pd.DataFrame(answer)

In [13]:
debate_nodes_df = structure_list_of_nodes_or_edges_with_data_into_dataframe(debate_nodes)

100%|██████████| 329013/329013 [00:00<00:00, 440685.33it/s]


In [14]:
debate_edges_df = structure_list_of_nodes_or_edges_with_data_into_dataframe(debate_edges)

100%|██████████| 327453/327453 [00:00<00:00, 404032.52it/s]


We change the order of the columns (optional, for aesthetic reasons):

In [15]:
node_column_order = ['source',
                     'author',
                     'text',
                     'created',
                     'edited',
                     'votes',
                     'relation']

In [16]:
edge_column_order = ['source',
                     'target',
                     'weight']

In [17]:
debate_nodes_df = debate_nodes_df[node_column_order]

In [18]:
debate_edges_df = debate_edges_df[edge_column_order]

We then merge the edge information with the node information, resulting in a dataframe where each row is a (source) node, with also which (target) node it points to and any edge attributes.

In [19]:
debates_df = pd.merge(debate_nodes_df, debate_edges_df, how = 'left', on = ['source'])

In [20]:
debates_df.shape

(329013, 9)

For this dataframe, the `source` (node) column serves as an identifier for each row - this is verified below where every column value appears exactly once.

In [21]:
set(Counter(debates_df['source']).values())

{1}

The merge of the edge information into the node information will have some nodes not be sources to edges. We define those targets to be `-1` and their edge weights is zero.

In [22]:
debates_df['target'] = debates_df['target'].fillna(-1)

In [23]:
debates_df['weight'] = debates_df['weight'].fillna(0.0)

# Delete Redundant Feature - `relation`

There is a column called `relation` which has the same information as the `weight` column, as shown by the following correlation matrix:

In [24]:
debates_df[['relation', 'weight']].corr()

Unnamed: 0,relation,weight
relation,1.0,1.0
weight,1.0,1.0


... and also the following counters of pairs of values. Notice the value $1$ matches with $1$, $-1$ matches with $-1$ and $0$ matches with $0$.

In [25]:
Counter(zip(debates_df['relation'], debates_df['weight']))

Counter({(1, 1.0): 139722, (-1, -1.0): 184651, (0, 0.0): 4640})

This means we are justified in dropping the `relation` column, because its information is already contained inthe `weight` column.

In [26]:
debates_df = debates_df.drop(columns = 'relation')

We also change the values of the `weight` column into integers.

In [27]:
debates_df['weight'] = debates_df['weight'].astype(int)

# Splitting the `votes` Column

The `votes` column consists of a length-5 list of integers. This corresponds to the five categories in Kialo voting, where the numbers in each list position count the number of votes, and

  0. Index `0` refers to the category "This claim is false."
  
  1. Index `1` refers to the category "This claim is improbable."
  
  2. Index `2` refers to the category "This claim is plausible."
  
  3. Index `3` refers to the category "This claim is probable."
  
  4. Index `4` refers to the category "This claim is true."
  
See [here](https://www.kialo.com/the-existence-of-god-2629) for an example: click on the horizontal blue bar above the claim, and then the little icon on the upper left (with three vertical bars of ascending height).

We split the `votes` column into five columns and rename them as `vote_category0`, `vote_category1`... etc.

In [28]:
votes_subdf = pd.DataFrame(list(debates_df['votes']))

In [29]:
vote_column_names = []
for index in range(5):
    item = 'vote_category' + str(index)
    vote_column_names.append(item)

In [30]:
votes_subdf.columns = vote_column_names

We partition the original dataframe and insert the five `votes` columns in the middle.

In [31]:
votes_index = list(debates_df).index('votes')

In [32]:
df1 = debates_df[list(debates_df)[:votes_index]]

In [33]:
df2 = debates_df[list(debates_df)[votes_index+1:]]

In [34]:
debates_df = pd.concat([df1, votes_subdf, df2], axis = 1)

In [35]:
debates_df.shape

(329013, 12)

# Output Dataframe

We have thus cleaned the pickled files into a single table, which overcomes any NetworkX incompatibility issues, and allows for the debate structure to be reconstructed (by following `source` and `target` relations, where `weight` is an edge attribute and everything else is a node attribute).

We output the dataframe into a single `csv` file.

In [36]:
debates_df.to_csv('kialo_debates.csv', index = False)

# Conclusions

This notebook has cleaned a series of pickled Kialo debates (NetworkX directed graphs) into a single table for downstream analysis, from which the debate structure can be reconstructed, while also overcoming any version incompatibilities from NetworkX.

In [37]:
"Notebook done in " + str(timedelta(seconds = time.time() - START)) + '.'

'Notebook done in 0:00:34.535702.'