# Experiment 3: Pandas Dataframes as data structures for Graphs

[//]: # (------------------------------------------    DO NOT MODIFY THIS    ------------------------------------------)
<style type="text/css">
.tg  {border-collapse:collapse;
      border-spacing:0;
     }
.tg td{border-color:black;
       border-style:solid;
       border-width:1px;
       font-family:Arial, sans-serif;
       font-size:14px;
       overflow:hidden;
       padding:10px 5px;
       word-break:normal;
      }
.tg th{border-color:black;
       border-style:solid;
       border-width:1px;
       font-family:Arial, sans-serif;
       font-size:14px;
       font-weight:normal;
       overflow:hidden;
       padding:10px 5px;
       word-break:normal;
      }
.tg .tg-fymr{border-color:inherit;
             font-weight:bold;
             text-align:left;
             vertical-align:top
            }
.tg .tg-0pky{border-color:inherit;
             text-align:left;
             vertical-align:top
            }
[//]: # (--------------------------------------------------------------------------------------------------------------)

[//]: # (-------------------------------------    FILL THIS OUT WITH YOUR DATA    -------------------------------------)
</style>
<table class="tg">
    <tbody>
      <tr>
        <td class="tg-fymr" style="font-weight: bold">Title:</td>
        <td class="tg-0pky">Experiment 3: Pandas Dataframes as data structures for Graphs</td>
      </tr>
      <tr>
        <td class="tg-fymr" style="font-weight: bold">Authors:</td>
        <td class="tg-0pky">
            <a href="https://github.com/ecarrenolozano" target="_blank" rel="noopener noreferrer">Edwin Carreño</a>
        </td>
      </tr>
      <tr>
        <td class="tg-fymr" style="font-weight: bold">Affiliations:</td>
        <td class="tg-0pky">
            <a href="https://www.ssc.uni-heidelberg.de/en" target="_blank" rel="noopener noreferrer">Scientific Software Center</a>
        </td>
      </tr>
      <tr>
        <td class="tg-fymr" style="font-weight: bold">Date Created:</td>
        <td class="tg-0pky">30.10.2024</td>
      </tr>
      <tr>
        <td class="tg-fymr" style="font-weight: bold">Description:</td>
        <td class="tg-0pky">Creation of a graph using Pandas dataframes and data from CSV files. Conversion to NetworkX is tested too.</td>
      </tr>
    </tbody>
</table>

[//]: # (--------------------------------------------------------------------------------------------------------------)

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/quickstart/beginner.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/docs/blob/master/site/en/tutorials/quickstart/beginner.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
</table>

## Overview

In this notebook we are going to:

1. Import CSV (comma-separated values) data from nodes and edges.
2. Describe the data from CSVs.
3. Create a data pipeline that:
   - load CSV data as Pandas dataframes.
4. Create a NetworkX graph from a Pandas dataframes.

## Setup (if required)

The following function checks if a library of interest is installed. Depending on the parameter `install`, you can install it automatically.

In [1]:
import subprocess
from importlib.metadata import version
from importlib.util import find_spec

def check_package_install(package, install=False):
    if find_spec(package) is None:
        print(package,"is NOT installed in Python environment")
        if install:
            try:
                subprocess.check_call(["python", "-m", "pip", "install", package])
            except subprocess.CalledProcessError as e:
                print(f"\tError occurred: {e}")
                print(f"\tCheck the name of your package!")
            print(package, "has been installed with version: ", version(package))
    else:
        print(package,"is installed with version: ", version(package))

### NetworkX installation

In [2]:
check_package_install("networkx", install=False)

networkx is installed with version:  3.4.2


### Pandas installation

In [3]:
check_package_install("pandas", install=False)

pandas is installed with version:  2.2.2


## Importing Libraries

In [4]:
"""
Recommendations:
    - Respect the order of the imports, they are indicated by the numbers 1, 2, 3.
    - One import per line is recommended, with this we can track easily any modified line when we use git.
    - Absolute imports are recommended (see 3. Local application/library specific imports below), they improve readability and give better error messages.
    - You should put a blank line between each group of imports.
"""

# future-imports (for instance: from __future__ import barry_as_FLUFL)
# from __future__ import barry_as_FLUFL  

# 1. Standard library imports
import ast
import csv
import os
from itertools import islice

# 2. Related third party imports
import networkx as nx
import pandas as pd
from IPython.display import Image 

# 3. Local application/library specific imports
# import <mypackage>.<MyClass>         # this is an example
# from <mypackage> import <MyClass>    # this is another example 

## Helper Functions

In [5]:
def networkx_graph_from_pandas(df_nodes, df_edges, graph_type=nx.DiGraph()):
    # create_graph_from_edgeslist
    """
    networkx_graph = nx.from_pandas_edgelist(df_edges,
                                             source='Source ID',
                                             target='Target ID',
                                             create_using=nx.DiGraph()
                                            )
    """
    networkx_graph = graph_type
    # populate_graph_nodes_properties
    attr=get_attributes(df_edges,'y')
    
    networkx_graph.add_edges_from(pd.concat([df_edges[['Source ID', 'Target ID']],
                                             attr
                                            ],
                                            axis=1).itertuples(index=False, name=None)
                                  )
    

    # populate_graph_nodes_properties
    networkx_graph.add_nodes_from(pd.concat([df_nodes['UniProt ID'],
                                             df_nodes['properties'].map(ast.literal_eval)
                                            ],
                                            axis=1).itertuples(index=False, name=None)
                                 )
  
    return networkx_graph

def get_attributes(df,str):
    if str=='y':
        attr = pd.DataFrame({
            'properties': df.apply(
                lambda row: {
                    'id':row['Relationship ID'],
                    'label':row['label'],
                    **ast.literal_eval(row['properties'])
                },
                axis=1
            )
        })
    else:
        attr= df['properties'].map(ast.literal_eval)
    return attr

def create_dataframes(file_path_nodes, file_path_edges):
    df_nodes = from_csv_to_pandasdf(file_path_nodes, delimiter=',')
    df_edges = from_csv_to_pandasdf(file_path_edges, delimiter=',')
    return df_nodes, df_edges    

def from_csv_to_pandasdf(file_path, delimiter=','):
    return pd.read_csv(file_path,
                       delimiter=delimiter)

def load_csv_generator(file_path, header=True):
    with open(file_path, "r") as file:
        reader = csv.reader(file)
        if header:
            next(reader)
        for row in reader:
            yield tuple(row)

## Introduction

A graph is collection of nodes and edges that expresses the relationship between nodes. A minimal node contain their `node label`, similarly, and a minimal edge contain the `source id`, `target id` and `label`.

The data for nodes and edges can be stored in form of tuples. Figure 1 shows the graph that can be constructed with the information from the dataset `dataset_dummy2_edges.csv` and `dataset_dummy2_nodes.csv` 

In [6]:
Image(filename="./images/graph_dataset_dummy2.png", width=600, height=600) 

NameError: name 'Image' is not defined

In this notebook we are going to work with a little bit complex graphs (more nodes, and more edges). Additionally, each node and edge contain properties that are from our interest to keep in order to use for subsequent frameworks, i.e. machine learning and optimization frameworks.

## Section 1: Load the CSV data that contains nodes and edges

For this exercise, each graph is represented by two CSV files. One containing information about the **nodes** and the other about the **edges**. To indicate that both files correspond to the same graph, their names include the same number of nodes. For example:

- `dataset_30_nodes_proteins.csv`: contains 30 rows (nodes).
- `dataset_30_edges_interactions.csv`: contains 47 rows (edges).

We reference each CSV file or dataset as follows:

In [None]:
#filename_nodes = "dataset_dummy2_nodes.csv"
#filename_edges = "dataset_dummy2_edges.csv"

filename_nodes = "dataset_30_nodes_proteins.csv"
filename_edges = "dataset_30_edges_interactions.csv"

#-----------  Change the path we you have the datasets
# FILE_PATH_DATASETS = "../../../DATASETS"
FILE_PATH_DATASETS = "../data_examples"

### 1.1: Load Nodes

The CSV file for nodes contains three columns:
- `UniProt ID`
- `label`
- `properties`

We are going to load the information of nodes as a **list of tuples**. Each tuple represents a node with the structure:
- `(id, label, properties)`
- Each field in the tuple is a `string`
- The `properties` field is a string containing a dictionary of properties.

In [None]:
file_path_nodes = os.path.join(FILE_PATH_DATASETS, filename_nodes)

list_nodes_example = list(islice(load_csv_generator(file_path_nodes, header=True), 3))

In [None]:
print("The list of NODES contains: {} nodes".format(len(list_nodes_example)))
print("Examples:")
for node in list_nodes_example:
    print("{}".format(node))

### 1.2: Load Edges

The CSV file for edges contains five columns:
- `Relationship ID`
- `Source ID`
- `Target ID`
- `label`
- `properties`

We are going to load the information of edges as a list of tuples. Each tuple represents a node with the structure:
- `(id, source, target, label, properties)`
- Each field in the tuple is a `string`
- The `properties` field is string containing a dictionary of properties.

In [None]:
file_path_edges = os.path.join(FILE_PATH_DATASETS, filename_edges)

list_edges_example = list(islice(load_csv_generator(file_path_edges, header=True), 3))

In [None]:
print("The list of EDGES contains: {} edges".format(len(list_edges_example)))
print("Examples:")
for edge in list_edges_example:
    print("{}".format(edge))

## Section 2: Create Data Pipeline
- **input:** CSV data of edges
- **output:** Pandas dataframe containing information of edges

The pipeline consist of two consecutive stages:

| Stage | Function                              | Description |
|-------|---------------------------------------| ----------- |
| 1     | `create_pandasdf_from_csv()`          | create a pandas dataframe in memory        |
| 2     | `edges_format_converter(dataframe)`   | transform dataframe rows to desired format |

### 2.1 Create Pandas Dataframe from CSV file (data structure)

In [None]:
df_nodes, df_edges = create_dataframes(file_path_nodes, file_path_edges)

In [None]:
df_nodes.head()

In [None]:
df_edges.head()

## Section 3: Converting to Networkx graph

In [None]:
try:
    G.clear()
    print("Graph has been cleared!")
except:
    print("Graph G doesn't exist")

### 3.1 Create a Directed Graph

In [None]:
G = networkx_graph_from_pandas(df_nodes, df_edges, graph_type=nx.DiGraph())

Note that the graph contains a dictionary with a single key, `"properties"`. Inside this dictionary is another dictionary that holds each property. However, this is not the desired format. Instead, we want a single dictionary containing all the properties directly.

### 3.2 Draw Graph

In [None]:
nx.draw(G, with_labels=True)

### 3.3 Some statistics

In [None]:
print("Number of nodes: {}".format(G.number_of_nodes()))
print("Number of edges: {}".format(G.number_of_edges()))

In [None]:
limit = 5
for index, edge in enumerate(G.edges(data=True)):
    if index == limit:
        break
    print(edge)

In [None]:
limit = 5
for index, node in enumerate(G.nodes(data=True)):
    if index == limit:
        break
    print(node)