In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

## Introduction

In order to get you familiar with graph ideas,
I have deliberately chosen to steer away from
the more pedantic matters
of loading graph data to and from disk.
That said, the following scenario will eventually happen,
where a graph dataset lands on your lap,
and you'll need to load it in and run with it.

Thus, we're going to go through graph I/O,
specifically the APIs on how to convert
graph data that comes to you
into that magical NetworkX object `G`.

Let's get going!

## Graph Data as Tables

Let's recall what we've learned in the introductory chapters.
Graphs can be represented using two **sets**:

- Node set
- Edge set

### Node set as tables

Let's say we had a graph with 3 nodes in it: `A, B, C`.
We could represent it in plain text, computer-readable format:

```
A
B
C
```

Suppose the nodes also had metadata.
Then, we could tag on metadata as well:

```
A, circle, 5
B, circle, 7
C, square, 9
```

Does this look familiar to you?
Yes, node sets can be stored in CSV format,
with one of the columns being node ID,
and the rest of the columns being metadata.

### Edge set as tables

If, between the nodes, we had 4 edges (this is a directed graph),
we can also represent those edges in plain text, computer-readable format:

```
A, C
B, C
A, B
C, A
```

And let's say we also had other metadata,
we can represent it in the same CSV format:

```
A, C, red
B, C, orange
A, B, yellow
C, A, green
```

If you've been in the data world for a while,
this should not look foreign to you.
Yes, edge sets can be stored in CSV format too!
Two of the columns represent the nodes involved in an edge,
and the rest of the columns represent the metadata.

### Combined Representation

In fact, one might also choose to combine
the node set and edge set tables together in a merged format:

A, circle, 5
B, circle, 7
C, square, 9

```
n1, n2, colour, shape1, num1, shape2, num2
A,  C,  red,    circle, 5,    square, 9
B,  C,  orange, circle, 7,    square, 9
A,  B,  yellow, circle, 5,    circle, 7
C,  A,  green,  square, 9,    circle, 5
```

In this chapter, the datasets that we will be looking at
are going to be formatted in both ways.
Let's get going.

## Dataset

We will be working with the Divvy bike sharing dataset.

To illustrate working with data in the separated node- and edge-set format,
we will be loading in the 2013 data.
On the other hand, to illustrate how we handle data in combined/merged format,
we will be loading in the 2019 Q4 data.

> Divvy is a bike sharing service in Chicago.
> Since 2013, Divvy has released their bike sharing dataset to the public.
> The 2013 dataset is comprised of two files: 
> - `Divvy_Stations_2013.csv`, containing the stations in the system, and
> - `DivvyTrips_2013.csv`, containing the trips.

Let's dig in.

In [None]:
from pyprojroot import here

Firstly, we need to unzip the dataset.
Here's a Python function that you can 

In [None]:
import zipfile
import os

# This block of code checks to make sure that a particular directory is present.
if "divvy_2013" not in os.listdir(here() / 'datasets/'):
    print('Unzipping the divvy_2013.zip file in the datasets folder.')
    with zipfile.ZipFile(here() / "datasets/divvy_2013.zip","r") as zip_ref:
        zip_ref.extractall(here() / 'datasets')

In [None]:
import pandas as pd

stations = pd.read_csv(here() / 'datasets/divvy_2013/Divvy_Stations_2013.csv', parse_dates=['online date'], encoding='utf-8')
stations.head(10)

In [None]:
trips = pd.read_csv(here() / 'datasets/divvy_2013/Divvy_Trips_2013.csv', 
                    parse_dates=['starttime', 'stoptime'])
trips.head(10)

In [None]:
import janitor
trips.groupby(["from_station_id", "to_station_id"]).count().reset_index().select_columns(["from_station_id", "to_station_id", "trip_id"]).rename_column("trip_id", "num_trips")