# Supply chain partitioning example


In this notebook, we will use cuGraph to prototype partitioning of JDA supply chain example graph  

* Created:   12/13/2019
* Last Edit: 12/13/2019

RAPIDS Versions: 0.11.0

Test Hardware
* P100 16G, CUDA 10.0

Using docker container: rapidsai/rapidsai-nightly:cuda10.0-runtime-centos7

## cuGraph Notice 
The current version of cuGraph has some limitations:

* Vertex IDs need to be 32-bit integers.
* Vertex IDs are expected to be contiguous integers starting from 0.

cuGraph provides the renumber function to mitigate this problem. Input vertex IDs for the renumber function can be either 32-bit or 64-bit integers, can be non-contiguous, and can start from an arbitrary number. The renumber function maps the provided input vertex IDs to 32-bit contiguous integers starting from 0. cuGraph still requires the renumbered vertex IDs to be representable in 32-bit integers. These limitations are being addressed and will be fixed soon. 

### Test Data
We will be using a larger example dataset Arijit provided.

### Prep

In [1]:
# Import needed libraries
import cugraph
import cudf
import numpy as np

In [2]:
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt

### Read data using cuDF and pandas

In [3]:
# Test file    
datafile='../data/73.prefix.edgelist.csv'

In [4]:
# read the data using cuDF
gdf = cudf.read_csv(datafile, delimiter=",", names=['src', 'dst'], dtype=['int32', 'int32'] )
df = pd.read_csv(datafile, delimiter=",", names=['src', 'dst'])

In [5]:
# Let's adjust all the vertex IDs to be zero based because of the cuGraph limitations
gdf["src"] = gdf["src"] - 1
gdf["dst"] = gdf["dst"] - 1
df["src"] = df["src"] - 1
df["dst"] = df["dst"] - 1

In [6]:
gdf.head().to_pandas()

Unnamed: 0,src,dst
0,0,1880276
1,0,2075383
2,0,1998368
3,0,2026114
4,0,2011687


In [7]:
df.head()

Unnamed: 0,src,dst
0,0,1880276
1,0,2075383
2,0,1998368
3,0,2026114
4,0,2011687


### Create the directed graph using NetworkX

In [8]:
# NetworkX SCC and WCC only test directed graphs
cpuG=nx.from_pandas_edgelist(df, source='src', target='dst',create_using=nx.DiGraph)
#nx.draw(cpuG, with_labels=True,pos=nx.circular_layout(cpuG), node_color='r', edge_color='b')
#plt.show()

In [9]:
print("cpu Graph")
print("\tNumber of Vertices: " + str(cpuG.number_of_nodes()))
print("\tNumber of Edges:    " + str(cpuG.number_of_edges()))

cpu Graph
	Number of Vertices: 894976
	Number of Edges:    1048575


In [10]:
%%time
nx.is_strongly_connected(cpuG)

CPU times: user 7.3 s, sys: 165 ms, total: 7.47 s
Wall time: 7.46 s


False

In [11]:
%%time
nx.is_weakly_connected(cpuG)

CPU times: user 2.88 s, sys: 26.7 ms, total: 2.91 s
Wall time: 2.9 s


False

In [12]:
%%time
print("\tNumber strongly connected components: " + str(nx.number_strongly_connected_components(cpuG)))

	Number strongly connected components: 894976
CPU times: user 4.4 s, sys: 33.4 ms, total: 4.43 s
Wall time: 4.43 s


In [13]:
%%time
print("\tNumber weakly connected components: " + str(nx.number_weakly_connected_components(cpuG)))

	Number weakly connected components: 21619
CPU times: user 2.69 s, sys: 54.9 ms, total: 2.74 s
Wall time: 2.74 s


In [14]:
# Generate WCCs of cpuG, returning a geneator of sets of nodes, one for each weakly connected component of G
#[len(c) for c in sorted(nx.weakly_connected_components(cpuG), key=len, reverse=True)]
largest_cc = max(nx.weakly_connected_components(cpuG), key=len)
len(largest_cc)

659200

### Create the directed graph using cugraph

In [15]:
# Note that currently cuGraph WCC only supported undirected network graph, so we use Graph() instead of DiGraph()
gpuG = cugraph.DiGraph()
#gpuG = cugraph.Graph()
#gpuG.from_cudf_edgelist(gdf, source='src', target='dst')
gpuG.add_edge_list(gdf['src'], gdf['dst'])

  Use from_cudf_edgelist instead')


In [16]:
print("Cpu Graph")
print("\tNumber of Vertices: " + str(gpuG.number_of_vertices()))
print("\tNumber of Edges:    " + str(gpuG.number_of_edges()))

Cpu Graph
	Number of Vertices: 2393432
	Number of Edges:    1048575


In [17]:
gdf['src'].unique().count()

467098

In [18]:
gdf['dst'].unique().count()

427878

In [19]:
# Note number of Vertices reported by gpuG != NETWORKX output
# Also != gdf['src'].unique().count() + gdf['dst'].unique().count()  

# Reason: Renumber the sparse set of the src and dst vertex ids to a dense set of vertex ids with contiguous numbers starting from 0.
gdf['renumbered_src'], gdf['renumbered_dst'], mapping = cugraph.renumber(gdf['src'], gdf['dst'])
gdf.head()

Unnamed: 0,src,dst,renumbered_src,renumbered_dst
0,0,1880276,0,495142
1,0,2075383,0,333932
2,0,1998368,0,869324
3,0,2026114,0,320706
4,0,2011687,0,534132


In [20]:
gpuG = cugraph.DiGraph()
gpuG.add_edge_list(gdf['renumbered_src'], gdf['renumbered_dst'])

  Use from_cudf_edgelist instead')


In [21]:
print("gpu Graph")
print("\tNumber of Vertices: " + str(gpuG.number_of_vertices()))
print("\tNumber of Edges:    " + str(gpuG.number_of_edges()))

gpu Graph
	Number of Vertices: 894976
	Number of Edges:    1048575


In [22]:
## current version of SCC cannot take > 50K vertices in a graph
df = cugraph.strongly_connected_components(gpuG)

RuntimeError: cuGraph failure at: /conda/conda-bld/libcugraph_1576317568753/work/cpp/src/components/connectivity.cu:165: ERROR: Insufficient device memory for SCC

In [None]:
%%time
# Using a prototype, experimental version but performance will be tuned in RAPIDS 12/13
# from cugraph.proto.components import strong_connected_component
# https://github.com/rapidsai/cugraph/tree/branch-0.11/python/cugraph/proto/components

df1 = cugraph.strong_connected_component(gdf['renumbered_src'], gdf['renumbered_dst'])

In [24]:
%%time
# Currently cugraph WCC only supports undirected graph, so we have to use undirected Graph()
gpuG_undirected = cugraph.Graph()
gpuG_undirected.add_edge_list(gdf['renumbered_src'], gdf['renumbered_dst'])
df1 = cugraph.weakly_connected_components(gpuG_undirected)

CPU times: user 127 ms, sys: 55 ms, total: 182 ms
Wall time: 182 ms


In [25]:
label_gby = df1.groupby('labels')
label_count = label_gby.count()
print("Total number of components found : ", len(label_count))

Total number of components found :  21619


In [26]:
label_gby.size().head()

labels
1     659200
5         26
9          4
10         6
15       102
dtype: int32