# Community Detection Using the `NETWORK` Actionset in SAS Viya and Python

In this example, we load a small sample undirected graph into CAS, and show how to detect communities using the network actionset. We demonstrate how resolution and fixed nodes affect the community detection as well.

----------------

The basic flow of this notebook is as follows:
1. Load the sample graph into a Pandas DataFrame as a set of links that represent the total graph. 
2. Connect to our CAS server and load the actionsets we require.
3. Upload our sample graph to our CAS server.
4. Execute the community detection without fixed nodes using two resolutions (0.5 and 1.0).
5. Execute the community detection with fixed nodes.
6. Prepare and display the network plots showing the cliques.

----------------
__Prepared by:__
Damian Herrick (damian.herrick@sas.com)

#### Imports

Our imports are broken out as follows:

| Module        | Description                                                                        |
|:--------------|:----------------------------------------------------------------------------------:|
| `os`          | Allows access to environment variables.                                            |
| `swat`        | SAS Python module that orchestrates communicatoin with a CAS server.               |
| `pandas`      | Data management module we use for preparation of local data.                       |
| `networkx`    | Used to manage graph data structures when plotting.                                |
| `bokeh`       | Module used to generate interactive plots of graphs.                               |

In [8]:
import os
from typing import Any, List
import swat
import pandas as pd

import networkx as nx
from networkx import Graph

from bokeh.core.enums import Palette
from bokeh.io import output_notebook, show, save
from bokeh.layouts import gridplot
from bokeh.models import Circle, MultiLine, Range1d
from bokeh.models.annotations import LabelSet
from bokeh.models.graphs import NodesAndLinkedEdges
from bokeh.models.plots import Plot
from bokeh.models.sources import ColumnDataSource
from bokeh.models.tools import HoverTool
from bokeh.palettes import Blues8
from bokeh.plotting import from_networkx

NX_SPRING_SEED = 96201546

In [9]:
def set_node_colors(
    graph: Graph, attr_to_highlight: Any, ref_attr: str, color_palette: Palette
) -> None:
    """set_node_colors: set colors based on binary settings.

    Args:
        graph (Graph): The input graph.
        attr_to_highlight (Any): Value of the attribute to highlight.
        ref_attr (str): label of the attribute.
        color_palette (Palette): Bokeh palette to use.
    """
    for node in graph.nodes():
        graph.nodes[node]["highlight"] = color_palette[graph.nodes[node]["clique"]]

def render_plot(
    graph: Graph,
    title: str,
    hover_tooltips: List,
    node_size: int = 15,
    node_color: str = Blues8[-1],
    node_alpha: float = 1.0,
    aspect_ratio: float = None,
    width: int = 600,
    height: int = 600,
    outfile: str = None,
) -> Plot:
    """render_plot :: simple function to plot a graph using bokeh.

    Args:
        graph (Graph): The input, fully prepared graph.
        title (str): Title of the graph we're creating.
        hover_tooltips (List): the list of tuples that should display on hover.
        node_size (int): Optional, default 15. Set the size of the nodes.
        node_color (str): Optional. attribute label with each node's color.
        node_alpha (float): Optional. Background color transparency.
        aspect_ratio (float): Optional. Defined aspect ratio of the figure.
        width (int): Optional, default 600px. Non-negative integer pixel width.
        height (int): Optional, default 600px. Non-negative integer pixel height.
        outfile (str): Optional. Name of the export file. Defaults to None.

    Returns:
        Plot: The prepared Bokeh figure ready for display.
    """
    # Create a plot — set dimensions, toolbar, and title
    plot = Plot(
        x_range=Range1d(-10.1, 10.1),
        title=title,
        width=width,
        height=height,
        aspect_ratio=aspect_ratio,
    )

    plot.xgrid.grid_line_color = None
    plot.ygrid.grid_line_color = None

    # Create a network graph object with spring layout
    nw_graph = from_networkx(
        graph, nx.spring_layout, scale=10, center=(2, 0), seed=NX_SPRING_SEED
    )

    # Set node size and color
    nw_graph.node_renderer.glyph = Circle(
        size=node_size, fill_color=node_color, fill_alpha=node_alpha
    )

    # Set edge opacity and width
    nw_graph.edge_renderer.glyph = MultiLine(line_alpha=0.5, line_width=2)

    # green hover for both nodes and edges
    nw_graph.node_renderer.hover_glyph = Circle(size=node_size, fill_color="#abdda4")
    nw_graph.edge_renderer.hover_glyph = MultiLine(line_color="#abdda4", line_width=4)

    nw_graph.inspection_policy = NodesAndLinkedEdges()

    plot.add_tools(HoverTool(tooltips=hover_tooltips))

    # Add network graph to the plot
    plot.renderers.append(nw_graph)

    # Add Labels
    x, y = zip(*nw_graph.layout_provider.graph_layout.values())
    node_labels = list(graph.nodes())
    source = ColumnDataSource(
        {"x": x, "y": y, "name": [node_labels[i] for i in range(len(x))]}
    )
    labels = LabelSet(
        x="x",
        x_offset=-5,
        y="y",
        y_offset=-5,
        text="name",
        source=source,
        background_fill_color=node_color,
        text_font_size="13px",
        background_fill_alpha=node_alpha,
    )
    plot.renderers.append(labels)

    if outfile is not None:
        save(plot, filename=outfile)

    return plot

The call to `output_notebook` is required by `bokeh` to render plots inside Jupyter Notebooks.

In [10]:
output_notebook()

### Step 1: Prepare the sample graph. 
* We pass a set of links, and a set of nodes. Nodes are passed this time because we define fix groups for later calculation on load.

In [11]:
colNames = ["from", "to"]
links = [
    ("A", "B"),
    ("A", "F"),
    ("A", "G"),
    ("B", "C"),
    ("B", "D"),
    ("B", "E"),
    ("C", "D"),
    ("E", "F"),
    ("G", "I"),
    ("G", "H"),
    ("H", "I"),
]

dfLinkSetIn = pd.DataFrame(links, columns=colNames)

colNames = ["node", "fixGroup"]
nodes = [("A", 1), ("B", 1), ("C", 2), ("D", 2), ("H", 3), ("I", 3)]

dfNodeSetIn = pd.DataFrame(nodes, columns=colNames)

Let's start by looking at the basic network itself.

We create a `networkx` graph and pass it to our `bokeh` helper function to create the initial plot.

In [13]:
G_comm = nx.from_pandas_edgelist(dfLinkSetIn, 'from', 'to')

title = 'Sample Undirected Graph for Community Detection'
hover = [('Node', '@index')]
nodeSize = 25

plot = render_plot(G_comm, title, hover, nodeSize)
show(plot)

### Step 2: Connect to CAS, load the actionsets we'll need, and upload our graph to the CAS server.

In [15]:
host = os.environ['CAS_HOST_ORGRD']
port = int(os.environ['CAS_PORT'])
print(f"{host}:{port}")

orgrd061.unx.sas.com:23404


In [16]:
conn = swat.CAS(host, port)

_ = conn.loadactionset("network")

NOTE: Added action set 'network'.


Before we load the data, we should verify which caslib is active. Since we just connected and have not specified, the active library should map to our user ID.

Only one caslib can be active at a time. As long as we are happy with the active caslib, we do not need to reference the caslib in subsequent calls to CAS through `swat` methods. Note that this is slightly different from the corrresponding CASL calls we reference.

In [17]:
conn.caslibinfo()

Unnamed: 0,Name,Type,Description,Path,Definition,Subdirs,Local,Active,Personal,Hidden,Transient
0,CASTestTmp,PATH,castest's test files,/bigdisk/lax/castest/,,1.0,0.0,0.0,0.0,0.0,0.0
1,CASUSER(daherr),PATH,Personal File System Caslib,/u/daherr/,,1.0,0.0,0.0,1.0,0.0,1.0
2,CASUSERHDFS(daherr),HDFS,Personal HDFS Caslib,/user/daherr/,,1.0,0.0,1.0,1.0,0.0,1.0
3,EngTest,HDFS,engtest's HDAT files,/user/engtest/,,1.0,0.0,0.0,0.0,0.0,0.0
4,Formats,PATH,Format Caslib,/bigdisk/lax/formats/,,1.0,0.0,0.0,0.0,0.0,0.0
5,HPS,HDFS,HDAT files on /hps,/hps/,,1.0,0.0,0.0,0.0,0.0,0.0


#### Upload the local dataframes into CAS

In [18]:
_ = conn.upload(dfLinkSetIn, casout=dict(name='LinkSetIn'))
_ = conn.upload(dfNodeSetIn, casout=dict(name='NodeSetIn'))

NOTE: Cloud Analytic Services made the uploaded file available as table LINKSETIN in caslib CASUSERHDFS(daherr).
NOTE: The table LINKSETIN has been created in caslib CASUSERHDFS(daherr) from binary data uploaded to Cloud Analytic Services.
NOTE: Cloud Analytic Services made the uploaded file available as table NODESETIN in caslib CASUSERHDFS(daherr).
NOTE: The table NODESETIN has been created in caslib CASUSERHDFS(daherr) from binary data uploaded to Cloud Analytic Services.


### Step 3a: Calculate the communities (without fixed groups) in our graph using the `network` actionset.

Since we've loaded our actionset, we can reference it using dot notation from our connection object.

We expect that resolution of 0.5 will detect two communities; resolution of 1.0 will detect three communities.

Note that the Python code below is equivalent to this block of CASL:
```
proc network
   links              = mycas.LinkSetIn
   outNodes           = mycas.NodeSetOut;
   community
      resolutionList  = 1.0 0.5
      outLevel        = mycas.CommLevelOut
      outCommunity    = mycas.CommOut
      outOverlap      = mycas.CommOverlapOut
      outCommLinks    = mycas.CommLinksOut;
run;
```

In [19]:
conn.network.community(links=dict(name='LinkSetIn'),
                       outnodes=dict(name='nodeSetOutA'),
                       outLevel=dict(name='CommLevelOut'),
                       outCommunity=dict(name='CommOut'),   
                       outOverlap=dict(name='CommOverlapOut'),     
                       outCommLinks=dict(name='CommLinksOut'),
                       resolutionList=[0.5, 1]
 )

NOTE: The number of nodes in the input graph is 9.
NOTE: The number of links in the input graph is 11.
NOTE: Processing community detection using 1 threads across 1 machines.
NOTE: At resolution=1, the community algorithm found 3 communities with modularity=0.392562.
NOTE: At resolution=0.5, the community algorithm found 2 communities with modularity=0.342975.
NOTE: Processing community detection used 0.00 (cpu: 0.00) seconds.


Unnamed: 0,casLib,Name,Label,Rows,Columns,casTable
0,CASUSERHDFS(daherr),nodeSetOutA,,9,3,"CASTable('nodeSetOutA', caslib='CASUSERHDFS(da..."
1,CASUSERHDFS(daherr),CommLinksOut,,3,5,"CASTable('CommLinksOut', caslib='CASUSERHDFS(d..."
2,CASUSERHDFS(daherr),CommOut,,5,9,"CASTable('CommOut', caslib='CASUSERHDFS(daherr)')"
3,CASUSERHDFS(daherr),CommLevelOut,,2,4,"CASTable('CommLevelOut', caslib='CASUSERHDFS(d..."
4,CASUSERHDFS(daherr),CommOverlapOut,,11,3,"CASTable('CommOverlapOut', caslib='CASUSERHDFS..."

Unnamed: 0,Name1,Label1,cValue1,nValue1
0,numNodes,Number of Nodes,9,9.0
1,numLinks,Number of Links,11,11.0
2,graphDirection,Graph Direction,Undirected,

Unnamed: 0,Name1,Label1,cValue1,nValue1
0,problemType,Problem Type,Community Detection,
1,status,Solution Status,OK,
2,cpuTime,CPU Time,0.00,0.0
3,realTime,Real Time,0.00,0.000156


### Step 3b: Calculate the communities (with fixed groups) in our graph using the `network` actionset.

Using fixed node groups, we expect to find three communities.

The Python code in the subsequent block is equivalent to this block of CASL:
```
proc network
   nodes             = mycas.NodeSetIn
   links             = mycas.LinkSetIn
   outNodes          = mycas.NodeSetOut;
   community
      resolutionList = 1.0
      fix            = fixGroup;
run;
```

In [20]:
conn.network.community(nodes=dict(name='NodeSetIn'),
                       links=dict(name='LinkSetIn'),
                       outnodes=dict(name='NodeSetOutB'),
                       resolutionList=[1.0],
                       fix='fixGroup')

NOTE: The number of nodes in the input graph is 9.
NOTE: The number of links in the input graph is 11.
NOTE: Processing community detection using 1 threads across 1 machines.
NOTE: At resolution=1, the community algorithm found 3 communities with modularity=0.342975.
NOTE: Processing community detection used 0.00 (cpu: 0.00) seconds.


Unnamed: 0,casLib,Name,Label,Rows,Columns,casTable
0,CASUSERHDFS(daherr),NodeSetOutB,,9,2,"CASTable('NodeSetOutB', caslib='CASUSERHDFS(da..."

Unnamed: 0,Name1,Label1,cValue1,nValue1
0,numNodes,Number of Nodes,9,9.0
1,numLinks,Number of Links,11,11.0
2,graphDirection,Graph Direction,Undirected,

Unnamed: 0,Name1,Label1,cValue1,nValue1
0,problemType,Problem Type,Community Detection,
1,status,Solution Status,OK,
2,cpuTime,CPU Time,0.00,0.0
3,realTime,Real Time,0.00,0.000134


### Step 4: Get the community results from CAS and prepare data for plotting

------
In this step we fetch the node results from CAS, then add community assignments and node fill color as node attributes in our `networkx` graph.
Since we do the same thing for each of the three communities, we'll combine into a single cell. If we're doing this in production we'd probably make a helper method.

| Table      | Description                                                                                 |
|------------|---------------------------------------------------------------------------------------------|
| `NodeSetA` | Results and community labels for the non-fixed group calculations, resolutions 0.5 and 1.0. |
| `NodeSetB` | Results and community labels for the fixed group calculation at resolution 1.0              |

| Attribute Label   | Description                                               |
|-------------------|-----------------------------------------------------------|
| `community_0`     | Community assignment for non-fixed groups, resolution 1.0 |
| `community_1`     | Community assignment for non-fixed groups, resolution 0.5 |
| `community_fixed` | Community assignment for fixed groups, resolution 1.0     |

In [21]:
# pull the node set locally so we can plot
comm_nodes_cas = conn.CASTable('NodeSetOutA').to_dict(orient='index')
comm_fixed_nodes_cas = conn.CASTable('NodeSetOutB').to_dict(orient='index')

# make our mapping dictionaries that allow us to assign attributes
comm_nodes_0 = {v['node']:v['community_0'] for v in comm_nodes_cas.values()}
comm_nodes_1 = {v['node']:v['community_1'] for v in comm_nodes_cas.values()}
comm_fixed_nodes = {v['node']:v['community_0'] for v in comm_fixed_nodes_cas.values()}

# set the attributes
nx.set_node_attributes(G_comm, comm_nodes_0, 'community_0')
nx.set_node_attributes(G_comm, comm_nodes_1, 'community_1')
nx.set_node_attributes(G_comm, comm_fixed_nodes, 'community_fixed')

# Assign the fill colors for the nodes.
for node in G_comm.nodes:
    G_comm.nodes[node]['highlight_0'] = Spectral8[int(G_comm.nodes[node]['community_0'])]
    G_comm.nodes[node]['highlight_1'] = Spectral8[int(G_comm.nodes[node]['community_1'])]
    G_comm.nodes[node]['highlight_fixed'] = Spectral8[int(G_comm.nodes[node]['community_fixed'])]

### Step 5: Create the three plots and display them

In [22]:
title_0 = 'Community Detection Example 1: Resolution 1'
hover_0 = [('Node', '@index'), ('Community', '@community_0')]

title_1 = 'Community Detection Example 2: Resolution 0.5'
hover_1 = [('Node', '@index'), ('Community', '@community_1')]

title_fixed = 'Community Detection Example 3: Fixed Nodes'
hover_fixed = [('Node', '@index'), ('Community', '@community_fixed')]

# render the plots.
# reminder - we set nodeSize earlier in the notebook. Its value is 40.
plot_0 = render_plot(G_comm, title_0, hover_0, node_size=nodeSize, node_color='highlight_0')
plot_1 = render_plot(G_comm, title_1, hover_1, node_size=nodeSize, node_color='highlight_1')
plot_fixed = render_plot(G_comm, title_fixed, hover_fixed, node_size=nodeSize, node_color='highlight_fixed')

In [23]:
grid = gridplot([plot_0, plot_1, plot_fixed], ncols=2)
show(grid)

### Step 7: Clean up everything. 

Make sure we know what tables we created, drop them, and close our connection.
(This is probably overkill, since everything in this session is ephemeral anyway, but good practice nonetheless.

In [24]:
conn.tableinfo()

Unnamed: 0,Name,Rows,Columns,IndexedColumns,Encoding,CreateTimeFormatted,ModTimeFormatted,AccessTimeFormatted,JavaCharSet,CreateTime,Repeated,View,MultiPart,SourceName,SourceCaslib,Compressed,Creator,Modifier,SourceModTimeFormatted,SourceModTime
0,LINKSETIN,11,2,0,utf-8,2021-10-22T11:30:35-04:00,2021-10-22T11:30:35-04:00,2021-10-22T11:30:53-04:00,UTF8,1950536000.0,0,0,0,,,0,daherr,,2021-10-22T11:30:35-04:00,1950536000.0
1,NODESETIN,6,2,0,utf-8,2021-10-22T11:30:35-04:00,2021-10-22T11:30:35-04:00,2021-10-22T11:30:53-04:00,UTF8,1950536000.0,0,0,0,,,0,daherr,,2021-10-22T11:30:35-04:00,1950536000.0
2,NODESETOUTA,9,3,0,utf-8,2021-10-22T11:30:45-04:00,2021-10-22T11:30:45-04:00,2021-10-22T11:31:03-04:00,UTF8,1950536000.0,0,0,0,,,0,daherr,,,
3,COMMLEVELOUT,2,4,0,utf-8,2021-10-22T11:30:45-04:00,2021-10-22T11:30:45-04:00,2021-10-22T11:30:45-04:00,UTF8,1950536000.0,0,0,0,,,0,daherr,,,
4,COMMOUT,5,9,0,utf-8,2021-10-22T11:30:45-04:00,2021-10-22T11:30:45-04:00,2021-10-22T11:30:45-04:00,UTF8,1950536000.0,0,0,0,,,0,daherr,,,
5,COMMLINKSOUT,3,5,0,utf-8,2021-10-22T11:30:45-04:00,2021-10-22T11:30:45-04:00,2021-10-22T11:30:45-04:00,UTF8,1950536000.0,0,0,0,,,0,daherr,,,
6,COMMOVERLAPOUT,11,3,0,utf-8,2021-10-22T11:30:45-04:00,2021-10-22T11:30:45-04:00,2021-10-22T11:30:45-04:00,UTF8,1950536000.0,0,0,0,,,0,daherr,,,
7,NODESETOUTB,9,2,0,utf-8,2021-10-22T11:30:53-04:00,2021-10-22T11:30:53-04:00,2021-10-22T11:31:03-04:00,UTF8,1950536000.0,0,0,0,,,0,daherr,,,


In [25]:
conn.droptable(name='LinksetIn', quiet=True)
conn.droptable(name='NodeSetIn', quiet=True)
conn.droptable(name='NodeSetOutA', quiet=True)
conn.droptable(name='NodeSetOutB', quiet=True)
conn.droptable(name='CommOut', quiet=True)
conn.droptable(name='CommLevelOut', quiet=True)
conn.droptable(name='CommLinksOut', quiet=True)
conn.droptable(name='CommOverlapOut', quiet=True)

In [26]:
conn.close()