## The Node in the Machine: Software Architecture as Network


### Bobby Norton
### Windy City GraphDB
### September 22, 2016

## A design / refactoring experiment...

Take a code base you know well. Put everything in one namespace / package / your language's equivalent of organization.

The tests still pass.

Open the code. Do you like this better?

There are fewer things, right? Fewer files. Fewer directories.

_What's not to love?_

## Organization matters to us...not the machine.

There aren't fewer functions, just fewer _containers_.

Everything is exposed to us every time we look at this file.

What about our test cases that defined boundaries between the sub-systems?

Do we open up some of the functions that we want to test in isolation?

### We like architecture diagrams because they provide a compact visual description of complicated engineered systems.

There is no substitute for quick exploratory analysis and pattern recognition.

![](./img/lein-topology-faad435.png)

Start at the lower left, in this case at leiningen.topology/topology. The flow of control starts here and moves in a depth-first search from the lower left to the upper right. At the end of each path, control returns to the caller and proceeds across the next outgoing edge.

Notice how the five namespaces in this library are arranged to be in close proximity.

The program is a tree...a directed acyclic graph.

Test coverage can be seen at a glance by following edges from the test vertices on the right.

If this diagram were static, it would be an infographic. Informative, perhaps, but ultimately prone to error as the system changes. What we'd like is the ability to _generate_ this diagram from underlying data. In fact, this diagram was mostly generated automatically, and could be completely.

### We don't like pushing pixels.

Laying out by hand isn't going to happen on every commit.

This is why diagrams get stale.

### End-to-end automation creates some interesting possibilities:

* Generate edge data
* Generate visualization based on previous commit.
* Save coordinates of existing vertices.
* Generate visualization of latest code based on recent changes, laying out by hand only things that have changed.

### The problems...

* People capturing architecture as marketecture chartjunk.

* People create their own vocabularies to describe architecture, then try to build a business off of these ideas. UML is the most infamous of these vocabularies. Plenty of wasted money and time has been spent on that effort.

### The alternative: Architecture as Network

A dependency network can be represented as an edge list of the form "source,target,weight", e.g.:

```
topology.core/print-weighted-edges,clojure.core/defn,1
topology.core/print-weighted-edges,clojure.core/doseq,1
topology.core/print-weighted-edges,clojure.core/println,1
topology.core/print-weighted-edges,clojure.string/join,1
```

This raw data can be imported into visualization tools and organized as a graph and treated like a database.

Equivalently, network diagrams created in tools like Cytoscape can be saved as network data.

## Demo: Visualizing Code with Cytoscape

[lein-topology](https://github.com/testedminds/lein-topology) is a Leiningen plugin that generates the data for a Clojure project's function dependency structure matrix.

A demonstration of analyzing this network was done with a set of [Jupyter Notebooks in the sandbook](https://github.com/bobbyno/sandbook) project.

## Collect the Dots 

Functions aren't the only artifacts in your system...you might not be looking at this level at all.

You've seen topology create a function dependency graph from a Clojure repo. 

The same approach is generic: Sources of data to mine "edgewise" include git repos, Jenkins / CI, AWS infra like Route 53, and hub.docker.com.

## Analyze...Connect the Dots

In memory graph analysis is appropriate for N < 1M

[Yes, your data fits in RAM...](http://archive.is/http://yourdatafitsinram.com/) _(probably)_

[4clojure example in Gorilla REPL](http://viewer.gorilla-repl.org/view.html?source=github&user=bobbyno&repo=code_as_network&path=doc/foreclojure.clj)

## Visualize

_"A fundamental challenge in moving from the static to the dynamic is the need to respect, in the case of the latter, what is referred to as the user’s mental map._
    
_This term is used to describe the result of the process by which, **upon studying a given static network map, a user becomes familiar with it, interprets it, and navigates about it.**_
    
_Simply put, we would expect a certain amount of ‘stability’ across visualizations."_
    
Statistical Analysis of Network Data in R

## Visualizing code with a Dependency Structure Matrix (DSM)

`'s','t',1` => `[{'source': 's','target': 't','weight': '1'}]`

Let's try visualizing as a Dependency Structure Matrix.

In [5]:
from sand.io import *

network_data = csv_to_edgelist('./data/lein-topology-faad435.csv')

In [6]:
len(network_data)

204

In [7]:
list(network_data)[:5]

[{'source': 'topology.dependencies/dependencies',
  'target': 'clojure.core/defn-',
  'weight': '1'},
 {'source': 'topology.dependencies/filtered',
  'target': 'clojure.core/filter',
  'weight': '1'},
 {'source': 'topology.dependencies-test/should-compute-fn-calls-in-namespace',
  'target': 'clojure.core/defn',
  'weight': '1'},
 {'source': 'example/test-when', 'target': 'clojure.core/cons', 'weight': '1'},
 {'source': 'leiningen.topology/topology',
  'target': 'org.clojure/clojure',
  'weight': '1'}]

Let's look at the Dependency Structure Matrix or DSM.

This is an $N^2$ matrix representing the network of relationships `row -depends-> column`.

Try sorting the entries by outdegree. Notice how most of the most of the lower half of the matrix is empty?

There is more to come in later work about [DSM's for software](https://en.wikipedia.org/wiki/Design_structure_matrix). 

In [8]:
from sand.matrix import *

matrix(network_data, 800)

_screenshot of the iframe content in the cell above_

<img src='./img/matrix.png' width=600/>

### DSM Visualization Exercise Ideas

* We would ideally like to order this by group. In this case, namespace is a reasonable way to group. There are many potential options.

* Coloring cells could be done in a more interesting way.

* Abbreviating columns would be useful.

## Act: Refactor...Restructure


"_With the adoption of a graph-based framework for representing relational data in network analysis we inherit a rich vocabulary for discussing various important concepts related to graphs._

_...questions of interest can often be re-phrased in a useful manner as questions regarding some aspect of the structure or characteristics of a corresponding network graph._"
    
Statistical Analysis of Network Data in R

### Descriptive architecture based on observations...

### ...over prescriptive architecture based on prognostications.

Given the level of complication, it's tough to know a priori what you are about to create.

Reverse engineer the structure of an existing system, then bring in structural analysis to your red-green-refactor cycle.

## Other applications of software architecture networks...

### Orientating devs and ops...where are things in this system?

**Form follows function.**

* Collections of vertices: Communities/Clusters => Packages/Namespaces

* How do we describe the flow of control through the program? Does the package structure reflect a description of that flow of control?

### YAGNI: You aren't gonna need it

Find all nodes with no incoming edges that aren't in a certain namespace (like the one with the main method).

These are candidates for deletion.

### SRP enforcement

Which namespaces are the consumers of any other given namespace?

Does the provider expose a consistent interface to consumers?

### Which of the containers are "hidden", visible from only a small number of consumers? 

`topology.symbols`, for example, is hidden behind `topology.dependencies`. The entire implementation could be swapped out without impacting the rest of the program if the contract with `topology.dependencies` is maintained. 

**This is encapsulation in network terms, which can be mathematically defined!**

### Root cause analysis

Paths from a temporary root node to the node where a problem is observed. 

pathfinding + changelogs.

### Change propogators: High in-degree and out-degree centrality. 

"Change agents make systems brittle because they increase the likelihood that the effect of a change will propagate to a disproportionately large portion of the system."

### This doesn't mean BDUF is back in style...

...but equally, NDUF (No Design Up Front) and NDE (No Design Ever) aren't cool anymore now that you have a powerful architecture model.

* Start with the simplest structure that can possibly work.

* Once desirable structural patterns are known amongst the team, you can start to write tests that express these rules.

* Techniques like TDD and BDD are design techniques, not only verification steps. The structural modeling allows you to visualize and navigate the structure of your code, however it was produced. Given the assertion that TDD / BDD result in "better" designs, network modeling may provide a means for objective evidence.

### Don't mix networks in the same data set (unless you know what you're doing)

* We've been exploring tools for network analysis. Property graph db vendors will encourage you to put everything in a graph, then query what you need. The downside is more complicated schema management.

While you _can_ use a graph database in lieu of RDBMS, it's not entirely clear that you _should_.

Be clear about what your vertices and edges represent.

### Graph Databases?

At no point until now have I said anything about 'graph databases' like Neo4j and Titan. These are persistence stores. They offer a query language, scalability, transactional support, and security along with other concerns found in a RDBMS / NOSQL persistence tier. 

Use the simplest structure that can possibly work. Given that yourdatafitsinram, you can likely go very far with an in-memory approach that reads in all the data upon system startup. If you're at a point where you _know_ you need to solve the concerns a graph database can handle, then everything we've seen today still applies to the analysis steps.

The graph db vendors don't often spend much time on network analysis ideas beyond some of the basics. Most of what you'll find from them involves using the query language, or converting their particular graph representation into others like `networkx` or `igraph`.


## Homework: Dependency Graphing Exercise

Take a library you'd like to better understand...probably one you want to change.

Start with the output, then walk the dependency graph. 

**You can start with just package / namespace / class level dependencies**

Create the network manually in Cytoscape. We can export the data as a node, edge list when we're done.

You might even find it easier to work in text, building out the dependencies edgewise, then importing into cytoscape when you're done.

Now try modifying the `lein-topology` notebook to work with your data.

### A "Software Architecture Network Data" Community...SAND?

Where should we continue the conversation?

_I think we need a Google Group._

In the meantime:

Cytoscape has a thriving app community. 

http://www.slideshare.net/keiono/introduction-to-biological-network-analysis-and-visualization-with-cytoscape-part1

cytoscape-discuss@googlegroups.com

cytoscape-help@googlegroups.com

[SOCNET](https://insna.org/socnet.html)

**And you can find me @bobbynorton and bobby@testedminds.com.**