# University of Illinois Data Mining Specialization
## Course 01: Data Visualization
*2018-05-07 to 2018-05-13 - Week 03*

### Programming Assignment

Find some network data that you think is suitable and that you would like to visualize. Here are some sites that provide links to a wide variety of different graph/network datasets:

* [Stanford Large Network Dataset Collection](http://snap.stanford.edu/data/index.html)
* [UCI Network Data Repository](https://networkdata.ics.uci.edu/)

Choose a visualization platform and parse the data into a format suitable for the tools you will use. You must upload an image of your visualization for peer evaluation.

In addition to your visualization, please include a paragraph that helps explain your submission. A few questions that your paragraph could answer include:

* What is the data set that you chose? Why?
* Did you use a subset of the data? If so, what was it?
* Are there any particular aspects of your visualization to which you would like to bring attention?
* What do you think the data and your visualization show?

#### Grading Rubric

| Criteria | Poor (1–2 points) | Fair (3 points) | Good (4 points) | Great (5 points) |
| --- | --- | --- | --- | --- | --- |
| *Proximate Layout*<sup>1</sup> | Relationship between items cannot be discerned because of poor layout. | Major problems with the layout, leading to many long edges and/or overlaps that distract from the data. | Minor problems with the layout, resulting in long edges or unnecessary overlaps in objects or edges. | Related items are placed near each other and intersections of visualization elements are not unnecessarily distracting. |
| *Design of the Chart*<sup>2</sup> | Relationship between items cannot be discerned because of poor element and/or design choices. | Major problems with some elements and/or design choices that interfere with the display of the data. | Minor problems with some elements and/or design choices that distract from the display of the data. | Visualization effectively uses elements and design to display the data.|
| *Contest*<sup>3</sup> | Misleading | Boring | Not boring | Interesting |

<sup>1</sup>How well are related items placed near each other? Do the edges cross or do items overlap when perhaps they do not need to? Are the crossings distracting?
<br><sup>2</sup>Does the visualization effectively utilize the assignment of variables to elements and design of a visualization described in Week 2?
<br><sup>3</sup>How interesting is the result? Does this represent an interesting choice of data and/or an interesting way to display the data?

#### Math Overflow Comments to Answers

I chose to graph Math Overflow's data for user comments to answers. [Stanford's Network Analys Project](http://snap.stanford.edu/index.html) ("SNAP") includes mutliple datasets from Stack Overflow and its subject-specific offshoots. I use Stack Overflow daily, even more when researching for these programming assignment. I would pick a [dataset from Stack Overflow](http://snap.stanford.edu/data/sx-stackoverflow.html)... but they're huge!!! The smallest has ~1.6 million user nodes.

I may *eventually* explore such a large network. For now, [the Math Overflow datasets](http://snap.stanford.edu/data/sx-mathoverflow.html) have more manageable node counts ranging from  10k to 25k range. I futher narrowed the size by examing comments-to-answers (at ~13k nodes and 195k edges) instead of the larger full dataset (~25k nodes and ~500k edges).

In [32]:
# Imports
import pandas as pd
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objs as go
import networkx as nx
from itertools import islice
import datetime as dte

# Set Plotly to display in notebook
init_notebook_mode(connected=True)

In [None]:
# Define helper functions
def take(n, iterable):
    """Return first n items of an iterable as a list."""
    return list(islice(iterable, n))

#### Plot Math Overflow Comments to Answers

In [5]:
# Load data and inspect
dataMathExCtoA = pd.read_table("input/sx-mathoverflow-c2a.txt", sep=" ", names=["tail", "head", "unix-ts"])
print("Shape: %s" % (dataMathExCtoA.shape, ))
print(dataMathExCtoA.head(5))

Shape: (195330, 3)
   tail  head     unix-ts
0     3     1  1254206196
1     1     1  1254207602
2     2     1  1254249757
3     1    25  1254259818
4     1    22  1254273152


In [8]:
# Convert Unix timestamp to datetime
dataMathExCtoA["datetime"] = pd.to_datetime(dataMathExCtoA["unix-ts"], unit="s", origin="unix")
print(dataMathExCtoA.head(5))

   tail  head     unix-ts            datetime
0     3     1  1254206196 2009-09-29 06:36:36
1     1     1  1254207602 2009-09-29 07:00:02
2     2     1  1254249757 2009-09-29 18:42:37
3     1    25  1254259818 2009-09-29 21:30:18
4     1    22  1254273152 2009-09-30 01:12:32


In [27]:
# Subset to 2016 to simplify initial graph development
dataMathExCtoA2016 = dataMathExCtoA[(dataMathExCtoA["datetime"] >= dte.date(2016, 1, 1)) & \
                                    (dataMathExCtoA["datetime"] < dte.date(2017, 1, 1))]
print("Shape 2016: %s" % (dataMathExCtoA2016.shape, ))

Shape 2016: (3911, 4)


In [28]:
# Create and populate 2016 directed graph
graphMathExCtoA2016 = nx.DiGraph()
for edge in dataMathExCtoA2016.itertuples():
    graphMathExCtoA2016.add_edge(edge.tail, edge.head, object={"datetime": edge.datetime})
print("Graph Dimensions 2016: (%i nodes, %i edges)" % (graphMathExCtoA2016.number_of_nodes(), \
                                                       graphMathExCtoA2016.number_of_edges()))

# Why do you have fewer edges than rows in the dataset???

Graph Dimensions 2016: (1281 nodes, 2121 edges)


In [72]:
# Calculate node position for drawing using the Fruchterman-Reingold force-directed algorithm
nodePos = nx.spring_layout(graphMathExCtoA2016)

In [73]:
# Quick peek at the result
print(take(5, nodePos.items()))

[(46290, array([-0.01569599, -0.01622575])), (28128, array([-0.07733241, -0.07768194])), (4312, array([-0.08468484, -0.03861823])), (613, array([0.1620408 , 0.18577061])), (84747, array([-0.2462773 ,  0.38009689]))]


In [93]:
#
# Draw graph; based on example in https://plot.ly/python/network-graphs/
#
# Create scatter plot of nodes
scatterNodes = go.Scatter(
    x = [],
    y = [],
    text = [],
    mode = "markers",
    hoverinfo = "text",
    marker = go.Marker(
        showscale = True,
        colorscale = "Greens",
        reversescale = True, # For largest to smallest
        color = [],
        size = 7,
        colorbar = dict(
            thickness = 15,
            title = "Node In Degree",
            xanchor = "left",
            titleside = "right"
        ),
        line = dict(width = 0.5)
    )
)
for node, inDegree in sorted(graphMathExCtoA2016.in_degree(), key=lambda t: t[1]):
    # Append node position
    x, y = nodePos[node]
    scatterNodes["x"].append(x)
    scatterNodes["y"].append(y)
    
    # Set in degree for color scale
    scatterNodes["marker"]["color"].append(inDegree)
    scatterNodes["text"].append("in dgree: " + str(inDegree))

In [96]:
# Create scatter plot of edges
scatterEdges = go.Scatter(
    x = [],
    y = [],
    text = [],
    mode = "lines",
    hoverinfo = "text",
    line = go.Line(width=0.25, color="#3")
)
for edge in graphMathExCtoA2016.edges():
    xTail, yTail = nodePos[edge[0]]
    xHead, yHead = nodePos[edge[1]]
    scatterEdges["x"] += [xTail, xHead, None]
    scatterEdges["y"] += [yTail, yHead, None]

In [97]:
# Create figure and plot
figure = go.Figure(
    data = go.Data([scatterEdges, scatterNodes]),
    layout = go.Layout(
        title = "Math Exchange Comments to Answers - 2016",
        showlegend = False,
        hovermode = "closest",
        xaxis = go.XAxis(showgrid=False, zeroline=False, showticklabels=False),
        yaxis = go.YAxis(showgrid=False, zeroline=False, showticklabels=False)
    )
)
iplot(figure)
plot(figure, filename="output/math-exchange-comments-to-answers-2016.html")

'file:///mnt/d/GitHub/uoi-coursera-data-mining/crs01/wk03/programming-assignment/output/math-exchange-comments-to-answers-2016.html'