# GraphLab Modeling Examples
from https://dato.com/learn/userguide/graph_analytics/graph_analytics.html
December 30, 2015

To illustrate some of these methods, we'll use a previously created SGraph where vertices represent Wikipedia articles about US businesses and edges represent hyperlinks between articles. This data is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License (more details here: http://en.wikipedia.org/wiki/Wikipedia:Copyrights). It can be downloaded from Dato's public datasets bucket on Amazon S3. The remaining methods are left for the reader to explore in the exercises at the end of the chapter.

In [1]:
import os
data_file = 'US_business_links'

In [2]:
import graphlab as gl

In [8]:
os.getcwd()

'C:\\Users\\kesj\\Documents\\IPython Notebooks\\GeneralDataAnalysis\\graphLabExamples'

In [9]:
os.chdir('../../../../data/graphlabData/')

In [10]:
if os.path.exists(data_file):
    sg = gl.load_sgraph(data_file)
else:
    url = 'http://s3.amazonaws.com/dato-datasets/' + data_file
    sg = gl.load_sgraph(url)
    sg.save(data_file)

[INFO] [1;32m1451494970 : INFO:     (initialize_globals_from_environment:282): Setting configuration variable GRAPHLAB_FILEIO_ALTERNATIVE_SSL_CERT_FILE to C:\Anaconda\envs\dato-env\lib\site-packages\certifi\cacert.pem
[0m[1;32m1451494970 : INFO:     (initialize_globals_from_environment:282): Setting configuration variable GRAPHLAB_FILEIO_ALTERNATIVE_SSL_CERT_DIR to 
[0mThis trial license of GraphLab Create is assigned to aj.rader.kesj@statefarm.com and will expire on February 05, 2016. Please contact trial@dato.com for licensing options or to request a free non-commercial license for personal or academic use.

[INFO] Start server at: ipc:///tmp/graphlab_server-26016 - Server binary: C:\Anaconda\envs\dato-env\lib\site-packages\graphlab\unity_server.exe - Server log: C:\Users\kesj\AppData\Local\Temp\graphlab_server_1451494970.log.0
[INFO] GraphLab Server Version: 1.7.1


PROGRESS: Downloading http://s3.amazonaws.com/dato-datasets/US_business_links/dir_archive.ini to C:/Users/kesj/AppData/Local/Temp/graphlab-KESJ/26016/eea28cbd-613a-490d-b2a1-b7f4bcb3bf3b.ini
PROGRESS: Downloading http://s3.amazonaws.com/dato-datasets/US_business_links/objects.bin to C:/Users/kesj/AppData/Local/Temp/graphlab-KESJ/26016/33be2280-3ed8-4cec-b6e6-38e2c2aec6b7.bin
PROGRESS: Downloading http://s3.amazonaws.com/dato-datasets/US_business_links/m_74b0dc51.frame_idx to C:/Users/kesj/AppData/Local/Temp/graphlab-KESJ/26016/c5a8e872-a199-40ce-aac0-55fff7fd8ae1.frame_idx
PROGRESS: Downloading http://s3.amazonaws.com/dato-datasets/US_business_links/m_74b0dc51.sidx to C:/Users/kesj/AppData/Local/Temp/graphlab-KESJ/26016/56df59d6-d2c5-4e46-bdac-6b4f7c897e5b.sidx
PROGRESS: Downloading http://s3.amazonaws.com/dato-datasets/US_business_links/m_19495cff.frame_idx to C:/Users/kesj/AppData/Local/Temp/graphlab-KESJ/26016/4db1cc34-1ea4-4c69-a532-2e82b242caf7.frame_idx
PROGRESS: Downloading http

In [22]:
print sg.summary?

In [None]:
print sg.summary

In [26]:
sg.get_neighborhood(ids=['State Farm','Allstate','Progressive','USAA']).show()

Canvas is updated and available in a tab in the default browser.


## Page Rank
PageRank is an iterative algorithm to compute the most influential nodes in a network. In each iteration, a vertex's influence is measured as the sum of influence of the nodes that point at the vertex. Each node's score is updated until the scores converge, or until the user-specified maximum number of iterations is reached.

As with other GraphLab Create methods, the model object is constructed with the create function. The summary() method provides a snapshot of the result, and the list_fields() method gives the attributes of the model that can be queried.

In [12]:
pr = gl.pagerankrankerank.create(sg,max_iterations=10)
print pr.summary()

PROGRESS: Counting out degree
PROGRESS: Done counting out degree
PROGRESS: +-----------+-----------------------+
PROGRESS: | Iteration | L1 change in pagerank |
PROGRESS: +-----------+-----------------------+
PROGRESS: | 1         | 371040                |
PROGRESS: | 2         | 273072                |
PROGRESS: | 3         | 161198                |
PROGRESS: | 4         | 87525                 |
PROGRESS: | 5         | 53120.5               |
PROGRESS: | 6         | 29993.1               |
PROGRESS: | 7         | 18550.2               |
PROGRESS: | 8         | 10595.4               |
PROGRESS: | 9         | 6661.93               |
PROGRESS: | 10        | 3824.08               |
PROGRESS: +-----------+-----------------------+
Class                                   : PagerankModel

Graph
-----
num_edges                               : 517127
num_vertices                            : 233121

Results
-------
graph                                   : SGraph. See m['graph']
change in last

In [13]:
print pr['training_time']

2.201


In [14]:
pr_out = pr['pagerank']
print pr_out.topk('pagerank',k=10)

+-------------------------------+---------------+---------------+
|              __id             |    pagerank   |     delta     |
+-------------------------------+---------------+---------------+
| American Broadcasting Company | 3050.12285481 | 171.095519607 |
|           Microsoft           | 1640.93294801 | 79.4193105615 |
|           DC Comics           | 1623.76580168 |  231.05676926 |
|       Paramount Pictures      | 1341.29716993 | 74.7486642804 |
|            Facebook           | 1218.69295972 | 23.8510621222 |
|             Google            | 1180.48509934 |  41.271924661 |
|       Ford Motor Company      | 1156.31695182 | 84.5552251247 |
|            Twitter            | 1074.30512161 | 16.3932159159 |
|       Columbia Pictures       | 921.042912728 | 55.2951992749 |
|    The Walt Disney Company    | 878.301063411 | 74.9881435439 |
+-------------------------------+---------------+---------------+
[10 rows x 3 columns]



In [39]:
pr_out[pr_out[pr_out.column_names()[0]]=='State Farm']

__id,pagerank,delta
State Farm,1.56287782312,0.126397550662


### Triangle counting

The number of triangles in a vertex's immediate neighborhood is a measure of the "density" of the vertex's neighborhood. In both of the figures below, vertex A has three immediate neighbors, ignoring edge directions. In the top figure, none of node A's neighbors is connected to any other neighbor, indicating a very loosely connected network. In contrast, all of the three neighbors are connected to each other in the bottom figure, indicating a tightly connected network.

In [15]:
tri = gl.triangle_counting.create(sg)
print tri.summary()

PROGRESS: Initializing vertex ids.
PROGRESS: Removing duplicate (bidirectional) edges.
PROGRESS: Counting triangles...
PROGRESS: Finished in 7.028 secs.
PROGRESS: Total triangles in the graph : 171968
Class                                   : TriangleCountingModel

Graph
-----
num_edges                               : 517127
num_vertices                            : 233121

Results
-------
graph                                   : SGraph. See m['graph']
total number of triangles               : 171968
vertex triangle count                   : SFrame. See m['triangle_count']

Metrics
-------
training time (secs)                    : 7.073

Queryable Fields
----------------
graph                                   : A new SGraph with the triangle count as a vertex property.
num_triangles                           : Total number of triangles in the graph.
triangle_count                          : An SFrame with the triangle count for each vertex.
training_time                           : T

In this dataset, there is a lot of overlap between the companies with large influence in the Wikipedia article network (as measured by PageRank) and the companies with the largest and most dense neighborhoods, as measured by the triangle count method, but there are notable differences. ABC has only the sixth most triangles despite having the highest pagerank, while Microsoft ranks very high by both statistics, suggesting the Microsoft article is slightly more central in the network.

In [16]:
tri_out = tri['triangle_count']
print tri_out.topk('triangle_count',k=10)

+-------------------------------+----------------+
|              __id             | triangle_count |
+-------------------------------+----------------+
|           Microsoft           |     21447      |
|             Google            |     15491      |
|            Facebook           |     14200      |
|              IBM              |     11716      |
|       Paramount Pictures      |     10547      |
| American Broadcasting Company |     10514      |
|            Twitter            |      9219      |
|       Target Corporation      |      8474      |
|        Delta Air Lines        |      8195      |
|             Intel             |      7952      |
+-------------------------------+----------------+
[10 rows x 2 columns]



### Single-source shortest path

Paths in a graph are a sequence of vertices, where each consecutive pair is connected by an edge in the graph. The single-source shortest path problem is to find the shortest path from all vertices in the graph to a user-specified target node. The source vertex with the smallest distribution of shortest paths can be considered the most central node in the graph.

Because GraphLab Create SGraphs use directed edges, the shortest path toolkit also finds the shortest directed paths to a source vertex. In this example we find all shortest paths to the node for the Microsoft article, then visualize the shortest path from the Microsoft article to the Weyerhauser article. Interestingly, the quickest way to get from Microsoft to Weyerhauser is via the articles on Google and tax avoidance.

In [18]:
sssp = gl.shortest_path.create(sg, source_vid='Microsoft')
sssp.get_path(vid='Weyerhaeuser', show=True,
              highlight=['Microsoft', 'Weyerhaeuser'], arrows=True, ewidth=1.5)

PROGRESS: +----------------------------+
PROGRESS: | Number of vertices updated |
PROGRESS: +----------------------------+
PROGRESS: | 14129                      |
PROGRESS: | 24513                      |
PROGRESS: | 3939                       |
PROGRESS: | 566                        |
PROGRESS: | 0                          |
PROGRESS: +----------------------------+
Canvas is accessible via web browser at the URL: http://localhost:55384/index.html
Opening Canvas in default web browser.


None

[('Microsoft', 0.0),
 ('Google', 1.0),
 ('Tax avoidance', 2.0),
 ('Weyerhaeuser', 3.0)]

AttributeError: 'SGraph' object has no attribute 'head'