In [1]:
import os
import igraph as ig

import plotly.plotly as py
from plotly.graph_objs import *

import pandas as pd
import sys
sys.path.append(os.path.abspath(".."))

from urlembed.util.seqmanager import *
from urlembed.util.plotter import *
from urlembed.util.metrics import *

from sklearn import metrics

from __future__ import print_function

The crawling proccess has been done in two different ways:
- **No costraint**: the crawler follows a random outlink from all of the outlinks in a given page
- **List costraint**: the crawler follows a random outlink but only from the outlinks in "lists"

<div style="text-align:center"><h1> NO-COSTRAINT GRAPH </h1></div>

Here we're loading all the files that the crawler has generated for creating the web graph

In [2]:
nocostraint_path        = os.getcwd() + "/../dataset/cs.illinois.edu_NoConstraint.words1000.depth10/"
nocostraint_urlmap_path = nocostraint_path + "urlsMap.txt"
nocostraint_edges_path  = nocostraint_path + "edges.txt"

codeurlmap_nocostraint  = get_urlmap(nocostraint_urlmap_path)
nocostraint_graph       = graph_from_file(nocostraint_edges_path, codeurlmap_nocostraint)

print("Number of vertices:", len(nocostraint_graph.vs))
print("Number of edges:",    len(nocostraint_graph.es))

Number of vertices: 807
Number of edges: 16993


Some python magic

In [3]:
gt = GroundTruth(os.getcwd() + "/../dataset/ground_truth/urlToMembership.txt")
nocostraint_graph.delete_vertices([vertex.index for vertex in nocostraint_graph.vs if vertex["name"] == "missing"])

for vertex in nocostraint_graph.vs:
    vertex["color"] = get_color(int(gt.get_groundtruth(vertex["name"])))
    vertex["true_label"] = int(gt.get_groundtruth(vertex["name"]))

In [4]:
nocostraint_fig = graph3d_plot(nocostraint_graph, "No-costraint Network - Manually clustered")
py.iplot(nocostraint_fig, filename="No-costraint Network - Manually clustered")


Woah there! Look at all those points! Due to browser limitations, the Plotly SVG drawing functions have a hard time graphing more than 500k data points for line charts, or 40k points for other types of charts. Here are some suggestions:
(1) Use the `plotly.graph_objs.Scattergl` trace object to generate a WebGl graph.
(2) Trying using the image API to return an image instead of a graph URL
(3) Use matplotlib
(4) See if you can create your visualization with fewer data points



Estimated Draw Time Slow



The draw time for this plot will be slow for clients without much RAM.


<div>
    <a href="https://plot.ly/~chrispolo/44/" target="_blank" title="No-costraint network" style="display: block; text-align: center;"><img src="https://plot.ly/~chrispolo/44.png" alt="No-costraint network" style="max-width: 100%;width: 1000px;"  width="1000" onerror="this.onerror=null;this.src='https://plot.ly/404.png';" /></a>
    <script data-plotly="chrispolo:44"  src="https://plot.ly/embed.js" async></script>
</div>


### Clustering on the No-Costraint Graph

Two clustering algorithm are reported:
- **fastgreedy**: is a hierarchical approach, but it is bottom-up instead of top-down. It tries to optimize a quality function called modularity in a greedy manner. Initially, every vertex belongs to a separate community, and communities are merged iteratively such that each merge is locally optimal (i.e. yields the largest increase in the current value of modularity). The algorithm stops when it is not possible to increase the modularity any more, so it gives you a grouping as well as a dendrogram. The method is fast and it is the method that is usually tried as a first approximation because it has no parameters to tune. However, it is known to suffer from a resolution limit, i.e. communities below a given size threshold (depending on the number of nodes and edges if I remember correctly) will always be merged with neighboring communities.


- **walktrap**: is an approach based on random walks. The general idea is that if you perform random walks on the graph, then the walks are more likely to stay within the same community because there are only a few edges that lead outside a given community. Walktrap runs short random walks of 3-4-5 steps (depending on one of its parameters) and uses the results of these random walks to merge separate communities in a bottom-up manner like fastgreedy.community. Again, you can use the modularity score to select where to cut the dendrogram. It is a bit slower than the fast greedy approach but also a bit more accurate

In [41]:
fastgreedy_dendogram = nocostraint_graph.community_fastgreedy()
fastgreedy_clustering = fastgreedy_dendogram.as_clustering(16)

print(fastgreedy_clustering.sizes())

walktrap_dendogram = nocostraint_graph.community_walktrap(steps=3)
walktrap_clustering = walktrap_dendogram.as_clustering(16)

print(walktrap_clustering.sizes())

[449, 54, 50, 96, 68, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[82, 261, 141, 1, 71, 32, 13, 30, 3, 54, 32, 4, 1, 1, 1, 1]


Assigning some attributes to graph vertices for clustering and visualizing purposes

In [7]:
for i in range(len(walktrap_clustering.membership)):
    nocostraint_graph.vs[i]["color"] = get_color(walktrap_clustering.membership[i])
    nocostraint_graph.vs[i]["pred_walktrap_label"] = walktrap_clustering.membership[i]
    nocostraint_graph.vs[i]["pred_fastgreedy_label"] = fastgreedy_clustering.membership[i]

In [8]:
nocostraint_fig = graph3d_plot(nocostraint_graph, "No-costraint Network - WalkTrap clustered")
py.iplot(nocostraint_fig, filename="No-costraint Network - WalkTrap clustered")

The draw time for this plot will be slow for clients without much RAM.


<div>
    <a href="https://plot.ly/~chrispolo/46/" target="_blank" title="No-costraint Network - WalkTrap clusterized" style="display: block; text-align: center;"><img src="https://plot.ly/~chrispolo/46.png" alt="No-costraint Network - WalkTrap clusterized" style="max-width: 100%;width: 1000px;"  width="1000" onerror="this.onerror=null;this.src='https://plot.ly/404.png';" /></a>
    <script data-plotly="chrispolo:46"  src="https://plot.ly/embed.js" async></script>
</div>


### Evaluation


- **Confusion matrix**: Each column of the matrix represents the instances in a predicted cluster while each row represents the instances in an actual cluster

In [9]:
nocostraint_conftable_df = pd.DataFrame(get_confusion_table(nocostraint_graph.vs["true_label"], nocostraint_graph.vs["pred_walktrap_label"]), 
             index=set(nocostraint_graph.vs["true_label"]),
             columns=set(nocostraint_graph.vs["pred_walktrap_label"]))

nocostraint_conftable_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,0,0,0,0,0,0,0,0,0,0,18,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,12,0,0,0,0,0
2,47,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,29,0,0,0,0,0,0,0,0,0,0
4,1,0,2,0,0,0,0,0,0,7,0,4,0,0,0,0
6,0,0,6,0,0,0,0,0,0,0,0,0,0,0,0,0
8,0,47,13,0,0,0,0,15,0,0,0,0,0,0,0,0
10,14,197,108,1,0,0,0,4,0,0,1,0,0,0,0,0
11,0,3,0,0,0,0,0,0,0,0,0,0,1,1,1,1
12,0,14,5,0,0,0,0,0,0,0,0,0,0,0,0,0


### Other metrics:
- **Adjusted Rand index**: Given the knowledge of the *ground truth* class assignments *true_label* and our clustering algorithm assignments of the same samples *pred_walktrap_label*, the adjusted Rand index is a function that measures the similarity of the two assignments, ignoring permutations and with chance normalization


- **Mutual Information based scores**: Given the knowledge of the ground truth class assignments labels_true and our clustering algorithm assignments of the same samples labels_pred, the Mutual Information is a function that measures the agreement of the two assignments, ignoring permutations. Two different normalized versions of this measure are available, Normalized Mutual Information(NMI) and Adjusted Mutual Information(AMI). NMI is often used in the literature while AMI was proposed more recently and is normalized against chance


- **Homogeneity**: each cluster contains only members of a single class


- **completeness**: all members of a given class are assigned to the same cluster


- **V-measure**: The V-measure is actually equivalent to the mutual information (NMI) discussed above normalized by the sum of the label entropies

In [40]:
true_label_nc = nocostraint_graph.vs["true_label"]
pred_walktrap_label_nc = nocostraint_graph.vs["pred_walktrap_label"]
pred_fastgreedy_label_nc = nocostraint_graph.vs["pred_fastgreedy_label"]

nocostraint_metrics_df = pd.DataFrame([
        [
            metrics.homogeneity_score(true_label_nc, pred_walktrap_label_nc),
            metrics.completeness_score(true_label_nc, pred_walktrap_label_nc),
            metrics.v_measure_score(true_label_nc, pred_walktrap_label_nc),
            metrics.adjusted_rand_score(true_label_nc, pred_walktrap_label_nc),
            metrics.adjusted_mutual_info_score(true_label_nc, pred_walktrap_label_nc)
        ],
        [
            metrics.homogeneity_score(true_label_nc, pred_fastgreedy_label_nc),
            metrics.completeness_score(true_label_nc, pred_fastgreedy_label_nc),
            metrics.v_measure_score(true_label_nc, pred_fastgreedy_label_nc),
            metrics.adjusted_rand_score(true_label_nc, pred_fastgreedy_label_nc),
            metrics.adjusted_mutual_info_score(true_label_nc, pred_fastgreedy_label_nc)
        ]],
        index=["WalkTrap", "FastGreedy"],
        columns=["Homogeneity", "Completeness", "V-Measure core", "Adjusted Rand index", "Mutual Information"])

print("WalkTrap modularity:", walktrap_clustering.modularity)
print("FastGreedy modularity:", fastgreedy_clustering.modularity)
nocostraint_metrics_df

WalkTrap modularity: 0.00369857293816
FastGreedy modularity: 0.212391372801


Unnamed: 0,Homogeneity,Completeness,V-Measure core,Adjusted Rand index,Mutual Information
WalkTrap,0.647093,0.658495,0.652744,0.436288,0.628108
FastGreedy,0.551864,0.856322,0.67118,0.576429,0.535485


<div style="text-align:center"><h1> LIST-COSTRAINT GRAPH </h1></div>

In [11]:
listcostraint_path        = os.getcwd() + "/../dataset/cs.illinois.edu_ListConstraint.words1000.depth10/"
listcostraint_urlmap_path = listcostraint_path + "urlsMap.txt"
listcostraint_edges_path  = listcostraint_path + "edges.txt"

urlmap_listcostraint      = get_urlmap(listcostraint_urlmap_path)
listcostraint_graph       = graph_from_file(listcostraint_edges_path, urlmap_listcostraint)

print("Number of vertices:", len(listcostraint_graph.vs))
print("Number of edges:",    len(listcostraint_graph.es))

Number of vertices: 1090
Number of edges: 19742


In [12]:
gt = GroundTruth(os.getcwd() + "/../dataset/ground_truth/urlToMembership.txt")
listcostraint_graph.delete_vertices([vertex.index for vertex in listcostraint_graph.vs if vertex["name"] == "missing"])

for vertex in listcostraint_graph.vs:
    vertex["color"] = get_color(int(gt.get_groundtruth(vertex["name"])))
    vertex["true_label"] = int(gt.get_groundtruth(vertex["name"]))

In [13]:
listcostraint_fig = graph3d_plot(listcostraint_graph, "List-costraint Network - Manually clustered")
py.iplot(listcostraint_fig, filename="List-costraint Network - Manually clustered")

The draw time for this plot will be slow for clients without much RAM.


<div>
    <a href="https://plot.ly/~chrispolo/50/" target="_blank" title="List-costraint Network - Manually clusterized" style="display: block; text-align: center;"><img src="https://plot.ly/~chrispolo/50.png" alt="List-costraint Network - Manually clusterized" style="max-width: 100%;width: 1000px;"  width="1000" onerror="this.onerror=null;this.src='https://plot.ly/404.png';" /></a>
    <script data-plotly="chrispolo:50"  src="https://plot.ly/embed.js" async></script>
</div>


In [14]:
fastgreedy_lc_dendogram = listcostraint_graph.community_fastgreedy()
fastgreedy_lc_clustering = fastgreedy_lc_dendogram.as_clustering(16)

print(fastgreedy_lc_clustering.sizes())

walktrap_lc_dendogram = listcostraint_graph.community_walktrap(steps=3)
walktrap_lc_clustering = walktrap_lc_dendogram.as_clustering(16)

print(walktrap_lc_clustering.sizes())

[343, 471, 48, 42, 47, 59, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3]
[367, 156, 47, 59, 32, 199, 40, 4, 15, 4, 4, 1, 1, 1, 1, 91]


In [31]:
for i in range(len(walktrap_clustering.membership)):
    listcostraint_graph.vs[i]["color"] = get_color(walktrap_lc_clustering.membership[i])
    listcostraint_graph.vs[i]["pred_walktrap_label"] = walktrap_lc_clustering.membership[i]
    listcostraint_graph.vs[i]["pred_fastgreedy_label"] = fastgreedy_lc_clustering.membership[i]
    
for vertex in listcostraint_graph.vs:
    if vertex["pred_fastgreedy_label"] == None:
        vertex["pred_fastgreedy_label"] = -1

In [16]:
listcostraint_fig = graph3d_plot(listcostraint_graph, "List-costraint Network - WalkTrap clustered")
py.iplot(listcostraint_fig, filename="List-costraint Network - WalkTrap clustered")

The draw time for this plot will be slow for clients without much RAM.


In [32]:
listcostraint_conftable_df = pd.DataFrame(get_confusion_table(listcostraint_graph.vs["true_label"], listcostraint_graph.vs["pred_fastgreedy_label"]), 
             index=set(listcostraint_graph.vs["true_label"]),
             columns=set(listcostraint_graph.vs["pred_fastgreedy_label"]))

print(set(listcostraint_graph.vs["pred_fastgreedy_label"]))
listcostraint_conftable_df

set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, -1])


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,-1
0,0,0,18,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,9,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,146,0,0,0,0,1,1,1,1,1,1,1,1,1,0,14
3,0,0,0,29,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,1,0,9,0,0,0,0,0,0,0,0,0,0,0,0,0
6,0,6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,14,139,0,0,0,0,0,0,0,0,0,0,0,0,0,1,137
10,181,3,0,0,0,0,0,0,0,0,0,0,0,0,0,1,131
12,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8
13,0,0,0,0,0,57,0,0,0,0,0,0,0,0,0,0,0


In [38]:
true_label_lc = listcostraint_graph.vs["true_label"]
pred_walktrap_label_lc = listcostraint_graph.vs["pred_walktrap_label"]
pred_fastgreedy_label_lc = listcostraint_graph.vs["pred_fastgreedy_label"]

listcostraint_metrics_df = pd.DataFrame([
        [
            metrics.homogeneity_score(true_label_lc, pred_walktrap_label_lc),
            metrics.completeness_score(true_label_lc, pred_walktrap_label_lc),
            metrics.v_measure_score(true_label_lc, pred_walktrap_label_lc),
            metrics.adjusted_rand_score(true_label_lc, pred_walktrap_label_lc),
            metrics.adjusted_mutual_info_score(true_label_lc, pred_walktrap_label_lc)
        ],
        [
            metrics.homogeneity_score(true_label_lc, pred_fastgreedy_label_lc),
            metrics.completeness_score(true_label_lc, pred_fastgreedy_label_lc),
            metrics.v_measure_score(true_label_lc, pred_fastgreedy_label_lc),
            metrics.adjusted_rand_score(true_label_lc, pred_fastgreedy_label_lc),
            metrics.adjusted_mutual_info_score(true_label_lc, pred_fastgreedy_label_lc)
        ]],
        index=["WalkTrap", "FastGreedy"],
        columns=["Homogeneity", "Completeness", "V-Measure core", "Adjusted Rand index", "Mutual Information"])

print("WalkTrap modularity:", walktrap_lc_clustering.modularity)
print("FastGreedy modularity:", fastgreedy_lc_clustering.modularity)
listcostraint_metrics_df

WalkTrap modularity: 0.0697452319297
FastGreedy modularity: 0.239621877718


Unnamed: 0,Homogeneity,Completeness,V-Measure core,Adjusted Rand index,Mutual Information
WalkTrap,0.509364,0.48927,0.499115,0.276233,0.472249
FastGreedy,0.55226,0.60351,0.576749,0.365613,0.538205


<div style="text-align:center"><h1> CLUSTERING </h1></div>

<div>
    <a href="https://plot.ly/~chrispolo/38" 
        target="_blank" title="y" 
        style="display: block; text-align: center;">
            <img src="../dataset/img/cn.png" 
                alt="y" style="max-width: 100%;width: 1121px;"  
                width="100%" onerror="this.onerror=null;this.src='https://plot.ly/404';" />
    </a>
    <script data-plotly="chrispolo:38"  src="https://plot.ly/embed.js" async></script>
</div>