# Classification

[EDeN](https://github.com/fabriziocosta/EDeN) can be directly used to induce predictive estimators for graphs using the popular scikit-learn library.

The first thing to do is to load the graphs. 
EDeN offers an io module that can read popular formats like the [gSpan](http://ieeexplore.ieee.org/document/1184038/?arnumber=1184038&tag=1) graph format or the JSON serialization of [networkx](https://networkx.github.io) graphs.  

In [2]:
from eden.io.gspan import load
pos_graphs = list(load('data/bursi.pos.gspan'))
neg_graphs = list(load('data/bursi.neg.gspan'))
graphs = pos_graphs + neg_graphs

y = [1]*len(pos_graphs) + [-1]*len(neg_graphs)
import numpy as np
y = np.array(y)

EDeN exports a ```vectorize``` function that converts a list of graphs in input to a data matrix in output.

The output format is a scipy [Compressed Sparse Row matrix](https://docs.scipy.org/doc/scipy-0.18.1/reference/generated/scipy.sparse.csr_matrix.html#scipy.sparse.csr_matrix).

In [3]:
%%time
from eden.graph import vectorize

X = vectorize(graphs, complexity=2)
print 'Instances: %d Features: %d with an avg of %d features per instance' % (X.shape[0], X.shape[1],  X.getnnz()/X.shape[0])

Instances: 4337 Features: 65537 with an avg of 184 features per instance
CPU times: user 11.7 s, sys: 2.37 s, total: 14.1 s
Wall time: 13.6 s


Several predictive algorithms from the [scikit](http://scikit-learn.org/stable/) library can process data in csr format.

In [4]:
%%time
from sklearn.linear_model import SGDClassifier
from sklearn import cross_validation

predictor = SGDClassifier(average=True, class_weight='balanced', shuffle=True, n_jobs=-1)
scores = cross_validation.cross_val_score(predictor, X, y, cv=10, scoring='roc_auc')
print('AUC ROC: %.4f +- %.4f' % (np.mean(scores),np.std(scores)))

AUC ROC: 0.9017 +- 0.0157
CPU times: user 392 ms, sys: 17.7 ms, total: 410 ms
Wall time: 416 ms


---