In [17]:
from process import ProcessADS, ProcessGraph

DEV_KEY = "kNUoTurJ5TXV9hsw9KQN1k8wH4U0D7Oy0CJoOvyw"


### Data Warehousing API for NASA ADS

These two classes from the process module get our data out of NASA ADS and into a table format suitable for warehousing on our server (SQLite database).

The ProcessADS class takes a string query q, a dev key for the ADS api, and an integer value for max_pages (1 to 5 seem to work the best). In this way we can create multiple instances for each query in the same python script/session.


In [18]:
# topics: 'stars (demo)', cosmology, exoplanet astronomy, high energy astrophysics

p = ProcessADS(q='exoplanet astronomy',key=DEV_KEY,max_pages=4)


We can investigate the results of the query by checking out the nodes and edges. These are just pandas dataframes and respond to all the df methods. 



In [19]:
p.nodes.head()

Unnamed: 0,label,node_type,id
0,"Kroupa, Pavel",Author,0
1,"Burrows, A.",Author,1
2,"Marley, M.",Author,2
3,"Hubbard, W. B.",Author,3
4,"Lunine, J. I.",Author,4


The fields that store the tables in both Process classes are private, so they cannot be set directly (and accidentally overwritten). If you want to manipulate the dataframes for whatever reason it needs to be stored to a new variable. Don't do anything that changes the dfs in place inside the objects!!


In [4]:
#p.edges = None  # direct setting throws an error

ProcessGraph processes the query object with networkx and delivers 3 sets of graphs in node and edgelist dataframes. Main is the large main query graph, lg_cc_subgraph are the largest bipartite connected components and islands are bipartite subgraphs with the highest edgeweight for each node type. Betweenness centrality, degree centrality, and pagerank for all nodes in each graph are also added. These will play a factor in the visualization's node size in sigmajs.

The dataframes you can look at (assuming graphs = ProcessGraph(p):
* graphs.main_nodes
* graphs.main_edges
* graphs.lg_cc_nodes
* graphs.lg_cc_edges
* graphs.islands_nodes
* graphs.islands_edges

You can also check out the 3 graph networkx graph objects:
* graphs.g   (main graph)
* graphs.lg_cc_subgraph
* graphs.islands_subgraph



In [20]:
graphs = ProcessGraph(p)

graphs.main_nodes.head()

Unnamed: 0,id,label,node_type,zbetween_central,zdeg_central,zpagerank
0,0,"Kroupa, Pavel",Author,0.0,0.001497,0.000713
1,1,"Burrows, A.",Author,0.00758,0.002994,0.001307
2,2,"Marley, M.",Author,0.0,0.001497,0.000735
3,3,"Hubbard, W. B.",Author,0.001004,0.002994,0.001222
4,4,"Lunine, J. I.",Author,0.001004,0.002994,0.001222


In [13]:
graphs.a_lg_cc_subgraph.nodes(data=True)[0:5]

[(0,
  {'id': 0,
   'label': u'Schlegel, David J.',
   'node_type': 'Author',
   'zbetween_central': 0.0,
   'zdeg_central': 0.31100478468899523,
   'zpagerank': 0.002844346215282877}),
 (1,
  {'id': 1,
   'label': u'Finkbeiner, Douglas P.',
   'node_type': 'Author',
   'zbetween_central': 0.0,
   'zdeg_central': 0.31100478468899523,
   'zpagerank': 0.002844346215282877}),
 (2,
  {'id': 2,
   'label': u'Davis, Marc',
   'node_type': 'Author',
   'zbetween_central': 0.0,
   'zdeg_central': 0.31100478468899523,
   'zpagerank': 0.002844346215282877}),
 (3,
  {'id': 3,
   'label': u'Riess, Adam G.',
   'node_type': 'Author',
   'zbetween_central': 0.010458095342608765,
   'zdeg_central': 0.38516746411483255,
   'zpagerank': 0.004105197589831577}),
 (4,
  {'id': 4,
   'label': u'Filippenko, Alexei V.',
   'node_type': 'Author',
   'zbetween_central': 0.010458095342608765,
   'zdeg_central': 0.38516746411483255,
   'zpagerank': 0.004105197589831577})]

In [21]:
graphs.islands_graph.nodes(data=True)[:5]

[(3,
  {'id': 3,
   'label': u'Hubbard, W. B.',
   'node_type': 'Author',
   'zbetween_central': 0.0,
   'zdeg_central': 0.025974025974025976,
   'zpagerank': 0.004696763813513676}),
 (4,
  {'id': 4,
   'label': u'Lunine, J. I.',
   'node_type': 'Author',
   'zbetween_central': 0.0,
   'zdeg_central': 0.025974025974025976,
   'zpagerank': 0.004696763813513676}),
 (379,
  {'id': 379,
   'label': u'Weiss, W.',
   'node_type': 'Author',
   'zbetween_central': 0.0,
   'zdeg_central': 0.025974025974025976,
   'zpagerank': 0.01282051282051282}),
 (10,
  {'id': 10,
   'label': u'Marois, Christian',
   'node_type': 'Author',
   'zbetween_central': 0.0066729323308270675,
   'zdeg_central': 0.3116883116883117,
   'zpagerank': 0.020633581764866907}),
 (11,
  {'id': 11,
   'label': u'Macintosh, Bruce',
   'node_type': 'Author',
   'zbetween_central': 0.0066729323308270675,
   'zdeg_central': 0.3116883116883117,
   'zpagerank': 0.020633581764866907})]

If everything looks good, two methods are called to export the graphs to csv files. A subdirectory of the current working directory called csvs is created for the output. For our web app these csvs will get piped into a sqlite database that will sit on the server, but we can use these csvs for other graph APIs and databases if we wanted to...

In [22]:
graphs.export_main_to_csv()

In [23]:
graphs.export_subgraphs_to_csv()