<a href="https://colab.research.google.com/github/cmzwolf/JupyterVAMDCPortal/blob/main/portalTest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# VAMDC portal in a Jupyter Notebook 
This notebook implements in google-colab functionality similar to the ones of the VAMDC portal, in particular the part for requesting databases and dowloading data. **This is a first prototype**: the code will be factored into functions and new functionalities will be added

Let us start by a configuration phase, where we fetch all the required compoenents

In [None]:
# fetching the VAMDC Python libraries
!git clone https://github.com/notlaast/vamdclib.git 
%cd vamdclib

Cloning into 'vamdclib'...
remote: Enumerating objects: 181, done.[K
remote: Total 181 (delta 0), reused 0 (delta 0), pack-reused 181[K
Receiving objects: 100% (181/181), 166.20 KiB | 914.00 KiB/s, done.
Resolving deltas: 100% (85/85), done.
/content/vamdclib


In [None]:
# installing the VAMDC libraries as a python package (-e option is for interactive linking)
!pip install -e .

Obtaining file:///content/vamdclib
Installing collected packages: vamdclib
  Running setup.py develop for vamdclib
Successfully installed vamdclib-0.3


In [None]:
# installing dependencies for interacting with the VAMDC registries
!pip install suds-jurko

Collecting suds-jurko
  Downloading suds-jurko-0.6.zip (255 kB)
[?25l[K     |█▎                              | 10 kB 22.0 MB/s eta 0:00:01[K     |██▋                             | 20 kB 26.5 MB/s eta 0:00:01[K     |███▉                            | 30 kB 21.1 MB/s eta 0:00:01[K     |█████▏                          | 40 kB 17.2 MB/s eta 0:00:01[K     |██████▍                         | 51 kB 7.3 MB/s eta 0:00:01[K     |███████▊                        | 61 kB 8.5 MB/s eta 0:00:01[K     |█████████                       | 71 kB 7.9 MB/s eta 0:00:01[K     |██████████▎                     | 81 kB 8.8 MB/s eta 0:00:01[K     |███████████▌                    | 92 kB 9.5 MB/s eta 0:00:01[K     |████████████▉                   | 102 kB 7.1 MB/s eta 0:00:01[K     |██████████████                  | 112 kB 7.1 MB/s eta 0:00:01[K     |███████████████▍                | 122 kB 7.1 MB/s eta 0:00:01[K     |████████████████▋               | 133 kB 7.1 MB/s eta 0:00:01[K     |██

**We are ready to go!** Let us check the registries to see what nodes are registered

In [None]:
# quering the registries to get the node list
import pandas as pd
from vamdclib import nodes
from vamdclib import request
nl = nodes.Nodelist()

In [None]:
# display the nodes into a nice table, using Pandas data frame
%load_ext google.colab.data_table
nodeNames = []
nodesUrls = []
nodesIdentifiers = []
nodesMaintainers = []
for node in nl:
  nodeNames.append(node.name)
  nodesUrls.append(node.url)
  nodesIdentifiers.append(node.identifier)
  #nodesMaintainers.append(node.maintainer)
  nodesMaintainers.append("mokeMail@forAvoiding.spam")

df = pd.DataFrame(list(zip(nodeNames, nodesUrls, nodesIdentifiers, nodesMaintainers)),
               columns =['NodeName', 'NodeURL','nodeIdentifiers', 'nodeMaintainer'])
df

The google.colab.data_table extension is already loaded. To reload it, use:
  %reload_ext google.colab.data_table


Unnamed: 0,NodeName,NodeURL,nodeIdentifiers,nodeMaintainer
0,Water internet Accessible Distributed Informat...,http://vamdc.saga.iao.ru/node/wadis/tap/,ivo://vamdc/wadis/vamdc-tap,mokeMail@forAvoiding.spam
1,NIST Atomic Spectra Database,https://physics.nist.gov:8000/nodes/asd/tap/,ivo://vamdc/nist/vamdc-tap_12.07,mokeMail@forAvoiding.spam
2,CDMS,https://cdms.astro.uni-koeln.de/cdms/tap/,ivo://vamdc/cdms/vamdc-tap_12.07,mokeMail@forAvoiding.spam
3,MeCaSDa - Methane Calculated Spectroscopic Dat...,http://vamdc.icb.cnrs.fr/mecasda-12.07/tap/,ivo://vamdc/dijon-methane-lines,mokeMail@forAvoiding.spam
4,GeCaSDa: Gemane Calculated Spectroscopic Database,http://vamdc.icb.cnrs.fr/gecasda/tap/,ivo://vamdc/dijon-GeH4-lines,mokeMail@forAvoiding.spam
5,Theoretical spectral database of polycyclic a...,http://vamdc-pah.oa-cagliari.inaf.it/tap/,ivo://vamdc/OA-Cagliari/PAH,mokeMail@forAvoiding.spam
6,IDEADB - Innsbruck Dissociative Electron Attac...,https://ideadb.uibk.ac.at/tap/,ivo://vamdc/IDEADB,mokeMail@forAvoiding.spam
7,OACT - LASP Database,http://dblasp.oact.inaf.it/node1207/OACT/tap/,ivo://vamdc/OACatania/LASP,mokeMail@forAvoiding.spam
8,Carbon Dioxide Spectroscopic Databank 296K (VA...,http://lts.iao.ru/node/cdsd-296-xsams1/tap/,ivo://vamdc/cdsd-296,mokeMail@forAvoiding.spam
9,VALD (atoms),http://vald.astro.uu.se/atoms-12.07/tap/,ivo://vamdc/vald/uu/django,mokeMail@forAvoiding.spam


We can now define the queries we would like submit. Each query will be submitted to the nodes we will select.

In [None]:
# define the set of queries to perform
queries =[]
# select HCO
queries.append("select species")
# select HCN
queries.append("select * where ((InchiKey = 'LELOWRISYMNNSU-UHFFFAOYSA-N'))")


In [None]:
# selecting the nodes to query by their indexe
selectedNodes = []
selectedNodesIndexes = [2,9]
for i in selectedNodesIndexes:
  currentNodeIdentifier = df['nodeIdentifiers'][i]
  currentNode = nl.getnode(currentNodeIdentifier)
  selectedNodes.append(currentNode)

In [None]:
# build all the request, considering the predefined set of queries and nodes
from vamdclib import request
requests = []

for currentNode in selectedNodes:
  for currentQuery in queries:
    req = request.Request()
    req.setnode(currentNode)
    req.setquery(currentQuery)
    requests.append(req)

In [None]:
# making head requests
from vamdclib import request
for req in requests:
  req.doheadrequest()

In [None]:
# Displaying the performed head queries 
headerColumns = ["VAMDC-COUNT-SPECIES", "VAMDC-COUNT-STATES", "VAMDC-TRUNCATED", "VAMDC-COUNT-MOLECULES", "VAMDC-COUNT-SOURCES", "VAMDC-APPROX-SIZE", "VAMDC-COUNT-RADIATIVE", "VAMDC-COUNT-ATOMS", "VAMDC-REQUEST-TOKEN", "Query"]
countSpecies = []
countStates =[]
truncated = []
countMolecules = []
countSources = []
approxSize = []
countRadiative = []
countAtoms = []
requestToken = []
submittedQueries = []
for req in requests:
  countSpecies.append(req.headers.get(headerColumns[0]))
  countStates.append(req.headers.get(headerColumns[1]))
  truncated.append(req.headers.get(headerColumns[2]))
  countMolecules.append(req.headers.get(headerColumns[3]))
  countSources.append(req.headers.get(headerColumns[4]))
  approxSize.append(req.headers.get(headerColumns[5]))
  countRadiative.append(req.headers.get(headerColumns[6]))
  countAtoms.append(req.headers.get(headerColumns[7]))
  requestToken.append(req.headers.get(headerColumns[8]))
  submittedQueries.append(req.query.Query)

HeadDF = pd.DataFrame(list(zip(countSpecies, countStates, truncated, countMolecules, countSources, approxSize, countRadiative, countAtoms, requestToken, submittedQueries)),
            columns = headerColumns   )


In [None]:
HeadDF

Unnamed: 0,VAMDC-COUNT-SPECIES,VAMDC-COUNT-STATES,VAMDC-TRUNCATED,VAMDC-COUNT-MOLECULES,VAMDC-COUNT-SOURCES,VAMDC-APPROX-SIZE,VAMDC-COUNT-RADIATIVE,VAMDC-COUNT-ATOMS,VAMDC-REQUEST-TOKEN,Query
0,1073.0,0.0,100.0,1062.0,0.0,3084.32,5494236.0,11.0,cdms:309d0fd7-7443-4c3c-a00f-a0fcfe9393d8:head,select species
1,6.0,766.0,100.0,6.0,16.0,1.23,936.0,0.0,cdms:af894d31-cff5-4fad-907b-15c79c38e7ed:head,select * where ((InchiKey = 'LELOWRISYMNNSU-UH...
2,309.0,,,,,,,309.0,vald:2070668d-5689-435f-804e-21db481d5d2d:head,select species
3,,,,,,,,,,select * where ((InchiKey = 'LELOWRISYMNNSU-UH...


**The last line of the previous tab contains no data**: the reason is simple and depends on the header that each node return. Sometimes node use capital letters, sometimes lowercase letters. The previous code works with capital letters. It is up to the client code to adapt to the variety of behaviour of the nodes? Or should the nodes answer all the same way? **To discuss further in VAMDC...**

We are going to retrieve data only for the query indexed by 1. Once the data is fetched, we put them into a file. For naming this file, we will chose the Token of the query, since this is unique. 

In [None]:
# selecting the query indexed by 1 to retrieve data
requestToPerform = requests[1]

In [None]:
# running the query to get the results
result = requestToPerform.dorequest()

In [None]:
# we use the query token to name the file in a unique way (later we may rename it using the query store ID) and save the result XSAMS file
resultFileName = "/content/"+ HeadDF["VAMDC-REQUEST-TOKEN"][1]+".xsams"
with open(resultFileName, "wb") as text_file:
    text_file.write(result.Xml)

**Excellent!** Now open the file explorer on the left of this colab-notebook. Find your file. Click-left and donwload it. You have your XSAMS file! congrats. 


---
# Applying processors
Let us now try to apply some processor on the retreived data. 


To begin with, let us download the file defining the molecular processor which fit well with the data we have fetched. 

In [None]:
!wget https://raw.githubusercontent.com/VAMDC/Processors/master/static/xsl/molecularxsams2html.xsl

--2021-12-14 10:38:09--  https://raw.githubusercontent.com/VAMDC/Processors/master/static/xsl/molecularxsams2html.xsl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 92777 (91K) [text/plain]
Saving to: ‘molecularxsams2html.xsl’


2021-12-14 10:38:10 (8.10 MB/s) - ‘molecularxsams2html.xsl’ saved [92777/92777]



The we use the the downloaded processor to transform our data, via a standard XSLT

In [None]:
import lxml.etree as ET
xsltfile = ET.XSLT(ET.parse('/content/vamdclib/molecularxsams2html.xsl'))
xmlfile = ET.parse(resultFileName)
output = xsltfile(xmlfile).write_output('test1.html')

Let us use Pandad to read the produced html table

In [None]:
tableHTML = pd.read_html("test1.html")

Let us display the first read table 

In [None]:
tableHTML[0]

Unnamed: 0,Id,Title,Origin,Authors,Year,Link
0,BCDMS-68,Pure Rotational Spectrum of HCN in the Teraher...,"journal : J. Mol. Spectrosc. ( Vol : 202 , ...","Maiwald, F.; Lewen, F.; Ahrens, V.; Beaky, ...",2000,
1,BCDMS-70,High-Temperature Infrared Measurements in the ...,"journal : J. Mol. Spectrosc. ( Vol : 202 , ...","Maki, A. G.; Mellau, G. C.; Klee, S.; Winne...",2000,
2,BCDMS-130,Vibrational predissociation in the hydrogen fl...,"journal : J. Chem. Phys. ( Vol : 80 , Page ...","DeLeon, R. L.; Muenter, J. S.;",1984,
3,BCDMS-381,Sub-Doppler Saturation Spectroscopy of HCN up ...,"journal : Z. Naturforsch. ( Vol : 57a , Pag...","Ahrens, V.; Lewen, F.; Takano, S.; Winnewis...",2002,
4,BCDMS-382,A Concise New Look at the [CLC][ITAL]l[/ITAL][...,"journal : Astrophys. J. ( Vol : 585 , Page ...","Thorwirth, S.; Müller, H. S. P.; Lewen, F.; ...",2003,
5,BCDMS-383,,"journal : Proc. SPIE ( Vol : 6580 , Page Be...","Lapinov, A. V.;",2006,
6,BCDMS-385,Dipole moment and hyperfine properties of the ...,"journal : J. Chem. Phys. ( Vol : 80 , Page ...","Ebenstein, W. L.; Muenter, J. S.;",1984,
7,BCDMS-479,Submillimeter-wave spectroscopy of HCN in exci...,"journal : J. Mol. Spectrosc. ( Vol : 220 , ...","Zelinger, Z.; Amano, T.; Ahrens, V.; Brünke...",2003,
8,BCDMS-828,Stark effect and hyperfine structure of HCN me...,journal : J. Res. Natl. Bur. Stand. ( Vol : ...,"Radford, H. E.; Kurtz, C. V.;",1970,
9,BCDMS-829,Microwave Spectra of Molecules of Astrophysica...,journal : J. Phys. Chem. Ref. Data ( Vol : 3...,"Maki, A. G.;",2000,


Then, let us display the second read table

In [None]:
tableHTML[1]

Unnamed: 0,Unselect all,Chemical nameX,Stoichiometric formulaX,Ordinary structural formulaX,FrequencyX,AX,Lower energy(1/cm)X,Lower total statistical weightX,Lower nuclear statistical weightX,Lower QNsX,Upper energy(1/cm)X,Upper total statistical weightX,Upper nuclear statistical weightX,Upper QNsX
0,,Hydrogen Cyanide,CHN,HCN,4.471766e+02,0.0000,714.9356,3.0,1.0,ElecStateLabel=X v1=0 v2=1 l2=1 v3=0 J=1 F=1 p...,714.9506,1.0,,ElecStateLabel=X v1=0 v2=1 l2=1 v3=0 J=1 F=0 p...
1,,Hydrogen Cyanide,CHN,HCN,4.482061e+02,0.0000,714.9356,3.0,1.0,ElecStateLabel=X v1=0 v2=1 l2=1 v3=0 J=1 F=1 p...,714.9506,5.0,,ElecStateLabel=X v1=0 v2=1 l2=1 v3=0 J=1 F=2 p...
2,,Hydrogen Cyanide,CHN,HCN,4.488446e+02,0.0000,714.9356,3.0,1.0,ElecStateLabel=X v1=0 v2=1 l2=1 v3=0 J=1 F=1 p...,714.9506,3.0,,ElecStateLabel=X v1=0 v2=1 l2=1 v3=0 J=1 F=1 p...
3,,Hydrogen Cyanide,CHN,HCN,4.489430e+02,0.0000,714.9356,9.0,3.0,ElecStateLabel=X v1=0 v2=1 l2=1 v3=0 J=1 parit...,714.9506,9.0,3.0,ElecStateLabel=X v1=0 v2=1 l2=1 v3=0 J=1 parit...
4,,Hydrogen Cyanide,CHN,HCN,4.489625e+02,0.0000,714.9356,5.0,1.0,ElecStateLabel=X v1=0 v2=1 l2=1 v3=0 J=1 F=2 p...,714.9506,5.0,,ElecStateLabel=X v1=0 v2=1 l2=1 v3=0 J=1 F=2 p...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
931,,Hydrogen Cyanide,CHN,HCN,7.402896e+06,20.8773,10651.4728,513.0,3.0,ElecStateLabel=X v1=0 v2=0 l2=0 v3=0 J=85 kron...,10898.4068,519.0,3.0,ElecStateLabel=X v1=0 v2=0 l2=0 v3=0 J=86 kron...
932,,Hydrogen Cyanide,CHN,HCN,7.483843e+06,21.5714,10898.4068,519.0,3.0,ElecStateLabel=X v1=0 v2=0 l2=0 v3=0 J=86 kron...,11148.0410,525.0,3.0,ElecStateLabel=X v1=0 v2=0 l2=0 v3=0 J=87 kron...
933,,Hydrogen Cyanide,CHN,HCN,7.564614e+06,22.2788,11148.0410,525.0,3.0,ElecStateLabel=X v1=0 v2=0 l2=0 v3=0 J=87 kron...,11400.3693,531.0,3.0,ElecStateLabel=X v1=0 v2=0 l2=0 v3=0 J=88 kron...
934,,Hydrogen Cyanide,CHN,HCN,7.645208e+06,22.9986,11400.3693,531.0,3.0,ElecStateLabel=X v1=0 v2=0 l2=0 v3=0 J=88 kron...,11655.3860,537.0,3.0,ElecStateLabel=X v1=0 v2=0 l2=0 v3=0 J=89 kron...
