In [1]:
# from IPython.display import HTML

# HTML('''<script>
# code_show=true; 
# function code_toggle() {
#  if (code_show){
#  $('div.input').hide();
#  } else {
#  $('div.input').show();
#  }
#  code_show = !code_show
# } 
# $( document ).ready(code_toggle);
# </script>
# <form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')


# Processing spectra from VERITAS database

In this notebook we will structure the pipeline to process the publicly available
[VERITAS blazars spectra][VDB].
The goal is to collect the (ascii) data files at the bottom [that page][VDB],
transform and structure them to VO-compliant data structures.
Also, flux data should be standardized to [CGS][CGS] units $erg.s^{-1}.cm^{-2}$,
and wavelength/frequency to $Hz$.

[VDB]: http://veritas.sao.arizona.edu/veritas-science/veritas-blazar-spectra
[CGS]: https://en.wikipedia.org/wiki/Centimetre%E2%80%93gram%E2%80%93second_system_of_units

The table of interest at the [VERITAS page][VDB] is the last one on that page -- which we'll call here
*target table* --, where we have the columns:

| Blazar | Publication | VERITAS results page | Ascii file with spectral data |
|--------|-------------|----------------------|-------------------------------|
| ...    | ...         | ...                  | ...                           |

Where, 
* `Blazar` contains the name of the object and possibly a qualifier about blazar state of activity in parenthesis after the name (as the example). This column is *not* unique!
* `Publication` is neither an unique column and contains the identifier of the article where the respective data was published; this is *not* a link or a standard (e.g, DOI) value.
* `VERITAS results page` contains links to another (internal) page where plots of the article are provided; the link to the `ArXiv` version of the article is there provided
* `Ascii file with spectral data` are links to the data table.

The ASCII files have the following structure (follow the head line of those files):
```
    //VHE points: E[TeV] phi[m-2 s-1 TeV-1] ephi_low ephi_up
```

All files have the same head line, which means that all tables have the same metadata, which means that column order and units are all the same!

**AlreadyGood**:
* data files have all the same structure, same metadata, same units;

**ToImprove**:
* `Blazars` column should contain *only* the object name, o//VHE points: E[TeV] phi[m-2 s-1 TeV-1] ephi_low ephi_upbservations like the object's state of activity (e.g, high/low) are better to go in another column;
* `MJD` can be added as an extra column;
* `Publication` could be label as it is, but a hyperlink to ArXiv;

<div style="margin-top:30px; margin-bottom:30px;
            background-color:orange; color:black; opacity:0.75; padding:10px;">
    Each data file is assumed to be the result of <b>one</b>, contiguous observation.
</div>


The final result we are looking for is a master table with:

| Blazar name | ICRS position (x2) | MJD (x2) | Publication (hyperlink) | Data file (FITS) | Note |
|-------------|--------------------|----------|-------------------------|------------------|------|
| ...         | (RA & Dec)         | (Start & End) | (ArXiv)            | ...              | ...  |


The `ICRS position` is **not** provided, we have to query [Simbad][SIMBAD] for it.

`MJD` (epoch) is not directly provided either, but we should probably get them from the other
tables at the [same page][VDB], or create another, local table from the articles.

`Publication` (the ArXiv) link should be parsed from the related page provided by the
*target table*'s `Publication` column.

Finally, `Blazar name` and `Note` is already given and `Data file` will be the result of inputs formatting.

[SIMBAD]: http://simbad.u-strasbg.fr/simbad/

The results of this processing are publicly available through a [Virtual Observatory (VO)][IVOA] 
compatible framework at the [Brazilian Science Data Center (BSDC)][BSDCVO].
A [web interface][BSDCWEB] and a [SSAP service][BSDCSAP] are provided for accesing the data.

[IVOA]: http://www.ivoa.net/
[BSDCVO]: http://vo.bsdc.icranet.org/
[BSDCWEB]: http://vo.bsdc.icranet.org/veritas/q/web/form
[BSDCSAP]: http://vo.bsdc.icranet.org/veritas/q/ssa/info


## The workflow

Here goes the workflow we shoud code below:

1. Retrieve the 'target table' from 'http://veritas.sao.arizona.edu/veritas-science/veritas-blazar-spectra'
1. Parse each line onto
 * object name
 * note
 * publication "bibcode"
 * results url
 * data url
 * **Process data file**
1. Create the output table

*Process data file*:
1. retrieve ICRS position from object name
1. parse results page to get publication's ArXiv link
1. download ascii data file
 * transform first column, `E`, from $TeV$ to $Hz$
 * transform second column, `phi`, from $m^{-2} s^{-1} TeV^{-1}$ to $erg s^{-1} cm^{-2}$
 * transform 3rd and 4st columns, `ephi_low` and `ephi_up`, as done for `phi`
 * associate those columns with proper UCDs
1. write everything in a FITS file


In [2]:
veritas_url = 'http://veritas.sao.arizona.edu'

In [3]:
import pandas

# first, set the display width for a smaller number than the deafult (80)
pandas.set_option('display.width', 50)

# print one column at a time ("per line")
pandas.set_option('display.max_colwidth', pandas.get_option('display.width') )

# colmuns' header justified at the left, next to index
# pandas.set_option('display.colheader_justify', 'left')

# limit the number or rows to just a few
pandas.set_option('display.max_rows', 10)

# import wget
# import bs4

def eprint(string):
#     from sys import stderr
#     print >> stderr, "{}".format(string)
#     stderr.flush()
    print "\nERROR:{}".format(string)

In [4]:
class HTMLBase(object):
    fields = {}

    def __init__(self,html):
        self._html = html

    def html(self):
        return self._html

    def extract_fields(self):
        assert False, "Not implemented. This is a base class."

        
def check_table_header(row,fields):
    """
    """
    cells = row.findAll('td')
    for i,cell in enumerate(cells):
        txt = cell.find(text=True)
        txt = txt.strip()
        if txt in fields.keys():
            fields[txt] = i
    return all([ v is not None for v in fields.values() ])


def process_row(row,fields,get_doi=False):
    cells = row.findAll('td')
    if len(cells)==4:
        # Object source name(s) (can be more then one comma separated)
        _i_ = fields['Blazar']
        src = cells[_i_].find(text=True)
        src = src.strip()
        # Article reference (url), usually a ref to ads
        _i_ = fields['VERITAS results page']
        art = cells[_i_].find('a',href=True)
        url = art['href']
        if get_doi:
            url = get_doi_url(url)
        # We skip year of publication (third column)
        # as well as bibcode reference (fourth column)
        #ref = cells[3].find(text=True).encode('utf8')
        # FITS file link for downloading it in the near future
        _i_ = fields['Ascii file with spectral data']
        fits = cells[_i_]
        ffile = fits.find('a',href=True)
        try:
            ffile = ffile['href']
        except:
            ffile = None
        #furl = url+ffile if ffile!=None else '_NULL_'
        return (src,url,ffile)
    return None


class HTMLVeritasTable(HTMLBase):
    fields = {'Blazar'                       :0,
              'Publication'                  :1,
              'VERITAS results page'         :2,
              'Ascii file with spectral data':3}

    def extract_fields(self):
        """
        """
        _table = {'Object':[], 'URL':[], 'File':[]}

        for i,row in enumerate(self._html.findAll('tr')):
            if i==0:
                ok = check_table_header(row,self.fields)
                if not ok:
                    return None
                continue
            vals = process_row(row,self.fields.copy())
            if vals is not None:
                src,url,ffile = vals
                _table['Object'].append(src)
                _table['URL'].append(url)
                _table['File'].append(ffile)
        return _table
    
class Web(object):
    url   = './'
    table = {}
    _html = None
    HTML_parser = None

    def __init__(self,url,HTMLParserClass):
        self.url = url
        self.HTML_parser = HTMLParserClass
    
    def __str__(self):
        return str(self.html)

    def get_table(self,table=None):
        from bs4 import BeautifulSoup as BS
        import urllib2

        if table is not None and isinstance(table,dict):
            self.table = table
            
        assert self.url is not None, self.url
        assert isinstance(self.url,(str,unicode)), type(self.url)

        soup = BS(urllib2.urlopen( self.url ).read(),"html.parser")
        table = soup.find('table', self.table )
        self._html = self.HTML_parser(table)

    @property
    def html(self):
        return self._html.html()
    
    def get_table_fields(self):
        return self._html.extract_fields()
    

### Input/Web table

In [5]:
web = Web('http://veritas.sao.arizona.edu/veritas-science/veritas-blazar-spectra',HTMLVeritasTable)
web.get_table(
    table={'style':"border: 1px solid #000000; height: 587px;",
           'border':"1",
           'width':"669",
           'frame':"vsides",
           'align':"center"
           }
)
from IPython.display import HTML
HTML(unicode(web.html))

0,1,2,3
Blazar,Publication,VERITAS results page,Ascii file with spectral data
1ES0229+200,"Ap.J. 782, 13 (2014)",results,Ascii data
1ES0414+009,"Ap.J. 755, 118 (2012)",results,Ascii data
1ES0806+524,"Ap.J. 690, L126 (2009)",results,Ascii data
1ES1215+303,"Ap.J. 779, 92 (2013)",results,Ascii data
1ES1218+304,"Ap.J. 695, 1370 (2009)",results,Ascii data
1ES1218+304,"Ap.J. 709, L163 (2010)",results,Ascii data
1ES1959+650,"Ap.J. 775, 3 (2013)",results,Ascii data
1ES2344+514 (high),"Ap.J. 738, 169 (2011)",results,Ascii data
1ES2344+514 (low),"Ap.J. 738, 169 (2011)",results,Ascii data


In [6]:
f = web.get_table_fields()

In [7]:
import pandas
class Local(pandas.DataFrame):
    def __init__(self,table):
        super(Local,self).__init__(table)
    
    def describe(self):
        print super(Local,self).describe()
        print "\n-> Has Nil?"
        hows_nil = self.isnull().any()
        print hows_nil
        for c in hows_nil.index:
            if not hows_nil[c]: continue
            print "\n-> Indexes where column '{}' is null:".format(c)
            print self[self[c].isnull()].index.values


In [8]:
table = Local(f)
table.describe()
# print table

                                                     File  \
count                                                  29   
unique                                                 28   
top     /documents/PKS1424+240_VERITAS_2009_2014ApJ......   
freq                                                    2   

             Object  \
count            29   
unique           26   
top     PKS1424+240   
freq              3   

                                                      URL  
count                                                  29  
unique                                                 19  
top     http://veritas.sao.arizona.edu/veritas-science...  
freq                                                    7  

-> Has Nil?
File      False
Object    False
URL       False
dtype: bool


In [9]:
def clean_dir(_dir,ext):
    import os
    from glob import glob
    if not os.path.exists(_dir):
        os.mkdir(_dir)
    if os.path.isdir(_dir):
        files = glob(os.path.join(_dir,ext))
        for f in files:
            os.remove(f)

class Download(object):
    def __init__(self,outdir,clean=True):
        import os
        if not os.path.exists(outdir):
            os.mkdir(outdir)
        self._outdir = outdir
        if clean:
            self.clean_outdir()
        self._md5 = os.path.join(self._outdir,'md5sum')
        
    def download(self,url):
        import wget
        filename = wget.download(url,out=self._outdir)
        return filename

    def clean_outdir(self,ext="*"):
        _dir = self._outdir
        clean_dir(_dir,ext)
                

    def create_md5sum_file(self,files_list,_dir=None):
        import hashlib
        md5txt = os.path.join(_dir,self._md5) if _dir else self._md5
        md5 = {}
        for f in files_list:
#             fname = os.path.join(_dir,f) if _dir else f
            fname = f
            h = None
            with open(fname,'rb') as fp:
                h = hashlib.md5(fp.read()).hexdigest()
            md5.update({f:h})
        with open(md5txt,'w') as fp:
            for _file,_hash in md5.iteritems():
                fp.write("%s    %s\n"%(_hash,_file))
        return md5

    def is_exist_files(self,files_list,_dir=None):
        import os
        if _dir is None:
            _dir = self._outdir
        md5_file = self._md5
        
        # First we see if there is a file list (md5sum) to look for
        def check_md5sum(files_list,md5_file):
            if os.path.isfile(md5_file):
                md5 = read_md5sum_file(md5_file)
                md5_files_list = md5.keys()
                leng_inters = len(set(md5_files_list).intersection(files_list))
                return leng_inters == len(files_list)
            # If there is *no* md5-file, return *None*
            return None

        # Also, check if the files are actually there (inside the dir)
        def check_glob(files_list,_dir):
            files_ext = '*.txt'
            dir_files_list = read_dir_content(_dir,files_ext)
            leng_matches = sum(map(lambda v: v in dir_files_list, files_list))
            return leng_matches == len(files_list)

        md5_check = check_md5sum(files_list,md5_file)
        if md5_check in (True,False):
            return md5_check
        glob_check = check_glob(files_list,_dir)
        if glob_check:
            self.create_md5sum_file(files_list,_dir=_dir)
        return glob_check

def read_dir_content(_dir,ext='*'):
    from glob import glob
    dir_files_list = glob(os.path.join(_dir,ext))
    return [ os.path.basename(f) for f in dir_files_list ]

def read_md5sum_file(md5txt):
    import os
    assert os.path.isfile(md5txt)

    md5_hashs,md5_files = [],[]
    with open(md5txt,'r') as mdf:
        for line in mdf.readlines():
            _h,_f = line.split(None,1)
            md5_hashs.append(_h.strip())
            md5_files.append(_f.strip())
    md5 = dict(zip(md5_files,md5_hashs))
    return md5

In [10]:
download_dir = 'data/'
download_handler = Download(download_dir)

In [11]:
import os

_files = table['File'].dropna().apply(lambda f:os.path.join(download_dir,os.path.basename(f)))
if download_handler.is_exist_files(_files):
    print("FITS files exist locally. Passing by download step..")
    _files = _files.apply(lambda f: os.path.join(download_dir,f))
else:
    print("FITS files do not exist locally. Downloading them...")
    furls = veritas_url + table['File']
    _files = furls.apply(lambda f: download_handler.download(f))
    md5s = download_handler.create_md5sum_file(_files)
    del furls

table['File'] = _files

FITS files do not exist locally. Downloading them...


In [12]:
print table

                                                 File  \
0   data//1ES0229+200_VERITAS_2009_2012_2014ApJ......   
1   data//1ES0414+009_VERITAS_2008-2011_2012ApJ......   
2   data//1ES0806+524_VERITAS_2006-2008_2009ApJ......   
3   data//1ES1215+303_VERITAS_2008-2012_2013ApJ......   
4   data//1ES1218+304_VERITAS_2007_2009ApJ...695.1...   
..                                                ...   
24  data//RGBJ0710+591_VERITAS_2008-2009_2010ApJ.....   
25  data//RXJ0648.7+1516_VERITAS_2010_2011ApJ...74...   
26  data//VERJ0521+211_VERITAS_2009-2010_2013ApJ.....   
27  data//WComae_VERITAS_2008-01-04_2008ApJ...684L...   
28         data//Mrk421_VERITAS_2011ApJ...738..25.txt   

            Object  \
0      1ES0229+200   
1      1ES0414+009   
2      1ES0806+524   
3      1ES1215+303   
4      1ES1218+304   
..             ...   
24     RGB0710+591   
25  RXJ0648.7+1516   
26     VER0521+211   
27          WComae   
28         Mrk 421   

                                                  

In [13]:
# Now we can process the fits files themselves.
# He start noting that we want the SPECTRUM Data Unit(s)
#  available (or not) in the fits files; discard the other DU.
# Things we want to do:
# - get the OBJECT name
# - get the each object position
# - get the observation date
# - transform the data vectors (x) to frequency(Hz) and (y) to flux(erg/s/cm2)
# Then we should follow the following workflow:
# - open the fits file
# - find the necessary data unit (SPECTRUM)
# - open its header
#  - get some keywords from the header
# - open its data; data here are vectors
#  - it can be from 2 to 4 vectors
#   - energy
#   - flux
#   - Denergy
#   - Dflux
#  - convert the ?energy vectors to 'Hz' units
#  - convert the ?flux vectors to 'erg/s/cm2' units

# Here we just define the functions we'll need..
def resolve_name(name):
    from astropy.coordinates import get_icrs_coordinates as get_coords
    try:
        icrs = get_coords(name)
        pos = (icrs.ra.value,icrs.dec.value)
    except:
        pos = None
    return pos

def fix_dateobs(date):
    try:
        dt = str(date).split('-')
        y = int(dt[0])
    except:
        return '1999-01-01'
    try:
        m = int(dt[1])
    except:
        m = 1
    try:
        d = int(dt[2])
    except:
        d = 1
    return '{:4d}-{:02d}-{:02d}'.format(y,m,d)

def merge_header_keywords(header_p,header_s):
    # Extension's header has the highest priority; keywords there
    # should not be overwritten. Relevant keywords are the ones in:
    # 'FITS_KEYWORDS'
    f_header = {'COMMENT':[]}
    _kw = list(set(header_p.keys()).intersection(FITS_KEYWORDS))
    for k in _kw:
        f_header.update({k : header_p[k]})
    if 'COMMENT' in header_p.keys():
        f_header['COMMENT'].extend(header_p['COMMENT'])
    _kw = list(set(header_s.keys()).intersection(FITS_KEYWORDS))
    for k in _kw:
        f_header.update({k : header_s[k]})
    if 'COMMENT' in header_s.keys():
        f_header['COMMENT'].extend(header_s['COMMENT'])
    return f_header
    
def trans_data(table):
    import numpy as np
    from astropy import units
    Unit = units.Unit
    
    units.set_enabled_equivalencies(units.spectral())
    uEn = Unit('Hz')
    uFn = Unit('erg s-1 cm-2')
    uEc = Unit('TeV')
    conv = {Unit('ph TeV s-1 cm-2') : lambda x,y: (x/Unit('ph')).to(uFn),
            Unit('ph TeV-1 s-1 cm-2') : lambda x,y: ((y.to(uEc)**2)*(x/Unit('ph'))).to(uFn),
            Unit('ph s-1 cm-2') : lambda x,y: None,
            Unit('GeV') : lambda x: x.to(uEn, equivalencies=units.spectral())}

    vE = table['energy']
    uE = vE.unit
    vEn = conv[uE](vE)

    vF = table['flux']
    uF = vF.unit
    vFn = conv[uF](vF,vE)

    if vFn is None:
        print("Flux data could not be transformed. Unrecognised units ({})?".format(uF.to_string()))
        return False

    def set_null(column,null_expression,new_null_value=-999):
        _idx = np.where(null_expression(column))
        column[_idx] = new_null_value
        column.null = new_null_value
        
    nullval = -999
    new_nullval = nullval
    
    table['energy'] = vEn
    table['energy'].unit = vEn.unit
    set_null( table['energy'], lambda x:x==0.0)
    table['flux'] = vFn
    table['flux'].unit = vFn.unit
    set_null( table['flux'], lambda x:x==0.0)
    set_null( table['flux'], lambda x:x>0.001)

    if 'Denergy' in table.colnames:
        vDE = table['Denergy']
        uDE = vDE.unit
        vDEn = conv[uDE](vDE)
        table['energy_error'] = vDEn
        table['energy_error'].unit = vDEn.unit
        set_null( table['energy_error'], lambda x:x==0.0)
        del table['Denergy']
    else:
        uDE = table['energy'].unit
        vDEn = np.asarray([nullval]*len(vE),dtype=int)
        table['energy_error'] = vDEn
        table['energy_error'].unit = uDE
        table['energy_error'].null = nullval

    if 'Dflux' in table.colnames:
        vDF = table['Dflux']
        uDF = vDF.unit
        vDFn = conv[uDF](vDF,vE) # Notice we use the energy bin/value of the measurement.
        table['flux_error'] = vDFn
        table['flux_error'].unit = vDFn.unit
        set_null( table['flux_error'], lambda x:x==0.0)
        del table['Dflux']
    else:
        uDF = table['flux'].unit
        vDFn = np.asarray([nullval]*len(vE),dtype=int)
        table['flux_error'] = vDFn
        table['flux_error'].unit = uDF
        table['flux_error'].null = nullval

    return True

# def header_to_dict(header):
#     from collections import OrderedDict
#     out = OrderedDict()
#     for card in header.cards:
#         k = card[0]
#         v = card[1]
#         c = card[2]
#         out[k] = v   # 'c' is out for the time being
#     return out

In [14]:
def proc_source(filename,source_name,url_results):
    """
    Returns a (plain) list with all valid spectra in it.
    """
    
    print "\n======================================================================"
    print "Taking file: ",filename
    print "Source name: ",source_name

    # Clean source name
    source_name = clean_source_name(source_name)
    
    # Parse the results page to 
    # Open the data file, if something goes wrong, return 'None'
    data = read_file(filename)
    
    if not data:
        eprint("***: File opening failed. Moving on.")
        return None
    
    # Verify whether what we have from the first (and only) header line is OK
    # (basically, check whether units make sense and are compatible to what we want)
    is_metadata_ok = verify_metadata(data)
    if not is_metadata_ok:
        eprint("***: Something is wrong with {}:{} metadata.".format(source_name,filename))
        return None
    
    # Transform out data (wavelength and flux) to what we want
    data.transform_data()
    
    # In VERITAS data files should not exist 'NA' values.
    # Which means, 'del_rows' should be zero
    del_rows = data.dropna(columns=['energy','flux'],na_value=-999)
    if del_rows:
        assert False, 'VERITAS data files should not present NA values!'
        print "\n{} rows eliminated from table: {}.".format(len(del_rows),del_rows)
        if len(spec)==0:
            eprint("***: Table is empty; not to be writen. Continue to next spectrum.")
            continue
    print "\nOutput filename: {}".format(data.suggest_output_filename(output_dir=''))
    print "**********************************************************************"
    


### Files processing logfile

In [15]:
table['SPECTRUM'] = table.apply(lambda x: proc_source(x.File,x.Object,x.URL), axis=1)


Taking file:  data//1ES0229+200_VERITAS_2009_2012_2014ApJ...782...13A.txt
Source name:  1ES0229+200

Taking file:  data//1ES0229+200_VERITAS_2009_2012_2014ApJ...782...13A.txt
Source name:  1ES0229+200


NameError: ("global name 'clean_source_name' is not defined", u'occurred at index 0')

In [None]:
table

In [None]:
def fix_degeneracy(group):
    from collections import OrderedDict
    row = group.irow(0)
    columns = row.to_dict()
    specs = row['SPECTRUM']
    del columns['SPECTRUM']
    tdf = OrderedDict()
    tdf['OBJECT'] = []
    tdf['RA'] = []
    tdf['DEC'] = []
    tdf['DATE-OBS'] = []
    tdf['SPECTRUM'] = []
    tdf['SRCPOS1'] = []
    tdf['SRCPOS2'] = []
    tdf['EBL_CORR'] = []
    cnt = 0
    for s in specs:
        t = s.retrieve_table()
        tdf['RA'].append( t.meta['RA'] )
        tdf['DEC'].append( t.meta['DEC'] )
        tdf['OBJECT'].append( t.meta['OBJECT'] )
        tdf['DATE-OBS'].append( t.meta['DATE-OBS'] )
        tdf['SPECTRUM'].append( s )
        tdf['SRCPOS1'].append( t.meta['SRCPOS1'] )
        tdf['SRCPOS2'].append( t.meta['SRCPOS2'] )
        tdf['EBL_CORR'].append( t.meta['EBL_CORR'] )
        cnt += 1
    for c in columns:
        tdf[c] = [ row[c] ] * cnt
    return Local( tdf )

table_proc = table.dropna().groupby('URL',group_keys=False).apply(fix_degeneracy).reset_index(drop=True)

In [None]:
table_proc

In [None]:
outdir = 'FITS_out/'
clean_dir(outdir,'*')

for _spec in table_proc.SPECTRUM:
    write_to_fits(_spec,outdir)
#_bla = table_proc.apply(lambda d:write_to_fits(d.SPECTRUM,outdir),axis=1)
#del _bla

table_proc.describe()

In [None]:
import pandas as pd
pd.set_option('display.max_rows',200)
pd.set_option('display.max_columns',10)
pd.set_option('display.width',500)

print table_proc


In [None]:
table_final = table_proc.dropna()[['OBJECT','RA','DEC','URL','FITS','DATE-OBS']]
print table_final