# Suggestion engine of UCDs based on common metadata descriptors

Unified Content Descriptors (UCDs) are machine-level descriptions of astronomical data. 
In astronomical data tables, together with columns name, description, and data type, they compose the set of metadata used to describe the content (_i.e._, data) of a dataset.
Whereas column name and description are free text components -- meaning any set of words of the (english) vocabulary can be used to name and describe the data of columns in a data table --, UCDs compose a minimalist set of words (or _atoms_) arranged under a specific set of rules (grammar, if you will).
UCDs are one of the results of the semmantics working group of the International Virtual Observatory Alliance (IVOA) used in any of the Virtual Observatory (VO) data resources.

For example, a column of a (VO) catalog (_table_ in astronomical jargon) providing external URL for each of the elements of the table may be called '`access_url`' with a description "`URL used to access (download) dataset`" would have a UCD '`meta.ref.url`' (with a datatype '`string`').
On one hand, the (english) high-level _description_ of the column -- or even the _name_ -- states clearly to a human what is the content in it. On the other hand, a machine (_i.e._, a software) would have a hard time "understanding" the (human) description. There is where UCDs play their role: properly set, a machine can understand that `meta.ref.url` contains some _meta_ information (non astrophysical quantity) _refering_ to another resource that happens to be an _url_.

The full extent of UCD explanation, vocabulary and grammar, can be seen at http://www.ivoa.net/documents/latest/UCDlist.html and associated documents. In section Data of this document I present other examples of such metadata set.

Although the UCD vocabulary is relatively small and the grammar imposes clear logical rules, data providers (specially newcomers) may have a hard, time-consuming experience in doing it -- much like choosing the keywords for a scientific paper: we know the words but representing the content of the article in a handful of logically conected key-words demands some level of biblioteconomy.

The goal of this note is to apply machine learning (ML) techniques to help the owner of a dataset (_i.e._, catalog) in the process of filling its metadata fields, in particular the UCD field.
To do so, we will using natural language understanding to clean and normalize the data so that we can feed them to ML algorithms.
We will a couple of commonly used ML algorithms -- Support Vector Machines (SVM) and Naive Bayes -- to test our predictive hypothesis.
The data we going to use relate to VO Simple Conesearch Services (SCS), which provide catalogs of astronomical objects (galaxies, stars, planets) of the any, non-specific, content.

* remove empty column names as well as empty descriptions and respective ucds from the data
* remove "None" ucds
* display cleaned versions of column names and descriptions
* present first 3 results of predictions
* check if 'truth' value is among the first 3 predicted results

In [1]:
# The following pipeline was taken from
#https://medium.com/towards-data-science/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a
# a analogous, but more qualitative text is
#https://medium.com/moosend-engineering-data-science/how-to-build-a-machine-learning-industry-classifier-5d19156d692f
#
def read_columns(config,parameter):
    '''
    Return list of 'parameter' from sections in 'config'
    '''
    sections = list(config.keys())
    sections.sort()
    return [ config[s].get(parameter) for s in sections ]


def read_columns_name(config):
    '''
    Return "clean" 'columns' (names) from 'config'
    
    "clean" means clean from '0-9a-zA-Z+-/*.' characters
    '''
    import re
    name_columns = read_columns(config,'columns')
    out = []
    for columns in name_columns:
        blk = []
        for i,column in enumerate(columns):
            clean_column = re.sub('[^0-9a-zA-Z\+\-\/\*\.]',' ',column).strip()
            blk.append(clean_column)
        out.append(blk)
    return out

def read_columns_description(config):
    '''
    Return "clean" 'descriptions' from 'config'
    
    "clean" means clean from '0-9a-zA-Z ' characters
    '''
    import re
    desc_columns = read_columns(config,'descriptions')
    out = []
    nil = []
    for columns in desc_columns:
        blk = []
        for i,column in enumerate(columns):
            try:
                clean_column = re.sub('[^0-9a-zA-Z ]','',column).strip()
#                 clean_column = re.sub(string.punctuation,'',column).strip()
            except:
                nil.append((i,column))
                clean_column = ''
            blk.append(clean_column)
        out.append(blk)
    print("{:d} empty columns".format(len(nil)))
    return out

def read_columns_ucd(config):
    '''
    Return "primary" 'ucds' from 'config'
    
    "primary" means primary ucd words
    '''
    ucd_columns = read_columns(config,'ucds')
    out = []
    for columns in ucd_columns:
        blk = []
        for i,column in enumerate(columns):
            primary_ucd = column.split(';')[0]
            blk.append(primary_ucd)
#             blk.append(column)
        out.append(blk)
    return out

def remove_empty_pairs(ucds, other):
    '''
    Remove ucds/other pairs when 'other' is empty
    '''
    assert len(ucds) == len(other)
    ucds_clean = ucds[:]
    other_clean = other[:]
    empty_ucd_indexes = [i for i,u in enumerate(ucds) if u.lower() == "none" or u.strip() == '']
    print("Number or 'none' UCDs:",len(empty_ucd_indexes))
    empty_other_indexes = [i for i,o in enumerate(other) if o.lower() == "none" or o.strip() == '']
    print("Number or 'none' columns:",len(empty_other_indexes))
    empty_indexes = empty_ucd_indexes + empty_other_indexes
    empty_indexes.sort(reverse=True)
    for i in empty_indexes:
        ucds_clean.pop(i)
        other_clean.pop(i)
    return ucds_clean,other_clean

## Data

We are going to use VO conesearch catalogs (SCS) metadata from the [Vizier](https://vizier.u-strasbg.fr/viz-bin/VizieR) laboratory as they provide a curated set of resources.

In [2]:
with open('data/radio/CATALOGS.json','r') as f:
    import json
    config = json.load(f)

print("The number of catalog metadata sets we are going to use:",len(config))

The number of catalog metadata sets we are going to use: 1367


In [3]:
print("Example of metadata sets:\n")

# from random import shuffle
# keys = list(config.keys())
# shuffle(keys)

import json
for k in config.keys():
    print(json.dumps(config[k], indent=4))
    break

Example of metadata sets:

{
    "title": "CSIRO ASKAP Science Data Archive Cone Search Service",
    "url": "https://casda.csiro.au/casda_vo_tools/scs/obscore?",
    "ivoid": "ivo://au.csiro/casda/scs",
    "creators": [
        "CSIRO"
    ],
    "description": "Cone search service for querying catalogues from ASKAP radio astronomy observations",
    "columns": [
        "obs_id",
        "access_url",
        "target_name",
        "s_ra",
        "s_dec"
    ],
    "ucds": [
        "ID_MAIN",
        "meta.ref.url",
        "meta.id;src",
        "POS_EQ_RA_MAIN",
        "POS_EQ_DEC_MAIN"
    ],
    "units": [
        "None",
        "None",
        "None",
        "deg",
        "deg"
    ],
    "descriptions": [
        "Observation ID",
        "URL used to access (download) dataset",
        "Astronomical object observed, if any",
        "Central right ascension, ICRS",
        "Central declination, ICRS"
    ]
}


In [4]:
ucd_columns = read_columns_ucd(config)

UCDs we have:

In [5]:
ucd_columns

[['ID_MAIN',
  'meta.id.assoc',
  'POS_EQ_RA_MAIN',
  'POS_EQ_DEC_MAIN',
  'phot.flux.density',
  'phot.flux.density',
  'None'],
 ['ID_MAIN',
  'POS_EQ_RA_MAIN',
  'POS_EQ_DEC_MAIN',
  'phot.flux.density',
  'stat.error',
  'stat.error',
  'phot.flux.density',
  'stat.error',
  'stat.error',
  'None'],
 ['ID_MAIN',
  'POS_EQ_RA_MAIN',
  'POS_EQ_DEC_MAIN',
  'stat.error',
  'stat.error',
  'phot.flux.density',
  'stat.error',
  'phot.flux.density',
  'stat.error',
  'None'],
 ['meta.id',
  'None',
  'phot.flux.density',
  'stat.error',
  'meta.id.cross',
  'ID_MAIN',
  'POS_EQ_RA_MAIN',
  'POS_EQ_DEC_MAIN',
  'src.redshift',
  'phot.mag',
  'stat.error',
  'None'],
 ['ID_MAIN',
  'POS_EQ_RA_MAIN',
  'POS_EQ_DEC_MAIN',
  'phot.flux.density',
  'stat.error',
  'phot.flux.density',
  'stat.error',
  'src.class',
  'None'],
 ['ID_MAIN',
  'POS_EQ_RA_MAIN',
  'POS_EQ_DEC_MAIN',
  'phot.flux.density',
  'stat.error',
  'pos.eq.ra',
  'pos.eq.dec',
  'phot.flux.density',
  'stat.error',
  'No

## Classifying the column names

In [6]:
import numpy as np

name_columns = read_columns_name(config)
assert len(name_columns) == len(ucd_columns)

names = [ n for names in name_columns for n in names ]
np.array(names).shape

(24511,)

Features we have:

In [7]:
names

['name',
 'at20g name',
 'ra',
 'dec',
 'meas flux 148 ghz',
 'flux 148 ghz',
 'Search Offset',
 'name',
 'ra',
 'dec',
 'flux 148 ghz',
 'flux 148 ghz neg err',
 'flux 148 ghz pos err',
 'flux 218 ghz',
 'flux 218 ghz neg err',
 'flux 218 ghz pos err',
 'Search Offset',
 'name',
 'ra',
 'dec',
 'ra error',
 'dec error',
 'flux 20 cm',
 'flux 20 cm error',
 'int flux 20 cm',
 'int flux 20 cm error',
 'Search Offset',
 'source number',
 'aegis20 name',
 'int flux 20 cm',
 'int flux 20 cm error',
 'counterpart id',
 'name',
 'ra',
 'dec',
 'redshift',
 'irac 3p6 um mag',
 'irac 3p6 um mag error',
 'Search Offset',
 'name',
 'ra',
 'dec',
 'flux 2 cm',
 'flux 2 cm error',
 'int flux 2 cm',
 'int flux 2 cm error',
 'source type',
 'Search Offset',
 'name',
 'ra',
 'dec',
 'flux 2 cm',
 'flux 2 cm error',
 'centroid ra',
 'centroid dec',
 'int flux 2 cm',
 'int flux 2 cm error',
 'Search Offset',
 'name',
 'ra',
 'ra error',
 'dec',
 'dec error',
 'flux 863 mhz',
 'flux 863 mhz error',
 'in

In [8]:
target_ucd = [ u for ucds in ucd_columns for u in ucds ]
print(len(target_ucd))

24511


Now we map the UCDs to numerical values.

In [9]:
d_ucd2id = { u:i for i,u in enumerate(set(target_ucd)) }
d_id2ucd = { d_ucd2id[u]:u for u in d_ucd2id }
target_id = [ d_ucd2id[u] for u in target_ucd ]
assert len(target_ucd) == len(target_id)

Scikit-learn provides a set of tools to transform the text data into numerical values, where the values are weighted according to their frequency as well as normalized. 

Counting and weighting of terms is important to consider for their relevance, words that are common to many texts get a lower weight as they don't really distinguish as feature within the data. This process is done by the [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) (frequency counter) and [TfidfTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer) (weighting).

In [10]:
## Pipeline
# Naive Bayes

def MNB(features, targets, use_idf=True):
    # Multinomial Naive Bayes
    from sklearn.naive_bayes import MultinomialNB
    from sklearn.feature_extraction.text import TfidfTransformer
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.pipeline import Pipeline

    text_clf_nb = Pipeline([('vect', CountVectorizer(stop_words='english')),
                        ('tfidf', TfidfTransformer(use_idf=use_idf)),
                        ('clf', MultinomialNB()),
    ])
    text_clf_nb = text_clf_nb.fit(features,targets)
    return text_clf_nb


def CNB(features, targets, use_idf=True):
    # Complement Naive Bayes
    from sklearn.naive_bayes import ComplementNB
    from sklearn.feature_extraction.text import TfidfTransformer
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.pipeline import Pipeline

    text_clf_nb = Pipeline([('vect', CountVectorizer(stop_words='english')),
                        ('tfidf', TfidfTransformer(use_idf=use_idf)),
                        ('clf', ComplementNB()),
    ])
    text_clf_nb = text_clf_nb.fit(features,targets)
    return text_clf_nb


def best_prediction(classifier):
    return lambda w:d_id2ucd.get(classifier.predict([w])[0])


def top_predictions(classifier):
    def probs(text):
        probs = classifier.predict_proba([text]).flatten()
        indxs = probs.argsort()
        return [(d_id2ucd.get(i),probs[i]) for i in indxs[-3:]][::-1]
    return probs


def print_predict(predicted):
    for u,p in predicted:
        print('{} : {:.5f}'.format(u,p))


def test_evaluate(classifier, features, targets, test_size=0.2):
    from sklearn.model_selection import train_test_split
    import pandas as pd

    assert classifier.lower() in ('mnb','cnb')

    x_train, x_test, y_train, y_test = train_test_split(features, targets, test_size=test_size)
    
    if classifier.lower() == 'mnb':
        text_clf_nb = MNB(x_train,y_train)
    else:
        text_clf_nb = CNB(x_train,y_train)
    
    predicted = [d_id2ucd[i] for i in text_clf_nb.predict(x_test)]
    assert len(predicted)==len(y_test)

    df_eval = pd.DataFrame([(predicted[i],d_id2ucd[y_test[i]]) 
                            for i in range(len(y_test))], 
                           columns=['predicted','truth'])
    return df_eval

In [11]:
mnb = MNB(names, target_id)
mnb_best_predict = best_prediction(mnb)
mnb_top_predicts = top_predictions(mnb)

In [12]:
mnb_best_predict('magnitude')

'stat.error'

In [13]:
p = mnb_top_predicts('magnitude')
print_predict(p)

stat.error : 0.09947
phot.flux.density : 0.08074
POS_EQ_DEC_MAIN : 0.05569


In [14]:
df_eval = test_evaluate('mnb', names, target_id, 0.2)

sum(df_eval.predicted == df_eval.truth)/len(df_eval)

0.4126045278400979

In [15]:
cnb_name = CNB(names, target_id)
cnb_name_best_predict = best_prediction(cnb_name)
cnb_name_top_predicts = top_predictions(cnb_name)

In [16]:
cnb_name_best_predict('magnitude')

'phys.area'

In [17]:
p = cnb_name_top_predicts('magnitude')
print_predict(p)

POS_EQ_DEC_MAIN : 0.00291
instr.background : 0.00291
INST_DET_SIZE : 0.00291


In [18]:
df_eval = test_evaluate('cnb', names, target_id, 0.2)

sum(df_eval.predicted == df_eval.truth)/len(df_eval)

0.4495207016112584

## Classifying column descriptions

In [19]:
desc_columns = read_columns_description(config)
assert len(desc_columns) == len(ucd_columns)

descriptions = [ d for desc in desc_columns for d in desc ]
np.array(descriptions).shape

171 empty columns


(24511,)

In [20]:
mnb_desc = MNB(descriptions, target_id)
mnb_desc_best_predict = best_prediction(mnb_desc)
mnb_desc_top_predicts = top_predictions(mnb_desc)

In [21]:
mnb_desc_best_predict('magnitude')

'phot.mag'

In [22]:
p = mnb_desc_top_predicts('magnitude')
print_predict(p)

phot.mag : 0.35827
stat.error : 0.12935
phot.flux.density : 0.12434


In [23]:
df_eval = test_evaluate('mnb', descriptions, target_id, 0.2)

sum(df_eval.predicted == df_eval.truth)/len(df_eval)

0.6865184580868856

In [24]:
cnb_desc = CNB(descriptions, target_id)
cnb_desc_best_predict = best_prediction(cnb_desc)
cnb_desc_top_predicts = top_predictions(cnb_desc)

In [25]:
cnb_desc_best_predict('magnitude')

'phot.mag'

In [26]:
p = cnb_desc_top_predicts('magnitude')
print_predict(p)

phot.mag : 0.01266
phys.magAbs : 0.00305
None : 0.00292


In [27]:
df_eval = test_evaluate('cnb', descriptions, target_id, 0.2)

sum(df_eval.predicted == df_eval.truth)/len(df_eval)

0.7358759942892107