# Query the Metastore

![query](../images/notebook/query.jpg)

AIQC uses SQLite under the hood as a machine learning metastore. It persists critical information at every step of the workflow that helps practitioners:

- Interpret the performance of models
- Reproduce experiments
- Encode new samples during inference

Each object in the [Low-Level API](api_low_level.html) (e.g. `Job`, `Predictor`, `Splitset`, `Feature`, `Dataset`, and many more) is a relational table in local SQLite file. The Low-Level API serves as an object-relational model (ORM) to easily traverse and inspect that metastore as Python objects. Please know that these examples just scratch the surface of the Low-Level API.

---

Let's rapidly create a trained queue of models so that we have some information to work with. We'll use one of AIQC's tests for the sake of brevity.

In [50]:
from aiqc import mlops, datum, lab, tests
queue = tests.tf_multi_tab.make_queue()
queue.run_jobs()

---

## Inspecting the Modeling Process

### How well did the models in the `Queue` perform?

In [None]:
queue.plot_performance(min_score=0.94, max_loss=0.09)

![boomerang](../images/visualization/classify_boomerang.png)

In [51]:
queue.metrics_to_pandas()

Unnamed: 0,hyperparamcombo_id,job_id,predictor_id,split,accuracy,f1,loss,precision,recall,roc_auc
23,9,9,9,test,1.0,1.0,0.029,1.0,1.0,1.0
22,9,9,9,validation,0.926,0.925,0.095,0.939,0.926,0.996
21,9,9,9,train,0.962,0.962,0.059,0.963,0.962,0.999
20,8,8,8,test,1.0,1.0,0.048,1.0,1.0,1.0
19,8,8,8,validation,0.963,0.963,0.082,0.967,0.963,1.0
18,8,8,8,train,0.962,0.962,0.073,0.963,0.962,0.998
17,7,7,7,test,1.0,1.0,0.031,1.0,1.0,1.0
16,7,7,7,validation,0.963,0.963,0.077,0.967,0.963,0.998
15,7,7,7,train,0.981,0.981,0.058,0.981,0.981,0.999
13,6,6,6,validation,0.963,0.963,0.094,0.967,0.963,1.0


### What training `Jobs` belong to the `Queue`?

In [23]:
list(queue.jobs)

[<Job: 2>,
 <Job: 3>,
 <Job: 4>,
 <Job: 5>,
 <Job: 6>,
 <Job: 7>,
 <Job: 8>,
 <Job: 9>]

### A `Job` stores information about its trained model in a `Predictor`

In [52]:
predictor = queue.jobs[0].predictors[0]

The model object

In [54]:
predictor.get_model()

<keras.engine.sequential.Sequential at 0x191368e90>

The user-defined training metrics

The hyperparameters that were fed to this specific job

In [57]:
predictor.get_hyperparameters(as_pandas=True)

Unnamed: 0,param,value
0,neuron_count,9.0
1,batch_size,3.0
2,learning_rate,0.03
3,epoch_count,30.0


### How did this model perform?

In [55]:
predictor.history.keys()

dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])

In [56]:
predictor.history['val_loss'][:10]

[0.7967078685760498,
 0.6094738841056824,
 0.5092198848724365,
 0.4443624019622803,
 0.38863012194633484,
 0.34954071044921875,
 0.3077627122402191,
 0.2807725667953491,
 0.2461201548576355,
 0.2245447188615799]

In [None]:
predictor.plot_learning_curve()

![Classify Learn](../images/visualization/classify_learn.png)

View the decoded predictions

In [103]:
prediction = predictor.predictions[0]

In [104]:
prediction.predictions.keys()

dict_keys(['train', 'validation', 'test'])

In [105]:
prediction.predictions['test'][:10]

array(['setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa',
       'versicolor', 'versicolor', 'versicolor', 'versicolor'],
      dtype=object)

In [None]:
prediction.plot_confusion_matrix()

![Plot Confusion](../images/visualization/classify_confusion.png)

In [108]:
# etc
prediction.plot_roc_curve()
prediction.plot_precision_recall()

<bound method Prediction.plot_precision_recall of <Prediction: 2>>

What features were driving the model?

In [None]:
prediction.plot_feature_importance(top_n=4)

![Classify Features](../images/visualization/classify_features.png)

### What `Algorithm` was the `Queue` training?

In [67]:
algorithm = queue.algorithm

In [74]:
algorithm.library

'keras'

In [73]:
algorithm.analysis_type

'classification_multi'

What did the architecture look like?

In [80]:
from aiqc.utils.dill import reveal_code
# Looks like I can make these calls more elegant/ concise =)

In [96]:
reveal_code(algorithm.fn_train)[0]

def fn_train(model, loser, optimizer, samples_train, samples_evaluate, **hp):
    model.compile(
        loss = loser
        , optimizer = optimizer
        , metrics = ['accuracy']
    )
    model.fit(
        samples_train["features"]
        , samples_train["labels"]
        , validation_data = (
            samples_evaluate["features"]
            , samples_evaluate["labels"]
        )
        , verbose = 0
        , batch_size = hp['batch_size']
        , epochs = hp['epoch_count']
        , callbacks=[tf.keras.callbacks.History()]
    )
    return model



'd'

In [None]:
# etc
reveal_code(algorithm.fn_train)[0]
reveal_code(algorithm.fn_lose)[0]
reveal_code(algorithm.fn_optimize)[0]
reveal_code(algorithm.fn_predict)[0]

### What `hyperparameter` space was tested?

In [87]:
queue.algorithm.hyperparamsets[0].hyperparameters

{'neuron_count': [9, 12],
 'batch_size': [3],
 'learning_rate': [0.03, 0.05],
 'epoch_count': [30, 60]}

---

## Inspecting the Data

### What `samples` were fed to the algorithm?

In [88]:
splitset = queue.splitset

How was the data divided into splits?

In [92]:
splitset.sizes

{'validation': {'percent': 0.18, 'count': 27},
 'test': {'percent': 0.12, 'count': 18},
 'train': {'percent': 0.7, 'count': 105}}

Which sample indices belong to which split?

In [98]:
splitset.samples['test'][:10]

[4, 6, 23, 32, 38, 49, 54, 64, 71, 78]

### What `Features` were used to train the model?

There are multiple `Features` because AIQC support multi-modal analysis

In [113]:
feature = splitset.get_features()[0]

In [114]:
feature.columns

['sepal_length', 'sepal_width', 'petal_length', 'petal_width']

In [115]:
feature.columns_excluded

['species']

### What were the `Labels`?

In [116]:
label = splitset.label

In [118]:
label.columns

['species']

In [119]:
label.unique_classes

['setosa', 'versicolor', 'virginica']

### What `Dataset` did these `Features` and `Labels` come from?

In [120]:
dataset = feature.dataset

In [121]:
dataset.dataset_type

'tabular'

In [123]:
dataset.file_count

1

This dataset was created from an in-memory dataframe so there is no file path associated with it

In [129]:
dataset.source_path

In [135]:
file = dataset.files[0]

In [136]:
file.is_ingested

True

In [137]:
file.shape

{'rows': 150, 'columns': 5}

In [138]:
file.columns

['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

In [139]:
file.dtypes

{'sepal_length': 'float64',
 'sepal_width': 'float64',
 'petal_length': 'float64',
 'petal_width': 'float64',
 'species': 'object'}

### How were the `Labels` and `Features` encoded during training?

In [177]:
job = queue.jobs[0]

In [183]:
job.fittedlabelcoders[0]

<FittedLabelCoder: 2>

In [186]:
labelcoder = job.fittedlabelcoders[0].labelcoder

In [187]:
labelcoder.only_fit_train

False

In [189]:
labelcoder.sklearn_preprocess

OneHotEncoder(sparse=False)

In [190]:
labelcoder.matching_columns

['species']

In [201]:
fitted_encoder = job.fittedlabelcoders[0].fitted_encoders

In [203]:
fitted_encoder

OneHotEncoder(sparse=False)

In [202]:
fitted_encoder.categories_

[array(['setosa', 'versicolor', 'virginica'], dtype=object)]

In [202]:
fitted_encoder

[array(['setosa', 'versicolor', 'virginica'], dtype=object)]

This information is *CRITICAL* for:

- Decoding raw predictions into human-readable insight.
- Encoding new samples during inference.
- Recreating the experiment. 