# Complex query

## Query the database using a provenance relationship with the QueryBuilder

Time: 3 mins

##### In this example, we query the calculations in our database that are part of specific groups, and analyze the output. We want to get the magnetization of each structure that we computed. We are also interested in the smearing contribution to the total energy as an indicator of the existence and magnitude of the bandgap.

<div class="alert alert-box alert-info">
To run this example you need to have imported first the sample database provided with the demos.
Make sure to have done it otherwise you will get zero results.
</div>

In [None]:
import sys, numpy as np
from argparse import ArgumentParser
from matplotlib import gridspec, pyplot as plt

from aiida import load_dbenv, is_dbenv_loaded
if not is_dbenv_loaded():
    load_dbenv()
from aiida.orm import CalculationFactory, QueryBuilder, load_node
from aiida.orm.data.structure import StructureData
from aiida.orm.data.parameter import ParameterData
from aiida.orm.group import Group

from notebook_helpers import generate_query_graph

PwCalculation = CalculationFactory('quantumespresso.pw')

In [None]:
# Each group of calculations that are of interests has the string: "tutorial_"+pseudo,
# where pseudo is lda, pbe or pbesol
group_basename = 'tutorial_%'

#### Start building the query

In [None]:
# Instantiate QB:
qb = QueryBuilder()
# Append the Group to the entities returned, with a filter on the name:
qb.append(Group, filters={'name':{'like':group_basename}}, project='name', tag='group')

#### Visualize the query so far

In [None]:
from IPython.display import Image

In [None]:
generate_query_graph(qb.get_json_compatible_queryhelp(), 'query1.png')
Image(filename='query1.png')

#### Append the calculations that are members of each group

In [None]:
# I want every PwCalculation that is a member of the specified groups:
qb.append(PwCalculation, tag='calculation', member_of='group')

#### Visualize the current status of the query

In [None]:
generate_query_graph(qb.get_json_compatible_queryhelp(), 'query2.png') 
Image(filename='query2.png')

#### Append the structures that are input of the calculation. Project the id of the structure and the formula, stored in the extras under the key 'formula'. 
The first time you will run this, the extras.formula is not set, so it will return all `None`. Later we'll see how to amend this.

In [None]:
qb.append(StructureData, project=['id', 'extras.formula'], tag='structure', input_of='calculation')

#### Visualize the current status of the query

In [None]:
generate_query_graph(qb.get_json_compatible_queryhelp(), 'query3.png')
Image(filename='query3.png')

Append the parameters that are an output of the calculation.

Project:
* The smearing contribution and the units
* The magnetization and the untits.

In [None]:
qb.append(ParameterData,tag='results',
        project=['attributes.energy_smearing', 'attributes.energy_smearing_units',
           'attributes.total_magnetization', 'attributes.total_magnetization_units',
        ], output_of='calculation'
    )

#### Visualize the final query

In [None]:
generate_query_graph(qb.get_json_compatible_queryhelp(), 'query4.png') 
Image(filename='query4.png')

#### Print the query results

In [None]:
results = qb.all()
for item in results:
    print ', '.join(map(str, item))

The first time you run this query, the third column (`extras.formula` of the Structure) will be `None`, because the extras are not set. For those we now add the extras, and re-run the query.

In [None]:
missing_formulas_pk = set([res[1] for res in results if res[2] is None])
print "{} structures still do not have an extra.formulas set.".format(len(missing_formulas_pk))
if missing_formulas_pk:
    print "We will set this extra now."

In [None]:
for structure_pk in missing_formulas_pk:
    structure = load_node(structure_pk)
    formula = structure.get_formula()
    structure.set_extra('formula', formula)
print "Extra added to {} structures.".format(len(missing_formulas_pk))

In [None]:
    formula = structure.get_formula

We now run again the query to make sure to get also the formula.

Note that to run the query again, we have to create it again (we already run it, so we cannot just modify it).
The line below replaces qb with a new query, with the same appended filters and projections.

In [None]:
qb = QueryBuilder(**qb.get_json_compatible_queryhelp())
results = qb.all()
for item in results:
    print ', '.join(map(str, item))

#### Plot the results
Getting a long list is not always helpful. We prepared a function that visualizes in a nice, graphical format the results of the query.

Don't get scared, most of the code below is to get a nice appearance in matplotlib - you already got the results in the point above!

In [None]:
def plot_results(query_res):
    """
    :param query_res: The result of an instance of the QueryBuilder
    """
    smearing_unit_set,magnetization_unit_set,pseudo_family_set = set(), set(), set()
    # Storing results:
    results_dict = {}
    for pseudo_family, structure_pk, formula, smearing, smearing_units, mag, mag_units in query_res:
        if formula not in results_dict:
            results_dict[formula] = {}
        # Storing the results:
        results_dict[formula][pseudo_family] = (smearing, mag)
        # Adding to the unit set:
        smearing_unit_set.add(smearing_units)
        magnetization_unit_set.add(mag_units)
        pseudo_family_set.add(pseudo_family)

    # Sorting by formula:
    sorted_results = sorted(results_dict.items())
    formula_list = zip(*sorted_results)[0]
    nr_of_results = len(formula_list)

    # Checks that I have not more than 3 pseudo families.
    # If more are needed, define more colors
    #pseudo_list = list(pseudo_family_set)
    if len(pseudo_family_set) > 3:
        raise Exception('I was expecting 3 or less pseudo families')

    colors = ['b', 'r', 'g']

    # Plotting:
    plt.clf()
    fig=plt.figure(figsize=(16, 9), facecolor='w', edgecolor=None)
    gs  = gridspec.GridSpec(2,1, hspace=0.01, left=0.1, right=0.94)

    # Defining barwidth
    barwidth = 1. / (len(pseudo_family_set)+1)
    offset = [-0.5+(0.5+n)*barwidth for n in range(len(pseudo_family_set))]
    # Axing labels with units:
    yaxis = ("Smearing energy [{}]".format(smearing_unit_set.pop()),
        "Total magnetization [{}]".format(magnetization_unit_set.pop()))
    # If more than one unit was specified, I will exit:
    if smearing_unit_set:
        raise Exception('Found different units for smearing')
    if magnetization_unit_set:
        raise Exception('Found different units for magnetization')
    
    # Making two plots, upper for the smearing, the lower for magnetization
    for index in range(2):
        ax=fig.add_subplot(gs[index])
        for i,pseudo_family in enumerate(pseudo_family_set):
            X = np.arange(nr_of_results)+offset[i]
            Y = np.array([thisres[1][pseudo_family][index] for thisres in sorted_results])
            ax.bar(X, Y,  width=0.2, facecolor=colors[i], edgecolor=colors[i], label=pseudo_family)
        ax.set_ylabel(yaxis[index], fontsize=14, labelpad=15*index+5)
        ax.set_xlim(-0.5, nr_of_results-0.5)
        ax.set_xticks(np.arange(nr_of_results))
        if index == 0:
            plt.setp(ax.get_yticklabels()[0], visible=False)
            ax.xaxis.tick_top()
            ax.legend(loc=3, prop={'size': 18})
        else:
            plt.setp(ax.get_yticklabels()[-1], visible=False)
        for i in range(0, nr_of_results, 2):
            ax.axvspan(i-0.5, i+0.5, facecolor='y', alpha=0.2)
        ax.set_xticklabels(list(formula_list),rotation=90, size=14, ha='center')
    plt.show()

In [None]:
plot_results(results)