# Tutorial 1: Basics

This tutorial will talk about how to use this software from your own python project or Jupyter notebook.
There is also a nice command line interface that enables you to do the same with just two lines in your command line.

**NOTE FOR CONTRIBUTORS: Always clear all output before commiting (``Cell`` > ``All Output`` > ``Clear``)**!

In [1]:
# Magic
%matplotlib inline
# Reload modules whenever they change
%load_ext autoreload
%autoreload 2

# Make clusterking package available even without installation
import sys
sys.path = ["../../"] + sys.path

import clusterking as ck

## Scanning

### Setting it up

Let's set up a scanner object and configure it.

In [2]:
s = ck.scan.WilsonScanner()

First we set up the function/distribution that we want to consider. Here we look into the branching ratio with respect to $q^2$ of $B\to D \,\tau\, \bar\nu_\tau$. The function of the differential branching ration is taken from the flavio package (https://flav-io.github.io/). The $q^2$ binning is chose to have 9 bins between $(3.2 \,\text{GeV})^2$ and $(11.6\,\text{GeV})^2$ and is implemented as follows

In [3]:
import flavio
import numpy as np

def dBrdq2(w, q):
    return flavio.np_prediction("dBR/dq2(B+->Dtaunu)", w, q)

s.set_dfunction(
    dBrdq2,
    binning=np.linspace(3.2, 11.6, 10),
    normalize=True
)

First, let's set up the Wilson coefficients that need to be sampled. The Wilson coefficients are implemented using the Wilson package (https://wilson-eft.github.io/), which allows to use a variety of bases, EFTs and matches them to user specified scales.
Using the example of $B\longrightarrow D \tau \bar\nu_\tau$, we sample the coefficients ``CVL_bctaunutau``, ``CSL_bctaunutau`` and ``CT_bctaunutau`` from the ``flavio`` basis (https://wcxf.github.io/assets/pdf/WET.flavio.pdf) with 10 points between $-1$ and $1$ at the scale of 5 GeV:

In [4]:
s.set_spoints_equidist(
    {
        "CVL_bctaunutau": (-1, 1, 10),
        "CSL_bctaunutau": (-1, 1, 10),
        "CT_bctaunutau": (-1, 1, 10)
    },
    scale=5,
    eft='WET',
    basis='flavio'
)

### Running it

In [5]:
d = ck.Data()

In [6]:
s.run(d)

[32mScanner:INFO:Started queue with 1000 job(s) distributed over up to 8 core(s)/worker(s).[0m
Scanning: 100%|████████████████████████| 1000/1000 [00:24<00:00, 40.07 spoint/s]
[36mScanner:DEBUG:Converting data to pandas dataframe.[0m
[32mScanner:INFO:Integration done.[0m


The results are saved in a dataframe, ``d.df``. Let's have a look:

In [7]:
d.df.head()

Unnamed: 0_level_0,CVL_bctaunutau,CSL_bctaunutau,CT_bctaunutau,bin0,bin1,bin2,bin3,bin4,bin5,bin6,bin7,bin8
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,-1.0,-1.0,-1.0,0.013125,0.058725,0.10786,0.146354,0.168836,0.172496,0.155692,0.117966,0.058947
1,-1.0,-1.0,-0.777778,0.011128,0.052279,0.100732,0.142333,0.169616,0.177547,0.162658,0.12344,0.060268
2,-1.0,-1.0,-0.555556,0.009775,0.046469,0.093154,0.137575,0.170368,0.1835,0.170686,0.128749,0.059724
3,-1.0,-1.0,-0.333333,0.013228,0.051945,0.094669,0.136531,0.169901,0.184716,0.171651,0.12558,0.05178
4,-1.0,-1.0,-0.111111,0.026508,0.085055,0.123471,0.149511,0.166231,0.168245,0.147711,0.100012,0.033255


## Clustering

Let's build a hierarchy cluster out of the data object we created above.

In [8]:
c = ck.cluster.HierarchyCluster(d)

First, we have to specify the metric we want to use to measure the distance between different distributions. If no argument is specified, the common $\chi^2$ metric from is used.

In [10]:
c.set_metric()

Let's build now the hierarchy cluster:

In [22]:
c.build_hierarchy()

[36mScanner:DEBUG:Building hierarchy.[0m
[36mScanner:DEBUG:Done[0m


The maximal distance between the individual clusters ``max_d`` can be chosen as follows:

In [29]:
c.cluster(max_d=0.15)

[32mScanner:INFO:Performing clustering.[0m
[32mScanner:INFO:Clustering resulted in 4 get_clusters.[0m
[32mScanner:INFO:Done[0m


Now we add the information about the clusters to the dataframe created above:

In [33]:
c.write()

Let's take a look and notice the new column ``cluster`` at the end of the data frame:

In [34]:
d.df.head()

Unnamed: 0_level_0,CVL_bctaunutau,CSL_bctaunutau,CT_bctaunutau,bin0,bin1,bin2,bin3,bin4,bin5,bin6,bin7,bin8,cluster,bpoint
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0,-1.0,-1.0,-1.0,0.013125,0.058725,0.10786,0.146354,0.168836,0.172496,0.155692,0.117966,0.058947,0,False
1,-1.0,-1.0,-0.777778,0.011128,0.052279,0.100732,0.142333,0.169616,0.177547,0.162658,0.12344,0.060268,0,False
2,-1.0,-1.0,-0.555556,0.009775,0.046469,0.093154,0.137575,0.170368,0.1835,0.170686,0.128749,0.059724,0,False
3,-1.0,-1.0,-0.333333,0.013228,0.051945,0.094669,0.136531,0.169901,0.184716,0.171651,0.12558,0.05178,0,False
4,-1.0,-1.0,-0.111111,0.026508,0.085055,0.123471,0.149511,0.166231,0.168245,0.147711,0.100012,0.033255,0,False


## Selecting benchmark points

In a similar way we can determine the benchmark points representing the individual clusters. Initializing a benchmark point object

In [35]:
b = ck.Benchmark(d)

and again choosing a metric ($\chi^2$ metric is default)

In [16]:
b.set_metric()

the benchmark points can be computed

In [17]:
b.select_bpoints()

and written in the dataframe:

In [36]:
b.write()

Let's take a look and notice the new column ``bpoint`` at the end of the data frame:

In [37]:
d.df.head()

Unnamed: 0_level_0,CVL_bctaunutau,CSL_bctaunutau,CT_bctaunutau,bin0,bin1,bin2,bin3,bin4,bin5,bin6,bin7,bin8,cluster,bpoint
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0,-1.0,-1.0,-1.0,0.013125,0.058725,0.10786,0.146354,0.168836,0.172496,0.155692,0.117966,0.058947,0,
1,-1.0,-1.0,-0.777778,0.011128,0.052279,0.100732,0.142333,0.169616,0.177547,0.162658,0.12344,0.060268,0,
2,-1.0,-1.0,-0.555556,0.009775,0.046469,0.093154,0.137575,0.170368,0.1835,0.170686,0.128749,0.059724,0,
3,-1.0,-1.0,-0.333333,0.013228,0.051945,0.094669,0.136531,0.169901,0.184716,0.171651,0.12558,0.05178,0,
4,-1.0,-1.0,-0.111111,0.026508,0.085055,0.123471,0.149511,0.166231,0.168245,0.147711,0.100012,0.033255,0,


## Preserving results

Now it's time to write out the results for later use.

In [20]:
d.write("output/cluster", "tutorial_basics", overwrite="overwrite")

[32mDFMD:INFO:Will write dataframe to 'output/cluster/tutorial_basics_data.csv'.[0m
[36mDFMD:DEBUG:Done[0m
[32mDFMD:INFO:Will write metadata to 'output/cluster/tutorial_basics_metadata.json'.[0m
[36mDFMD:DEBUG:Done[0m


This will not only write out the data itself, but also a lot of associated metadata that makes it easy to later reconstruct what the data actually represents. This was accumulated in the attribute ``d.md`` over all steps:

In [21]:
d.md

defaultdict(<function clusterking.util.metadata.nested_dict>,
            {'bpoint': defaultdict(<function clusterking.util.metadata.nested_dict>,
                         {'bpoint': defaultdict(<function clusterking.util.metadata.nested_dict>,
                                      {'cluster_column': 'cluster',
                                       'metric': defaultdict(<function clusterking.util.metadata.nested_dict>,
                                                   {'args': [],
                                                    'kwargs': {}})})}),
             'cluster': defaultdict(<function clusterking.util.metadata.nested_dict>,
                         {'cluster': defaultdict(<function clusterking.util.metadata.nested_dict>,
                                      {'cluster_args': {'max_d': 0.2},
                                       'git': {'branch': 'HEAD',
                                        'msg': 'added test file',
                                        'sha': '36202