In [1]:
data_type = 'red_data' # reduced data

In [2]:
import sys
sys.path.insert(0, '..')
from definition import (
    data_names,
    limits_m_Kpipi,
    columns
)

MC_name   = data_names['MC']
data_name = data_names[data_type]

low_m_Kpipi = limits_m_Kpipi[data_type]['low']
high_m_Kpipi = limits_m_Kpipi[data_type]['high']

print("MC name: ", MC_name)
print("data name: ", data_name)
print('----')
print('m_Kpipi')
print("low: ", low_m_Kpipi)
print("high: ", high_m_Kpipi)

MC name:  BTODstDX_MC
data name:  BTODstDX_reduced
----
m_Kpipi
low:  1820
high:  1950


The columns that need to be looked at are:
- $q^2$: `q2_reco`
- *isolation BDT*: `isolation_bdt`
- $t_\tau$: `tau_life_reco`
- $m(D^*K\pi\pi)$: `m_DstKpipi`
- The angles 
    - $\theta_X$ ($=\theta_D$ of the paper?): `theta_X_reco`
    - $\theta_L$: `theta_L_reco`
    - $\chi$: `chi_reco`

# Reweighting MC sample of $B \to D^{*-}\left(D^{+} \to K^+ \pi^+ \pi^+  \right)X$ Background

**INPUTS**
- `MC`: $D^+ \to K^+ \pi^+ \pi^-$
- `data`: LHCb data, with $_s$Weights to project in the $D^+ \to K^+ \pi^+ \pi^-$ contribution and project out the other contributions.

**GOALS**
1. to learn the weights to apply to MC to align MC to data for the $D^+ \to K^+ \pi^+ \pi^- $ decays by looking at the MC and $_s$Weighted LHCb data (using BDTs),
2. to apply the weights to the general MC.

We hope that this will reweight the general MC sample for $B \to D^{*-} (D^+ \to 3\pi X) X$

In [3]:
# python libraries
import zfit
import timeit
from scipy.stats import ks_2samp

# hep_ml
from hep_ml.reweight import GBReweighter

# bd2dsttaunu
from bd2dsttaunu.locations import loc

# HEA library
from HEA.plot import plot_hist_auto, plot_hist, save_fig, plot_divide_auto
from HEA import load_dataframe
from HEA.plot.tools import draw_vline
from HEA.definition import latex_params
from HEA.pandas_root import load_saved_root
from HEA.pandas_root import save_root
import HEA.BDT as bt
from HEA.tools.serial import dump_joblib, retrieve_joblib

Welcome to JupyROOT 6.22/06


## Read the dataframe

In [4]:
df = {}
df['MC'] = load_dataframe(loc.B2DstDplusX_MC, tree_name='DecayTreeTuple/DecayTree', columns=columns+['m_Kpipi'])
df['data'] = load_saved_root(data_name + '_with_sWeights', folder_name=data_name)

Loading /data/lhcb/users/scantlebury-smead/angular_analysis/double_charm/final_ds_selection_B_DstDpX_Kpipi_truth_matched.root
Loading /home/correiaa/bd2dsttaunu/output//root/BTODstDX_reduced/BTODstDX_reduced_with_sWeights.root


In [5]:
df['MC'] = df['MC'].query(f'm_Kpipi > {low_m_Kpipi} and m_Kpipi < {high_m_Kpipi}')
# df['data'] = df['data'].query(f'm_Kpipi > {low_m_Kpipi} and m_Kpipi < {high_m_Kpipi}')

## Theory - reweighting the MC data
https://indico.cern.ch/event/397113/contributions/1837841/attachments/1213955/1771752/ACAT2016-reweighting.pdf

Use of the `hep_ml` package: documentation [here](https://arogozhnikov.github.io/hep_ml/reweight.html)

### Explanation
#### 1D comparison cannot be used
We could assign compute the weights in the following way:\
For the bin $i$: 
$$w_{i}= \frac{\text{#data}[i]}{\text{#MC}[i]}$$

Disadvantages:
- Reweighting one variable might bring disagreement to others\
$\Rightarrow$ we need a multi-dimensional reweighting procedure.\
$\Rightarrow$ we need to compare multi-dimensional histograms!
- We need to choose the number of bins

#### Use ML (boosted reweighting)
The procedure is automatised using a ML classifier using decision trees (e.g., gradient classifier).\
1. Dataset = concatenated MC and data (goal: classify MC and data events via a ML classifier)²²²
2. Tree splits the space of variables orthogonaly to maximise the difference between MC and LHCb data in these regions. The difference is evaluated with the symmetrised $\chi^2$ (instead of the usual *MSE*):
$$\chi^2 = \sum_{\text{bin }i} \frac{\left(\text{#data}[i]-\text{#MC}[i]\right)^2}{\text{#data}[i]+\text{#MC}[i]}$$
2. Compute weight predictions in leaves

Advantages:
- Optimal choice of region
- Information about the efficiency of the procedure (via the ROC curve for instance)
- The ML classifier can be used to re-weight other MC samples
- Posssible computation of feature importances

Disadvantage: slow



### `GBReweighter` in `hep_ml`

We are going to use a `GBReweighter` from the `hep_ml` package. Hopefully, a description is provided by the documentation:

*Gradient Boosted Reweighter - a reweighter algorithm based on ensemble of regression trees. Parameters have the same role, as in gradient boosting. Special loss function is used, trees are trained to maximize symmetrized binned chi-squared statistics.*

*Training takes much more time than for bin-based versions, but GBReweighter is capable to work in high dimensions while keeping reweighting rule reliable and precise (and even smooth if many trees are used).*

## Prepare the dataframes

### Pick up the variables

In [6]:
print(f"{len(columns)} available columns:")
for column in columns:
    print("- " + column)

13 available columns:
- m_Kpipi
- q2_reco
- isolation_bdt
- tau_life_reco
- m_DstKpipi
- theta_X_reco
- theta_L_reco
- chi_reco
- costheta_X_reco
- costheta_L_reco
- coschi_reco
- tau_M
- B0_M


In [7]:
training_columns = [
    'costheta_X_reco',
    'costheta_L_reco',
    'chi_reco',
    'isolation_bdt',
    'q2_reco'
]

print(f"{len(training_columns)} columns used for training of the GBRWeighter:")
for training_column in training_columns:
    print("- " + training_column)

5 columns used for training of the GBRWeighter:
- costheta_X_reco
- costheta_L_reco
- chi_reco
- isolation_bdt
- q2_reco


We are also going to try to add $m(D^*3\pi)$ to the trained variables.

In [8]:
training_columns_2 = training_columns + ['B0_M']

In [9]:
training_columns_dict = {
    "basic": training_columns, 
    "with_B0_M": training_columns_2,   
}

### Get the dataframes with only the training variables

In [10]:
# df_train = {
#     'MC': df['MC'][training_columns],
#     'data': df['data'][training_columns]
# }

In [11]:
# df_train['MC']

## Training

### hyperparameters

In [12]:
# hyperparams ={
#     'n_estimators': 100,
#     'learning_rate': 0.1,
#     'max_depth': 3,
#     'min_samples_leaf': 200
# }

### Reweigher classifier

In [13]:
# reweighter = GBReweighter(**hyperparams)

# reweighter = reweighter.fit(
#     original=df_train['MC'], 
#     target=df_train['data'], 
#     target_weight=df['data']['sWeight']
# )

In [14]:
# MC_weights = reweighter.predict_weights(df_train['MC'])
# MC_weights

## Save

In [15]:
# df['MC']['weight'] = MC_weights
# save_root(df['MC'], MC_name + '_B0_M_with_weights', 'DecayTree', folder_name=MC_name)

### LOOP different variables

In [16]:
hyperparams ={
#     'n_estimators': 200,
    'n_estimators': 30, # defaut:30
    'learning_rate': 0.2, # defaut:0.2
    'max_depth': 3, # defaut
    'min_samples_leaf': 200 # defaut
}

In [29]:
reweighter=[]
for name_training, training_columns in training_columns_dict.items():
    print(f"Reweighting of: {name_training}")
    
    df_train = {
        'MC': df['MC'][training_columns],
        'data': df['data'][training_columns]
    }
    
    reweighter = GBReweighter(**hyperparams)
    reweighter.fit(original=df_train['MC'], target=df_train['data'], 
                   target_weight=df['data']['sWeight'])
    
    df_train['MC']['weight'] = reweighter.predict_weights(df_train['MC'])
    
    MC_file_name = MC_name + f'_{name_training}_with_weights'
    
    save_root(df_train['MC'], MC_file_name, 'DecayTree', folder_name=MC_name)
    dump_joblib(reweighter, MC_file_name, folder_name=MC_name)

Reweighting of: basic


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Root file saved in /home/correiaa/bd2dsttaunu/output/root/BTODstDX_MC/BTODstDX_MC_basic_with_weights.root
Joblib file saved in /home/correiaa/bd2dsttaunu/output/joblib/BTODstDX_MC/BTODstDX_MC_basic_with_weights.joblib
Reweighting of: with_B0_M


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Root file saved in /home/correiaa/bd2dsttaunu/output/root/BTODstDX_MC/BTODstDX_MC_with_B0_M_with_weights.root
Joblib file saved in /home/correiaa/bd2dsttaunu/output/joblib/BTODstDX_MC/BTODstDX_MC_with_B0_M_with_weights.joblib


Check that the jobfile have been correctly saved.

In [39]:
retrieved_reweighter = retrieve_joblib(MC_file_name, folder_name=MC_name)

In [37]:
assert all(reweighter.predict_weights(df_train['MC'].drop('weight', 1))==retrieved_reweighter.predict_weights(df_train['MC'].drop('weight', 1)))