## Part 3: Systematic Uncertainties
In this section, we will learn how to add systematic uncertainties to a parametric fit analysis. The python commands in this notebook are taken from the `systematics.py` script. 

For uncertainties which only affect the process normalisation, we can simply implement these as `lnN` uncertainties in the datacard. The file `mc_part3.root` contains the systematic-varied trees i.e. Monte-Carlo events where some systematic uncertainty source `{photonID,JEC,scale,smear}` has been varied up and down by $1\sigma$.

In [None]:
import ROOT
from IPython.display import Image

In [None]:
f = ROOT.TFile("mc_part3.root")
f.ls()

Let's first load the systematic-varied trees as RooDataSets and store them in a python dictionary, `mc`:

In [None]:
# Define mass and weight variables
mass = ROOT.RooRealVar("CMS_hgg_mass", "CMS_hgg_mass", 125, 100, 180)
weight = ROOT.RooRealVar("weight","weight",0,0,1)

mc = {}

# Load the nominal dataset
t = f.Get("ggH_Tag0")
mc['nominal'] = ROOT.RooDataSet("ggH_Tag0","ggH_Tag0", t, ROOT.RooArgSet(mass,weight), "", "weight" )

# Load the systematic-varied datasets
for syst in ['JEC','photonID','scale','smear']:
    for direction in ['Up','Down']:
        key = "%s%s01Sigma"%(syst,direction)
        name = "ggH_Tag0_%s"%(key)
        t = f.Get(name)
        mc[key] = ROOT.RooDataSet(name, name, t, ROOT.RooArgSet(mass,weight), "", "weight" )

The jet energy scale (JEC) and photon identification (photonID) uncertainties do not affect the shape of the $m_{\gamma\gamma}$ distribution i.e. they only effect the signal yield estimate. We can calculate their impact by comparing the sum of weights to the nominal data set. Note, the photonID uncertainty changes the weight of the events in the tree, whereas the JEC varied trees contain a different set of events, generated by shifting the jet energy scale in the simulation. In any case, the means for calculating the yield variations is equivalent:

In [None]:
for syst in ['JEC','photonID']:
    for direction in ['Up','Down']:
        yield_variation = mc['%s%s01Sigma'%(syst,direction)].sumEntries()/mc['nominal'].sumEntries()
        print("Systematic varied yield (%s,%s): %.3f"%(syst,direction,yield_variation))

We can write these yield variations in the datacard with the lines:

```
CMS_scale_j           lnN      0.951/1.056      -
CMS_hgg_phoIdMva      lnN      1.05             -   
```

* Why is the photonID uncertainty expressed as one number, whereas the JEC uncertainty is defined by two?

Note in this analysis there are no systematic uncertainties affecting the background estimate ("-") in the datacard), as the background model has been derived directly from data.

### Parametric shape uncertainties
What about systematic uncertainties which affect the shape of the mass distribution?

In a parametric analysis, we need to build the dependence directly into the model parameters. The example uncertainty sources in this tutorial are the photon energy scale and smearing uncertainties. From the names alone we can expect that the **scale** uncertainty will affect the mean of the signal Gaussian, and the **smear** uncertainty will impact the resolution (sigma). Let's first take a look at the `scaleUp01Sigma` dataset:

In [None]:
# Build the model to fit the systematic-varied datasets
mean = ROOT.RooRealVar("mean", "mean", 125, 124, 126)
sigma = ROOT.RooRealVar("sigma", "sigma", 2, 1.5, 2.5)
gaus = ROOT.RooGaussian("model", "model", mass, mean, sigma)

# Run the fits twice (second time from the best-fit of first run) to obtain more reliable results
gaus.fitTo(mc['scaleUp01Sigma'], ROOT.RooFit.SumW2Error(True), ROOT.RooFit.PrintLevel(-1))
gaus.fitTo(mc['scaleUp01Sigma'], ROOT.RooFit.SumW2Error(True), ROOT.RooFit.PrintLevel(-1))
print("Mean = %.3f +- %.3f GeV, Sigma = %.3f +- %.3f GeV"%(mean.getVal(),mean.getError(),sigma.getVal(),sigma.getError()) )

Now let's compare the values to the nominal fit for all systematic-varied trees. We observe a significant variation in the mean for the **scale** uncertainty, and a significant variation in sigma for the **smear** uncertainty. 

In [None]:
# First fit the nominal dataset
gaus.fitTo(mc['nominal'], ROOT.RooFit.SumW2Error(True), ROOT.RooFit.PrintLevel(-1) )
gaus.fitTo(mc['nominal'], ROOT.RooFit.SumW2Error(True), ROOT.RooFit.PrintLevel(-1) )
# Save the mean and sigma values and errors to python dicts
mean_values, sigma_values = {}, {}
mean_values['nominal'] = [mean.getVal(),mean.getError()]
sigma_values['nominal'] = [sigma.getVal(),sigma.getError()]

# Next for the systematic varied datasets
for syst in ['scale','smear']:
    for direction in ['Up','Down']:
        key = "%s%s01Sigma"%(syst,direction)
        gaus.fitTo(mc[key] , ROOT.RooFit.SumW2Error(True), ROOT.RooFit.PrintLevel(-1) )
        gaus.fitTo(mc[key], ROOT.RooFit.SumW2Error(True), ROOT.RooFit.PrintLevel(-1))
        mean_values[key] = [mean.getVal(), mean.getError()]
        sigma_values[key] = [sigma.getVal(), sigma.getError()]

# Print the variations in mean and sigma
for key in mean_values.keys():
    print("%s: mean = %.3f +- %.3f GeV, sigma = %.3f +- %.3f GeV"%(key,mean_values[key][0],mean_values[key][1],sigma_values[key][0],sigma_values[key][1]))

The values tell us that the scale uncertainty (at $\pm 1 \sigma$) varies the signal peak mean by around 0.3%, and the smear uncertainty (at $\pm 1 \sigma$) varies the signal width (sigma) by around 4.5% (average of up and down variations). 

Now we need to bake these effects into the parametric signal model. The mean of the Gaussian was previously defined as:

$$ \mu = m_H + \delta$$

We introduce the nuisance parameter `nuisance_scale` = $\eta$ to account for a shift in the signal peak mean using:

$$ \mu = (m_H + \delta) \cdot (1+0.003\eta)$$

At $\eta = +1 (-1)$ the signal peak mean will shift up (down) by 0.3%. To build this into the RooFit signal model we simply define a new parameter, $\eta$, and update the definition of the mean formula variable:

In [None]:
# Building the workspace with systematic variations
MH = ROOT.RooRealVar("MH", "MH", 125, 120, 130 )
MH.setConstant(True)

# Define formula for mean of Gaussian
dMH = ROOT.RooRealVar("dMH_ggH_Tag0", "dMH_ggH_Tag0", 0, -5, 5 )
eta = ROOT.RooRealVar("nuisance_scale", "nuisance_scale", 0, -5, 5)
eta.setConstant(True)
mean_formula = ROOT.RooFormulaVar("mean_ggH_Tag0", "mean_ggH_Tag0", "(@0+@1)*(1+0.003*@2)", ROOT.RooArgList(MH,dMH,eta))

* Why do we set the nuisance parameter to constant at this stage?

Similar for the width introducing a nuisance parameter, $\chi$:

$$ \sigma = \sigma \cdot (1+0.045\chi)$$

In [None]:
sigma = ROOT.RooRealVar("sigma_ggH_Tag0_nominal", "sigma_ggH_Tag0_nominal", 2, 1, 5)
chi = ROOT.RooRealVar("nuisance_smear", "nuisance_smear", 0, -5, 5)
chi.setConstant(True)
sigma_formula = ROOT.RooFormulaVar("sigma_ggH_Tag0", "sigma_ggH_Tag0", "@0*(1+0.045*@1)", ROOT.RooArgList(sigma,chi))

Let's now fit the new model to the signal Monte-Carlo dataset, build the normalisation object and save the workspace.

In [None]:
# Define Gaussian
model = ROOT.RooGaussian( "model_ggH_Tag0", "model_ggH_Tag0", mass, mean_formula, sigma_formula )

# Fit model to MC
model.fitTo( mc['nominal'], ROOT.RooFit.SumW2Error(True), ROOT.RooFit.PrintLevel(-1) )

# Build signal model normalisation object
xs_ggH = ROOT.RooRealVar("xs_ggH", "Cross section of ggH in [pb]", 48.58 )
br_gamgam = ROOT.RooRealVar("BR_gamgam", "Branching ratio of Higgs to gamma gamma", 0.0027 )
eff = mc['nominal'].sumEntries()/(xs_ggH.getVal()*br_gamgam.getVal())
eff_ggH_Tag0 = ROOT.RooRealVar("eff_ggH_Tag0", "Efficiency for ggH events to land in Tag0", eff )
# Set values to be constant
xs_ggH.setConstant(True)
br_gamgam.setConstant(True)
eff_ggH_Tag0.setConstant(True)
# Define normalisation component as product of these three variables
norm_sig = ROOT.RooProduct("model_ggH_Tag0_norm", "Normalisation term for ggH in Tag 0", ROOT.RooArgList(xs_ggH,br_gamgam,eff_ggH_Tag0))

# Set shape parameters of model to be constant (i.e. fixed in fit to data)
dMH.setConstant(True)
sigma.setConstant(True)

# Build new signal model workspace with signal normalisation term. 
f_out = ROOT.TFile("workspace_sig_with_syst.root", "RECREATE")
w_sig = ROOT.RooWorkspace("workspace_sig","workspace_sig")
getattr(w_sig, "import")(model)
getattr(w_sig, "import")(norm_sig)
w_sig.Print()
w_sig.Write()
f_out.Close()

The final step is to add the parametric uncertainties as Gaussian-constrained nuisance parameters into the datacard. The syntax means the Gaussian constraint term in the likelihood function will have a mean of 0 and a width of 1.
```
nuisance_scale        param    0.0    1.0
nuisance_smear        param    0.0    1.0
```
* Try adding these lines to `datacard_part1_with_norm.txt`, along with the lines for the JEC and photonID yield uncertainties above, and compiling with the `text2workspace` command. Open the workspace and look at its contents. You will need to change the signal process workspace file name in the datacard to point to the new workspace (`workspace_sig_with_syst.root`).
* Can you see the new objects in the compiled datacard that have been created for the systematic uncertainties? What do they correspond to?

In [None]:
%%bash
# Compile datacard with systematic uncertainties included

We can now run a fit with the systematic uncertainties included. The option `--saveSpecifiedNuis` can be called to save the postfit nuisance parameter values in the combine output limit tree. 

In [None]:
%%bash
combine -M MultiDimFit datacard_part1_with_norm.root -m 125 --freezeParameters MH --saveWorkspace \
-n .bestfit.with_syst --saveSpecifiedNuis CMS_scale_j,CMS_hgg_phoIdMva,nuisance_scale,nuisance_smear

* What do the postfit values of the nuisances tell us here? You can check them by opening the output file (`root higgsCombine.bestfit.with_syst.MultiDimFit.mH125.root`) and running `limit->Show(0)`.
* Try plotting the postfit mass distribution (as detailed in part 2). Do you notice any difference?

### Uncertainty breakdown
A more complete datacard with additional nuisance parameters is stored in `datacard_part3.txt`. We will use this datacard for the rest of part 3. Open the text file and have a look at the contents.

In [None]:
# Let's open the datacard and take a look
with open("datacard_part3.txt","r") as f:
    lines = f.readlines()
    
print("".join(lines))

The following line has been appended to the end of the datacard to define the set of theory nuisance parameters. This will come in handy when calculating the uncertainty breakdown.
```
theory group = BR_hgg QCDscale_ggH pdf_Higgs_ggH alphaS_ggH UnderlyingEvent PartonShower
```
Compile the datacard and run an observed `MultiDimFit` likelihood scan over the signal strength, r:

In [None]:
%%bash
text2workspace.py datacard_part3.txt -m 125

combine -M MultiDimFit datacard_part3.root -m 125 --freezeParameters MH \
-n .scan.with_syst --algo grid --points 20 --setParameterRanges r=0.5,2.5

Our aim is to break down the total uncertainty into the systematic and statistical components. To get the statistical-uncertainty-only scan it should be as simple as freezing the nuisance parameters in the fit... right? 

Try it by adding `,allConstrainedNuisances` to the `--freezeParameters` option. This will freeze all (constrained) nuisance parameters in the fit. You can also feed in regular expressions with wildcards using `rgx{.*}`. For instance to freeze only the `nuisance_scale` and `nuisance_smear` you could run with `--freezeParameters MH,rgx{nuisance_.*}`.

In [None]:
%%bash
combine -M MultiDimFit datacard_part3.root -m 125 --freezeParameters MH,allConstrainedNuisances \
-n .scan.with_syst.statonly --algo grid --points 20 --setParameterRanges r=0.5,2.5

You can plot the two likelihood scans on the same axis with the command:

In [None]:
%%bash
plot1DScan.py higgsCombine.scan.with_syst.MultiDimFit.mH125.root --main-label "With systematics" \
--main-color 1 --others higgsCombine.scan.with_syst.statonly.MultiDimFit.mH125.root:"Stat-only":2 -o part3_scan_v0

In [None]:
# Lets open the png file and plot it here
Image(filename='part3_scan_v0.png', width=500) 

* Can you spot the problem? 

The nuisance parameters introduced into the model have pulled the best-fit signal strength point! Therefore we cannot simply subtract the uncertainties in quadrature to get an estimate for the systematic/statistical uncertainty breakdown. 

The correct approach is to freeze the nuisance parameters to their respective best-fit values in the stat-only scan. We can do this by first saving a postfit workspace with all nuisance parameters profiled in the fit. Then we load the postfit snapshot values of the nuisance parameters (with the option `--snapshotName MultiDimFit`) from the combine output of the previous step, and then freeze the nuisance parameters for the stat-only scan.

In [None]:
%%bash
combine -M MultiDimFit datacard_part3.root -m 125 --freezeParameters MH \
-n .bestfit.with_syst --setParameterRanges r=0.5,2.5 --saveWorkspace

combine -M MultiDimFit higgsCombine.bestfit.with_syst.MultiDimFit.mH125.root -m 125 \
--freezeParameters MH,allConstrainedNuisances -n .scan.with_syst.statonly_correct --algo grid --points 20 \
--setParameterRanges r=0.5,2.5 --snapshotName MultiDimFit

Adding the option `--breakdown syst,stat` to the `plot1DScan.py` command will automatically calculate the uncertainty breakdown for you.

In [None]:
%%bash
plot1DScan.py higgsCombine.scan.with_syst.MultiDimFit.mH125.root --main-label "With systematics" \
--main-color 1 --others higgsCombine.scan.with_syst.statonly_correct.MultiDimFit.mH125.root:"Stat-only":2 \
-o part3_scan_v1 --breakdown syst,stat

In [None]:
# Lets open the png file and plot it here
Image(filename='part3_scan_v1.png', width=500) 

We can also freeze groups of nuisance parameters defined in the datacard with the option `--freezeNuisanceGroups`. Let's run a scan freezing only the theory uncertainties (using the nuisance group we defined in the datacard):

In [None]:
%%bash
combine -M MultiDimFit higgsCombine.bestfit.with_syst.MultiDimFit.mH125.root -m 125 --freezeParameters MH \
--freezeNuisanceGroups theory -n .scan.with_syst.freezeTheory --algo grid --points 20 \
--setParameterRanges r=0.5,2.5 --snapshotName MultiDimFit

To breakdown the total uncertainty into the theory, experimental and statistical components we can then use:

In [None]:
%%bash
plot1DScan.py higgsCombine.scan.with_syst.MultiDimFit.mH125.root --main-label Total --main-color 1 \
--others higgsCombine.scan.with_syst.freezeTheory.MultiDimFit.mH125.root:"Freeze theory":4 \
higgsCombine.scan.with_syst.statonly_correct.MultiDimFit.mH125.root:"Stat-only":2 \
-o part3_scan_v2 --breakdown theory,exp,stat

In [None]:
# Lets open the png file and plot it here
Image(filename='part3_scan_v2.png', width=500) 

These methods are not limited to this particular grouping of systematics. We can use the above procedure to assess the impact of any nuisance parameter(s) on the signal strength confidence interval. 
* Try and calculate the contribution to the total uncertainty from the luminosity estimate using this approach.

### Impacts
It is often useful/required to check the impacts of the nuisance parameters (NP) on the parameter of interest, r. The impact of a NP is defined as the shift $\Delta r$ induced as the NP, $\theta$, is fixed to its $\pm1\sigma$ values, with all other parameters profiled as normal. More information can be found in the combine documentation via this [link](https://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part3/nonstandard/#nuisance-parameter-impacts).

Let's calculate the impacts for our analysis. We can use the `combineTool.py` from the `CombineHarvester` package to automate the scripts. The impacts are calculated in a few stages:

In [None]:
%%bash
# 1) Do an initial fit for the parameter of interest, adding the `--robustFit 1` option:
combineTool.py -M Impacts -d datacard_part3.root -m 125 --freezeParameters MH -n .impacts \
--setParameterRanges r=0.5,2.5 --doInitialFit --robustFit 1

* What does the option `--robustFit 1` do? 

In [None]:
%%bash
# 2) Next perform a similar scan for each NP with the `--doFits` option. This may take a few minutes
combineTool.py -M Impacts -d datacard_part3.root -m 125 --freezeParameters MH \
-n .impacts --setParameterRanges r=0.5,2.5 --doFits --robustFit 1

In [None]:
%%bash
# 3) Collect the outputs from the previous step and write the results to a json file:
combineTool.py -M Impacts -d datacard_part3.root -m 125 --freezeParameters MH \
-n .impacts --setParameterRanges r=0.5,2.5 -o impacts_part3.json

In [None]:
%%bash
# 4) Produce a plot summarising the nuisance parameter values and impacts:
plotImpacts.py -i impacts_part3.json -o impacts_part3

Open the output pdf file. There is a lot of information in these plots, which can be of invaluable use to analysers in understanding the fit. Do you understand everything that the plot is showing?
* Which NP has the highest impact on the signal strength measurement?
* Which NP is pulled the most in the fit to data? What does this information imply about the signal model mean in relation to the data?
* Which NP is the most constrained in the fit to the data? What does it mean for a nuisance parameter to be constrained?
* Try adding the option `--summary` to the impacts plotting command. This is a nice new feature in combine!