# NPS-23-005 

Template: Getting_started.ipynb from hepdata_lib

From hepdata_lib:
The following instructions and examples should get you started to get your analysis into [HEPData](https://hepdata.net) using `hepdata_lib`. Please also refer to the [documentation](http://hepdata-lib.readthedocs.io/). While you can also run `hepdata_lib` on your local computer, you can use the [binder](https://mybinder.org/) or [SWAN](http://swan.cern.ch/) services in the browser. Mind that SWAN is only available for people with a CERN account.

Also useful reference: https://github.com/jalimena/HepData_EXO-23-016/tree/main
See "main" function in createHepData_all.py

## General setup

To make sure things are working and `hepdata_lib` is available, run the following command:

In [1]:
import hepdata_lib
print("hepdata_lib version", hepdata_lib.__version__)

hepdata_lib version 0.16.0


## Creating your HEPData submission

The `Submission` object represents the whole HEPData entry and thus carries the top-level meta data that is equally valid for all the tables and variables you may want to enter. The object is also used to create the physical submission files you will upload to the HEPData web interface.

When using `hepdata_lib` to make an entry, you always need to create a `Submission` object. Let's do that now, and then add data to it step by step:

In [2]:
from hepdata_lib import Submission
submission = Submission()

In general, a `Submission` should contain details on the actual analysis such as it's abstract as well as links to the actual publication. The abstract should be in a plain text file. For `inspire` there's a special `record_id`, while for links to `arXiv` etc. one should use plain hyperlinks.

In [3]:
submission.read_abstract("NPS25003_inputs/abstract.txt")

#Production cross section

#Pythia configurations

#Signal model UFO Files

#Generator Process cards

#Cut flow tables

#Data distributions of relevant ML input features

#Small set of input vectors & ML outputs

#Signal Efficiencies for simplified models' model points

#Statistical model


#submission.add_link("Webpage with all figures and tables", "https://cms-results.web.cern.ch/cms-results/public-results/publications/B2G-16-029/")
#submission.add_link("arXiv", "http://arxiv.org/abs/arXiv:1802.09407")
#submission.add_record_id(1657397, "inspire")
#submission.add_additional_resource("Original abstract file", "example_inputs/abstract.txt", copy_file=True)  # for illustration, probably not useful

## Adding a table/figure

In HEPData, figures and table will both be `Table` objects. The example here shows reading a plain text file containing the signal effiency times acceptance as a function of resonance mass for different signal models. The file has been uploaded to the `example_files` directory. For your submission, create a new directory, e.g. using the analysis identifier.

Let's have a look at the file:

In [4]:
!head NPS25003_inputs/median_limits_allchannels.txt

40	15	47.6794
50	15	33.4076
60	15	35.9371
70	15	21.6529
80	15	22.4775
90	15	29.3897
100	15	38.8081
110	15	28.113
50	20	30.801
60	20	14.8941


The first column is the mass value, the other columns contain the efficiency times acceptance values.

Let's create the table/figure. First, we need to give it a name, which is usually just the identifier in the paper, here "Figure 1". The table also needs a description, which is usually the caption. You also need to describe the location, i.e. where to find it in the publication:

In [5]:
from hepdata_lib import Table
table = Table("Figure 7 Bottom Right")
table.description = "Combined 95% CL observed upper limit on the cross section, using the BDT-based event categorization, as a function of scalar masses."
table.location = "Results"

Now we need to provide more information on what is actually shown, which is done via `keywords`. The ones that are available can be taken from the documentation:
- [Observables](https://hepdata-submission.readthedocs.io/en/latest/keywords/observables.html)
- [Phrases](https://hepdata-submission.readthedocs.io/en/latest/keywords/phrases.html)
- [Particles](https://hepdata-submission.readthedocs.io/en/latest/keywords/partlist.html)

In [6]:
#very unsure
table.keywords["observables"] = ["SIG"]
table.keywords["phrases"] = ["Cross Section"] #do I need phrases 
table.keywords["reactions"] = ["H --> {phi}_1 {phi}_2 --> 2{tau}4b/2{tau}2b"]
#do I need "particles"?

Let's read in the file. For this purpose, `numpy` is very handy. Since the first two rows are the header, we skip them:

In [7]:
import numpy as np
data = np.loadtxt("NPS25003_inputs/median_limits_allchannels.txt", skiprows=0)

`numpy` stores the content as arrays. You can actually see that the entry that was labelled as `NaN` is correctly read in:

In [8]:
from __future__ import print_function
print(data)

[[ 40.       15.       47.6794 ]
 [ 50.       15.       33.4076 ]
 [ 60.       15.       35.9371 ]
 [ 70.       15.       21.6529 ]
 [ 80.       15.       22.4775 ]
 [ 90.       15.       29.3897 ]
 [100.       15.       38.8081 ]
 [110.       15.       28.113  ]
 [ 50.       20.       30.801  ]
 [ 60.       20.       14.8941 ]
 [ 70.       20.       14.7805 ]
 [ 80.       20.       16.5984 ]
 [ 90.       20.       22.9752 ]
 [100.       20.       24.729  ]
 [ 70.       30.       33.5855 ]
 [ 80.       30.       29.888  ]
 [ 90.       30.       26.7847 ]
 [ 20.       15.        2.94763]
 [ 30.       15.        7.50414]
 [ 30.       20.        2.69308]
 [ 40.       20.        4.64869]
 [ 40.       30.        1.03961]
 [ 50.       30.        1.05894]
 [ 60.       30.        2.54844]
 [ 50.       40.        1.12181]
 [ 60.       40.        1.48343]
 [ 70.       40.        5.11465]
 [ 80.       40.        8.32959]
 [ 60.       50.        2.03326]
 [ 70.       50.        3.38112]]


We will now use this for our `Variable` definitions. The x-axis is usually the independent variable (`is_independent=True`), whereas the other ones are dependent (i.e. a function of the former). You also need to declare whether the variable is binned or not as well as the units. Similar as for the `keywords` used above, it is again important to provide additional information that can be found via the HEPData web interface using the observables and particles linked above. The values assigned are just slices of the `data` array:

In [9]:
from hepdata_lib import Variable
import numpy as np

# Column meaning
y_vals = data[:, 0]   # FIRST column = y bin centers
x_vals = data[:, 1]   # SECOND column = x bin centers
z_vals = data[:, 2]   # bin content

# Build bin edges from centers
def make_edges(centers):
    centers = np.unique(centers.astype(float))
    edges = np.zeros(len(centers) + 1)
    edges[1:-1] = 0.5 * (centers[1:] + centers[:-1])
    edges[0] = centers[0] - (edges[1] - centers[0])
    edges[-1] = centers[-1] + (centers[-1] - edges[-2])
    return centers, edges

y_centers, y_edges = make_edges(y_vals)
x_centers, x_edges = make_edges(x_vals)

# Independent variables
phi2_mass = Variable(
    "phi_2 mass",
    is_independent=True,
    is_binned=True,
    units="GeV"
)

phi1_mass = Variable(
    "phi_1 mass",
    is_independent=True,
    is_binned=True,
    units="GeV"
)

# Map center -> edge tuple
y_edges_map = {y: (y_edges[i], y_edges[i+1]) for i, y in enumerate(y_centers)}
x_edges_map = {x: (x_edges[i], x_edges[i+1]) for i, x in enumerate(x_centers)}

# Only include bins that exist in your data
phi2_mass.values = [y_edges_map[y] for y in y_vals]
phi1_mass.values = [x_edges_map[x] for x in x_vals]

# Dependent variable
median_limit = Variable(
    "Median limit",
    is_independent=False,
    is_binned=False,
    units="pb"
)

median_limit.values = [float(v) for y,x,v in data] 
# median_limit.add_qualifier("95% CL", "upper limit")

# Add to table
table.add_variable(phi2_mass)
table.add_variable(phi1_mass)
table.add_variable(median_limit)


In case of a plot, you should also add the original figure itself. `hepdata_lib` will take care of creating the thumbnail as well. Just add the figure as below.

*WARNING*: This needs `ImageMagick` to be installed (this is the case when running on Binder and SWAN with LCG_94 or later). Executing the following line will fail if it is missing. In this case, comment out this line and restart from the top.

In [10]:
#table.add_image("NPS25003_inputs/plotLimit_2d_allchannels-2.pdf")

If you want, the original data file can be attached to the table as an additional resource file.

In [11]:
table.add_additional_resource("Original data file", "NPS25003_inputs/median_limits_allchannels.txt", copy_file=True)

This is all that's needed for the table/figure. We still need to add it to the submission:

In [12]:
submission.add_table(table)

Once you've added all tables/figures and the general submission details, you should add a few more keywords to all tables for better identification and searchability, e.g. the centre-of-mass energy:

In [13]:
for table in submission.tables:
    table.keywords["cmenergies"] = [13000]

Now it's time to create the submission for the upload. Here, we choose `example_output` as output directory:

In [14]:
print("Independent phi2_mass:", phi2_mass.values[:5])
print("Independent phi1_mass:", phi1_mass.values[:5])
print("Dependent median_limit (first 10):", median_limit.values[:10])
print("NaNs in dependent:", np.sum(np.isnan(median_limit.values)))
print("Non-float dependent entries:", sum(not isinstance(v,float) for v in median_limit.values))


Independent phi2_mass: [(35.0, 45.0), (45.0, 55.0), (55.0, 65.0), (65.0, 75.0), (75.0, 85.0)]
Independent phi1_mass: [(12.5, 17.5), (12.5, 17.5), (12.5, 17.5), (12.5, 17.5), (12.5, 17.5)]
Dependent median_limit (first 10): [47.6794, 33.4076, 35.9371, 21.6529, 22.4775, 29.3897, 38.8081, 28.113, 30.801, 14.8941]
NaNs in dependent: 0
Non-float dependent entries: 0


In [15]:
outdir = "NPS25003_output"
submission.create_files(outdir, remove_old=True)

In the working directory, you will now find a `submission.tar.gz` file, which you can use for uploading to your HEPData sandbox:

In [16]:
!ls submission.tar.gz

submission.tar.gz


And the `example_output` directory will contain the generated `yaml` files:

In [17]:
!ls NPS25003_output

plotLimit_2d_allchannels-2.png	thumb_plotLimit_2d_allchannels-2.png
submission.yaml


In [18]:
!cat NPS25003_output/submission.yaml

---
additional_resources:
- description: Created with hepdata_lib 0.16.0
  location: https://doi.org/10.5281/zenodo.1217998
comment: A search for Higgs boson decays to a pair of neutral scalars \Paa and \Pab
  with unequal masses is performed in final states with \PQb quarks and \PGt leptons.
  Depending on the masses of the neutral scalars, \Pab can undergo a cascade decay
  to a pair of \Paa scalars. For both the cascade and non-cascade scenario, one of
  the \Paa is always considered to decay to \PGt leptons, used for the online selection
  of events. A data sample of proton-proton collisions at $\sqrt{s}=13\TeV$ corresponding
  to an integrated luminosity of 138\fbinv recorded with the CMS detector at the LHC
  is analyzed. No statistically significant excess is observed over the standard model
  backgrounds. Upper limits are set on the Higgs boson branching fraction to $\Paa\Pab
  \to 2\PGt4\PQb$ and to $\Paa\Pab \to 2\PGt2\PQb$ along with the corresponding cross
  s

In [19]:
!cat NPS25003_output/figure_7_bottom_right.yaml

cat: example_output/additional_figure_1.yaml: No such file or directory
