In [3]:
import pandas as pd

The predictions for the observables are stored in a pandas DataFrame MultiIndex. The first level of the MultiIndex corresponds to the name of the dataset, while second is just an index for the specific data point (and should be irrelevant for your purposes). The following cell lists the datasets for which observables are computed:

In [4]:
predictions_by_replicas = pd.read_pickle('predictions_with_NNPDF40_by_replica.pkl')
dataset_name_list = predictions_by_replicas.index.get_level_values(0).unique().tolist()
for name in dataset_name_list:
    print(name)

SLAC_NC_NOTFIXED_P_EM-F2
SLAC_NC_NOTFIXED_D_EM-F2
BCDMS_NC_NOTFIXED_D_EM-F2
DYE906_Z0_120GEV_DW_PDXSECRATIO
CHORUS_CC_NOTFIXED_PB_NU-SIGMARED
CHORUS_CC_NOTFIXED_PB_NB-SIGMARED
HERA_NC_318GEV_EP-SIGMARED
HERA_NC_318GEV_EAVG_BOTTOM-SIGMARED
ATLAS_WPWM_7TEV_36PB_ETA
ATLAS_Z0_7TEV_LOMASS_M
ATLAS_WPWM_7TEV_46FB_CC-ETA
ATLAS_Z0_7TEV_46FB_CC-Y
ATLAS_Z0J_8TEV_PT-M
ATLAS_Z0J_8TEV_PT-Y


In the following, we select a specific dataset in the DataFrame:

In [None]:
predictions_by_replicas.xs("SLAC_NC_NOTFIXED_P_EM-F2", level=0)


Unnamed: 0_level_0,data_central,theory_central,rep_00001,rep_00002,rep_00003,rep_00004,rep_00005,rep_00006,rep_00007,rep_00008,...,rep_00091,rep_00092,rep_00093,rep_00094,rep_00095,rep_00096,rep_00097,rep_00098,rep_00099,rep_00100
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
41,0.35854,0.357822,0.360565,0.359336,0.358618,0.356378,0.357733,0.363331,0.358229,0.356117,...,0.358364,0.350213,0.35838,0.356822,0.362868,0.357341,0.356719,0.35538,0.3565,0.354294
42,0.35072,0.358886,0.361493,0.360313,0.359445,0.357405,0.358811,0.364014,0.35931,0.357453,...,0.359496,0.351686,0.359204,0.357928,0.363713,0.358377,0.357617,0.35681,0.357541,0.355689
57,0.3441,0.343689,0.34683,0.344875,0.344585,0.343591,0.34317,0.344822,0.343565,0.344195,...,0.345943,0.338203,0.343885,0.342003,0.34648,0.344866,0.34109,0.340011,0.344322,0.340218
58,0.3438,0.345588,0.348446,0.346461,0.346181,0.34553,0.344823,0.346339,0.345859,0.346189,...,0.347652,0.340932,0.345698,0.344099,0.34809,0.346774,0.34307,0.342641,0.346101,0.34315
59,0.321,0.346467,0.349184,0.347168,0.346894,0.346392,0.345565,0.347109,0.346943,0.347097,...,0.348402,0.34218,0.346521,0.345105,0.348831,0.347646,0.344025,0.343907,0.346916,0.344591
60,0.35465,0.347115,0.349717,0.347666,0.347394,0.346998,0.346093,0.34772,0.347764,0.347756,...,0.348921,0.343106,0.347112,0.345877,0.349368,0.348277,0.344758,0.344887,0.34751,0.345731
61,0.33297,0.347402,0.349944,0.347873,0.347599,0.34725,0.346316,0.348008,0.348141,0.348044,...,0.349131,0.343527,0.347363,0.346235,0.349597,0.348548,0.345097,0.345352,0.347769,0.346284
75,0.33138,0.327077,0.329742,0.327843,0.32761,0.32689,0.326406,0.326221,0.327348,0.32692,...,0.328442,0.326363,0.32717,0.325159,0.326369,0.32817,0.32164,0.324972,0.329305,0.327491
76,0.33353,0.327671,0.330335,0.328376,0.328203,0.327586,0.326798,0.326678,0.328215,0.327495,...,0.328885,0.326902,0.32808,0.326046,0.327167,0.328863,0.322661,0.325883,0.329822,0.328579
77,0.32763,0.327823,0.330468,0.328479,0.32833,0.327734,0.326747,0.326763,0.328593,0.327656,...,0.328893,0.326937,0.328455,0.32647,0.327546,0.329074,0.323277,0.326278,0.329873,0.329135


As you see, the DataFrame has many columns. The first one (`data_central`) is the experimental data point. The second one (`theory_central`) is the mean value of the predictions computed over the ensemble of replicas. The other columns show the values of the predictions computed for each replica. Of course, you could reconstruct the `theory_central` column by computing the mean value using the replicas.

In [None]:
# Reconstruct theory_central 

# choose data set 
df = predictions_by_replicas.xs("SLAC_NC_NOTFIXED_D_EM-F2", level=0)

# pick out the replica columns 
replica_cols = [col for col in df.columns if "rep" in col]

# calculate the mean per row across replicas
theory_central_reconstructed = df[replica_cols].mean(axis=1)

diffs = (df['theory_central'] - theory_central_reconstructed).round(8)

# create a new DataFrame to show side by side
comparison_df = pd.DataFrame({
    'theory_central': df['theory_central'],
    'theory_central_reconstructed': theory_central_reconstructed,
    'difference (8 dp)': diffs
})

print(comparison_df)


     theory_central  theory_central_reconstructed  difference (8 dp)
id                                                                  
41         0.323348                      0.323348                0.0
42         0.324227                      0.324227                0.0
43         0.325513                      0.325513                0.0
57         0.294775                      0.294775               -0.0
58         0.296025                      0.296025               -0.0
59         0.296386                      0.296386               -0.0
60         0.296538                      0.296538               -0.0
74         0.283146                      0.283146                0.0
75         0.283304                      0.283304                0.0
76         0.283150                      0.283150                0.0
77         0.282942                      0.282942                0.0
78         0.282629                      0.282629                0.0
79         0.282287               

# Important
There are two families of predictions in the DataFrame that I have provided. The predictions in the first family are linear in the PDFs, so schematically we have
$$
\mathcal{O}_I = \sum_{i}^{N_{\rm flav}}\sum_{\alpha}^{N_{\rm grid}}(FK)_{I\alpha}^{i} f_{i\alpha} \,.
$$
The other family of predictions is non-linear (quadratic) in the PDFs, and schematically we have
$$
\mathcal{O}_I = \sum_{i,j}^{N_{\rm flav}}\sum_{\alpha \beta}^{N_{\rm grid}}(FK)_{I\alpha \beta}^{ij} f_{i\alpha} f_{j\beta} \,.
$$

For linear predictions, the distribution in PDF space should be preserved in the space of observables. For non-linear predictions, this is not necessarily true.

Note that the predictions in the DataFrame are computed using the NNPDF4.0 PDFs. Moreover, the datasets for which predictions are non-linear are those that begin with `ATLAS` and `DY`. All the other datasets have linear predictions.