# DAML Project 2 - Callum Smith

## Data Preprocessing: Convert the csv file into an appropriate format for our neural networks

1. Create variables where you count the number of electrons, photons, muons, jets and bjets in the event (ignore charge)
    * each line in csv is a single event, which objects `obj` separated by semicolons:  
    `event ID; process ID; event weight; MET; METphi; obj1, E1, pt1, eta1, phi1; obj2, E2, pt2, eta2, phi2; ...`<br>
    * e.g. row 1 of the `SM` dataset looks like:    
    `5702564; z_jets; 1; 102549; -2.9662; j,335587,132261,-1.57823,1.02902; j,107341,106680,-0.0989776, -2.67901; j,85720.1,62009,0.840127,-1.73805; j,270540,58844.5,2.20566,1.6064; j,55173.9,52433.5,-0.183147,2.62501; j,48698.6,37306.4,-0.719927,-1.7898; j,148467,23648,-2.52332,-1.70799; e-,186937,131480,0.888915,-0.185666; e+,80014.3,79281.7,0.135844,0.275231;`<br>
    * `obj`: the object type:

        |Key|Particle|
        |---|---|
        |j|jet|
        |b|b-jet|
        |e-|electron|
        |e+|positron|
        |m-|muon|
        |m+|muon+|
        |g|photon|

    * for each event e.g. electron = number of electrons/positrons (ignore charge) in that event

Imports

In [149]:

import numpy as np
import pandas as pd


Read the file

In [150]:

# read our data file
path = "background_chan2b_7.8.csv" # file name
base_cols = ["event_id", "process_id", "event_weight", "MET", "METphi"] # first 5 columns before objects

# Count semicolons per line to get max number of fields (semicolons + 1)
max_cols = max(line.count(";") for line in open(path, encoding="utf-8")) + 1 # unsure of the +1 here...
names = base_cols + [f"obj{i+1}" for i in range(max_cols - len(base_cols))] # full list of column names separated by ;

df = pd.read_csv(path, sep=";", header=None, names=names, engine="python", dtype=str) # read data separated by ; as strings


In [151]:
print(df)

        event_id process_id event_weight      MET     METphi  \
0        5702564     z_jets            1   102549    -2.9662   
1       13085335     z_jets            1   103468    1.96193   
2          74025    wtopbar            1   129408   -1.17889   
3        2419445     z_jets            1  77774.2   -1.09171   
4          43639       wtop            1   107151   -1.02642   
...          ...        ...          ...      ...        ...   
340263        30      ttbar            1  65677.3    -1.1536   
340264       111      ttbar            1  58730.1   0.529769   
340265        75      ttbar            1   342729   0.804597   
340266  15181306     z_jets            1   246999  -0.849401   
340267  19552503     z_jets            1   109060    1.48996   

                                         obj1  \
0            j,335587,132261,-1.57823,1.02902   
1           j,224322,109177,-1.34681,-1.16114   
2            b,169640,104808,-1.05742,1.86718   
3             j,220498,108012,1.333

Convert columns which are numbers to floats: `event ID`, `event weight`, `MET`, `METphi`

In [None]:

# convert number columns from strings to floats
num_cols = ["event_id", "event_weight", "MET", "METphi"]
df[num_cols] = df[num_cols].apply(pd.to_numeric, errors="coerce")


Count the number of electrons, photons, muons, jets and bjets in the event

In [154]:
obj_keys = ["j", "b", "e-", "e+", "m-", "m+", "g"] # keys for each object type
obj_cols = df.columns[len(base_cols):] # take columns after the first 5, i.e [5:]

# Extract the object type from every object cell
types = df[obj_cols].map(lambda s: str(s).split(",", 1)[0].strip() if pd.notna(s) and str(s).strip() else None) # lambda applied to every cell if condition met
"""
    If the cell is not NaN and not an empty string, convert to str, split on the first comma, take the token before it (the object code), and strip whitespace.
    Otherwise return None (for missing slots) 
"""

# Count per row how many times each type appears
counts = pd.DataFrame({i: (types == i).sum(axis=1) for i in obj_keys}).astype("int64")
"""
    types == k creates a boolean DataFrame (True where the cell equals the object k).
    .sum(axis=1) counts True per row (because True=1, False=0), i.e., how many times k appears in that event across all object slots.
    The dict comprehension builds a column per object type.
    pd.DataFrame(...) turns that dict into a DataFrame with columns in obj_keys order.
    .astype("int64") casts boolean sums to integer counts
"""

# columns for total electrons and total muons, regardless of charge
counts["e"] = counts["e-"] + counts["e+"]
counts["m"] = counts["m-"] + counts["m+"]


# reorder columns: |N ele| N muon| N jets| N bjets| N photons|
counts = (counts
          .drop(columns=["e-", "e+", "m-", "m+"]) # get rid of charged columns
          .reindex(columns=["e","m","j", "b", "g"], fill_value=0)) # reorder remaining columns; fill_value=0 fills empty columns with 0's not NaNs


# add these columns to the overall dataframe as integers
training = pd.concat([counts, df], axis=1)
training[["e","m","j","b","g"]] = training[["e","m","j","b","g"]].astype("int64") # convert new columns to integers


print([c for c in training.columns])  # quick sanity check


['e', 'm', 'j', 'b', 'g', 'event_id', 'process_id', 'event_weight', 'MET', 'METphi', 'obj1', 'obj2', 'obj3', 'obj4', 'obj5', 'obj6', 'obj7', 'obj8', 'obj9', 'obj10', 'obj11', 'obj12', 'obj13', 'obj14']


In [159]:
training.dtypes

e                 int64
m                 int64
j                 int64
b                 int64
g                 int64
event_id          int64
process_id       object
event_weight      int64
MET             float64
METphi          float64
obj1             object
obj2             object
obj3             object
obj4             object
obj5             object
obj6             object
obj7             object
obj8             object
obj9             object
obj10            object
obj11            object
obj12            object
obj13            object
obj14            object
dtype: object

3. **Sort by energy** (largest to smallest)
    * remember the main df is ordered as follows:  
    `event ID; process ID; event weight; MET; METphi; obj1, E1, pt1, eta1, phi1; obj2, E2, pt2, eta2, phi2; ...`<br>
2. Choose an appropiate number of particles to study per event (recommended: **8**)  

4. If the event has more than 8 particles choose the **8 particles** with **highest energy and truncate** the rest
    * objects are already in order of decreasing energy
    * just take the first 8 objects and drop the rest
5. take logarithm of MET, energy and momentum variables

In [177]:

base_cols = [i for i in training.columns[:10]]
obj_cols  = sorted([c for c in training.columns if c.startswith("obj")],
                   key=lambda s: int(s[3:]))


# Wide -> long and split into fields
S = training[obj_cols].stack() # (row, obj_col)
parts = S.str.split(",", n=4, expand=True).apply(lambda c: c.str.strip())
parts.columns = ["type","E","pt","eta","phi"]


# Numeric conversion for object fields
for i in ["E","pt","eta","phi"]:
    parts[i] = pd.to_numeric(parts[i], errors="coerce")


# Add event (row) index and sort by E within each event - FOR ORDERING??
parts.index = parts.index.set_names(["row","slot"])
long = parts.reset_index().sort_values(["row","E"], ascending=[True, False])


# Keep top 8 per event
long["rank"] = long.groupby("row").cumcount() + 1
topK = long[long["rank"] <= 8].copy()


# calculate logE and logpt 
eps = 1e-12 # so we have no divergence log(0)
topK["logE"]  = np.log(np.clip(topK["E"],  eps, None))
topK["logpt"] = np.log(np.clip(topK["pt"], eps, None))


# Pivot back to wide numeric columns (type1, logE1, logpt1, ...) - NO!!
wide = (topK
        .set_index(["row","rank"])[["type","logE","logpt","eta","phi"]]
        .unstack("rank"))


# Flatten columns like ('logE', 1) -> 'logE1'
wide.columns = [f"{f}{k}" for f, k in wide.columns]


# Order columns per slot: type, logE, logpt, eta, phi
ordered = []
for i in range(1, 8+1):
    ordered += [f"type{i}", f"logE{i}", f"logpt{i}", f"eta{i}", f"phi{i}"]
wide = wide.reindex(columns=[c for c in ordered if c in wide.columns])


# Merge back base/meta columns
out = (training.reset_index()
       .rename(columns={"index": "row"})[["row"] + base_cols]
       .merge(wide.reset_index(), on="row", how="left")
       .drop(columns="row"))


# calculate log(MET)
met_num = pd.to_numeric(out["MET"], errors="coerce")
out["MET"] = np.log(np.clip(met_num, eps, None))
out.rename(columns={"MET": "logMET"}, inplace=True) # rename column


In [204]:
print(out.columns)
print("\n", out.loc[-5:, "type6":"phi6"])
print("\n", out.iloc[:, 15:20].head())

Index(['e', 'm', 'j', 'b', 'g', 'event_id', 'process_id', 'event_weight',
       'logMET', 'METphi', 'type1', 'logE1', 'logpt1', 'eta1', 'phi1', 'type2',
       'logE2', 'logpt2', 'eta2', 'phi2', 'type3', 'logE3', 'logpt3', 'eta3',
       'phi3', 'type4', 'logE4', 'logpt4', 'eta4', 'phi4', 'type5', 'logE5',
       'logpt5', 'eta5', 'phi5', 'type6', 'logE6', 'logpt6', 'eta6', 'phi6',
       'type7', 'logE7', 'logpt7', 'eta7', 'phi7', 'type8', 'logE8', 'logpt8',
       'eta8', 'phi8'],
      dtype='object')

        type6      logE6     logpt6      eta6      phi6
0          j  11.358843  11.035035  0.840127 -1.738050
1        NaN        NaN        NaN       NaN       NaN
2        NaN        NaN        NaN       NaN       NaN
3        NaN        NaN        NaN       NaN       NaN
4        NaN        NaN        NaN       NaN       NaN
...      ...        ...        ...       ...       ...
340263   NaN        NaN        NaN       NaN       NaN
340264   NaN        NaN        NaN       NaN   

6. If the event has less than 8 particles, create kinematic variables with 0 values for the missing particles...
    * when to do this?

## TEMP PROGRESS REPORT
- the ordering of training_top8 is now `e; m; j; b; g; event ID; process ID; event weight; MET; METphi; obj1; obj2; obj3; obj4; obj5; obj6; obj7; obj8;`<br>
where for each object we have `obji, log(Ei), log(pti), etai, phii`<br>  

    * ~~still need to take the log of MET: log(MET) for each event~~ __done__
    * ~~maybe go back and convert all columns which are numbers to floats instead of switching back and forth to strings~~ __done__
    * ~~maybe don't use fields for each object (also what is a field here?) but also were given in ;,,,,,; format...~~ __done__
    * what is "`MinMaxScalar` or similar" and what does it mean to standardise the variables?

later ordering of final df: |N ele| N muon| N jets| N bjets| N photons| log(MET)| METphi| log(E1)| log(pt1)| eta1| phi1| ... | phi8|