# DAML Project 2 - Callum Smith

## Data Preprocessing: Convert the csv file into an appropriate format for our neural networks

1. Create variables where you count the number of electrons, photons, muons, jets and bjets in the event (ignore charge)
    * each line in csv is a single event, which objects `obj` separated by semicolons:  
    `event ID; process ID; event weight; MET; METphi; obj1, E1, pt1, eta1, phi1; obj2, E2, pt2, eta2, phi2; ...`<br>
    * e.g. row 1 of the `SM` dataset looks like:    
    `5702564; z_jets; 1; 102549; -2.9662; j,335587,132261,-1.57823,1.02902; j,107341,106680,-0.0989776, -2.67901; j,85720.1,62009,0.840127,-1.73805; j,270540,58844.5,2.20566,1.6064; j,55173.9,52433.5,-0.183147,2.62501; j,48698.6,37306.4,-0.719927,-1.7898; j,148467,23648,-2.52332,-1.70799; e-,186937,131480,0.888915,-0.185666; e+,80014.3,79281.7,0.135844,0.275231;`<br>
    * `obj`: the object type:

        |Key|Particle|
        |---|---|
        |j|jet|
        |b|b-jet|
        |e-|electron|
        |e+|positron|
        |m-|muon|
        |m+|muon+|
        |g|photon|

    * for each event e.g. electron = number of electrons/positrons (ignore charge) in that event

Imports

In [15]:

import numpy as np # numerical computing with arrays
import pandas as pd # data manipulation and analysis
from sklearn.preprocessing import StandardScaler # normalize features to mean=0, std=1 


Read the file

In [16]:

# first 5 columns before objects
base_cols = ["event_id", "process_id", "event_weight", "MET", "METphi"] 


# read our data file
def read(path):  # file name as argument

    # Count semicolons per line to get max number of fields (semicolons + 1)
    max_cols = max(line.count(";") for line in open(path, encoding="utf-8")) + 1 # unsure of the +1 here...
    names = base_cols + [f"obj{i+1}" for i in range(max_cols - len(base_cols))] # full list of column names separated by ;
    df = pd.read_csv(path, sep=";", header=None, names=names, engine="python", dtype=str) # read data separated by ; as strings
    
    return df


Convert columns which are numbers to integers or floats: `event ID`, `event weight`, `MET`, `METphi`

In [17]:

# convert number columns from strings to integers or floats
def numeric(df):

    num_cols = ["event_id", "event_weight", "MET", "METphi"]
    df[num_cols] = df[num_cols].apply(pd.to_numeric, errors="coerce") # float64 or int64 depending on the data supplied


Count the number of electrons, photons, muons, jets and bjets in the event

In [18]:

# count particles
def particle_count(df):

    obj_keys = ["j", "b", "e-", "e+", "m-", "m+", "g"] # keys for each object type
    obj_cols = df.columns[len(base_cols):] # take columns after the first 5, i.e [5:]

    # Extract the object type from every object cell
    types = df[obj_cols].map(lambda s: str(s).split(",", 1)[0].strip() if pd.notna(s) and str(s).strip() else None) # lambda applied to every cell if condition met
    """
        If the cell is not NaN and not an empty string, convert to str, split on the first comma, take the token before it (the object code), and strip whitespace.
        Otherwise return None (for missing slots) 
    """

    # Count per row how many times each type appears
    counts = pd.DataFrame({i: (types == i).sum(axis=1) for i in obj_keys}).astype("int64")
    """
        types == k creates a boolean DataFrame (True where the cell equals the object k).
        .sum(axis=1) counts True per row (because True=1, False=0), i.e., how many times k appears in that event across all object slots.
        The dict comprehension builds a column per object type.
        pd.DataFrame(...) turns that dict into a DataFrame with columns in obj_keys order.
        .astype("int64") casts boolean sums to integer counts
    """

    # columns for total electrons and total muons, regardless of charge
    counts["e"] = counts["e-"] + counts["e+"]
    counts["m"] = counts["m-"] + counts["m+"]


    # reorder columns: |N ele| N muon| N jets| N bjets| N photons|
    counts = (counts
            .drop(columns=["e-", "e+", "m-", "m+"]) # get rid of charged columns
            .reindex(columns=["e","m","j", "b", "g"], fill_value=0)) # reorder remaining columns; fill_value=0 fills empty columns with 0's not NaNs


    # add these columns to the overall dataframe as integers
    df = pd.concat([counts, df], axis=1)
    df[["e","m","j","b","g"]] = df[["e","m","j","b","g"]].astype("int64") # convert new columns to integers

    return df


3. **Sort by energy** (largest to smallest)
    * remember the main df is ordered as follows:  
    `event ID; process ID; event weight; MET; METphi; obj1, E1, pt1, eta1, phi1; obj2, E2, pt2, eta2, phi2; ...`<br>
2. Choose an appropiate number of particles to study per event (recommended: **8**)  

4. If the event has more than 8 particles choose the **8 particles** with **highest energy and truncate** the rest
    * objects are already in order of decreasing energy
    * just take the first 8 objects and drop the rest
5. take logarithm of MET, energy and momentum variables
6. If the event has less than 8 particles, create kinematic variables with 0 values for the missing particles...
    * replace NaN with 0

- take sine and cosine of METphi and phiN for later standardisation

In [19]:

# truncate and mask
def trunc_mask(df):

    # Vectorized NumPy/pandas operations - Operates on entire columns at once using optimized C code
    base_cols = [i for i in df.columns[:10]]
    obj_cols  = sorted([c for c in df.columns if c.startswith("obj")],
                    key=lambda s: int(s[3:])) # is the key necessary?


    # Wide -> long and split into fields
    S = df[obj_cols].stack() # (row, obj_col)
    parts = S.str.split(",", n=4, expand=True).apply(lambda c: c.str.strip())
    parts.columns = ["type","E","pt","eta","phi"]


    # Numeric conversion for object fields
    for i in ["E","pt","eta","phi"]:
        parts[i] = pd.to_numeric(parts[i], errors="coerce")


    # Add event (row) index and sort by E within each event - FOR ORDERING??
    parts.index = parts.index.set_names(["row","slot"])
    long = parts.reset_index().sort_values(["row","E"], ascending=[True, False])


    # Keep top 8 per event
    long["rank"] = long.groupby("row").cumcount() + 1
    topK = long[long["rank"] <= 8].copy()


    # calculate logE and logpt 
    eps = 1e-12 # so we have no divergence log(0)
    topK["logE"]  = np.log(np.clip(topK["E"],  eps, None))
    topK["logpt"] = np.log(np.clip(topK["pt"], eps, None))


    # calculate sinphi and cosphi
    topK["sinphi"] = np.sin(topK["phi"])
    topK["cosphi"] = np.cos(topK["phi"])


    # Pivot back to wide numeric columns (type1, logE1, logpt1, ...)
    wide = (topK
            .set_index(["row","rank"])[["type", "logE", "logpt", "eta", "sinphi", "cosphi"]]
            .unstack("rank"))


    # Flatten columns like ('logE', 1) -> 'logE1'
    wide.columns = [f"{f}{k}" for f, k in wide.columns]


    # Order columns per slot: type, logE, logpt, eta, sinphi, cosphi
    ordered = []
    for i in range(1, 9):
        ordered += [f"type{i}", f"logE{i}", f"logpt{i}", f"eta{i}", f"sinphi{i}", f"cosphi{i}"]
    wide = wide.reindex(columns=[c for c in ordered if c in wide.columns])


    # Merge back base/meta columns
    df = (df.reset_index()
        .rename(columns={"index": "row"})[["row"] + base_cols]
        .merge(wide.reset_index(), on="row", how="left")
        .drop(columns="row"))


    # calculate log(MET)
    met_num = pd.to_numeric(df["MET"], errors="coerce")
    df["MET"] = np.log(np.clip(met_num, eps, None))
    df.rename(columns={"MET": "logMET"}, inplace=True) # rename column


    # calculate sin(METphi) and cos(METphi)
    metphi_num = pd.to_numeric(df["METphi"], errors="coerce")
    df.drop(columns=["METphi"], inplace=True) # get rid of METphi
    df.insert(df.columns.get_loc("type1"),     "sin_METphi", np.sin(metphi_num)) # replace with sin(METphi)
    df.insert(df.columns.get_loc("type1"), "cos_METphi", np.cos(metphi_num)) # and cos(METphi)


    # Convert NaN's in object columns to 0's 
    df = df.fillna(0)

    return df


## TEMP PROGRESS REPORT
- the ordering of training_top8 is now: `e; m; j; b; g; event ID; process ID; event weight; MET; METphi; obj1; obj2; obj3; obj4; obj5; obj6; obj7; obj8;`<br>
where for each object we have `obji, log(Ei), log(pti), etai, phii`<br>  

    * ~~still need to take the log of MET: log(MET) for each event~~ __done__
    * ~~maybe go back and convert all columns which are numbers to floats instead of switching back and forth to strings~~ __done__
    * ~~maybe don't use fields for each object (also what is a field here?) but also were given in ;,,,,,; format...~~ __done__
    * ~~what is "`MinMaxScalar` or similar" and what does it mean to standardise the variables?~~ __use `StandardScaler` instead__

- ordering now: `e; m; j; b; g; event ID; process ID; event weight; log(MET); sin(METphi); cos(METphi); obj1; log(E1); log(pt1); eta1; sin(phi1); cos(phi1) ... `<br>

Preprocessing function

In [20]:
# define an overall function for preprocessing
def pre_process(file):

    data = read(file) # read background file and make it a dataset
    numeric(data) #convert number columns to floats
    data = particle_count(data) # count particles
    data = trunc_mask(data) # truncate and mask

    return data


test = pre_process("background_chan2b_7.8.csv")


In [21]:
print([c for c in test.columns]) 
print("\n", test.loc[-5:, "type6":"cosphi6"])
print("\n", test.iloc[:, 17:23].head())

['e', 'm', 'j', 'b', 'g', 'event_id', 'process_id', 'event_weight', 'logMET', 'sin_METphi', 'cos_METphi', 'type1', 'logE1', 'logpt1', 'eta1', 'sinphi1', 'cosphi1', 'type2', 'logE2', 'logpt2', 'eta2', 'sinphi2', 'cosphi2', 'type3', 'logE3', 'logpt3', 'eta3', 'sinphi3', 'cosphi3', 'type4', 'logE4', 'logpt4', 'eta4', 'sinphi4', 'cosphi4', 'type5', 'logE5', 'logpt5', 'eta5', 'sinphi5', 'cosphi5', 'type6', 'logE6', 'logpt6', 'eta6', 'sinphi6', 'cosphi6', 'type7', 'logE7', 'logpt7', 'eta7', 'sinphi7', 'cosphi7', 'type8', 'logE8', 'logpt8', 'eta8', 'sinphi8', 'cosphi8']

        type6      logE6     logpt6      eta6   sinphi6   cosphi6
0          j  11.358843  11.035035  0.840127 -0.986046 -0.166475
1          0   0.000000   0.000000  0.000000  0.000000  0.000000
2          0   0.000000   0.000000  0.000000  0.000000  0.000000
3          0   0.000000   0.000000  0.000000  0.000000  0.000000
4          0   0.000000   0.000000  0.000000  0.000000  0.000000
...      ...        ...        ...    

8. After the dataset is ready, use `MinMaxScalar` or similar to standardise the training variables over the SM dataset
    * Scaling and normalizing features to improve model performance and training stability
    * i.e. normalisation; giving all data points the same scale
    * “Mahalanobis distance”: $$z = (x - \mu)^T{\Sigma^{-1}}$$
    * ~~standardise counts: `e; m; j; b; g`~~ maybe not...
    * convert `METphi` to `sin(METphi); cos(METphi)` and `phiN` to `sin(phiN); cos(phiN)`
    * "train on the event level variables `MET; METphi` and the kinematics of the particle level objects `log(EN); log(ptN); etaN; phiN`"
    * make the standardisation process a class ~~(for some reason...)~~ "*Do not recalculate the standardisation*"

In [22]:

# Standardise the training dataset
training = test.copy()
train_cols = ["logMET", "sin_METphi", "cos_METphi"] + [f"{j}{i}" for j in ("logE", "logpt", "eta", "sinphi", "cosphi") for i in range(1, 9)] # columns to be standardised
training[train_cols] = training[train_cols].astype("float64") # convert training columns to floats before scaling


scaler = StandardScaler()
training.loc[:, train_cols] = scaler.fit_transform(training[train_cols]) # Replace in place
#training = scaler.fit_transform(df[train_cols])  # standardised training dataset
#X_test_num  = scaler.transform(X_test[num_cols])       # reuse fitted scaler


In [23]:
print(test)
print(training)

        e  m  j  b  g  event_id process_id  event_weight     logMET  \
0       2  0  7  0  0   5702564     z_jets             1  11.538096   
1       0  2  1  0  0  13085335     z_jets             1  11.547018   
2       0  2  1  1  0     74025    wtopbar             1  11.770725   
3       0  2  2  0  0   2419445     z_jets             1  11.261565   
4       0  2  1  1  0     43639       wtop             1  11.581994   
...    .. .. .. .. ..       ...        ...           ...        ...   
340263  0  2  1  1  0        30      ttbar             1  11.092509   
340264  0  2  2  1  0       111      ttbar             1  10.980708   
340265  0  2  4  1  0        75      ttbar             1  12.744695   
340266  0  2  3  0  0  15181306     z_jets             1  12.417140   
340267  0  2  3  0  0  19552503     z_jets             1  11.599653   

        sin_METphi  ...     logpt7      eta7   sinphi7   cosphi7  type8  \
0        -0.174495  ...  11.280763  0.135844  0.271769  0.962362      j 