# Project2: Anomaly Detection for Exotic Event Identification at the Large Hadron Collider 




## Project Marking Scheme

This project is assessed based on the following components:

| **Assessment Component**  | **Percentage** | **Description** |
|---------------------------|----------------|-----------------|
| **Data Preprocessing Pipeline** | 20% | Robust data loading using pandas or equivalent for: Filtering & scaling with proper validation |
| **Model Design** | 20% | Working model which uses an `AutoEncoder` design to train on `SM` background data |
| **Training Implementation** | 20% | Proper training loop with reasonable stopping decision, validation, and hyperparameter selection |
| **Anomaly Detection Strategy** | 20% | Effective BSM/SM separation using reconstruction error |
| **Code Quality & Documentation** | 20% | Clean, well-commented code with proper function documentation and cross-checks |


### Key Success Criteria:
- **Preprocessing**: Handle angular variables correctly, implement caching, remove non-physical features
- **Vizualisation**: Vizualising data to correctly check relative distributions haven't been distorted
- **Architecture**: Auto-Encoder model design with sensible defaults
- **Training**: Sensible early stopping, validation monitoring
- **Evaluation**: Low false-positive strategy, Vizualisation of trained model outputs
- **Documentation**: Clear explanations, cross-checks, professional presentation

---

## Brief Introduction to the Standard Model and Large Hadron Collider


The Standard model (`SM`) of Particle Physics is the most complete model physicists have for understanding the interactions of the fundamental particles in the universe. The elementary particles of the SM are shown in Fig.1.

---
<figure>
    <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/00/Standard_Model_of_Elementary_Particles.svg/627px-Standard_Model_of_Elementary_Particles.svg.png" alt="SM" style="width: 600px;"/>
    <figcaption>Fig.1 - Elementary particles of the Standard Model.</figcaption>
</figure>

---

It is comprised of matter particles (**fermions**):
- **leptons**
    - electrons
    - muon
    - tau
    - and respective neutrinos
- **quarks** which are the building blocks of protons

as well as force carrier particles (**bosons**):
- photon and W/Z bosons (electroweak force)
- gluons (strong force)

and the Higgs boson which is attributed to the mechanism which gives particles their mass.


Though the SM has experimentally stood the test of time, many outstanding questions about the universe and the model itself remain, and scientist continue to probe for inconsistencies in the SM in order to find new physics. More exotic models such as **Supersymmetry (SUSY)** predic mirror particles which may exist and have alluded detection thus far. 

---

The **Large Hadron Collider** (LHC) is a particle smasher capable of colliding protons at a centre of mass energy of 14 TeV.
**ATLAS** is general purpouse particle detectors tasked with recording the remnants of proton collisions at the collicion point. The main purpose of this experiment is to test the SM rigorously, and ATLAS was one of two expeririments (ATLAS+CMS) responsible for the discovery of the **Higgs boson in 2012**. 

Find an animation of how particles are reconstructed within a slice of the ATLAS detector here: https://videos.cern.ch/record/2770812. Electrons, muons, photons, quark jets, etc, will interact with different layers of the detector in different ways, making it possible to design algorithms which distinguish reconstructed particles, measure their trajectories, charge and energy, and identify them as particular types.

Figure 2 shows an event display from a data event in ATLAS in which 2 muons (red), 2 electrons (green), and 1 quark-jet (purple cone) are found. This event is a candidate to a Higgs boson decaying to four leptons with an associated jet: $$H (+j)\rightarrow 2\mu 2e (+j)$$ 



---

<figure>
    <img src="https://twiki.cern.ch/twiki/pub/AtlasPublic/EventDisplayRun2Physics/JiveXML_327636_1535020856-RZ-LegoPlot-EventInfo-2017-10-18-19-01-24.png" alt="Higgs to leptons" style="width: 600px;"/>
    <figcaption>Fig.2 - Event display of a Higgs candidate decaying to two muons and two electrons.</figcaption>
</figure>

---


Particles are shown transversing the detector material. The 3D histogram show 
* the azimuth $\phi$ ( angle around the beam, 0 is up)
* pseudo-rapidity $\eta$ (trajectory along the beam) positions of the particle directions with respect to the interaction point.
* The total energy measured for the particle is denoted by $E$,
* the transverse momentum ($p_T$) deposited by the particle in giga-electronvolts (GeV) are shown by the hight of the histograms.

A particle kinematics can then be described by a four-vector  $$\bar{p} = (E,p_T,\eta,\phi)$$

An additional importan quantity is the missing energy in the transverse plane (MET). This is calculated by taking the negative sum of the transverse momentum of all particles in the event.
$$\mathrm{MET} = -\sum p_T$$

With perfect detector performance the MET will sum to 0 if all outgoing particles are observed by the detector. Neutrinos cannot be measured by the detector and hence their precense produces non-zero MET.

## Anomally detection dataset

For the anomally detection project we will use the dataset discussed in this publication: <p><a href="https://arxiv.org/pdf/2105.14027.pdf" title="Anomalies">The Dark Machines Anomaly Score Challenge:
Benchmark Data and Model Independent Event
Classification for the Large Hadron Collider</a></p>

Familiarise yourself with the paper, in particular from sections 2.1 to 4.4.

---

The dataset contains a collection of simulated proton-proton collisions in a general particle physics detector (such as ATLAS). We will use a dataset containing `340 000` SM events (referred to as channel 2b in the paper) which have at least 2 electrons/muons in the event with $p_T>15$ GeV. 

**The events can be found in `background_chan2b_7.8.csv`**


You can see all the SM processes that are simulated in Table 2 of the paper, 

    e.g., an event with a process ID of `w_jets` is a simulated event of two protons producing a lepton and neutrino and at least two jets.
    
$$pp\rightarrow \ell\nu(+2j)$$

---

The datasets are collected as CSV files where each line represents a single event, with the current format:

`event ID; process ID; event weight; MET; METphi; obj1, E1, pt1, eta1, phi1; obj2, E2, pt2, eta2, phi2; ...`<br>

See Section 2.2 for a description of the dataset.<br>
Variables are split by a semicolon `";"`
- `event ID`: an identifier for the event number in the simulation
- `process ID`: an identifier for the event simulation type
- `event weight`: the weight associated to the simulated event (how important that event is)
- `MET`: the missing transverse energy
- `METphi`: the azimuth angle (direction) of the MET

Followed by a list of objects (particles) whose variables are split by commas `","` in the following order:
- `obj`: the object type,

    |Key|Particle|
    |---|---|
    |j|jet|
    |b|b-jet|
    |e-|electron|
    |e+|positron|
    |m-|muon|
    |m+|muon+|
    |g|photon|
    
    *see Table 1 of the paper*
- `E`: the total measured particle energy in MeV, [0,inf]
- `pt`: the transverse mementum in MeV, [0,inf]
- `eta`: pseudo-rapidity, [-inf,inf]
- `phi`: azimuth angle, radians [-3.14,3.14]

e.g. row 1 of the `SM` dataset looks like:<br>
`5702564;z_jets;1;102549;-2.9662;j,335587,132261,-1.57823,1.02902;j,107341,106680,-0.0989776,-2.67901;j,85720.1,62009,0.840127,-1.73805;j,270540,58844.5,2.20566,1.6064;j,55173.9,52433.5,-0.183147,2.62501;j,48698.6,37306.4,-0.719927,-1.7898;j,148467,23648,-2.52332,-1.70799;e-,186937,131480,0.888915,-0.185666;e+,80014.3,79281.7,0.135844,0.275231;`

---

In addition to the `SM` events we are also provided simulated events from `Beyond Standard Model` (`BSM`) exotic physics models. They are summarised here:

|Model | File Name | 
|---|---|
|**SUSY chargino-chargino process**||
||`chacha_cha300_neut140_chan2b.csv`|
||`chacha_cha400_neut60_chan2b.csv`|
||`chacha_cha600_neut200_chan2b.csv`|
|**SUSY chargino-neutralino processes**||
||`chaneut_cha200_neut50_chan2b.csv`|
||`chaneut_cha250_neut150_chan2b.csv`|
|**$Z'$ decay to leptons**||
||`pp23mt_50_chan2b.csv`|
||`pp24mt_50_chan2b.csv`|
|**Gluino and RPV SUSY**||
||`gluino_1000.0_neutralino_1.0_chan2b.csv`||
||`stlp_st1000_chan2b.csv`||



## Project description
*Responsible:* Robert Currie (<rob.currie@ed.ac.uk>, JCMB 3406)

### Overview
The task is to design an anomaly detection algorithm which is trained on the `SM` dataset and which can be used to flag up interesting (exotic) events from the BSM physics models.

You will do this by designing a robust `AutoEncoder` which is trained on the event level variables `MET; METphi` and the kinematics of the particle level objects. The `AutoEncoder` needs to duplicate the input as output effectively while going through a latent space (bottleneck).

You will then need to evaluate and discuss the performance of your `AutoEncoder` at identifying exotic models listed above, and come up with an appropiate metric to identify events from `non SM` physics after being trained on just `SM` background.

### Data Preprocessing
* The data is provided in a CSV (text) format with semicolon and comma seperated list with **one line per event**. We need to convert this into an appropiate format for our neural networks.
* Since the number of particles per event is variable you will need to **truncate** and **mask** particles in the event. The following steps need to be perfomed on the SM (background) sample:
     1. Create variables where you count the number of electrons, photons, muons, jets and bjets in the event (ignore charge) before any truncation.
     2. Choose an appropiate number of particles to study per event (recommended: **8** particles are used in the paper)
     3. Check the particles are sorted by energy (largest to smallest)
     4. If the event has more than 8 particles choose the **8 particles** with **highest energy and truncate** the rest.
     5. convert energy and momentum variables by logarithm (e.g., `log`) - this is to prioritise differences in energy **scale** over more minor differences. 
     6. If the event has less than 8 particles, create kinematic variables with 0 values for the missing particles...
     7. The final set of training variables should look something like this (the exact format is up to you)
    |N ele| N muon| N jets| N bjets| N photons| log(MET)| METphi| log(E1)| log(pt1)| eta1| phi1| ... | phi8|
    |-|-|-|-|-|-|-|-|-|-|-|-|-|
    
    8. After the dataset is ready, use `MinMaxScalar` or similar to standardise the training variables over the SM dataset
* After the SM dataset has been processed use the same processing for the BSM (signal samples). Use the same standardisation functions as on the SM dataset.<br> *Do not recalculate the standardisation*.
* Keep associated metatata (`event ID; process ID; event weight;`) though this does not need processing. 
* Randomise and split the SM (background) dataset into training and testing datasets.<br> (the BSM samples don't need to be split (*Can you explain why?*)).
* *Hint*: It is suggested that you write a class or function for the preprocessing which takes a csv path as input and provides the processed dataset. After you have done the data processing its suggested you save the datasets so as to not have to recalculate them again if the kernel is restarted.<br><br>
* **The data should be checked for consistency between raw `csv` input and the final processed/normalized dataset that is to be passed to the model for fitting. This can either be through graphical comparisons or another well-defined method.**

### Model Evaluation
In the evaluation explore different datasets an try answer as many questions about the performance as possible. 
* Evaluate the performance of the `AE` model on `SM` and `BSM` datasets. How does model design impact performance?
* Explore using an anomaly score as a handle on finding new physics.<br> Consider scanning over different anomaly scores and calculating the signal and background efficiencies at each point (plot this for different BSM models). How might you choose a value which flags up a non-SM event? 
* Explore SM events. Which look more anomolous than others? Are there any particular features which are responsible, e.g. particle counts, MET ranges, etc.? 
* Discuss any limitations your algorithm has. How might you update and improve your model in future? Discuss any issues you had, or things you would have liked to try given more time.<br><br>
* **How should you pick the best anomaly score? How good is the separation between `SM` and `BSM` events? Are the different events easily separable/identifiable?**

---
## Submission


To complete this project, you should **Submit your Jupyter notebook** as a "report." See the comments below on documentation,

**You should submit by Friday 14th November**


For all task we're not looking for exceptional model performace and high scores (although those are nice too), **we're mostly concerned with _best practices:_** If you are careful and deliberate in your work, and show us that you can use the tools introduced in the course so far, we're happy!

Training all of these models in sequence takes a very long time so **don't spend hours on training hundreds of epochs.** Be conservative on epoch numbers (30 is normally more than enough) and use appropiate techniques like EarlyStopping to speed things up. Once you land on a good model you can allow for longer training times if performance can still improve.


## Documentation: Annotation and Commentary

It is important that __all__ code is annotated and that you provide brief commentary __at each step__ to explain your approach. We expect well-documented jupyter notebooks, not an unordered collection of code snippets. You can also include any failed approaches if you provide reasonable explanation. 

Unlike weekly checkpoints where you were being guided towards the *''correct''* answer, this project is by design more open ended. It is, therefore, necessary to give some justification for choosing one method over another.
Explain *why* you chose a given approach and *discuss* the results. You can also include any failed approaches if you provide reasonable explanation; we care more about you making an effort and showing that you understand the core concepts.

This is not in the form of a written report so do not provide pages of background material. Only provide a brief explanation for each step. Aim to clearly present your work so that the markers can easily follow your reasoning and can reproduce each of your steps through your analysis. Aim to convince us that you have understood the material covered in the course.

To add commentary above (or below) a code snippet create a new cell and add your text in markdown format. __Do not__ add commentary or significant text as a code comment in the same cell as the code.
(Code comments are still helpful!)

Using multiple cells as needed in the jupyter notebook makes separating code/sections easier. Please don't put all code in one huge cell. 

__20\% of the mark for this project is allocated to coding style and clarity of comments and approach.__

## Submission Steps

It is important your code is fully functional before it is submitted or this will affect your final mark. 

When you are ready to submit your report perform the following steps: 

 -  In Jupyter run `Kernel` -> `Restart Kernel` and ` Clear All Outputs `
 -  Then `Kernel` -> `Restart & Run All` to ensure that all your analysis  is reproducible and all output can be regenerated
 -  Save the notebook, and close Jupyter
 -  **Change the filename to contain Name_Surname**
 -  Tar and zip your project folder if you have multiple files in a working directory. You are free to include any supporting code. Make sure this belongs in the project folder and is referenced correctly in your notebook. **Do not include any of the input CSV data.**
 -  Submit this file or zipped folder through Learn<br>
 In case of problems or if your compressed project folder exceeds 20 MB (first **make sure you are not including any CSV files**) email your submission to Kieran (the course administrator) at the Teaching Office and me.

### Notes on submission

1. Only use libraries in the DAML conda env provided for this course or _you may lose marks_.
2. Submissions using any framework other than `PyTorch` will **_almost certainly lose marks_**.
3. Non-run/empty playbooks with just code will _almost certainly lose marks_.
4. Working with peers is encouraged. Submitting significant chunks of code which are exactly the same as your peers is not. Please only submit your own work.
5. Lack of basic documentation around even for clearly written code _may lose marks_.



# Happy Anomaly Hunting
---
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician.</p>&mdash; Josh Wills (@josh_wills) <a href="https://twitter.com/josh_wills/status/198093512149958656?ref_src=twsrc%5Etfw">May 3, 2012</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> 

---

Your code follows....