# Part 3: Introduction to High Energy Physics data
A perspective from the ATLAS experiment at CERN

## The LHC and the ATLAS experiment

The ATLAS experiment at CERN explores vast amounts of physics data to answer the most fundamental questions of the Universe.

* What is Dark Matter, and can we produce dark matter particles in the LHC?
* Are there extra spatial dimensions beyond the familiar three?
* Why is there more matter than antimatter in the universe?
* Are there new fundamental particles or forces?
* ...

The Large Hadron Collider (LHC) is a circular particle accelerator that provides high energy proton-proton collisions (_events_), from which new particles are being produced due to the basic equation that relates energy and matter:  
$$E = mc^2$$

<img src="../img/LHC_collisions.jpeg" width="800"/>

The ATLAS detector is mounted in one of the four interaction points of the LHC -- other detectors are mounted in other interaction points.

<img src="../img/LHC.png" width="800"/>

ATLAS is a complex detector, composed by various sub-detectors, each one specialized to the detection of a specific signature/particle.

Those outgoing, newly produced particles, leave signatures in the ATLAS detector. Physicists reconstruct particle objects from those signatures.

<img src="../img/ATLASImage.jpg" width="800"/>

And a transverse view of the detector showing actually the specific signatures different particles are leaving.

<img src="../img/Schematic-of-how-different-particles-interact-with-the-ATLAS-detector.png" width="700"/>

## The ATLAS physics data

Those reconstructed particle objects, after various processing and data reduction steps, are stored in a compact data format called [PHYSLITE](https://opendata.atlas.cern/docs/documentation/data_format/physlite/). Physicists analyze PHYSLITE data to study the properties of the particles produced in each event.

Key Features of PHYSLITE:
- **Reduced File Size:** PHYSLITE targets a file size of 10 kB per event for data and 12 kB for MC, a significant reduction compared to previous formats.
- **CPU Efficiency:** Preliminary evaluations show a 25% reduction in CPU usage compared to previous models.
- **Unskimmed and Monolithic:** PHYSLITE is designed to be a one-size-fits-all solution, fitting various use cases without the need for multiple versions.
- **Direct Analysis Capability:** PHYSLITE can be analyzed directly, avoiding the need for creating flat n-tuples and further reducing storage demands.

PHYSLITE data are highly structured and can be represented in a tabular format. However, since each event can contain variable number of particles, this will be a _ragged_ or _jagged_ table. 

Therefore, we will use the [`awkward`](https://github.com/scikit-hep/awkward) Python library that provides NumPy-like idioms for arbitrary data structures. 

For example:

In [1]:
import awkward as ak

example = ak.Array([
    [{"x": 1.1, "y": 1.2, "z": 3.1}, {"x": 2.2, "y": 1.3, "z": 2}, {"x": 3.3, "y": 2.4, "z": 4.2}],     # this event contains three electrons with properties x, y, and z
    [],                                                                                                 # this event contains no electrons
    [{"x": 4.4, "y": 1.1, "z": 1}, {"x": 5.5, "y": 4.2, "z": 3.2}]                                      # this event contains two electrons with properties x, y, and z
])

ak.to_dataframe(example)

Unnamed: 0_level_0,Unnamed: 1_level_0,x,y,z
entry,subentry,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0,1.1,1.2,3.1
0,1,2.2,1.3,2.0
0,2,3.3,2.4,4.2
2,0,4.4,1.1,1.0
2,1,5.5,4.2,3.2


The properties of the particles (aka *variables*) we store in PHYSLITE files can all be found in:

https://atlas-physlite-content-opendata.web.cern.ch/

PHYSLITE files are of type [ROOT](https://root.cern/) files. This is a very common file type used in particle physics and needs specialized decompression and interpretation routines to load the data in memory (e.g. in awkward arrays). For such purpose we will use the [`uproot`](https://github.com/scikit-hep/uproot5) library.

So far we have introduced the two main libraries we utilize for particle physics data analysis in the Scientific Python ecosystem:

1. `awkward`
2. `uproot`

Those two libraries are in principle enough for a physicist to acquire an in-memory representation of the data.

### ATLAS Open Data

**ATLAS has recently released 65 TB of PHYSLITE open data for research -- this is over 7 billion LHC collision events!** Those are all the data collected by the experiment during the 2015 and 2016. The release is accompanied by additional 2 billion events of simulated “Monte Carlo” data, which are essential for carrying out a physics analysis. The simulated data have almost the same structure as the real data.

Read more about the open data release at: 

https://atlas.cern/Updates/News/Open-Data-Research

The open data portal provides in depth information about the data along with analysis tutorials:

https://opendata.atlas.cern/