# Getting Started

The ePIC collaboration is providing full simulation data files in the ROOT data format through the XRootD service at Jefferson Lab. This allows analysis without the need to download any data files.

In this notebook we show how to load a file from the XRootD service using the [uproot](https://pypi.org/project/uproot/) python library. This allows for seemless interfacing with many data science and machine learning tools.

## Importing uproot

Depending on the versions of uproot and XRootD that you have installed, you may encouter a warning from uproot below. Nevertheless, because of the simple data format of the ePIC ROOT files, we are able to ignore this warning.

In [None]:
import uproot as ur
print('Uproot version: ' + ur.__version__)

## Opening a file with uproot

To test uproot, we will open a sample file (a single reconstructed DIS NC output file):

In [None]:
server = 'root://dtn-eic.jlab.org//work/eic2/'
dir = 'EPIC/RECO/23.06.1/epic_brycecanyon/DIS/NC/18x275/minQ2=10/'
file = 'pythia8NCDIS_18x275_minQ2=10_beamEffects_xAngle=-0.025_hiDiv_1.0000.eicrecon.tree.edm4eic.root'

In [None]:
events = ur.open(server + dir + file + ':events')

## Exploring the file contents

We can now look into the file, including all its branches. Let's take a look at the possible 'keys':

In [None]:
events.keys()

That is a lot of branches!

Maybe we are only interested in a few branches. Let's look at the branch with particles reconstructed by the track reconstruction algorithms:

In [None]:
events.keys('ReconstructedChargedParticles.*')

## Making a simple plot

Of course, we came here to create plots, not just look at branches. Uproot can give us the data from branches in `numpy` arrays. From there, we can use `matplotlib` to create a histogram. Let's do this with the momentum.

In [None]:
reconstructed_charged_particles = events['ReconstructedChargedParticles'].arrays()

If you are running this on a Jupyter instance that displays the memory use, then you will see that the previous step corresponds to an increase in memory use. This will be important to keep in mind. Since you are accessing files that are (in some cases) several GBs large, you will likely want to avoid reading all arrays from an entire file in memory, even on regular servers.

Let's start by taking a look at the `energy` variables in the array we just obtained.

In [None]:
reconstructed_charged_particles['ReconstructedChargedParticles.energy']

As is very common in nuclear and high energy physics, these are not 'regular' numpy array, as indicated by the `var` in the dimension. This is because there are a varying number of reconstructed particles per event. We use a package `awkward` to deal with these 'awkward' arrays. In particular, we can 'regularize' these arrays using a `flatten` operation.

In [None]:
import numpy as np
import awkward as ak
import matplotlib.pyplot as plt

In [None]:
ak.flatten(reconstructed_charged_particles['ReconstructedChargedParticles.energy'])

In [None]:
plt.hist(ak.flatten(reconstructed_charged_particles['ReconstructedChargedParticles.energy']), range = (0, 50), bins = 50)
plt.xlabel('Energy [GeV]')
plt.ylabel('Events / GeV')
plt.yscale('log')
plt.show()