<a href="https://colab.research.google.com/github/duberii/pid-playground/blob/main/activities/Introduction_to_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Introduction to Pandas**
---
Welcome to Particle Identification Playground! This activity is designed to introduce you to Pandas, a Python library that provides tools for working with large amounts of data.

**Before completing this notebook, complete the following activities:**
*   Introduction to Vectors
*   Introduction to Special Relativity

**After completing this notebook, you will be able to:**
*   Add new columns to existing DataFrames
*   Iterate over the rows of a DataFrame
*   Select rows from a DataFrame based on column values






In [None]:
#@title Run this cell to import Pandas
import pandas as pd
import numpy as np
masses = {'Proton':0.93827,'Neutron':0.93957,'Kaon':0.49367,'Muon':0.10566,'Pion':0.13957,'Electron':0.000511,'Photon':0}
df = pd.DataFrame()
ptypes = []
for ptype in masses.keys():
  for i in range(10):
    ptypes.append(ptype)
df['Particle'] = ptypes
pxs = [np.random.uniform(0,1) for i in range(70)]
pys = [np.random.uniform(0,1) for i in range(70)]
pzs = [np.random.uniform(0,1) for i in range(70)]
Es = [((masses[df.iloc[i]['Particle']]+np.random.normal(0,0.001*masses[df.iloc[i]['Particle']]))**2+pxs[i]**2+pys[i]**2+pzs[i]**2)**0.5 for i in range(70)]
df['E'] = Es
df['px']= pxs
df['py'] = pys
df['pz'] = pzs
df = df.sample(frac=1, ignore_index=True)

---
##**Why Pandas?**
---
Particle physics experients can produce up to a petabyte (1 million gigabytes) of data per day. Needless to say, working with this amount of data can be challenging. Fortunately, there are several tools that will allow us to work with large datasets. In this notebook, we will use Pandas, a python library used to manipulate large amounts of data. In future notebooks, we will use the most popular tool in particle physics data analysis, called ROOT. Although ROOT has its own file format, we will continue to use pandas throughout the remaining activities, as it is faster and easier to use for our purposes.

---
##**How do we use Pandas?**
---
Pandas is a Python library that allows us to create objects called DataFrames. DataFrames are structured like tables. Run the cell below to display a DataFrame.

In [None]:
df.head()

Unnamed: 0,Particle,E,px,py,pz
0,Muon,0.778794,0.278722,0.107772,0.711386
1,Electron,0.901993,0.026161,0.079285,0.89812
2,Neutron,1.309605,0.37927,0.787745,0.261113
3,Neutron,1.352814,0.795809,0.212556,0.518904
4,Neutron,1.055682,0.142361,0.362502,0.286489


The `head` method of a dataframe will display the first 5 rows of the table. As we can see, this dataframe has 5 columns: **Particle**, **E**, **px**, **py**, and **pz**. We can obtain a column from the dataframe just like we would access information in a dictionary. Run the code below to print out the **E** column.

In [None]:
print(df["E"])

0     0.778794
1     0.901993
2     1.309605
3     1.352814
4     1.055682
        ...   
65    1.250769
66    1.353361
67    0.742434
68    1.289715
69    1.211714
Name: E, Length: 70, dtype: float64


This particular dataframe contains the energy and momentum components of a variety of particles that we observe in our particle detectors. As seen in the [Introduction to Relativity notebook](https://colab.research.google.com/drive/1LhSPaw8m91syKz5EVh_IvA6f1CGkbZPk?usp=sharing), we can calculate the mass of a particle by solving the definition of relativistic energy for the mass:
$$m = \sqrt{E^2-|p|^2},$$
where $E$ is the relativistic energy of the particle and
$$|p| = \sqrt{p_x^2+p_y^2+p_z^2}$$
is the magnitude of the relativistic momentum vector of the particle. In Pandas, we can define new columns using basic arithmetic on other columns. Run the cell below to calculate the magnitude of the momentum vector based on its components.

In [None]:
df['p'] = (df['px']**2 + df['py']**2 + df['pz']**2)**0.5
df.head()

Unnamed: 0,Particle,E,px,py,pz,p
0,Muon,0.778794,0.278722,0.107772,0.711386,0.771602
1,Electron,0.901993,0.026161,0.079285,0.89812,0.901993
2,Neutron,1.309605,0.37927,0.787745,0.261113,0.912452
3,Neutron,1.352814,0.795809,0.212556,0.518904,0.973526
4,Neutron,1.055682,0.142361,0.362502,0.286489,0.483477


####**Question #1:**
---
Write code in the cell below that defines another column of the dataframe called `m`, which contains the mass of the particle. **Hint:** You can calculate mass based on the magnitude of the momentum vector and the energy:
$$m = \sqrt{E^2-|p|^2}$$

In [None]:
#Complete this code
df.head()

#####**Solution:**

In [None]:
df['m'] =(df['E']**2- df['p']**2)**0.5
df.head()

Unnamed: 0,Particle,E,px,py,pz,p,m
0,Muon,0.778794,0.278722,0.107772,0.711386,0.771602,0.105594
1,Electron,0.901993,0.026161,0.079285,0.89812,0.901993,0.000511
2,Neutron,1.309605,0.37927,0.787745,0.261113,0.912452,0.939413
3,Neutron,1.352814,0.795809,0.212556,0.518904,0.973526,0.939336
4,Neutron,1.055682,0.142361,0.362502,0.286489,0.483477,0.938463


---
## **Filtering Dataframes**
---
Notice that each value of the mass is slightly different! This is often the case due to the difficulty of measuring exact values for the energy or momentum. In order to find the true mass of the particles, we will need to take an average of all of the calculated mass values for each particle. However, this dataset contains multiple particles! Thus we will need to get a dataframe that contains only one type of particle.

We can do this using the dataframe's `loc` method, which allows us to select which rows we want to work with. First, however, note that we can also use comparison operators on columns. For example, run the code below to see the effect of the "is equal to" condition on the `Particle` column.

In [None]:
df['Particle']=="Proton"

0     False
1     False
2     False
3     False
4     False
      ...  
65    False
66    False
67    False
68    False
69    False
Name: Particle, Length: 70, dtype: bool

This operation returns another Pandas object (called a **series**), which contains `True` if the condition holds for a given row, and `False` if the condition doesn't hold. For our purposes, we can treat the series exactly the way we would treat a list.

We can pass this series to the `loc` property to slice the dataframe based on the given condition. Run the code below to create a dataframe called `protons`, which contains only the rows for which the `Particle` column contained the string `Proton`. The next line of code computes the mean value for the `m` column.

In [None]:
protons = df.loc[df['Particle']=='Proton']
proton_mass = protons['m'].mean()
print("Proton Mass: " + str(proton_mass) + " GeV/c^2")

Proton Mass: 0.938067689897155 GeV/c^2


####**Question #2:**
---
Write code in the cell below to compute the mean calculated mass for the following particles:

*   Pion
*   Electron




In [None]:
pions =
pion_mass =
print("Pion Mass: " + str(pion_mass) + " GeV/c^2")
electrons =
electron_mass =
print("Electron Mass: " + str(electron_mass) + " GeV/c^2")

#####**Solution:**

In [None]:
pions = df.loc[df['Particle']=='Pion']
pion_mass = pions['m'].mean()
print("Pion Mass: " + str(pion_mass) + " GeV/c^2")
electrons = df.loc[df['Particle']=='Electron']
electron_mass = electrons['m'].mean()
print("Electron Mass: " + str(electron_mass) + " GeV/c^2")

Neutron Mass: 0.9394138523050394 GeV/c^2
Kaon Mass: 0.49399246076533193 GeV/c^2
Muon Mass: 0.10562448280783125 GeV/c^2
Pion Mass: 0.13958965652145236 GeV/c^2
Electron Mass: 0.0005111072523878897 GeV/c^2


####**Question #3:**
---
Create a dataframe containing all of the particles whose mass is less than 0.00001 GeV/c^2. What particle types do you see?

In [None]:
light_particles =
light_particles.head()

Double click to edit this cell and answer the following question: What particle types do you see?

#####**Solution:**

In [None]:
light_particles = df[df['m']<0.00001]
light_particles.head()

Unnamed: 0,Particle,E,px,py,pz,p,m
11,Photon,0.09021,0.017221,0.088219,0.007663,0.09021,0.0
14,Photon,0.928015,0.172499,0.406059,0.816439,0.928015,0.0
15,Photon,1.001158,0.569728,0.283825,0.772768,1.001158,0.0
35,Photon,1.112853,0.665238,0.004575,0.89212,1.112853,0.0
36,Photon,0.498579,0.127265,0.309535,0.369558,0.498579,0.0


Only photons, which have 0 mass, appear in the dataframe.

####**Question #4:**
---
We can calculate the speed of the particles based on their momentum by solving the relativistic momentum equation for $v$:
$$
v=\frac{p}{\sqrt{m^2+p^2}}
$$
Add a column to the original dataframe with the calculated velocities. How does the average speed of a proton compare to the average speed of an electron?

In [None]:
df['v'] =
protons =
proton_velocity =
print("Proton Average Velocity: " + str(proton_velocity) + " c")
electrons =
electron_velocity =
print("Electron Average Velocity: " + str(electron_velocity) + " c")

#####**Solution:**

In [None]:
df['v'] = df['p']/(df['m']**2+df['p']**2)**0.5
protons = df.loc[df['Particle']=='Proton']
proton_velocity = protons['v'].mean()
print("Proton Average Velocity: " + str(proton_velocity) + " c")
electrons = df.loc[df['Particle']=='Electron']
electron_velocity = electrons['v'].mean()
print("Electron Average Velocity: " + str(electron_velocity) + " c")

Proton Average Velocity: 0.7251066164906019 c
Electron Average Velocity: 0.9999998136448405 c


---
## **Iterating over Dataframes**
---
Sometimes, we want to iterate over the rows of a dataframe. For example, without modifying the dataframe, it would be very difficult to print
 ```
[Particle] has a mass of [m] GeV/c^2
```
for each row, where `[Particle]` and `[m]` are replaced with the corresponding entries in each row. However, we can use the `.iloc` property as an iterable. Run the code to see how we would use the `.iloc` to solve this problem.

In [None]:
for row in df.iloc:
  Particle = row['Particle']
  m = str(row['m'])
  print(Particle + " has a mass of " + m + " GeV/c^2")

Muon has a mass of 0.10559427612647053 GeV/c^2
Electron has a mass of 0.00051115143483826 GeV/c^2
Neutron has a mass of 0.9394126213690044 GeV/c^2
Neutron has a mass of 0.9393361923202903 GeV/c^2
Neutron has a mass of 0.9384633706125254 GeV/c^2
Kaon has a mass of 0.4937196129595681 GeV/c^2
Kaon has a mass of 0.49398395698151776 GeV/c^2
Muon has a mass of 0.10567671486206286 GeV/c^2
Muon has a mass of 0.10563108661476055 GeV/c^2
Muon has a mass of 0.10550267828556674 GeV/c^2
Kaon has a mass of 0.49384258087824534 GeV/c^2
Photon has a mass of 0.0 GeV/c^2
Proton has a mass of 0.9381856803282945 GeV/c^2
Proton has a mass of 0.939281977547552 GeV/c^2
Photon has a mass of 0.0 GeV/c^2
Photon has a mass of 0.0 GeV/c^2
Kaon has a mass of 0.49312817091222616 GeV/c^2
Neutron has a mass of 0.9412350280534708 GeV/c^2
Electron has a mass of 0.0005114029179846808 GeV/c^2
Kaon has a mass of 0.4951875393495249 GeV/c^2
Proton has a mass of 0.937046407232431 GeV/c^2
Muon has a mass of 0.10573919708070241

We use a for loop to iterate over the rows of the dataframe, then we access the columns to fill in the print statement. We will make use of this method in later activities.