# Data Handling with Pandas

In [None]:
#Imports etc
import matplotlib 
import numpy
import pylab
import pandas as pd
%matplotlib inline

There are a few ways to read and write data in Python.  If you have a simple text file, you can do this:

In [None]:
Events=numpy.loadtxt("Events.txt")

You could also do something fancier for more power to handle your data. Pandas is a framework for data science that allows easy manipulation and combination of big data tables. It also has convenient routines for processing data in e.g. excel format, that often prove useful. To open a file in Pandas, do this:

In [None]:
Events=pd.read_csv("Events.txt",delimiter=' ')

You can look at whats in the datafile like this

In [None]:
Events

We see the title of the columns here is given by the first entry in the file. Since the data file has no headings, that is not ideal. We can give the table proper titles like this:

In [None]:
columnnames=['event_number',
'run_number',
'p1_Px',
'p1_Py',
'p1_Pz',
'p1_E',
'p1_Q',
'p1_ID',
'p2_Px',
'p2_Py',
'p2_Pz',
'p2_E',
'p2_Q',
'p2_ID']

Events=pd.read_csv("Events.txt",delimiter=' ',names=columnnames)

In [None]:
Events

To access a particular column we call it like this

In [None]:
Events.p1_E

The type of this object is a pandas data series. We can handle it as it is, or we can convert it to a numpy array if we so choose:

In [None]:
ar=numpy.array(Events.p1_E)
print(len(ar))

We can histogram it either as a pandas series, or as an array, whatever is easiest:

In [None]:
pylab.hist(Events.p1_E,bins=100)
pylab.title("Histogramming the Pandas series")
pylab.show()
pylab.hist(ar,bins=100)
pylab.title("Histogramming the array")
pylab.show()

We can plot two series against eachother like this:

In [None]:
pylab.plot(Events.p1_E,Events.p1_Px,'.')

In a plot like the one above, the dots are all on top of eachother. To better highlight the dot density, making them each a bit transparent is a useful trick:

In [None]:
pylab.plot(Events.p1_E,Events.p1_Px,'.',alpha=0.02)

Even better would be to make a 2D histogram:

In [None]:
pylab.hist2d(Events.p1_E,Events.p1_Px,bins=(100,100),cmap='Blues')
pylab.colorbar()
pylab.show()

Even better is a 2D histogram with a log scale:


In [None]:
pylab.hist2d(Events.p1_E,Events.p1_Px,bins=(100,100),cmap='Blues',norm = matplotlib.colors.LogNorm())
pylab.colorbar()
pylab.show()

And so on. 

# Slicing up data tables

Just like with numpy arrays, we can select sub-tables from Pandas, either by picking out interesting rows or interesting columns.  If we only care about the kinematics of particle 1, for example, we can pick out only the particle 1 columns.  Try to understand what the following code is doing:

In [None]:
# This gives us the list of column names
Columns=Events.columns
print(Columns)

In [None]:
# Find out if each column has "p1" in its name:
whichones=["p1" in name for name in Columns]
print(whichones)
columnstokeep=numpy.array(Columns[whichones])
print(columnstokeep)

In [None]:
SubTable=Events[columnstokeep]
print(SubTable)

Another thing we might want to do is keep only events matching a certain criteria. To access a set of rows rather than a set of columns, we can use the iloc command. The following will pick out everyting in the subtable with p1_Px>0

In [None]:
OnlyPositiveP1=SubTable.loc[SubTable.p1_Px>0]
print(OnlyPositiveP1)

In [None]:
#check if we succeeded by making a plot
pylab.plot(SubTable.p1_Px,SubTable.p1_Py,'o',alpha=0.02,color='red')
pylab.plot(OnlyPositiveP1.p1_Px,OnlyPositiveP1.p1_Py,'o',alpha=0.02,color='black')


We can save our sliced up data table to disk in a variety of formats like this:

In [None]:
OnlyPositiveP1.to_csv("./DataInCSV.csv")
OnlyPositiveP1.to_excel("./DataInExcel.xlsx",engine='openpyxl')
OnlyPositiveP1.to_html("./DataInhtml.html")
# And so on. (tab complete 'OnlyPositiveP1.to' to see what else is available)

# Exercise

Using Pandas data tables, select only the columns which are a momentum (pX, pY or pZ).  

Make a plot of the total momentum of particle 1 vs the total momentum of particle 2, for all events where the Z component of p1 is greater than zero.  

(total momentum is the norm of the vector (pX,pY,pZ).

# Getting good at Pandas


If you want to get good with Pandas, not just basic with it, check out this mini-course:
https://www.kaggle.com/learn/pandas