# Pandas Exercises

Tamás Gál (tamas.gal@fau.de)

The latest version of this notebook is available at [https://github.com/escape2020/school2022](https://github.com/escape2020/school2022)

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as ml
import sys
plt = ml.pyplot
ml.rcParams['figure.figsize'] = (10.0, 5.0)

print(f"Python version: {sys.version}\n"
      f"Pandas version: {pd.__version__}\n"
      f"NumPy version: {np.__version__}\n"
      f"Matplotlib version: {ml.__version__}\n"
      f"seaborn version: {sns.__version__}")

Python version: 3.9.12 (main, Mar 26 2022, 15:44:31) 
[Clang 13.1.6 (clang-1316.0.21.2)]
Pandas version: 1.4.2
NumPy version: 1.22.4
Matplotlib version: 3.5.2
seaborn version: 0.11.2


In [2]:
from IPython.core.magic import register_line_magic

@register_line_magic
def shorterr(line):
    """Show only the exception message if one is raised."""
    try:
        output = eval(line)
    except Exception as e:
        print("\x1b[31m\x1b[1m{e.__class__.__name__}: {e}\x1b[0m".format(e=e))
    else:
        return output
    
del shorterr

In [3]:
import warnings
warnings.filterwarnings('ignore')  # annoying UserWarnings from Jupyter/seaborn which are not fixed yet

## Exercise 1

Use the `pd.read_csv()` function to create a `DataFrame` from the dataset `data/neutrinos.csv`. You will encounter a few obstacles but make sure you don't modify the raw data. It's always a good idea to open the CSV file in question in an editor and see how it's formatted.

Every column of the resulting `DataFrame` needs to have a well defined `dtype` (something other than `object`).

**Notice: energies are measured in `GeV`, distances and positions in `m`, angles in radians and time in `ns` for all of the provided datasets.**

## Exercise 2

Create a histogram of the neutrino energies (`energy`).

## Exercise 3

Use the `pd.read_csv()` function to create a `DataFrame` from the dataset `data/reco.csv`. The index column should be correctly parsed too. This dataset contains the reconstructed neutrinos to each of the neutrino events in the other data file.

## Exercise 4

Combine the `neutrinos` and `reco` `DataFrames`  into a single `DataFrame`. This way it will be easier to examine the MC and reconstruction parameters.

Hint: `pd.concat()`

## Exercise 5

Create a plot to visualise the zenith reconstruction quality (true vs. reconstructed zenith).

In case you did not complete exercise 4:

    data = pd.concat([neutrinos reco.add_prefix('reco_')], axis="columns")

## Exercise 6

Create a histogram of the cascade probabilities (__`neutrinos`__ dataset: `proba_cscd` column) for the energy ranges 1-5 GeV, 5-10 GeV, 10-20 GeV and 20-100 GeV.

Hint: `pd.cut()`

## Exercise 7

Create a 2D histogram showing the distribution of the `x` and `y` values of the starting positions (`pos_x` and `pos_y`) of the neutrinos.

## Exercise 8

Use `seaborn` (`import seaborn as sns`) to recreate the 2D histogram from Exercise 7. The functions `sns.displot()` or `sns.jointplot()` will be handy.

## Exercise 9

Create two histograms of the `azimuth` and `zenith` distribution side by side, in one plot (two subplots).

Try `pandas` built-in matplotlib wrapper and also the raw matplotlib library.

## Exercise 10

Split the data into two groups: `upgoing` and `downgoing`, based on the `zenith` value (`zenith == 0` is coming directly from above, `zenith == π` from below).

Try out `sns.stripplot` to verify your "cut" on the data!

## Exercise 11

Create a combined histogram (two histograms overlayed in the same plot) for both `upgoing` and `downgoing` datasets, showing the `zenith` angle.

## Exercise 12

Read a KM3NeT Event File (`data/hits.h5`) and examine the PMT signals (`tot`) of the hits (mean, min, max) and also the earliest hit for each digital optical module (identified by `dom_id`) separately . Create a histogram of the `tot` distribution for each module. The function `pd.read_hdf()` is a good start. The dataset path inside the HDF5 file is `/hits`.

### Exercise: Create a histogram of all time differences between consecutive hits calculated on each DOM independently

### Exercise: Examine the data for time differences below 50ns

### Exercise: examine the hits on each PMT of each DOM

Calculate the number of hits and the min/max/mean `tot` for each PMT on each DOM.