# 1. Data Exploration and Significance Estimation, Crab Nebula

## 1.1. Context
In this first tutorial we will consider the brightest steady gamma-ray source in the sky: the Crab Nebula.   
Due to these properties, the source is considered a reference source in gamma-ray astronomy at very high energies (VHE, $E>100\,{\rm GeV}$).   
We will explore the gamma-ray data, focusing on the list of events (i.e. individual gamma-ray photon detections), and then learn how to estimate the significance of the gamma-ray signal coming from the source.


In [None]:
# - basic imports (numpy, astropy, regions, matplotlib)
import numpy as np
import astropy.units as u
from astropy.coordinates import SkyCoord, Angle
from regions import PointSkyRegion, CircleSkyRegion
import matplotlib.pyplot as plt
import logging
import warnings

# - Gammapy's imports
from gammapy.maps import Map, MapAxis, WcsGeom, RegionGeom
from gammapy.data import DataStore, Observation
from gammapy.datasets import SpectrumDataset, Datasets
from gammapy.estimators import FluxPointsEstimator
from gammapy.stats import wstat, WStatCountsStatistic

# - setting up logging and ignoring warnings
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
warnings.filterwarnings("ignore")

## 1.2. Data exploration

Let us first take a look at the content of the high-level data. `Gammapy`'s `DataStore` will allow us to load all the data in a directory at once, that is all the Crab Nebula data for this case. Each file corresponds to a batch of data acquisition of roughly 20 minutes, that we call "run", and is represented by `Gammapy`'s `Observation` class.
We require that certain properties of the instrument response function (IRF) are available: the maximum angular separation used for event selection (rad_max), the effective area (aeff), and the energy dispersion (edisp). See Sec. 1.4 for details on the latter two and Sec. 2.1.1 for the former.

In [None]:
datastore = DataStore.from_dir("../acme_magic_odas_data/data/CrabNebula")
observations = datastore.get_observations(required_irf=["rad_max", "aeff", "edisp"])
print(len(observations))

### 1.2.1. Event lists
We have 98 runs in our Crab Nebula Dataset. Each run, or `Observation`, has two essential components: an **Event List**, and an **Insturment Response Function**. Let us focus on the first, for this tutorial. The event list is a table that contains, per each event classified as a gamma ray, essential information to perform astronomy: arrival time, sky coordinates, and energy. Let us look at the event list of the first observation.

In [None]:
events = observations[0].events
events.table

Each column in the event list is stored as an [Astropy Quantity](https://docs.astropy.org/en/stable/units/quantity.html), meaning that the values carry explicit physical units (e.g. `events.energy` is an array of energies with the unit `TeV` attached). This allows safe unit conversions, which will be used later when plotting.

Let us set aside the temporal information for the moment, and let us try to histogram the energy and coordinates to check their distribution.

In [None]:
# energy histogram
energy_bins = np.logspace(1, 4, 20) * u.GeV

fig, ax = plt.subplots()
ax.hist(
    events.energy.to("GeV"),
    bins=energy_bins,
)
ax.set_xlabel(r"$E\,/\,{\rm GeV}$")
ax.set_yscale("log")
ax.set_xscale("log")
plt.show()

We will see, in the next tutorial, how these type of histograms of events vs energy will play an important part in the spectrum estimation.   
Let us now quickly plot the coordinates of all the events.

In [None]:
# coordinates plot
fig, ax = plt.subplots()
ax.plot(
    events.radec.ra.degree,
    events.radec.dec.degree,
    marker=".",
    ls="",
    color="C0",
    markersize=0.6,
    label="events",
)
ax.set_xlabel(r"${\rm R.A.}\,/\,{\rm deg}$")
ax.set_ylabel(r"${\rm Dec}\,/\,{\rm deg}$")
ax.grid(ls="--")
ax.legend()
plt.show()

An excess of events appears around coordinates ${\rm R.A.} \sim 83.5^{\circ}$ and Dec ${\rm Dec} \sim 22^{\circ}$, very close to the coordinates of the Crab Nebula. Let us then add the coordinates of the Crab Nebula from [Simbad](https://simbad.u-strasbg.fr/simbad/sim-id?Ident=Crab+Nebula) in the plot.

In [None]:
# - Crab Nebula Coordinates
crab_coords = SkyCoord.from_name("Crab Nebula")
fig, ax = plt.subplots()
ax.plot(
    events.radec.ra.degree,
    events.radec.dec.degree,
    marker=".",
    ls="",
    color="C0",
    markersize=0.6,
    label="events",
)
ax.plot(
    crab_coords.ra.degree,
    crab_coords.dec.degree,
    marker="*",
    color="goldenrod",
    markeredgecolor="k",
    ls="",
    label="Crab Nebula",
    markersize=10,
)
ax.set_xlabel(r"$R.A.\,/\,{\rm deg}$")
ax.set_ylabel(r"$Dec\,/\,{\rm deg}$")
ax.grid(ls="--")
ax.legend()
plt.show()

It appears that this _excess_ of events is indeed located around the source position. How could we estimate how many gamma rays are coming from the source?

## 1.3. Estimating the significance of a signal
Not all the events (i.e. not all the dots in the map) will correspond to gamma rays emitted from the source. Most of them will actually correspond to Cosmic-ray-induced showers misidentified as gamma rays. Therefore, if we were simply to estimate the _source_ events by drawing a small region - say a circle - around the source coordinates and counting all the events that fall within it, we would be counting also the Cosmic-ray _background_ together with the gamma-ray _signal_.

One option to estimate this _background_ would be, for example, to choose a region far from the source and take a sample of events there. Where and how big does this background estimation region have to be? The answer is intrinsic to one of the most common observation modes of Imaging Air Cherenkov Telescopes (IACTs) such as MAGIC.

### 1.3.1. Wobble-mode pointing
The observation method known as _wobble mode_ consits in the telescopes tracking sky coordinates which are, for MAGIC, $0.4^{\circ}$ from the source nominal position. This results in the source having a projected position in the camera plane $0.4^{\circ}$ from its centre, which facilitates the background estimation, as we shall see in a minute.

**A little note**: It is not adequate to plot celestial coordinates in a cartesian plane - as we did in the previous cell, as they are not Euclidean coordinates, rather coordinates on a sphere. Let us use `Gammapy` to handle the proper projection of coordinates on a plane. We will be creating a _sky map_, in particular a _count map_, that is, a 2D histogram of the coordinates of the events.

In [None]:
# let us center the skymap on the pointing position
pointing_coords = observations[0].pointing.get_icrs()
# the geometry is the structure of the map, the binning of the 2D histogram
countmap_geom = WcsGeom.create(
    skydir=pointing_coords, binsz=0.05, width="4 deg", frame="icrs", proj="TAN"
)
counts = Map.from_geom(countmap_geom)
# let us fill the histogram (will automatically read the coordinates of the events)
counts.fill_events(events)
# now let us plot the histogram
ax = counts.plot(cmap="viridis", add_cbar=True)
wcs = counts.geom.wcs

# let us also add the pointing position
pointing = PointSkyRegion(pointing_coords)
pointing.to_pixel(wcs).plot(
    ax=ax, color="goldenrod", marker="+", markersize=15, label="pointing"
)

ax.legend()
ax.set_xlabel(r"$R.A.$")
ax.set_ylabel(r"$Dec$")

We can see that indeed the telescope was pointing at a slight offset from the source, visible from the yellowish hotspot in the histogram.   
In order to start to estimate our signal, let us create a small circular region enclosing the source - let us call it _on_ region - and let us count all the events within. As already explained, this counts both gamma and cosmic-ray events. To subtract those, we estimate them from another region, that we refer to as _off_ region, identical in size and symmetric to the _on_ with respect to the camera centre. Let us draw these two regions.

In [None]:
# radius of the ON and OFF regions
region_radius = 0.2 * u.deg

# ON region definition: centered around the source
on_region = CircleSkyRegion(center=crab_coords, radius=region_radius)

# the OFF region is symmetric to the on w.r.t. the camera centre
# - compute the pointing offset and the source angle
pointing_offset = pointing_coords.separation(crab_coords)
source_angle = pointing_coords.position_angle(crab_coords)
# - rotate the off by 180 degrees
off_region_angle = source_angle + 180 * u.deg
# - define the off
off_region_centre = pointing_coords.directional_offset_by(
    position_angle=off_region_angle, separation=pointing_offset
)
off_region = CircleSkyRegion(center=off_region_centre, radius=region_radius)

# replot counts and pointing and add the ON and OFF region
ax = counts.plot(cmap="viridis", add_cbar=True)
wcs = counts.geom.wcs

pointing.to_pixel(wcs).plot(
    ax=ax, color="goldenrod", marker="+", markersize=15, label="pointing"
)
on_region.to_pixel(wcs).plot(ax=ax, edgecolor="crimson", linewidth=2, label="on region")
off_region.to_pixel(wcs).plot(
    ax=ax, edgecolor="dodgerblue", linewidth=2, label="off region"
)

ax.legend()
ax.set_xlabel(r"$R.A.$")
ax.set_ylabel(r"$Dec$")

Gammapy offers a functionality to filter the events in a `SkyRegion`. Let us use it and count the events in the _on_ and _off_ regions.  
**_Question 1.1_**: Could you guess why we stated that the _off_ region had to be symmetric to the _on_ w.r.t. to the camera center?

In [None]:
on_events = events.select_region(on_region)
N_on = len(on_events.table)
off_events = events.select_region(off_region)
N_off = len(off_events.table)

print(f"N_on = {N_on}, N_off = {N_off}")

### 1.3.2. Computing the significance, the Li & Ma formula

What do we do with these two numbers? What we are really interested in determining is whether the signal is real, or if this data could plausibly come from fluctuations of the background. How should we proceed about it?

First of all, we assume that both the events in the _on_ and _off_ regions follow a Poisson distribution. If $g$, and $b$ represent the _expected number of signal and background events_, which we can’t observe directly but want to estimate, then we have

$$
P(g, b; N_{\rm on}) = \frac{(g + b)^{N_{\rm on}} \exp{[-(g + b)]}}{N_{\rm on}!},
\quad 
P(b; N_{\rm off}) = \frac{b^{N_{\rm off}} \exp{(-b)}}{N_{\rm off}!}.
\tag{1.1}
$$

If we estimate the background from more than one region, we can introduce the factor $\alpha$ which is the ratio of the exposure (might mean area or also observation time) of the _on_ region to the exposure of the _off_ region. In this case, $b$ denotes the expected number of background events in all _off_ regions combined, and $\alpha b$ gives the expected background in the _on_ region.
If instead of using a single _off_ region as we did in our example, we had considered $3$ other circular regions (always symmetric from the centre), then we would have $\alpha = 1/3$. In this case, we modify our probability for the _on_ events into
$$
P(g, b, \alpha; N_{\rm on}) = \frac{(g + \alpha b)^{N_{\rm on}} \exp{[-(g + \alpha b)]}}{N_{\rm on}!}.
\tag{1.2}
$$

From the two probabilities we can build the likelihood

$$
\mathcal{L}(g, b, \alpha \,|\, N_{\rm on}, N_{\rm off}) = P(g, b, \alpha; N_{\rm on}) \times P(b; N_{\rm off}),
\tag{1.3}
$$

for computations, especially for minimisation, it is common to use $-2$ the log of this quantity

$$
-2 \log \mathcal{L}(g, b, \alpha \,|\, N_{\rm on}, N_{\rm off}) = 
2 \left[ g + ( 1 + \alpha) b - N_{\rm on} \log (g + \alpha b) - N_{\rm off} \log(b)  \right]
\tag{1.4}
$$

where we have removed constant terms (i.e. terms that do not depend on the parameters of interest $(g, b, \alpha)$).    

Let us suppose that in our observations we have measured $30$ events from $3$ _off_ regions and $20$ events in the _on_ region, and then plot the negative log-likelihood as a function of $g$ and $b$ to see where its minimum occurrs.

In [None]:
n_on = 20
n_off = 30
alpha = 1 / 3


def minus_2_log_L(n_on, n_off, alpha, g, b):
    return 2 * (g + (1 + alpha) * b - n_on * np.log(g + alpha * b) - n_off * np.log(b))


g = np.linspace(0, 50, 51)
b = np.linspace(0, 50, 51)
b = b[:, np.newaxis]

stats = minus_2_log_L(n_on=n_on, n_off=n_off, alpha=alpha, g=g, b=b)

# plot the likelihood
fig, ax = plt.subplots()
CS = plt.contour(stats, 8, origin="lower")
ax.clabel(CS, fontsize=10)

# - find the values of g and b corresponding to the minimum
# and plot them
index_stat_min = np.unravel_index(stats.argmin(), stats.shape)
plt.plot(
    index_stat_min[1],
    index_stat_min[0],
    marker="+",
    markersize=12,
    color="k",
    label="minimum",
)

ax.legend()
ax.set_xlabel(r"$g$", fontsize=11)
ax.set_ylabel(r"$b$", fontsize=11)
ax.set_title(r"$N_{\rm on} = 20, \; N_{\rm on} = 30, \alpha = 1/3$")
plt.show()

As we can see, the negative log of the likelihood has a minimum at $\hat{b} = 30$, which correspond to the number of _off_ events,   
and $\hat{g} = 10$, which correspond to the number of estimated excess events
$N_{\rm ex} = N_{\rm on} - \alpha N_{\rm off} = 10$. 

But instead of just finding the best-fit values of $g$ and $b$ by minimising the log-likelihood, our real goal is to test if the signal (i.e. $g$) is significantly different from $0$, or if it could simply be produced from fluctuations of the background ($b$). In other words, we want to test the null hypothesis $H_0: \, g = 0$, against the alternative hypothesis $H_1: \, g \neq 0$. This can be done with the likelihood ratio test (LRT), referring to Eq. 1.4

$$
- 2 \log \left(
\frac{
    \mathcal{L}(g=0, \hat{b}_0, \alpha \,|\, N_{\rm on}, N_{\rm off})
  }{
    \mathcal{L}(\hat{g}, \hat{b}, \alpha \,|\, N_{\rm on}, N_{\rm off})
}
\right),
\tag{1.5}
$$

where $\hat{g}$ and $\hat{b}$ are the values maximising the likelihood, and $\hat{b}_0$ is the value maximising the likelihood under the null hypotehsis ($g = 0$).

This LRT, according to Wilks' theorem, follows a $\chi^2$ distribution with one degree of freedom ($\chi_1^2$). As the square of a normal variable is also distributed as a $\chi_1^2$, it is easy to "translate" the probability corresponding to the output of the LRT, to units of sigma (i.e. of std. deviations of a normal distribution). In particular, one just needs to take the square root of Eq. 1.5.

Eq. 17 from [Li & Ma (1987)](https://ui.adsabs.harvard.edu/abs/1983ApJ...272..317L/abstract) returns the square root of the LRT in Eq. 1.5. It is the formula adopted in VHE astrophysics to claim the significance of a signal. The statistics at Eq. (1.4) is handled by `Gammapy`'s [`WStatCountsStatistics`](https://docs.gammapy.org/dev/api/gammapy.stats.WStatCountsStatistic.html#gammapy.stats.WStatCountsStatistic).

In [None]:
statistics = WStatCountsStatistic(n_on=20, n_off=30, alpha=1 / 3)
# we can compute both the LRT
print(f"LRT = {statistics.ts:.2f}")
# and the corresponding probability in units of sigma
print(f"{statistics.sqrt_ts:.2f} sigma")

We see that, in the hypothetical case $N_{\rm on} = 20, \; N_{\rm on} = 30, \alpha = 1/3$, we do not have evidence to reject the reject the null nypothesis (no signal) with this result ($5$ sigma is the standard commonly considered to do so). But what about the actual numbers we found in our Crab Nebula observation?

In [None]:
statistics = WStatCountsStatistic(n_on=N_on, n_off=N_off, alpha=1)
# we can compute both the LRT
print(f"LRT = {statistics.ts:.2f}")
# and the corresponding probability in units of sigma
print(f"{statistics.sqrt_ts:.2f} sigma")

For the Crab Nebula observation we can conclude the null hypotehsis is statistically unlikely. For a source as bright as the Crab Nebula, even with just $20$ minutes of observations, we get a significant signal.

### 1.3.3. $\theta^2$ histogram

Another common method to visualise the presence of a signal, and to compute its statistical significance, is to build a $\theta^2$ histogram. Instead of simply counting the events in the _on_ and _off_ regions, we make a histogram of their squared angular distances from the region centres: source position for _on_, and the region-centres for _off_. To determine the quantities $N_{\rm on}$ and $N_{\rm off}$, we count all the bin values within a cut in $\theta^2$ that is basically the size of the _on_ region squared. With this method one can nicely visualise the presence of the signal in the _on_ region, as the counts steadily increase as we approach the zero, that is the centre of the _on_ region that in turn corresponds to the source position.

Let us build the $\theta^2$ distribution, the squared angular distance between the event directions and the region centers.
By plotting the histograms for the _on_ and _off_ regions, and their difference, we can see how the signal stands out above the background.


In [None]:
# buil the theta2 distances
# - compute the distances
on_offsets = on_region.center.separation(events.radec)
off_offsets = off_region.center.separation(events.radec)

# - theta2 bins
theta2_bins_edges = np.linspace(0, 0.4, 41) * u.Unit("deg2")
theta2_bins = 0.5 * (theta2_bins_edges[:-1] + theta2_bins_edges[1:])

fig, ax = plt.subplots(1, 2, sharey=True, figsize=(10, 5), gridspec_kw={"wspace": 0.05})

# on and off histograms
on_cts, _, __ = ax[0].hist(
    on_offsets**2,
    bins=theta2_bins_edges,
    histtype="step",
    color="crimson",
    label="on events",
)
off_cts, _, __ = ax[0].hist(
    off_offsets**2,
    bins=theta2_bins_edges,
    histtype="step",
    color="dodgerblue",
    label="off events",
)
ax[0].axvline((region_radius**2).to_value("deg2"), ls="--", color="gray")
ax[0].text(
    0.045, 130, r"$\theta^2$ cut (squared size of the on/off region)", rotation=90
)
ax[0].set_xlabel(r"$\theta^2\,/\,{\rm deg}^2$")
ax[0].set_ylabel("counts")
ax[0].set_ylim([0, 500])
ax[0].legend()

# excess histogram
ex_cts, _, __ = ax[1].hist(
    theta2_bins_edges[:-1],
    theta2_bins_edges,
    weights=(on_cts - off_cts),
    histtype="step",
    color="k",
    label="(on - off) counts",
)
ax[1].axvline((region_radius**2).to_value("deg2"), ls="--", color="gray")
ax[1].text(
    0.045, 130, r"$\theta^2$ cut (squared size of the on/off region)", rotation=90
)
ax[1].set_xlabel(r"$\theta^2\,/\,{\rm deg}^2$")
ax[1].legend()

plt.show()

**Question 1.2**: Why an histogram of the distances **squared**?

Let us check that by counting the events within the $\theta^2$ cut, we obtain the same number that we obtained with the `select_region` functions applied on `Gammapy` event list.

In [None]:
n_on = np.sum(on_cts[theta2_bins <= region_radius**2])
n_off = np.sum(off_cts[theta2_bins <= region_radius**2])

print(f"N_on = {n_on}, N_off = {n_off}")

### Exercise 1.1.

Create a $\theta^2$ plot using **all** the 98 observations in the Crab Nebula data set. What is the total significance of the signal?


### Exercise 1.2.
It might be useful to determine whether our source has significant emission above a certain energy.   
The presence of gamma rays above a given energy can indeed directly be related to a certain energy of the parent cosmic rays. For example, in certain physical scenarios, the gamma-ray energy is approximately a factor 10 lower than the parent cosmic-ray energy. Thus, a source with gamma-ray emission up to $10\,{\rm TeV}$, would be accelerating cosmic rays to hundreds of ${\rm TeV}$.

Using **all** the 98 observations in the Crab Nebula data set, compute the significance of the emission above $1$ and $10$ ${\rm TeV}$.

## 1.4. Instrument response function 

The second important component of each `Observation` is the instrument response function (IRF).    
This is a parametrisation of the response of the system to incoming gamma rays, expressed as a function of energy and coordinates.   
The IRF is factored in _components_, representing the collection area and the probability distributions of the energy and direction estimators. They are built from Monte Carlo (MC) simulations, using histograms of simulated events after applying the same selection cuts as for the event list selection.

If we are observing a small area of the sky, the spatial dependence of the IRF is more conveniently expressed as a function of the detector coordinates (offset from the camera centre) rather than in sky coordinates.

Let us start by examining the _effective area_. We can see that there is no offset dependency. This is due to the fact that, since most MAGIC observation occurr with the source at $0.4^{\circ}$ offset, the MC events are simulated accordingly.

In [None]:
observations[0].aeff.plot()
plt.show()

As we only have energy dependence, we can then plot this energy dependence for each of the four offsets, which are identical in this case.

In [None]:
observations[0].aeff.plot_energy_dependence()
plt.yscale("log")
plt.show()

The other important component is the _energy dispersion_, that represents the PDF of the energy estimator.

In [None]:
observations[0].edisp.to_edisp_kernel(offset=Angle("0.4 deg")).plot_matrix()
plt.show()

Given an event with a certain true energy ($x$ axis), we can visualise what is the most probable value at which its energy will be estimated ($y$ axis). We will see in detail in the next tutorial the role that the IRF plays in the spectrum estimation.

## 1.5 Before moving to the next tutorial
Let us plot a histogram for the energies of the events in the _on_ and _off_ regions.

In [None]:
fig, ax = plt.subplots()
ax.hist(
    on_events.energy.to("GeV"),
    bins=energy_bins,
    histtype="step",
    color="crimson",
    label="on events",
)
ax.hist(
    off_events.energy.to("GeV"),
    bins=energy_bins,
    histtype="step",
    color="dodgerblue",
    label="off events",
)
ax.set_xlabel(r"$E\,/\,{\rm GeV}$")
ax.set_yscale("log")
ax.set_xscale("log")
ax.legend()
plt.show()

### Homework 1.1.
Could you make a similar histogram using **all** the 98 observations in the Crab Nebula data set?