# Estimating the lifetime of a Muon

First we must import all our required modules and libraries

In [None]:
# Import standard libraries
import sys
from pathlib import Path

# Add project root to PYTHONPATH
project_root = Path("..").resolve()
sys.path.append(str(project_root / "src"))

# Import custom modules
from preprocessing import get_data
from preprocessing import remove_anom_pmt
from preprocessing import format_data
from preprocessing import average_decay_by_event
from preprocessing import save_data

Then we can load our dataset and take our first look at it. 

Note that when using the `.head()` method, the loss of numerical accuracy is purely a visual feature, and the much more precise values are held in memory.

In [None]:
df = get_data()

df.head()

Next we can inspect the datatypes to see if any are incorrect.

In [None]:
df.dtypes

Although there are no immediate errors, `channel_numb` is simply an identification, so we can update it to an object instead.

In [None]:
df["channel_numb"] = df["channel_numb"].astype(object, copy=False)

We can now look over basic descriptive statistics to identify any possible errors in values.

Although `julian_day` is just a lebel, it is used for future calculations so must remain as an integer. 

In [None]:
df.describe()

As our data has passed the basic 'sanity checks', we can start to clean the dataset by removing NaN values.

In [None]:
nan_counts = df.isna().sum()
print(nan_counts)

As there are no NaN values, we can move one and consider any duplicated rows. 

In [None]:
duplicate_rows = df[df.duplicated()]
print(duplicate_rows)


We can clearly see that all the data is unique. 

Next we must consider any impossible values. According to the data's documentation, all readings are present in the the form M.N, where M is the run, and N is the PMT channel. The later of which should only have values of 1 or 2, depending on the specific PMT used. 

In [None]:
unique_values = df["channel_numb"].unique()
print(f"Unique values of {"channel_numb"}: {unique_values}")

Now we can see that there are some irregular channel numbers as 6234.4 should not exist, so we can remove these without further consideration. 

In [None]:
df = remove_anom_pmt(df)

unique_values = df["channel_numb"].unique()
print(f"Unique values of 'channel_numb': {unique_values}")

Now we can see that all erroneous data values have been removed. 

Now that the data has been cleaned, we can complete some basic feature engineering and convert times from unusuable fractions of a Julian day into more standard time units, and calculate the possible decay time. 

In [None]:

df = format_data(df)
df.head()

Although it may initially appear that data is now cleaned, when examined more thoroughly we see that each decay is detected by both PMts, meaning that these results should be aggregated to reduce noise. Furthermore, some PMTs fire twice leading to erroneous results, as can be shown at index 18 when compared to the pairs of decays at indices 16, 17 and 19, 20.

In [None]:
df[16:21]

Events recorded by different detectors were associated using a coincidence window, Î”t, accounting for digitisation and clock jitter. Measurements within each coincidence group were cleaned using a median-based outlier rejection before averaging to reduce detector noise.

In [None]:
df_old = df.copy
df = average_decay_by_event(df)

Now we can observe the noise reduction by comparing the distributions of decay time. 

In [None]:
#comparison histograms

This final dataframe can be saved so that it can be more easily processed by future endeavors

In [None]:
# If the data needs to be saved, then uncomment the following line and run:

save_data(df)