## 0. Notes based on common homework issues
* When turning in Jupyter notebooks by committing them to a git repository, please make sure you have run every cell in your notebook and then saved the state of the notebook. This will allow us to view your notebook output through GitHub and it will ensure that all of your cells are working as expected prior to final submission.

## 0.a. Git repository clean-up

* Make sure both your Lab 1 Jupyter notebook and python script are commited and pushed to the branch you made the first week.
* Make sure your homework from last week is committed and pushed to GitHub.

## 0.b. We will now each create a private repository through GitHub

* Go to your GitHub profile page.
* Click on Repositories
* Click on "New"
* Give our new repository the name E11Homework (or something similar)
* IMPORTANT: Check "Private"
* Check "Initialize this repository with a README
* Click "Create repository"
* Go to "Settings"
* Click on "Collaborators" on the left side-bar (you will be prompted for your GitHub password
* Add me (alihanks) and Chris (cllamb0) as collaborators so that we can view  your future homework submissions.


## 0.c. Now clone this new repository locally
* Click on "Clone or Download" and copy the HTTPS url for your new repository 
* At your GitBash or Terminal commmand line: 
```
$ cd ~
$ git clone <url-to-your-repository>
``` 
(paste the url after git clone)
* Open Jupyter notebook and navigate to your new repository folder (directory)
  * This should be a folder in your home directory - where you start when you start Jupyter
* Create a new notebook and name it "Lab3-Activity"

# Homework Extra Credit:
1. Correct your lab 1 activity notebook based on the solutions posted here: https://github.com/engineering-11/Activities/blob/master/Results/Lab%201%20Activity-Solutions.ipynb
2. Move your corrected notebook, the python script you created last week, and the homework you submitted today to your private repository and add/commit everything to that repository.

# We are ready to start the lab!

## 1. Getting your data

We have a Raspberry pi computer running as a webserver that is hosting the data we will be using today. The first step is to download that data.

* Over WiFi connect to `RPiTouchServer`. The WPA2 password will be provided in lab.
* Make a folder for keeping data - at the command-line prompt:
```
$ cd ~
$ mkdir E11data
```

* Copy the data from the RPi server to this folder - at your command-line prompt:
```
$ scp pi@192.168.4.1:~/data/Inside* ~/E11data
```

* This step will prompt you for a password for the `pi` user on that server, this will be provided in lab.

## 2. Prepare your Jupyter Notebook

Import the python libraries we will need for this activity

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import math

## 3. Read in your data csv (comma-separated-variable) file using pandas

In [None]:
data = pd.read_csv("~/E11data/Inside_p1_g3_2019-09-17_2s_D3S.csv")

Take a look at the output. Each line in this file is a list of 4096 values. Each row gives the number of counts (interaction in the detector) measured at the energy corresponding to that column value (0-4096).

In [None]:
data

In [None]:
# Put this pandas DataFrame into a numpy array so that we can perform fast manipulations of the data
spectra = np.array(data)

### 3.a. Let's take a look at the integrated spectrum
By this I mean we will sum over all rows in our array to see the total counts across the full energy range recorded.

In [None]:
spectrum = spectra.sum(axis=0)
plt.plot(spectrum)
plt.show()

### 3.b. Let's try that as a log plot

In [None]:
plt.plot(spectrum)
plt.yscale("log")
plt.show()

### 3.c. Try zooming in on the counts between columns 1000 and 2000 (x-axis range)

In [None]:
plt.plot(spectrum[1000:2000])
#plt.yscale("log")
plt.show()

### Why are these points fluctuating so much and is this what we expect?

* We can think of each column as a bin in a histogram
* To improve these fluctuations we need more counts in each bin
* We could do this by taking more data but we can also do it now by combining counts in adjacent bins - rebinning

In [None]:
print(spectrum.shape)
spectrum_resize = np.resize(spectrum,(1028,4))
print(spectrum_resize.shape)

In [None]:
# We've split our 4096 bins into 4 sets of 1028 bins, we now just need to sum those together
spectrum_rebin = spectrum_resize.sum(axis=1)
print(spectrum_rebin.shape)

In [None]:
plt.plot(spectrum_rebin)
plt.yscale("log")
plt.show()

### If we now want to zoom in on the same section we looked at in the initial spectrum, what index range do we want to look at?

We've rebinned by 4, so the array elements that were previously ranging from 1000 - 2000 are now ranging from 250 - 500

In [None]:
plt.plot(spectrum_rebin[250:500])
#plt.yscale("log")
plt.show()

## 4. Now we will look at total counts

* This involves summing the original 2D array across the other axis 
* So instead of summing all rows, we will sum across all columns for each row and get total counts within the time interval data was collected for
  * that is the data that goes in to each row

In [None]:
counts = spectra.sum(axis=1)
counts

In [None]:
plt.plot(counts)
plt.show()

### Each entry in `counts` contains the total number of counts, `N`, collected over a 2s interval, what is the uncertainty on that number?

Let's zoom in again

In [None]:
plt.plot(counts[100:200])
plt.show()

Do these fluctuations make sense? Based on our answer above about the uncertainty on the number of counts, we can plot this distribution with error bars.

In [None]:
x = range(len(counts[100:200]))
plt.errorbar(x,counts[100:200],np.sqrt(counts[100:200]))
plt.show()

### How would we decrease these error bars without adding new data?

Just like we did for the spectrum, we can combine adjacent values, or "bins", to rebin this distribution - putting more counts in each entry.

In [None]:
print(counts.shape)
counts_resize = np.resize(counts,(int(counts.shape[0]/4),4))
print(counts_resize.shape)
counts_rebin = counts_resize.sum(axis=1)
print(counts_rebin.shape)
x = range(len(counts_rebin[25:50]))
plt.errorbar(x,counts_rebin[25:50],np.sqrt(counts_rebin[25:50]))
plt.show()

In this data, each entry in our array of counts is from a 2 second data collection interval. So we are looking at is the total counts per 2s interval. If we rebin this by a factor of 4, like we did with the spectrum, what does each entry in our new array of counts represent?

We took our 2s intervals and combined them in 4 interval chunks, so now each entry corresponds to 2x4 = 8 seconds of data collection.

### 4.a. Make a histogram of this data

To start with, determine the total range of values in counts. Use that range to set the number of bins so that the bin-width (range of each bin) is 1 count. So a bin would go from 5-6, or 6-7, etc.

In [None]:
xmin = np.min(counts)
xmax = np.max(counts)
print(xmin)
print(xmax)
nbins = xmax-xmin+1

In [None]:
plt.hist(counts,bins=nbins)
plt.show()

### 4.b. Determine the mean and standard deviation and plot the corresponding Gaussian (Normal) distribution

For this, recall that the functional form for the Gaussian distribution is:

$ \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$

Where $\mu = \overline x$ - the mean of the distribution, and $\sigma$ is the standard deviation.

We can rely on a python package that has predefined the normal distribution, but we will also go through making our own, which has the advantage of being able to control the normalization.

In [None]:
def gaussian(x, mu, sigma):
    func = 1/(sigma*math.sqrt(2*math.pi))*np.exp(-np.power((x-mu),2)/(2*math.pow(sigma,2)))
    return func

In [None]:
mu = np.mean(counts)
sigma = np.std(counts)

plt.hist(counts,bins=nbins)
x = np.arange(xmin, xmax, 1)
plt.plot(x, gaussian(x, mu, sigma))
plt.show()

### Why can't we see the Gaussian distribution. Try graphing it by itself

In [None]:
x = np.arange(xmin, xmax, 1)
plt.plot(x, gaussian(x, mu, sigma))
plt.show()

Let's also try the pre-defined method to see how that works

In [None]:
# We need to import the library where this predefined function is created
from scipy.stats import norm

x = np.arange(xmin, xmax, 1)
plt.plot(x, norm.pdf(x, mu, sigma))
plt.show()

This Gaussian distribution is a probability distribution, that means it's normalized to so that $\Sigma(P(x)) = 1$.

We need to normalize our histogram as well! How do we do that?

In [None]:
plt.hist(counts,bins=nbins,density=True)
x = np.arange(xmin, xmax, 1)
plt.plot(x, gaussian(x, mu, sigma))
plt.show()

### So is a Normal distribution the right distribution to describe this data?

What is the average number of counts?

In [None]:
print(np.mean(counts))

At what mean value of our measured quantity did we say that the Gaussian distribution became applicable?

Let's try Poisson and see if it does better? Recall that the functional form for the Poisson distribution is:

$ \frac{\mu^{x}e^{-\mu}}{x!}$

again $\mu = \overline x$ - the mean value of the distribution.

In [None]:
# For the predefined Poisson distribution function we need to import the relevant library
from scipy.stats import poisson

# Because we imported a library called poisson, we need to call our function something else
# We do need a way to take factorials of arrays, though, and we find that here
from scipy.special import factorial

# Now we're ready to make our function
def my_poisson(x, mu):
    func = np.power(mu,x)*np.exp(-1*mu)/factorial(x)
    return func

In [None]:
plt.hist(counts,bins=nbins,density=True)
x = np.arange(xmin, xmax, 1)
plt.plot(x, poisson.pmf(x, mu)) # This is the predefined way to get a Poisson distribution
plt.plot(x, my_poisson(x, mu))
plt.show()

### NOTE: next week we'll discuss how we might compare which of these distributions is doing a better job describing the data more quantitatively

## 5. Now back to that spectrum to study average "energy"

Let's remind ourselves what the total energy spectrum looks like first. Plot it again.

In [None]:
plt.plot(spectrum_rebin)
plt.yscale("log")
plt.show()

### 5.a. What's the average energy measured across this entire data set?

Recall from class that we can calculate an average value for some measurement, x, as $\overline x = \left<x\right> = \Sigma xF(x)$

This holds for determining the expectation value for any variable x described by the frequency distribution F(x), provided we've defined F(x) such that $\Sigma F(x) = 1$.

Our energy spectrum above provides the frequency with which counts at a particular energy occurr, we just need to normalize this distribution to sum to one.

In [None]:
spectrum_integral = spectrum_rebin.sum()
spectrum_norm = spectrum_rebin/spectrum_integral

In [None]:
plt.plot(spectrum_norm)
plt.yscale("log")
plt.show()

In [None]:
# We can do this using a for loop to iterate over each index (x value) in our spectrum array
mean_x = 0
for x,F in enumerate(spectrum_norm):
    mean_x += x*F

print("The mean value along the x-axis (energy) = {}".format(mean_x))

### 5.b. What would happen if we remove the excess counts at high energy?

These are an artifact of the detector, and not really part of the real radiation distribution.

In [None]:
sub_spectrum = spectrum_rebin[:1000]
plt.plot(sub_spectrum)
plt.yscale("log")
plt.show()

In [None]:
spectrum_integral = sub_spectrum.sum()
spectrum_norm = sub_spectrum/spectrum_integral
plt.plot(spectrum_norm)
plt.yscale("log")
plt.show()

In [None]:
# Note, we used a for loop above, but we can do this a little more "Pythonically"
mean_x = sum([x*F for x,F in enumerate(spectrum_norm)])
print(mean_x)

### 5.c. Now we will look at "samples" within this data set to explore variations in the measured energy

To do this, we will split up our spectra into equal chunks by slicing the original 2D array

In [None]:
spectra_sample = spectra[0:100]
spectrum_sample = spectra_sample.sum(axis=0)
# We want to cut out the high energy junk but we haven't rebinned this version, so we need to gut at 4000 not 1000
spectrum_sample = spectrum_sample[:4000]

In [None]:
plt.plot(spectrum_sample)
plt.yscale("log")
plt.show()

In [None]:
spectrum_integral = spectrum_sample.sum()
norm_sample = spectrum_sample/spectrum_integral
plt.plot(norm_sample)
plt.yscale("log")
plt.show()

In [None]:
sample_mean_x = 0
for x,F in enumerate(norm_sample):
    sample_mean_x += x*F
print(sample_mean_x)

### This is suprising? But wait! In this distribution `x` goes from 0 - 4000, not 0 - 1000. We rebinned before.

These x values are effectively 4 times the previous case. So we can simply divide this answer by the ribinning factor, 4.

In [None]:
sample_mean_x = 0
for x,F in enumerate(norm_sample):
    sample_mean_x += x*F
sample_mean_x = sample_mean_x/4
print(sample_mean_x)

# Homework Part 1:

## 6. We can now do this same thing for all of our samples of the spectra

### 6.a. Following the same steps as we took in 5.c. (not including plotting), build up a list of the sample mean "Energy" - $\left<x\right>$ - for N slices, where each slice grabs 100 entries (spectra).

* **First:** How many spectra did we take? We can find this out by looking at the length of the `counts` array from earlier, because each entry in that array is the sum of all the counts collected in that spectrum.

In [None]:
len(counts)

There are almost 12000 spectra! 

If we devide these up into chunks of 100, we will have to repeat the work done in 5.c 117 times. We should not do this manually! That would mean writing out code like this:
```
spectra_sample1 = spectra[0:100]
...


```

```
spectra_sample2 = spectra[100:200]
...


```

```
spectra_sample3 = spectra[200:300]
...


```

And doing that 117 times!

Instead, we can define a function that executes all of the steps needed and returns the resulting $\left<x\right>$ for the specified chunk: `spectra[start:stop]`.

We can then create a loop that iterates over the desired `start` and `stop` until we have a list of 117 $\left<x\right>$ values, one for each chunk of spectra.

### 6.b. Plot your list of mean x values (the mean energy of each sample spectrum) in a histogram

### 6.c. What are the mean and standard deviation of this distribution?

### 6.d. Add the Gaussian (Normal) distribution curve defined by this mean and sigma to your plot of the histogrammed average energies

Don't forget to plot your histogram as a normalized distribution!

# Homework Part 2: Looking at a second set of data

Our other data file is: ~/E11data/Inside_p1_g3_2019-09-17_10s_D3S.csv

**NOTE:** There should be two files with very similar names. You should have read in one of these files during the lab, and you will now follow the same procedure for the second file. The names may be swapped in your case based on which file we used during the lab. 

The file with `10s` in the name indicates that data was collected in 10 second intervals, the file with `2s` in the name indicates that the data was collected in 2 second intervals. You should keep track of which is which because it will be relevant for later parts of the homework.

## 1. Read in this data in the same way that we read the first data set

## 2. Create your integrated spectrum and counts arrays as we did for the first data set

## 3. Plot the spectrum
### 3.a. Cut out the high energy spike at the end of the distribution

## 4. Plot the counts
### 4.a. Plot the counts across the full time-series
### 4.b. Plot the counts as a histogram - as we did for the first data set
**NOTE:** Be sure to set the number of bins in the same way we did before, do not use the same number of bins as you did for the first data set.

## 5. Comparing data
### 5.a. Plot this counts histogram together with the original data set (with labels!)
####   5.a.i. what differences do you see?
####   5.a.ii. Calculate the mean and standard deviation for each set of counts, how do they compare?
### 5.b. This data was taken over 10s intervals so each entry in the counts array is the total counts over 10s. Convert both sets of counts to counts-per-second
### 5.c. Plot these new counts-per-second arrays from both data sets together, as histograms, with labels!
**HINT:** Don't forget to plot these as normalized frequency distributions - that is, use the `density=True` argument when plotting the histograms.

**NOTE:** Keep the binning the same as for previous plots. So each data set should have it's own binning determined from the original counts arrays.
#### 5.c.i. What differences do you see now? Has your answer changed from 5.a.i?
#### 5.c.ii. Calculate the mean and standard deviation from these counts-per-second distributions - how do they compare?
#### 5.c.iii. Is your answer here different than for 5.a.ii? Why or why not? Is this what you expected?

## 6. How long was data collected for each data set?

**HINT:** I've told you the time interval for each entry in the array of counts

### 6. a. The integrated spectrum from each data set shows the counts in each "energy" bin integrated over the full data collection time. Based on your answer above, which spectrum (the spectrum from the first data set or the second) do you expect to have smaller uncertainties at each energy?

**HINT:** I'm asking you to compare total counts, and the corresponding uncertainty in those counts, in the same energy bin (x-axis entry in the spectrum distribution).