# Assignment 2b Due: 9/15

Please submit this assignment as Assignment2b_FirstName_LastName

In this assignment you will explore fitting data and assessing how well your fit describes the different data sets.

Assignment Overview:
* Fit data and use $\chi^2$ and the $\chi^2$ test to assess 
* Analyze the efficiency of your data provided differnt threshold levels using your fit results 

For this assingment you can make use of the numpy, matplotlib, and the scipy packages.

In [1]:
#specify imports here
import numpy as np
import matplotlib.pyplot as plt
from scipy import optimize
from scipy import integrate
from scipy.optimize import curve_fit
import scipy.special as sf
%matplotlib notebook

# Problem 1: W Boson Mass

Finding the *true* values of a quantity relies on analyzing results for many experiments. One quantity that has been measured many times is the W boson mass see Wikipedia https://en.wikipedia.org/wiki/W_and_Z_bosons and the particle data group (PDG) https://pdg.lbl.gov/2018/listings/rpp2018-list-w-boson.pdf 

**a)** In this problem you will analyze measurements of the W boson from various experiments and determine if the values are consistnet and given this data set, what the best fit value is. Start by reading in the data file Wmass_data.txt, which contains an experiment number, W mass in units of $GeV/c^2$ and its uncertainty.


In [2]:
wm_n, wmass, wmerror = np.loadtxt('data/Wmass_data.txt', unpack=True)

**b)** Compute the error weighted mean of the W mass and its uncertainty. How does the weighted mean compare to the bold faced average of the PDG?

In [3]:
weights = 1 / (wmerror ** 2)
sum(wmass * weights) / sum(weights)

80.37914612783635

Our computed weighted mean is within the margin of error from the bold faced average of the PDG ($80.379 \pm 0.012$).

**c)** Calculate the $\chi^2$, degrees of freedom, reduced $\chi^2$, and p-value. The p-value can be calculated using *gammaincc(dof / 2.0, chisq / 2.0)* from *scipy.special*. Based on the p-value are the data consistant?

In [4]:
def func_constant(t,a):
    return a

In [5]:
wmfit, wmcov = curve_fit(func_constant, wm_n, wmass, sigma=wmerror, absolute_sigma=True)

print('Initial fit:')
print(wmfit)
print('\nCovariance matrix:')
print(wmcov)

Initial fit:
[80.37914613]

Covariance matrix:
[[0.00010688]]


In [6]:
wm_residual = wmass - wmfit
wm_chisq = np.sum(wm_residual ** 2 / wmerror ** 2)
wm_dof = len(wm_n) - len(wmfit) - 1
wm_pvalue = sf.gammaincc(wm_dof / 2.0, wm_chisq / 2.0)
print(f'{"chi2 =":>15} {wm_chisq} \n{"dof =":>15} {wm_dof} \n{"reduced chi2 =":>15} {wm_chisq / wm_dof} \n{"p-value =":>15} {wm_pvalue}')

         chi2 = 8.706836513332618 
          dof = 7 
 reduced chi2 = 1.2438337876189454 
      p-value = 0.27439480922611664


The p-value is greater than five percent ($\text{p-value} > 5\%$) so the data is consistent.

**d)** Plot the measurement number vs. the W mass. Don't forget to include the error bars on the W mass measurements. Then Fit a line of the form $y = p_0$, where $p_0$ is a constant parameter.

How does your $p_0$ value compare to the weighted mean you calculated earlier in part b)?

In [7]:
wm_fig = plt.figure()
wm_axes = wm_fig.add_axes([0.15,0.12,0.8,0.8])
wm_axes.set_title('Initial fit')
wm_axes.set_xlabel('Experiment number')
wm_axes.set_ylabel('W mass (GeV/c^2)')
wm_axes.errorbar(wm_n, wmass, yerr = wmerror, fmt='.')
wm_axes.plot(wm_n,[func_constant(n,*wmfit) for n in wm_n],'r-',label='Fit')
wm_axes.legend();

<IPython.core.display.Javascript object>

The fit parameter $p_0$ matches my weighted mean that I calculated before (approximately 80.379).

# Problem 2: Proton Charge Radius

We will carry an identical analysis as we did in Problem 1, but on a different quantity, the proton charge radius. The proton charge radius has been a recent hot topic in the nuclear physics field, as new designed experiments using muonic hydorgen have made very percise measurements of it. See https://www.nature.com/articles/s41586-019-1721-2

There is an approchable video that reviews the history of the proton size and its measurements: https://www.youtube.com/watch?v=C5B_ZfGy4d0

**a)** Import the data set proton_radius_data.txt, which includes the experiment number, the proton charge radius, and its uncertainty measured in $fm$. 

In [8]:
pcr_n, pcradius, pcrerror = np.loadtxt('data/proton_radius_data.txt', unpack=True)

**b)** Compute the error weighted mean of the proton charge radius and its uncertainty. 

You can also compare this to the PDG value (pgs. 6 and 7): https://pdg.lbl.gov/2018/listings/rpp2018-list-p.pdf 

In [9]:
pcrweights = 1 / (pcrerror ** 2)
sum(pcradius * pcrweights) / sum(pcrweights)

0.8416225242550589

**c)** Calculate the  $\chi^2$, degrees of freedom, reduced $\chi^2$ and p-value. Based on the p-value are the data consistant? Do you see what all of the fuss is about.

In [10]:
def func_constant(t,a):
    return a

In [11]:
pcrfit, pcrcov = curve_fit(func_constant, pcr_n, pcradius, sigma=pcrerror, absolute_sigma=True)

print('Initial fit:')
print(pcrfit)
print('\nCovariance matrix:')
print(pcrcov)

Initial fit:
[0.84162252]

Covariance matrix:
[[1.11377068e-07]]


In [12]:
pcr_residual = pcradius - pcrfit
pcr_chisq = np.sum(pcr_residual ** 2 / pcrerror ** 2)
pcr_dof = len(pcr_n) - len(pcrfit) - 1
pcr_pvalue = sf.gammaincc(pcr_dof / 2.0, pcr_chisq / 2.0)
print(f'{"chi2 =":>15} {pcr_chisq} \n{"dof =":>15} {pcr_dof} \n{"reduced chi2 =":>15} {pcr_chisq / pcr_dof} \n{"p-value =":>15} {pcr_pvalue}')

         chi2 = 222.8986182713528 
          dof = 13 
 reduced chi2 = 17.146047559334832 
      p-value = 2.6281901330385047e-40


The p-value is much less than five percent ($\text{p-value} \ll 5\%$), so the data is very inconsistent.

**d)** Plot the measurement number vs. the proton charge radius. Don't forget to include the error bars on the proton charge radius measurements. Then Fit a line of the form  $y = p_0$ , where $p_0$ is a constant parameter.

How does your $p_0$ value compare to the weighted mean you calculated earlier in part b)?

In [13]:
pcr_fig = plt.figure()
pcr_axes = pcr_fig.add_axes([0.15,0.12,0.8,0.8])
pcr_axes.set_title('Initial fit')
pcr_axes.set_xlabel('Experiment number')
pcr_axes.set_ylabel('Proton charge radius (fm)')
pcr_axes.errorbar(pcr_n, pcradius, yerr = pcrerror, fmt='.')
pcr_axes.plot(pcr_n, [func_constant(n,*pcrfit) for n in pcr_n],'r-',label='Fit')
pcr_axes.legend();

<IPython.core.display.Javascript object>

The fit parameter $p_0$ matches my computed weighted mean (approximately 0.8416).

# Problem 3: Selecting Data

In particle physics we sometimes want to measure a particlular particle that is created from many which result from a collision in a particle collider. In recording these collision events we typically measure other particles which are not the ones we are intersted in. The events we are interested in we refer to as our signal, whereas the ones we are not interested in we refer to as a background. 

**a)** The provided data set (Ep_data.txt) contains values of particle energy/momentum (E/p), the number of particles, and the uncertainty on the number of particles. Import the data and plot the number of particles vs. E/p and be sure to include the error bars on the particle counts. 

In [14]:
ptcl_energy, ptcl_num, ptcl_error = np.loadtxt('data/Ep_data.txt', unpack=True)

ptcl_fig = plt.figure()
ptcl_axes = ptcl_fig.add_axes([0.1,0.1,0.8,0.8])
ptcl_axes.set_xlabel('Particle energy/momentum $(E/p)$')
ptcl_axes.set_ylabel('Number of particles')
ptcl_axes.set_title('Particle collision')
ptcl_axes.errorbar(ptcl_energy, ptcl_num, yerr = ptcl_error, fmt='.', label = 'Data')
ptcl_axes.legend();

<IPython.core.display.Javascript object>

**b)** You should notice that there appear to be two clear distributions here. One which seems to be centered E/p = 0.6 and another around E/p = 1. The population at the lower E/p represent pions, whereas the population around E/p = 1 are electrons. For this exersice we will treat the pions as a background and the electrons as our signal. We will model each particle type to have a Gaussian distribution. Define two python functions, one that returns a value computed from a Gaussian functions, and another python function that returns a value computed from the sum of two Gaussian functions. Then make a fit to the data using the sum of two Gaussian functions. Each of your Gaussian functions can take the form of:

$G_1(x) = p_1 e^{-(x-p_2)^2/(2p_3)}$

where the $p_1, p_2,$ and $p_3$ are three parameters for the one Gaussian function. You will have 3 more different parameters for the other Gaussian function $G_2(x)$. So we want to fit our E/p distribution with function $G_1(x) + G_2(x)$. The image below shows my fit, with the $G_1(x) + G_2(x)$ fit being the black curve. From this fit I can use the fit parameters to draw $G_1(x)$ (blue curve) and $G_2(x)$ (red curve). 

Note: Did you get a negative value for the gaussian widths from your fit? We know that a negative value is not physical. Try to give some initial parameters for the fit to start with.

![Screen%20Shot%202021-07-15%20at%209.57.45%20AM.png](attachment:Screen%20Shot%202021-07-15%20at%209.57.45%20AM.png)

In [15]:
def func_gauss(x,p1,p2,p3):
    return p1*np.exp((-(x-p2)**2)/(2*p3))
    
def func_gauss2(x,p1,p2,p3,p4,p5,p6):
    return func_gauss(x,p1,p2,p3) + func_gauss(x,p4,p5,p6)

ptcl_fit, ptcl_cov = curve_fit(func_gauss2, ptcl_energy, ptcl_num, sigma = ptcl_error, absolute_sigma = True)

print('Initial fit:')
print(ptcl_fit)
print('\nCovariance matrix:')
print(ptcl_cov)

Initial fit:
[9.74415553e+00 1.00366136e+00 5.93338492e-02 2.02093707e+01
 6.01230392e-01 1.05958691e-02]

Covariance matrix:
[[ 1.89694965e-02  5.69432587e-06 -1.70257958e-04  6.72801531e-03
  -3.19625065e-05 -4.71470510e-06]
 [ 5.69432587e-06  7.59662808e-05 -2.62954965e-05  2.44115915e-03
   8.14668262e-06  2.46372611e-06]
 [-1.70257958e-04 -2.62954965e-05  1.51639803e-05 -1.14433843e-03
  -2.91391202e-06 -9.85902130e-07]
 [ 6.72801531e-03  2.44115915e-03 -1.14433843e-03  1.34649438e-01
   2.68237194e-04  5.83639879e-05]
 [-3.19625065e-05  8.14668262e-06 -2.91391202e-06  2.68237194e-04
   2.44558029e-06  2.80424680e-07]
 [-4.71470510e-06  2.46372611e-06 -9.85902130e-07  5.83639879e-05
   2.80424680e-07  1.49037206e-07]]


**c)** Calculate your $\chi^2$, degrees of freedom, reduced $\chi^2$, and p-value for the fit to the data.
Based on those statistics above is this a good fit? Explain.

In [16]:
ptcl_residual = ptcl_num - func_gauss2(ptcl_energy, *ptcl_fit)
ptcl_chisq = np.sum(ptcl_residual ** 2 / ptcl_error ** 2)
ptcl_dof = len(ptcl_num) - len(ptcl_fit) - 1
ptcl_pvalue = sf.gammaincc(ptcl_dof / 2.0, ptcl_chisq / 2.0)
print(f'{"chi2 =":>15} {ptcl_chisq} \n{"dof =":>15} {ptcl_dof} \n{"reduced chi2 =":>15} {ptcl_chisq / ptcl_dof} \n{"p-value =":>15} {ptcl_pvalue}')

         chi2 = 108.4209411082341 
          dof = 93 
 reduced chi2 = 1.1658165710562807 
      p-value = 0.13094980565355063


Yes, this is a good fit since ${\chi_R}^2 \approx 1$.

**d)** On the same graph, plot your data, the total fit to it, and the single Gaussian functions computed using the parameter results from your 2 Gaussian function fit (e.g. reproduce my fit figure). 

In [17]:
ptcl_fig2 = plt.figure()
ptcl_axes2 = ptcl_fig2.add_axes([0.1,0.1,0.8,0.8])
ptcl_axes2.set_xlabel('Particle energy/momentum $(E/p)$')
ptcl_axes2.set_ylabel('Number of particles')
ptcl_axes2.set_title('Particle collision')
ptcl_axes2.errorbar(ptcl_energy, ptcl_num, yerr = ptcl_error, fmt='.', label = 'Data')
ptcl_axes2.plot(ptcl_energy, func_gauss2(ptcl_energy, *ptcl_fit), 'k-', label='Total fit')
ptcl_axes2.plot(ptcl_energy, func_gauss(ptcl_energy, *ptcl_fit[3:]), 'b--', label='Pion fit')
ptcl_axes2.plot(ptcl_energy, func_gauss(ptcl_energy, *ptcl_fit[:3]), 'r--', label='Electron fit')
ptcl_axes2.legend();

<IPython.core.display.Javascript object>

**e)** We can use the $E/p$ distribution to try to select the maximum number of electrons while minimizing the number of pions that *leak* into our electron signal. We can do this by requireing our selected sample to be larger than some $E/p$ threshold value. Any data that has an $E/p$ value lower then the threshold we throw it out. In a physics analysis this is called a cut. However we need to be careful, if we place a cut at $E/p$ that is too large we will have a really clean electron sample, but throw away a lot of good electrons. On the other hand if we make the $E/p$ cut too low we will keep most of our electrons, but let in a lot of background (pions). So we must compormise between clean data and statistics. To do this lets calculat the total number of electrons we have from $0.0 < E/p < 2$. This can be obtained by integrating (you can use scipy integrators, I used *integrate.quad* when doing this exersise)the electron contribution from our fit. We will call this number e_tot. Do a similar thing for the total pions and call that number pi_tot. 

For 10 equally spaced E/p thresholds between 0.3 and 0.8, calculate the number of electrons that are above each of the thresholds, we can call this array e_sig and can be obtained by integrating from the E/p threshold value to the E/p = 2. Do a similar thing for the pion distribution. 

Below is the your graph in part f) should look like.

![Screen%20Shot%202021-07-15%20at%209.57.52%20AM.png](attachment:Screen%20Shot%202021-07-15%20at%209.57.52%20AM.png)



In [18]:
e_tot, e_tot_err = integrate.quad(func_gauss, 0, 2, args = tuple(ptcl_fit[:3]))
p_tot, p_tot_err = integrate.quad(func_gauss, 0, 2, args = tuple(ptcl_fit[3:]))
print(f'{"Total number of electrons:":>26} {e_tot}')
print(f'{"Total number of pions:":>26} {p_tot}')

Total number of electrons: 5.949326913613009
    Total number of pions: 5.214480134869143


In [19]:
ep_thresholds = np.linspace(0.3, 0.8, 10)
e_sig_quads = np.array([integrate.quad(func_gauss, threshold, 2, args = tuple(ptcl_fit[:3])) for threshold in ep_thresholds])
p_sig_quads = np.array([integrate.quad(func_gauss, threshold, 2, args = tuple(ptcl_fit[3:])) for threshold in ep_thresholds])
e_sig = e_sig_quads[:,0]
p_sig = p_sig_quads[:,0]

e_sig, p_sig

(array([5.93793428, 5.92624202, 5.90484712, 5.86767417, 5.80634752,
        5.71027992, 5.56738708, 5.36557365, 5.09493127, 4.75030539]),
 array([5.20553905, 5.17015252, 5.04565431, 4.71616686, 4.06010455,
        3.07713387, 1.96882534, 1.02841789, 0.4279467 , 0.139444  ]))

**f)** Plot the ratios e_sig/e_tot and pi_sig/pi_tot as a function of E/p threshold on the same graph. 

In [20]:
eprat_fig = plt.figure()
eprat_axes = eprat_fig.add_axes([0.1,0.1,0.8,0.8])
eprat_axes.set_title('Ratios of the total number of electrons and pions and\n the number of electrons and pions above a certain threshold')
eprat_axes.set_xlabel('$E/p$ cut threshold')
eprat_axes.set_ylabel('Ratio (%)')
eprat_axes.scatter(ep_thresholds, e_sig/e_tot, label = 'Electron')
eprat_axes.scatter(ep_thresholds, p_sig/p_tot, label = 'Pion')
eprat_axes.legend(loc='lower left');

<IPython.core.display.Javascript object>

**g)** When the e_sig/etot ratio is 90%, how what percentage of the pion distribution is contaminating our electron sample?

In [21]:
e_sig/e_tot

array([0.99808506, 0.99611975, 0.99252356, 0.9862753 , 0.97596713,
       0.95981949, 0.93580117, 0.90187911, 0.85638785, 0.79846098])

In [22]:
erat_90_index = np.where(np.round(e_sig/e_tot, 2) == 0.9)[0][0]
erat_90_index

7

In [23]:
(p_sig/p_tot)[erat_90_index]

0.19722347438277668

Approximately 19.7% of the pion distribution is contaminating our electron sample when the e_sig/e_tot ratio is 90%.