# Previous week's highlights


## Mean, standard deviation, SE(M)

The sample mean:


$$\bar{x} = \frac{1}{N} \sum\limits_{i=1}^{N} x_i$$

which is the unbiased estimator of the population mean:

$$E\left(\bar{x}\right) = \mu$$


**Sample Standard deviation**:  $$ s^2 = \frac{1}{N-1} \sum\limits_{i=1}^{N} \left(x_i - \bar{x}\right)^2 $$

which is the unbiased estimator of the population standard deviation $\sigma$ (Bessel's correction):

$$E\left(s\right) = \sigma$$

**Population Standard deviation**:  $$ \sigma^2 = \frac{1}{N} \sum\limits_{i=1}^{N} \left({x_i} - \mu\right)^2 $$



**Standard error of the mean (SEM)**:  $$SE(\bar{x}) = \frac{\sigma}{\sqrt{N}}$$



##Central Limit Theorem


"Given a population with a finite mean μ and a finite non-zero variance σ$^2$, the sampling distribution of the mean approaches a normal distribution as N increases."




## Confidence intervals



### Approaching a distribution by assuming normality:

- When $\sigma$ is known:


$$\bar{x} \pm z^* \frac{\sigma}{\sqrt{N}} \qquad \equiv \qquad \bar{x} \pm z^* \times SE\left(\bar{x}\right)$$

where z$^*$ is the critical value of the confidence intervals (the number of standard errors to be added and subtracted in order to achieve the desired confidence level/percentage confidence we want)

- When $\sigma$ is unknown (usage of the Student's t approximation - for samples with $N < 30$):

$$\bar{x} \pm t_c\left(\frac{a}{2}, N-1\right) \frac{s}{\sqrt{N}} \qquad \equiv \qquad \bar{x} \pm t_c \times SE\left(\bar{x}\right)$$

where $t_c$ is a critical value that depends on the requested significance level $a$ (or equivalently the confidence level $C = 1-a$) and the degrees of freedom (here $N-1$)


### When we cannot approach a distribution..?

We bootstrap...



## Let's see all that on an example regarding distance uncertainties


In Leonidaki et al. (2013), numerous SNRs were detected in NGC 2403. The luminosities of the SNR population were computed using the distance of 3.2 Mpc, taken from Freedman & Madore (1988) which was calculated based on I-band photometry of 8 Cepheids.

Taking advantage of Konstantinos' work on gathering all available distances for a large number of galaxies, let's compare the distance value we used with a mean value from a larger sample of data.





In [None]:
# Load needed libraries

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as st
import glob
import math

%matplotlib inline

In [None]:
# Load up the data for NGC 2403:

 

# Check the length of the dataset:

print("Number of all available distances for NGC2403:", len(NGC2403_dist))

# Find the mean, standard deviation and SEM:




# Now, let's plot a histogram

plt.hist(NGC2403_dist, normed=True, color='b', alpha=0.5, \bins=10)
plt.title("All available distances for NGC2403")
plt.show()


### Confidence levels

In [None]:
#Compute the desired confidence interval:

confidence_level=0.99

CI_norm= st.norm.interval(confidence_level, loc = , scale = )
print("Confidence levels at", confidence_level * 100, "%:", CI_norm)


# Plot the histogram and an assumed normal distribution:


plt.title("All available distances for NGC2403")




#NGC2403 distance that was used in Leonidaki et al. (2013)
x_mine=3.2
mu_K=3.175

# LINES
plt.axvline(mu, ymin=0., ymax = 1.0, linewidth=1.2, color='k')
plt.axvline(CI_norm[0], ymin=0., ymax = 1.0, linestyle='--', linewidth=1.0, color='black')
plt.axvline(CI_norm[1], ymin=0., ymax = 1.0, linestyle='--', linewidth=1.0, color='black')
plt.axvline(x_mine, ymin=0., ymax = 1.0, linestyle='--', linewidth=1.0, color='red')
plt.axvline(mu_K, ymin=0., ymax = 1.0, linewidth=1.2, color='green')
plt.show()

In [None]:
ratio=[]
for i in open('ratios_Ha'):
    splitline1 = i.split()
    ratio.append(float(splitline1[0]))

N=len(ratio)
data_median=np.median(ratio) 
data_std= np.std(ratio) 

 
print("data median and std:", data_median, data_std)
plt.axis([0.1,0.5,0,250])
plt.hist(ratio,500, histtype = "step")
plt.axvline(x=data_median, color='r')
plt.show()

### Bootstrap

Bootsrapping is a resampling (with replacement) method. As we saw before, by drawing many samples we can approximate the sampling distribution of the mean which is impossible for real data without the assumption of a distribution.

Bootstrap method is based on randomly constructing $B$ samples from the original one, by sampling with replacement from the latter. The size of the resamples should be equal to the size of the original sample. For example, with the sample $X$ below, we can create $B = 5$ new samples $Y_i$:

$$X = \left[1, 8, 3, 4, 7\right]$$

$$\begin{align}
Y_1 &= \left[8, 3, 3, 7, 1\right] \\
Y_2 &= \left[3, 1, 4, 4, 1\right] \\
Y_3 &= \left[3, 7, 1, 8, 7\right] \\
Y_4 &= \left[7, 7, 4, 3, 1\right] \\
Y_5 &= \left[1, 7, 8, 3, 4\right]
\end{align}$$

Then, we compute the desired sample statistic for each of those samples to form an empirical sampling distribution. The standard deviation of the $B$ sample statistics is the bootstrap estimate of the standard error of the statistic.

In [None]:
boot_samples = np.random.choice(ratio, (10000, N), replace = True)
m_boot = np.median(boot_samples, axis = 1)

plt.figure(figsize = [12, 4])
plt.subplot(1, 2, 1)
plt.hist(m_boot,histtype = "step")
plt.title("Distribution of median values (DMV)")
plt.subplot(1, 2, 2)
plt.axis([0.22,0.37,0,600])
plt.hist(m_boot,500,histtype = "step", color='b')
plt.hist(ratio,500, histtype ="step", color='g')
plt.title("DMV-Data histogram")
plt.show()

print("std_bootstrap:", np.std(m_boot))

In [None]:
# compute 95% confidence intervals around the median  
CI95s = #........................#
print("95% confidence interval:\nLow:", CI95s[0], "\nHigh:", CI95s[1])
print()
# 80% confidence interval  
CI80s = #........................#
print("80% confidence interval: \nLow:", CI80s[0], "\nHigh:", CI80s[1])
print()
# 68% confidence interval  
CI68s = #........................#
print("68% confidence interval: \nLow:", CI68s[0], "\nHigh:", CI68s[1])

### Jackknife resampling

This older method inspired the Bootstrap which can be seen as a generalization (Jackknife is the linear approximation of Bootstrap.) It estimates the sampling distribution of a parameter on an $N$-sized sample through a collection of $N$ sub-samples by removing one element at a time.

E.g. the sample $X$ leads to the <b>Jackknife samples</b> $Y_i$:

$$ X = \left[1, 7, 3\right] $$

$$
\begin{align}
Y_1 &= \left[7, 3\right] \\
Y_2 &= \left[1, 3\right] \\
Y_3 &= \left[1, 7\right]
\end{align}
$$

The <b>Jackknife Replicate</b> $\hat\theta_{\left(i\right)}$ is the value of the estimator of interest $f(x)$ (e.g. mean, median, skewness) for the $i$-th subsample and $\hat\theta_{\left(\cdot\right)}$ is the sample mean of all replicates:

$$
\begin{align}
\hat\theta_{\left(i\right)} &= f\left(Y_i\right) \\
\hat\theta_{\left(\cdot\right)} &= \frac{1}{N}\sum\limits_{i=1}^N {\hat\theta_{\left(i\right)}}
\end{align}
$$

and the <b>Jackknife Standard Error</b> of $\hat\theta$ is computed using the formula:
 
$$ SE_{jack}(\hat\theta) = \sqrt{\frac{N-1}{N}\sum\limits_{i=1}^N \left[\hat{\theta}\left(Y_i\right) - \hat\theta_{\left(\cdot\right)} \right]^2} = \cdots = \frac{N-1}{\sqrt{N}} s$$

where $s$ is the standard deviation of the replicates.

In [None]:
def jackknife(x):
    return [[x[j] for j in range(len(x)) if j != i] for i in range(len(x))]

jack_samples = jackknife(ratio)
jack_medians = np.median(jack_samples, axis = 1)
SE_median_jack = np.std(jack_medians) * (N - 1.0) / np.sqrt(N)
print("std_jackknife:",SE_median_jack)

plt.hist(jack_medians,histtype = "step")
plt.show()

<b> Why this distribution..?</b>