In last notebook we covered some core definitions that were required to talk about statistics. The idea of population and sampling processes and the idea of standard error are crucial to understand statistics. The difference between how we treat sample and population, as you will see from here on out, will form the basis of understanding statistics. 

In this notebook, we add a new idea to this. We start by looking at elements of statistical estimation theory. We will be breaking statistical estimation theory down into elementary and advanced sections. In this notebook, we shall cover the elementary parts. After we introduce random variables and probability we shall delve more into advanced estimation theory ideas like maximum likelihood estimation or maximum apriori estimate and many more. What we shall do is introduce the general ideas of point and interval estimates and then move onto confidence intervals. Hence our sections are -

1) Point estimates and Interval estimates <br>
2) Confidence intervals<br>
3) Visualization of confidence intervals<br>

This already quite a bit. So let's get started. 

Before we start talking about estimates, let's define a few things so we can talk about estimates easily- <br>

1) Population parameters - Parameters that pertain to the population like the population mean $\mu$ or population standard deviation  $\sigma$

2) Sample parameters  - Parameters pertaining the sample, for example sample mean is denoted by $\bar x$ and the  sample standard deviation is denoted  as $s$

3) Sampling distribution parameters  - These parameters pertain to the sampling distribution that we introduced in the last notebook where the mean of the sampling distribution of means is denoted by $\bar \mu$ and the standard error which is the standard deviation of the sampling distribution is denoted as $\bar \sigma$

If you ever get confused about what each term means in the later part of the notebook, you can always come back here and clarify doubts. 


# Point estimate and Interval Estimates

Firstly, let's delve into why do we event need to talk about estimate. The whole point of understanding estimates is to provide better information about measurements we are making. Let us take an example of measuring the weight of Diljit. When we say that Diljit weighs 50 kg, we are making a point estimate about his weight. But we can say that Diljit weighs 50 kg $\pm$ 5 kg. This is an example of an interval estimate. In general,  we prefer interval estimates over point estimates. Want to hazard a guess as to why? 

Question:  Why do you think interval estimates are better than point estimates? Hint: How much information is being provided by interval estimates vs point estimates.

Answer: The answer is really in the hint. An interval estimate provides us with more information compared to a point estimate. It gives us not just what the value of a measurement but also how much error can there be in the value. Typically when we deal with any kind of measurement, we will always have an error range. From a tape measure to a weighing scale to any kind of measuring tool will always have an inherent range in which we can measure. 

Keeping this in mind, we can think of confidence interval as type of interval estimate. Unlike a point estimate which provides an estimate of a population parameter, a confidence interval provides a range of values in which the population parameter may lie. For this it's better that we try to do this with an example. Things will be more clear. 

Suppose we take the example of Diljit's weight. Say the true distribution of the error of Diljit's weight on the weighing scale is given by the plot below - 

In [1]:
# Just run the code below

# All libraries that would be used in this notebook
import numpy as np 
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

import numpy as np
import random

from IPython.display import display, Math

from bokeh.io import show, output_notebook
from bokeh.plotting import figure, show
from scipy.stats import norm 
from bokeh import plotting as pl
from bokeh.models import HoverTool
 
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from plotly import graph_objs as go
init_notebook_mode(connected=True)

output_notebook()

random.seed = 1
mean= 50
std =0.5
weight_dist = np.random.normal(mean, std, 10000)
count, bins= np.histogram(weight_dist, 50)

tools_to_show= 'box_zoom,pan,save,hover,reset,tap,wheel_zoom'        

fig = pl.figure(x_range=[45,55], plot_height=250, tools = tools_to_show,title="Diljit's weight" , x_axis_label= "Diljit's Weight [kg]", y_axis_label ="Count" )

fig.vbar(x= bins[:-1], top= count, width = 0.02 )

fig.xgrid.grid_line_color = None
fig.y_range.start = 0

hover = fig.select(dict(type=HoverTool))
hover.tooltips = [("xvalue", "@x"), ("yvalue", "@top")]

show(fig)


### Solution code

```python
# Just run the above code
```

In the plot above we have used the bokeh plotting library to generate a frequency distribution of Dilijit's weight. 

We can  consider this distribution as a population which contains all the possible measurements of Diljit's weight. Now this population has many parameters, two of which are the population mean $\mu$ and the population standard deviation $\sigma$. Since the measurements lead to normal distribution we can say that mean of the population distribution is at $\mu $ = 50 kg and the standard deviation is 0.5 kg. 

Suppose we have a situation that we don't have access to this population data about Diljit's weight and we want to infer what could be his weight? We firstly we are going to have a sample from this distribution. 

Question: With a sample size of 10 samples from the distribution about a 1000 times, plot the sampling distribution of mean.

(Run the below code to observe the answer)

In [11]:
# Just run the below code

#Answer 
# so first we take the variable weight_dist and randomly sample from it 

number_times = np.arange(0,1000)
sample_size = 10
samples= [random.sample(list(weight_dist),sample_size) for x in number_times] 
sampling_distribution_mean  =[np.mean(x) for x in samples]
count_mean, bins_mean = np.histogram(sampling_distribution_mean,50)
# normalization 
count_mean_normalized = count_mean/sum(count_mean)


# plotting 
fig = pl.figure(x_range=[48,52], plot_height=300, tools = tools_to_show,title="Sampling distribution of means" , x_axis_label= "mean weight [kg]", y_axis_label ="Count" )

fig.vbar(x= bins_mean[:-1], top= count_mean_normalized, width = 0.01 )

fig.xgrid.grid_line_color = None
fig.y_range.start = 0

hover = fig.select(dict(type=HoverTool))
hover.tooltips = [("xvalue", "@x"), ("yvalue", "@top")]

show(fig)

### Solution code

```python
# Just run the above code
```

In the above, we have a discrete distribution which represents the sampling distribution of mean.

Question : What did we actually do in calculating the sampling distribution?

Answer: What we have essentially done is weigh Dilijit about 10 times. We have done this 100 times. It's the equivalent of say weighing Diljit 10 times a day for 100 days to collect data. We do this because we want to understand what the variation in the measurement of the weight is. 

Now that we have this information, how can we estimate what is the population mean from this Diljit's weights? Well we have all the sample means, in fact we have a distribution of sample means. So we can look at the mean of the sampling distribution and compare it to the mean of the population distribution. So that would be - <br>

In [3]:
# Just run the below code

mean_of_sampling_dist = np.mean(sampling_distribution_mean)
population_mean =  bins[np.argmax(count)]

print("population mean -  {}" .format(round(population_mean,2)))
print("mean of the the sampling distribution -  {}".format(round(mean_of_sampling_dist,2)))

population mean -  49.94
mean of the the sampling distribution -  50.01


### Solution code

```python
# Just run the above code
```

We can see that both these values are very close. Well that good, our mean of the sampling distribution $\bar\mu$ is close to the population mean $\mu_P$. 

Of course all of this was possible since we had a sampling distribution.

Next, we figure out what is the error in the sampling distribution. Why this is important ? Because it helps us quantify what is the variation in the sampling distribution. If that variation is really large than it means that there is a large error in $\bar\mu$ and this is bad because this is our estimate of the population mean $\mu_P$. 

We have actually quantified the error in the sampling distribution before, it's nothing but the standard error $\bar\sigma$. Wait, but if we want to look at the variation in the sampling distribution, why didn't we just use the standard deviation of the sampling distribution?  Well the standard deviation of the sampling distribution of means is the standard error. This we established the last notebook. Now we are utilizing that fact here to actually describe the sampling distribution. 

#  Confidence Intervals
The other way we can describe the sampling distribution of means is using confidence intervals. Here we are not saying that there is a single value for the error but that the mean of the sampling distribution of means lies within a certain interval. We define the confidence intervals in terms of "distance" from the mean. So for instance we can say that we are confident that- 


<br>
68.27% of our value of $\bar\mu$ falls within $\bar\mu \pm \bar\sigma$

<br>
95.45% of our value of $\bar\mu$ falls within $\bar\mu \pm 2\bar\sigma$

<br>
99.73% of our value of $\bar\mu$ falls within $\bar\mu \pm 3\bar\sigma$

Well that seems rather magical. How did we get these numbers. Well for that let's go back to a our good old normal distribution.


In [4]:
# Just run the below code

xrange =np.linspace(-5,5,10000)
pdf = norm(0,1).pdf(xrange)
data = [go.Scatter(x = xrange, y = pdf, mode ="lines")]
layout = go.Layout(title="Sampling distribution of mean", xaxis = {"title": "x values",
                                                                        "range": [-5,5]}
                                                            , yaxis= {"title": "PDF" ,         
                                                                       "range": [0,0.5]         
                                                                                })
figure = go.Figure(data = data, layout = layout)
iplot(figure,filename = "Normal distribution")


### Solution code

```python
# Just run the above code
```

What I have done here is just used the scipy stats module function called norm. norm.pdf(). It takes either an array or an single value and gives us value of the normal distribution. Now, those numbers 68.27 %, 95.45 % etc are essentially the area under the curve for $\bar\mu \pm \bar\sigma$, $\bar\mu \pm 2\bar\sigma$ and so on respectively. So how do we do this. We would have to write a piece of code that calculates this area. OR we can use the cumulative distribution for a normal distribution to get these percentages. Luckily for us scipy has a module for this as well.

In [9]:
# Just run the below code

# for a single standard deviation away fromt the sample mean
first_confidence_interval = (norm.cdf(1) -norm.cdf(-1))*100
second_condifidence_interval = (norm.cdf(2) -norm.cdf(-2))*100
third_condifidence_interval = (norm.cdf(3) -norm.cdf(-3))*100
print("confidence interval values -> {} {} {} ".format(first_confidence_interval, 
                                                       second_condifidence_interval,
                                                       third_condifidence_interval))

confidence interval values -> 68.26894921370858 95.44997361036415 99.73002039367398 


### Solution code

```python
# Just run the above code
```

We can actually build a table of confidence interval values based on standard deviation values. To do this let's first write down a general way of coming up with confidence intervals -

confidence interval = $\bar\mu \pm z\bar\sigma$

where $z$ is called the confidence coefficient or the critical value based on the use case. $z$ is nothing fancy, for example when z = 1 i.e we are 1 standard deviation away from the mean, the confidence interval is 95.45 %. Now z can be any real number, but since $z \approx 3$ is 99% of the area under the normal we don't always have to go above it. So following this logic let's build a calcuator for confidence intervals

Question: For a normal distribution with $\mu =0$ and $\sigma =1$ get values of the confidence interval for z values between 0 to 3 in increments of 0.1. Plot these using either plotly or bokeh. Label the axis appropriately. 



In [6]:
# Just run the below code

#answer
z_values  = np.arange(0,3.1,0.1)
confidence_intervals  = [(norm.cdf(x) -norm.cdf(-x))*100 for x in z_values]


fig = pl.figure(x_range=[0,3], plot_height=400, tools = tools_to_show,title="Confidence Intervals" , x_axis_label= "z-values", y_axis_label ="Confidence interval" )

fig.line(x=z_values, y= confidence_intervals )
fig.xgrid.grid_line_color = None
fig.y_range.start = 0

hover = fig.select(dict(type=HoverTool))
hover.tooltips = [("xvalue", "@x"), ("yvalue", "@y")]



show(fig)


### Solution code

```python
# Just run the above code
```

So here I would like to get a bit philosophical. What does this number of  confidence interval 95.45 % mean? Does that mean that the probability that our population mean lies within 2 standard deviations of the the sampling distribution? NO. See, a confidence interval is really giving you an interval estimate of the error in your sampling method which is represented by the sampling distribution. See when we calculated the confidence intervals, we never used any information about the population mean. Hence to say that the population mean lies within 95.45 % of the sampling distribution would be an incorrect statement. For us to do that we need to use Bayesian methods of estimation. A topic beyond the scope of this current section. We will get there. For now, remember that the confidence intervals are error estimates for the sampling method that you are using and not probabilities. 

# Visualization of confidence intervals 

Lastly we have built a visualization for you showing the confidence interval for a give z value. the red region is essentially the area under the curve. Change the z value on the slide and hit run interact to play around with the visualization. 

In [10]:
# Just run the below code

# normal distribution of bokeh

conf_int =0 


def get_z(z_value): 
    fig = pl.figure(x_range=[-5,5], plot_height=400, tools = tools_to_show,title="Sampling distribution of means" , x_axis_label= "mean weight (kg)", y_axis_label ="Count" )

    fig.line(x=xrange, y= pdf, line_width = 4)

    fig.xgrid.grid_line_color = None
    fig.y_range.start = 0
    hover = fig.select(dict(type=HoverTool))
    hover.tooltips = [("xvalue", "@x"), ("yvalue", "@y")]

    shade_x = np.arange(-z_value,z_value,0.001)
    shade_region=norm.pdf(shade_x)
    shade_region[0] = 0 
    shade_region[-1] = 0
    fig.patch(shade_x, shade_region, color="red", alpha =0.4)
     
    show(fig)

    return z_value


def conf_interval(z_value): 
    z = get_z(z_value )
    conf_int = (norm.cdf(z)-norm.cdf(-z))*100
    print("Confidence interval {} %".format(round(conf_int,2)))
    return None



interact_manual(conf_interval, 
                z_value=widgets.FloatSlider(value =0.5,
                                            min = 0.1,
                                            max =3.0,
                                            continuous_update = False),
                );


interactive(children=(FloatSlider(value=0.5, continuous_update=False, description='z_value', max=3.0, min=0.1)…

### Solution code

```python
# Just run the above code
```

With that we shall we wrapping up this notebook. I hope that you have developed a base understanding for what confidence intervals are. If not it would be great to go through the notebook again and attempt some of the labs to gain a better understanding. 

The next topic we are going to cover is one of the most important topics in statistics. So get ready for it!

In [8]:
# No exercise

### Solution code

```python
# No exercise
```