# Confidence Interval

In statistics parameters of data (for instance mean and covariance) are often estimated based on sample from data (since you do not have the entire data, for instance, the height of an entire country), but the mean (or other parameters) of the sample might be slightly different from the actual mean of data, so if you make several samples each sample might show a slightly different value for the mean, so now we are looking for a range in which the actual mean lays with a certain probability. This range called **Confidence Interval**.  The confidence level of $95 \%$ or $99 \%$ is often used as probability, meaning if a confidence interval of $99 \%$ is given, one can be $99 \%$ certain that the true values lie within this interval.

## Confidence Interval for Data with Normal Distribution

1. **Sample mean ($\bar{x}$)**: The average value of the sample data.
2. **Z-score or t-score**: A value from the Z or t distribution that corresponds to the desired confidence level. For example, a 95% confidence level corresponds to a Z-score of approximately 1.96 (for large sample sizes or known population standard deviation) or a t-score from the t-distribution for smaller samples with unknown population standard deviation.
3. **Standard deviation ($\sigma$) or Standard error of the mean (SEM)**: The SEM is calculated as $\sigma/\sqrt{n}$ where $\sigma$ is the population standard deviation and $n$ is the sample size. If the population standard deviation is unknown, it is estimated using the sample standard deviation ($s$).
4. **Sample size ($n$)**: The number of observations in the sample.

The formula for a CI around a sample mean, when the population standard deviation is known, is given by:

$ \text{CI} = \bar{x} \pm Z \times \frac{\sigma}{\sqrt{n}} $

where:
- $\bar{x}$ is the sample mean,
- $Z$ is the Z-score corresponding to the desired confidence level,
- $\sigma$ is the population standard deviation, and
- $n$ is the sample size.



The Z-score, also known as a standard score, measures the number of standard deviations an individual data point is from the mean of a dataset. It's a way to compare results from different data sets or to standardize scores across different scales. The formula to calculate a Z-score for a given data point is:

$ Z = \frac{(X - \mu)}{\sigma} $

Where:
- $Z$ is the Z-score.
- $X$ is the value of the data point.
- $\mu$ is the mean of the dataset.
- $\sigma$ is the standard deviation of the dataset.

### Steps to Calculate a Z-score:

1. **Calculate the mean ($\mu$) of the dataset:** Add up all the numbers, then divide by the number of numbers.
2. **Calculate the standard deviation ($\sigma$) of the dataset:** This measures the dispersion of the dataset. The formula for the sample standard deviation is:
$ \sigma = \sqrt{\frac{\sum (X_i - \mu)^2}{n-1}} $
   Where $X_i$ are the data points, $\mu$ is the mean of the dataset, and $n$ is the number of data points.
3. **Calculate the Z-score ($Z$) for each data point:** Subtract the mean from the data point, and then divide this by the standard deviation.

### Example:

Let's say we have a test score of 85 from a class where the mean test score is 80 and the standard deviation is 5.

$ Z = \frac{(85 - 80)}{5} = \frac{5}{5} = 1 $

This means the test score of 85 is 1 standard deviation above the mean of the class scores.

### Interpretation:

- A Z-score of 0 indicates the score is exactly at the mean.
- A positive Z-score indicates the score is above the mean, and a negative Z-score indicates it is below the mean.
- The magnitude of the Z-score shows how far from the mean the data point is in terms of standard deviations. For example, a Z-score of 2 means the data point is 2 standard deviations away from the mean.



To understand how a $95 \%$ confidence level corresponds to a Z-score of approximately $1.96$, it's important to dive into the concept of normal distribution and the properties of Z-scores.

A Z-score measures how many standard deviations an element is from the mean. In the context of confidence intervals, the Z-score indicates how far we must go from the mean of a distribution to capture a certain proportion of the data.

### Normal Distribution

A normal distribution is a symmetric, bell-shaped distribution where most of the observations cluster around the central peak and the probabilities for values further away from the mean taper off equally in both directions. Key characteristics include:

- It is defined by its mean ($\mu$) and standard deviation ($\sigma$).
- Approximately 68% of the data falls within one standard deviation of the mean.
- Approximately 95% of the data falls within two standard deviations of the mean.
- Approximately 99.7% of the data falls within three standard deviations of the mean.

However, these percentages (68%, 95%, 99.7%) are approximations. For precise calculations, especially for creating confidence intervals, we use the Z-score to determine the exact cutoff points.

### Calculating the Z-score for a 95% Confidence Level

The 95% confidence level refers to the percentage of all possible samples that can be expected to include the true population parameter. To find the Z-score that corresponds to the middle 95% of the distribution, we look at the properties of the normal distribution:

- The middle 95% is centered around the mean, leaving 2.5% of the observations in each tail of the distribution (100% - 95% = 5%, divided by 2 because it's symmetric).
- A Z-score table or the inverse of the standard normal distribution function (often denoted as $Z_{\alpha/2}$ or $Z_{0.025}$ for a 95% confidence interval) can be used to find the Z-score that corresponds to the cumulative area of 0.975 (100% - 2.5% = 97.5%) from the left side of the mean.

When we look up this cumulative area in a Z-score table or use a statistical software or calculator to find the inverse of the standard normal distribution, we find that the Z-score that leaves 2.5% in the upper tail (or captures 97.5% of the data from the left) is approximately 1.96.

This means that to create a 95% confidence interval for the mean of a normally distributed dataset, we would use a Z-score of 1.96 to calculate the margin of error, extending 1.96 standard deviations from the sample mean to the left and right. This captures the central 95% of the possible sample means.



## Numerical Example for Confidence Interval

Let's calculate a 95% confidence interval for the mean of a sample with the following characteristics:
- Sample mean ($\bar{x}$) = 100
- Population standard deviation ($\sigma$) = 15
- Sample size ($n$) = 25

The Z-score for a $95\%$ confidence level is approximately $1.96$.

We'll calculate the confidence interval using the formula provided.

Let's calculate it:

The $95 \%$ confidence interval for the mean of this sample is approximately from $94.12$ to $105.88$. This means we are $95 \% $confident that the true population mean falls within this range. The margin of error in this case is $5.88$, indicating the range above and below the sample mean that comprises the confidence interval.

To interpret this, if you were to take many samples and calculate a $95 \%$ confidence interval for each, about 95% of these intervals would contain the true population mean.

In [1]:
# Given values
sample_mean = 100  # Sample mean (x̄)
population_std_dev = 15  # Population standard deviation (σ)
sample_size = 25  # Sample size (n)
confidence_level_z_score = 1.96  # Z-score for 95% confidence level

# Calculating the Standard Error of the Mean (SEM)
sem = population_std_dev / (sample_size**0.5)

# Calculating the margin of error
margin_of_error = confidence_level_z_score * sem

# Calculating the confidence interval
lower_bound = sample_mean - margin_of_error
upper_bound = sample_mean + margin_of_error

(lower_bound, upper_bound, margin_of_error)


(94.12, 105.88, 5.88)