## Inferential Statistics

Inferential statistics is a branch of statistics that allows us to make predictions or inferences about a population based on a sample of data taken from that population. It involves the use of statistical tests to make these inferences.

For example, let's say you want to know the average height of all adults in a city. It would be impractical to measure everyone, so instead, you take a random sample of 100 adults, measure their heights, and calculate the average. This average is a statistic.

Now, you want to infer the average height of all adults in the city based on your sample. You might use a confidence interval to say, "I am 95% confident that the average height of all adults in the city is between 5.5 and 6 feet." This is an example of inferential statistics.


## Probability Theory

Probability theory is a branch of mathematics that deals with uncertainty. It provides a mathematical framework for quantifying and reasoning about uncertainty.

Here's a simple example: 

Suppose you have a fair six-sided die and you want to find the probability of rolling a 3. Since the die is fair, each of the six outcomes (1, 2, 3, 4, 5, 6) is equally likely. So, the probability of rolling a 3 is 1 out of 6, or approximately 0.167.

In Python, you might simulate this scenario with the following code:

```python
import random

# Simulate a large number of die rolls
num_rolls = 1000000
rolls = [random.randint(1, 6) for _ in range(num_rolls)]

# Count the number of 3s
num_threes = rolls.count(3)

# Calculate the probability of rolling a 3
probability = num_threes / num_rolls

print(f"The estimated probability of rolling a 3 is {probability}")
```

This code simulates a large number of die rolls and estimates the probability of rolling a 3 by counting the number of 3s and dividing by the total number of rolls. If you run this code, you should find that the estimated probability is close to the theoretical probability of 1/6.

## Probability Distributions

A probability distribution is a mathematical function that provides the probabilities of occurrence of different possible outcomes in an experiment. In more technical terms, the probability distribution is a description of a random phenomenon in terms of the probabilities of events.

For example, if we roll a fair six-sided die, the probability distribution of the outcome is as follows:

- The outcome 1 has a probability of 1/6.
- The outcome 2 has a probability of 1/6.
- The outcome 3 has a probability of 1/6.
- The outcome 4 has a probability of 1/6.
- The outcome 5 has a probability of 1/6.
- The outcome 6 has a probability of 1/6.

This can be represented in Python using a dictionary:

```python
probability_distribution = {
    1: 1/6,
    2: 1/6,
    3: 1/6,
    4: 1/6,
    5: 1/6,
    6: 1/6
}
```

This dictionary represents the probability distribution of the outcome of a single roll of a fair six-sided die. Each key is an outcome and each value is the probability of that outcome.

## Expected Values

The expected value of a random variable is a key concept in probability and statistics and refers to the long-term average of a random variable over many independent repetitions of an experiment. It's essentially a weighted average of all possible values a random variable can take on, with each value being weighted by its respective probability.

For example, consider a fair six-sided die. The possible outcomes when you roll the die are 1, 2, 3, 4, 5, and 6. Each outcome has a probability of 1/6. The expected value (E) is calculated as follows:

E = (1/6)*1 + (1/6)*2 + (1/6)*3 + (1/6)*4 + (1/6)*5 + (1/6)*6 = 3.5

So, if you roll a fair six-sided die many times, you would expect the average outcome to be 3.5.

Here's how you might calculate this in Python:

```python
outcomes = [1, 2, 3, 4, 5, 6]
probabilities = [1/6, 1/6, 1/6, 1/6, 1/6, 1/6]

expected_value = sum(outcome * probability for outcome, probability in zip(outcomes, probabilities))

print(f"The expected value is {expected_value}")
```

This code calculates the expected value of a fair six-sided die roll by summing the product of each outcome and its probability.

## Binomial Distribution

The binomial distribution is a probability distribution that describes the number of successes in a fixed number of independent Bernoulli trials with the same probability of success. A Bernoulli trial is an experiment that results in a success with probability p and a failure with probability 1-p.

For example, consider flipping a fair coin 10 times. Each flip is a Bernoulli trial with a success probability (getting heads) of 0.5. The binomial distribution can tell us the probability of getting a certain number of heads (successes) in 10 flips (trials).

The probability mass function of a binomial distribution is given by:

P(X=k) = C(n, k) * (p^k) * ((1-p)^(n-k))

where:
- P(X=k) is the probability of k successes in n trials
- C(n, k) is the number of combinations of n items taken k at a time
- p is the probability of success on a single trial
- n is the number of trials

Here's how you might calculate the probability of getting exactly 5 heads in 10 flips of a fair coin in Python using the `scipy.stats` module:

```python
from scipy.stats import binom

# Number of trials and probability of success
n = 10
p = 0.5

# Number of successes
k = 5

# Calculate binomial probability
probability = binom.pmf(k, n, p)

print(f"The probability of getting exactly {k} heads in {n} flips is {probability}")
```

This code calculates the probability mass function (pmf) of the binomial distribution for k=5, n=10, and p=0.5, which gives the probability of getting exactly 5 heads in 10 flips of a fair coin.

## Commulative Distribution

The cumulative distribution function (CDF) for a random variable is defined as the probability that the variable takes a value less than or equal to a certain value.

The CDF is a function that increases monotonically from 0 to 1 as you move along the x-axis. At any given x, the value of the CDF is the probability that the random variable will take a value less than or equal to x.

For example, if we have a random variable X that follows a uniform distribution from 0 to 10, the CDF at x=5 would be 0.5. This means that there is a 50% chance that X will take a value less than or equal to 5.

Here's how you might plot the CDF for a standard normal distribution in Python using the `scipy.stats` module:

```python
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# Generate a range of x values
x = np.linspace(-4, 4, 1000)

# Calculate the CDF at each x value
cdf = norm.cdf(x)

# Plot the CDF
plt.plot(x, cdf)
plt.title('CDF of a Standard Normal Distribution')
plt.xlabel('x')
plt.ylabel('CDF')
plt.grid(True)
plt.show()
```

This code generates a range of x values from -4 to 4, calculates the CDF of the standard normal distribution at each x value, and then plots the CDF. The resulting plot shows that the CDF of the standard normal distribution increases monotonically from 0 to 1 as you move along the x-axis.

## Probability Density Function (PDF)

The probability density function (PDF) is a statistical expression that defines a probability distribution for a continuous random variable as opposed to a discrete random variable. The PDF is used to specify the probability of the random variable falling within a particular range of values, as opposed to taking on any one value.

For example, consider a random variable X that follows a standard normal distribution. The PDF of the standard normal distribution is given by:

f(x) = (1 / sqrt(2π)) * e^(-(x^2)/2)

Here's how you might plot the PDF for a standard normal distribution in Python using the `scipy.stats` module:

```python
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# Generate a range of x values
x = np.linspace(-4, 4, 1000)

# Calculate the PDF at each x value
pdf = norm.pdf(x)

# Plot the PDF
plt.plot(x, pdf)
plt.title('PDF of a Standard Normal Distribution')
plt.xlabel('x')
plt.ylabel('PDF')
plt.grid(True)
plt.show()
```

This code generates a range of x values from -4 to 4, calculates the PDF of the standard normal distribution at each x value, and then plots the PDF. The resulting plot shows the familiar bell curve shape of the standard normal distribution.

## Normal Distribution

The normal distribution, also known as the Gaussian distribution, is a type of continuous probability distribution for a real-valued random variable. It is a symmetric distribution where most of the observations cluster around the central peak and the probabilities for values further away from the mean taper off equally in both directions. 

The normal distribution is defined by two parameters: the mean (μ) and the standard deviation (σ). The mean defines the location of the peak and the standard deviation defines the height and width of the distribution.

The probability density function of a normal distribution is given by:

f(x) = (1 / (σ * sqrt(2π))) * e^(-(x-μ)^2 / (2σ^2))

Here's how you might plot the PDF of a normal distribution with a mean of 0 and a standard deviation of 1 (a standard normal distribution) in Python:

```python
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# Parameters of the distribution
mu = 0
sigma = 1

# Generate a range of x values
x = np.linspace(mu - 4*sigma, mu + 4*sigma, 100)

# Calculate the PDF at each x value
pdf = norm.pdf(x, mu, sigma)

# Plot the PDF
plt.plot(x, pdf)
plt.title('Normal Distribution')
plt.xlabel('x')
plt.ylabel('PDF')
plt.grid(True)
plt.show()
```

This code generates a range of x values from μ - 4σ to μ + 4σ, calculates the PDF of the normal distribution at each x value, and then plots the PDF. The resulting plot shows the familiar bell curve shape of the normal distribution.

## Z-Score

A Z-score is a statistical measurement that describes a value's relationship to the mean of a group of values. It is measured in terms of standard deviations from the mean. If a Z-score is 0, it indicates that the data point's score is identical to the mean score. A Z-score of 1.0 would indicate a value that is one standard deviation from the mean.

The formula to calculate a Z-score is:

Z = (X - μ) / σ

where:
- Z is the Z-score,
- X is the value of the element,
- μ is the population mean,
- σ is the standard deviation.

Here's how you might calculate the Z-score for a given array of values in Python:



In [1]:
import numpy as np

# Given data
data = np.array([2, 4, 4, 4, 5, 5, 7, 9])

# Calculate mean and standard deviation
mean = np.mean(data)
std_dev = np.std(data)

# Calculate Z-scores
z_scores = (data - mean) / std_dev

print(f"Z-scores: {z_scores}")

Z-scores: [-1.5 -0.5 -0.5 -0.5  0.   0.   1.   2. ]




This code calculates the Z-scores for each value in the data array. The Z-scores tell you how many standard deviations each value is from the mean.

## Sampling

Sampling is a fundamental concept in inferential statistics, where a subset of data is taken from a larger population to make inferences about the entire population. Here's a simple example to illustrate sampling for beginners:

Suppose you have a bag containing 100 marbles, with 50 red marbles and 50 blue marbles. You want to know the proportion of red marbles in the bag without looking at all the marbles.

Here's how you can use sampling to estimate the proportion of red marbles:

1. **Define the population and the parameter**: The population is the set of all marbles in the bag, and the parameter of interest is the proportion of red marbles.

2. **Draw a sample**: Randomly select a subset of marbles from the bag. For example, you could draw 10 marbles.

3. **Calculate the sample statistic**: Count the number of red marbles in the sample. In this case, if you drew 6 red marbles, the sample statistic would be 6/10 or 0.6.

4. **Repeat the sampling process**: Draw another sample of 10 marbles and calculate the proportion of red marbles. Repeat this process multiple times (e.g., 100 times) to get a collection of sample statistics.

5. **Analyze the sample statistics**: Calculate the mean and standard deviation of the sample statistics. The mean will be an estimate of the population parameter (the proportion of red marbles), and the standard deviation will give you an idea of how much variability there is in your estimates.

6. **Interpret the results**: Based on the mean and standard deviation of your sample statistics, you can make inferences about the proportion of red marbles in the bag. For example, if the mean is 0.6 and the standard deviation is 0.1, you can say that your estimate of the proportion of red marbles is 0.6 ± 0.1, or between 0.5 and 0.7.

This example demonstrates how sampling can be used to estimate population parameters, even when it's not feasible to look at the entire population.

Here's a Python code snippet that simulates the sampling process:

```python
import numpy as np

# Define the population
population = np.array(['red'] * 50 + ['blue'] * 50)

# Set the number of samples and the size of each sample
num_samples = 100
sample_size = 10

# Initialize an empty list to store the sample statistics
sample_statistics = []

# Draw samples and calculate the proportion of red marbles in each sample
for _ in range(num_samples):
    sample = np.random.choice(population, size=sample_size)
    red_marbles_proportion = np.mean(sample == 'ed')
    sample_statistics.append(red_marbles_proportion)

# Calculate the mean and standard deviation of the sample statistics
mean_proportion = np.mean(sample_statistics)
std_dev_proportion = np.std(sample_statistics)

print(f"Mean proportion of red marbles: {mean_proportion:.2f}")
print(f"Standard deviation of proportion: {std_dev_proportion:.2f}")
```

This code will output the mean and standard deviation of the sample statistics, which can be used to make inferences about the proportion of red marbles in the bag.

## Sampling Distribution

The sampling distribution is a probability distribution that describes the possible values of a sample statistic (e.g., the mean, median, or proportion) when repeatedly drawing samples from a population. It is a fundamental concept in inferential statistics, as it helps us understand the variability and uncertainty in our estimates.

Here's a simple example to illustrate the concept of a sampling distribution:

Suppose you have a bag containing 100 marbles, with 50 red marbles and 50 blue marbles. You want to know the proportion of red marbles in the bag without looking at all the marbles.

As mentioned earlier, you can use sampling to estimate the proportion of red marbles. Let's draw 10 marbles from the bag and calculate the proportion of red marbles in each sample:

1. Sample 1: Draw 10 marbles, and the proportion of red marbles is 6/10 or 0.6.
2. Sample 2: Draw another 10 marbles, and the proportion of red marbles is 4/10 or 0.4.
3. Sample 3: Draw another 10 marbles, and the proportion of red marbles is 7/10 or 0.7.
4....

By repeating this process many times (e.g., 100 times), you can create a collection of sample statistics (proportions of red marbles). The distribution of these sample statistics will form a sampling distribution.

In this example, the sampling distribution of the proportion of red marbles will have the following characteristics:

1. **Central tendency**: The mean of the sampling distribution will be close to the true population parameter (the proportion of red marbles in the bag). In this case, the mean will be approximately 0.5.

2. **Spread**: The standard deviation of the sampling distribution will indicate the variability of the sample statistics. A larger standard deviation means that the sample statistics are more spread out, indicating higher variability in the estimates. In this case, the standard deviation will be around 0.1.

3. **Normality**: For many population distributions and sample sizes, the sampling distribution will be approximately normal. This means that the distribution will have a bell-shaped curve, with most of the values falling within one standard deviation of the mean.

The sampling distribution is crucial for making inferences about the population parameter from a sample. By understanding the properties of the sampling distribution, we can estimate the precision of our estimates and make more informed decisions.

Here's a Python code snippet that simulates the sampling distribution of the proportion of red marbles:

```python
import numpy as np

# Define the population
population = np.array(['red'] * 50 + ['blue'] * 50)

# Set the number of samples and the size of each sample
num_samples = 1000
sample_size = 10

# Initialize an empty list to store the sample statistics
sample_statistics = []

# Draw samples and calculate the proportion of red marbles in each sample
for _ in range(num_samples):
    sample = np.random.choice(population, size=sample_size)
    red_marbles_proportion = np.mean(sample == 'ed')
    sample_statistics.append(red_marbles_proportion)

# Calculate the mean and standard deviation of the sample statistics
mean_proportion = np.mean(sample_statistics)
std_dev_proportion = np.std(sample_statistics)

print(f"Mean proportion of red marbles: {mean_proportion:.2f}")
print(f"Standard deviation of proportion: {std_dev_proportion:.2f}")
```

This code will output the mean and standard deviation of the sample statistics, which can be used to understand the properties of the sampling distribution.

## Central Limit Theorem

The Central Limit Theorem (CLT) is a fundamental concept in probability theory and statistics that states that the distribution of the sum (or average) of a large number of independent, identically distributed random variables approaches a normal distribution, regardless of the shape of the original distribution.

Here's a simple example to illustrate the Central Limit Theorem:

Suppose you have a fair six-sided die (a uniform distribution) and you roll the die 100 times. The average of the rolls will be approximately 3.5 (since the expected value of a fair six-sided die is 3.5).

Now, let's repeat this experiment 10,000 times and calculate the average of each set of 100 rolls. The distribution of these averages will approach a normal distribution, even though the original distribution of individual rolls is not normal.

The Central Limit Theorem is important because it allows us to make inferences about a population based on a sample, even when the sample size is large. It also helps us understand the properties of statistical estimators, such as the mean and variance, and how they are related to the underlying population.

Here's a Python code snippet that simulates the Central Limit Theorem:

```python
import numpy as np
import matplotlib.pyplot as plt

# Set the number of experiments and the number of rolls per experiment
num_experiments = 10000
num_rolls = 100

# Initialize an empty list to store the averages of each experiment
averages = []

# Run the experiments
for _ in range(num_experiments):
    # Generate random rolls from a fair six-sided die
    rolls = np.random.randint(1, 7, size=num_rolls)
    
    # Calculate the average of the rolls
    average = np.mean(rolls)
    
    # Add the average to the list of averages
    averages.append(average)

# Plot the distribution of averages
plt.hist(averages, bins=20, density=True)
plt.title('Distribution of Averages')
plt.xlabel('Average')
plt.ylabel('Probability')
plt.show()
```

This code will generate a histogram of the distribution of averages, which should approach a normal distribution as the number of experiments increases.

In summary, the Central Limit Theorem is a fundamental concept in probability theory and statistics that states that the distribution of the sum (or average) of a large number of independent, identically distributed random variables approaches a normal distribution, regardless of the shape of the original distribution. This theorem is important because it allows us to make inferences about a population based on a sample, even when the sample size is large.

## Confidence Interval

A confidence interval is a statistical concept that provides a range of values within which an unknown population parameter is likely to fall. It is used to estimate the uncertainty or variability of a population parameter, such as the mean, proportion, or difference between two means.

The confidence interval is calculated based on a sample from the population and is used to make inferences about the population parameter. The confidence level, often denoted by "1 - α" or "100(1 - α)%", represents the probability that the true population parameter falls within the confidence interval.

Here's a simple example to illustrate the concept of a confidence interval:

Suppose you have a sample of 100 students' heights, and you want to estimate the average height of all students in the population. You calculate the sample mean and find that it is 175 cm. To determine the confidence interval, you would use a t-distribution or a normal distribution, depending on the sample size and the assumptions made about the population.

Let's assume you use a t-distribution and choose a confidence level of 95% (α = 0.05). You would then calculate the margin of error, which is the range within which the true population parameter is likely to fall. In this case, the margin of error would be approximately 2 cm.

The confidence interval would then be calculated as follows:

Lower limit = Sample mean - Margin of error
Upper limit = Sample mean + Margin of error

Lower limit = 175 cm - 2 cm = 173 cm
Upper limit = 175 cm + 2 cm = 177 cm

So, with a 95% confidence level, you would estimate that the average height of all students in the population is likely to be between 173 cm and 177 cm.

In summary, a confidence interval provides a range of values within which an unknown population parameter is likely to fall, based on a sample from the population. The confidence level represents the probability that the true population parameter falls within the confidence interval. Confidence intervals are used to make inferences about population parameters and to estimate their uncertainty or variability.