# Introduction to Statistics in Python: The Normal Distribution

---

## 1. The Normal Distribution

- The **normal distribution** is among the most important probability distributions in statistics.
- Its shape is the famous “bell curve.”
- Many statistical methods assume or use the normal distribution due to its prevalence in real-world data.
![image.png](attachment:99ba7a05-2478-4c30-aff4-2a2a81e9591b.png)

---

## 2. Properties of the Normal Distribution

- **Symmetrical:**  
  The left and right sides of the curve are mirror images.
![image.png](attachment:729c9592-9343-4715-853b-701cef3b6e3b.png)

- **Area under the curve = 1:**  
  For any continuous probability distribution, the total area is 1.
![image.png](attachment:8d4dfb4e-4786-426c-96c1-5c835168f27a.png)

- **Curve never hits zero:**  
  The probability density gets very small at the tails but never actually reaches zero.
![image.png](attachment:2ff7a3b3-8c60-48f1-a0a8-447dca95fc44.png)

---

## 3. Described by Mean and Standard Deviation

- A normal distribution is defined by:
    - **Mean (μ):** Center of the distribution.
    - **Standard deviation (σ):** Measures spread (how "wide" the bell is).
- **Standard normal distribution:**  
  Mean 0, standard deviation 1 (μ=0, σ=1).
- Changing mean or standard deviation changes the center and width, but not the shape.
![image.png](attachment:ef2d920e-f5e7-45e6-a57c-475596055bf2.png)

---

## 4. The Empirical Rule (68-95-99.7 Rule)

- **68%** of data within **1 standard deviation** of the mean
- **95%** within **2 standard deviations**
- **99.7%** within **3 standard deviations**

---

## 5. Real-World Example: Heights

- Many real-world measurements (like human heights) are approximately normally distributed.
- Example:  
  Heights of women (from a survey)
    - Mean (μ): 161 cm
    - Standard deviation (σ): 7 cm
![image.png](attachment:d1033299-0f5c-4c37-b613-b36890258f0f.png)
---

## 6. Approximating Percentages with the Normal Distribution

Since this height data closely resembles the normal distribution, we can take the area under a normal distribution with mean 161 and standard deviation 7 to approximate what percent of women fall into different height ranges.
![image.png](attachment:8d0e18c1-689c-483b-bc11-3022418fc0c5.png)

We use the normal distribution to answer questions like:

### What percent of women are shorter than 154 cm?

![image.png](attachment:aa17abd5-cdb2-437b-9ab9-e9784d2175b7.png)

```python
from scipy.stats import norm

# Calculate cumulative probability up to 154 cm
percent_shorter = norm.cdf(154, 161, 7)
print(percent_shorter)
```

**Output:**
```
0.158655
```

### Line-by-Line Explanation

- `from scipy.stats import norm`  
  *Imports the normal distribution object from scipy.stats.*
- `norm.cdf(154, 161, 7)`  
  *Calculates the cumulative probability (area under the curve) up to 154 cm, for a normal distribution with mean 161 and standard deviation 7.*
- `print(percent_shorter)`  
  *Displays the result.*

**Significance:**  
*About 16% of women are shorter than 154 cm.*

---

### What percent of women are taller than 154 cm?

```python
percent_taller = 1 - norm.cdf(154, 161, 7)
print(percent_taller)
```

**Output:**
```
0.841345
```

### Explanation

- `1 - norm.cdf(154, 161, 7)`  
  *Subtracts the cumulative probability up to 154 cm from 1, giving the area to the right (taller than 154 cm).*

**Significance:**  
*About 84% of women are taller than 154 cm.*

---

### What percent of women are between 154 cm and 157 cm?

```python
percent_between = norm.cdf(157, 161, 7) - norm.cdf(154, 161, 7)
print(percent_between)
```

**Output:**
```
0.1252
```

### Explanation

- `norm.cdf(157, 161, 7)`  
  *Area (probability) below 157 cm.*
- `norm.cdf(154, 161, 7)`  
  *Area (probability) below 154 cm.*
- `norm.cdf(157, 161, 7) - norm.cdf(154, 161, 7)`  
  *Difference gives probability between 154 and 157 cm.*

**Significance:**  
*About 12.5% of women are between 154 and 157 cm tall.*

---

## 7. Using the Inverse: From Percentile to Height

### What height are 90% of women shorter than?

```python
height_90_percentile = norm.ppf(0.9, 161, 7)
print(height_90_percentile)
```

**Output:**
```
169.97086
```

### Explanation

- `norm.ppf(0.9, 161, 7)`  
  *Finds the height below which 90% of the data falls (the 90th percentile).*

**Significance:**  
*90% of women are shorter than about 170 cm.*

---

### What height are 90% of women taller than?

```python
height_10_percentile = norm.ppf(1 - 0.9, 161, 7)
print(height_10_percentile)
```

**Output:**
```
152.029
```

### Explanation

- `norm.ppf(1 - 0.9, 161, 7)` or `norm.ppf(0.1, 161, 7)`  
  *Finds the height below which 10% of the data falls (so 90% are taller).*

**Significance:**  
*90% of women are taller than about 152 cm.*

---

## 8. Generating Random Numbers from the Normal Distribution

### Example: Generate 10 random heights

```python
random_heights = norm.rvs(161, 7, size=10)
print(random_heights)
```

**Possible Output:**
```
[155.5758223  155.13133235 160.06377097 168.33345778 165.92273375
 163.32677057 165.13280753 146.36133538 149.07845021 160.5790856 ]
```

### Line-by-Line Explanation

- `norm.rvs(161, 7, size=10)`  
  *Generates 10 random samples from a normal distribution with mean 161 and standard deviation 7.*
- `print(random_heights)`  
  *Prints the array of simulated heights.*

**Significance:**  
*Each value represents a simulated woman's height, based on our normal model.*

---

## 9. Key Points

- The **normal distribution** models many real-world variables.
- Use `norm.cdf` to find the probability of being below a certain value.
- Use `1 - norm.cdf` for the probability above a value.
- Subtract two `cdf` values to find the probability between two values.
- Use `norm.ppf` for percentiles (inverse: find value for a given probability).
- Use `norm.rvs` to generate random samples.

---


### Exercise
Distribution of Amir's sales
Since each deal Amir worked on (both won and lost) was different, each was worth a different amount of money. These values are stored in the amount column of amir_deals As part of Amir's performance review, you want to be able to estimate the probability of him selling different amounts, but before you can do this, you'll need to determine what kind of distribution the amount variable follows.

Both pandas as pd and matplotlib.pyplot as plt are loaded and amir_deals is available.

Instructions 1/2

Create a histogram with 10 bins to visualize the distribution of the amount. Show the plot.
```python
# Histogram of amount with 10 bins and show plot
amir_deals['amount'].hist(bins =10)
plt.show()

```
![image.png](attachment:0bc6d289-b69a-4d6d-b9d7-9c33174f010d.png)

Question
Which probability distribution do the sales amounts most closely follow?

Possible answersC


Uniform

Binomial

Normal

None of the above


### Exercise
Probabilities from the normal distribution
Since each deal Amir worked on (both won and lost) was different, each was worth a different amount of money. These values are stored in the amount column of amir_deals and follow a normal distribution with a mean of 5000 dollars and a standard deviation of 2000 dollars. As part of his performance metrics, you want to calculate the probability of Amir closing a deal worth various amounts.

norm from scipy.stats is imported as well as pandas as pd. The DataFrame amir_deals is loaded.

Instructions 1/4
What's the probability of Amir closing a deal worth less than $7500?


```python
# Probability of deal < 7500
prob_less_7500 = norm.cdf( 7500, 5000, 2000)

print(prob_less_7500)
<script.py> output:
    0.8943502263331446

```

What's the probability of Amir closing a deal worth more than $1000?

```python
# Probability of deal > 1000
prob_over_1000 = 1 - norm.cdf( 1000, 5000, 2000)

print(prob_over_1000)

<script.py> output:
    0.9772498680518208
In [1]:

```

What's the probability of Amir closing a deal worth between $3000 and $7000?

```python
# Probability of deal between 3000 and 7000
prob_3000_to_7000 = norm.cdf(7000, 5000, 2000) - norm.cdf(3000, 5000, 2000)

print(prob_3000_to_7000)

<script.py> output:
    0.6826894921370859
```

What amount will 25% of Amir's sales be less than?

```python
# Calculate amount that 25% of deals will be less than
pct_25 = norm.ppf(0.25, 5000, 2000)

print(pct_25)

<script.py> output:
    3651.0204996078364
```

### Exercise
Simulating sales under new market conditions
The company's financial analyst is predicting that next quarter, the worth of each sale will increase by 20% and the volatility, or standard deviation, of each sale's worth will increase by 30%. To see what Amir's sales might look like next quarter under these new market conditions, you'll simulate new sales amounts using the normal distribution and store these in the new_sales DataFrame, which has already been created for you.

In addition, norm from scipy.stats, pandas as pd, and matplotlib.pyplot as plt are loaded.

Instructions

Currently, Amir's average sale amount is $5000. Calculate what his new average amount will be if it increases by 20% and store this in new_mean.
Amir's current standard deviation is $2000. Calculate what his new standard deviation will be if it increases by 30% and store this in new_sd.
Create a variable called new_sales, which contains 36 simulated amounts from a normal distribution with a mean of new_mean and a standard deviation of new_sd.
Plot the distribution of the new_sales amounts using a histogram and show the plot.

```python
# Calculate new average amount
new_mean = 5000 + 5000*0.2

# Calculate new standard deviation
new_sd = 2000 + 2000*0.3

# Simulate 36 new sales
new_sales = norm.rvs( new_mean, new_sd, size=36 )

# Create histogram and show
plt.hist(new_sales)
plt.show()

```
![image.png](attachment:530c5431-9262-4ff0-87af-926feb2a4549.png)

### Exercise
Which market is better?
The key metric that the company uses to evaluate salespeople is the percent of sales they make over $1000 since the time put into each sale is usually worth a bit more than that, so the higher this metric, the better the salesperson is performing.

Recall that Amir's current sales amounts have a mean of $5000 and a standard deviation of $2000, and Amir's predicted amounts in next quarter's market have a mean of $6000 and a standard deviation of $2600.

norm from scipy.stats is imported.

Based only on the metric of percent of sales over $1000, does Amir perform better in the current market or the predicted market?

Instructions

Possible answersc


Amir performs much better in the current market.

Amir performs much better in next quarter's predicted market.

Amir performs about equally in both markets.


```python
from scipy.stats import norm

# Current market parameters
mean_current = 5000
sd_current = 2000

# Predicted market parameters
mean_pred = 6000
sd_pred = 2600

# Probability of sale > 1000 in current market
p_current = 1 - norm.cdf(1000, loc=mean_current, scale=sd_current)

# Probability of sale > 1000 in predicted market
p_pred = 1 - norm.cdf(1000, loc=mean_pred, scale=sd_pred)

print("Current market:", p_current)
print("Predicted market:", p_pred)


Current market: 0.9772
Predicted market: 0.9727

```

# The Central Limit Theorem (CLT) in Python

---

## 1. Introduction: Why the Normal Distribution is Important

- The **Central Limit Theorem (CLT)** explains why the normal distribution appears so frequently in statistics.
- It helps us understand the behavior of sampling distributions, even when the original data is not normal.

---

## 2. Rolling the Dice: Simulating Sampling

Suppose we have a die with faces 1 to 6 stored as a pandas Series called `die`:

```python
import pandas as pd
import numpy as np

die = pd.Series([1, 2, 3, 4, 5, 6])
```

**Simulate rolling the die 5 times and compute the mean:**

```python
sample = die.sample(5, replace=True)
mean_result = np.mean(sample)
print(sample.values)
print(mean_result)
```

**Sample Output:**
```
[5 2 2 1 4]
2.8
```

### Explanation

- `die.sample(5, replace=True)`  
  *Randomly samples 5 values from the die, with replacement (so the same number can appear multiple times).*
- `np.mean(sample)`  
  *Calculates the average (mean) of the 5 sampled values.*
- `print(sample.values)`  
  *Displays the 5 numbers rolled.*
- `print(mean_result)`  
  *Displays the mean of those rolls.*

**Significance:**  
Each time you do this, you get a different sample mean. The process introduces variability, which is at the heart of sampling distributions.

---

## 3. Repeating the Process: Building a Sampling Distribution

### Rolling the die 5 times, repeated 10 times

```python
sample_means = []
for i in range(10):
    sample = die.sample(5, replace=True)
    sample_means.append(np.mean(sample))
print(sample_means)
```

**Sample Output:**
```
[3.2, 2.8, 3.6, 4.0, 2.4, 3.2, 4.2, 3.4, 3.0, 3.8]
```

### Explanation

- `sample_means = []`  
  *Initializes an empty list to store sample means.*
- `for i in range(10):`  
  *Repeats the sampling process 10 times.*
- `sample = die.sample(5, replace=True)`  
  *Each iteration, sample 5 dice rolls with replacement.*
- `np.mean(sample)`  
  *Computes the mean of each sample.*
- `sample_means.append(...)`  
  *Appends each mean to the list.*
- `print(sample_means)`  
  *Shows all 10 sample means.*

**Significance:**  
This creates a *sampling distribution* of the mean from repeated samples.

![image.png](attachment:335238bb-5f22-4631-af28-693c6abaacb5.png)

---

## 4. Increasing Sample Size: 100 and 1000 Sample Means

### 100 Sample Means

![image.png](attachment:9dbc1587-6ea7-40d2-95c5-967a45782a04.png)

```python
sample_means = []
for i in range(100):
    sample_means.append(np.mean(die.sample(5, replace=True)))
print(sample_means[:10])  # Show first 10 means
```

**Sample Output:**
```
[3.2, 3.0, 3.6, 3.8, 2.4, 3.4, 2.2, 2.6, 2.8, 3.2]
```

### 1000 Sample Means

![image.png](attachment:139d6022-1905-45ac-b9e0-f25e09b176ea.png)

This sampling distribution more closely resembles the normal distribution.

```python
sample_means = []
for i in range(1000):
    sample_means.append(np.mean(die.sample(5, replace=True)))
print(sample_means[:10])  # Show first 10 means
```

**Sample Output:**
```
[3.8, 2.6, 2.8, 3.2, 2.8, 2.4, 3.0, 3.8, 3.6, 3.0]
```

### Explanation

- The same logic as above, just more repetitions (100 or 1000).
- As the number of sample means increases, the shape of their histogram becomes increasingly normal (bell-shaped).

**Significance:**  
- Even though the original die is uniform (all outcomes equally likely), the distribution of the sample means is approximately normal.  
- This is the Central Limit Theorem in action.

---

## 5. Central Limit Theorem Explained

- **CLT Statement:**  
  The sampling distribution of a statistic (like the mean) becomes closer to a normal distribution as the number of samples increases, regardless of the underlying population's distribution.
- **Requirements:**  
  - Samples must be random.
  - Samples must be independent.
![image.png](attachment:7040d716-9792-449b-9863-ce53641e0035.png)

---

## 6. Standard Deviation and the CLT

You can also look at the **standard deviation** of samples:

```python
sample_sds = []
for i in range(1000):
    sample_sds.append(np.std(die.sample(5, replace=True)))
print(sample_sds[:10])
```

**Sample Output:**
```
[1.9235384061671346, 2.0591260281974, 1.4142135623730951, 2.0591260281974, 2.0591260281974, 1.6733200530681511, 1.6733200530681511, 2.0591260281974, 1.6733200530681511, 1.6733200530681511]
```

### Explanation

- `sample_sds = []`  
  *Empty list for sample standard deviations.*
- `for i in range(1000):`  
  *Repeat 1000 times.*
- `die.sample(5, replace=True)`  
  *Sample 5 dice rolls with replacement.*
- `np.std(...)`  
  *Calculate the standard deviation of each sample.*
- `sample_sds.append(...)`  
  *Store each result.*
![image.png](attachment:a534cc29-cf4d-4978-b46f-cc90905d26c3.png)

**Significance:**  
The distribution of sample standard deviations is also approximately normal, centered around the true SD of the die.

---

## 7. CLT for Proportions

### Example: Sampling from a Sales Team

![image.png](attachment:560c285c-ba82-47c4-b7d1-39988a9ae647.png)

```python
import pandas as pd
sales_team = pd.Series(["Amir", "Brian", "Claire", "Damian"])

# Sample 10 team members with replacement
sample1 = sales_team.sample(10, replace=True)
print(sample1.values)

sample2 = sales_team.sample(10, replace=True)
print(sample2.values)
```

**Sample Output:**
```
['Claire' 'Damian' 'Brian' 'Damian' 'Damian' 'Amir' 'Amir' 'Amir' 'Amir' 'Damian']
['Brian' 'Amir' 'Brian' 'Claire' 'Brian' 'Damian' 'Claire' 'Brian' 'Claire' 'Claire']
```

### Counting Proportions

Suppose we want to know the proportion of "Claire" in each sample:

```python
prop_claire1 = np.mean(sample1 == "Claire")
prop_claire2 = np.mean(sample2 == "Claire")
print(prop_claire1)
print(prop_claire2)
```

**Sample Output:**
```
0.1
0.4
```

### Explanation

- `sales_team.sample(10, replace=True)`  
  *Randomly picks 10 people, with replacement.*
- `sample1 == "Claire"`  
  *Creates a boolean array where True indicates "Claire".*
- `np.mean(...)`  
  *Fraction of "Claire" in the sample.*

**Significance:**  
Sample proportions vary, but if you repeat many times, the distribution of proportions becomes approximately normal, centered around the true proportion (0.25 for "Claire").

---

## 8. Sampling Distribution of Proportion

If you repeat sampling many times and plot the distribution of sample proportions, it forms a normal-shaped curve centered on the population proportion (here, 0.25).

---

## 9. Estimating Population Characteristics

### Mean of Sampling Distribution

To estimate the **expected value** of the die (mean of means):
![image.png](attachment:47982d66-efb8-4bb2-be6b-9022f1cb2b4b.png)

```python
# After generating 1000 sample means:
expected_die_mean = np.mean(sample_means)
print(expected_die_mean)
```

**Sample Output:**
```
3.48
```

### Estimate Proportion of "Claire"s

Suppose you saved 1000 sample proportions in a list called `sample_props`:

```python
expected_prop_claire = np.mean(sample_props)
print(expected_prop_claire)
```

**Sample Output:**
```
0.26
```

### Explanation

- `np.mean(sample_means)`  
  *Averages all sample means to estimate the population mean.*
- `np.mean(sample_props)`  
  *Averages all sample proportions to estimate the population proportion.*

**Significance:**  
- These estimates are close to the true values (mean = 3.5 for a fair die, prop = 0.25 for "Claire").
- Even without knowing the full population, repeated sampling gives good estimates.

---

## 10. Why the CLT Matters

- The **Central Limit Theorem** lets us estimate characteristics of large populations using repeated random samples.
- Essential for inferential statistics, confidence intervals, and hypothesis testing.
- Useful when the population is too large to measure directly.

---

## 11. Key Points Recap

- **Sampling distributions** summarize statistics from repeated samples.
- As the number of samples grows, their distribution becomes normal—even if the original data is not.
- The CLT applies to means, standard deviations, proportions, and more.
- You can use Python (NumPy, pandas) to simulate and visualize these concepts.

---


### Exercise
The CLT in action
The central limit theorem states that a sampling distribution of a sample statistic approaches the normal distribution as you take more samples, no matter the original distribution being sampled from.

In this exercise, you'll focus on the sample mean and see the central limit theorem in action while examining the num_users column of amir_deals more closely, which contains the number of people who intend to use the product Amir is selling.

pandas as pd, numpy as np, and matplotlib.pyplot as plt are loaded and amir_deals is available.

Instructions 1/4

Create a histogram of the num_users column of amir_deals and show the plot.

```python
# Create a histogram of num_users and show
amir_deals['num_users'].hist()
plt.show()
```
![image.png](attachment:6b457a8c-fb25-4f85-aed0-c19e91d810d1.png)

Set the seed to 104.
Take a sample of size 20 with replacement from the num_users column of amir_deals, and take the mean.

```python
# Set seed to 104
np.random.seed(104)

# Sample 20 num_users with replacement from amir_deals
samp_20 = amir_deals['num_users'].sample(20, replace=True)

# Take mean of samp_20
print(samp_20.mean())

<script.py> output:
    32.0
In [1]:

```
Repeat this 100 times using a for loop and store as sample_means. This will take 100 different samples and calculate the mean of each.


```python
# Set seed to 104
np.random.seed(104)

# Sample 20 num_users with replacement from amir_deals and take mean
samp_20 = amir_deals['num_users'].sample(20, replace=True)
np.mean(samp_20)

sample_means = []
# Loop 100 times
for i in range(100):
  # Take sample of 20 num_users
  samp_20 = amir_deals['num_users'].sample(20, replace=True)
  # Calculate mean of samp_20
  samp_20_mean = np.mean(samp_20)
  # Append samp_20_mean to sample_means
  sample_means.append(samp_20_mean)
  
print(sample_means)


<script.py> output:
    [31.35, 45.05, 33.55, 38.15, 50.85, 31.85, 34.65, 36.25, 38.9, 44.05, 35.45, 37.6, 37.95, 28.85, ...... contiue In [1]:
```
Convert sample_means into a pd.Series, create a histogram of the sample_means, and show the plot.

```python
# Set seed to 104
np.random.seed(104)

sample_means = []
# Loop 100 times
for i in range(100):
  # Take sample of 20 num_users
  samp_20 = amir_deals['num_users'].sample(20, replace=True)
  # Calculate mean of samp_20
  samp_20_mean = np.mean(samp_20)
  # Append samp_20_mean to sample_means
  sample_means.append(samp_20_mean)
  
# Convert to Series and plot histogram
sample_means_series = pd.Series(sample_means)
sample_means_series.hist()
# Show plot
plt.show()
```

![image.png](attachment:7d3151a1-dedb-4edb-9349-96a89a37bf41.png)

### Exercise
The mean of means
You want to know what the average number of users (num_users) is per deal, but you want to know this number for the entire company so that you can see if Amir's deals have more or fewer users than the company's average deal. The problem is that over the past year, the company has worked on more than ten thousand deals, so it's not realistic to compile all the data. Instead, you'll estimate the mean by taking several random samples of deals, since this is much easier than collecting data from everyone in the company.

amir_deals is available and the user data for all the company's deals is available in all_deals. Both pandas as pd and numpy as np are loaded.

Instructions

Set the random seed to 321.
Take 30 samples (with replacement) of size 20 from all_deals['num_users'] and take the mean of each sample. Store the sample means in sample_means.
Print the mean of sample_means.
Print the mean of the num_users column of amir_deals.

```python
# Set seed to 321
np.random.seed(321)

sample_means = []
# Loop 30 times to take 30 means
for i in range(30):
  # Take sample of size 20 from num_users col of all_deals with replacement
  cur_sample = all_deals['num_users'].sample(20, replace=True)
  # Take mean of cur_sample
  cur_mean = np.mean(cur_sample)
  # Append cur_mean to sample_means
  sample_means.append(cur_mean)

# Print mean of sample_means
print(np.mean(sample_means))

# Print mean of num_users in amir_deals
print(np.mean(amir_deals['num_users']))

<script.py> output:
    38.31333333333332
    37.651685393258425
In [1]:

```

# Introduction to Statistics in Python: The Poisson Distribution

---

## 1. The Poisson Distribution

- The **Poisson distribution** is a probability distribution used to model the number of times an event happens in a fixed interval of time or space.
- **Key characteristics:**
    - Events must occur independently.
    - Events happen at a constant average rate.
    - Events are discrete (countable).

---

## 2. Poisson Processes

- A **Poisson process** is a model for events that happen randomly but at a predictable average rate.
- **Examples:**
    - Number of animal adoptions from a shelter per week (e.g., 8 adoptions/week on average).
    - Number of people arriving at a restaurant per hour.
    - Number of earthquakes in California per year.
- **Key point:** The time unit (hour, week, year) can vary, as long as it’s consistent for your data.

---

## 3. The Poisson Distribution in Practice

- The **Poisson distribution** describes the probability of observing a given number of events in a fixed period.
- **Typical questions:**  
    - What is the probability of at least 5 adoptions in a week?  
    - What is the probability of fewer than 20 earthquakes in a year?

---

## 4. Lambda (λ): The Average Rate

The Poisson distribution is described by a value called lambda, which represents the average number of events per time period.

- **Lambda (λ)** is the average number of events per time interval.
    - In the animal shelter example, **λ = 8** (average adoptions per week).
- **λ is also the expected value (mean) of the distribution.**
- The Poisson distribution is **discrete**, since it counts events.
    - The most likely number of adoptions per week is near λ (here, 7 or 8).
![image.png](attachment:da7d87e2-d2e5-40de-9731-6ca412528c30.png)

---

## 5. Lambda is the Distribution's Peak

- Changing λ changes the **shape** of the distribution.
- The **peak** (mode) of the distribution is at or near λ.
    - For λ = 1, the distribution is more skewed.
    - For λ = 8, the distribution peaks around 8.
![image.png](attachment:98ca5933-9dfb-41af-9538-80826c898376.png)

    - Lambda changes the shape of the distribution, so a Poisson distribution with lambda equals 1, in blue, looks quite different than a Poisson distribution with lambda equals 8, in green, but no matter what, the distribution's peak is always at its lambda value.
---

## 6. Probability of a Single Value

**Question:**  
If the average number of adoptions per week is 8, what is the probability of exactly 5 adoptions in a week?

**Python Code Example:**

```python
from scipy.stats import poisson

poisson.pmf(5, 8)
```

**Output:**

```
0.09160366159280817
```

**Line-by-Line Explanation:**

- `from scipy.stats import poisson`
    - **What:** Imports the `poisson` class from `scipy.stats`.
    - **Why:** To access Poisson distribution functions, such as `pmf` (probability mass function).
    - **Output/Result:** Enables Poisson probability calculations.
- `poisson.pmf(5, 8)`
    - **What:** Computes the probability of exactly 5 events given λ = 8.
    - **Why:** `pmf(k, mu)` returns P(X = k) for Poisson(μ).
    - **Output/Result:** Returns about `0.0916`, meaning there's a **9.2% chance** of exactly 5 adoptions in a week.

**Significance:**  
This tells us that, even though the average is 8, there's a ~9% chance that exactly 5 adoptions will occur in a week.

---

## 7. Probability of Less Than or Equal To

**Question:**  
What is the probability of 5 or fewer adoptions in a week (P(X ≤ 5)), given λ = 8?

**Python Code Example:**

```python
poisson.cdf(5, 8)
```

**Output:**

```
0.19123611850470666
```

**Line-by-Line Explanation:**

- `poisson.cdf(5, 8)`
    - **What:** Computes the cumulative probability (P(X ≤ 5)) for λ = 8.
    - **Why:** `cdf(k, mu)` gives the sum of probabilities of 0, 1, ..., 5 events.
    - **Output/Result:** Returns about `0.1912`, so there's a **19.1% chance** of 5 or fewer adoptions in a week.

**Significance:**  
This cumulative probability is useful for understanding the likelihood of observing up to a certain number of events.

---

## 8. Probability of Greater Than

**Question:**  
What is the probability of more than 5 adoptions in a week (P(X > 5)), given λ = 8?

**Python Code Example:**

```python
1 - poisson.cdf(5, 8)
```

**Output:**

```
0.8087638814952933
```

**Line-by-Line Explanation:**

- `poisson.cdf(5, 8)`
    - **What:** Calculates P(X ≤ 5).
    - **Why:** We want P(X > 5), so we use the complement.
    - **Output/Result:** Returns the cumulative probability for 5 or fewer adoptions.
- `1 - poisson.cdf(5, 8)`
    - **What:** Calculates P(X > 5).
    - **Why:** The probability of more than 5 events is 1 minus the probability of 5 or fewer.
    - **Output/Result:** Returns about `0.8088`, so there's an **80.9% chance** of more than 5 adoptions in a week.

**Another Example:**  
If λ = 10, what is P(X > 5)?

```python
1 - poisson.cdf(5, 10)
```

**Output:**

```
0.9329140371209682
```

- **Interpretation:** If the average rises to 10 adoptions per week, there's a **93.3% chance** of more than 5 adoptions in a week.

---

## 9. Sampling from a Poisson Distribution

- We can **simulate random data** from a Poisson distribution using `poisson.rvs`.

**Python Code Example:**

```python
poisson.rvs(8, size=10)
```

**Output:**

```
array([ 9,  9,  8,  7, 11,  3, 10,  6,  8, 14])
```

**Line-by-Line Explanation:**

- `poisson.rvs(8, size=10)`
    - **What:** Generates 10 random samples from a Poisson distribution with λ = 8.
    - **Why:** To simulate the number of adoptions that might happen in 10 different weeks.
    - **Output/Result:** Returns a NumPy array, e.g., `[9, 9, 8, 7, 11, 3, 10, 6, 8, 14]`. Each value is the simulated number of adoptions in a particular week.

**Significance:**  
This allows us to model and analyze real-world random processes, like adoption counts per week.

---

## 10. The Central Limit Theorem (CLT) Still Applies

- **Key Point:**  
    - Even though the Poisson distribution is discrete and often skewed for small λ, the **distribution of sample means** from many samples is approximately normal (bell-shaped), especially with a large number of samples.
    - This is the **Central Limit Theorem** in action.

![image.png](attachment:97621b8f-2294-4f21-bfea-cd91ca98dee8.png)

---

# Summary Table

| λ (average) | P(X=5) | P(X≤5) | P(X>5) |
|-------------|--------|--------|--------|
| 8           | 0.092  | 0.191  | 0.809  |
| 10          |   –    |   –    | 0.933  |

---

# Key Takeaways

- **Poisson distribution** models the count of random events in a fixed interval.
- **λ (lambda)** = average rate; determines the distribution’s center and shape.
- Use `poisson.pmf` for **exact** probabilities, `poisson.cdf` for **cumulative** probabilities, and `poisson.rvs` for **sampling**.
- The **CLT** means the average of many Poisson samples is approximately normal for large sample sizes.
- Poisson is widely used for event count modeling in real-world scenarios (adoptions, arrivals, accidents, etc.).

### Exercise
Identifying lambda
Now that you've learned about the Poisson distribution, you know that its shape is described by a value called lambda. In this exercise, you'll match histograms to lambda values.

Instructions

Match each Poisson distribution to its lambda value.


![image.png](attachment:1a52e496-8c46-4dca-af32-8c222f1b84c1.png)

### Exercise
Tracking lead responses
Your company uses sales software to keep track of new sales leads. It organizes them into a queue so that anyone can follow up on one when they have a bit of free time. Since the number of lead responses is a countable outcome over a period of time, this scenario corresponds to a Poisson distribution. On average, Amir responds to 4 leads each day. In this exercise, you'll calculate probabilities of Amir responding to different numbers of leads.

Instructions 1/4

Import poisson from scipy.stats and calculate the probability that Amir responds to 5 leads in a day, given that he responds to an average of 4.

```python
# Import poisson from scipy.stats
from scipy.stats import poisson

# Probability of 5 responses
prob_5 = poisson.pmf( 5, 4)

print(prob_5)

<script.py> output:
    0.1562934518505317
In [1]:

```
Amir's coworker responds to an average of 5.5 leads per day. What is the probability that she answers 5 leads in a day?

```python
# Import poisson from scipy.stats
from scipy.stats import poisson

# Probability of 5 responses
prob_coworker = poisson.pmf(5, 5.5)

print(prob_coworker)

<script.py> output:
    0.17140068409793663
In [1]:

```
What's the probability that Amir responds to 2 or fewer leads in a day?

```python
# Import poisson from scipy.stats
from scipy.stats import poisson

# Probability of 2 or fewer responses
prob_2_or_less = poisson.cdf(2, 4)

print(prob_2_or_less)

<script.py> output:
    0.23810330555354436
In [1]:
```
What's the probability that Amir responds to more than 10 leads in a day?

```python
# Import poisson from scipy.stats
from scipy.stats import poisson

# Probability of > 10 responses
prob_over_10 = 1 - poisson.cdf(10, 4)

print(prob_over_10)

<script.py> output:
    0.0028397661205137315
In [1]:

```

# More Probability Distributions in Python

---

## 1. Overview

In this lesson, we introduce several important probability distributions beyond the Poisson distribution, focusing on:

- The **Exponential distribution**
- The **Student's t-distribution**
- The **Log-normal distribution**

---

## 2. Exponential Distribution

### What is the Exponential Distribution?

- **Purpose:** Models the probability of a certain amount of time between events in a *Poisson process* (where events occur independently and at a constant average rate).
- **Examples:**
    - Probability of *more than 1 day* between pet adoptions.
    - Probability of *less than 10 minutes* between restaurant arrivals.
    - Probability of *6–8 months* between earthquakes in a region.
- **Key Points:**
    - Uses the same rate parameter (lambda, λ) as the Poisson distribution.
    - It is **continuous** (models time/intervals), unlike Poisson which is **discrete**.

---

### Example: Customer Service Requests

Suppose **on average, one customer service ticket is created every 2 minutes**.

- **λ (lambda)** = 0.5 (because 0.5 tickets are created per minute)
- **Time unit**: minutes
![image.png](attachment:05d55def-f119-48f5-934d-6bab07de7a4b.png)

#### Lambda's Role

- **Lambda (λ)** is the *rate parameter*.
- The rate affects the **shape** of the exponential distribution (how quickly the probability declines as time increases).
![image.png](attachment:1a95461f-889d-4229-985f-23c6f2444865.png)

---

### Expected Value of the Exponential Distribution

- In the **Poisson distribution**: λ = expected events per time (e.g., 0.5 requests per minute).
- In the **Exponential distribution**: *Expected time between events* = 1/λ
    - Here: 1/λ = 1/0.5 = 2 minutes (on average, wait 2 minutes for a new request).
---

### How long until a new request is created?

We use the exponential cumulative distribution function (`expon.cdf`) to answer questions about waiting times.

#### **Probability of waiting less than 1 minute for a request**

```python
from scipy.stats import expon

scale = 1 / 0.5  # = 2

expon.cdf(1, scale=2)
```

**Output:**
```
0.3934693402873666
```

**Line-by-Line Explanation:**

- `from scipy.stats import expon`
    - **What:** Imports the exponential distribution object from `scipy.stats`.
    - **Why:** To access exponential distribution probability functions (`cdf`).
    - **Result:** Enables use of exponential distribution in Python.
- `scale = 1 / 0.5`
    - **What:** Calculates the scale parameter, which is the mean time between events.
    - **Why:** Exponential distribution in SciPy uses `scale=1/λ`.
    - **Result:** Here, `scale = 2` minutes.
- `expon.cdf(1, scale=2)`
    - **What:** Computes the probability that the waiting time is *less than 1 minute*.
    - **Why:** `cdf(x, scale)` returns P(T < x) for the exponential distribution.
    - **Result:** About **39.3% chance** that you’ll wait less than 1 minute for a new request.

**Significance:**  
We have about a 39% chance of waiting less than 1 minute for a new ticket.

---

#### **Probability of waiting more than 4 minutes for a request**

```python
1 - expon.cdf(4, scale=2)
```

**Output:**
```
0.1353352832366127
```

**Line-by-Line Explanation:**

- `expon.cdf(4, scale=2)`
    - **What:** Computes P(T < 4).
    - **Why:** To find the probability of waiting *less than* 4 minutes.
    - **Result:** Not shown directly, but used for complement.
- `1 - expon.cdf(4, scale=2)`
    - **What:** Computes P(T > 4) (the complement).
    - **Why:** To find the probability of *waiting more than 4 minutes*.
    - **Result:** About **13.5% chance** of waiting more than 4 minutes.

**Significance:**  
There's about a 13.5% chance of having to wait longer than 4 minutes.

---

#### **Probability of waiting between 1 and 4 minutes**

```python
expon.cdf(4, scale=2) - expon.cdf(1, scale=2)
```

**Output:**
```
0.4711953764760207
```

**Line-by-Line Explanation:**

- `expon.cdf(4, scale=2)`
    - **What:** Computes P(T < 4).
    - **Why:** Upper bound for the interval.
    - **Result:** Probability that wait is less than 4 minutes.
- `expon.cdf(1, scale=2)`
    - **What:** Computes P(T < 1).
    - **Why:** Lower bound for the interval.
    - **Result:** Probability that wait is less than 1 minute.
- Subtraction: `expon.cdf(4, scale=2) - expon.cdf(1, scale=2)`
    - **What:** Probability that wait is *between* 1 and 4 minutes.
    - **Why:** P(1 < T < 4) = P(T < 4) - P(T < 1).
    - **Result:** About **47.1% chance** of waiting between 1 and 4 minutes.

**Significance:**  
Almost half the time (47%), you'll wait between 1 and 4 minutes for a new request.

---

## 3. (Student's) t-Distribution

### What is the t-Distribution?

- **Shape:** Similar to the normal (Gaussian) distribution but with **thicker tails** (more probability far from the mean).
- **Use:** Commonly used when estimating the mean of a normally distributed population in situations where the sample size is small **and** the population standard deviation is unknown.
![image.png](attachment:d9dc2e59-7d09-4032-8b97-9aa91b907df4.png)

### Degrees of Freedom (df)

- **Parameter:** The t-distribution has a "degrees of freedom" (df) parameter.
    - **Lower df:** Thicker tails, higher standard deviation.
    - **Higher df:** The t-distribution approaches the normal distribution.

**Summary Table:**

| Degrees of Freedom (df) | Shape Characteristic          |
|-------------------------|------------------------------|
| Low (e.g., 1)           | Much thicker tails           |
| High (e.g., 30, 100)    | Nearly normal distribution   |

![image.png](attachment:5915f2c0-87c7-4231-b386-ef257ab9d85b.png)

---

## 4. Log-Normal Distribution

![image.png](attachment:df408f5e-9692-4546-b7dc-6ae3a759543e.png)

### What is a Log-Normal Distribution?

- A random variable is **log-normally distributed** if **its logarithm is normally distributed**.
- **Shape:** Skewed (not symmetric like the normal distribution).
- **Real-world examples:**
    - Length of chess games.
    - Adult blood pressure.
    - Number of hospitalizations during the 2003 SARS outbreak.

**Key point:**  
Log-normal distributions often arise when many small multiplicative effects combine (e.g., processes that grow by percentages).

---

# Summary Table: Example Probabilities with Exponential Distribution

| Scenario                    | Code                                      | Output         | Interpretation                                      |
|-----------------------------|-------------------------------------------|----------------|-----------------------------------------------------|
| Wait < 1 min                | `expon.cdf(1, scale=2)`                   | 0.393          | 39% chance of waiting less than 1 minute            |
| Wait > 4 min                | `1 - expon.cdf(4, scale=2)`               | 0.135          | 13.5% chance of waiting more than 4 minutes         |
| 1 min < Wait < 4 min        | `expon.cdf(4, scale=2) - expon.cdf(1, 2)` | 0.471          | 47.1% chance of waiting between 1 and 4 minutes     |

---

# Key Takeaways

- **Exponential distribution**: Models the time between random events; parameterized by λ (rate).
- **t-distribution**: Like normal, but with thicker tails for small samples; controlled by degrees of freedom.
- **Log-normal distribution**: Variable's *log* is normally distributed; results in a skewed distribution.
- Understanding these distributions helps model real-world randomness in time, measurement, and counts.

### Exercise
Distribution dragging and dropping
By this point, you've learned about so many different probability distributions that it can be difficult to remember which is which. In this exercise, you'll practice distinguishing between distributions and identifying the distribution that best matches different scenarios.

Instructions

Match each situation to the distribution that best models it.

![image.png](attachment:fe662b20-ab10-4b68-a8a0-b05bab1e6326.png)

### Exercise
Modeling time between leads
To further evaluate Amir's performance, you want to know how much time it takes him to respond to a lead after he opens it. On average, he responds to 1 request every 2.5 hours. In this exercise, you'll calculate probabilities of different amounts of time passing between Amir receiving a lead and sending a response.

Instructions 1/3

Import expon from scipy.stats. What's the probability it takes Amir less than an hour to respond to a lead?

```python
# Import expon from scipy.stats
from scipy.stats import expon

# Print probability response takes < 1 hour
print(expon.cdf(1, scale=2.5))

<script.py> output:
    0.3296799539643607

```
What's the probability it takes Amir more than 4 hours to respond to a lead?
```python
# Import expon from scipy.stats
from scipy.stats import expon

# Print probability response takes > 4 hours
print( 1 - expon.cdf( 4, scale=2.5))

<script.py> output:
    0.20189651799465536
```
What's the probability it takes Amir 3-4 hours to respond to a lead?

```python
# Import expon from scipy.stats
from scipy.stats import expon

# Print probability response takes 3-4 hours
print(expon.cdf(4, scale=2.5) - expon.cdf(3, scale=2.5))

<script.py> output:
    0.09929769391754684

```

END..