In [1]:
import numpy as np
from scipy import stats

In [2]:
rng = np.random.default_rng(2022)
arr = rng.integers(low=0, high=10, size=10)
arr

array([7, 2, 7, 0, 1, 6, 9, 0, 0, 6])

---

# Modeling Data Distribution

## [Z-Scores](https://en.wikipedia.org/wiki/Standard_score)

In [statistics](https://en.wikipedia.org/wiki/Statistics "Statistics"), the **standard score** is the number of [standard deviations](https://en.wikipedia.org/wiki/Standard_deviation "Standard deviation") by which the value of a [raw score](https://en.wikipedia.org/wiki/Raw_score) (i.e., an observed value or data point) is above or below the [mean](https://en.wikipedia.org/wiki/Mean "Mean") value of what is being observed or measured. Raw scores above the mean have positive standard scores, while those below the mean have negative standard scores.

Standard scores are most commonly called **_z_-scores**; the two terms may be used interchangeably, as they are in this article. Other equivalent terms in use include **z-values, normal scores**, **standardized variables** and **pull** in [high energy physics](https://en.wikipedia.org/wiki/High_energy_physics "High energy physics").[[1]](https://en.wikipedia.org/wiki/Standard_score#cite_note-1)[[2]](https://en.wikipedia.org/wiki/Standard_score#cite_note-2)

Computing a z-score requires knowledge of the mean and standard deviation of the complete population to which a data point belongs; if one only has a [sample](https://en.wikipedia.org/wiki/Sample_(statistics) "Sample (statistics)") of observations from the population, then the analogous computation using the sample mean and sample standard deviation yields the [_t_-statistic](https://en.wikipedia.org/wiki/T-statistic).

Review module `zscore`, click [here](utils/zscore.py).

In [3]:
from utils.zscore import zscore

In [4]:
zscore(1, 10, 2)

-4.5

In [5]:
# Z-Scores calculation for every item in an array
stats.zscore(arr)

array([ 0.95789494, -0.53881591,  0.95789494, -1.13750025, -0.83815808,
        0.65855277,  1.55657928, -1.13750025, -1.13750025,  0.65855277])

# [Normal Distribution](https://en.wikipedia.org/wiki/Normal_distribution)

A **normal distribution** is a [probability distribution](https://en.wikipedia.org/wiki/Probability_distribution "Probability distribution") used to model phenomena that have a default behaviour and cumulative possible deviations from that behaviour.

More rigorously, in [probability theory](https://en.wikipedia.org/wiki/Probability_theory "Probability theory"), a **normal distribution** (also known as **Gaussian**, **Gauss**, or **Laplace–Gauss** **distribution**) is a type of [continuous probability distribution](https://en.wikipedia.org/wiki/Continuous_probability_distribution "Continuous probability distribution") for a [real-valued](https://en.wikipedia.org/wiki/Real_number "Real number") [random variable](https://en.wikipedia.org/wiki/Random_variable "Random variable"). The general form of its [probability density function](https://en.wikipedia.org/wiki/Probability_density_function "Probability density function") is

${\displaystyle f(x)={\frac {1}{\sigma {\sqrt {2\pi }}}}e^{-{\frac {1}{2}}\left({\frac {x-\mu }{\sigma }}\right)^{2}}}$

## [Normal Distribution Empricial Rule](https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule)

In [statistics](https://en.wikipedia.org/wiki/Statistics "Statistics"), the **68–95–99.7 rule**, also known as the **empirical rule**, is a shorthand used to remember the percentage of values that lie within an [interval estimate](https://en.wikipedia.org/wiki/Interval_estimate "Interval estimate") in a [normal distribution](https://en.wikipedia.org/wiki/Normal_distribution "Normal distribution"): 68%, 95%, and 99.7% of the values lie within one, two, and three [standard deviations](https://en.wikipedia.org/wiki/Standard_deviation "Standard deviation") of the [mean](https://en.wikipedia.org/wiki/Arithmetic_mean "Arithmetic mean"), respectively.

### Example 1: Proportion below a value

The lifespans of tigers in a particular zoo are normally distributed. The average tiger lives $13.1$ years; the standard deviation is $1.5$ years.

Use the empirical rule (68 - 95 - 99.7%) to estimate the probability of a tiger living longer than $14.6$ years.

In [6]:
# Proportion below a value
mu, sd = 13.1, 1.5
dp = 14.6
n_sd = (mu - dp) / sd
print("n_sd =", round(n_sd))
p = 0.5 + 0.68 / 2
round(p, 4)

n_sd = -1


0.84

### Example 2: Proportion above a value

The lifespans of tigers in a particular zoo are normally distributed. The average tiger lives $22.4$ years; the standard deviation is $2.7$ years.

Use the empirical rule (68 - 95 - 99.7%) to estimate the probability of a tiger living longer than $14.3$ years.

In [7]:
# Proportion above a value
mu, sd = 22.4, 2.7
dp = 14.3
n_sd = (mu - dp) / sd
print("n_sd =", round(n_sd))
p = 1 - (0.003) / 2
round(p, 4)

n_sd = 3


0.9985

### Example 3: Proportion between two values

The lifespans of tigers in a particular zoo are normally distributed. The average tiger lives $12.5$ years; 
the standard deviation is $2.4$ years.
Use the empirical rule (68 - 95 - 99.7%) to estimate the probability of a tiger living between $5.3$ and $10.1$ years.

In [8]:
# Proportion between two values
mu, sd = 12.5, 2.4
dp_low, dp_high = 5.3, 10.1
n_sd_low = (mu - dp_low) / sd
print("n_sd_low =", round(n_sd_low))
n_sd_high = (mu - dp_high) / sd
print("n_sd_high =", round(n_sd_high))
p = (0.997 - 0.68) / 2
round(p, 4)

n_sd_low = 3
n_sd_high = 1


0.1585

## Normal Distirbution Calculations

> [Probability density function](https://en.wikipedia.org/wiki/Probability_density_function)<br>
> In [probability theory](https://en.wikipedia.org/wiki/Probability_theory "Probability theory"), a **probability density function** (**PDF**), or **density** of a [continuous random variable](https://en.wikipedia.org/wiki/Continuous_random_variable "Continuous random variable"), is a [function](https://en.wikipedia.org/wiki/Function_(mathematics) "Function (mathematics)") whose value at any given sample (or point) in the [sample space](https://en.wikipedia.org/wiki/Sample_space "Sample space") (the set of possible values taken by the random variable) can be interpreted as providing a _relative likelihood_ that the value of the random variable would be close to that sample.[[2]](https://en.wikipedia.org/wiki/Probability_density_function#cite_note-2)[[3]](https://en.wikipedia.org/wiki/Probability_density_function#cite_note-3) In other words, while the _absolute likelihood_ for a continuous random variable to take on any particular value is 0 (since there is an infinite set of possible values to begin with), the value of the PDF at two different samples can be used to infer, in any particular draw of the random variable, how much more likely it is that the random variable would be close to one sample compared to the other sample.

> [Cumulative distribution function](https://en.wikipedia.org/wiki/Cumulative_distribution_function)<br>
> In [probability theory](https://en.wikipedia.org/wiki/Probability_theory "Probability theory") and [statistics](https://en.wikipedia.org/wiki/Statistics "Statistics"), the **cumulative distribution function** (**CDF**) of a real-valued [random variable](https://en.wikipedia.org/wiki/Random_variable "Random variable") ${\displaystyle X}$, or just **distribution function** of ${\displaystyle X}$, evaluated at ${\displaystyle x}$ that ${\displaystyle X}$ will take a value less than or equal to ${\displaystyle x}$.[[1]](https://en.wikipedia.org/wiki/Cumulative_distribution_function#cite_note-1)<br>
> In the case of a scalar [continuous distribution](https://en.wikipedia.org/wiki/Continuous_distribution "Continuous distribution"), it gives the area under the [probability density function](https://en.wikipedia.org/wiki/Probability_density_function "Probability density function") from minus infinity to ${\displaystyle x}$. Cumulative distribution functions are also used to specify the distribution of [multivariate random variables](https://en.wikipedia.org/wiki/Multivariate_random_variable "Multivariate random variable").

> [Percent point function](https://en.wikipedia.org/wiki/Quantile_function)<br>
> In [probability](https://en.wikipedia.org/wiki/Probability "Probability") and [statistics](https://en.wikipedia.org/wiki/Statistics "Statistics"), the **[quantile](https://en.wikipedia.org/wiki/Quantile "Quantile") function**, associated with a [probability distribution](https://en.wikipedia.org/wiki/Probability_distribution "Probability distribution") of a [random variable](https://en.wikipedia.org/wiki/Random_variable "Random variable"), specifies the value of the random variable such that the probability of the variable being less than or equal to that value equals the given probability. Intuitively, the quantile function associates with a range at and below a probability input the likelihood that a random variable is realized in that range for some probability distribution. It is also called the **percentile function**, **percent-point function** or **inverse cumulative distribution function**.

Review module `norm_prop`, click [here](utils/norm/norm_prop.py).

In [9]:
from utils.norm.norm_prop import norm_prop

### Example 1: Proportion below a value

A set of elementary school student heights are normally distributed with a mean of $105$ centimeters and a standard deviation of $10$ centimeters. Ikue is an elementary school student with a height of $90.4$ centimeters.

What proportion of student heights are lower than Ikue's height?

In [10]:
# Proportion below a value
dp, mu, sd = 90.4, 105, 10
p = norm_prop(dp, mu, sd)
p

0.07214503696589383

### Example 2: Proportion above a value

A set of chemistry exam scores are normally distributed with a mean of $70$ points and a standard deviation of $4$ points. 
Cam got a score of $65$ points on the exam.

What proportion of exam scores are higher than Cam's score?

In [11]:
# Proportion above a value
dp, mu, sd = 65, 70, 4
p = norm_prop(dp, mu, sd, 'above')
p

0.8943502263331446

### Example 3: Proportion between two values

A set of average city temperatures in October are normally distributed with a mean of $21.02$ degrees and a standard deviation of $2$.

What proportion of temperatures are between $17.02$ degrees and $25$?

In [12]:
# Proportion between two values
dp, mu, sd = [17.02, 25], 21.02, 2
p = p = norm_prop(dp, mu, sd, 'between')
p

0.9539544003016089

### Example 4: Finding Z-Score for a left tail probability and Proportions under the left tail probability

The distribution of reading scale scores in the 4th grade at Truman Elementary School was approximately normal with mean $\mu = 221$ and standard deviation $\sigma = 36$.

Educational researchers are conducting an intervention experiment. To participate in the experiment, a student must have a reading scale score in the lower $50\%$ of the scores in their grade level.

What is the maximum reading scale score for students to participate in the intervention experiment?

In [13]:
# Finding Z-Score for a left tail probability and Proportions under the left tail probability
mu, sd = 221, 36
z = stats.norm.ppf(0.5)
print("Z Score =", z)
# Inverse Z Score formula
dp = z * sd + mu
print("Data Point =", dp)

Z Score = 0.0
Data Point = 221.0


### Example 5: Finding Z-Score for a right tail probability and Proportions above the right tail probability

The distribution of SAT scores of all college-bound seniors taking the SAT in 2014 was approximately normal with mean $\mu = 1497$ and standard deviation $\sigma = 322$.

A certain summer program only admits students whose SAT scores are in the top $15\%$ of those who take the test in a given year.

What is the minimum SAT score in 2014 that meets the program’s requirements?

In [14]:
# Finding Z-Score for a right tail probability and Proportions above the right tail probability
mu, sd = 1497, 322
z = stats.norm.ppf(1 - 0.15)
print("Z Score =", z)
# Inverse Z Score formula
dp = z * sd + mu
print("Data Point =", dp)

Z Score = 1.0364333894937898
Data Point = 1830.7315514170004
