diff --git a/docs/mathematical_notation.md b/docs/mathematical_notation.md index d464dec..a3a7baa 100644 --- a/docs/mathematical_notation.md +++ b/docs/mathematical_notation.md @@ -12,7 +12,7 @@ * - Symbol - Formula - - Article + - Explained * - $\mu$ - $\sum_{x} k P(X=x) = \int_{-\infty}^{\infty} x f(x) d x$ - [🔗](expected-value) diff --git a/docs/probability/continuous_distributions.md b/docs/probability/continuous_distributions.md index a2f02a2..d413180 100644 --- a/docs/probability/continuous_distributions.md +++ b/docs/probability/continuous_distributions.md @@ -54,8 +54,24 @@ $P(X=a)=\int_{a}^{a} f(x) d x=0 \text { for all real numbers } a$ Random variable $X \sim U[a,b]$ has the uniform distribution on the interval \[a, b\] if its density function is -```{image} https://cdn.mathpix.com/snip/images/C3YIEOiPSsTEyCokT28x7xwBtWiAMEuJgXY7ljXUKpM.original.fullsize.png -:width: 600 +```{code-cell} +import torch +import matplotlib.pyplot as plt +import seaborn as sns +from scipy.stats import uniform + +sns.set_theme(style="darkgrid") + +# random numbers from uniform distribution +n = 10000 +start = 10 +width = 20 +data_uniform = uniform.rvs(size=n, loc = start, scale=width) +ax = sns.displot(data_uniform, + bins=100, + kde=True) +ax.set(xlabel='Uniform Distribution ', ylabel='Frequency') +plt.show() ``` $$ @@ -105,8 +121,13 @@ For random variable $X \sim U(0,23)$. Find P(2 \< X \< 18) $P(2 < X < 18) = (18-2)\cdot \frac 1 {23-0} = \frac {16}{23}$ -## Exponential rv +## Exponential Distribution +The exponential distribution is a continuous probability distribution that often concerns the amount of time until some +specific event happens. +It is a process in which events happen continuously and independently at a constant average rate. The exponential +distribution has the key property of being memoryless. +### Applications The family of exponential distributions provides probability models that are widely used in engineering and science disciplines to describe **time-to-event** data. @@ -115,22 +136,65 @@ disciplines to describe **time-to-event** data. - Waiting time in a queue - Length of service time - Time between customer arrivals +- the amount of money spent by the customer +- Calculating the time until the radioactive particle decays ### PDF +The continuous random variable, say X is said to have an exponential distribution, if it has the following probability +density function: $$ -f(x;\lambda) = \begin{cases} \lambda e^{ - \lambda x} & x \ge 0, \\ 0 & x < 0. \end{cases} =\lambda e^{-\lambda x} I_{(0, \infty)}(x) +\large f(x;\lambda) = \begin{cases} \lambda e^{ - \lambda x} & x \ge 0, \\ 0 & x < 0. \end{cases} =\lambda e^{-\lambda x} I_{(0, \infty)}(x) $$ +λ is called the distribution rate. + +```{code-cell} +import torch +import matplotlib.pyplot as plt +import seaborn as sns +from scipy.stats import expon + +sns.set_theme(style="darkgrid") + +data_expon = expon.rvs(scale=1,loc=0,size=1000) +ax = sns.displot(data_expon, + kde=True, + bins=100) +ax.set(xlabel='Exponential Distribution', ylabel='Frequency') +plt.show() +``` + ### Expected Value +The mean of the exponential distribution is calculated using the integration by parts. + +$$ +\begin{aligned} +&E[X]=\int_{0}^{\infty} x f(x) d x=\int_0^{\infty} x \lambda e^{-\lambda x} d x \\ +&=\lambda\left[\left|\frac{-x e^{-\lambda x}}{\lambda}\right|_0^{\infty}+\frac{1}{\lambda} \int_0^{\infty} e^{-\lambda x} d x\right] \\ +&=\lambda\left[0+\frac{1}{\lambda} \frac{-e^{-\lambda x}}{\lambda}\right]_0^{\infty} \\ +&=\lambda \frac{1}{\lambda^2} \\ +&=\frac{1}{\lambda} +\end{aligned} -$E(X) = \int_{0}^{\infty} x f(x) d x = \int_{0}^{\infty} x \lambda e^{ - \lambda x} d x = \frac{1}{\lambda}$ +E[X^2]&= \int_{0}^{\infty} x^2 f(x) d x \\ +&= \int_{0}^{\infty} x^2 \lambda e^{ - \lambda x} d x \\ +&= \frac{2}{\lambda^2} +$$ -$E(X^2) = \int_{0}^{\infty} x^2 f(x) d x = \int_{0}^{\infty} x^2 \lambda e^{ - \lambda x} d x = \frac{2}{\lambda^2}$ ### Variance +To find the variance of the exponential distribution, we need to find the second moment of the exponential distribution + +$$ +V(X) &= E(X^2) - E(X)^2 \\ +&= \frac{2}{\lambda^2} - (\frac{1}{\lambda})^2 \\ +&= \frac{1}{\lambda^2} +$$ -$V(X) = E(X^2) - E(X)^2 = \frac{2}{\lambda^2} - (\frac{1}{\lambda})^2 = \frac{1}{\lambda^2}$ +### Properties +The most important property of the exponential distribution is the memoryless property. This property is also +applicable to the geometric distribution. ## Normal (Gaussian) Distribution @@ -529,6 +593,38 @@ $$ R code: pnorm(1.2) +#### Find P(X<4.1) when N(2, 3)? + +Let $X \sim N(2,3)$. +Then + +$$ +\begin{aligned} +P ( X \leq 4.1) &= P \left(\frac{ X -\mu}{\sigma} \leq \frac{4.1-2}{\sqrt{3}}\right) \\ +&= P (Z \leq 1.21) \\ +& \approx 0.8868 +\end{aligned} +$$ + +R Code: pnorm(1.21) + +```R +z_score <- (4.1 - 2) / sqrt(3) +pnorm(z_score) +``` + +$$ +\begin{aligned} +& X _1, X _2, \ldots, X _{10} \stackrel{ id }{\sim} N (2,3) \\ +&\overline{ X } \sim N \left(\mu, \sigma^2 / n \right)= N (2,3 / 10) \\ +& P (\overline{ X } \leq 2.3)= P \left(\frac{\overline{ X }-\mu_{\overline{ X }}}{\sigma_{\overline{ X }}} \leq \frac{2.3-2}{\sqrt{3 / 10}}\right) \\ +&\frac{\overline{ X -\mu}}{\sigma / \sqrt{ n }}=\begin{aligned} +&= P ( Z \leq 0.5477) \\ +& \approx 0.7081 +\end{aligned} +\end{aligned} +$$ + #### Interval between variables To find the probability of an interval between certain variables, you need to subtract cdf from another cdf. @@ -625,4 +721,28 @@ pro=norm(1, 2).cdf(3.5) - norm(1,2).cdf(0) ax.text(0.2,0.02,round(pro,2), fontsize=20) plt.show() -``` \ No newline at end of file +``` + +## Gamma Distribution +The gamma distribution term is mostly used as a distribution which is defined as two parameters – shape parameter and +inverse scale parameter, having continuous probability distributions. Its importance is largely due to its relation to +exponential and normal distributions. + +Gamma distributions have two free parameters, named as alpha (α) and beta (β), where; + +- α = Shape parameter +- β = Rate parameter (the reciprocal of the scale parameter) + +The scale parameter β is used only to scale the distribution. This can be understood by remarking that wherever the +random variable x appears in the probability density, then it is divided by β. Since the scale parameter provides the +dimensional data, it is seldom useful to work with the “standard” gamma distribution, i.e., with β = 1. + +### Gamma function: + +The gamma function $[10]$, shown by $\Gamma( x )$, is an extension of the factorial function to real (and complex) +numbers. Specifically, if $n \in\{1,2,3, \ldots\}$, then + +$$ +\Gamma( n )=( n -1) ! +$$ + diff --git a/docs/probability/hypothesis_testing.md b/docs/probability/hypothesis_testing.md index d1693c9..7c9816e 100644 --- a/docs/probability/hypothesis_testing.md +++ b/docs/probability/hypothesis_testing.md @@ -38,11 +38,10 @@ from scipy.stats import norm sns.set_theme(style="darkgrid") -sample = torch.normal(mean = 8, std = 16, size=(1,1000)) +sample = torch.normal(mean = 0, std = 1, size=(1,1000)) sns.displot(sample[0], kde=True, stat = 'density',) plt.axvline(torch.mean(sample[0]), color='red', label='mean') - plt.show() ``` Example of random sample after it is observed: @@ -54,14 +53,8 @@ $$ Based on what you are seeing, do you believe that the true population mean $\mu$ is $$ - -\begin{align} -\mu<=3 \\ -or \\ -\mu>3 \\ -\text { The sample is } \overline{\mathrm{x}}=2.799 -\end{align} - + \mu<=3 \text{ or } \mu>3 \\ +\text { The sample mean is } \overline{\mathrm{x}}=2.799 $$ This is below 3 , but can we say that $\mu<3$ ? @@ -92,16 +85,22 @@ $$ **How do we formalize this stuff, We use hypothesis testing** -Hypotheses: +### Notation $\mathrm{H}_0: \mu \leq 3$ <- Null hypothesis \ $\mathrm{H}_1: \mu>3 \quad$ Alternate hypothesis -### Null hypothesis -The null hypothesis is assumed to be true. +#### Null hypothesis +The null hypothesis is a hypothesis that is assumed to be true. We denote it with an $H_0$. -### Alternate hypothesis +#### Alternate hypothesis The alternate hypothesis is what we are out to show. +The alternative hypothesis is a hypothesis that we are looking for evidence for or **out to show**. +We denote it with an $H_1$. + +:::{note} +Some people use the notation $H_a$ here +::: **Conclusion is either**:\ Reject $\mathrm{H}_0 \quad$ OR $\quad$ Fail to Reject $\mathrm{H}_0$ @@ -115,6 +114,9 @@ You don't know the exact distribution.\ Means you know the distribution is normal but you don't know the mean and variance. #### Critical values +Critical values for distributions are numbers that cut off specified areas under pdfs. For the +N(0, 1) distribution, we will use the notation $z_\alpha$ to denote the value that cuts off area $\alpha$ to +the right as depicted here. ```{image} https://cdn.mathpix.com/snip/images/VhPT2BPUY6gNGGTSOLvZuK6iXJSLNFeOwMU3aI8Droc.original.fullsize.png :align: center @@ -183,6 +185,30 @@ $= P \left(\right.$ Reject $H _0$ when $\left.\mu=5\right)$ $\alpha$ is called the level of significance of the test. It is also sometimes referred to as the size of the test. +$$ +\begin{aligned} +\alpha &=\max P (\text { Type I Error }) \\ +&=\max _{\mu \in H _0} P \left(\text { Reject } H _0 ; \mu\right) \\ +\beta &=\max P (\text { Type II Error }) \\ +&=\max _{\mu \in H _1} P \left(\text { Fail to Reject } H _0 ; \mu\right) +\end{aligned} +$$ + +### Power of the test + +$1-\beta$ is known as the +power of the test +$$ +\begin{gathered} +1-\beta=1-\max _{\mu \in H _1} P \left(\text { Fail to Reject } H _0 ; \mu\right) \\ +=\min _{\mu \in H _1}\left(1- P \left(\text { Fail to Reject } H _0 ; \mu\right)\right) \\ +=\min _{\mu \in H _1} P \left(\text { Reject } H _0 ; \mu\right) \quad \begin{array}{c} +\text { High power } \\ +\text { is good! } +\end{array} +\end{gathered} +$$ + ### Step One Choose an estimator for μ. @@ -224,10 +250,10 @@ Give a conclusion! $0.05= P ($ Type I Error) \ $= P \left(\right.$ Reject $H _0$ when true $)$ \ -$= P (\overline{ X }< c$ when $\mu=5)$ +$= P (\overline{ X }< \text{ c when } \mu=5)$ -$ = P \left(\frac{\overline{ X }-\mu_0}{\sigma / \sqrt{ n }}<\frac{ c -5}{2 / \sqrt{10}}\right.$ when $\left.\mu=5\right) +$ = P \left(\frac{\overline{ X }-\mu_0}{\sigma / \sqrt{ n }}<\frac{ c -5}{2 / \sqrt{10}}\right.$ when $\left.\mu=5\right)$ ```{image} https://cdn.mathpix.com/snip/images/A2zQa5iD99VnS5sLbiZ947KpZWH7i7xSbnJ6IZ88j2w.original.fullsize.png @@ -248,3 +274,522 @@ $ = P \left(\frac{\overline{ X }-\mu_0}{\sigma / \sqrt{ n }}<\frac{ c -5}{2 / \s :alt: Errors in Hypothesis Testing :width: 80% ``` + +### Formula + +Let $X_1, X_2, \ldots, X_n$ be a random sample from the normal distribution with mean $\mu$ and known variance $\sigma^2$. + +Consider testing the simple versus simple hypotheses + +$$ +H _0: \mu=\mu_0 \quad H _1: \mu=\mu_1 +$$ + +where $\mu_0$ and $\mu_1$ are fixed and known. + + +$$ +H_0: \mu=\mu_0 \\ +H _1: \mu=\mu_1 \\ +\mu_0<\mu_1 \\ +\text{ Reject H0, in favor of H1 if } \\ + +\large \overline{ X }>\mu_0+ z _\alpha \frac{\sigma}{\sqrt{ n }} +$$ + + +$$ +H_0: \mu=\mu_0 \\ +H _1: \mu=\mu_1 \\ +\mu_0>\mu_1 \\ +\text{ Reject H0, in favor of H1 if } \\ + +\large \overline{ X }<\mu_0+ z_{1-\alpha} \frac{\sigma}{\sqrt{ n }} +$$ + +### Type II Error + +$$ +H_0: \mu=\mu_0 \\ +H _1: \mu=\mu_1 \\ +\mu_0<\mu_1 +$$ + +$$ +\begin{aligned} +& \beta= P (\text { Type II Error }) \\ +=& P \left(\text { Fail to Reject } H _0 \text { when false }\right) \\ +=& P \left(\overline{ X } \leq \mu_0+ z _\alpha \frac{\sigma}{\sqrt{ n }} \text { when } \mu=\mu_1\right) \\ +=& P \left(\overline{ X } \leq \mu_0+ z _\alpha \frac{\sigma}{\sqrt{ n }} ; \mu_1\right) +\end{aligned} + + +$$ + +$$ +\begin{aligned} +\beta &= P \left(\left(\frac{\overline{X} -\mu_1}{\sigma / \sqrt{ n }}\right) \leq \frac{\mu_0+ z _\alpha \frac{\sigma}{\sqrt{ n }}-\mu_1}{\sigma / \sqrt{ n }} ; \mu_1\right) \\ +&= P \left( Z \leq \frac{\mu_0+ z _\alpha \frac{\sigma}{\sqrt{ n }}-\mu_1}{\sigma / \sqrt{ n }}\right) +\end{aligned} +$$ + +## Composite vs Composite Hypothesis + +$$ +\begin{aligned} +& X _1, X _2, \ldots, X _{ n } \sim N \left(\mu, \sigma^2\right), \sigma^2 \text { known } \\ +& H _0: \mu \leq \mu_0 \quad \text { vs } \quad H _1: \mu>\mu_0 +\end{aligned} +$$ + +- Step One Choose an estimator for μ +- Step Two Choose a test statistic: Reject $H_0$ , in favor of $H_1$ if $\bar{𝖷}$ > c, where c is to be determined. +- Step Three Find c. + +## One-Tailed Tests + +Let $X_1, X_2, \ldots, X_n$ be a random sample from the normal distribution with mean $\mu$ and known variance $\sigma^2$. +Consider testing the hypotheses + +$$ +H _0: \mu \geq \mu_0 \quad H _1: \mu<\mu_0 +$$ + +where $\mu_0$ is fixed and known. + + +### Step One +Choose an estimator for μ. + +$$ +\widehat{\mu}=\bar{X} +$$ + +### Step Two + +Choose a test statistic or Give the “form” of the test. + +Reject $H _0$, in favor of $H _1$, if $\overline{ X }< c$ for some c to be determined. + +### Step Three + +Find c. + +$$ +\begin{aligned} +\alpha &=\max _{\mu \geq \mu_0} P (\text { Type I Error }) \\ +&=\max _{\mu \geq \mu_0} P \left(\text { Reject } H _0 ; \mu\right) \\ +&=\max _{\mu \geq \mu_0} P (\overline{ X }< c ; \mu) +\end{aligned} +$$ + +$$ +\begin{aligned} +\alpha &=\max _{\mu \geq \mu_0} P (\overline{ X }< c ; \mu) \\ +&=\max _{\mu \geq \mu_0} P \left( Z <\frac{ c -\mu}{\sigma / \sqrt{ n }}\right) \\ +&=\max _{\mu \geq \mu_0} \Phi\left(\frac{ c -\mu}{\sigma / \sqrt{ n }}\right) +\end{aligned} +$$ + +$$ +\begin{aligned} +\alpha &=\max _{\mu \geq \mu_0} P (\overline{ X }< c ; \mu) \\ +&=\max _{\mu \geq \mu_0} P \left( Z <\frac{ c -\mu}{\sigma / \sqrt{ n }}\right) \\ +&=\max _{\mu \geq \mu_0} \Phi\left(\frac{ c -\mu}{\sigma / \sqrt{ n }}\right) \\ +\text { decreasing in } \mu +\end{aligned} +$$ + +### Step four + +Reject $H _0$, in favor of $H _1$, if +$$ +\overline{ X }<\mu_0+ z _{1-\alpha} \frac{\sigma}{\sqrt{ n }} +$$ + +### Example + +In 2019, the average health care annual premium for a family of 4 in the United States, was reported to be $\$ 6,015$. + +In a more recent survey, 100 randomly sampled families of 4 reported an average annual health care premium of $\$ 6,537$. +Can we say that the true average is currently greater than $\$ 6,015$ for all families of 4? + +Assume that annual health care premiums are normally distributed with a standard deviation of $\$ 814$. +Let $\mu$ be the true average for all families of 4. + +#### Step Zero +Set up the hypotheses. + +$$ +H _0: \mu=6015 \quad H _1: \mu>6015 +$$ + +Decide on a level of significance. $ \alpha=0.10$ + +#### Step One +Choose an estimator for $\mu$. + +$$ +\hat{\mu}=\bar{X} +$$ + +#### Step Two +Give the form of the test. +Reject $H _0$, in favor of $H _1$, if + +$$ +\bar{X}>c +$$ + +for some $c$ to be determined. + +#### Step Three +Find c. + +$$ +\begin{aligned} +\alpha &=\max _{\mu=\mu_0} P (\text { Type I Error; } \mu) \\ +&= P \left(\text { Type I Error; } \mu_0\right) +\end{aligned} +$$ + +$$ +\begin{aligned} +&\alpha= P \left(\text { Reject } H _0 ; \mu_0\right) \text { when }\\ +&= P \left(\overline{ X }> c ; \mu_0\right) \quad \text { it true!, }\\ +&= P \left(\frac{\overline{ X }-\mu_0}{\sigma / \sqrt{ n }}>\frac{ c -6015}{814 / \sqrt{100}} ; \mu_0\right)\\ +&=P\left(Z>\frac{c-6015}{814 / \sqrt{100}}\right) +\end{aligned} +$$ + +$$ +\frac{c-6015}{814 / \sqrt{100}}=1.28 +$$ + +#### Step Four +Conclusion. Reject $H _0$, in favor of $H _1$, if + +$$ +\bar{X}>6119.19 +$$ + +From the data, where $\bar{x}=6537$, we reject $H _0$ in favor of $H _1$.\ +The data suggests that the true mean annual health care premium is greater than $\$ 6015$. + + +## Power Tests +Let $X_1, X_2, \ldots, X_n$ be a random sample from any distribution with unknown parameter $\theta$ which takes values +in a parameter space $\Theta$ + +We ultimately want to test + +$$ +\begin{aligned} +& H _0: \theta \in \Theta_0 \\ +& H _1: \theta \in \Theta \backslash \Theta_0 +\end{aligned} +$$ + +where $\Theta_0$ is some subset of $\Theta$. + +So in other words, if the null hypothesis was for you to test for an exponential distribution, +whether lambda was between 0 and 2, the complement of that is not the rest of the real number line because the space is +only non-negative values. So the complement of the interval from 0 to 2 in that space is 2 to infinity. + + +$\gamma(\theta)= P \left(\right.$ Reject $H _0$ when the parameter is $\left.\theta\right)$ +$$ +\gamma(\theta)= P \left(\text { Reject } H _0 ; \theta\right) +$$ +$\theta$ is an argument that can be anywhere in the parameter space $\Theta$. +it could be a $\theta$ from $H _0$ +it could be a $\theta$ from $H _1$ + + +$$ +\begin{aligned} +&\alpha=\max P \left(\text { Reject } H _0 \text { when true }\right) \\ +&=\max _{\theta \in \Theta_0} P \left(\text { Reject } H _0 ; \theta\right) \\ +&=\max _{\theta \in \Theta_0} \gamma(\theta) \longleftrightarrow \begin{array}{l} +\text { Other notation } \\ +\text { is } \max _{\theta \in H _0} \\ +\hline +\end{array} \\ +& +\end{aligned} +$$ + + +## Hypothesis Testing with P-Values + +Recall that p-values are defined as the following: +A p-value is the probability that we observe a test statistic at least as extreme as the one we calculated, assuming the null hypothesis is true. +It isn't immediately obvious what that definition means, so let's look at some examples to really get an idea of what p-values are, and how they work. + +Let's start very simple and say we have 5 data points: x = <1, 2, 3, 4, 5>. Let's also assume the data were generated +from some normal distribution with a known variance $\sigma$ but an unknown mean $\mu_0$. What would be a good guess +for the true mean? +We know that this data could come from *any* normal distribution, so let's make two wild guesses: + +1. The true mean is 100. +2. The true mean is 3. + +Intuitively, we know that 3 is the better guess. But how do we actually determine which of these guesses is more likely? +By looking at the data and asking "how likely was the data to occur, assuming the guess is true?" + +1. What is the probability that we observed x=<1,2,3,4,5> assuming the mean is 100? Probabiliy pretty low. And because the p-value is low, we "reject the null hypothesis" that $\mu_0 = 100$. +2. What is the probability that we observed x=<1,2,3,4,5> assuming the mean is 3? Seems reasonable. However, something to be careful of is that p-values do not **prove** anything. Just because it is probable for the true mean to be 3, does not mean we know the true mean is 3. If we have a high p-value, we "fail to reject the null hypothesis" that $\mu_0 = 3$. + +What do "low" and "high" mean? That is where your significance level $\alpha$ comes back into play. We consider a p-value low if the p-value is less than $\alpha$, and high if it is greater than $\alpha$. + +## Two Tailed Tests + +Let $X_1, X_2, \ldots, X_n$ be a random sample from the normal distribution with mean $\mu$ and known variance $\sigma^2$. + +Derive a hypothesis test of size $\alpha$ for testing + +$$ +\begin{aligned} +& H _0: \mu=\mu_0 \\ +& H _1: \mu \neq \mu_0 +\end{aligned} +$$ + +We will look at the sample mean $\bar{X} \ldots$ $\ldots$ and reject if it is either too high or too low. + +### Step One +Choose an estimator for μ. + +$$ +\widehat{\mu}=\bar{X} +$$ + +### Step Two +Choose a test statistic or Give the “form” of the test. + + +Reject $H _0$, in favor of $H _1$ if either $\overline{ X }< c$ or $\bar{X}>d$ for some $c$ and $d$ to be determined. + +Easier to make it symmetric! +Reject $H _0$, in favor of $H _1$ if either + +$$ +\begin{aligned} +&\overline{ X }>\mu_0+ c \\ +&\overline{ X }<\mu_0- c +\end{aligned} +$$ +for some $c$ to be determined. + +### Step Three +Find c. + +$$ +\begin{aligned} +\alpha &=\max _{\mu=\mu_0} P (\text { Type I Error }) \\ +&=\max _{\mu=\mu_0} P \left(\text { Reject } H _0 ; \mu\right) \\ +&= P \left(\text { Reject } H _0 ; \mu_0\right) +\end{aligned} +$$ + +$$ +\begin{aligned} +&\alpha= P \left(\overline{ X }<\mu_0- c \text { or } \overline{ X }>\mu_0+ c ; \mu_0\right) \\ +&=1- P \left(\mu_0- c \leq \overline{ X } \leq \mu_0+ c ; \mu_0\right) +\end{aligned} +$$ + +$$ +\begin{gathered} +\alpha=1- P \left(\frac{- c }{\sigma / \sqrt{ n }} \leq Z \leq \frac{ c }{\sigma / \sqrt{ n }}\right) \\ +1-\alpha= P \left(\frac{- c }{\sigma / \sqrt{ n }} \leq Z \leq \frac{ c }{\sigma / \sqrt{ n }}\right) +\end{gathered} +$$ + + +$$ +\frac{c}{\sigma / \sqrt{n}}=z_{\alpha / 2} + + +c=z_{\alpha / 2} \frac{\sigma}{\sqrt{n}} +$$ + +### Step Four +Conclusion + +Reject $H _0$, in favor of $H _1$, if + +$$ +\begin{aligned} +&\overline{ X }>\mu_0+ z _{\alpha / 2} \frac{\sigma}{\sqrt{n}} \\ +&\overline{ X }<\mu_0- z _{\alpha / 2} \frac{\sigma}{\sqrt{ n }} +\end{aligned} +$$ + + +### Example +In 2019, the average health care annual premium for a family of 4 in the United States, was reported to be $\$ 6,015$. + +In a more recent survey, 100 randomly sampled families of 4 reported an average annual health care premium of $\$ 6,177$. +Can we say that the true average, for all families of 4 , is currently different than the sample mean from 2019? +$$ +\sigma=814 \quad \text { Use } \alpha=0.05 +$$ + +Assume that annual health care premiums are normally distributed with a standard deviation of $\$ 814$. +Let $\mu$ be the true average for all families of 4. +Hypotheses: + +$$ +\begin{aligned} +& H _0: \mu=6015 \\ +& H _1: \mu \neq 6015 +\end{aligned} +$$ + +$$ +\begin{aligned} +&\bar{x}=6177 \quad \sigma=814 \quad n=100 \\ +&z_{\alpha / 2}=z_{0.025}=1.96 \\ +&\text { In R: qnorm(0.975) } \\ +&6015+1.96 \frac{814}{\sqrt{100}}=6174.5 \\ +&6015-1.96 \frac{814}{\sqrt{100}}=5855.5 +\end{aligned} +$$ + +We reject $H _0$, in favor of $H _1$. The data suggests that the true current average, for all families of 4 , is different than it was in 2019. + +```{image} https://cdn.mathpix.com/snip/images/_oA87qNHdN5Ozd0kgQL7PxguB7Yc7zoi__lLKXJGuZU.original.fullsize.png +:align: center +:alt: Errors in Hypothesis Testing +:width: 80% +``` + +## Hypothesis Tests for Proportions + +A random sample of 500 people in a certain country which is about to have a national election were asked whether they preferred "Candidate A" or "Candidate B". +From this sample, 320 people responded that they preferred Candidate A. + +Let $p$ be the true proportion of the people in the country who prefer Candidate A. + +Test the hypotheses +$H _0: p \leq 0.65$ versus +$H _1: p>0.65$ +Use level of significance $0.10$. +We have an estimate + +$$ +\hat{p}=\frac{320}{500}=\frac{16}{25} +$$ + + +### The Model + +Take a random sample of size $n$. +Record $X_1, X_2, \ldots, X_n$ where +$X_i= \begin{cases}1 & \text { person i likes Candidate A } \\ 0 & \text { person i likes Candidate B }\end{cases}$ +Then $X_1, X_2, \ldots, X_n$ is a random sample from the Bernoulli distribution with parameter $p$. + +Note that, with these 1's and 0's, +$$ +\begin{aligned} +\hat{p} &=\frac{\# \text { in the sample who like A }}{\# \text { in the sample }} \\ +&=\frac{\sum_{ i =1}^{ n } X _{ i }}{ n }=\overline{ X } +\end{aligned} +$$ +By the Central Limit Theorem, $\hat{p}=\overline{ X }$ has, for large samples, an approximately normal distribution. + +$$ +\begin{aligned} +E[\hat{p}] &=E\left[X_1\right]=p \\ +\operatorname{Var}[\hat{p}] &=\frac{\operatorname{Var}\left[X_1\right]}{n}=\frac{p(1-p)}{n} +\end{aligned} +$$ +So, $\quad \hat{p} \stackrel{\text { approx }}{\sim} N\left(p, \frac{p(1-p)}{n}\right)$ + +$$ +\hat{p} \stackrel{\text { approx }}{\sim} N\left(p, \frac{p(1-p)}{n}\right) +$$ +In particular, +$$ +\frac{\hat{p}-p}{\sqrt{\frac{p(1-p)}{n}}} +$$ +behaves roughly like a $N(0,1)$ as $n$ gets large. + +$n >30$ is a rule of thumb to apply to all distributions, but we can (and should!) do better with specific +distributions. + +- $\hat{p}$ lives between 0 and 1. +- The normal distribution lives between $-\infty$ and $\infty$. +- However, $99.7 \%$ of the area under a $N(0,1)$ curve lies between $-3$ and 3 , + + +$$ +\begin{aligned} +&\hat{p} \stackrel{\text { approx }}{\sim} N\left(p, \frac{p(1-p)}{n}\right) \\ +&\Rightarrow \sigma_{\hat{p}}=\sqrt{\frac{p(1-p)}{n}} +\end{aligned} +$$ + +Go forward using normality if the interval +$$ +\left(\hat{p}-3 \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}, \hat{p}+3 \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\right) +$$ +is completely contained within $[0,1]$. + +### Step One + +Choose a statistic. +$\widehat{p}=$ sample proportion for Candidate $A$ + +### Step Two + +Form of the test. +Reject $H _0$, in favor of $H _1$, if $\hat{ p }> c$. + +### Step Three +Use $\alpha$ to find $c$ +Assume normality of $\hat{p}$ ? +It is a sample mean and $n>30$. +- The interval +$$ +\left(\hat{p}-3 \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}, \hat{p}+3 \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\right) +$$ +is $(0.5756,0.7044)$ + +$$ +\begin{aligned} +\alpha &=\max _{p \in H_0} P (\text { Type I Error }) \\ +&=\max _{p \leq 0.65} P \left(\text { Reject } H _0 ; p \right) \\ +&=\max _{ p \leq 0.65} P (\hat{ p }> c ; p ) +\end{aligned} +$$ + +$$ +\begin{aligned} +\alpha &=\max _{p \leq 0.65} P\left(\frac{\hat{p}-p}{\sqrt{\frac{p(1-p)}{n}}}>\frac{c-p}{\sqrt{\frac{p(1-p)}{n}}} ; p\right) \\ +& \approx \max _{p \leq 0.65} P\left(Z>\frac{c-p}{\sqrt{\frac{p(1-p)}{n}}}\right) +\end{aligned} +$$ + +$$ +\begin{aligned} +0.10 &=\max _{p \leq 0.65} P \left(Z>\frac{c-p}{\sqrt{\frac{p(1-p)}{n}}}\right) \\ +&=P\left(Z>\frac{c-0.65}{\sqrt{\frac{0.65(1-0.65)}{n}}}\right) \\ +& \Rightarrow \frac{c-0.65}{\sqrt{\frac{0.65(1-0.65)}{n}}}=z_{0.10} +\end{aligned} +$$ + +Reject $H _0$ if + +$$ +\hat{p}>0.65+z_{0.10} \sqrt{\frac{0.65(1-0.65)}{n}} +$$ +Formula + +$$ +\hat{p}> p +z_{0.10} \sqrt{\frac{p(1-p)}{n}} +$$ + + + diff --git a/docs/probability/random_variable.md b/docs/probability/random_variable.md index 6abac7e..8c33771 100644 --- a/docs/probability/random_variable.md +++ b/docs/probability/random_variable.md @@ -335,7 +335,11 @@ $E[g(X)]=\int_{-\infty}^{\infty} g(x) f_{X}(x)) d x$ - Measures of **spread** of a distribution. - Variance is a measure of dispersion. -Defined as $\sigma^2$ or V(X). +### Denoted by + +$$ +\large \sigma^2 \text{ or } V(X). +$$ $$ V(X) = E[(X - E[X])^2] = E[(X - \mu)^2] = E[X^2] - E[X]^2 diff --git a/docs/probability/what_is_probability.md b/docs/probability/what_is_probability.md index 8a251c1..f43955c 100644 --- a/docs/probability/what_is_probability.md +++ b/docs/probability/what_is_probability.md @@ -66,7 +66,7 @@ Tossing a coin, Sample Space = {H,T}.

Image from byjus.com

### Experiment or Trial -Experiment is any action or process that generates observations or outcomes. \ +Experiment is any action or process that generates observations or outcomes.\ E.g. The tossing of a coin, selecting a card from a deck of cards, throwing a dice etc. ### Outcome or Sample Point diff --git a/docs/requirements.txt b/docs/requirements.txt index 7cdb7b2..17fc53b 100644 --- a/docs/requirements.txt +++ b/docs/requirements.txt @@ -1,6 +1,6 @@ -torch~=1.12 -seaborn~=0.11 +torch~=1.13 +seaborn~=0.12 scipy~=1.9 -myst-nb~=0.16 -sphinx-design~=0.2 +myst-nb~=0.17 +sphinx-design~=0.3 sphinx-copybutton \ No newline at end of file diff --git a/requirements.txt b/requirements.txt index 7822bd8..5f88124 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,4 +1,4 @@ -Sphinx~=5.1 +Sphinx~=5.3 -r docs/requirements.txt sphinx_rtd_theme sphinx-autobuild \ No newline at end of file