In [31]:
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
DATA_FILE = './case_study_files/IMB 737 Sericulture  DATA.xlsx'
sheet_names = {0: 'Hypotheis Test Data', 1:'Regression Data' }

---
INSTRUCTIONS:
1.	Answer to all the questions listed below using the data provided along with the case.
2.	For each question, you must clearly state the following:
    * a.	Type of test that you are using (One sample Z or t; Two Sample Z or t; non-parametric test and so on)
    * b.	Clearly state the null and alternative hypotheses
    * c.	The test statistic value (mathematical form)
    * d.	The p-value (or critical value)
    * e.	The decision regarding the null hypothesis (reject or fail to reject)
3.	Provide the response in a word or pdf document.
4.	The last date for submission of the completed assignments is 16 August 2020.
---

## Question 1

>	Jayalaxmi Agro Tech (JAT) believes that the average income per acre from sericulture is at least Rs. 35,000 with a standard deviation of Rs. 40,000. Apply an appropriate hypothesis test to check this claim.

### Solution:
#### Type of test
> `1 sample Z test`
#### Null Hypothesis
> Avg Income
$
\begin{align}
\mu < 35000
\end{align}
$
#### Alternative Hypothesis
> Avg Income 
$
\begin{align}
\mu >= 35000
\end{align}
$

####  The test statistic value
$
\begin{align}
\alpha = 0.05 \\
\sigma = 40000 \\
\mu = 35000 \\
\bar{x}  =  36741.29 \\
n = 1218 \\
Z = \frac{(\bar{x}  - \mu)}{ (\sigma / \sqrt{n} )} \\
\end{align}
$ 

* `Z statistcs value` = 1.519271305296625

* `critical value` = 1.6448536269514715

#### Null Hypothesis
> **`Retain`**

---


## Question 2
> In the context of question 1, what is the probability of incorrectly concluding that the average income per acre is less than Rs. 35,000 when in fact the average income per acre is Rs. 36,000. What is the power of the hypothesis test?

### Solution:
> Probability of concluding that avg income < 35000 , when in fact is 36000 is `Type II error`. (Probability of retaining null hypothesis when it has to be rejected)

#### Type of test

> 1 sample z test

#### given values

$ 
\mu1 = 35000 \\
\mu2 = 36000 \\
$

computed Z statistics value = 1.519271305296625 (computed in question 1)

#### confidence interval

$
v = \mu1 \pm z * \frac{\sigma}{\sqrt{n}}
$

##### substituting


$X_{critical} = 35000  +  1.519271305296625 * \sigma/\sqrt{n}$  = 36741.29231178194

#### find the P value when the critical value is less than 36471

$P(Z <= 36741.29) = P( Z <= 36741.29 - 36000/ \frac{\sigma}{\sqrt{n}}) $  = 0.7411111760838357

#### Type II Error $\beta$
$\beta =  0.7411111760838357$

#### Power of test

$1 - \beta$  =  0.2588888239161643

---

## Question 3	
>Realistically, it is rare that the population standard deviation is known. If the population standard deviation is not known, validate the hypothesis set up in question 1 using an appropriate test. 

### Solution:
#### Type of test
> `1 sample T test`
#### Null Hypothesis
> Avg Income
$
\begin{align}
\mu < 35000
\end{align}
$
#### Alternative Hypothesis
> Avg Income 
$
\begin{align}
\mu >= 35000
\end{align}
$

####  The test statistic value
$
\begin{align}
\alpha = 0.05 \\
s = 41175 \\
\mu = 35000 \\
\bar{x}  =  36741.29 \\
n = 1218 \\
t_{value} = \frac{(\bar{x}  - \mu)}{ (s / \sqrt{n} )} \\
\end{align}
$ 

* `T statistcs value` = 1.4758861486893562

* `critical value` = P(T value) = 1.646106656353215

#### Null Hypothesis
> **`Retain`**

---

## Question 4.
>JAT believes that there is significant gender disparity among sericulturists in Karnataka. The claim is that the proportion of female sericulturists is less than 15%. Conduct an appropriate hypothesis test to validate this claim.

#### Null Hypothesis $H_o$
>p >= 0.15

#### Alternate Hypothesis $H_A$
>p < 0.15

#### Type of test

>1 sample Z test for proportion. As $n \times \hat{p} \times ( 1 - \hat{p}) >= 10$

#### Test statistic value
$
\alpha = 0.05 \\
\hat{p} = 0.09523809523809523 \\
n = 1218 \\
p = 0.15 \\
\begin{align} \\
z_{stat}  = \frac{\hat{p} - p}{\sqrt{\frac{p * (1- p)}{n}}} \\
\end{align} \\
$

`Z Stattics value:` -6.510730010003325

`Critical Value:` -1.6448536269514726

#### Null Hypothesis
> `Reject`

---

## Question 5
>Intuitively, the average income per acre of sericulturists who did not receive training on sericulture should be less than that of sericulturists who received training. Conduct an appropriate hypothesis test to validate this claim.

$\mu_1$ = Average income of sericulturists without training

$\mu_2$ = Average income of sericulturists with training

#### Null Hypothesis $H_0$

$\mu_1 -  \mu_2 \geq 0$

#### Alternative Hypothesis $H_A$

$\mu_1 -  \mu_2 < 0$

#### Type of test

> 2 sample T test with `unequal variance`

#### Test statistic

$
\alpha = 0.05 \\
n_1 = 849 \\
n_2 = 369 \\
\mu_1 = 40084.15415223522 \\
\mu_2 = 29049.992304885367 \\
s_1  = 42456.43301375929 \\
s_2 = 36988.7315378972 \\
Sp^2 = 1669717736.3792605 \\
t = \frac{((\bar{x_1} - \bar{x_2}) - (\mu_1 - \mu_2))}{\sqrt{\frac{S_1^2}{n_1} + \frac{S_2^2}{n_2}}} \\
t_{stat} = 4.569525905038689 \\
P_{value}  = P (t_{stat}) = 5.663339262546472e-06 \\
df = 798
$


`Critical value at alpha = 0.05 and degree of freedom 798`:  -1.646765344238518
`T stat value`: 4.569525905038689

#### Null Hypothesis
> `Retain` => { This is a left tail test and the T statistics value is much greater than Critical Value}
---