In [31]:
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
DATA_FILE = './case_study_files/IMB 737 Sericulture  DATA.xlsx'
sheet_names = {0: 'Hypotheis Test Data', 1:'Regression Data' }

---
INSTRUCTIONS:
1.	Answer to all the questions listed below using the data provided along with the case.
2.	For each question, you must clearly state the following:
    * a.	Type of test that you are using (One sample Z or t; Two Sample Z or t; non-parametric test and so on)
    * b.	Clearly state the null and alternative hypotheses
    * c.	The test statistic value (mathematical form)
    * d.	The p-value (or critical value)
    * e.	The decision regarding the null hypothesis (reject or fail to reject)
3.	Provide the response in a word or pdf document.
4.	The last date for submission of the completed assignments is 16 August 2020.
---

## Question 1

>	Jayalaxmi Agro Tech (JAT) believes that the average income per acre from sericulture is at least Rs. 35,000 with a standard deviation of Rs. 40,000. Apply an appropriate hypothesis test to check this claim.

### Solution:
#### Type of test
> `1 sample Z test`
#### Null Hypothesis
> Avg Income
$
\begin{align}
\mu < 35000
\end{align}
$
#### Alternative Hypothesis
> Avg Income 
$
\begin{align}
\mu >= 35000
\end{align}
$

####  The test statistic value
$
\begin{align}
\alpha = 0.05 \\
\sigma = 40000 \\
\mu = 35000 \\
\bar{x}  =  36741.29 \\
n = 1218 \\
Z = \frac{(\bar{x}  - \mu)}{ (\sigma / \sqrt{n} )} \\
\end{align}
$ 

* `Z statistcs value` = 1.519271305296625

* `critical value` = 1.6448536269514715

#### Null Hypothesis
> **`Retain`**

---


## Question 2
> In the context of question 1, what is the probability of incorrectly concluding that the average income per acre is less than Rs. 35,000 when in fact the average income per acre is Rs. 36,000. What is the power of the hypothesis test?

### Solution:
> Probability of concluding that avg income < 35000 , when in fact is 36000 is `Type II error`. (Probability of retaining null hypothesis when it has to be rejected)

#### Type of test

> 1 sample z test

#### given values

$ 
\mu1 = 35000 \\
\mu2 = 36000 \\
$

computed Z statistics value = 1.519271305296625 (computed in question 1)

#### confidence interval

$
v = \mu1 \pm z * \frac{\sigma}{\sqrt{n}}
$

##### substituting


$X_{critical} = 35000  +  1.519271305296625 * \sigma/\sqrt{n}$  = 36741.29231178194

#### find the P value when the critical value is less than 36471

$P(Z <= 36741.29) = P( Z <= 36741.29 - 36000/ \frac{\sigma}{\sqrt{n}}) $  = 0.7411111760838357

#### Type II Error $\beta$
$\beta =  0.7411111760838357$

#### Power of test

$1 - \beta$  =  0.2588888239161643

---

## Question 3	
>Realistically, it is rare that the population standard deviation is known. If the population standard deviation is not known, validate the hypothesis set up in question 1 using an appropriate test. 

### Solution:
#### Type of test
> `1 sample T test`
#### Null Hypothesis
> Avg Income
$
\begin{align}
\mu < 35000
\end{align}
$
#### Alternative Hypothesis
> Avg Income 
$
\begin{align}
\mu >= 35000
\end{align}
$

####  The test statistic value
$
\begin{align}
\alpha = 0.05 \\
s = 41175 \\
\mu = 35000 \\
\bar{x}  =  36741.29 \\
n = 1218 \\
t_{value} = \frac{(\bar{x}  - \mu)}{ (s / \sqrt{n} )} \\
\end{align}
$ 

* `T statistcs value` = 1.4758861486893562

* `critical value` = P(T value) = 1.646106656353215

#### Null Hypothesis
> **`Retain`**

---

## Question 4.
>JAT believes that there is significant gender disparity among sericulturists in Karnataka. The claim is that the proportion of female sericulturists is less than 15%. Conduct an appropriate hypothesis test to validate this claim.

#### Null Hypothesis $H_o$
>p >= 0.15

#### Alternate Hypothesis $H_A$
>p < 0.15

#### Type of test

>1 sample Z test for proportion. As $n \times \hat{p} \times ( 1 - \hat{p}) >= 10$

#### Test statistic value
$
\alpha = 0.05 \\
\hat{p} = 0.09523809523809523 \\
n = 1218 \\
p = 0.15 \\
\begin{align} \\
z_{stat}  = \frac{\hat{p} - p}{\sqrt{\frac{p * (1- p)}{n}}} \\
\end{align} \\
$

`Z Stattics value:` -6.510730010003325

`Critical Value:` -1.6448536269514726

#### Null Hypothesis
> `Reject`

---

## Question 5
>Intuitively, the average income per acre of sericulturists who did not receive training on sericulture should be less than that of sericulturists who received training. Conduct an appropriate hypothesis test to validate this claim.

### Solution:

$\mu_1$ = Average income of sericulturists without training

$\mu_2$ = Average income of sericulturists with training

#### Null Hypothesis $H_0$

$\mu_1 -  \mu_2 \geq 0$

#### Alternative Hypothesis $H_A$

$\mu_1 -  \mu_2 < 0$

#### Type of test

> 2 sample T test with `unequal variance`

#### Test statistic

$
\alpha = 0.05 \\
n_1 = 849 \\
n_2 = 369 \\
\mu_1 = 40084.15415223522 \\
\mu_2 = 29049.992304885367 \\
s_1  = 42456.43301375929 \\
s_2 = 36988.7315378972 \\
Sp^2 = 1669717736.3792605 \\
t = \frac{((\bar{x_1} - \bar{x_2}) - (\mu_1 - \mu_2))}{\sqrt{\frac{S_1^2}{n_1} + \frac{S_2^2}{n_2}}} \\
t_{stat} = 4.569525905038689 \\
P_{value}  = P (t_{stat}) = 5.663339262546472e-06 \\
df = 798
$


`Critical value at alpha = 0.05 and degree of freedom 798`:  -1.646765344238518
`T stat value`: 4.569525905038689

#### Null Hypothesis
> `Retain` 

---

## Question 6
>JAT believes that farmers who underwent training in sericulture have more awareness about crop insurance, hence they are more likely to buy crop insurance. Use an appropriate hypothesis test to check the claim that the proportion of farmers who took crop insurance is greater among the farmers who underwent training on sericulture. 

### Solution:

$\hat{p1}$ = Proportion of farmers who have got training and have crop insurance

$\hat{p2}$ = Proportion of farmers who have not got training but still have crop insurance

#### Null Hypothesis $H_0$
$\hat{p1} \leq \hat{p2}$

#### Alternate Hypothesis $H_A$
$\hat{p1} > \hat{p2}$

#### Type of Test
>2 sample Z-test (proportion) - Right Tailed Test

$
\alpha = 0.05 \\
n = 1218 \\
n_1 = 369 \\
n_2 = 849 \\
\hat{p_1} = 0.16260162601626016 \\
\hat{p_2} = 0.06478209658421673 \\
\begin{align}
\hat{p} = \frac{\hat{p_1} * n_1 + \hat{p_2} * n_2}{n_1 + n_2} = 0.09441707717569786 \\
\end{align}
$

#### Z Statistics value:
$
\begin{align} 
Z = \frac{(\hat{p_1} - \hat{p_2})}{\sqrt{\hat{p}  * ( 1 - \hat{p}) * (\frac{1}{n_1} + \frac{1}{n_2})}}  = 5.365121545386228
\end{align}
$

#### Critical Value
>1.6448536269514715


#### Null Hypothesis
   >`Rejected` - as critical value < Z statistics

---

## Question 7
>Inferring an association between incidence of pest infestations and geographical location might help farmers to emphasize on pest control activities. Use a suitable test of hypothesis to infer whether geographical location and incidence of pest infestations are associated.

### Solution:

#### Null Hypothesis $H_0$
> Geogrpahical location and incidence of pest are independent

#### Alternate Hypothesis $H_A$
> Geographical location and incidence of pest are dependent

#### Type of Test
>Chi square test of independence

In [7]:
import pandas as pd
sheet_names = {0: 'Hypotheis Test Data', 1:'Regression Data' }
df_hypothesis_test = pd.read_excel('./case_study_files/IMB 737 Sericulture  DATA.xlsx', sheet_name=sheet_names[0])
d1 = df_hypothesis_test
gp_dist = d1.groupby(by=['district'])
dist_count = gp_dist[['district']].count()
dist_pest = gp_dist[['affected_by_pest']].sum()
d2  = pd.concat([dist_count,dist_pest], axis=1)
d3 = d2.copy()
d3['not_affected_by_pest']  = d2['district'] - d2['affected_by_pest']
d3 = d3.drop(columns='affected_by_pest')
tot_count, tot_pest, tot_no_pest = d2['district'].sum(), d2['affected_by_pest'].sum(), d3['not_affected_by_pest'].sum()
d2['expected']  = (d2.district/tot_count) * tot_pest
d3['expected']  = (d2.district/tot_count) * tot_no_pest
d2.columns = ['district','observations','expected']
d3.columns = ['district','observations','expected']
d = d2.append(d3)
d = d.drop(columns='district')
idx = d.index.rename('classes')
d.index = idx
d['chi_statistic'] = (d['observations'] - d['expected'])**2/d['expected']
d

Unnamed: 0_level_0,observations,expected,chi_statistic
classes,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Belagavi,3,35.908046,30.158686
Bellary,184,150.206897,7.602672
Chikballapur,243,202.298851,8.188794
Mandya,17,34.896552,9.178172
Tumakuru,169,192.689655,2.912454
Belagavi,68,35.091954,30.860051
Bellary,113,146.793103,7.779479
Chikballapur,157,197.701149,8.379231
Mandya,52,34.103448,9.391618
Tumakuru,212,188.310345,2.980186


#### Calculate degree of freedom

In [18]:
nr_of_districts = len(df_hypothesis_test['district'].unique())
categories = 2 #affected by pest and not affected by pest
degree_of_freedom = (nr_of_districts-1) * (categories - 1)
degree_of_freedom

4

#### Test statistics

$\alpha = 0.05$

degree_of_freedom = 4

#### Test statistics value:

In [16]:
chi_test_independence = d.chi_statistic.sum()
chi_test_independence

117.43134302392417

#### Critical Value

In [20]:
import scipy.stats
scipy.stats.chi2.ppf(1 - 0.05, degree_of_freedom)

9.487729036781154

#### Null Hypothesis
>`Reject` - Chi test is a right tailed test and the Null Hypothesis is rejected

---

## Question 8
>JAT suspects that there is equal variability in income per acre when the sericulturists are using bivoltine hybrids alone as compared to when they are using a combination of bivoltine hybrids and other hybrids or other hybrids exclusively. Check this claim by conducting an appropriate hypothesis test

### Solution:

#### Null Hypothesis $H_0$
$\sigma_1 = \sigma_2$

#### Alternate Hypothesis $H_A$
$\sigma_1 \neq \sigma_2$

#### Type of test 
>F test for equality of population variances

#### Test Statistic

$
\alpha = 0.05
n_1 = 402 \\
n_2 = 817 \\
s_1^2 = 1431970641.7151263 \\
s_2^2 = 1786531982.6340976 \\ 
$

$
\begin{align}
F_{n_1-1,n_2-1} = \frac{s_1^2}{s_2^2}
\end{align}
$

#### F Statistic:
F_stat = `0.8015365275486425`

#### Critical Value:
`1.181471943529887`


#### Null Hypothesis
>`Retain`

---

## Question 9
>From a policy perspective, some districts of Karnataka might require more attention in terms of aid provided. One way to validate this is to check if there is significant disparity in the average income per acre of sericulture farmers in different districts. Use a suitable statistical method to check this.

### Solution:

#### Null Hypothesis $H_0$
*The average income per acre is same for all districts*

$\mu_1 = \mu_2 =\mu_3 = \mu_4 .... $

#### Alternate Hypothesis $H_A$

$\mu_1 \neq \mu_2 \neq \mu_3 \neq \mu_4 .... $

#### Type of Test
> `One way analysis of variance (ANOVA)`

#### Statistic

$\alpha = 0.05$
k = 5
n = 1218
$F_{value} = 92.58682070439953$

$p_{value} = 9.408845402203028e-69$

$critical_{value} = 2.796343824348884$

#### Null Hypothesis
>`Reject` - Right tailed test - F-value is higher than critical value

In [50]:
import pandas as pd
import scipy.stats
alpha = 0.05
sheet_names = {0: 'Hypotheis Test Data', 1:'Regression Data' }
df = pd.read_excel('./case_study_files/IMB 737 Sericulture  DATA.xlsx', sheet_name=sheet_names[0])
districts = df['district'].unique()
k = len(districts)
n = len(df)
input_array = [df[df.district == district]['income_per_acre'].to_numpy() for district in districts]
F, p_value= scipy.stats.f_oneway(*input_array)
critical_value = scipy.stats.f.ppf((1 - alpha/2), k-1, n-k )