Let’s implement the Two-Sample Z test for a coronavirus dataset. Let’s put our theoretical knowledge into practice and see how well we can do. You can download the dataset [here](https://drive.google.com/file/d/1IJija1WXjC6gDtVAczyTsvSOmLe3iu6Y/view.)

This dataset has been taken from John Hopkin’s repository and you can find the link [here](https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports) for it.

**This dataset here the below features:**

- Province/State
- Country/Region
- Last Update
- Confirmed
- Deaths
- Recovered
- Latitude
- Longitude

And we have added the feature of Temperature and Humidity for Latitude and Longitude using Python’s Weather API – Pyweatherbit. A common perception about COVID-19 is that Warm Climate is more resistant to the corona outbreak and we need to verify this using Hypothesis Testing

### Hypothesis Declaration

- Null Hypothesis: Temperature doesn’t affect COV-19 Outbreak
- Alternate Hypothesis: Temperature does affect COV-19 Outbreak

***We are considering Temperature below 24 as Cold Climate and above 24 as Hot Climate in our dataset.***

In [1]:
import pandas as pd
import numpy as np
corona = pd.read_csv('./data/Corona_Updated.csv')

In [2]:
corona['Temp_Cat'] = corona['Temprature'].apply(lambda x : 0 if x < 24 else 1)
corona_t = corona[['Confirmed', 'Temp_Cat']]

In [3]:
corona_t.head()

Unnamed: 0,Confirmed,Temp_Cat
0,67760,0
1,10149,0
2,8042,0
3,7513,0
4,1784,0


### Z-test 

In [10]:
def TwoSampZ(X1, X2, sigma1, sigma2, N1, N2):
    from numpy import sqrt, abs, round
    from scipy.stats import norm
    ovr_sigma = sqrt(sigma1**2/N1 + sigma2**2/N2)
    z = (X1 - X2)/ovr_sigma
    pval = 2*(1 - norm.cdf(abs(z)))
    return z, pval

In [11]:
d1 = corona_t[(corona_t['Temp_Cat']==1)]['Confirmed']
d2 = corona_t[(corona_t['Temp_Cat']==0)]['Confirmed']

In [12]:
m1, m2 = d1.mean(), d2.mean()
sd1, sd2 = d1.std(), d2.std()
n1, n2 = d1.shape[0], d2.shape[0]

In [13]:
print(m1)
print(m2)
print(sd1)
print(sd2)
print(n1)
print(n2)

26.548387096774192
672.9085714285715
45.93969867095881
5228.628797929271
31
175


In [14]:
z, p = TwoSampZ(m1, m2, sd1, sd2, n1, n2)

In [15]:
z_score = np.round(z,8)
p_val = np.round(p,6)

In [17]:
print(z_score)
print(p_val)

-1.63497531
0.102054


In [18]:
if (p_val<0.05):
    Hypothesis_Status = 'Reject Null Hypothesis : Significant'
else:
    Hypothesis_Status = 'Do not reject Null Hypothesis : Not Significant'

In [19]:
print (p_val)
print (Hypothesis_Status)

0.102054
Do not reject Null Hypothesis : Not Significant


Thus. we do not have evidence to reject our Null Hypothesis that temperature doesn’t affect the COV-19 outbreak.

There are certain limitations of the Z test for COVID-19 datasets:
    
- Sample data may not be well representative of population data
- Sample variance may not be a good estimator of the population variance
- Variability in a state’s capacity to deal with this pandemic
- Socio-Economic Reasons
- Early breakout in certain places
- Some states could be hiding the data for geopolitical reasons    