#### EXERCISE 1. 
The hourly wages in a particular industry are normally distributed with mean 13.20 dollar and standard deviation 2.50 dollar. A company in this industry employs 40 workers, paying them an average of 12.20 dollar per hour. Can this company be accused of paying substandard wages? Use an α = .01 level test.

In [1]:
import random
from math import sqrt
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
random.seed(101)

In [2]:
# H0: mu=13.2 # null hypothesis,
# H1: mu<13.2 # alternative hypothesis, left tail
mu0=13.2
sigma=2.5
n=40
x=12.2
alpha=0.01 #significant level

In [3]:
# n >30, sigma is known, observations are independent so we use z test.
z= (x-mu0)/(sigma/sqrt(n))
z

-2.5298221281347035

In [4]:
Zα= stats.norm.ppf(alpha) # Zα: critical z value
Zα

-2.3263478740408408

In [5]:
# since z ≤ Zα, reject the null hypothesis(H0)

In [6]:
pV=stats.norm.cdf(z)
pV

0.005706018193000826

In [7]:
# P( z < -2.53 | H0=True ) = 0.0057

# reject the null hypothesis(H0) when p-Value ≤ alpha
# pV < alpha, so We reject the null hypothesis

#conclusion: we should accept the alternative hypothesis.
# this company can be accused of paying substandard wages

In [8]:
# Critical Value=?
# z= (x-mu)/(sigma/sqrt(n)) >> Xc = mu0 + Zα*(sigma/sqrt(n))

mu0 + stats.norm.ppf(alpha)*(sigma/sqrt(n))

12.280430261017555

#### EXERCISE 2.
Shear strength measurements derived from unconfined compression tests for two types of soils gave the results shown in the following document (measurements in tons per square foot). Do the soils appear to differ with respect to average shear strength, at the 1% significance level?

In [1]:
# # Difference between means... we use z test

# H0: x1-x2 = 0 # null hypothesis
# H1: x1-x2 <> 0 # alternative hypothesis, (two tail)

data = pd.read_csv("soil.csv")
alpha = 0.01 #significant level
delta = 0


<IPython.core.display.Javascript object>

In [47]:
n1=sum(data['Soil1'].notna())
n2=sum(data['Soil2'].notna())
x1=data['Soil1'].mean()
x2=data['Soil2'].mean()
s1=data['Soil1'].std()
s2=data['Soil2'].std()

In [48]:
print('n1 = {:.0f}'.format(n1))
print('n2 = {:.0f}'.format(n2))

print('x1 = {:.3f}'.format(x1))
print('x2 = {:.3f}'.format(x2))

print('s1 = {:.3f}'.format(s1))
print('s2 = {:.3f}'.format(s2))

n1 = 30
n2 = 35
x1 = 1.692
x2 = 1.417
s1 = 0.207
s2 = 0.219


In [49]:
z=(x1-x2-delta)/sqrt(s1**2/n1+s2**2/n2)
z

5.191460504717386

In [50]:
Zc=stats.norm.ppf(1-alpha) # Critical z score
Zc

2.3263478740408408

In [51]:
# Since z=5.19 is greater than Zc=2.326, we can reject the null hypothesis in favor of Ha

In [52]:
pV=2*stats.norm.cdf(-z)
pV

2.0865085144430152e-07

In [53]:
if pV < alpha:
    print('At {} level of significance, we can reject the null hypothesis in favor of Ha.'.format(alpha))
else:
    print('At {} level of significance, we fail to reject the null hypothesis.'.format(alpha))

At 0.01 level of significance, we can reject the null hypothesis in favor of Ha.


In [54]:
# P( z > 5.19 | H0=True ) =~ 0

# reject the null hypothesis(H0) when p-Value ≤ alpha
# pV < alpha, so We reject the null hypothesis

In [43]:
# At 0.01 level of significance, we reject the null hypothesis in favor of Ha.
# We can say that there is enough strong evidence that soils differ in average shear strength.

#### EXERCISE 3. 
The following dataset is based on data provided by the World Bank (https://datacatalog.worldbank.org/dataset/education-statistics). World Bank Edstats.  2015 PISA Test Dataset

Get descriptive statistics (the central tendency, dispersion and shape of a dataset’s distribution) for each continent group (AS, EU, AF, NA, SA, OC).
Determine whether there is any difference (on the average) for the math scores among European (EU) and Asian (AS) countries (assume normality and equal variances). Draw side-by-side box plots.

In [5]:
df = pd.read_excel("EdStatsEXCEL.xlsx", engine= "openpyxl")
df.head()

<IPython.core.display.Javascript object>

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1970,1971,1972,1973,1974,1975,...,2055,2060,2065,2070,2075,2080,2085,2090,2095,2100
0,Arab World,ARB,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2,,,,,,,...,,,,,,,,,,
1,Arab World,ARB,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2.F,,,,,,,...,,,,,,,,,,
2,Arab World,ARB,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2.GPI,,,,,,,...,,,,,,,,,,
3,Arab World,ARB,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2.M,,,,,,,...,,,,,,,,,,
4,Arab World,ARB,"Adjusted net enrolment rate, primary, both sex...",SE.PRM.TENR,54.822121,54.894138,56.209438,57.267109,57.991138,59.36554,...,,,,,,,,,,


In [6]:
df.to_csv('who.csv')