<a href="https://colab.research.google.com/github/bilal-ozgur/statistics/blob/main/Statistics_Assignment_2_(CIs_%26_Hypothesis)_(DS_15_EU).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## CONFIDENCE INTERVALS

### EXERCISE 1. What is the normal body temperature for healthy humans? A random sample of 130 healthy human body temperatures provided by Allen Shoemaker yielded 98.25 degrees and standard deviation 0.73 degrees.

### Give a 99% confidence interval for the average body temperature of healthy people.

In [None]:
import numpy as np
from scipy import stats

In [None]:
mean_temp = 98.25
std_temp = 0.73
n = 130

In [None]:
sem_temp = std_temp / np.sqrt(n)   # standard error of mean  (sigma x bar). Std for SDSM
sem_temp

0.06402523540941313

In [None]:
# LONGER WAY ANSWER

moe = 2.58 * sem_temp   # margin of error (99%)
moe

0.1651851073562859

In [None]:
upper_lim = mean_temp + moe    # upper confidence limit
upper_lim



98.41518510735628

In [None]:
lower_lim = mean_temp - moe    # lower confidence limit
lower_lim

98.08481489264372

In [None]:
print(f'Normal body temperature for healthy humans is between {round(lower_lim,2)} and {round(upper_lim,2)}')

Normal body temperature for healthy humans is between 98.08 and 98.42


In [None]:
# SHORT ANSWER
ci_z_temp = stats.norm.interval(0.99, mean_temp, sem_temp)
ci_z_temp

(98.08508192246582, 98.41491807753418)

### EXERCISE 2. The administrators for a hospital wished to estimate the average number of days required for inpatient treatment of patients between the ages of 25 and 34. A random sample of 500 hospital patients between these ages produced a mean and standard deviation equal to 5.4 and 3.1 days, respectively.


### Construct a 95% confidence interval for the mean length of stay for the population of patients from which the sample was drawn.

In [None]:
n2 = 500
mean_pat = 5.4
std_pat = 3.1

In [None]:
sem_pat = std_pat / np.sqrt(n2)   # standard error of mean  (sigma x bar). Std for SDSM
sem_pat

0.13863621460498696

In [None]:
ci_z_pat = stats.norm.interval(0.95, mean_pat, sem_pat)
ci_z_pat

(5.12827801242126, 5.67172198757874)

# HYPOTHESIS TESTING

### EXERCISE 3. The hourly wages in a particular industry are normally distributed with mean 13.20 and standard deviation 2.50 dolar. A company in this industry employs 40 workers, paying them an average of $12.20 per hour. Can this company be accused of paying substandard wages? Use an α = .01 level test. (Wackerly, Ex.10.18)

### CHECK: statistic: -2.5298221281347035, pvalue= 0.005706018193000826

In [None]:
x_bar = 12.20  # sample mean
n = 40         # number of samples
sigma = 2.50   # sd of population
mu = 13.20     # Population mean

Calculate the test statistic

In [None]:
z = (x_bar - mu)/(sigma/np.sqrt(n))
z

-2.5298221281347035

Calculate the p-value

In [None]:
p_value = 1- stats.norm.cdf(-2.5)   # kodun icine z scorumuzu 2.5 olarak mi yazmali yoksa -2.5 mi ?
p_value

0.9937903346742238

In [None]:
alpha = 0.01

if p_value < alpha:
  print(f'At {alpha} level of significance, we can reject the null hypothesis in favor of alternative hypothesis.')
else:
  print(f'At {alpha} level of significance, we fail to reject the null hypothesis.')

At 0.01 level of significance, we fail to reject the null hypothesis.


### EXERCISE 4.Shear strength measurements derived from unconfined compression tests for two types of soils gave the results shown in the following document (measurements in tons per square foot). Do the soils appear to differ with respect to average shear strength, at the 1% significance level?

### Results for two type of soils

### CHECK: statistic: 5.1681473319343345, pvalue= 2.593228732352821e-06

In [None]:
import pandas as pd

In [None]:
soil_df = pd.read_excel('soil.xlsx')
soil_df.head()

Unnamed: 0,Soil1,Soil2
0,1.442,1.364
1,1.943,1.878
2,1.11,1.337
3,1.912,1.828
4,1.553,1.371


In [None]:
soil1 = soil_df['Soil1'].dropna()

In [None]:
soil2 = soil_df['Soil2']

In [None]:
levenetest = stats.levene(soil1,soil2)         # We know that the large p-value suggests that the populations have equal variances.(equal_var = True)
levenetest                                     # Is pvalue 0.57 high enough to make equal_var=True at independent test?

LeveneResult(statistic=0.31486292982090475, pvalue=0.5767018253541134)

In [None]:
indTest = stats.ttest_ind(soil1, soil2, equal_var=True, alternative='two-sided')

In [None]:
indTest.pvalue

2.593228732352821e-06

In [None]:
alpha = 0.01

if indTest.pvalue < alpha:
  print(f'At {alpha} level of significance, the soils appear to differ with respect to average shear strength.')
else:
  print(f'At {alpha} level of significance, the soils don\'t appear to differ with respect to average shear strength')

At 0.01 level of significance, the soils appear to differ with respect to average shear strength.


### EXERCISE 5. The following dataset is based on data provided by the World Bank (https://datacatalog.worldbank.org/dataset/education-statistics). World Bank Edstats.  2015 PISA Test Dataset

### 1. Get descriptive statistics (the central tendency, dispersion and shape of a dataset’s distribution) for each continent group (AS, EU, AF, NA, SA, OC).
### 2. Determine whether there is any difference (on the average) for the math scores among European (EU) and Asian (AS) countries (assume normality and equal variances). Draw side-by-side box plots.
### CHECK: statistic=0.870055317967983, pvalue=0.38826888111307345

In [None]:
df0 = pd.read_excel('2015 PISA Test.xlsx')
df = df0.copy()
df.head()

Unnamed: 0,Country Code,Continent_Code,internet_users_per_100,Math,Reading,Science
0,ALB,EU,63.252933,413.157,405.2588,427.225
1,ARE,AS,90.5,427.4827,433.5423,436.7311
2,ARG,SA,68.043064,409.0333,425.3031,432.2262
3,AUS,OC,84.560519,493.8962,502.9006,509.9939
4,AUT,EU,83.940142,496.7423,484.8656,495.0375


In [None]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
internet_users_per_100,70.0,71.973099,16.390632,21.976068,60.89902,72.99935,85.026763,98.2
Math,70.0,460.971557,53.327205,327.702,417.416075,477.60715,500.482925,564.1897
Reading,70.0,460.997291,49.502679,346.549,426.948625,480.19985,499.687475,535.1002
Science,70.0,465.439093,48.397254,331.6388,425.923375,475.40005,502.43125,555.5747


In [None]:
df.shape

(70, 6)

In [None]:
df.groupby('Continent_Code')['Math'].mean(numeric_only=True)

Continent_Code
AF    363.212100
AS    466.216647
EU    477.981449
OC    494.559750
SA    402.887700
Name: Math, dtype: float64

In [None]:
df_as = df.Math[df.Continent_Code=='AS']
df_as

1     427.4827
11    531.2961
25    403.8332
27    547.9310
30    386.1096
33    469.6695
35    380.2590
36    532.4399
37    459.8160
38    524.1062
39    396.2497
43    543.8078
49    446.1098
56    402.4007
59    564.1897
63    415.4638
69    494.5183
Name: Math, dtype: float64

In [None]:
df_eu = df.Math[df.Continent_Code=='EU']
df_eu

0     413.1570
4     496.7423
5     506.9844
6     441.1899
9     521.2506
14    437.1443
15    492.3254
16    505.9713
17    511.0876
20    485.8432
21    519.5291
22    511.0769
23    492.9204
24    492.4785
26    453.6299
28    464.0401
29    476.8309
31    503.7220
32    488.0332
34    489.7287
40    478.3834
41    485.7706
42    482.3051
44    419.6635
46    371.3114
47    478.6448
48    417.9341
50    512.2528
51    501.7298
54    504.4693
55    491.6270
57    443.9543
58    494.0600
60    475.2301
61    509.9196
62    493.9181
66    420.4540
Name: Math, dtype: float64

In [None]:
levenetest = stats.levene(df_as, df_eu)         # We know that the large p-value suggests that the populations have equal variances.(equal_var = True)
levenetest

LeveneResult(statistic=14.300030628780679, pvalue=0.0004037413184451079)

In [None]:
indTest = stats.ttest_ind(df_as, df_eu)          # Neden - deger aldi ?
indTest

Ttest_indResult(statistic=-0.8700553179679789, pvalue=0.38826888111307556)

In [None]:
stats.ttest_ind(15.9 , 91.81, equal_var=True, alternative='two-sided')

  stats.ttest_ind(15.9 , 91.81, equal_var=True, alternative='two-sided')
  var *= np.divide(n, n-ddof)  # to avoid error on division by zero
  var *= np.divide(n, n-ddof)  # to avoid error on division by zero


Ttest_indResult(statistic=nan, pvalue=nan)

In [None]:
standard = np.array([1.5, 2.3, 4.7, 6.1, -1.2, 2.6, 1.5, -0.4, -2.7, 1.9, 1.1, -1.5])
standard

array([ 1.5,  2.3,  4.7,  6.1, -1.2,  2.6,  1.5, -0.4, -2.7,  1.9,  1.1,
       -1.5])

In [None]:
new = []
for i in standard:
  new.append(i**2)
new

[2.25,
 5.289999999999999,
 22.090000000000003,
 37.209999999999994,
 1.44,
 6.760000000000001,
 2.25,
 0.16000000000000003,
 7.290000000000001,
 3.61,
 1.2100000000000002,
 2.25]

In [None]:
stats.ttest_ind(standard , new)

Ttest_indResult(statistic=-1.9427696130852292, pvalue=0.0649511173847011)