In [1]:
#importera paket
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as scs 

# import cars data
filepath = '../Uppgift/Data/auto-mpg.csv'
cars = pd.read_csv(filepath)
cars = cars.dropna() # Remove rows containing NaN
cars.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,usa,ford torino


### Construct a hypothesis test that examines whether average fuel efficiency of all cars manufacutre in the year 1982 are same or different  (assume the population mean $\mu_0$ = 40 mpg). Use a significance level of $\alpha = 0.05$.

Since the population standard deviation is unknown, we will use a t-test here.

In [2]:
from scipy.stats import t # Import the t-distribution from SciPy

Alternative hypothesis : The average fuel efficiency of cars manufactured in the year 1982 is not equal to 40 mpg.

$H_A: \mu \neq 40\,mpg$

    
Null hypothesis : The average fuel efficiency of cars manufactured in the year 1982 is equal to 40 mpg.

$H_0: \mu = 40\,mpg$


Then, we calculate the test statistic for our null hypothesis, i.e.:

$t = \frac{\overline{X} - \mu_0}{s \,/\sqrt{n}}$

- $ |Test statistic| < Critical value $:Fail to reject null hypothesis.

- $ |Test statistic| \geq Critical value$:Reject null hypothesis.

- If $ p value \leq \alpha $  then the result would be reject null hypothesis.

- If $ p value > \alpha $  then the result would be fail to reject null hypothesis.

calculate the hypothesis test by using  NumPy

In [35]:
year = cars[cars['model_year'] == 82] # Data of cars manufactured in 1982
sample = year['mpg'] # Select the column mpg as sample

xbar = sample.mean() # Calculate sample mean
mu = 40 # Set mu as the value for the null hypothesis
std = sample.std(ddof=1) # calculate the sample standard deviation
n = len(sample) # calculate the size of the sample
alpha = 0.05

statistic = (xbar - mu) / (std / np.sqrt(n)) # Calculate the value of the test statistic
print('Test statistic: ' +str(statistic)) # Print the test statistics


t_crit = t.ppf(1-alpha/2, n-1) # Calculate the value of the critical statistic
print('Critical statistic: ' +str(t_crit)) # Print the critical statistics

if abs(statistic) < t_crit: #  To check condition of test result
    print("Fail to reject the null hypothesis, mean value is same")
else:
    print("Reject the null hypothesis, mean value is different")

Test statistic: -8.374123511411977
Critical statistic: 2.045229642132703
Reject the null hypothesis, mean value is different


$ |Test statistic| = 8.374 \geq Critical value =2.045 $: So we reject null hypothesis  $ H_0.$ with significance level of $\alpha = 0.05$ and average fuel efficiency of all cars manufacutre in the year 1982 are different.


We use the built-in function ttest_1samp(), which calculates the t-test statistic and its p-value given a set of data and a population mean. Here we use alternative='two-sided', which gives us the two-sided test that our sample mean is different from the population mean.


In [14]:
result,pvalue= scs.ttest_1samp(a=sample, popmean=mu, alternative='two-sided') # Carry out two-sided t-test including ttest_1samp() in SciPy

print(result,pvalue) # Print the result

if pvalue > alpha:# To check condition of test result
    print("Fail to reject the null hypothesis, mean value is not equal to 40 mpg")
else:
    print("Reject the null hypothesis, mean value is equal to 40 mpg")

-8.374123511411977 3.1374935328662237e-09
Reject the null hypothesis, mean value is equal to 40 mpg


$ pvalue \leq \alpha $  so we the reject null hypothesis. So the result  would be the average fuel efficiency of cars manufactured in the year 1982 is not equal to 40 mpg.


## Construct a hypothesis test that examines whether more than one fourth of the cars with 4 cylinders are manufactured in USA. Use a significance value of $\alpha=0.1$
For proportion tests, we use the default normal distribution Z, so we start by importing it from SciPy.


In [15]:
from scipy.stats import norm # Import the normalfordelning from SciPy

Alternative hypothesis: The Cars in data contain more than one fourth of the cars with 4 cylinders manufactured in USA.

$H_A: p > 1/4$

    
Null hypothesis :The Cars in data contain less than or equal one fourth of the cars with 4 cylinders manufactured in USA.

$H_0: p \leq 1/4$


Then we calculate the test statistic for our hypothesis, which is given by

$Z = \frac{\hat{p}-p_0}{\sqrt{p_0(1-p_0)/n}}$

In [23]:
from statsmodels.stats import proportion

Here we use the proportions_ztest() test and we use alternative='larger', because our alternative hypothesis is that p is <strong>more</strong> than 1/2.


In [38]:
cars_usa = cars[cars['origin'] == 'usa'] # Data of cars manufactured by usa.
usa_4 = cars_usa[cars_usa['cylinders'] == 4] # Data of cars with 4 cylinders manufactured by usa.

p_pop = 1/4 # proportion
n = len(cars_usa) # length of the cars with 4 cylinders manufactured by usa.

alpha = 0.1 # set the signifigance level.
statistic, pvalue = proportion.proportions_ztest(count=len(usa_4), nobs=len(cars_usa), value=p_pop, prop_var=p_pop, alternative='larger')
print('statistic :' + str(statistic))
print('pvalue :' + str(pvalue))

if pvalue > alpha:# To check condition of test result
    print("Fail to reject the null hypothesis")
else:
    print("Reject the null hypothesis")


statistic :1.1434522260231417
pvalue :0.12642543963172764
Fail to reject the null hypothesis


Since $p = 0.126 > \alpha=0.1$, we cannot reject the null hypothesis. This means that with a significance level of 0.1, we cannot say that more than one fourth of the cars with 4 cylinders manufactured in USA.
