# Step 1: Generating sample data

This block of Python code will generate a unique sample of size 50 that you will use in this discussion. Note that your sample will be unique and therefore your answers will be unique as well. The numpy module in Python allows you to create a data set using a Normal distribution. Note that the mean and standard deviation were chosen for you. The data set will be saved in a Python dataframe that will be used in later calculations.

In [4]:
import pandas as pd
import numpy as np
import math
import scipy.stats as st

# create 50 randomly chosen values from a Normal distribution. (arbitrarily using mean=2.48 and standard deviation=0.50).
diameters = np.random.normal(2.4800,0.500,50)

# convert the array into a dataframe with the column name "diameters" using pandas library.
diameters_df = pd.DataFrame(diameters, columns=['diameters'])
diameters_df = diameters_df.round(2)

# print the dataframe (note that the index of dataframe starts at 0).
print("Diameters data frame\n")
print(diameters_df)

Diameters data frame

    diameters
0        2.64
1        3.02
2        1.84
3        2.52
4        2.81
5        1.94
6        2.19
7        2.90
8        1.77
9        2.52
10       2.04
11       2.66
12       2.55
13       3.18
14       1.96
15       2.30
16       1.91
17       1.64
18       2.20
19       3.14
20       1.70
21       2.02
22       2.74
23       1.61
24       2.65
25       1.53
26       3.60
27       2.53
28       3.20
29       2.23
30       3.11
31       2.76
32       3.19
33       2.59
34       2.61
35       3.10
36       2.95
37       4.01
38       2.26
39       1.94
40       3.13
41       2.33
42       1.52
43       2.16
44       3.52
45       2.52
46       1.95
47       2.80
48       1.95
49       3.13


# Step 2: Constructing confidence intervals

You will assume that the population standard deviation is known and that the sample size is sufficiently large. Then you will use the Normal distribution to construct these confidence intervals. You will use the submodule scipy.stats to construct confidence intervals using your sample data.

In [7]:
# Python methods that calculate confidence intervals require the sample mean and the standard error as inputs.

# calculate the sample mean
mean = diameters_df['diameters'].mean()
print(diameters_df['diameters'])

# input the population standard deviation, which was given in Step 1.
std_deviation = 0.5000

# calculate standard error = standard deviation / sqrt(n)   where n is the sample size.
stderr = std_deviation/math.sqrt(len(diameters_df['diameters']))

# construct a 90% confidence interval.
conf_int_90 = st.norm.interval(0.90, mean, stderr)
print("90% confidence interval (unrounded) =", conf_int_90)
print("90% confidence interval (rounded) = (", round(conf_int_90[0], 2), ",", round(conf_int_90[1], 2), ")")
print("")

# construct a 99% confidence interval.
conf_int_99 = st.norm.interval(0.99, mean, stderr)
print("99% confidence interval (unrounded) =", conf_int_99)
print("99% confidence interval (rounded) = (", round(conf_int_99[0], 2), ",", round(conf_int_99[1], 2), ")")

0     2.64
1     3.02
2     1.84
3     2.52
4     2.81
5     1.94
6     2.19
7     2.90
8     1.77
9     2.52
10    2.04
11    2.66
12    2.55
13    3.18
14    1.96
15    2.30
16    1.91
17    1.64
18    2.20
19    3.14
20    1.70
21    2.02
22    2.74
23    1.61
24    2.65
25    1.53
26    3.60
27    2.53
28    3.20
29    2.23
30    3.11
31    2.76
32    3.19
33    2.59
34    2.61
35    3.10
36    2.95
37    4.01
38    2.26
39    1.94
40    3.13
41    2.33
42    1.52
43    2.16
44    3.52
45    2.52
46    1.95
47    2.80
48    1.95
49    3.13
Name: diameters, dtype: float64
90% confidence interval (unrounded) = (np.float64(2.3850912846323324), np.float64(2.6177087153676672))
90% confidence interval (rounded) = ( 2.39 , 2.62 )

99% confidence interval (unrounded) = (np.float64(2.319261363228155), np.float64(2.6835386367718446))
99% confidence interval (rounded) = ( 2.32 , 2.68 )


# Step 3: Performing hypothesis testing for the population mean

Since you were given the population standard deviation in Step 1 and the sample size is sufficiently large, you can use the z-test for population means. The z-test method in statsmodels.stats.weightstats submodule runs the z-test. The input to this method is the sample dataframe and the value under the null hypothesis. The output is the test-statistic and the two-tailed P-value.

In [6]:
from statsmodels.stats.weightstats import ztest

# run z-test hypothesis test for population mean. The value under the null hypothesis is 2.30.
print(diameters_df['diameters'])
test_statistic, p_value = ztest(x1 = diameters_df['diameters'],  value = 2.30) # type: ignore

print("z-test hypothesis test for population mean")
print("test-statistic =", round(test_statistic,2))
print("two tailed p-value =",round(p_value,4)) # type: ignore

# If the p-value is less than the significance level (0.05), then the null hypothesis is rejected.
sig_lvl = 0.05
if p_value < sig_lvl:
    print("Reject the null hypothesis")
else:
    print("Fail to reject the null hypothesis")


0     2.64
1     3.02
2     1.84
3     2.52
4     2.81
5     1.94
6     2.19
7     2.90
8     1.77
9     2.52
10    2.04
11    2.66
12    2.55
13    3.18
14    1.96
15    2.30
16    1.91
17    1.64
18    2.20
19    3.14
20    1.70
21    2.02
22    2.74
23    1.61
24    2.65
25    1.53
26    3.60
27    2.53
28    3.20
29    2.23
30    3.11
31    2.76
32    3.19
33    2.59
34    2.61
35    3.10
36    2.95
37    4.01
38    2.26
39    1.94
40    3.13
41    2.33
42    1.52
43    2.16
44    3.52
45    2.52
46    1.95
47    2.80
48    1.95
49    3.13
Name: diameters, dtype: float64
z-test hypothesis test for population mean
test-statistic = 2.43
two tailed p-value = 0.015
Reject the null hypothesis


In [16]:
# To find the one-tailed p-value, divide the two-tailed p-value by 2.
one_tailed_p_value = p_value/2
print("one-tailed p-value =", round(one_tailed_p_value,4)) # type: ignore

significance_level = 0.01
if one_tailed_p_value < significance_level:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")


one-tailed p-value = 0.0175
Significance level = 1.05
Fail to reject the null hypothesis.


# Conclusion

In order to check if the population mean is equal to a certain value, you can use the z-test for population means. This test is valid when the population standard deviation is known and the sample size is sufficiently large. The z-test method in the statsmodels.stats.weightstats submodule can be used to perform this test.

In order to construct confidence intervals for the population mean, you can use the Normal distribution. The scipy.stats submodule can be used to construct these confidence intervals.