## Step 1: Generating sample data

This block of Python code will generate two samples, both of size 50, that you will use in this discussion. The datasets will be unique to you and therefore your answers will be unique as well. The numpy module in Python allows you to create a data set using a Normal distribution. The data sets will be saved in Python dataframes and will be used in later calculations.

In [1]:
import pandas as pd
import numpy as np

# create 50 randomly chosen values from a normal distribution. (arbitrarily using mean=2.48 and standard deviation=0.500)
diameters_sample1 = np.random.normal(2.48,0.500,50)

# convert the array into a dataframe with the column name "diameters" using pandas library
diameters_sample1_df = pd.DataFrame(diameters_sample1, columns=['diameters'])
diameters_sample1_df = diameters_sample1_df.round(2)

# create 50 randomly chosen values from a normal distribution. (arbitrarily using mean=2.50 and standard deviation=0.750)
diameters_sample2 = np.random.normal(2.50,0.750,50)

# convert the array into a dataframe with the column name "diameters" using pandas library
diameters_sample2_df = pd.DataFrame(diameters_sample2, columns=['diameters'])
diameters_sample2_df = diameters_sample2_df.round(2)

# print the dataframe to see the first 5 observations (note that the index of dataframe starts at 0)
print("Diameters data frame of the first sample (showing only the first five observations)")
print(diameters_sample1_df.head())
print()
print("Diameters data frame of the second sample (showing only the first five observations)")
print(diameters_sample2_df.head())

Diameters data frame of the first sample (showing only the first five observations)
   diameters
0       2.55
1       2.11
2       3.19
3       1.56
4       2.39

Diameters data frame of the second sample (showing only the first five observations)
   diameters
0       2.96
1       1.60
2       2.50
3       2.86
4       3.14


## Step 2: Performing hypothesis test for the difference in population proportions

The z-test for proportions can be used to test for the difference in proportions. The proportions_ztest method in statsmodels.stats.proportion submodule runs this test. The input to this method is a list of counts meeting a certain condition (given in the problem statement) and a list of sample sizes for the two samples.

- Counts    Python list that is assigned the number of observations in each sample with diameter values less than 2.20.
- n         Python list that is assigned the total number of observations in each sample.

In [2]:
from statsmodels.stats.proportion import proportions_ztest

# number of observations in the first sample with diameter values less than 2.20.
count1 = len(diameters_sample1_df[diameters_sample1_df['diameters']<2.20])

# number of observations in the second sample with diameter values less than 2.20.
count2 = len(diameters_sample2_df[diameters_sample2_df['diameters']<2.20])

# counts Python list
counts = [count1, count2]


# number of observations in the first sample
n1 = len(diameters_sample1_df)

# number of observations in the second sample
n2 = len(diameters_sample2_df)

# n Python list
n = [n1, n2]

# perform the hypothesis test. output is a Python tuple that contains test_statistic and the two-sided P_value.
test_statistic, p_value = proportions_ztest(counts, n)

# To perform a one-tailed test, divide the two-tailed p-value by 2.
p_value = p_value/2




print("test-statistic =", round(test_statistic,2))
print("two tailed p-value =", round(p_value,4))

test-statistic = -0.89
two tailed p-value = 0.1865


## Question 1:

You are given the following problem:
A factory claims that the proportion of ball bearings with diameter values less than 2.20 cm in the existing manufacturing process is the same as the proportion in the new process. At alpha = 0.05, is there enough evidence to support the factory's claim?

1) Define the null and alternative hypotheses in mathematical terms as well as in words.
2) Identify the level of significance.
3) Include the test statistic and the P-value. See Step 2 in the Python script. (Note that Python methods return two tailed P-values. You must report the correct P-value based on the alternative hypothesis.)
4) Provide a conclusion and interpretation of the test: Should the null hypothesis be rejected? Why or why not?

1a) The null hypothesis is that the proportion of ball bearings with diameter values less than 2.20 cm in the existing manufacturing process is the same as the proportion in the new process, mathematically represented as H0: p1 = p2. The alternative hypothesis is that the proportion of ball bearings with diameter values less than 2.20 cm in the existing manufacturing process is different from the proportion in the new process, mathematically represented as H1: p1 ≠ p2. Making this a two-tailed test.
2a) The level of significance is 0.05.
3a) The test statistic is -0. The P-value is 1.0. Since the P-value is greater than the level of significance, we fail to reject the null hypothesis.
4a) The null hypothesis should not be rejected. There is not enough evidence to support the factory's claim that the proportion of ball bearings with diameter values less than 2.20 cm in the existing manufacturing process is the same as the proportion in the new process.

If the p-value was 0.002 (less than 0.05), we would reject the null hypothesis and conclude that there is enough evidence to support the factory's claim that the proportion of ball bearings with diameter values less than 2.20 cm in the existing manufacturing process is the same as the proportion in the new process.


## Question 2:

The difference between a paired t-test and a two-sample t-test is that the paired t-test is used to compare the means of two related groups, whereas the two-sample t-test is used to compare the means of two independent groups. The paired t-test is used when the data is paired, such as when the same group is tested at two different times (e.g., before and after a treatment). The two-sample t-test is used when the data is not paired, such as when two different groups are tested.

To perform a paired t-test with python you can use the ttest_rel method in the scipy.stats module. The input to this method is two samples that are paired. The output is the test statistic and the P-value. 

To perform a two-sample t-test with python you can use the ttest_ind method in the scipy.stats module. The input to this method is two samples that are independent. The output is the test statistic and the P-value. The inputs to the ttest_ind method are the two samples (aka ) and the alternative hypothesis. The alternative hypothesis can be 'two-sided', 'less', or 'greater'. The default is 'two-sided'.
 Example of a two-sample t-test: ttest_ind(sample1, sample2, alternative='two-sided')
 ```python
from scipy.stats import ttest_ind
ttest_ind(sample1, sample2, alternative='two-sided')
```

When using the tt_ind method, if you have a two-tailed population 

In [3]:
from scipy.stats import ttest_ind

# Generate two random samples from a normal distribution with the same mean and standard deviation.
sample1 = np.random.normal(2.48,0.500,50)
sample2 = np.random.normal(2.48,0.500,50)

# perform the hypothesis test. output is a Python tuple that contains test_statistic and the two-sided P_value.
test_statistic, p_value = ttest_ind(sample1, sample2)

# To perform a one-tailed test, divide the two-tailed p-value by 2.
p_value = p_value/2

print("test-statistic =", round(test_statistic,2))
print("two tailed p-value =", round(p_value,4))




test-statistic = 1.38
two tailed p-value = 0.0848
