<a href="https://colab.research.google.com/github/cloudpedagogy/data-science-programming/blob/main/statistics-scipy/02_Hypothesis_Testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Hypothesis Testing


##Overview


Hypothesis testing is a fundamental concept in statistics that allows us to make informed decisions about populations based on sample data. It involves formulating two competing statements, the null hypothesis (H0) and the alternative hypothesis (H1), and then using statistical methods to determine which hypothesis is more likely to be supported by the evidence.

In hypothesis testing, the null hypothesis represents the status quo or the default assumption. It suggests that there is no significant difference or effect in the population being studied. On the other hand, the alternative hypothesis is the statement we aim to support with evidence, suggesting that there is a significant difference or effect in the population.

To perform hypothesis testing in Python, the SciPy library provides a comprehensive set of functions that cover a wide range of statistical tests. SciPy is a powerful and user-friendly open-source library for scientific computing and statistical analysis.

The hypothesis testing process typically involves the following steps:

1. Formulating the Hypotheses: Clearly define the null hypothesis and the alternative hypothesis based on the research question or problem at hand.

2. Choosing a Test: Depending on the nature of the data and the research question, select an appropriate statistical test from SciPy's wide range of functions, such as t-tests, ANOVA, chi-square tests, correlation tests, and more.

3. Collecting and Preparing Data: Gather relevant data and ensure it is in the appropriate format for analysis. SciPy functions usually take data in the form of NumPy arrays or pandas DataFrames.

4. Setting the Significance Level: Decide on the significance level (alpha), which represents the probability of rejecting the null hypothesis when it is true. Commonly used values are 0.05 or 0.01, indicating a 5% or 1% chance of making a Type I error, respectively.

5. Conducting the Test: Use the chosen SciPy function to perform the statistical test on the data. The output will provide a test statistic and a p-value.

6. Interpreting Results: Compare the p-value to the significance level. If the p-value is less than or equal to alpha, we reject the null hypothesis in favor of the alternative hypothesis. Otherwise, we fail to reject the null hypothesis.

7. Drawing Conclusions: Based on the test result, draw conclusions about the population and the relationship between variables. Be cautious not to over-interpret or make causal claims solely based on the test outcome.

By using the SciPy library in Python, hypothesis testing becomes a straightforward and accessible process. It allows researchers, data scientists, and analysts to make evidence-based decisions and draw meaningful insights from their data, enhancing the rigor and validity of statistical analyses.

#Implementing t-tests, chi-square tests, ANOVA


##t-tests



T-tests are a type of hypothesis test that allows you to compare means to determine if they are significantly different from each other. In Python, we can use the `scipy` library to perform T-tests.

The Pima Indians Diabetes dataset contains information about female patients at least 21 years old of Pima Indian heritage. The datasets consist of several medical predictor variables and one binary target variable, Outcome. Predictor variables include the number of pregnancies the patient has had, their BMI, insulin level, age, etc.

Suppose you want to compare the mean glucose level between people with diabetes and without diabetes. Here's how you can perform an independent t-test:


In [None]:
# Import required libraries
import pandas as pd
from scipy import stats

# Load data
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
dataframe = pd.read_csv(url, names=names)

# Split the data into two groups
group1 = dataframe[dataframe['Outcome'] == 0]['Glucose']
group2 = dataframe[dataframe['Outcome'] == 1]['Glucose']

# Perform the t-test
t_stat, p_val = stats.ttest_ind(group1, group2, equal_var = False, nan_policy='omit')

# Print the results
print(f"T-statistic: {t_stat}")
print(f"P-value: {p_val}")


The `ttest_ind` function from `scipy.stats` is used to perform the independent two sample t-test.

The `equal_var = False` argument is used because we do not want to assume that the two populations have the same variance (this is known as Welch's T-test).

`nan_policy='omit'` is used because the dataset contains NaN values. With this setting, the function will automatically exclude these values.

The `t_stat` variable represents the calculated t-statistic, while `p_val` is the two-tailed p-value. If the p-value is less than your significance level (often 0.05), you can reject the null hypothesis and conclude that the means of glucose levels between people with diabetes and without diabetes are significantly different.


##chi-square tests



Chi-Square tests are often used in hypothesis testing to examine the independence of two categorical vectors. In the context of the Pima Indian Diabetes dataset, one might want to investigate if there's a relationship between having diabetes (a binary categorical variable) and the number of pregnancies (which we can convert to a categorical variable for the sake of this example).

Let's see how to perform a chi-square test in Python using the scipy library.

First, you need to import necessary libraries and load the data:


In [None]:
import pandas as pd
import scipy.stats as stats

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
dataframe = pd.read_csv(url, names=names)


We can make the 'Pregnancies' variable categorical by dividing it into three groups: '0-3', '4-6', and '7+'. The 'Outcome' variable is already categorical (0 for non-diabetes, 1 for diabetes).

In [None]:
# Create a categorical 'Pregnancies' variable
dataframe['Pregnancies_cat'] = pd.cut(dataframe['Pregnancies'], bins=[-1,3,6,float('inf')], labels=['0-3','4-6','7+'])


Then you can use the `pd.crosstab` function to create a contingency table:


In [None]:
contingency_table = pd.crosstab(dataframe['Outcome'], dataframe['Pregnancies_cat'])


Now, you can perform the Chi-Square test of independence:


In [None]:
chi2, p, dof, ex = stats.chi2_contingency(contingency_table)

print(f"Chi-square statistic = {chi2}")
print(f"p-value = {p}")


The p-value tells you whether or not the differences between the groups are statistically significant. If the p-value is less than or equal to 0.05, you would reject the null hypothesis that the variables are independent, and conclude that there is a significant relationship between them. If the p-value is greater than 0.05, you would not reject the null hypothesis.

Please note that it's always important to examine the data and understand the assumptions behind statistical tests before applying them.


##ANOVA



**ANOVA (Analysis of Variance)** is a statistical method used to analyze the differences between two or more groups by comparing the variances within each group to the variance between the groups. It helps determine whether the means of the groups are significantly different from each other.

In ANOVA, we test the null hypothesis that the means of all groups are equal against the alternative hypothesis that at least one group mean is different. If the p-value is below a chosen significance level (e.g., 0.05), we reject the null hypothesis and conclude that there is a significant difference between at least one pair of group means.

Now let's demonstrate how to perform ANOVA using the Python `scipy` library with the Pima Indian Diabetes dataset:


In [None]:
import pandas as pd
from scipy.stats import f_oneway

# Load the Pima Indian Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
data = pd.read_csv(url, names=names)

# Split the data into groups based on the 'Outcome' variable
group0 = data[data['Outcome'] == 0]['Glucose']
group1 = data[data['Outcome'] == 1]['Glucose']

# Perform ANOVA
statistic, p_value = f_oneway(group0, group1)

# Print the results
print("ANOVA results:")
print("F-statistic:", statistic)
print("p-value:", p_value)


In this example, we split the dataset into two groups based on the 'Outcome' variable: group0 (non-diabetic) and group1 (diabetic). We then use the `f_oneway()` function from `scipy.stats` to perform the ANOVA test. The function returns the F-statistic and the p-value.

The F-statistic represents the ratio of between-group variance to within-group variance. A larger F-statistic indicates a larger difference between the group means relative to the variation within each group. The p-value represents the probability of observing the data given that the null hypothesis is true. If the p-value is below the chosen significance level (e.g., 0.05), we reject the null hypothesis and conclude that there is a significant difference between the group means.

Note that in this example, we only performed ANOVA on a single variable ('Glucose') and split the data based on the 'Outcome' variable. In a real-world scenario, you might want to perform ANOVA on multiple variables or analyze different groupings based on other variables. Additionally, ANOVA assumes certain assumptions, such as normally distributed data and homogeneity of variances, which should be validated before interpreting the results.


#Mann-Whitney U test



Below is an example of how to perform the Mann-Whitney U test using Python's SciPy library, using the Pima Indian dataset as an example. Please make sure you have the SciPy library installed before running the code.


In [None]:
import pandas as pd
from scipy.stats import mannwhitneyu

# Load the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv'
data = pd.read_csv(url, header=None)

# Assign the two groups for comparison
group1 = data[data[8] == 0][1]  # Non-diabetic group
group2 = data[data[8] == 1][1]  # Diabetic group

# Perform the Mann-Whitney U test
statistic, p_value = mannwhitneyu(group1, group2)

# Print the results
print("Mann-Whitney U test results:")
print(f"Statistic: {statistic}")
print(f"P-value: {p_value}")


In the code above, we start by importing the necessary libraries. We then load the Pima Indian dataset from the provided URL using `pd.read_csv()`. Since the dataset does not have column names, we pass `header=None` to the function.

Next, we define `group1` and `group2` by filtering the dataset based on the outcome variable (column 8), where `0` represents non-diabetic individuals and `1` represents diabetic individuals.

We then use the `mannwhitneyu()` function from SciPy to perform the Mann-Whitney U test on the two groups. The function returns the test statistic and the p-value.

Finally, we print the test results, including the statistic and the p-value.

Please note that the example assumes that the dataset is in the CSV format and that the outcome variable is in column 8. You may need to adjust the code according to your specific dataset structure.


#Wilcoxon signed-rank test



The Wilcoxon signed-rank test is a non-parametric statistical test used to compare two related samples. In this example, we will use the Pima Indians Diabetes dataset and perform the Wilcoxon signed-rank test using the Python `scipy` library.


In [None]:
import pandas as pd
from scipy.stats import wilcoxon

# Load the Pima Indians Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
data = pd.read_csv(url, header=None)
sample1 = data[0].values
sample2 = data[1].values

# Perform the Wilcoxon signed-rank test
statistic, p_value = wilcoxon(sample1, sample2)

# Print the results
print("Wilcoxon signed-rank test")
print("Statistic:", statistic)
print("P-value:", p_value)


In this code, we load the Pima Indians Diabetes dataset from the provided URL using the `pandas` library. We extract two samples from the dataset (`sample1` and `sample2`), representing two related variables that we want to compare.

Then, we use the `wilcoxon` function from `scipy.stats` to perform the Wilcoxon signed-rank test. The `wilcoxon` function takes the two samples as input and returns the test statistic and the p-value.

Finally, we print the test results, including the calculated statistic and the p-value.

Please note that the Pima Indians Diabetes dataset is typically used for classification tasks, and the Wilcoxon signed-rank test is not commonly applied to this dataset. This example is provided solely for demonstrating how to use the test with the given dataset.


#Reflection Points

1. **T-tests**:
   - Reflection: What are t-tests used for, and in what scenarios would you choose a t-test over other statistical tests?
   - Answer: T-tests are used to compare the means of two groups and determine if they are statistically different. They are suitable when the data is normally distributed, and you want to assess the significance of the mean difference between groups.

2. **Analysis of Variance (ANOVA)**:
   - Reflection: When would you choose ANOVA instead of a t-test? What are the key assumptions of ANOVA?
   - Answer: ANOVA is used when you have three or more groups to compare. It examines whether there are statistically significant differences in the means across multiple groups. The key assumptions include normally distributed data, equal variances across groups, and independent observations.

3. **Chi-square test**:
   - Reflection: What is the purpose of the Chi-square test, and in what scenarios would you use it?
   - Answer: The Chi-square test is used to determine the independence or association between categorical variables. It is commonly applied when analyzing data in contingency tables or conducting hypothesis testing with categorical data.

4. **Mann-Whitney U test**:
   - Reflection: When would you choose the Mann-Whitney U test instead of a t-test? What does it assess?
   - Answer: The Mann-Whitney U test is a non-parametric test used to compare two independent groups when the assumptions of the t-test are not met (e.g., non-normal distributions or unequal variances). It assesses if there is a statistically significant difference in the medians between the groups.

5. **Wilcoxon signed-rank test**:
   - Reflection: In what situations would you use the Wilcoxon signed-rank test, and what does it measure?
   - Answer: The Wilcoxon signed-rank test is used to compare paired samples or repeated measures when the assumptions for parametric tests are not met. It assesses if there is a statistically significant difference in the medians before and after an intervention or between two related groups.


#A quiz on Hypothesis Testing using Scipy


1. Which statistical test should be used when comparing means of two independent samples?
   <br>a) t-test  
   <br>b) Mann-Whitney U test  
   <br>c) Wilcoxon signed-rank test  
   <br>d) Chi-square test  

2. When is the Mann-Whitney U test preferred over the t-test?
   <br>a) When the sample size is large  
   <br>b) When the data is normally distributed  
   <br>c) When the sample size is small and the data is not normally distributed  
   <br>d) When the data is binary  

3. The Wilcoxon signed-rank test is used to compare:
   <br>a) Means of two independent samples  
   <br>b) Proportions of two independent samples  
   <br>c) Paired samples  
   <br>d) Variances of two independent samples  

4. In a chi-square test, the null hypothesis states that:
   <br>a) There is no difference between the means of two independent samples  
   <br>b) There is no association between two categorical variables  
   <br>c) The distribution of the data is normal  
   <br>d) The data is not normally distributed  

5. Which Scipy function can be used to perform a t-test for two independent samples with unequal variances?
   <br>a) `scipy.stats.ttest_1samp`  
   <br>b) `scipy.stats.ttest_rel`  
   <br>c) `scipy.stats.ttest_ind`  
   <br>d) `scipy.stats.ttest_ind_from_stats`  

---
**Answers:**

1. a) t-test
2. c) When the sample size is small and the data is not normally distributed
3. c) Paired samples
4. b) There is no association between two categorical variables
5. d) `scipy.stats.ttest_ind_from_stats`
---