<a href="https://colab.research.google.com/github/cloudpedagogy/data-science-programming/blob/main/statistics-scipy/06_Nonparametric_Tests.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Nonparametric Tests


## Overview


In data science, the analysis of data often involves making statistical inferences and drawing conclusions about populations based on samples. Traditional statistical methods, known as parametric tests, rely on assumptions about the underlying data distribution, such as normality and homogeneity of variance. While these assumptions are often met in practice, there are situations where they may not hold, making parametric tests less reliable and potentially leading to erroneous conclusions.

Nonparametric tests, on the other hand, provide a valuable alternative for data analysis when the assumptions of parametric tests are violated or when dealing with data that do not have a specific known distribution. These tests are also applicable in cases where the sample size is small, and it is difficult to assess the data's distribution accurately.

Nonparametric tests are designed to make fewer assumptions about the data and are robust against violations of normality or other distributional assumptions. Instead of estimating population parameters, these tests focus on ranking and comparing observations, making them more versatile in various scenarios.

Key advantages of nonparametric tests include their simplicity, flexibility, and ease of interpretation, making them particularly useful in situations where data does not meet the stringent assumptions of parametric tests. Moreover, nonparametric tests are effective in analyzing ordinal or categorical data, making them suitable for a wide range of applications, including social sciences, biology, healthcare, and marketing research.

Some commonly used nonparametric tests in data science include the Mann-Whitney U test (Wilcoxon rank-sum test), Kruskal-Wallis test, Wilcoxon signed-rank test, and Spearman's rank correlation, among others. These tests enable data scientists to draw meaningful conclusions from their data, even in situations where traditional parametric methods would be inadequate.

In summary, nonparametric tests play a crucial role in data science by providing robust and versatile statistical techniques for hypothesis testing, significance analysis, and correlation assessment. By relaxing the strict assumptions of parametric tests, nonparametric methods empower data scientists to gain deeper insights into their data and make informed decisions, ultimately enhancing the reliability and accuracy of data-driven conclusions.

# Kruskal-Wallis H-test


The Kruskal-Wallis H-test is a non-parametric statistical test used to determine whether there are differences between two or more independent groups. We can use the `scipy` library in Python to perform the Kruskal-Wallis H-test.


In [None]:
import pandas as pd
import numpy as np
from scipy.stats import kruskal

# Load the Pima Indian dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv'
data = pd.read_csv(url, header=None)
data.columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI',
                'DiabetesPedigreeFunction', 'Age', 'Outcome']

# Separate the data based on the outcome (diabetes/non-diabetes)
diabetes = data[data['Outcome'] == 1]['Glucose']
non_diabetes = data[data['Outcome'] == 0]['Glucose']

# Perform the Kruskal-Wallis H-test
statistic, p_value = kruskal(diabetes, non_diabetes)

# Print the results
print(f"Kruskal-Wallis H-test statistic: {statistic}")
print(f"p-value: {p_value}")


In the code above, we first import the necessary libraries: `pandas`, `numpy`, and `kruskal` from `scipy.stats`. We then load the Pima Indian dataset from the provided URL and assign appropriate column names.

Next, we separate the data based on the outcome (diabetes or non-diabetes) by filtering the `Glucose` values. The Kruskal-Wallis H-test is performed using the `kruskal` function, which takes the groups as separate arguments.


# Friedman test



The Friedman test is a non-parametric statistical test used to compare multiple related samples. It is an extension of the Wilcoxon signed-rank test and is used when we want to determine if there are any significant differences between groups across multiple treatments. In this case, we'll use the Pima Indian dataset as an example.


In [None]:
import pandas as pd
from scipy.stats import friedmanchisquare

# Load the Pima Indian dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv'
data = pd.read_csv(url, header=None)
data.columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']

# Assuming we have three treatments (columns 0, 1, 2) to compare
treatment_1 = data.iloc[:, 0]
treatment_2 = data.iloc[:, 1]
treatment_3 = data.iloc[:, 2]

# Perform the Friedman test
statistic, p_value = friedmanchisquare(treatment_1, treatment_2, treatment_3)

print(f"Friedman Test Statistic: {statistic}")
print(f"P-value: {p_value}")


In the code above, we first load the Pima Indian dataset using `pd.read_csv()`. Then, we define three treatments based on columns 0, 1, and 2 of the dataset. Finally, we use `friedmanchisquare()` from `scipy.stats` to perform the Friedman test and obtain the test statistic and p-value.

Note that in this example, we assumed three treatments for simplicity. You can adjust the code according to the number of treatments you have in your dataset.


# Reflection Points

1. **Understanding Statistical Tests**: Reflect on the importance of statistical tests in data analysis and their role in making informed decisions. Consider how the Kruskal-Wallis H-test, Friedman test, and Rank-sum test contribute to statistical analysis in different scenarios.

2. **Use Cases**: Explore real-world applications where the Kruskal-Wallis H-test, Friedman test, and Rank-sum test are commonly used. Reflect on the types of data and research questions for which these tests are suitable.

3. **Assumptions and Limitations**: Consider the assumptions underlying the Kruskal-Wallis H-test, Friedman test, and Rank-sum test. Reflect on the limitations of these tests and situations in which alternative tests may be more appropriate.

4. **Interpreting Test Results**: Reflect on how to interpret the results of these tests. Consider the statistical measures provided by these tests, such as p-values, test statistics, and effect sizes, and how they contribute to drawing meaningful conclusions from the data.

5. **Comparing Multiple Groups**: Reflect on the Kruskal-Wallis H-test and its ability to compare multiple independent groups. Consider scenarios in which this test is advantageous over other methods, such as one-way ANOVA.

6. **Repeated Measures Design**: Explore the Friedman test and its application in repeated measures designs. Reflect on situations where this test is suitable and how it accounts for correlated observations within subjects.

7. **Nonparametric Tests**: Consider the advantages and disadvantages of nonparametric tests like the Rank-sum test. Reflect on situations in which nonparametric tests are preferred over parametric tests and the implications of the underlying assumptions.

8. **Implementing Tests in Python**: Reflect on the practical aspects of implementing these tests using the scipy library in Python. Consider the required data input formats, function parameters, and how to interpret the output.

9. **Data Preprocessing**: Reflect on the importance of data preprocessing before applying these tests. Consider the steps involved in handling missing data, outliers, and ensuring data meets the assumptions of the tests.

10. **Further Learning**: Reflect on the knowledge gained from studying the Kruskal-Wallis H-test, Friedman test. Consider additional resources or advanced topics related to these tests that you may want to explore to deepen your understanding.


# Exercise


1. Load the dataset into a pandas DataFrame.
2. Preprocess the data by removing any missing values or irrelevant columns.
3. Split the data into two groups: diabetic and non-diabetic women based on the 'diabetes' column.
4. Perform a nonparametric test (Mann-Whitney U test) to compare the glucose levels between the two groups.
5. Interpret the results of the test and draw conclusions.


In [None]:
import pandas as pd
from scipy.stats import mannwhitneyu

# Task 1: Load the dataset into a pandas DataFrame
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
df = pd.read_csv(url, header=None)

# Task 2: Preprocess the data
df.columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigree', 'Age', 'Outcome']
df.dropna(inplace=True)

# Task 3: Split the data into diabetic and non-diabetic groups
diabetic_group = df[df['Outcome'] == 1]
non_diabetic_group = df[df['Outcome'] == 0]

# Task 4: Perform the Mann-Whitney U test
statistic, p_value = mannwhitneyu(diabetic_group['Glucose'], non_diabetic_group['Glucose'])

# Task 5: Interpret the results
alpha = 0.05
print(f"Mann-Whitney U Statistic: {statistic}")
print(f"P-value: {p_value}")

if p_value < alpha:
    print("The p-value is less than the significance level (alpha), so we reject the null hypothesis.")
    print("There is significant evidence to suggest that there is a difference in glucose levels between diabetic and non-diabetic women.")
else:
    print("The p-value is greater than the significance level (alpha), so we fail to reject the null hypothesis.")
    print("There is no significant evidence to suggest that there is a difference in glucose levels between diabetic and non-diabetic women.")


Note: In this exercise, we used the Mann-Whitney U test, which is a nonparametric test used to compare two independent groups' distributions when the data is not normally distributed. It's suitable for comparing the glucose levels between diabetic and non-diabetic groups, as the glucose levels might not follow a normal distribution in the dataset.

# A quiz on Nonparametric Tests


1. Non-parametric tests are used when:
   <br>a) The data is normally distributed.
   <br>b) The data does not follow a specific distribution or when assumptions of parametric tests are violated.
   <br>c) The data has a large sample size.
   <br>d) The data has a small sample size.

2. Which of the following is NOT an example of a non-parametric test?
   <br>a) Mann-Whitney U test
   <br>b) Kruskal-Wallis test
   <br>c) Pearson correlation test
   <br>d) Wilcoxon signed-rank test

3. The Mann-Whitney U test is used to compare:
   <br>a) Two independent samples.
   <br>b) Two dependent samples.
   <br>c) Three or more independent samples.
   <br>d) Three or more dependent samples.

4. The Kruskal-Wallis test is an extension of which parametric test?
   <br>a) Independent samples t-test
   <br>b) Paired t-test
   <br>c) One-way ANOVA
   <br>d) Chi-square test

5. The Wilcoxon signed-rank test is used to compare:
   <br>a) Two independent samples.
   <br>b) Two dependent samples.
   <br>c) Three or more independent samples.
   <br>d) Three or more dependent samples.

6. When should you use the Chi-square test?
   <br>a) To compare two independent samples.
   <br>b) To compare two dependent samples.
   <br>c) To compare three or more independent samples.
   <br>d) To compare three or more dependent samples.

7. The Kruskal-Wallis test can be used as an alternative to which parametric test?
   <br>a) Independent samples t-test
   <br>b) Paired t-test
   <br>c) One-way ANOVA
   <br>d) Two-way ANOVA

8. The Chi-square test is suitable for testing the association between:
   <br>a) Two continuous variables.
   <br>b) Two categorical variables.
   <br>c) A categorical variable and a continuous variable.
   <br>d) Two dependent samples.
---
Answers:
1. b) The data does not follow a specific distribution or when assumptions of parametric tests are violated.
2. c) Pearson correlation test
3. a) Two independent samples.
4. c) One-way ANOVA
5. b) Two dependent samples.
6. c) To compare three or more independent samples.
7. c) One-way ANOVA
8. b) Two categorical variables.
---