<a href="https://colab.research.google.com/github/etemadism/Courses/blob/main/02_Independent_Two_Samples.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Statistical Analysis for Comparison of Two Independent Groups**
##Overview
Welcome to the **Statistical Analysis** notebook for your statistics class. This notebook is designed to help you perform essential statistical analyses, including normality testing, variance equality checking, and hypothesis testing using the most suitable methods. It focuses on comparing the means of two groups (e.g., Male and Female) for multiple variables.

##Steps Covered:

1. Data Import: Load a CSV file containing your dataset.

2. Group Separation: Split the dataset based on a categorical variable (e.g., Gender).

3. Normality Testing: Check if each variable is normally distributed using either the Shapiro-Wilk or Kolmogorov-Smirnov test.

4. Levene’s Test: Check if the variances of the two groups are equal.

5. Hypothesis Testing:
  * If data is normally distributed and variances are equal, use pooled t-test.

  * If data is normally distributed but variances are unequal, use Welch's t-test.
  * If data is not normally distributed, use the Mann-Whitney U test (non-parametric test).

6. Result Summary: Display and store a summary of the results, including p-values and statistical test outcomes.


**Data Availability**

For learning purposes, you can access the dataset for this analysis on GitHub. The dataset is hosted on the GitHub account with the username **etemadism**. You can download the data directly from the repository to follow along with the notebook.

GitHub Repository: https://github.com/etemadism

**Author: Ali Etemadi**

Tehran University of Medical Sciences, Tehran, Iran

##Step 1: Import Libraries and define groups
We'll use the scipy.stats library for the t-test and numpy for calculations.

In [22]:
from scipy import stats
import pandas as pd

##Step 2: Define the Data
Define the data for both groups.

In [23]:

# Step 1: Load the data from CSV
df = pd.read_csv('/content/02_inde_t_test_sample_data.csv')
df

Unnamed: 0,Total homocysteine,Methylmalonic acid,Total cysteine,Methionine,Serine,Glycine,Pyridoxal 5'-phosphate,Pyridoxal,4-Pyridoxic acid,Pyridoxine,...,N1-methylnicotinamide,Cystathionine,Trigonelline,5-Methyl-tetrahydrofolate,5-Formyl-tetrahydrofolate,Folic acid,4-Alfa-hydroxy-5-methyl-THF,Para-aminobenzoylglutamate,Acetamidobenzoylglutamate,Gender
0,11.603118,0.380013,323.407760,32.147521,94.332101,223.703485,33.6,8.21,18.36,0,...,89.2,0.207,0.450,31.70,0,0.631,9.78,0.823,1.030,Female
1,8.652174,0.216109,277.687527,30.547725,100.675469,258.799353,26.0,3.08,17.20,0,...,28.3,0.300,0.327,30.60,0,0.000,6.42,0.863,2.080,Female
2,15.733290,1.328922,318.782237,32.815304,110.969300,214.863843,37.9,10.90,22.26,0,...,100.0,0.222,0.112,10.70,0,0.000,2.71,0.507,0.593,Male
3,20.843928,0.199061,366.622622,26.772327,113.550372,523.362031,215.0,19.20,61.75,0,...,49.7,0.292,0.797,10.10,0,0.000,4.44,0.970,0.291,Female
4,17.619270,0.408416,300.245602,35.448999,139.728205,361.125581,25.8,4.38,19.45,0,...,191.0,0.295,0.909,11.30,0,0.000,3.93,0.390,0.085,Male
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
395,71.170973,0.156765,222.129076,23.078759,70.869299,220.049433,20.5,3.09,16.56,0,...,51.5,0.286,0.849,3.23,0,0.000,4.04,1.370,0.085,Male
396,14.792811,0.356055,300.356735,23.894748,111.621847,353.777228,51.5,6.88,26.05,0,...,162.0,0.444,1.800,5.48,0,0.000,2.32,0.735,0.085,Female
397,19.300000,2.259199,263.000000,26.516857,98.528120,240.567659,24.2,9.48,55.60,0,...,62.2,0.240,1.590,26.00,0,0.000,5.03,0.820,0.376,Male
398,8.958233,0.184666,280.243136,38.371311,128.986231,277.599232,39.5,9.50,25.13,0,...,106.0,0.135,0.624,19.30,0,0.000,4.23,0.440,0.249,Male


In [24]:
# Step 2: Recode the 'Gender' column (assuming Gender is represented as strings like 'Male' and 'Female')
df['Gender'] = df['Gender'].map({'Male': 0, 'Female': 1})  # Replace with correct values for Gender

# Step 3: Separate the data into two groups based on Gender (0 and 1)
group1 = df[df['Gender'] == 0]  # Male group
group2 = df[df['Gender'] == 1]  # Female group



In [25]:
df

Unnamed: 0,Total homocysteine,Methylmalonic acid,Total cysteine,Methionine,Serine,Glycine,Pyridoxal 5'-phosphate,Pyridoxal,4-Pyridoxic acid,Pyridoxine,...,N1-methylnicotinamide,Cystathionine,Trigonelline,5-Methyl-tetrahydrofolate,5-Formyl-tetrahydrofolate,Folic acid,4-Alfa-hydroxy-5-methyl-THF,Para-aminobenzoylglutamate,Acetamidobenzoylglutamate,Gender
0,11.603118,0.380013,323.407760,32.147521,94.332101,223.703485,33.6,8.21,18.36,0,...,89.2,0.207,0.450,31.70,0,0.631,9.78,0.823,1.030,1
1,8.652174,0.216109,277.687527,30.547725,100.675469,258.799353,26.0,3.08,17.20,0,...,28.3,0.300,0.327,30.60,0,0.000,6.42,0.863,2.080,1
2,15.733290,1.328922,318.782237,32.815304,110.969300,214.863843,37.9,10.90,22.26,0,...,100.0,0.222,0.112,10.70,0,0.000,2.71,0.507,0.593,0
3,20.843928,0.199061,366.622622,26.772327,113.550372,523.362031,215.0,19.20,61.75,0,...,49.7,0.292,0.797,10.10,0,0.000,4.44,0.970,0.291,1
4,17.619270,0.408416,300.245602,35.448999,139.728205,361.125581,25.8,4.38,19.45,0,...,191.0,0.295,0.909,11.30,0,0.000,3.93,0.390,0.085,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
395,71.170973,0.156765,222.129076,23.078759,70.869299,220.049433,20.5,3.09,16.56,0,...,51.5,0.286,0.849,3.23,0,0.000,4.04,1.370,0.085,0
396,14.792811,0.356055,300.356735,23.894748,111.621847,353.777228,51.5,6.88,26.05,0,...,162.0,0.444,1.800,5.48,0,0.000,2.32,0.735,0.085,1
397,19.300000,2.259199,263.000000,26.516857,98.528120,240.567659,24.2,9.48,55.60,0,...,62.2,0.240,1.590,26.00,0,0.000,5.03,0.820,0.376,0
398,8.958233,0.184666,280.243136,38.371311,128.986231,277.599232,39.5,9.50,25.13,0,...,106.0,0.135,0.624,19.30,0,0.000,4.23,0.440,0.249,0


In [30]:
# Step 4: Define variables as columns 1 to 8 (assuming these are the first 8 columns after 'Gender')
variables = df.columns[0:37]  # Selecting the first 8 columns

In [31]:
# Function to select normality test based on sample size
def normality_test(data, threshold=50):
    """Perform Shapiro-Wilk if n < threshold, otherwise Kolmogorov-Smirnov."""
    n = len(data)
    if n < threshold:
        stat, p_value = stats.shapiro(data)
        test = "Shapiro-Wilk"
    else:
        stat, p_value = stats.kstest(data, 'norm')
        test = "Kolmogorov-Smirnov"
    return test, stat, p_value



In [32]:
# Step 5: Perform Levene’s Test for each variable (check for equal variances)
for variable in variables:
    # Extract data for the current variable
    data_group1 = group1[variable]
    data_group2 = group2[variable]

    # Perform Levene's test for equality of variances
    levene_stat, levene_p = stats.levene(data_group1, data_group2)

    print(f'--- Levene’s Test for {variable} ---')
    print(f"Levene's test statistic: {levene_stat:.4f}, p-value: {levene_p:.4f}")

    # Perform normality test (choose based on sample size)
    test1, shapiro_group1_stat, shapiro_group1_p = normality_test(data_group1)
    test2, shapiro_group2_stat, shapiro_group2_p = normality_test(data_group2)

    print(f"Normality test on Group 1 using {test1}: p-value = {shapiro_group1_p:.4f}")
    print(f"Normality test on Group 2 using {test2}: p-value = {shapiro_group2_p:.4f}")

    # Check if data is normal, if not perform Mann-Whitney U test (non-parametric)
    if shapiro_group1_p < 0.05 or shapiro_group2_p < 0.05:
        print("Data is not normal, performing Mann-Whitney U test (non-parametric).")
        # Perform Mann-Whitney U test
        stat, p_value = stats.mannwhitneyu(data_group1, data_group2)
        print(f'Mann-Whitney U test result for {variable}:')
        print(f'U-statistic: {stat:.4f}, p-value: {p_value:.4f}')
    else:
        # If data is normal, proceed with t-test
        if levene_p > 0.05:
            print("Equal variances assumed, proceed with pooled t-test.")
            stat, p_value = stats.ttest_ind(data_group1, data_group2, equal_var=True)
        else:
            print("Unequal variances assumed, proceed with Welch’s t-test.")
            stat, p_value = stats.ttest_ind(data_group1, data_group2, equal_var=False)

        print(f'Independent T-test result for {variable}:')
        print(f'T-statistic: {stat:.4f}, p-value: {p_value:.4f}')

    # Check if the p-value is significant
    if p_value < 0.05:
        print(f'Significant difference between Group 1 and Group 2 for {variable} (reject H0)')
    else:
        print(f'No significant difference between Group 1 and Group 2 for {variable} (fail to reject H0)')

    print('------------------------------------')

--- Levene’s Test for Total homocysteine ---
Levene's test statistic: 3.8371, p-value: 0.0508
Normality test on Group 1 using Kolmogorov-Smirnov: p-value = 0.0000
Normality test on Group 2 using Kolmogorov-Smirnov: p-value = 0.0000
Data is not normal, performing Mann-Whitney U test (non-parametric).
Mann-Whitney U test result for Total homocysteine:
U-statistic: 25377.0000, p-value: 0.0000
Significant difference between Group 1 and Group 2 for Total homocysteine (reject H0)
------------------------------------
--- Levene’s Test for Methylmalonic acid ---
Levene's test statistic: 0.0035, p-value: 0.9530
Normality test on Group 1 using Kolmogorov-Smirnov: p-value = 0.0000
Normality test on Group 2 using Kolmogorov-Smirnov: p-value = 0.0000
Data is not normal, performing Mann-Whitney U test (non-parametric).
Mann-Whitney U test result for Methylmalonic acid:
U-statistic: 23678.0000, p-value: 0.0011
Significant difference between Group 1 and Group 2 for Methylmalonic acid (reject H0)
-----

  W = numer / denom


Normality test on Group 1 using Kolmogorov-Smirnov: p-value = 0.0000
Normality test on Group 2 using Kolmogorov-Smirnov: p-value = 0.0000
Data is not normal, performing Mann-Whitney U test (non-parametric).
Mann-Whitney U test result for Kynurenine:
U-statistic: 20806.0000, p-value: 0.4334
No significant difference between Group 1 and Group 2 for Kynurenine (fail to reject H0)
------------------------------------
--- Levene’s Test for 3-Hydroxykynurenine ---
Levene's test statistic: 5.5401, p-value: 0.0191
Normality test on Group 1 using Kolmogorov-Smirnov: p-value = 0.0000
Normality test on Group 2 using Kolmogorov-Smirnov: p-value = 0.0000
Data is not normal, performing Mann-Whitney U test (non-parametric).
Mann-Whitney U test result for 3-Hydroxykynurenine:
U-statistic: 18093.5000, p-value: 0.1170
No significant difference between Group 1 and Group 2 for 3-Hydroxykynurenine (fail to reject H0)
------------------------------------
--- Levene’s Test for Kynurenic acid ---
Levene's tes

##Step 3: Check for Equal Variances (Levene’s Test) and Perform the Independent Two-Sample t-Test
Levene’s test checks if the variances of the two groups are equal. This will determine if we should use the pooled t-test or Welch’s t-test.

If the p-value is greater than 0.05, we assume equal variances and proceed with the pooled t-test.
If the p-value is less than 0.05, we assume unequal variances and use Welch’s t-test.

In [37]:
# Step 5: Create an empty list to store results
results = []

# Perform normality test, Levene’s Test, and t-tests for each variable (check for normality, equality of variances)
for variable in variables:
    # Extract data for the current variable
    data_group1 = group1[variable]
    data_group2 = group2[variable]

    # Step 5.1: Perform normality test for each group
    normality_group1 = normality_test(data_group1)
    normality_group2 = normality_test(data_group2)

    # Print normality test results
    print(f'--- Normality Test for {variable} ---')
    print(f'Group 1 ({variable}) - Test: {normality_group1[0]}, Stat: {normality_group1[1]:.4f}, p-value: {normality_group1[2]:.4f}')
    print(f'Group 2 ({variable}) - Test: {normality_group2[0]}, Stat: {normality_group2[1]:.4f}, p-value: {normality_group2[2]:.4f}')

    # Step 5.2: Check if data are normal (p-value < 0.05 means not normal)
    if normality_group1[2] < 0.05 or normality_group2[2] < 0.05:
        print("Data is not normally distributed, using Mann-Whitney U Test.")
        # Mann-Whitney U test (non-parametric test for independent samples)
        stat, p_value = stats.mannwhitneyu(data_group1, data_group2)
        test_type = "Mann-Whitney U Test"
    else:
        # Perform Levene's test for equality of variances
        levene_stat, levene_p = stats.levene(data_group1, data_group2)

        print(f'--- Levene’s Test for {variable} ---')
        print(f"Levene's test statistic: {levene_stat:.4f}, p-value: {levene_p:.4f}")

        # If Levene's test shows p-value > 0.05, variances are equal (use pooled t-test), else use Welch's t-test
        if levene_p > 0.05:
            print("Equal variances assumed, proceed with pooled t-test.")
            stat, p_value = stats.ttest_ind(data_group1, data_group2, equal_var=True)
            test_type = "Pooled T-test (Equal variances)"
        else:
            print("Unequal variances assumed, proceed with Welch’s t-test.")
            stat, p_value = stats.ttest_ind(data_group1, data_group2, equal_var=False)
            test_type = "Welch's T-test (Unequal variances)"

    # Print the test result
    print(f'{test_type} result for {variable}:')
    print(f'T-statistic/U-statistic: {stat:.4f}, p-value: {p_value:.4f}')

    # Check if the p-value is significant
    if p_value < 0.05:
        significance = 'Significant'
        print(f'Significant difference between Group 1 and Group 2 for {variable} (reject H0)')
    else:
        significance = 'Not Significant'
        print(f'No significant difference between Group 1 and Group 2 for {variable} (fail to reject H0)')

    print('------------------------------------')

    # Step 6: Store the results for the table
    results.append({
        'Variable': variable,
        'Normality Test Group 1': normality_group1[0],
        'Normality p-value Group 1': round(normality_group1[2], 4),
        'Normality Test Group 2': normality_group2[0],
        'Normality p-value Group 2': round(normality_group2[2], 4),
        'Levene Statistic': round(levene_stat, 4) if 'levene_stat' in locals() else None,
        'Levene p-value': round(levene_p, 4) if 'levene_p' in locals() else None,
        'Test Type': test_type,
        'Test Statistic': round(stat, 4),
        'p-value': round(p_value, 4),
        'Significance': significance
    })

# Step 7: Create a DataFrame for the results table
results_df = pd.DataFrame(results)

# Step 8: Print the results table
print("Final Report Summary:")
results_df

# Optional: Save the results table to a CSV file
#results_df.to_csv('test_results_report.csv', index=False)


--- Normality Test for Total homocysteine ---
Group 1 (Total homocysteine) - Test: Kolmogorov-Smirnov, Stat: 1.0000, p-value: 0.0000
Group 2 (Total homocysteine) - Test: Kolmogorov-Smirnov, Stat: 1.0000, p-value: 0.0000
Data is not normally distributed, using Mann-Whitney U Test.
Mann-Whitney U Test result for Total homocysteine:
T-statistic/U-statistic: 25377.0000, p-value: 0.0000
Significant difference between Group 1 and Group 2 for Total homocysteine (reject H0)
------------------------------------
--- Normality Test for Methylmalonic acid ---
Group 1 (Methylmalonic acid) - Test: Kolmogorov-Smirnov, Stat: 0.5434, p-value: 0.0000
Group 2 (Methylmalonic acid) - Test: Kolmogorov-Smirnov, Stat: 0.5409, p-value: 0.0000
Data is not normally distributed, using Mann-Whitney U Test.
Mann-Whitney U Test result for Methylmalonic acid:
T-statistic/U-statistic: 23678.0000, p-value: 0.0011
Significant difference between Group 1 and Group 2 for Methylmalonic acid (reject H0)
---------------------

Unnamed: 0,Variable,Normality Test Group 1,Normality p-value Group 1,Normality Test Group 2,Normality p-value Group 2,Levene Statistic,Levene p-value,Test Type,Test Statistic,p-value,Significance
0,Total homocysteine,Kolmogorov-Smirnov,0.0,Kolmogorov-Smirnov,0.0,1.1165,0.2913,Mann-Whitney U Test,25377.0,0.0,Significant
1,Methylmalonic acid,Kolmogorov-Smirnov,0.0,Kolmogorov-Smirnov,0.0,1.1165,0.2913,Mann-Whitney U Test,23678.0,0.0011,Significant
2,Total cysteine,Kolmogorov-Smirnov,0.0,Kolmogorov-Smirnov,0.0,1.1165,0.2913,Mann-Whitney U Test,16626.0,0.0045,Significant
3,Methionine,Kolmogorov-Smirnov,0.0,Kolmogorov-Smirnov,0.0,1.1165,0.2913,Mann-Whitney U Test,30097.0,0.0,Significant
4,Serine,Kolmogorov-Smirnov,0.0,Kolmogorov-Smirnov,0.0,1.1165,0.2913,Mann-Whitney U Test,21180.0,0.268,Not Significant
5,Glycine,Kolmogorov-Smirnov,0.0,Kolmogorov-Smirnov,0.0,1.1165,0.2913,Mann-Whitney U Test,16072.0,0.0009,Significant
6,Pyridoxal 5'-phosphate,Kolmogorov-Smirnov,0.0,Kolmogorov-Smirnov,0.0,1.1165,0.2913,Mann-Whitney U Test,22065.5,0.0607,Not Significant
7,Pyridoxal,Kolmogorov-Smirnov,0.0,Kolmogorov-Smirnov,0.0,1.1165,0.2913,Mann-Whitney U Test,24160.5,0.0002,Significant
8,4-Pyridoxic acid,Kolmogorov-Smirnov,0.0,Kolmogorov-Smirnov,0.0,1.1165,0.2913,Mann-Whitney U Test,20672.5,0.5044,Not Significant
9,Pyridoxine,Kolmogorov-Smirnov,0.0,Kolmogorov-Smirnov,0.0,1.1165,0.2913,Mann-Whitney U Test,19902.0,1.0,Not Significant
