<div style="background-color:pink; padding:20px; border-radius:8px; border: 1px solid #dcdcdc; max-width: 900px; margin: auto; text-align: center;"> 
    <h2 style="font-family: Arial, sans-serif; color: #2c3e50; font-size: 28px; margin-bottom: 10px;"> Statistical Tests </h2> 
    <p style="font-family: Arial, sans-serif; color: #555; font-size: 16px; line-height: 1.5;"> To better understand the relationships between features and heart disease, I conducted several statistical tests: </p>
    <ul style="text-align: left; font-family: Arial, sans-serif; color: #555; font-size: 16px; line-height: 1.5; padding-left: 20px;"> 
        <li><strong>Chi-Square Test for Categorical Variables:</strong> This test was used to determine if there is a significant association between categorical variables and heart disease. It helped identify features that have a statistical dependency with heart disease.</li> <li>
            <strong>Independent t-test:</strong> For numerical variables, I used the independent t-test to compare the mean differences between heart disease groups (presence vs. absence). This test helped reveal whether specific numeric features show significant differences between the two groups.</li> 
        <li><strong>ANOVA Test:</strong> ANOVA was applied for multiple-group comparisons to explore whether there are statistically significant differences in feature means across heart disease categories.
        </li> <li><strong>Correlation with Target Variable:</strong> The Point-Biserial Correlation Coefficient was calculated to assess the strength and direction of association between continuous variables and the binary heart disease target variable.</li> 
    </ul> 
</div>

In [74]:
#Importing Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats import chi2_contingency
import scipy.stats as stats
from scipy.stats import pointbiserialr

from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings('ignore')

In [22]:
df = pd.read_csv('../Data/Heart_Disease_Prediction.csv')
df.head()

Unnamed: 0,Age,Sex,Chest pain type,BP,Cholesterol,FBS over 120,EKG results,Max HR,Exercise angina,ST depression,Slope of ST,Number of vessels fluro,Thallium,Heart Disease
0,70,1,4,130,322,0,2,109,0,2.4,2,3,3,Presence
1,80,0,3,115,564,0,2,160,0,1.6,2,0,7,Absence
2,55,1,2,124,261,0,0,141,0,0.3,1,0,7,Presence
3,65,1,4,128,263,0,0,105,1,0.2,2,1,7,Absence
4,45,0,2,120,269,0,2,121,1,0.2,1,1,3,Absence


In [23]:
print(f"Dataset shape: {df.shape}")

Dataset shape: (270, 14)


In [24]:
#Creating a copy of original dataframe
df_copy = df.copy()

In [25]:
df['Heart Disease'] = df['Heart Disease'].replace({'Presence': 1, 'Absence':0})

### Statistical Tests:

### 1. Chi-Square Test for Categorical Variables
* The Chi-square test is a statistical method used to test associations between categorical variables to determine if there’s a significant association between two variables or if observed frequencies differ from expected frequencies.
* Null Hypothesis (H0): The variables are independent (no association).
* Alternative Hypothesis (H1): The variables are dependent (there’s an association).

 + A small p-value (typically ≤ 0.05) suggests that there is an association between the variable and 'Heart Disease', meaning they are not independent.
+ A large p-value (> 0.05) suggests that the variable and 'Heart Disease' are independent, indicating no significant association.

In [70]:
cat_components = ['Sex', 'Chest pain type', 'FBS over 120', 'Exercise angina', 'Thallium',
                 'Number of vessels fluro', 'EKG results']

results = {}
for col in cat_components:
    contingency_table = pd.crosstab(df[col], df['Heart Disease'])
    chi2, p, dof, expected = chi2_contingency(contingency_table)
    results[col] = {'Chi2': chi2, 'p-value': p, 'Degrees of Freedom': dof}

for variable, result in results.items():
    print(f"Variable: {variable}")
    print(f"Chi2: {result['Chi2']}, p-value: {result['p-value']}, Degrees of Freedom: {result['Degrees of Freedom']}")
    print("----" * 10)

Variable: Sex
Chi2: 22.66725551158847, p-value: 1.926225633356082e-06, Degrees of Freedom: 1
----------------------------------------
Variable: Chest pain type
Chi2: 68.58820650574037, p-value: 8.560988097108327e-15, Degrees of Freedom: 3
----------------------------------------
Variable: FBS over 120
Chi2: 0.009171195652173895, p-value: 0.9237061355849946, Degrees of Freedom: 1
----------------------------------------
Variable: Exercise angina
Chi2: 45.691872555714184, p-value: 1.3839580611547017e-11, Degrees of Freedom: 1
----------------------------------------
Variable: Thallium
Chi2: 74.56934644303065, p-value: 6.419070718857945e-17, Degrees of Freedom: 2
----------------------------------------
Variable: Number of vessels fluro
Chi2: 62.863091899026564, p-value: 1.4366195484671344e-13, Degrees of Freedom: 3
----------------------------------------
Variable: EKG results
Chi2: 8.979451997548342, p-value: 0.011223718700990279, Degrees of Freedom: 2
----------------------------------

#### Interpretation:
1. Sex:
   * Chi2: 22.67, p-value: 1.93 × 10^-6
   * Since the p-value is much less than 0.05, we reject the null hypothesis and conclude that 
     there is a statistically significant association between 'Sex' and 'Heart Disease'.

2. Chest pain type:
   * Chi2: 68.59, p-value: 8.56 * 10^-15
   * The very small p-value indicates a strong association between 'Chest pain type' and 'Heart 
     Disease'.

3. FBS over 120:
   * Chi2: 0.009, p-value: 0.924
   * Since the p-value is greater than 0.05, we do not have enough evidence to reject the null 
     hypothesis, suggesting no significant association between 'FBS over 120' and 'Heart Disease'.

4. Exercise angina:
   * Chi2: 45.69, p-value: 1.38 x 10^-11
   * The p-value is very small, indicating a significant association between 'Exercise angina' and 
    'Heart Disease'.

5. Thallium:
   * Chi2: 74.57, p-value: 6.42 x 10^-17
   * With a p-value far below 0.05, there is a significant association between 'Thallium' and 
    'Heart Disease'.
    
6. Number of vessels fluro:
   * Chi2: 62.86, p-value: 1.44 x 10^-13
   * The small p-value suggests a significant association between 'Number of vessels fluro' and 
    'Heart Disease'.

7. EKG results:
   * Chi2: 8.979 , p-value: 0.0112
   * The small p-value suggests a significant association between 'EKG Results' and 'Heart 
     Disease'.

#### Summary:
+ Variables Sex, Chest pain type, Exercise angina, Thallium, and Number of vessels fluro, EKG results show a statistically significant association with 'Heart Disease', while FBS over 120 does not.
+ These insights could help in feature selection for further predictive modeling.

### 2. Hypothesis Testing for Mean Differences

### 2.1: Independent t-test
##### Used for comparing two groups (patients with vs. without heart disease).
* An independent t-test (also known as a two-sample t-test or unpaired t-test) is a statistical test used to determine if there is a significant difference between the means of two independent groups.
##### Hypotheses for the Independent t-test:
 + Null Hypothesis (H0): The means of the two groups are equal (no difference).
 + Alternative Hypothesis (H1): The means of the two groups are not equal (there is a difference).
   
* Interpretation: If p < 0.05, you conclude that there is a significant difference between the group means.

In [27]:
continuous_components = ['Age', 'BP', 'Cholesterol', 'Max HR', 'ST depression']

# Dictionary to store t-test results
t_test_results = {}

for column in continuous_components:
    # Separate data into two groups based on heart disease diagnosis
    group_with_disease = df[df['Heart Disease'] == 1][column]
    group_without_disease = df[df['Heart Disease'] == 0][column]

    # Perform t-test
    t_stat, p_val = stats.ttest_ind(group_with_disease, group_without_disease, equal_var=True)
    t_test_results[column] = {'t-statistic': t_stat, 'p-value': p_val}

# Converting results to DataFrame
t_test_results_df = pd.DataFrame(t_test_results).T
t_test_results_df

Unnamed: 0,t-statistic,p-value
Age,3.743457,0.0002221218
BP,2.574999,0.01056095
Cholesterol,1.945677,0.05273889
Max HR,-7.543813,7.119583e-13
ST depression,7.531875,7.677946e-13


#### Interpretation:
1. Significant: Age shows a statistically significant difference(0.000222) between heart disease and non-heart disease groups.
2. Significant: Blood Pressure (BP) also has a significant difference(0.010561), indicating a potential association.
3. Marginally Significant: Cholesterol has a p-value slightly above 0.05, suggesting a weak association.
4. Highly Significant: Maximum Heart Rate shows a strong, statistically significant difference.
5. Highly Significant: ST Depression is also highly significant, indicating a strong association with heart disease.

#### The results suggest that Age, BP, Max HR, and ST Depression are more strongly associated with heart disease status, and they may serve as valuable features for predictive modeling.

### 3. ANOVA Test:
* An ANOVA (Analysis of Variance) test is a statistical method used to determine whether there are any statistically significant differences between the means of three or more independent groups.

In [39]:
category_col = ['Chest pain type', 'Slope of ST', 'Thallium' ]
continuous_cols = ['Max HR', 'ST depression', 'BP', 'Age']

def run_anova(df, category_col, continuous_cols):
    unique_groups = sorted(df[category_col].unique())
    for continuous_col in continuous_cols:
        # Collecting data for each group
        groups = [df[df[category_col] == group][continuous_col] for group in unique_groups]
        anova_results = stats.f_oneway(*groups)

        #Printing results
        print(f"ANOVA results for {category_col} and {continuous_col}")
        print("F-statistic:", anova_results.statistic)
        print("p-value:", anova_results.pvalue)
        print("----" * 10)

In [40]:
run_anova(df, 'Chest pain type', ['Max HR', 'ST depression', 'BP'])

ANOVA results for Chest pain type and Max HR
F-statistic: 13.273984195928696
p-value: 4.219911049988753e-08
----------------------------------------
ANOVA results for Chest pain type and ST depression
F-statistic: 10.509044373972815
p-value: 1.4841747556751578e-06
----------------------------------------
ANOVA results for Chest pain type and BP
F-statistic: 2.850923191691003
p-value: 0.03784582347269128
----------------------------------------


In [41]:
run_anova(df, 'Slope of ST', ['Max HR', 'ST depression', 'BP'])

ANOVA results for Slope of ST and Max HR
F-statistic: 32.786375424708915
p-value: 1.8509041289991412e-13
----------------------------------------
ANOVA results for Slope of ST and ST depression
F-statistic: 79.34776069809666
p-value: 9.00587803820291e-28
----------------------------------------
ANOVA results for Slope of ST and BP
F-statistic: 4.5144862797367775
p-value: 0.011797910465727686
----------------------------------------


In [42]:
run_anova(df, 'Thallium', ['Max HR', 'ST depression', 'BP'])

ANOVA results for Thallium and Max HR
F-statistic: 10.515395830025858
p-value: 4.019368532378967e-05
----------------------------------------
ANOVA results for Thallium and ST depression
F-statistic: 15.811504851287816
p-value: 3.2381509000998903e-07
----------------------------------------
ANOVA results for Thallium and BP
F-statistic: 2.5304623140945735
p-value: 0.08153043027708284
----------------------------------------


#### Overall Insights:
* The ANOVA results indicate significant associations between various continuous variables (Max HR, ST Depression, BP) and categorical variables (Chest Pain Type, Slope of ST, Thallium).
  
* Particularly strong associations were found between the slope of the ST segment and both Max HR and ST depression, highlighting their potential importance in cardiovascular assessment.

* Significant differences in Max HR and ST depression based on chest pain type suggest that understanding chest pain characteristics can help predict heart disease outcomes and tailor interventions.

### 4. Correlation with Target Variable: Point-Biserial Correlation Coefficient:
* A special case of the Pearson correlation that measures the relationship between one binary variable (dichotomous) and one continuous variable.
* Range: -1 to 1, interpreted like Pearson's correlation.
* Assumption: The continuous variable should be normally distributed.
* Here, we have Binary Variable as 'Heart Disease' and other continuous variable.

In [52]:
target_variable = 'Heart Disease'

# Dictionary to store results
correlation_results = {}

# Calculating Point Biserial Correlation for each continuous variable
for variable in continuous_components:
    correlation, p_value = pointbiserialr(df[target_variable], df[variable])
    correlation_results[variable] = {'correlation': correlation, 'p_value': p_value}

# Converting results to a DataFrame
results_df = pd.DataFrame(correlation_results).T
results_df.columns = ['Point Biserial Correlation', 'p-value']
print(results_df)

               Point Biserial Correlation       p-value
Age                              0.222914  2.221218e-04
BP                               0.155383  1.056095e-02
Cholesterol                      0.118021  5.273889e-02
Max HR                          -0.418514  7.119583e-13
ST depression                    0.417967  7.677946e-13


#### Interpretation of Results:
* Age, Max HR, and ST Depression appear to be important predictors of heart disease based on their strong correlations and significance levels.
* While BP and Cholesterol show some level of correlation, they may not be as reliable as the other features in predicting heart disease, especially Cholesterol, which is close to the significance threshold.

#### End of Statistical Tests