Sabrina was assigned the task of collecting random samples from male Healthy Life members who were recently hospitalized in the Greater Toronto Area (GTA) due to heart attacks. She compiled two large separate datasets: one consisting of male Toronto residents who smoked and another consisting of male Toronto residents who never smoked. The aim of this research is to provide evidence that smoking is a risk factor for early occurrence of the first heart attack. After conducting a thorough analysis of the data, Sabrina formulated a hypothesis suggesting that, on average, male smokers who experience their first heart attack are six years younger than male non-smokers who experience their first heart attack. Please calculate sample variances for both samples and decide (depending on the sample variances) what function is most appropriate: Data Analysis t-Test: Two-Sample Assuming Equal Variances or Data Analysis t-Test: Two-Sample Assuming Unequal Variances. Please help Sabrina to conduct the test at 5% significance level. As it is known that both samples come from normally distributed populations, no histograms are required.

In [43]:
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind

In [24]:
df = pd.read_excel('Sabrina Data.xlsx', skiprows=3)
df.head()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Nonsmoking Patients,Smoking Patients,Unnamed: 4,Unnamed: 5
0,,1.0,49.1,41.9,,
1,,2.0,88.0,74.0,,
2,,3.0,83.5,75.9,,
3,,4.0,54.7,49.2,,
4,,5.0,67.1,61.5,,


In [25]:
df.drop(columns = ['Unnamed: 0', 'Unnamed: 1', 'Unnamed: 4', 'Unnamed: 5'], inplace=True)

In [27]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73 entries, 0 to 72
Data columns (total 2 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Nonsmoking Patients  65 non-null     float64
 1   Smoking Patients     62 non-null     float64
dtypes: float64(2)
memory usage: 1.3 KB


In [28]:
df.isnull().sum()

Nonsmoking Patients     8
Smoking Patients       11
dtype: int64

In [16]:
np.round(df.isnull().sum()*100/len(df), 2)

Unnamed: 1                       12.16
Age of the first heart attack    10.81
Unnamed: 3                       14.86
dtype: float64

In [29]:
df.dropna(how='all', inplace=True)
df.reset_index(drop=True, inplace=True)

In [33]:
df

Unnamed: 0,Nonsmoking Patients,Smoking Patients
0,49.1,41.900000
1,88.0,74.000000
2,83.5,75.900000
3,54.7,49.200000
4,67.1,61.500000
...,...,...
62,77.5,
63,78.4,
64,71.4,
65,,87.119279


In [47]:
df['Nonsmoking Patients'].fillna(df['Nonsmoking Patients'].mean(), inplace=True)
df['Smoking Patients'].fillna(df['Smoking Patients'].mean(), inplace=True)

In [50]:
df.isnull().sum()

Nonsmoking Patients    0
Smoking Patients       0
dtype: int64

In [51]:
# Calculate sample variance
non_smokers = np.var(df['Nonsmoking Patients'], ddof=1)
smokers = np.var(df['Smoking Patients'], ddof=1)

print(f"Sample variance of non-smokers: {non_smokers}")
print(f"Sample variance of smokers: {smokers}")

Sample variance of non-smokers: 84.47930069930072
Sample variance of smokers: 113.67731245501977


In [52]:
ratio = smokers/non_smokers
ratio

1.3456232652735576

Since the ratio is less than 4:1 and the variances are not drastically different, the t-test can be done assuming equal variances.

In [53]:
# Independent Two-Sample t-test assuming equal variances
t_stat, p_value = ttest_ind(df['Nonsmoking Patients'], df['Smoking Patients'], equal_var=True)
print(f"t-statistic: {t_stat}")
print(f"p-value: {p_value}")

t-statistic: 2.554218343218931
p-value: 0.011778213396749254


Hypothesis Testing:
Null Hypothesis (H0): There is no significant difference in mean ages between male smokers and male nonsmokers who experience their first heart attack.
Alternative Hypothesis (H1): There is a significant difference in mean ages between male smokers and male nonsmokers who experience their first heart attack.

In [55]:
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference in means between groups.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference in means between groups.")

Reject the null hypothesis: There is a significant difference in means between groups.


Since the p-value 0.0118 is less than the significance level 0.05, we can reject the null hypothesis.
Hence, there is sufficient evidence to conclude that there is a significant difference in mean ages between male smokers and male nonsmokers who experience their first heart attack.