Q1) For the three combined Marathon datasets, extract the top 25 records with lowest finish time(official time column). Check the null hypothesis H0 = The average finishing time of the top 25 athletes is 130 minutes.Test it for alpha = 1%, 5% and 10% using a two-tailed test.

In [1]:
# Marathon Performance Analysis: Hypothesis Testing

# Import necessary libraries for data analysis and statistical testing
import pandas as pd  # For loading and processing datasets
import numpy as np  # For numerical operations
from scipy import stats  # For hypothesis testing

# Objective:
# Extract the top 25 athletes with the lowest official finish times from the combined Marathon datasets.
# Perform a two-tailed hypothesis test to evaluate whether their average finishing time is significantly different from 130 minutes.

# Null Hypothesis (H0): The average finishing time of the top 25 athletes is 130 minutes.
# Alternative Hypothesis (H1): The average finishing time of the top 25 athletes is not 130 minutes.

# Significance levels (alpha): 1%, 5%, and 10%.

In [2]:
# Load marathon datasets from 2015, 2016, and 2017
df_2015 = pd.read_csv("C:/Users/dbda.STUDENTSDC/Music/LabPractice/Notebooks/Datasets/marathon_results_2015.csv")
df_2016 = pd.read_csv("C:/Users/dbda.STUDENTSDC/Music/LabPractice/Notebooks/Datasets/marathon_results_2016.csv")
df_2017 = pd.read_csv("C:/Users/dbda.STUDENTSDC/Music/LabPractice/Notebooks/Datasets/marathon_results_2017.csv")

# Display the first few rows of each dataset to verify successful loading
print(df_2015.head())
print(df_2016.head())
print(df_2017.head())

   Unnamed: 0 Bib                   Name  Age M/F         City State Country  \
0           0   3         Desisa, Lelisa   25   M         Ambo   NaN     ETH   
1           1   4  Tsegay, Yemane Adhane   30   M  Addis Ababa   NaN     ETH   
2           2   8         Chebet, Wilson   29   M     Marakwet   NaN     KEN   
3           3  11       Kipyego, Bernard   28   M      Eldoret   NaN     KEN   
4           4  10          Korir, Wesley   32   M       Kitale   NaN     KEN   

  Citizen Unnamed: 9  ...      25K      30K      35K      40K     Pace  \
0     NaN        NaN  ...  1:16:07  1:32:00  1:47:59  2:02:39  0:04:56   
1     NaN        NaN  ...  1:16:07  1:31:59  1:47:59  2:02:42  0:04:58   
2     NaN        NaN  ...  1:16:07  1:32:00  1:47:59  2:03:01  0:04:59   
3     NaN        NaN  ...  1:16:07  1:32:00  1:48:03  2:03:47  0:05:00   
4     NaN        NaN  ...  1:16:07  1:32:00  1:47:59  2:03:27  0:05:00   

  Proj Time Official Time Overall Gender Division  
0         -       2:09

In [3]:
# Combine the 2015, 2016, and 2017 marathon datasets into a single DataFrame
df = pd.concat([df_2015, df_2016, df_2017], ignore_index=True)  # Reset index for consistency

# Display the shape of the combined dataset to verify merging
print("Combined dataset shape:", df.shape)

# Show the first few rows to check the structure
print(df.head())

Combined dataset shape: (79638, 26)
   Unnamed: 0 Bib                   Name  Age M/F         City State Country  \
0         0.0   3         Desisa, Lelisa   25   M         Ambo   NaN     ETH   
1         1.0   4  Tsegay, Yemane Adhane   30   M  Addis Ababa   NaN     ETH   
2         2.0   8         Chebet, Wilson   29   M     Marakwet   NaN     KEN   
3         3.0  11       Kipyego, Bernard   28   M      Eldoret   NaN     KEN   
4         4.0  10          Korir, Wesley   32   M       Kitale   NaN     KEN   

  Citizen Unnamed: 9  ...      30K      35K      40K     Pace Proj Time  \
0     NaN        NaN  ...  1:32:00  1:47:59  2:02:39  0:04:56         -   
1     NaN        NaN  ...  1:31:59  1:47:59  2:02:42  0:04:58         -   
2     NaN        NaN  ...  1:32:00  1:47:59  2:03:01  0:04:59         -   
3     NaN        NaN  ...  1:32:00  1:48:03  2:03:47  0:05:00         -   
4     NaN        NaN  ...  1:32:00  1:47:59  2:03:27  0:05:00         -   

  Official Time Overall Gender D

In [4]:
# Display dataset information, including column names, data types, and missing values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79638 entries, 0 to 79637
Data columns (total 26 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Unnamed: 0     53008 non-null  float64
 1   Bib            79638 non-null  object 
 2   Name           79638 non-null  object 
 3   Age            79638 non-null  int64  
 4   M/F            79638 non-null  object 
 5   City           79637 non-null  object 
 6   State          70645 non-null  object 
 7   Country        79638 non-null  object 
 8   Citizen        3440 non-null   object 
 9   Unnamed: 9     158 non-null    object 
 10  5K             79638 non-null  object 
 11  10K            79638 non-null  object 
 12  15K            79638 non-null  object 
 13  20K            79638 non-null  object 
 14  Half           79638 non-null  object 
 15  25K            79638 non-null  object 
 16  30K            79638 non-null  object 
 17  35K            79638 non-null  object 
 18  40K   

In [5]:
# Step 1: Split 'Official Time' into separate columns for hours, minutes, and seconds
df[['of_Hours', 'of_Minutes', 'of_Seconds']] = df['Official Time'].str.split(':', expand=True)

# Step 2: Convert extracted time components to integers for calculations
df[['of_Hours', 'of_Minutes', 'of_Seconds']] = df[['of_Hours', 'of_Minutes', 'of_Seconds']].astype(int)

# Step 3: Display a preview of the modified DataFrame to verify the changes
df.head()  # Shows the first few rows of the updated dataset

Unnamed: 0.1,Unnamed: 0,Bib,Name,Age,M/F,City,State,Country,Citizen,Unnamed: 9,...,Pace,Proj Time,Official Time,Overall,Gender,Division,Unnamed: 8,of_Hours,of_Minutes,of_Seconds
0,0.0,3,"Desisa, Lelisa",25,M,Ambo,,ETH,,,...,0:04:56,-,2:09:17,1,1,1,,2,9,17
1,1.0,4,"Tsegay, Yemane Adhane",30,M,Addis Ababa,,ETH,,,...,0:04:58,-,2:09:48,2,2,2,,2,9,48
2,2.0,8,"Chebet, Wilson",29,M,Marakwet,,KEN,,,...,0:04:59,-,2:10:22,3,3,3,,2,10,22
3,3.0,11,"Kipyego, Bernard",28,M,Eldoret,,KEN,,,...,0:05:00,-,2:10:47,4,4,4,,2,10,47
4,4.0,10,"Korir, Wesley",32,M,Kitale,,KEN,,,...,0:05:00,-,2:10:49,5,5,5,,2,10,49


In [6]:
# Display dataset information, including column names, data types, and missing values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79638 entries, 0 to 79637
Data columns (total 29 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Unnamed: 0     53008 non-null  float64
 1   Bib            79638 non-null  object 
 2   Name           79638 non-null  object 
 3   Age            79638 non-null  int64  
 4   M/F            79638 non-null  object 
 5   City           79637 non-null  object 
 6   State          70645 non-null  object 
 7   Country        79638 non-null  object 
 8   Citizen        3440 non-null   object 
 9   Unnamed: 9     158 non-null    object 
 10  5K             79638 non-null  object 
 11  10K            79638 non-null  object 
 12  15K            79638 non-null  object 
 13  20K            79638 non-null  object 
 14  Half           79638 non-null  object 
 15  25K            79638 non-null  object 
 16  30K            79638 non-null  object 
 17  35K            79638 non-null  object 
 18  40K   

In [7]:
# Step 1: Convert 'Official Time' to total seconds for easier analysis
df["finish_time"] = df["of_Hours"] * 3600 + df["of_Minutes"] * 60 + df["of_Seconds"]

# Step 2: Display a preview of the converted finish times
print(df[['finish_time']].head())  # Shows the first few rows for verification

   finish_time
0         7757
1         7788
2         7822
3         7847
4         7849


In [8]:
# Sort the dataset by 'finish_time' in ascending order to get the fastest athletes
df = df.sort_values(by=['finish_time'])

# Display the top 5 rows to verify the sorting
print(df.head())

       Unnamed: 0 Bib                   Name  Age M/F         City State  \
0             0.0   3         Desisa, Lelisa   25   M         Ambo   NaN   
53228         0.0  11        Kirui, Geoffrey   24   M     Keringet   NaN   
1             1.0   4  Tsegay, Yemane Adhane   30   M  Addis Ababa   NaN   
53229         1.0  17            Rupp, Galen   30   M     Portland    OR   
2             2.0   8         Chebet, Wilson   29   M     Marakwet   NaN   

      Country Citizen Unnamed: 9  ... Proj Time Official Time Overall Gender  \
0         ETH     NaN        NaN  ...         -       2:09:17       1      1   
53228     KEN     NaN        NaN  ...         -       2:09:37       1      1   
1         ETH     NaN        NaN  ...         -       2:09:48       2      2   
53229     USA     NaN        NaN  ...         -       2:09:58       2      2   
2         KEN     NaN        NaN  ...         -       2:10:22       3      3   

      Division Unnamed: 8 of_Hours of_Minutes of_Seconds finis

In [9]:
# Select the top 25 athletes with the lowest finish times
df = df.iloc[:25]

# Display the extracted subset for verification
print(df[['finish_time']].head())  # Shows the first few rows of the filtered dataset

       finish_time
0             7757
53228         7777
1             7788
53229         7798
2             7822


In [10]:
# Display dataset information, including column names, data types, and missing values
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 25 entries, 0 to 53236
Data columns (total 30 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Unnamed: 0     21 non-null     float64
 1   Bib            25 non-null     object 
 2   Name           25 non-null     object 
 3   Age            25 non-null     int64  
 4   M/F            25 non-null     object 
 5   City           25 non-null     object 
 6   State          9 non-null      object 
 7   Country        25 non-null     object 
 8   Citizen        0 non-null      object 
 9   Unnamed: 9     0 non-null      object 
 10  5K             25 non-null     object 
 11  10K            25 non-null     object 
 12  15K            25 non-null     object 
 13  20K            25 non-null     object 
 14  Half           25 non-null     object 
 15  25K            25 non-null     object 
 16  30K            25 non-null     object 
 17  35K            25 non-null     object 
 18  40K           

In [11]:
# Extract the 'finish_time' column for further analysis
finish_t = df['finish_time']

# Display the extracted finish times
print(finish_t.head())  # Shows the first few values for verification

0        7757
53228    7777
1        7788
53229    7798
2        7822
Name: finish_time, dtype: int32


In [12]:
# Define the hypothesized average finish time in seconds
finish_avg = 8500  # This represents the assumed population mean for hypothesis testing

# This value will be used in a two-tailed t-test to assess whether the top 25 athletes
# have a significantly different average finish time compared to 8500 seconds.

In [13]:
# Perform a one-sample t-test to compare the sample mean finish time to the hypothesized population mean
t_statistic, p_value = stats.ttest_1samp(finish_t, finish_avg)

# Print the test results for interpretation
print("T-Statistic:", t_statistic)
print("P-Value:", p_value)

# Decision criteria:
# - If p_value ≤ 0.05: Reject H0 (significant evidence that finish time differs from 8500 seconds)
# - If p_value > 0.05: Fail to reject H0 (no significant evidence supporting a difference)
if p_value <= 0.05:
    print("Reject H0: The average finish time is significantly different from 8500 seconds.")
else:
    print("Fail to reject H0: No significant evidence that the average finish time differs from 8500 seconds.")

T-Statistic: -27.064749111936585
P-Value: 1.7005525082817027e-19
Reject H0: The average finish time is significantly different from 8500 seconds.


In [14]:
# Compute degrees of freedom for the one-sample t-test
# Degrees of freedom (df) is calculated as (sample size - 1)
degrees_of_freedom = len(finish_t) - 1

# Print the computed value for verification
print("Degrees of Freedom:", degrees_of_freedom)

Degrees of Freedom: 24


In [15]:
# Print the t-test results for hypothesis interpretation
print("T-statistic:", t_statistic)  # Indicates the test statistic for comparison
print("P-value:", p_value)  # Determines statistical significance
print("Degrees of freedom:", degrees_of_freedom)  # Critical in t-distribution evaluation

T-statistic: -27.064749111936585
P-value: 1.7005525082817027e-19
Degrees of freedom: 24


In [16]:
# Interpretation of hypothesis test results

alpha = 0.025  # Significance level for decision-making

# Compare p-value with alpha to determine whether to reject H0
if p_value < alpha:
    print("The null hypothesis (mean finish time = 8500 seconds) is rejected.")
else:
    print("The null hypothesis (mean finish time = 8500 seconds) cannot be rejected.")

The null hypothesis (mean finish time = 8500 seconds) is rejected.


In [17]:
# Interpretation of hypothesis test results at alpha = 1%

alpha = 0.01  # Significance level for decision-making

# Compare p-value with alpha to determine whether to reject H0
if p_value < alpha:
    print("The null hypothesis (mean finish time = 8500 seconds) is rejected.")
else:
    print("The null hypothesis (mean finish time = 8500 seconds) cannot be rejected.")

The null hypothesis (mean finish time = 8500 seconds) is rejected.


In [18]:
# Interpretation of hypothesis test results at alpha = 5%

alpha = 0.05  # Significance level for decision-making

# Compare p-value with alpha to determine whether to reject H0
if p_value < alpha:
    print("The null hypothesis (mean finish time = 8500 seconds) is rejected.")
else:
    print("The null hypothesis (mean finish time = 8500 seconds) cannot be rejected.")

The null hypothesis (mean finish time = 8500 seconds) is rejected.


In [19]:
# Interpretation of hypothesis test results at alpha = 10%

alpha = 0.10  # Significance level for hypothesis evaluation

# Compare p-value with alpha to determine whether to reject H0
if p_value < alpha:
    print("The null hypothesis (mean finish time = 8500 seconds) is rejected.")
else:
    print("The null hypothesis (mean finish time = 8500 seconds) cannot be rejected.")

The null hypothesis (mean finish time = 8500 seconds) is rejected.


<!-- Q2) For the diabetes dataset, divide people in 2 groups:(A) age <= 40, (B) age > 40. Take 30 samples each. Run a two sample t test to see if their glucose level have significant statistical difference   -->

In [20]:
# Q2) For the diabetes dataset, divide people into two groups: (a) age <= 40, (b) age > 40.
# Take 30 samples each.
# Run a two sample t-test to see if their Glucose levels have a significant statistical difference.

In [21]:
# Load the diabetes dataset into a DataFrame
df = pd.read_csv("C:/Users/dbda.STUDENTSDC/Music/LabPractice/Notebooks/Datasets/diabetes.csv")

# Display basic information about the dataset to check structure and data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [22]:
# Display the first few rows of the diabetes dataset to verify its structure
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [23]:
# Extract glucose levels for individuals aged 40 or younger
age_le_40 = df[df['Age'] <= 40]['Glucose']

# Extract glucose levels for individuals older than 40
age_gt_40 = df[df['Age'] > 40]['Glucose']

# Display summary statistics for verification
print("Glucose levels (Age ≤ 40):\n", age_le_40.describe())
print("Glucose levels (Age > 40):\n", age_gt_40.describe())

Glucose levels (Age ≤ 40):
 count    574.000000
mean     117.458188
std       30.516167
min        0.000000
25%       97.000000
50%      113.000000
75%      134.000000
max      199.000000
Name: Glucose, dtype: float64
Glucose levels (Age > 40):
 count    194.000000
mean     131.061856
std       34.039999
min        0.000000
25%      106.000000
50%      129.000000
75%      155.000000
max      197.000000
Name: Glucose, dtype: float64


In [24]:
# Select 30 random samples from each group with replacement
age_le_40 = age_le_40.sample(n=30, replace=True)  # Group (Age ≤ 40)
age_gt_40 = age_gt_40.sample(n=30, replace=True)  # Group (Age > 40)

# Display the first few samples from each group for verification
print("Sample from Age ≤ 40 group:\n", age_le_40.head())
print("Sample from Age > 40 group:\n", age_gt_40.head())

Sample from Age ≤ 40 group:
 478    126
353     90
364    147
697     99
41     133
Name: Glucose, dtype: int64
Sample from Age > 40 group:
 537     57
560    125
558    103
339    178
517    125
Name: Glucose, dtype: int64


In [25]:
# Perform an independent two-sample t-test to compare glucose levels
t_statistics, p_value = stats.ttest_ind(age_le_40, age_gt_40)

# Print the test results for interpretation
print("T-Statistic:", t_statistics)
print("P-Value:", p_value)

# Decision criteria:
# - If p_value ≤ 0.05: Reject H0 (significant difference in glucose levels between age groups)
# - If p_value > 0.05: Fail to reject H0 (no significant evidence of difference)
if p_value <= 0.05:
    print("Reject H0: There is a significant difference in glucose levels between the two age groups.")
else:
    print("Fail to reject H0: No significant evidence that glucose levels differ between the two age groups.")

T-Statistic: -2.1286449134485514
P-Value: 0.03754240822947192
Reject H0: There is a significant difference in glucose levels between the two age groups.


In [26]:
# Compute the sample sizes for each group
n_le40 = len(age_le_40)  # Sample size for age ≤ 40 group
n_gt40 = len(age_gt_40)  # Sample size for age > 40 group

# Calculate degrees of freedom for the two-sample t-test
degrees_of_freedom = n_le40 + n_gt40 - 2

# Display the computed value for verification
print("Degrees of Freedom:", degrees_of_freedom)

Degrees of Freedom: 58


In [27]:
# Interpretation of hypothesis test results at alpha = 2.5%

alpha = 0.025  # Significance level for decision-making

# Compare p-value with alpha to determine hypothesis rejection
if p_value < alpha:
    print("Reject H0: There is a significant difference in glucose levels between the two age groups.")
else:
    print("Fail to reject H0: No significant evidence that glucose levels differ between the two age groups.")

Fail to reject H0: No significant evidence that glucose levels differ between the two age groups.


In [28]:
# Compute the critical value for hypothesis testing
critical_value = stats.t.ppf(1 - alpha, degrees_of_freedom)

# Display the computed critical value for verification
print("Critical Value:", critical_value)

Critical Value: 2.0017174841452356


In [29]:
# Print the results of the hypothesis test
print("T-statistic:", t_statistics)  # Measures the difference between sample and hypothesized mean
print("p-value:", p_value)  # Determines statistical significance

# Ensure critical_value is defined based on degrees of freedom and significance level
critical_value = stats.t.ppf(1 - alpha, degrees_of_freedom)

print("Critical value:", critical_value)  # Used for comparison in hypothesis testing

T-statistic: -2.1286449134485514
p-value: 0.03754240822947192
Critical value: 2.0017174841452356


In [30]:
# Compare the p-value with the significance level (alpha)
# Decision criteria:
# - If p_value < alpha, reject H0 (significant difference in glucose levels)
# - Otherwise, fail to reject H0 (no significant difference)

if p_value < alpha:
    print("Reject H0: There is a significant difference in glucose levels.")
else:
    print("Fail to reject H0: There is no significant difference in glucose levels.")

Fail to reject H0: There is no significant difference in glucose levels.


Q3) Use the hypothermia dataset. Hypothermia is a medical condition that occurs. when the body's core temperature drops below 95 F(35 C). its a medical emergency where the body loses heat faster than it can produce it, leading to dangerously low body temperatures.Patients are treated for this condition. the t.1 column represents the patients body tempreature, when the patient got admitted. THe t.2 column represents the  patients body temperature after the initial treatment.Run a paired ttest to find if the treatment was effective.

In [31]:
# Step 1: Load the dataset
import pandas as pd
import numpy as np
from scipy import stats

data = pd.read_csv("C:/Users/dbda.STUDENTSDC/Music/LabPractice/Notebooks/Datasets/Hypothermia.csv")

# Step 2: Extract relevant columns for analysis
usage_data = data[["t.1", "t.2"]]

# Display the first few rows to verify the dataset
print(usage_data.head())

    t.1   t.2
0  34.0  35.1
1  29.0  33.0
2  31.0  36.8
3  32.0  34.0
4  32.2  36.6


In [32]:
# Import necessary modules for statistical analysis
from scipy.stats import t
from scipy.stats import ttest_rel

In [33]:
# Perform a paired t-test to assess temperature changes pre- and post-treatment
tstats, pval = ttest_rel(usage_data["t.1"], usage_data["t.2"])

# Print the test results for interpretation
print("Paired T-statistic:", tstats)
print("P-value:", pval)

Paired T-statistic: -22.284298214866393
P-value: 5.5720123452358365e-56


In [34]:
# Compute degrees of freedom for the paired t-test
# Degrees of freedom (df) is calculated as (sample size - 1)
degrees_of_freedom = len(usage_data["t.1"]) - 1

# Display the computed value for verification
print("Degrees of Freedom:", degrees_of_freedom)

Degrees of Freedom: 199


In [35]:
# Compute the critical value for the paired t-test
alpha = 0.025  # Significance level
critical_value = t.ppf(1 - alpha, degrees_of_freedom)

# Display the computed critical value for verification
print("Critical Value:", critical_value)

Critical Value: 1.971956544249395


In [36]:
# Display the paired t-test results in a clear format
print(f"T-statistic: {tstats:.2f}, p-value: {pval:.4f}")
print(f"Critical value (alpha = {alpha}, two-tailed): ± {critical_value:.2f}")

T-statistic: -22.28, p-value: 0.0000
Critical value (alpha = 0.025, two-tailed): ± 1.97


In [37]:
# Compare the p-value with the significance level (alpha) to determine the hypothesis outcome
if pval < alpha:
    print("Reject null hypothesis: The treatment significantly affected body temperature.")
else:
    print("Fail to reject null hypothesis: No significant evidence that the treatment was effective.")

Reject null hypothesis: The treatment significantly affected body temperature.


Q4) For StudentsPerformance.csv, using ANOVA find if there is a significant difference in maths marks depending on the level of parental education.

In [38]:
# Step 1: Load the dataset
import pandas as pd
from scipy.stats import f_oneway

In [39]:
# Load the StudentsPerformance dataset

df = pd.read_csv('C:/Users/dbda.STUDENTSDC/Music/LabPractice/Notebooks/Datasets/StudentsPerformance.csv')

# Display the first few rows to verify dataset structure
print(df.head())

   gender race/ethnicity parental level of education         lunch  \
0  female        group B           bachelor's degree      standard   
1  female        group C                some college      standard   
2  female        group B             master's degree      standard   
3    male        group A          associate's degree  free/reduced   
4    male        group C                some college      standard   

  test preparation course  math score  reading score  writing score  
0                    none          72             72             74  
1               completed          69             90             88  
2                    none          90             95             93  
3                    none          47             57             44  
4                    none          76             78             75  


In [40]:
# Import necessary statistical module
from scipy import stats

# Display the count of unique parental education levels to understand group sizes
print(df['parental level of education'].value_counts())

parental level of education
some college          226
associate's degree    222
high school           196
some high school      179
bachelor's degree     118
master's degree        59
Name: count, dtype: int64


In [41]:
# Separate math scores by parental level of education for ANOVA analysis
s1 = df['math score'][df['parental level of education'] == "some college"]
s2 = df['math score'][df['parental level of education'] == "associate's degree"]
s3 = df['math score'][df['parental level of education'] == "high school"]
s4 = df['math score'][df['parental level of education'] == "some high school"]
s5 = df['math score'][df['parental level of education'] == "bachelor's degree"]
s6 = df['math score'][df['parental level of education'] == "master's degree"]

# Display the size of each group for verification
print(f"Sample sizes: \nSome College: {len(s1)}, Associate's Degree: {len(s2)}, High School: {len(s3)},")
print(f"Some High School: {len(s4)}, Bachelor's Degree: {len(s5)}, Master's Degree: {len(s6)}")

Sample sizes: 
Some College: 226, Associate's Degree: 222, High School: 196,
Some High School: 179, Bachelor's Degree: 118, Master's Degree: 59


In [42]:
# Perform one-way ANOVA test to compare math scores across different parental education levels
fstat, pval = f_oneway(s1, s2, s3, s4, s5, s6)

In [43]:
# Compute degrees of freedom for ANOVA
between_df = len([s1, s2, s3, s4, s5, s6]) - 1  # Number of groups - 1
within_df = len(s1) + len(s2) + len(s3) + len(s4) + len(s5) + len(s6) - len([s1, s2, s3, s4, s5, s6])  # Total observations - number of groups

# Display the computed degrees of freedom
print(f"Between-group degrees of freedom: {between_df}")
print(f"Within-group degrees of freedom: {within_df}")

Between-group degrees of freedom: 5
Within-group degrees of freedom: 994


In [44]:
# Compute the critical value for the ANOVA test
alpha = 0.025  # Significance level
critical_value = stats.f.ppf(1 - alpha, between_df, within_df)

# Display the computed critical value for verification
print("Critical Value:", critical_value)

Critical Value: 2.579227762918727


In [45]:
# Display the ANOVA test results clearly
print(f"\nF-statistic: {fstat:.2f}, p-value: {pval:.4f}")  
print(f"Critical value (alpha = {alpha}): {critical_value:.2f}")


F-statistic: 6.52, p-value: 0.0000
Critical value (alpha = 0.025): 2.58


In [46]:
# Interpretation of ANOVA results
# Compare p-value with alpha to determine statistical significance

if pval < alpha:
    print("Reject null hypothesis: Significant difference in math scores based on parental education level.")
else:
    print("Fail to reject null hypothesis: No significant evidence that parental education impacts math scores.")

Reject null hypothesis: Significant difference in math scores based on parental education level.


Q5) Use dataset airline_passenger_satisfaction.csv.
    for male passengers, determine if passenger class and ratings given to      "onboard services" are independent. seperately repeat for female passengers.

In [47]:
import pandas as pd
from scipy.stats import chi2_contingency

# Load the dataset and remove missing values
df = pd.read_csv('C:/Users/dbda.STUDENTSDC/Music/LabPractice/Notebooks/Datasets/airline_passenger_satisfaction.csv')
df.dropna(inplace=True)

In [48]:
# Filter dataset to include only male passengers
df_male = df[df["Gender"] == "Male"]

In [49]:
# Select only the relevant columns for the chi-square test
df = df[['Class', 'On-board Service']]

# Display the first few rows to verify the selection
print(df.head())

      Class  On-board Service
0  Business                 3
1  Business                 5
2  Business                 3
3  Business                 5
4  Business                 3


In [50]:
# Create a contingency table for passenger class vs. onboard service ratings
contingency_table = pd.crosstab(df['Class'], df['On-board Service'])

# Display the contingency table
print(contingency_table)

On-board Service  0     1     2      3      4      5
Class                                               
Business          5  4107  6787  11715  21426  17950
Economy           0  9043  9758  14467  14870   9979
Economy Plus      0  1588  1745   2278   2291   1478


In [51]:
# Perform the chi-square test on the contingency table
chi2, p_value, degree_of_freedom, expected_counts = chi2_contingency(contingency_table)

# Display the test results
print(f"Chi-square statistic: {chi2:.2f}")
print(f"P-value: {p_value:.4f}")
print(f"Degrees of freedom: {degree_of_freedom}")

Chi-square statistic: 6725.10
P-value: 0.0000
Degrees of freedom: 10


In [52]:
# Print the chi-square test results
print("Chi-square statistic:", chi2)
print("P-value:", p_value)
print("Degrees of freedom:", degree_of_freedom)
print("Expected frequencies:\n", expected_counts)


Chi-square statistic: 6725.0997721088515
P-value: 0.0
Degrees of freedom: 10
Expected frequencies:
 [[2.39367659e+00 7.05560110e+03 8.75606895e+03 1.36248071e+04
  1.84729597e+04 1.40781695e+04]
 [2.24412489e+00 6.61478253e+03 8.20900886e+03 1.27735589e+04
  1.73188094e+04 1.31985961e+04]
 [3.62198522e-01 1.06761636e+03 1.32492219e+03 2.06163399e+03
  2.79523087e+03 2.13023439e+03]]


In [53]:
# Set significance level (alpha)
alpha = 0.05  

# Compare p-value with alpha to determine statistical significance
if p_value < alpha:
    print("Reject H0: Passenger class and onboard service ratings are NOT independent.")
else:
    print("Do not reject H0: No significant evidence that passenger class affects onboard service ratings.")

Reject H0: Passenger class and onboard service ratings are NOT independent.


#### Female

In [54]:
# Load the airline passenger satisfaction dataset
import pandas as pd

df = pd.read_csv('C:/Users/dbda.STUDENTSDC/Music/LabPractice/Notebooks/Datasets/airline_passenger_satisfaction.csv')

# Remove rows with missing values to ensure clean data for analysis
df.dropna(inplace=True)

# Display the first few rows to verify dataset structure
print(df.head())

   ID  Gender  Age Customer Type Type of Travel     Class  Flight Distance  \
0   1    Male   48    First-time       Business  Business              821   
1   2  Female   35     Returning       Business  Business              821   
2   3    Male   41     Returning       Business  Business              853   
3   4    Male   50     Returning       Business  Business             1905   
4   5  Female   49     Returning       Business  Business             3470   

   Departure Delay  Arrival Delay  Departure and Arrival Time Convenience  \
0                2            5.0                                       3   
1               26           39.0                                       2   
2                0            0.0                                       4   
3                0            0.0                                       2   
4                0            1.0                                       3   

   ...  On-board Service  Seat Comfort  Leg Room Service  Cleanlines

In [55]:
# Filter dataset to include only female passengers
df_female = df[df["Gender"] == "Female"]

# Display the first few rows to verify the selection
print(df_female.head())

    ID  Gender  Age Customer Type Type of Travel     Class  Flight Distance  \
1    2  Female   35     Returning       Business  Business              821   
4    5  Female   49     Returning       Business  Business             3470   
7    8  Female   60     Returning       Business  Business              853   
9   10  Female   38     Returning       Business  Business             2822   
10  11  Female   28    First-time       Business  Business              821   

    Departure Delay  Arrival Delay  Departure and Arrival Time Convenience  \
1                26           39.0                                       2   
4                 0            1.0                                       3   
7                 0            3.0                                       3   
9                13            0.0                                       2   
10                0            5.0                                       1   

    ...  On-board Service  Seat Comfort  Leg Room Servic

In [56]:
# Select only the relevant columns for the chi-square test
df = df[['Class', 'On-board Service']]

# Display the first few rows to verify the selection
print(df.head())

      Class  On-board Service
0  Business                 3
1  Business                 5
2  Business                 3
3  Business                 5
4  Business                 3


In [57]:
# Create a contingency table for passenger class vs. onboard service ratings
contingency_table = pd.crosstab(df['Class'], df['On-board Service'])

# Display the contingency table
print(contingency_table)

On-board Service  0     1     2      3      4      5
Class                                               
Business          5  4107  6787  11715  21426  17950
Economy           0  9043  9758  14467  14870   9979
Economy Plus      0  1588  1745   2278   2291   1478


In [58]:
# Perform chi-square test for independence
chi2, p_value, degree_of_freedom, expected_counts = chi2_contingency(contingency_table)

# Display test results
print(f"Chi-square statistic: {chi2:.2f}")
print(f"P-value: {p_value:.4f}")
print(f"Degrees of freedom: {degree_of_freedom}")

Chi-square statistic: 6725.10
P-value: 0.0000
Degrees of freedom: 10


In [59]:
# Print the chi-square test results
print("Chi-square statistic:", chi2)
print("P-value:", p_value)
print("Degrees of freedom:", degree_of_freedom)
print("Expected frequencies:\n", expected_counts)

Chi-square statistic: 6725.0997721088515
P-value: 0.0
Degrees of freedom: 10
Expected frequencies:
 [[2.39367659e+00 7.05560110e+03 8.75606895e+03 1.36248071e+04
  1.84729597e+04 1.40781695e+04]
 [2.24412489e+00 6.61478253e+03 8.20900886e+03 1.27735589e+04
  1.73188094e+04 1.31985961e+04]
 [3.62198522e-01 1.06761636e+03 1.32492219e+03 2.06163399e+03
  2.79523087e+03 2.13023439e+03]]


In [60]:
# Set significance level (alpha)
alpha = 0.05  

# Compare p-value with alpha to determine statistical significance
if p_value < alpha:
    print("Reject H0: Passenger class and onboard service ratings are NOT independent.")
else:
    print("Do not reject H0: No significant evidence that passenger class affects onboard service ratings.")

Reject H0: Passenger class and onboard service ratings are NOT independent.


6) Use dataset Amazon Sale Report.
   Are order fulfillment and order status independent? Consider only cancelled and pending orders.

In [61]:
import pandas as pd
from scipy.stats import chi2_contingency

# Load the dataset
df = pd.read_csv('C:/Users/dbda.STUDENTSDC/Music/LabPractice/Notebooks/Datasets/Amazon Sale Report.csv', low_memory=False)

# Remove missing values
df.dropna(inplace=True)

In [62]:
# Display the first row of the dataset to check column names and data types
print(df.head(1))

       index             Order ID      Date                        Status  \
49051  49051  408-4858463-2356347  05-31-22  Shipped - Delivered to Buyer   

      Fulfilment Sales Channel  ship-service-level  Style         SKU  \
49051   Merchant      Amazon.in           Standard  J0385  J0385-KR-M   

      Category  ... currency Amount    ship-city      ship-state  \
49051    kurta  ...      INR  888.0  RAJAHMUNDRY  ANDHRA PRADESH   

      ship-postal-code  ship-country  \
49051         533126.0            IN   

                                           promotion-ids    B2B  fulfilled-by  \
49051  Amazon PLCC Free-Financing Universal Merchant ...  False     Easy Ship   

      Unnamed: 22  
49051       False  

[1 rows x 24 columns]


In [63]:
# Display unique values in the 'Fulfilment' column
print(df["Fulfilment"].unique())

['Merchant']


In [64]:
# Display unique values in the 'Status' column
print(df["Status"].unique())

['Shipped - Delivered to Buyer' 'Shipped - Returned to Seller'
 'Shipped - Returning to Seller' 'Shipped - Lost in Transit'
 'Shipped - Picked Up' 'Shipped - Out for Delivery' 'Pending'
 'Pending - Waiting for Pick Up' 'Shipped - Rejected by Buyer'
 'Shipped - Damaged']


In [65]:
# Filter dataset to include only relevant order statuses
df_filtered = df[(df["Status"] == "Pending") | 
                 (df["Status"] == "Pending - Waiting for Pick Up") | 
                 (df["Status"] == "Shipped - Rejected by Buyer")]

# Display the first few rows to verify filtering
print(df_filtered.head())

       index             Order ID      Date   Status Fulfilment  \
91199  91199  403-0176285-9325126  06-29-22  Pending   Merchant   
91200  91200  407-5797861-3428325  06-29-22  Pending   Merchant   
91203  91203  406-2093090-0448363  06-29-22  Pending   Merchant   
91207  91207  402-3666308-9956330  06-29-22  Pending   Merchant   
91218  91218  407-1580850-5233959  06-29-22  Pending   Merchant   

      Sales Channel  ship-service-level    Style             SKU  \
91199      Amazon.in           Standard  JNE3801   JNE3801-KR-XL   
91200      Amazon.in           Standard  JNE3797  JNE3797-KR-XXL   
91203      Amazon.in           Standard  JNE3797  JNE3797-KR-XXL   
91207      Amazon.in           Standard  JNE3797   JNE3797-KR-XL   
91218      Amazon.in           Standard    J0148    J0148-SET-XL   

            Category  ... currency Amount  ship-city      ship-state  \
91199          kurta  ...      INR  725.0     MUMBAI     MAHARASHTRA   
91200  Western Dress  ...      INR  735.0   

In [66]:
# Select only the relevant columns for the chi-square test
df = df[['Fulfilment', 'Status']]

# Display the first few rows to verify the selection
print(df.head())

      Fulfilment                        Status
49051   Merchant  Shipped - Delivered to Buyer
49077   Merchant  Shipped - Delivered to Buyer
49081   Merchant  Shipped - Delivered to Buyer
49082   Merchant  Shipped - Delivered to Buyer
49083   Merchant  Shipped - Delivered to Buyer


In [67]:
# Create a contingency table for order fulfillment vs. order status
contingency_table = pd.crosstab(df['Fulfilment'], df['Status'])

# Display the contingency table
print(contingency_table)

Status      Pending  Pending - Waiting for Pick Up  Shipped - Damaged  \
Fulfilment                                                              
Merchant        220                            280                  1   

Status      Shipped - Delivered to Buyer  Shipped - Lost in Transit  \
Fulfilment                                                            
Merchant                           16672                          3   

Status      Shipped - Out for Delivery  Shipped - Picked Up  \
Fulfilment                                                    
Merchant                            34                  967   

Status      Shipped - Rejected by Buyer  Shipped - Returned to Seller  \
Fulfilment                                                              
Merchant                              5                          1054   

Status      Shipped - Returning to Seller  
Fulfilment                                 
Merchant                              143  


In [68]:
# Perform chi-square test for independence
chi2, p_value, degree_of_freedom, expected_counts = chi2_contingency(contingency_table)

# Display test results
print(f"Chi-square statistic: {chi2:.2f}")
print(f"P-value: {p_value:.4f}")
print(f"Degrees of freedom: {degree_of_freedom}")

Chi-square statistic: 0.00
P-value: 1.0000
Degrees of freedom: 0


In [69]:
# Print the chi-square test results
print("Chi-square statistic:", chi2)
print("P-value:", p_value)
print("Degrees of freedom:", degree_of_freedom)
print("Expected frequencies:\n", expected_counts)

Chi-square statistic: 0.0
P-value: 1.0
Degrees of freedom: 0
Expected frequencies:
 [[2.2000e+02 2.8000e+02 1.0000e+00 1.6672e+04 3.0000e+00 3.4000e+01
  9.6700e+02 5.0000e+00 1.0540e+03 1.4300e+02]]


In [70]:
# Set significance level (alpha)
alpha = 0.05  

# Compare p-value with alpha to determine statistical significance
if p_value < alpha:
    print("Reject H0: Order fulfillment and order status are NOT independent.")
else:
    print("Do not reject H0: No significant evidence that order fulfillment affects order status.")

Do not reject H0: No significant evidence that order fulfillment affects order status.
