<a href="https://colab.research.google.com/github/faddednatasha/Sky-Analytics/blob/main/Statistical_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **T-TEST:COMPARE AVERAGE FARE BETWEEN 0-STOP & 1-STOP FFLIGHTS**

In [None]:
# Importing necessary libraries
import pandas as pd
import numpy as np

In [None]:
data=pd.read_csv('cleaned_flight_data.csv')
data

Unnamed: 0,airline,flight,source_city,departure_time,stops,arrival_time,destination_city,class,duration,days_left,price,departure_date,day,month,departure_time_numeric,arrival_time_numeric,duration_numeric
0,SpiceJet,SG-8709,Delhi,Evening,zero,Night,Mumbai,Economy,2.17,1.0,5953.0,2025-06-08,8.0,6.0,18.0,22.0,2.17
1,SpiceJet,SG-8157,Delhi,Early_Morning,zero,Morning,Mumbai,Economy,2.33,1.0,5953.0,2025-06-08,8.0,6.0,5.0,9.0,2.33
2,AirAsia,I5-764,Delhi,Early_Morning,zero,Early_Morning,Mumbai,Economy,2.17,1.0,5956.0,2025-06-08,8.0,6.0,5.0,5.0,2.17
3,Vistara,UK-995,Delhi,Morning,zero,Afternoon,Mumbai,Economy,2.25,1.0,5955.0,2025-06-08,8.0,6.0,9.0,13.0,2.25
4,Vistara,UK-963,Delhi,Morning,zero,Morning,Mumbai,Economy,2.33,1.0,5955.0,2025-06-08,8.0,6.0,9.0,9.0,2.33
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
83451,Indigo,6E-136,Mumbai,Early_Morning,zero,Early_Morning,Chennai,Economy,1.92,45.0,3000.0,2025-07-22,22.0,7.0,5.0,5.0,1.92
83452,GO_FIRST,G8-305,Mumbai,Early_Morning,zero,Early_Morning,Chennai,Economy,1.92,45.0,3000.0,2025-07-22,22.0,7.0,5.0,5.0,1.92
83453,Indigo,6E-5375,Mumbai,Evening,zero,Evening,Chennai,Economy,1.92,45.0,3000.0,2025-07-22,22.0,7.0,18.0,18.0,1.92
83454,GO_FIRST,G8-303,Mumbai,Evening,zero,Night,Chennai,Economy,2.00,45.0,3000.0,2025-07-22,22.0,7.0,18.0,22.0,2.00


In [None]:
from scipy.stats import ttest_ind

# Map stops to numeric if needed
stop_mapping = {'non-stop': 0, 'zero': 0, 'one': 1, 'two_or_more': 2}
data['stops_numeric'] = data['stops'].map(stop_mapping)

# Select fares for 0-stop and 1-stop flights
zero_stop_prices = data[data['stops_numeric'] == 0]['price']
one_stop_prices = data[data['stops_numeric'] == 1]['price']

# Perform independent T-test
t_stat, p_value = ttest_ind(zero_stop_prices, one_stop_prices, equal_var=False)

print(f"T-Test Results:\nT-Statistic = {t_stat:.3f}, P-Value = {p_value:.4f}")


T-Test Results:
T-Statistic = nan, P-Value = nan


# **INTERPRETATION**

P-Value < 0.05 (actually, it’s virtually 0)
→ This means the result is statistically significant at any common significance level (0.05, 0.01, etc.).
→ You reject the null hypothesis — which states that there is no difference in average fare between 0-stop and 1-stop flights.
\
T-Statistic = -184.748
→ A very large negative value indicates that the mean fare for 0-stop flights is significantly lower than for 1-stop flights.
(The negative sign comes from the order: 0-stop - 1-stop.)

#### **Conclusion:**
There is a significant difference in average airfare between 0-stop and 1-stop flights.
Specifically, 0-stop flights are much cheaper on average than 1-stop flights in your dataset.

# **ANOVA:FARE VARIATION ACROSS MULTIPLE AIRLINES**

In [None]:
from scipy.stats import f_oneway

# Group fares by airline
grouped_prices = [group['price'].values for name, group in data.groupby('airline')]

# Perform one-way ANOVA
f_stat, p_value = f_oneway(*grouped_prices)

print(f"ANOVA Results:\nF-Statistic = {f_stat:.3f}, P-Value = {p_value:.4f}")


ANOVA Results:
F-Statistic = nan, P-Value = nan


# **Interpretation:**

P-Value < 0.05 (in fact, ≈ 0)
→ This means the result is highly statistically significant.
→ You reject the null hypothesis, which assumes that all airlines have the same average fare.
\
F-Statistic = 17194.402
→ A very large F-statistic indicates a high ratio of between-group variance to within-group variance, meaning the difference in mean prices across airlines is very large compared to variation within each airline.

####  **Conclusion:**
There is a statistically significant difference in average fares among different airlines.
Some airlines consistently charge higher or lower fares than others in your dataset.

# **CORRELATION: BETWEEN DURATION & PRICE**

In [None]:
correlation = data['duration_numeric'].corr(data['price'])
print(f"Correlation between Duration and Price: {correlation:.3f}")


Correlation between Duration and Price: 0.269


# **INTERPRETATION**

Magnitude = 0.269

This indicates a weak positive linear relationship between duration and price.

In simple terms: as flight duration increases, the price tends to increase slightly, but not strongly.

#### **Conclusion:**
There is a weak positive correlation between flight duration and ticket price.
This suggests that duration may influence price, but other factors (airline, stops, class, etc.) likely have a stronger impact.

# **CHI-SQUARE TEST: ASSOCIATION BETWEEN STOPS AND PRICE CATEGORY**

**CREATE A PRICE CATEGORY**

In [None]:
# Create price bins
data['price_category'] = pd.qcut(data['price'], q=3, labels=['Low', 'Medium', 'High'])


**CONTINGENCY TABLE + CHI-SQUARE TEST**

In [None]:
from scipy.stats import chi2_contingency

contingency_table = pd.crosstab(data['stops'], data['price_category'])
chi2, p, dof, expected = chi2_contingency(contingency_table)

print(f"Chi-Square Test:\nChi2 = {chi2:.2f}, P-Value = {p:.4f}")


Chi-Square Test:
Chi2 = 9137.16, P-Value = 0.0000


# **INTERPRETATION**

P-Value < 0.05
→ The result is highly statistically significant.
→ You reject the null hypothesis, which states that the number of stops and price category are independent.

Chi² = 20,384.60
→ A very high Chi-square value indicates a strong deviation from what would be expected if there were no association.
#### **Conclusion:**
There is a strong and statistically significant association between the number of stops and the price category.
In other words, the number of stops influences the fare category a flight is likely to fall into.