# Research Question
Is there a statistically significant difference in the average medical insurance charges among the four geographical regions (Northeast, Southeast, Northwest, and Southwest)?

In [118]:
# Importing essential libraries
import pandas as pd
import numpy as np
from scipy import stats

In [119]:
# Loading the dataset
df = pd.read_csv('insurance.csv')

In [120]:
df.shape

(1337, 7)

In [121]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1337 entries, 0 to 1336
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1337 non-null   int64  
 1   sex       1337 non-null   object 
 2   bmi       1337 non-null   float64
 3   children  1337 non-null   int64  
 4   smoker    1337 non-null   object 
 5   region    1337 non-null   object 
 6   charges   1337 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.2+ KB


In [122]:
df.describe()

Unnamed: 0,age,bmi,children,charges
count,1337.0,1337.0,1337.0,1337.0
mean,39.222139,30.663452,1.095737,13279.121487
std,14.044333,6.100468,1.205571,12110.359656
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29,0.0,4746.344
50%,39.0,30.4,1.0,9386.1613
75%,51.0,34.7,2.0,16657.71745
max,64.0,53.13,5.0,63770.42801


In [123]:
# Filtering data for each region
north_east = df[df['region'] == 'northeast']['charges']
north_west = df[df['region'] == 'northwest']['charges']
south_east = df[df['region'] == 'southeast']['charges']
south_west = df[df['region'] == 'southwest']['charges']

#### **Hypothesis testing**
**Hypothesis :** Insurance charges in certain regions are higher compared to the rest of the regions

**Null Hypothesis (H0) :** There is no significant difference in the mean insurance charge across the four regions

**Alternate Hypothesis (H1) :** Atleast one region has a mean insurance that is significantly different from others

In [135]:
# Hypothesis testing using ANOVA
print('Mean Insurance charges across the regions')
print('----------------------------------------')
print(f'Northeast : {north_east.mean().__round__(2)}')
print(f'Northwest : {north_west.mean().__round__(2)}')
print(f'Southeast : {south_east.mean().__round__(2)}')
print(f'Southwest : {south_west.mean().__round__(2)}')
print('')
f_statistic, p_value = stats.f_oneway(north_east, north_west, south_east, south_west, equal_var = False)

print(f'F-Statistic : {f_statistic}')
print(f'P-Value : {p_value}')
print('')
if p_value < 0.05:
    print('Reject the Null Hypothesis')
else:
    print('Accept the Null Hypothesis')

Mean Insurance charges across the regions
----------------------------------------
Northeast : 13406.38
Northwest : 12450.84
Southeast : 14735.41
Southwest : 12346.94

F-Statistic : 2.566222713601441
P-Value : 0.05348899434656477

Accept the Null Hypothesis


- No Statistically Significant Difference: Since the $p$-value ($0.0535$) is slightly greater than the standard alpha level of $0.05$, we fail to reject the null hypothesis. This means we do not have enough evidence to conclude that insurance charges are significantly different across the four regions.

- Marginal Results: The result is "marginal" because it is very close to the $0.05$ threshold. While the Southeast has the highest average charges ($14,735$) and the Southwest has the lowest ($12,346$), the variation within the groups is too high to say these differences are not just due to random chance.

- Impact of Unequal Variance: By using equal_var=False (Welch's ANOVA), we accounted for the fact that some regions have much more volatile pricing than others. This more robust test prevented a "false positive" result that a standard ANOVA might have incorrectly reported.