Modeling to compare single vs family plans and cost of health insurance

In [8]:
import pandas as pd
import statsmodels.api as sm

In [56]:
## read in csv 

df = pd.read_csv("insurance.csv")

In [57]:
## convert F = 1 and M = 0 and Smoker to numeric

df['sex'] = df['sex'].map({'female': 1, 'male': 0})

df['smoker'] = df['smoker'].map({'no': 0, 'yes': 1})

In [11]:
## model 1 

X = df[['age', 'sex', 'smoker', 'children']]  # Include all columns you're interested in
y = df['charges'] 

X = sm.add_constant(X)

model = sm.OLS(y, X).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                charges   R-squared:                       0.724
Model:                            OLS   Adj. R-squared:                  0.723
Method:                 Least Squares   F-statistic:                     873.1
Date:                Fri, 31 Jan 2025   Prob (F-statistic):               0.00
Time:                        14:35:19   Log-Likelihood:                -13617.
No. Observations:                1338   AIC:                         2.724e+04
Df Residuals:                    1333   BIC:                         2.727e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const      -2821.6712    570.788     -4.943      0.0

Logistic modeling for OR 

Based off 2024 (https://www.kff.org/report-section/ehbs-2024-section-1-cost-of-health-insurance/)

Average annual premiums for single coverage is $8,951 per year.

Average annual premiums for family coverage is $25,572.

In [58]:
## adjust region variable by hot encoding
region_encoded = pd.get_dummies(df['region'], prefix='region')

df = df.drop('region', axis=1).join(region_encoded)
df = df.astype(int)
# Display the updated DataFrame
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,charges,region_northeast,region_northwest,region_southeast,region_southwest
0,19,1,27,0,1,16884,0,0,0,1
1,18,0,33,1,0,1725,0,0,1,0
2,28,0,33,3,0,4449,0,0,1,0
3,33,0,22,0,0,21984,0,1,0,0
4,32,0,28,0,0,3866,0,1,0,0


Family Coverage

In [59]:
## create a family variable 

df_f = df
df_f['family'] = df_f['children'].apply(lambda x:1 if x > 0 else 0)

## create a over/under average variable
df_f['average_cost'] = df_f['charges'].apply(lambda x:1 if x > 25572 else 0)

df_f = df_f[df['family'] == 1]

# print(df.head())

In [69]:
## model 1

X = df_f[['smoker', 'sex', 'age']] 
y = df_f['average_cost'] 

X = sm.add_constant(X)

# Fit the logistic regression model
model = sm.Logit(y, X).fit()

# Display the model summary
print(model.summary())

Optimization terminated successfully.
         Current function value: 0.264028
         Iterations 7
                           Logit Regression Results                           
Dep. Variable:           average_cost   No. Observations:                  764
Model:                          Logit   Df Residuals:                      760
Method:                           MLE   Df Model:                            3
Date:                Fri, 31 Jan 2025   Pseudo R-squ.:                  0.3800
Time:                        14:58:13   Log-Likelihood:                -201.72
converged:                       True   LL-Null:                       -325.37
Covariance Type:            nonrobust   LLR p-value:                 2.509e-53
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -5.0630      0.569     -8.900      0.000      -6.178      -3.948
smoker         3.5784      0.

Single Coverage

In [64]:
## create a family variable 

df_s = df
df_s['family'] = df_s['children'].apply(lambda x:1 if x == 0 else 0)

## create a over/under average variable
df_s['average_cost'] = df_s['charges'].apply(lambda x:1 if x > 8951 else 0)

df_s = df_s[df_s['family'] == 0]

df_s.head()

Unnamed: 0,age,sex,bmi,children,smoker,charges,region_northeast,region_northwest,region_southeast,region_southwest,family,average_cost
1,18,0,33,1,0,1725,0,0,1,0,0,0
2,28,0,33,3,0,4449,0,0,1,0,0,0
6,46,1,33,1,0,8240,0,0,1,0,0,0
7,37,1,27,3,0,7281,0,1,0,0,0,0
8,37,0,29,2,0,6406,1,0,0,0,0,0


In [68]:
## model 1

X = df_s[['smoker', 'sex', 'age']] 
y = df_s['average_cost'] 

X = sm.add_constant(X)

# Fit the logistic regression model
model = sm.Logit(y, X).fit()

# Display the model summary
print(model.summary())

         Current function value: 0.329304
         Iterations: 35
                           Logit Regression Results                           
Dep. Variable:           average_cost   No. Observations:                  764
Model:                          Logit   Df Residuals:                      760
Method:                           MLE   Df Model:                            3
Date:                Fri, 31 Jan 2025   Pseudo R-squ.:                  0.5246
Time:                        14:58:06   Log-Likelihood:                -251.59
converged:                      False   LL-Null:                       -529.25
Covariance Type:            nonrobust   LLR p-value:                4.886e-120
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -7.9273      0.632    -12.547      0.000      -9.166      -6.689
smoker        30.3679   8.39e+04      0.000      1.000   -1.64e+0

