In [3]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

In [4]:
df = pd.read_csv('../data/insurance_premiums.csv')

In [22]:
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [5]:
df.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


### Initial Observations and Predictions

Beginning with summary statistics, there is already a lot I can tell about the customers from insurance dataset. The customers range in age from 18-64, with a mean age of 39. Premium changes range from \\$1,100 to nearly \\$64,000. 
<br>
<br>
The first big question I have is whether the charges paid will simply track with the customers age. I don't have access to information how long customers paid premiums for, but it seems reasonable that on average, people who are older will have paid more over their lifetime.
<br>
<br>
I would expect BMI and smoking to track with higher charges, because insurance companies probably assume that customers with these characteristics would have more chronic health issues on average and would therefore be more risky to cover. [Smoking](https://www.cdc.gov/chronicdisease/resources/publications/factsheets/tobacco.htm) in particular is considered to be a major risk factor for a number of chronic conditions, like heart disease, cancer, and stroke.
<br>
<br>
I think the children column could be interesting as well. It could be that having children, as a female, impacts health outcomes more than males having children (due to pregnancy or other complications). It's unclear whether the children are also insured and if that premium would be included in charges. If the premiums paid for children were factored into the charges, however, I would expect them to be a bigger factor for younger people than for older people--older customers who have adult children over the age of 26 would not likely be paying insurance premiums for their adult children. Under the [Affordable Care Act (ACA)](https://www.dol.gov/agencies/ebsa/about-ebsa/our-activities/resource-center/faqs/young-adult-and-aca), insurers are required to offer coverage to dependents up to age 26.

To confirm or refute some of my initial predictions, I can plot by various categories and look at the resulting boxplot to get a rough idea of factors that could be impactful:

In [40]:
fig1 = px.box(df, x="sex", y="charges", title="Charges for Smokers by Sex", 
              color="smoker", labels={'charges':'Charges',
                                      'sex':'Sex', 'smoker':'Smoker'})

fig1.show()

I can see from the above boxplot that sex doesn't seem to have a significant impact on what people pay on average. Smoking, however, *significantly* increases both the average charges and the variance of charges within each group.

In [45]:
fig2 = px.box(df, x="region", y="charges", title="Charges for Customers by Region",
              labels={'charges':'Charges', 'region':'Region'})

fig2.show()

In [50]:
fig3 = px.histogram(df, x='bmi', y='charges', histfunc='avg',
                   title="Average Charges by BMI")

fig3.show()

### Data Prep and Cleanup

In [13]:
df.isnull().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64