# U.S. Medical Insurance Costs

In [4]:
import csv
import numpy as np
import pandas as pd


# Importing the data

Using pandas to read the .csv file and create a dataframe to work with 

In [5]:
ins_data = pd.read_csv('insurance.csv')


# Separating the Data by region

Taking the master DataFrame and creating 4 new Data Frames organized by region 


In [6]:
se_data=ins_data[ins_data['region']=='southeast']
ne_data=ins_data[ins_data['region']=='northeast']
nw_data=ins_data[ins_data['region']=='northwest']
sw_data=ins_data[ins_data['region']=='southwest']

se_percentage=len(se_data)/len(ins_data)*100
ne_percentage=len(ne_data)/len(ins_data)*100
nw_percentage=len(nw_data)/len(ins_data)*100
sw_percentage=len(sw_data)/len(ins_data)*100
print("Percentage from Southeast: " +str(round(se_percentage, 2))+"%")
print("Percentage from Northeast: " +str(round(ne_percentage, 2))+"%")
print("Percentage from Northwest: " +str(round(nw_percentage, 2))+"%")
print("Percentage from Southwest: " +str(round(sw_percentage, 2))+"%")

Percentage from Southeast: 27.2%
Percentage from Northeast: 24.22%
Percentage from Northwest: 24.29%
Percentage from Southwest: 24.29%


Data looks to have pretty good distribution among the 4 regions that the 


# Average Insurance Costs by Region

Here we will use Pandas to calculate the average of the column "Costs" for each reagion to see who had the highest and lowest costs. 

In [7]:
se_mean_cost=se_data['charges'].mean()
ne_mean_cost=ne_data['charges'].mean()
nw_mean_cost=nw_data['charges'].mean()
sw_mean_cost=sw_data['charges'].mean()

print("The Average Inusrance Cost for the Southeast is: $"+str(round(se_mean_cost, 2)))
print("The Average Inusrance Cost for the Northeast is: $"+str(round(ne_mean_cost, 2)))
print("The Average Inusrance Cost for the Northwest is: $"+str(round(nw_mean_cost, 2)))
print("The Average Inusrance Cost for the Southwest is: $"+str(round(sw_mean_cost, 2)))

The Average Inusrance Cost for the Southeast is: $14735.41
The Average Inusrance Cost for the Northeast is: $13406.38
The Average Inusrance Cost for the Northwest is: $12417.58
The Average Inusrance Cost for the Southwest is: $12346.94


In our data set the respondants from the Southeast had the highest average insurance costs.  
Now let us see what may be driving that cost by looking at 2 of the variables, BMI and Smoking

# Percentage of Overweight

According to the CDC A BMI of 25 or over is considered overweight In this next code block we will find out how many overweight people are in each region and see if that is a driver for the higher costs in the Southeast


In [15]:
se_ow=se_data[(se_data['bmi']>=25.0) & (se_data['bmi']<30)]
ne_ow=ne_data[(ne_data['bmi']>=25.0) & (ne_data['bmi']<30)]
nw_ow=nw_data[(nw_data['bmi']>=25.0) & (nw_data['bmi']<30)]
sw_ow=sw_data[(sw_data['bmi']>=25.0) & (sw_data['bmi']<30)]

se_ow_percent=len(se_ow)/len(se_data)*100
ne_ow_percent=len(ne_ow)/len(ne_data)*100
nw_ow_percent=len(nw_ow)/len(nw_data)*100
sw_ow_percent=len(sw_ow)/len(sw_data)*100
print("The Percentage of overweight in Southeast: " +str(round(se_ow_percent,2))+"%")
print("The Percentage of overweight in Northeast: " +str(round(ne_ow_percent,2))+"%")
print("The Percentage of overweight in Northwest: " +str(round(nw_ow_percent,2))+"%")
print("The Percentage of overweight in Southwest: " +str(round(sw_ow_percent,2))+"%")

The Percentage of overweight in Southeast: 21.98%
The Percentage of overweight in Northeast: 30.25%
The Percentage of overweight in Northwest: 32.92%
The Percentage of overweight in Southwest: 31.08%


So first we are looking at data for those in the sample who are overweight according to CDC guidelines on BMI and find that we have a rather large spread of people who are overweight but not obese with Northwest reporting just under a third.  


In [57]:
se_ob=se_data[se_data['bmi']>=30.0]
ne_ob=ne_data[ne_data['bmi']>=30.0]
nw_ob=nw_data[nw_data['bmi']>=30.0]
sw_ob=sw_data[sw_data['bmi']>=30.0]

se_ob_percent=len(se_ob)/len(se_data)*100
ne_ob_percent=len(ne_ob)/len(ne_data)*100
nw_ob_percent=len(nw_ob)/len(nw_data)*100
sw_ob_percent=len(sw_ob)/len(sw_data)*100


print("The Percentage of obese in Southeast: " +str(round(se_ob_percent,2))+"%")
print("The Percentage of obese in Northeast: " +str(round(ne_ob_percent,2))+"%")
print("The Percentage of obese in Northwest: " +str(round(nw_ob_percent,2))+"%")
print("The Percentage of obese in Southwest: " +str(round(sw_ob_percent,2))+"%")



The Percentage of obese in Southeast: 66.76%
The Percentage of obese in Northeast: 44.14%
The Percentage of obese in Northwest: 45.54%
The Percentage of obese in Southwest: 53.23%


now we calculate the obesity levels and we see that a full 2/3 of the Southeast region are Obese, when added to the 21% who were overweight that is almost 88% who have weight management issues.  The data lacks the ability to see if race or GDP of locale play a role in influencing BMI, more demographic data may be needed.  

In [58]:
se_mean_bmi=se_data['bmi'].mean()
ne_mean_bmi=ne_data['bmi'].mean()
nw_mean_bmi=nw_data['bmi'].mean()
sw_mean_bmi=sw_data['bmi'].mean()

print("The Average BMI for the Southeast is: "+str(round(se_mean_bmi,2)))
print("The Average BMI for the Northeast is: "+str(round(ne_mean_bmi, 2)))
print("The Average BMI for the Northwest is: "+str(round(nw_mean_bmi,2)))
print("The Average BMI for the Southwest is: "+str(round(sw_mean_bmi, 2)))

The Average BMI for the Southeast is: 33.36
The Average BMI for the Northeast is: 29.17
The Average BMI for the Northwest is: 29.2
The Average BMI for the Southwest is: 30.6


Once again the regions that reported highest percentatage of Overweight and Obese would have the highest average BMI

# Percentage of Smokers

The next variable we will examine is the percentage of respondants from each region who identified as smokers and see if there is any correlation 

In [59]:
se_smoker=se_data[se_data['smoker']=='yes']
ne_smoker=ne_data[ne_data['smoker']=='yes']
nw_smoker=nw_data[nw_data['smoker']=='yes']
sw_smoker=sw_data[sw_data['smoker']=='yes']

se_smoke_percent=len(se_smoker)/len(se_data)*100
ne_smoke_percent=len(ne_smoker)/len(ne_data)*100
nw_smoke_percent=len(nw_smoker)/len(nw_data)*100
sw_smoke_percent=len(sw_smoker)/len(sw_data)*100

print("The Percentage of smokers in Southeast: " +str(round(se_smoke_percent,2))+"%")
print("The Percentage of smokers in Northeast: " +str(round(ne_smoke_percent,2))+"%")
print("The Percentage of smokers in Northwest: " +str(round(nw_smoke_percent,2))+"%")
print("The Percentage of smokers in Southwest: " +str(round(sw_smoke_percent,2))+"%")

The Percentage of smokers in Southeast: 25.0%
The Percentage of smokers in Northeast: 20.68%
The Percentage of smokers in Northwest: 17.85%
The Percentage of smokers in Southwest: 17.85%


With this analysis we see the top 2 regions for smokers the same as the top 2 for Insurance Costs. 

# Age as a factor

Next variable to explore is Age, what is the average age in the total group and see how the different regions compare

In [60]:
mean_age=ins_data['age'].mean()
print("The average age for the total sample is: " +str(round(mean_age, 2))+" years old.")

The average age for the total sample is: 39.21 years old.


Next let us look into how each region compares when we examine age

In [61]:
se_mean_age=se_data['age'].mean()
ne_mean_age=ne_data['age'].mean()
nw_mean_age=nw_data['age'].mean()
sw_mean_age=sw_data['age'].mean()

print("The average age for the Southeast Sample is "+ str(round(se_mean_age, 2))+" years old.")
print("The average age for the Northeast Sample is "+ str(round(ne_mean_age, 2))+" years old.")
print("The average age for the Northwest Sample is "+ str(round(nw_mean_age, 2))+" years old.")
print("The average age for the Southwest Sample is "+ str(round(sw_mean_age, 2))+" years old.")

The average age for the Southeast Sample is 38.94 years old.
The average age for the Northeast Sample is 39.27 years old.
The average age for the Northwest Sample is 39.2 years old.
The average age for the Southwest Sample is 39.46 years old.


In this we see that all 4 regions track close to the total sample average with the Southwest coming in with the highest average age and the Southeast was the lowest average age. 



# Children as a factor

This variable has a couple different components, first, how many people total in each region have vs not having children then we can look at, those that had children, what was the average number of children by region.

In [62]:
total_no_children=ins_data[ins_data['children']==0]
total_childless_percent=len(total_no_children)/len(ins_data)*100
print("In the total sample the percentage of sample that have no children is: "+ str(round(total_childless_percent, 2))+"%")

In the total sample the percentage of sample that have no children is: 42.9%


In [63]:
total_with_children=ins_data[ins_data['children']>0]
average_children=total_with_children['children'].mean()
print("In the total sample, those who had children had an average of "+str(round(average_children, 2))+" children.")

In the total sample, those who had children had an average of 1.92 children.


Now let us see how  the regional data compares

In [66]:
se_no_child=se_data[se_data['children']==0]
ne_no_child=ne_data[ne_data['children']==0]
nw_no_child=nw_data[nw_data['children']==0]
sw_no_child=sw_data[sw_data['children']==0]

se_none_percent=len(se_no_child)/len(se_data)*100
ne_none_percent=len(ne_no_child)/len(ne_data)*100
nw_none_percent=len(nw_no_child)/len(nw_data)*100
sw_none_percent=len(sw_no_child)/len(sw_data)*100

print("In the Southeast the percentage of sample that is childless is: "+str(round(se_none_percent, 2))+"%")
print("In the Northeast the percentage of sample that is childless is: "+str(round(ne_none_percent, 2))+"%")
print("In the Northwest the percentage of sample that is childless is: "+str(round(nw_none_percent, 2))+"%")
print("In the Southwest the percentage of sample that is childless is: "+str(round(sw_none_percent, 2))+"%")

In the Southeast the percentage of sample that is childless is: 43.13%
In the Northeast the percentage of sample that is childless is: 45.37%
In the Northwest the percentage of sample that is childless is: 40.62%
In the Southwest the percentage of sample that is childless is: 42.46%


When compared to to the sample as a total the data distribution is within margins with the eastern half having more children than the western half.  Now to see those that did have children, how many did they have on average.


In [68]:
se_with=se_data[se_data['children']>0]
ne_with=ne_data[ne_data['children']>0]
nw_with=nw_data[nw_data['children']>0]
sw_with=sw_data[sw_data['children']>0]

se_average_child=se_with['children'].mean()
ne_average_child=ne_with['children'].mean()
nw_average_child=nw_with['children'].mean()
sw_average_child=sw_with['children'].mean()

print("In the Southeast those in the sample that had children had an average of "+str(round(se_average_child, 2))+" children")
print("In the Northeast those in the sample that had children had an average of "+str(round(ne_average_child, 2))+" children")
print("In the Northwest those in the sample that had children had an average of "+str(round(nw_average_child, 2))+" children")
print("In the Southwest those in the sample that had children had an average of "+str(round(sw_average_child, 2))+" children")

In the Southeast those in the sample that had children had an average of 1.85 children
In the Northeast those in the sample that had children had an average of 1.92 children
In the Northwest those in the sample that had children had an average of 1.93 children
In the Southwest those in the sample that had children had an average of 1.98 children
