# U.S. Medical Insurance Costs

### Project Goals

In this project I chose to focus on how smoking status relates to other variables in the medical insurance dataset. 

The specific questions I will perform analysis for are: 
- Difference in average insurance cost between non-smokers and smokers
- Percentage of smokers in each region
- Proportion of smokers among individuals of different BMI levels
- Proportion of non-smokers and smokers among individuals with one or more children, and among individuals without children

### Importing the dataset and saving to python variables

In [2]:
# Importing the insurance dataset
import csv
with open('insurance.csv') as insurance_data:
 insurance_data.read()


In [3]:
# Saving the dataset to python variables
age = []
sex = []
bmi = []
children = []
smoker = []
region = []
charges = []

with open('insurance.csv') as insurance_data:
 insurance_dict = csv.DictReader(insurance_data, delimiter = ',')
 for item in insurance_dict:
        age.append(item['age'])
        sex.append(item['sex'])
        bmi.append(item['bmi'])
        children.append(item['children'])
        smoker.append(item['smoker'])
        region.append(item['region'])
        charges.append(item['charges'])
        
def create_table(age, sex, bmi, children, smoker, region, charges):
    new_dictionary = {}
    for i in range(len(age)):
        new_dictionary[i] = {'age': age[i],
                               'sex': sex[i],
                               'bmi': float(bmi[i]),
                               'children': int(children[i]),
                               'smoker': smoker[i],
                               'region': region[i],
                               'charges': charges[i]}
    return new_dictionary

insurance_dictionary = create_table(age, sex, bmi, children, smoker, region, charges)
#print(insurance_dictionary) 


### Analysis 

In [4]:
# Finding the difference in average insurance cost between non-smokers and smokers

# Begin by finding the number of non-smokers and smokers in the dataset, and the sum of the insurance costs respectively
non_smokers_sum = 0
non_smokers_len = 0
for i in insurance_dictionary:
    if smoker[i] == 'no':
        non_smokers_sum += float(charges[i])
        non_smokers_len += 1

smokers_sum = 0
smokers_len = 0
for i in insurance_dictionary:
    if smoker[i] == 'yes':
        smokers_sum += float(charges[i])
        smokers_len += 1

# Use the sums and number of individuals above to find the averages and the difference

non_smokers_average = round(non_smokers_sum / non_smokers_len, 2)
print("The average insurance cost for non-smokers is " + str(non_smokers_average)+ " dollars.")
smokers_average = round(smokers_sum / smokers_len, 2)
print("The average insurance cost for smokers is " + str(smokers_average)+ " dollars.")
diff_average = abs(non_smokers_average - smokers_average)
print("The difference in average insurance costs between non-smokers and smokers is " + str(diff_average) + " dollars.")


The average insurance cost for non-smokers is 8434.27 dollars.
The average insurance cost for smokers is 32050.23 dollars.
The difference in average insurance costs between non-smokers and smokers is 23615.96 dollars.


In [6]:
# Finding the proportion of smokers in each region

# Find the unique regions included in this dataset 
included_regions = []
for x in region:
    if x not in included_regions:
        included_regions.append(x)
#print(included_regions)
#output: southwest, southeast, northwest, northeast

prop_smokers_southwest = round((sum(1 for i in insurance_dictionary if region[i]=='southwest' and smoker[i]=='yes')/region.count('southwest'))*100, 2)
prop_smokers_southeast = round((sum(1 for i in insurance_dictionary if region[i]=='southeast' and smoker[i]=='yes')/region.count('southeast'))*100, 2)
prop_smokers_northwest = round((sum(1 for i in insurance_dictionary if region[i]=='northwest' and smoker[i]=='yes')/region.count('northwest'))*100, 2)
prop_smokers_northeast = round((sum(1 for i in insurance_dictionary if region[i]=='northeast' and smoker[i]=='yes')/region.count('northeast'))*100, 2)

print("In the Southwest region, " + str(prop_smokers_southwest) + " percent of recorded individuals are smokers.")
print("In the Southeast region, " + str(prop_smokers_southeast) + " percent of recorded individuals are smokers.")
print("In the Northwest region, " + str(prop_smokers_northwest) + " percent of recorded individuals are smokers.")
print("In the Northeast region, " + str(prop_smokers_northeast) + " percent of recorded individuals are smokers.")

In the Southwest region, 17.85 percent of recorded individuals are smokers.
In the Southeast region, 25.0 percent of recorded individuals are smokers.
In the Northwest region, 17.85 percent of recorded individuals are smokers.
In the Northeast region, 20.68 percent of recorded individuals are smokers.


In [35]:
# Finding if there is a correlation between smoking and BMI
#print(min(bmi)) output: 15.96
#print(max(bmi)) output: 53.13

# sort by bmi 
def sortby_bmi(insurance_dictionary):
    bmi_scale = {0:15.0, 1:25.0, 2:35.0, 3:45.0, 4:55.0}
    individuals_by_bmi = {0:[], 1:[], 2:[], 3:[], 4:[]}
    for i in insurance_dictionary:
        if insurance_dictionary[i]['bmi'] == bmi_scale[0]:
            individuals_by_bmi[0].append(insurance_dictionary[i])
        elif insurance_dictionary[i]['bmi'] <= bmi_scale[1] and insurance_dictionary[i]['bmi'] > bmi_scale[0]:
            individuals_by_bmi[1].append(insurance_dictionary[i])
        elif insurance_dictionary[i]['bmi'] <= bmi_scale[2] and insurance_dictionary[i]['bmi'] > bmi_scale[1]:
            individuals_by_bmi[2].append(insurance_dictionary[i])
        elif insurance_dictionary[i]['bmi'] <= bmi_scale[3] and insurance_dictionary[i]['bmi'] > bmi_scale[2]:
            individuals_by_bmi[3].append(insurance_dictionary[i])
        elif insurance_dictionary[i]['bmi'] < bmi_scale[4] and insurance_dictionary[i]['bmi'] > bmi_scale[3]:
            individuals_by_bmi[4].append(insurance_dictionary[i])
    return individuals_by_bmi 
sorted_by_bmi = sortby_bmi(insurance_dictionary)
#print(sorted_by_bmi)

# Find percentage of smokers in each bin of the bmi scale

prop_smokers_bmi1 = round((sum(1 for i in sorted_by_bmi[1] if i['smoker']=='yes')/len(sorted_by_bmi[1]))*100, 2)
print("Of individuals with a BMI between 15.0 and 25.0, " + str(prop_smokers_bmi1) + " percent are smokers.")

prop_smokers_bmi2 = round((sum(1 for i in sorted_by_bmi[2] if i['smoker']=='yes')/len(sorted_by_bmi[2]))*100, 2)
print("Of individuals with a BMI between 25.0 and 35.0, " + str(prop_smokers_bmi2) + " percent are smokers.")

prop_smokers_bmi3 = round((sum(1 for i in sorted_by_bmi[3] if i['smoker']=='yes')/len(sorted_by_bmi[3]))*100, 2)
print("Of individuals with a BMI between 35.0 and 45.0, " + str(prop_smokers_bmi3) + " percent are smokers.")

prop_smokers_bmi4 = round((sum(1 for i in sorted_by_bmi[4] if i['smoker']=='yes')/len(sorted_by_bmi[4]))*100, 2)
print("Of individuals with a BMI between 45.0 and 55.0, " + str(prop_smokers_bmi4) + " percent are smokers.")


Of individuals with a BMI between 15.0 and 25.0, 22.27 percent are smokers.
Of individuals with a BMI between 25.0 and 35.0, 19.1 percent are smokers.
Of individuals with a BMI between 35.0 and 45.0, 22.3 percent are smokers.
Of individuals with a BMI between 45.0 and 55.0, 25.0 percent are smokers.


In [53]:
# Finding the proportion of smokers among individuals with children and without children
def sortby_children(insurance_dictionary):
    new_dict = {'parent': [], 'childless':[]}
    for i in insurance_dictionary:
        if insurance_dictionary[i]['children'] >= 1:
            new_dict['parent'].append(insurance_dictionary[i])
        elif insurance_dictionary[i]['children'] == 0:
            new_dict['childless'].append(insurance_dictionary[i])
    return new_dict
sorted_by_children = sortby_children(insurance_dictionary)
#print(sorted_by_children)

prop_smokers_parent = round((sum(1 for i in sorted_by_children['parent'] if i['smoker']=='yes')/len(sorted_by_children['parent']))*100, 2)
print("Among individuals with 1 or more children, " + str(prop_smokers_parent) + " percent are smokers.")

prop_smokers_childless = round((sum(1 for i in sorted_by_children['childless'] if i['smoker']=='yes')/len(sorted_by_children['childless']))*100, 2)
print("Among individuals with no children, " + str(prop_smokers_childless) + " percent are smokers.")



Among individuals with 1 or more children, 20.81 percent are smokers.
Among individuals with no children, 20.03 percent are smokers.


### Results

The findings from the analysis performed in this project are as follows:

- Insurance costs for people who smoke are significantly higher than the costs of those who don't smoke. The difference in average insurance costs between non-smokers and smokers is 23615.96 dollars.
- The proportion of smokers to non-smokers varies by region. The Northwest and Southwest have the smallest proportion of smokers, both at 17.85%. The Southeast has the highest proportion of smokers at 25%. The proportion of smokers in the Northeast is 20.68%.
- Smoking status and BMI don't provide information about each other. However, the highest bin of the BMI scale (which includes individuals with a BMI between 45.0 and 55.0) has the highest proportion of smokers, at 25%. The percentage of smokers in the other three bins varies between 19.1% and 22.3%.
- About one-fifth of individuals in this dataset, regardless of whether they have children, are smokers. Thus, smoking status and whether one has children don't provide information about each other. 