# U.S. Medical Insurance Costs

In this project I will be analysing a dataset containing information on US medical insurance. This dataset is contained in a csv file (**insurance.csv**), and I will be using Python to help me organise and analyse the contents.

To start with, I will be importing the csv file and saving its contents to use later in the Python terminal. It is useful to know that the csv file is organised into the following 7 column headers:
* Age
* Sex
* BMI
* Children
* Smoker
* Region
* Charges

In [1]:
#Import library needed for working with csv files
import csv

#Create list variables for storing csv data
ages = []
sexes = []
bmis = []
number_of_children = []
smoker_status = []
regions = []
insurance_costs = []

#Create a dictionary to store all the patient records
insurance_dict = {}

#Read the csv file and save it the appropriate variables above
with open("insurance.csv") as insurance_csv:
    csv_dict = csv.DictReader(insurance_csv)
    records_dict = {"age": 0, "sex": 0, "bmi": 0, "children": 0, "smoker": 0, "region": 0, "charges": 0}
    record_id = 0
    for row in csv_dict:
        #Populate lists
        ages.append(row["age"])
        sexes.append(row["sex"])
        bmis.append(row["bmi"])
        number_of_children.append(row["children"])
        smoker_status.append(row["smoker"])
        regions.append(row["region"])
        insurance_costs.append(row["charges"])
        #Populate dictionary
        records_dict = row["age"], row["sex"], row["bmi"], row["children"], row["smoker"], row["region"], row["charges"]
        insurance_dict.update({record_id: records_dict})
        record_id += 1

Now that I have extracted the data from the csv file, it is time to start analysing it.

The goal of my analysis is to try and see if there is any correlation between any of the six variables and insurance cost.

In [2]:
#Age correlation
age_correlation = 0
average_age = 0
loop_increment = 0
for costs in insurance_costs:
    y_axis = costs
    x_axis = ages[loop_increment]
    average_age += int(ages[loop_increment])
    age_correlation += float(y_axis) / float(x_axis)
    loop_increment += 1
age_correlation = round((age_correlation / len(insurance_costs)), 2)
average_age = round((average_age / len(ages)), 2)
print("The average insurance cost increases by ${} every year you get older".format(age_correlation))

#Sex correlation
male_patients = 0
female_patients = 0
average_male_costs = 0
average_female_costs = 0
loop_increment = 0
for sex in sexes:
    if sex == "male":
        male_patients += 1
        average_male_costs += float(insurance_costs[loop_increment])
    elif sex == "female":
        female_patients += 1
        average_female_costs += float(insurance_costs[loop_increment])
    loop_increment += 1
average_male_costs = round((average_male_costs / male_patients), 2)
average_female_costs = round((average_female_costs / female_patients), 2)
print("The average insurance cost for male patients is ${} and the average insurance cost for female patients is ${}".format(average_male_costs, average_female_costs))

#BMI correlation
bmi_correlation = 0
average_bmi = 0
loop_increment = 0
for costs in insurance_costs:
    y_axis = costs
    x_axis = bmis[loop_increment]
    average_bmi += float(bmis[loop_increment])
    bmi_correlation += float(y_axis) / float(x_axis)
    loop_increment += 1
bmi_correlation = round((bmi_correlation / len(insurance_costs)), 2)
average_bmi = round((average_bmi / len(bmis)), 2)
print("The average insurance cost increases by ${} everytime a patient increases their BMI by 1".format(bmi_correlation))

#Number of children correlation
no_children_tally = 0
with_children_tally = 0
average_insurance_costs_no_children = 0
average_insurance_costs_with_children = 0
loop_increment = 0
for children in number_of_children:
    if children == "0":
        no_children_tally += 1
        average_insurance_costs_no_children += float(insurance_costs[loop_increment])
    else:
        with_children_tally += 1
        average_insurance_costs_with_children += float(insurance_costs[loop_increment])
    loop_increment += 1
average_insurance_costs_no_children = round((average_insurance_costs_no_children / no_children_tally), 2)
average_insurance_costs_with_children = round((average_insurance_costs_with_children / with_children_tally), 2)
print("The average insurance cost increases by ${} when a patient has any children".format(round((average_insurance_costs_with_children - average_insurance_costs_no_children), 2)))

#Smoker correlation
smoking_patients = 0
non_smoking_patients = 0
average_smoker_costs = 0
average_non_smoker_costs = 0
loop_increment = 0
for patient in smoker_status:
    if patient == "yes":
        smoking_patients += 1
        average_smoker_costs += float(insurance_costs[loop_increment])
    elif patient == "no":
        non_smoking_patients += 1
        average_non_smoker_costs += float(insurance_costs[loop_increment])
    loop_increment += 1
average_smoker_costs = round((average_smoker_costs / smoking_patients), 2)
average_non_smoker_costs = round((average_non_smoker_costs / non_smoking_patients), 2)
print("The average insurance cost for smoking patients is ${} and for non-smoking patients it is ${}".format(average_smoker_costs, average_non_smoker_costs))

#Region correlation
unique_regions = []
average_regional_costs = 0
loop_increment = 0
for region in regions:
    if region not in unique_regions:
        unique_regions.append(region)
for unique_region in unique_regions:
    average_regional_costs = 0
    loop_increment = 0
    for region in regions:
        if region == unique_region:
            average_regional_costs += float(insurance_costs[loop_increment])
        loop_increment += 1
    average_regional_costs = round((average_regional_costs / regions.count(unique_region)), 2)
    print("The average insurance cost for people living in the {} region is ${}".format(unique_region, average_regional_costs))

The average insurance cost increases by $352.88 every year you get older
The average insurance cost for male patients is $13956.75 and the average insurance cost for female patients is $12569.58
The average insurance cost increases by $434.03 everytime a patient increases their BMI by 1
The average insurance cost increases by $1583.96 when a patient has any children
The average insurance cost for smoking patients is $32050.23 and for non-smoking patients it is $8434.27
The average insurance cost for people living in the southwest region is $12346.94
The average insurance cost for people living in the southeast region is $14735.41
The average insurance cost for people living in the northwest region is $12417.58
The average insurance cost for people living in the northeast region is $13406.38


Now that we have done our analysis, it is useful to provide some basic information about the distribution of our dataset so that we can have some context of its validity.

In [3]:
#Overall average insurance costs for the dataset
average_insurance_costs = 0
for costs in insurance_costs:
    average_insurance_costs += float(costs)
average_insurance_costs = round((average_insurance_costs / len(insurance_costs)), 2)
print("The average insurance costs for the entire dataset is ${}".format(average_insurance_costs))

#Dataset average age
print("The average age of the patients in the dataset is {} years".format(average_age))

#Dataset sex distribution
male_patients_percentage = round((male_patients / len(sexes)) * 100, 2)
female_patients_percentage = round((female_patients / len(sexes)) * 100, 2)
print("There are {} male patients ({}%) and {} female patients ({}%) in the dataset".format(male_patients, male_patients_percentage, female_patients, female_patients_percentage))

#Dataset average BMI
print("The average BMI of the patients in the dataset is {}".format(average_bmi))

#Dataset distribution of number of children for patients
unique_children_dict = {}
for children in number_of_children:
    if children not in unique_children_dict:
        unique_children_dict.update({children: 1})
    elif children in unique_children_dict:
        children_value = unique_children_dict.get(children) + 1
        unique_children_dict.update({children: children_value})
children_dict_ordered_keys = list(unique_children_dict.keys())
children_dict_ordered_keys.sort()
for keys in children_dict_ordered_keys:
    children_percentage = round((unique_children_dict[keys] / len(number_of_children)) * 100, 2)
    if keys == "1":
        print("There are {} patients ({}%) in the dataset with {} child".format(unique_children_dict[keys], children_percentage, keys))
    else:
        print("There are {} patients ({}%) in the dataset with {} children".format(unique_children_dict[keys], children_percentage, keys))

#Dataset smoker and non-smoker distribution
smoker_percentage = round((smoking_patients / len(smoker_status)) * 100, 2)
non_smoker_percentage = round((non_smoking_patients / len(smoker_status)) * 100, 2)
print("There are {} smokers ({}%) and {} non-smokers ({}%) in the dataset".format(smoking_patients, smoker_percentage, non_smoking_patients, non_smoker_percentage))

#Dataset regional distribution
unique_regions_dict = {}
for region in regions:
    if region not in unique_regions_dict:
        unique_regions_dict.update({region: 1})
    elif region in unique_regions_dict:
        unique_regions_value = unique_regions_dict.get(region) + 1
        unique_regions_dict.update({region: unique_regions_value})
regions_dict_ordered_keys = list(unique_regions_dict.keys())
regions_dict_ordered_keys.sort()
for keys in regions_dict_ordered_keys:
    region_percentage = round((unique_regions_dict[keys] / len(regions)) * 100, 2)
    print("There are {} patients ({}%) in the dataset that live in the {} region".format(unique_regions_dict[keys], region_percentage, keys))

The average insurance costs for the entire dataset is $13270.42
The average age of the patients in the dataset is 39.21 years
There are 676 male patients (50.52%) and 662 female patients (49.48%) in the dataset
The average BMI of the patients in the dataset is 30.66
There are 574 patients (42.9%) in the dataset with 0 children
There are 324 patients (24.22%) in the dataset with 1 child
There are 240 patients (17.94%) in the dataset with 2 children
There are 157 patients (11.73%) in the dataset with 3 children
There are 25 patients (1.87%) in the dataset with 4 children
There are 18 patients (1.35%) in the dataset with 5 children
There are 274 smokers (20.48%) and 1064 non-smokers (79.52%) in the dataset
There are 324 patients (24.22%) in the dataset that live in the northeast region
There are 325 patients (24.29%) in the dataset that live in the northwest region
There are 364 patients (27.2%) in the dataset that live in the southeast region
There are 325 patients (24.29%) in the datase

As we can see from our calculations above, the dataset is fairly evenly distributed so we can be pretty confident that our earlier analysis will hold true for the larger population.

The only exception to this is the analysis we did on how insurance costs is affected by how many children you have. As we can see above, we have a small subset of patients in the dataset with 4 or 5 children. This is too small of a dataset to base our analysis on, and it will most likely have skewed our initial analysis with regards to how the number of children affects insurance costs.