# U.S. Medical Insurance Costs

In this project, I will be examining data about BMI averages across various groups, including different age groups, smokers vs. non-smokers, and various regions


## Setup
Before beginning to analyze the data, I will be setting up my data in convenient Python variables. To start, my imported modules.

In [1]:
import csv
from collections import Counter

Now I read the data from csv and write each column to a separate list.

In [2]:
def create_lists(file, column):
    created_list = []
    with open(file, 'r', newline = '') as dataset:
        insurance_data = csv.DictReader(dataset)
        for entry in insurance_data:
            created_list.append(entry[column])
    return created_list

ages = create_lists('insurance.csv', 'age')
sexes = create_lists('insurance.csv', 'sex')
bmis = create_lists('insurance.csv', 'bmi')
num_children = create_lists('insurance.csv', 'children')
smoker_status = create_lists('insurance.csv', 'smoker')
regions = create_lists('insurance.csv', 'region')
insurance_charges = create_lists('insurance.csv', 'charges')

All of the lists are entirely composed of strings, and because we will be doing calculations, I convert the relevant lists to numeric data types.

In [3]:
ages = [eval(i) for i in ages]
bmis = [eval(i) for i in bmis]
num_children = [eval(i) for i in num_children]
insurance_charges = [eval(i) for i in insurance_charges]

I check to be sure no data is missing, assuming that missing data will display as "NaN" or as an empty string. This may not be the best assumption, but it's the best I can think to do besides inspection.

In [19]:
def check_missing(lst):
    missing_data_indices = []
    for i in range(len(lst)):
        if lst[i] == 'NaN' or lst[i] == '':
            missing_data_indices.append(i)
    return missing_data_indices

print(check_missing(ages))
print(check_missing(sexes))
print(check_missing(bmis))
print(check_missing(num_children))
print(check_missing(smoker_status))
print(check_missing(regions))
print(check_missing(insurance_charges))

[]
[]
[]
[]
[]
[]
[]


Great! There is no missing data!

Most of my calculations will be easier to do by having individual records grouped together, so I create a list of dictionaries, where each dictionary corresponds to one patient.

In [23]:
patient_info = []

for i in range(len(ages)):
    patient_entry = {'age': ages[i], 
                     'sex': sexes[i], 
                     'bmi': bmis[i], 
                     'num_children': num_children[i], 
                     'smoker_status': smoker_status[i], 
                     'region': regions[i], 
                     'insurance_charges': insurance_charges[i]}
    patient_info.append(patient_entry)

Lastly, I will be calculating many averages throughout the remainder of this project, so I want a function which calculates averages.

In [20]:
def calculate_average(lst):
    total = 0
    for entry in lst:
        total += entry
    return round(total / len(lst), 2)

## Age-BMI Comparisons
Now that everything is set-up, it's time to start examining some information about BMI averages. Note, I will be using the following ranges from the [CDC](https://www.cdc.gov/healthyweight/assessing/index.html):

BMI < 18.5: Underweight \
18.5 <= BMI <= 24.9: Healthy \
25.0 <= BMI <= 29.9: Overweight \
BMI >= 30.0: Obese

To start, I will find the average BMI for people in various age ranges. First, I find what ages are in the data 

In [21]:
min_age = min(ages)
max_age = max(ages)

print(f'We have data on people from ages {min_age} to {max_age}.')

We have data on people from ages 18 to 64.


Based on this, I have chosen the following age ranges to examine:

18-27 \
28-36 \
37-45 \
46-54 \
55-64

So I break up the BMI data into lists according to those range.

In [25]:
bmi_18_27 = []
bmi_28_36 = []
bmi_37_45 = []
bmi_46_54 = []
bmi_55_64 = []

for patient in patient_info:
    if patient['age'] <= 27:
        bmi_18_27.append(patient['bmi'])
    elif patient['age'] <= 36:
        bmi_28_36.append(patient['bmi'])
    elif patient['age'] <= 45:
        bmi_37_45.append(patient['bmi'])
    elif patient['age'] <= 54:
        bmi_46_54.append(patient['bmi'])
    else:
        bmi_55_64.append(patient['bmi'])

Now that Ihave all of the BMIs sorted, it's time to calculate averages.

In [26]:
average_18_27 = calculate_average(bmi_18_27)
average_28_36 = calculate_average(bmi_28_36)
average_37_45 = calculate_average(bmi_37_45)
average_46_54 = calculate_average(bmi_46_54)
average_55_64 = calculate_average(bmi_55_64)

print(f'The average BMI for patients ages 18 to 27 is {average_18_27}.')
print(f'The average BMI for patients ages 28 to 36 is {average_28_36}.')
print(f'The average BMI for patients ages 37 to 45 is {average_37_45}.')
print(f'The average BMI for patients ages 46 to 54 is {average_46_54}.')
print(f'The average BMI for patients ages 55 to 64 is {average_55_64}.')

The average BMI for patients ages 18 to 27 is 29.91.
The average BMI for patients ages 28 to 36 is 30.34.
The average BMI for patients ages 37 to 45 is 30.33.
The average BMI for patients ages 46 to 54 is 31.3.
The average BMI for patients ages 55 to 64 is 31.76.


From this we see that generally patients' BMI seems to generally increase on average as they age, with the average patient reaching obesity by the age of 28.

## Sex-BMI Comparisons
Now I will perform similar calculations for males vs females.

In [27]:
male_bmi = []
female_bmi = []

for entry in patient_info:
    if entry['sex'].upper() == "MALE":
        male_bmi.append(entry['bmi'])
    else:
        female_bmi.append(entry['bmi'])

Now for the averages.

In [29]:
average_male = calculate_average(male_bmi)
average_female = calculate_average(female_bmi)

print(f'The average male has a BMI of {average_male}.')
print(f'The average female has a BMI of {average_female}.')

The average male has a BMI of 30.94.
The average female has a BMI of 30.38.


It seems that males and females have relatively similar BMI averages, with the aver female edging slightly healthier.

## Smoker-BMI Comparisons
Continuing on, I compare smokers vs non-smokers.

In [30]:
smoker_bmi = []
non_smoker_bmi = []

for entry in patient_info:
    if entry['smoker_status'].upper() == "YES":
        smoker_bmi.append(entry['bmi'])
    else:
        non_smoker_bmi.append(entry['bmi'])

And again, the averages:

In [31]:
average_smoker = calculate_average(smoker_bmi)
average_non_smoker = calculate_average(non_smoker_bmi)

print(f'The BMI of the average smoker is {average_smoker}.')
print(f'The BMI of the average non-smoker is {average_non_smoker}.')

The BMI of the average smoker is 30.71.
The BMI of the average non-smoker is 30.65.


Again, there is fairly little difference between smokers and non-smokers. This is even closer than the male and female comaprisons.

## Region-BMI Comparisons
Lastly, I will examine if there is any noticeable difference in average BMI across the regions represented in the data. To start, I need to figure out what the different regions area.

In [32]:
region_counter = Counter(regions)

print(region_counter)

Counter({'southeast': 364, 'southwest': 325, 'northwest': 325, 'northeast': 324})


It seems there are four regions, and fortunately, there are a similar number of pateitns in each region (with the southeast slightly more represented). 

From here, everything is similar to before.

In [33]:
southeast_bmi = []
southwest_bmi = []
northwest_bmi = []
northeast_bmi = []

for entry in patient_info:
    if entry['region'] == 'southeast':
        southeast_bmi.append(entry['bmi'])
    elif entry['region'] == 'southwest':
        southwest_bmi.append(entry['bmi'])
    elif entry['region'] == 'northeast':
        northeast_bmi.append(entry['bmi'])
    else:
        northwest_bmi.append(entry['bmi'])

In [35]:
average_southeast = calculate_average(southeast_bmi)
average_southwest = calculate_average(southwest_bmi)
average_northeast = calculate_average(northeast_bmi)
average_northwest = calculate_average(northwest_bmi)

print(f'The average BMI of a patient in the Southeastern United States is {average_southeast}.')
print(f'The average BMI of a patient in the Southwestern United States is {average_southwest}.')
print(f'The average BMI of a patient in the Northeastern United States is {average_northeast}.')
print(f'The average BMI of a patient in the Northwestern United States is {average_northwest}.')

The average BMI of a patient in the Southeastern United States is 33.36.
The average BMI of a patient in the Southwestern United States is 30.6.
The average BMI of a patient in the Northeastern United States is 29.17.
The average BMI of a patient in the Northwestern United States is 29.2.


Interestingly, a much different picture is shown here. Patients from the northeastern and northwestern United States fall within the overweight range, with the southwest falling barely obese. However, the southeast is sitting significantly higher than any other region.

## Conclusions
I believe the most interesting fact gleaned is that the average BMI of patients from the southeast sits significantly higher than that of other regions. It would be interesting to pursue further analysis of the southeast region based on this information, including average insurance costs compared to regional or national averages.