# U.S. Medical Insurance Costs

## Project info

**Data file: insurance.csv**

**Headers:**
- age: int
- sex: str > female / male
- bmi: float
- children: int
- smoker: str > yes / no
- region: str > southwest / southeast / northwest / northeast
- charges: float

## Scope

**Possible questions:**
- Average price?
- Average price for males vs females?
- Average price for each region?
- Average price and difference for smokers vs non-smokers?
- Difference in price for youngest vs oldest?
- Average age?
- How many patients are female & male?
- How many patients for each region?
- Average age for people with at least 1 child?
- Average age for smokers vs non-smokers?

## Work
### Importing Data

First, we'll import the csv library in order to be able to work with the CSV file:

In [6]:
import csv

Next, let's create empty lists for each of the file's columns:

In [8]:
age = []
sex = []
bmi = []
children = []
smoker = []
region = []
charges = []

Now, we need a function that will let us extract each column from the csv file into a list.

In [10]:
def csv_to_list (list, file, column):
    with open (file) as csv_file:
        csv_data = csv.DictReader(csv_file)
        for patient in csv_data:
            list.append(patient[column])
    return list

Function Test:

In [12]:
# age = csv_to_list(age, "insurance.csv", "age")
# print (age)

After testing, everything works correctly.

Let's now populate our lists:

In [14]:
age = csv_to_list(age, "insurance.csv", "age")
sex = csv_to_list(sex, "insurance.csv", "sex")
bmi = csv_to_list(bmi, "insurance.csv", "bmi")
children = csv_to_list(children, "insurance.csv", "children")
smoker = csv_to_list(smoker, "insurance.csv", "smoker")
region = csv_to_list(region, "insurance.csv", "region")
charges = csv_to_list(charges, "insurance.csv", "charges")

We can also make a dictionary and assign an ID no. to each patient:

In [16]:
patient_dict = {}
for i in range(len(age)):
    patient_dict[i] = {"age": age[i], "sex": sex[i], "bmi": bmi[i], "children": children[i], "smoker": smoker[i], "region": region[i], "charges": charges[i]}
# print(patient_dict)


We now have access to everything in the csv in an orderly way, let's start answering some questions.

### Analysis

After manipulating the data and some thinking, these are the questions we'll be answering:

1. Average insurance cost?
2. How many patients in our dataset are male and how many female?
3. What are the unique regions in our dataset?
4. How many of our patients are smokers vs non-smokers?
5. What's the average age?
6. Average cost for males and females? Difference?
7. Average cost for each region?
8. Average cost for smokers and non-smokers? Difference?
9. Average age for people with at least 1 child?
10. Average age for smokers and non-smokers?-smokers?

#### 1. Average insurance cost?

In [20]:
def avg_insurance_cost (dict):
    total_cost = 0
    for patient in dict:
        total_cost += float(dict[patient]["charges"])
    average = total_cost / len(dict)
    return round(average, 2)

In [21]:
avg_cost = avg_insurance_cost(patient_dict)
print ("The average insurance cost for our dataset is: ${} USD per year.".format(avg_cost))

The average insurance cost for our dataset is: $13270.42 USD per year.


After calculating the average, I realized I missed an important piece of data: our dataset size. I already knew this because I manually inspected the CSV file. But I believe I should include it here as well.

In [23]:
dataset_size = len(patient_dict)
print("Our dataset contains the info for {} patients.".format(dataset_size))

Our dataset contains the info for 1338 patients.


#### 2. How many patients in our dataset are male and how many female?

In [25]:
def gender_totals(dict):
    male_count = 0
    female_count = 0
    missing_data = False
    for patient in dict:
        if dict[patient]["sex"] == "male":
            male_count += 1
        elif dict[patient]["sex"] == "female":
            female_count += 1
        else:
            missing_data = True
    if missing_data:
        print("Some data is missing. Please verify.")
    return male_count, female_count

In [26]:
male_patients, female_patients = gender_totals(patient_dict)
print ("Our dataset has {} males and {} females.".format(male_patients, female_patients))

Our dataset has 676 males and 662 females.


This shows us our dataset has a good balance between males and females, both being very close to the 50% of the results.

#### 3. What are the unique regions in our dataset?

In [29]:
def unique_regions (dict):
    region_list = []
    for patient in dict:
        region = dict[patient]["region"]
        if region not in region_list:
            region_list.append(region)
        else:
            continue
    return region_list

In [30]:
dataset_regions = unique_regions(patient_dict)
print ("The unique regions in our dataset are:", dataset_regions)

The unique regions in our dataset are: ['southwest', 'southeast', 'northwest', 'northeast']


#### 4. How many of our patients are smokers vs non-smokers?

In [32]:
def total_smokers(dict):
    smokers = 0
    non_smokers = 0
    for patient in dict:
        if dict[patient]["smoker"] == "yes":
            smokers += 1
        elif dict[patient]["smoker"] == "no":
            non_smokers += 1
    return smokers, non_smokers


In [33]:
smoker_patients, non_smoker_patients = total_smokers(patient_dict)
print("We have {} smokers and {} non smokers in our dataset.".format(smoker_patients, non_smoker_patients))

We have 274 smokers and 1064 non smokers in our dataset.


#### 5. What's the average age for our patients?

In [35]:
def calculate_avg_age (dict):
    total_age = 0
    for patient in dict:
        total_age += int(dict[patient]["age"])
    average = total_age / len(dict)
    return round(average, 2)

In [36]:
avg_age = calculate_avg_age(patient_dict)
print ("The average patient age in our dataset is: {} years.".format(avg_age))

The average patient age in our dataset is: 39.21 years.


#### 6. Average cost for males and females? Difference?

In [53]:
def avg_cost_male_vs_female(dict):
    male_count = 0
    male_total_cost = 0
    female_count = 0
    female_total_cost = 0
    for patient in dict:
        if dict[patient]["sex"] == "male":
            male_count += 1
            male_total_cost += float(dict[patient]["charges"])
        elif dict[patient]["sex"] == "female":
            female_count += 1
            female_total_cost += float(dict[patient]["charges"])
    male_avg = male_total_cost / male_count
    female_avg = female_total_cost / female_count
    return round(male_avg, 2), round(female_avg, 2)

def m_vs_f_cost_difference(male_cost, female_cost):
    difference = abs(male_cost - female_cost)
    if male_cost > female_cost:
        print ("Our female patients' yearly insurance cost is lower by ${} USD.".format(difference))
    elif female_cost > male_cost:
        print ("Our male patients' yearly insurance cost is lower by ${} USD.".format(difference))
    else:
        print("There's no difference in the average insurance cost between males and females")

In [55]:
male_avg_cost, female_avg_cost = avg_cost_male_vs_female(patient_dict)
print('''The average insurance cost for the male patients in our dataset is: ${}.
The average insurance cost for the female patients in our dataset is: ${}.'''.format(male_avg_cost, female_avg_cost))
m_vs_f_cost_difference(male_avg_cost, female_avg_cost)

The average insurance cost for the male patients in our dataset is: $13956.75.
The average insurance cost for the female patients in our dataset is: $12569.58.
Our female patients' yearly insurance cost is lower by $1387.17 USD.


#### 7. Average cost for each region?

In [48]:
def cost_per_region(dict):
    se_cost = 0
    se_total = 0
    sw_cost = 0
    sw_total = 0
    ne_cost = 0
    ne_total = 0
    nw_cost = 0
    nw_total = 0
    
    for patient in dict:
        if dict[patient]["region"] == "southeast":
            se_total += 1
            se_cost += float(dict[patient]["charges"])
        elif dict[patient]["region"] == "southwest":
            sw_total += 1
            sw_cost += float(dict[patient]["charges"])
        elif dict[patient]["region"] == "northeast":
            ne_total += 1
            ne_cost += float(dict[patient]["charges"])
        elif dict[patient]["region"] == "northwest":
            nw_total += 1
            nw_cost += float(dict[patient]["charges"])
    se_avg = round(se_cost / se_total, 2)
    sw_avg = round(sw_cost / sw_total, 2)
    ne_avg = round(ne_cost / ne_total, 2)
    nw_avg = round(nw_cost / nw_total, 2)

    return se_avg, sw_avg, ne_avg, nw_avg

In [50]:
se_avg_cost, sw_avg_cost, ne_avg_cost, nw_avg_cost = cost_per_region(patient_dict)
print('''
The average yearly insurance cost for each region in our dataset is:
- South East: ${} dollars.
- South West: ${} dollars.
- North East: ${} dollars.
- North West: ${} dollars.
'''.format(se_avg_cost, sw_avg_cost, ne_avg_cost, nw_avg_cost))


The average yearly insurance cost for each region in our dataset is:
- South East: $14735.41 dollars.
- South West: $12346.94 dollars.
- North East: $13406.38 dollars.
- North West: $12417.58 dollars.



For some reason, the average insurance cost in the South East is more than the rest of the regions.

#### 8. Average cost for smokers and non-smokers? Difference?

In [57]:
def avg_cost_smokers_vs_non_smokers(dict):
    smoker_count = 0
    smoker_total_cost = 0
    non_smoker_count = 0
    non_smoker_total_cost = 0
    for patient in dict:
        if dict[patient]["smoker"] == "yes":
            smoker_count += 1
            smoker_total_cost += float(dict[patient]["charges"])
        elif dict[patient]["smoker"] == "no":
            non_smoker_count += 1
            non_smoker_total_cost += float(dict[patient]["charges"])
    smoker_avg = smoker_total_cost / smoker_count
    non_smoker_avg = non_smoker_total_cost / non_smoker_count
    return round(smoker_avg, 2), round(non_smoker_avg, 2)

def sm_vs_nsm_cost_difference(smoker_cost, non_smoker_cost):
    difference = abs(smoker_cost - non_smoker_cost)
    if smoker_cost > non_smoker_cost:
        print ("Our non smoker patients' yearly insurance cost is lower by ${} USD.".format(difference))
    elif non_smoker_cost > smoker_cost:
        print ("Our smoker patients' yearly insurance cost is lower by ${} USD.".format(difference))
    else:
        print("There's no difference in the average insurance cost between smokers and non smokers")

In [61]:
smoker_avg_cost, non_smoker_avg_cost = avg_cost_smokers_vs_non_smokers(patient_dict)
print('''
The average insurance cost for smoker patients is: ${}.
The average insurance cost for non smoker patients is: ${}.
'''.format(smoker_avg_cost, non_smoker_avg_cost))
sm_vs_nsm_cost_difference(smoker_avg_cost, non_smoker_avg_cost)


The average insurance cost for smoker patients is: $32050.23.
The average insurance cost for non smoker patients is: $8434.27.

Our non smoker patients' yearly insurance cost is lower by $23615.96 USD.


#### 9. Average age for people with at least 1 child?

In [67]:
def patient_with_child_age(dict):
    age_total = 0
    patient_sum = 0
    for patient in dict:
        child = int(dict[patient]["children"])
        if child > 0:
            age_total += int(dict[patient]["age"])
            patient_sum += 1
    avg_age = age_total / patient_sum
    return round(avg_age, 2)

In [69]:
avg_age_with_child = patient_with_child_age(patient_dict)

print("The average age for patients in our dataset that have 1 or more children is {} years.".format(avg_age_with_child))

The average age for patients in our dataset that have 1 or more children is 39.78 years.


#### 10. Average age for smokers and non-smokers?

In [85]:
def avg_age_smokers_vs_non_smokers(dict):
    smoker_age = 0
    smoker_count = 0
    non_smoker_age = 0
    non_smoker_count = 0
    for patient in dict:
        if dict[patient]["smoker"] == "yes":
            smoker_age += int(dict[patient]["age"])
            smoker_count += 1
        elif dict[patient]["smoker"] == "no":
            non_smoker_age += int(dict[patient]["age"])
            non_smoker_count += 1
    smoker_avg_age = smoker_age / smoker_count
    non_smoker_avg_age = non_smoker_age / non_smoker_count
    return round(smoker_avg_age, 2), round(non_smoker_avg_age, 2)

In [87]:
smoker_avg_age, non_smoker_avg_age = avg_age_smokers_vs_non_smokers(patient_dict)

print('''
The average age of our smoker patients is: {}.
The average age of our non smoker patients is: {}.
'''.format(smoker_avg_age, non_smoker_avg_age))


The average age of our smoker patients is: 38.51.
The average age of our non smoker patients is: 39.39.



From this result we can conclude that *for our specific dataset*, there's no substantial difference in the ages between our smoker and non smoker patients.