# U.S. Medical Insurance Costs

In this project, a CSV file containing medical insurance information for the US will be analyzed. The goal will be to analyze averages by **region**, to potentially see how age, number of children and BMI varies between them. Does location play a potential factor in your insurance cost?


In [1]:
#Import CSV library
import csv

We'll need the `csv` library in order to have access the functions necessary to open the **insurance.csv** file that contains our data. 

Upon examining this file, we see that there are seven columns:
- Age
- Sex
- BMI
- Number of children
- Smoker status
- Region
- Charges (Cost)

By analyzing the data in this file, we will be able to calculate averages by region.

In [2]:
#Create lists that will eventually contain all the data from our insurance.csv file
ages = []
sex = []
bmis = []
num_children = []
smoker = []
regions = []
insurance_costs = []

With our lists created, we can move on to opening the file in order to populate these lists.

In [3]:
#As the file already contains headers, it's not necessary to indicate a fieldnames argument for DictReader
with open('insurance.csv', newline='') as insurance_file:
    insurance_reader = csv.DictReader(insurance_file)
    for row in insurance_reader:
        ages.append(row['age'])
        sex.append(row['sex'])
        bmis.append(row['bmi'])
        num_children.append(row['children'])
        smoker.append(row['smoker'])
        regions.append(row['region'])
        insurance_costs.append(row['charges'])

The code here opens the file **insurance.csv**, then using `csv.DictReader` we can create a dictionary that will allow us to assign our data to the appropriate lists by using keys that represent each column in the original file. From here we can start manipulating our data. Since our goal is to find averages by **region**, the first thing we will do is create a dictionary that will accomplish this goal. This dictionary will use the region as the **key**, and then contain a **list** that has individual data contained within **tuples**.

In [12]:
insurance_by_region = {}
for insurance_data in zip(ages, sex, bmis, num_children, smoker, regions, insurance_costs):
    if insurance_data[5] in insurance_by_region:
        insurance_by_region[insurance_data[5]].append((insurance_data[0], insurance_data[1],\
                                                       insurance_data[2], insurance_data[3],\
                                                       insurance_data[4], insurance_data[6]))
    else:
        insurance_by_region[insurance_data[5]] = [(insurance_data[0], insurance_data[1],\
                                                   insurance_data[2], insurance_data[3],\
                                                   insurance_data[4], insurance_data[6])]

Though the data at this point if printed would be difficult to parse by eye, it can easily be seen that it's been successfully divided into regions. Now that we have these values by region, we can start defining the functions that will manipulate this data so we can analyze it.

## Functions

We can begin with something simple, such as calculating average insurance cost by region. This will serve as a base that we can use it to compare other average values to. It's useful to know the indices for each type of data in order to understand the functions better:
- Age (0)
- Sex (1)
- BMI (2)
- Number of Children (3)
- Smoker (4)
- Insurance cost (5)

In [5]:
def average_cost_by_region(insurance_data, region):
    total_cost = 0.0
    for data in insurance_data[region]:
        total_cost += float(data[5])
    average_cost = round(total_cost / len(insurance_data[region]), 2)
    return average_cost

### Average cost by region
This function takes two parameters: `insurance_data` and `region`, one will be the dictionary that was created previously and divided by **region**, the other will be the region itself passed as a string. We can then iterate through the dictionary keys to be able to compare their averages.

In [6]:
for region in insurance_by_region.keys():
    average_cost = average_cost_by_region(insurance_by_region, region)
    print('The average insurance cost for the {r} region is: ${c}.'.format(r=region, c=average_cost))

The average insurance cost for the southwest region is: $12346.94.
The average insurance cost for the southeast region is: $14735.41.
The average insurance cost for the northwest region is: $12417.58.
The average insurance cost for the northeast region is: $13406.38.


We now know that on average, the **southeast** region spends the most on insurance costs. We can average other data in this region and others to try to find out why that may be. Does that region have more smokers? Do people there have more children on average? Are they older? Is their BMI higher? Similarly, we can try to figure out why, on average, the **northwest** region pays less on insurance costs. I'll create several similar functions here to find the average of other data sets.

In [7]:
def average_age_by_region(insurance_data, region):
    total_age = 0
    for data in insurance_data[region]:
        total_age += int(data[0])
    average_age = round(total_age / len(insurance_data[region]), 2)
    return average_age

### Average age by region
Similar to the previous function, this will return the average **age** per region.

In [8]:
def average_bmi_by_region(insurance_data, region):
    total_bmi = 0
    for data in insurance_data[region]:
        total_bmi += float(data[2])
    average_bmi = round(total_bmi / len(insurance_data[region]), 2)
    return average_bmi    

### Average BMI by region
Once more, a very similar function that will return the average **BMI** by region.

In [9]:
def average_children_by_region(insurance_data, region):
    total_children = 0
    for data in insurance_data[region]:
        total_children += int(data[3])
    average_children = round(total_children / len(insurance_data[region]), 2)
    return average_children

### Average number of children by region
This function is once again identical to the last.

In [10]:
def smokers_by_region(insurance_data, region):
    total_smokers = 0
    for data in insurance_data[region]:
        if data[4] == 'yes':
            total_smokers += 1
    return total_smokers

### Total number of smokers by region
Once again we have a slight change in this function in that it calculates total amount of **smokers**, not an average. We simply want to know how many people in that region smoke, as being a smoker can be a significant factor in insurance cost.

## Analysis
We have all the functions we need to analyze the data. In order to do this, we'll create yet another dictionary. This dictionary will once again have the regions as the keys, but this time, the value will be a simple list that contains either **averages** or **total amount** measured by the functions defined above. With this dictionary, we can simply compare the regional averages and see if there's any factor that stands out as a potential reason for higher costs in a region.

In [11]:
averages_by_region = {}
for region in insurance_by_region.keys():
    average_age = average_age_by_region(insurance_by_region, region)
    average_bmi = average_bmi_by_region(insurance_by_region, region)
    average_children = average_children_by_region(insurance_by_region, region)
    total_smokers = smokers_by_region(insurance_by_region, region)
    average_cost = average_cost_by_region(insurance_by_region, region)
    averages_by_region[region] = [average_age, average_bmi, average_children, total_smokers, average_cost]

With our **averages** and **totals** saved to a new dictionary, it's much easier to read the data. Let's print it out so we can see which factors (if any) play a role in increased prices for a particular region.

In [18]:
for region, data in averages_by_region.items():
    print('The {r} region is on average {a} years old, has an average BMI of {bmi}, an average of {c} children, a total of {s} smokers and an average insurance cost of ${cost}.'\
         .format(r=region, a=data[0], bmi=data[1], c=data[2], s=data[3], cost=data[4]))

The southwest region is on average 39.46 years old, has an average BMI of 30.6, an average of 1.14 children, a total of 58 smokers and an average insurance cost of $12346.94.
The southeast region is on average 38.94 years old, has an average BMI of 33.36, an average of 1.05 children, a total of 91 smokers and an average insurance cost of $14735.41.
The northwest region is on average 39.2 years old, has an average BMI of 29.2, an average of 1.15 children, a total of 58 smokers and an average insurance cost of $12417.58.
The northeast region is on average 39.27 years old, has an average BMI of 29.17, an average of 1.05 children, a total of 67 smokers and an average insurance cost of $13406.38.


## Conclusion
Though other potential factors could be at play, there is a very clear observation that can be made from this: the amount of **smokers** in a region significantly increase the average insurance cost. We can observe how the **southeast** region leads all other regions in the amount of total smokers and the amount paid on average on insurance. We can also observe that despite the **southwest** having a higher average BMI, being slightly older on average, and having slightly more children on average *still manages to pay less on average than the **northeast** region, simply because they have fewer smokers*.

## Limitations
Sex was not considered at all during these observations. It's possible that because women often pay more for insurance on average, that some of these observations could change if further divided by gender.