# U.S. Medical Insurance Costs
This project looks at correlation between regions and how much insurance costs, as well as the smoking habits and if they have any correlation with the region data point. 

First, we will import our CSV file into the program, and then store the data in a list.

Each row in the CSV will be a dictionary of k-v pairs that follow the same pattern as the one in the CSV. However, the `smoker` field is either a `yes` or `no`, so it is converted to a 1 or 0, respectively, in the `smoker_binary()` function.

In addition, regions will be recorded in this fashion:
- northwest = 1
- northeast = 2
- southwest = 3
- southeast = 4

This transformation will be in the function `region_conversion()`.

In [23]:
import csv

insur_list = []


def smoker_binary(str):
    if str == "yes":
        return 1
    elif str == "no":
        return 0

def region_conversion(str):
    if str == "northwest":
        return 1
    elif str == "northeast":
        return 2
    elif str == "southwest":
        return 3
    elif str == "southeast":
        return 4

with open('insurance.csv') as insurance_csv:
    reader = csv.DictReader(insurance_csv)
    for row in reader:
        insur_list.append({
            'age': row['age'],
            'sex': row['sex'],
            'bmi': row['bmi'],
            'children': row['children'],
            'smoker': smoker_binary(row['smoker']),
            'region': region_conversion(row['region']),
            'charges': float(row['charges'])
        })

list_len = len(insur_list) # For use later

## Region and Insurance Costs
In this portion, we will look at the average cost per region.
No arguments are taken, since it will provide this data upfront.
- However, one could modify the function to accept an argument (either scanned in or passed as an argument) to show the results for a specific region.

In [24]:
def avg_by_region_cost():
    # Marked as 1
    nw_total_cost = 0
    nw_count = 0

    # Marked as 2
    ne_total_cost = 0
    ne_count = 0

    # Marked as 3
    sw_total_cost = 0
    sw_count = 0

    # Marked as 4
    se_total_cost = 0
    se_count = 0

    for row in insur_list:
        if row['region'] == 1: # Northwest
            nw_count += 1
            nw_total_cost += row['charges']
        elif row['region'] == 2: # Northeast
            ne_count += 1
            ne_total_cost += row['charges']
        elif row['region'] == 3: # Southwest
            sw_count += 1
            sw_total_cost += row['charges']
        elif row['region'] == 4: # Southeast
            se_count += 1
            se_total_cost += row['charges']

    nw_avg = nw_total_cost / nw_count
    ne_avg = ne_total_cost / ne_count
    sw_avg = sw_total_cost / sw_count
    se_avg = se_total_cost / se_count

    return nw_avg, ne_avg, sw_avg, se_avg

nw_cost, ne_cost, sw_cost, se_cost = avg_by_region_cost()
print('The northwest region had an average insurance cost of ${:.2f}.'.format(nw_cost))
print('The northeast region had an average insurance cost of ${:.2f}.'.format(ne_cost))
print('The southwest region had an average insurance cost of ${:.2f}.'.format(sw_cost))
print('The southeast region had an average insurance cost of ${:.2f}.'.format(se_cost))

The northwest region had an average insurance cost of $12417.58.
The northeast region had an average insurance cost of $13406.38.
The southwest region had an average insurance cost of $12346.94.
The southeast region had an average insurance cost of $14735.41.


## Smoker and Region Correlations
This looks at the total count of smokers per region. It follows the same logic as the counting mechanism in the `avg_by_region_cost()` function. Also checks what percentage overall are smokers.

In [45]:
def smoker_by_region():
    # Marked as 1
    nw_count = 0

    # Marked as 2
    ne_count = 0

    # Marked as 3
    sw_count = 0

    # Marked as 4
    se_count = 0

    for row in insur_list:
        if row['smoker'] == 1:
            if row['region'] == 1: # Northwest
                nw_count += 1
            elif row['region'] == 2: # Northeast
                ne_count += 1
            elif row['region'] == 3: # Southwest
                sw_count += 1
            elif row['region'] == 4: # Southeast
                se_count += 1

    total_smoker = nw_count + ne_count + sw_count + se_count

    return nw_count, ne_count, sw_count, se_count, total_smoker

nw_smoke, ne_smoke, sw_smoke, se_smoke, total_smoke = smoker_by_region()
print('In the dataset of {} people, in total, there were {} that are smokers, or about {:.3f} percent of the total sampled.'.format(list_len, total_smoke, (total_smoke/list_len) * 100))
print('The northwest had {} smokers, and made up {:.3f} percent of smokers.'.format(nw_smoke, (nw_smoke / total_smoke) * 100))
print('The northeast had {} smokers, and made up {:.3f} percent of smokers.'.format(ne_smoke, (ne_smoke / total_smoke) * 100))
print('The southwest had {} smokers, and made up {:.3f} percent of smokers.'.format(sw_smoke, (sw_smoke / total_smoke) * 100))
print('The southeast had {} smokers, and made up {:.3f} percent of smokers.'.format(se_smoke, (se_smoke / total_smoke) * 100))

In the dataset of 1338 people, in total, there were 274 that are smokers, or about 20.478 percent of the total sampled.
The northwest had 58 smokers, and made up 21.168 percent of smokers.
The northeast had 67 smokers, and made up 24.453 percent of smokers.
The southwest had 58 smokers, and made up 21.168 percent of smokers.
The southeast had 91 smokers, and made up 33.212 percent of smokers.


## Findings
From here, we can see that there is a correlation between the number of smokers per region and the average cost of insurance per region. The southeast had the highest insurance cost, and made up most of the smokers that were included in this dataset.

However, this doesn't take into account the other numerous factors that can play a role in determining insurance cost, such as a person's age, sex, BMI, or number of children.