# U.S. Medical Insurance Costs

## Project Goals
Determine which factor in the insurance data file provided ('insurance.csv') contributes the most to insurance cost.  Factors included are age, sex, BMI, number of children, smoker, and region.  

To estimate how much each factor contributes to the overall cost, the data will be grouped by each factor and the average insurance cost in each group will be calculated.  Where data are collected into more than two ordered groups (i.e. age, BMI, and number of children), linear regression will be used to etimate the cost contribution.  Where data is grouped by unordered value (i.e. sex, smoker, and region), average costs will be compared.

## Import Data

In [98]:
import csv
from decimal import Decimal

In [72]:
insurance_data = []
with open('/Users/djfkahn/Downloads/python-portfolio-project-starter-files/insurance.csv') as data_file:
    data_dict = csv.DictReader(data_file)
    for row in data_dict:
        insurance_data.append(row)

In [73]:
print(insurance_data[0])

OrderedDict([('age', '19'), ('sex', 'female'), ('bmi', '27.9'), ('children', '0'), ('smoker', 'yes'), ('region', 'southwest'), ('charges', '16884.924')])


## Organize Data
Group the insurance costs by each of the factors.

In [74]:
def group_costs(insurance_data, factor):
    result = {}
    for record in insurance_data:
        if result.get(record[factor], None) is None:
            result[record[factor]] = [float(record['charges'])]
        else:
            result[record[factor]].append(float(record['charges']))

    sorted_result = {}
    for key in sorted(result.keys()):
        sorted_result[key] = result[key]
    return sorted_result

In [75]:
cost_by_age = group_costs(insurance_data, 'age')

In [76]:
cost_by_sex = group_costs(insurance_data, 'sex')

In [77]:
cost_by_children = group_costs(insurance_data, 'children')

In [78]:
cost_by_smoker = group_costs(insurance_data, 'smoker')

In [79]:
cost_by_region = group_costs(insurance_data, 'region')

BMI data is continuous, so set up tiers based on data from https://www.nhlbi.nih.gov/health/educational/lose_wt/BMI/bmi_dis.htm.

|Clasification|BMI Range |
|---|---|
|Underweight|< 18.5|
|Normal|18.5–24.9|
|Overweight|25.0–29.9|
|Obesity I|30.0–34.9|
|Obesity II |35.0–39.9|
|Extreme Obesity|40.0 +|

In [80]:
# Define the BMI tiers
bmi_tier_upper_limits = {1 : 18.5,
                         2 : 25.0,
                         3 : 30.0,
                         4 : 35.0,
                         5 : 40.0,
                         6 : 100.}

In [81]:
# Add tiers to the insurance data
for record in insurance_data:
    tier = 1
    while float(record['bmi']) > bmi_tier_upper_limits[tier] and tier < 5:
        tier += 1
    record['bmi_tier'] =  tier

In [82]:
cost_by_bmi = group_costs(insurance_data, 'bmi_tier')

## Analyze Data
For each cohort of data, compute the average.

In [83]:
def compute_averages(costs):
    result = {}
    for key, values in costs.items():
        total = 0.
        for value in values:
            total += value
        average = total / len(values)
        result[key] = average
    return result

In [84]:
avg_by_age = compute_averages(cost_by_age)

In [85]:
avg_by_sex = compute_averages(cost_by_sex)

In [86]:
avg_by_children = compute_averages(cost_by_children)

In [87]:
avg_by_smoker = compute_averages(cost_by_smoker)

In [88]:
avg_by_region = compute_averages(cost_by_region)

In [89]:
avg_by_bmi = compute_averages(cost_by_bmi)

### Ordered Categories
Perform a linear regression to find the slope of the best fit line through the groups to indicate the affect of the factor on the average insurance cost.

In [114]:
def compute_slope(data):
    sum_x = 0.
    sum_y = 0.
    for x, y in data.items():
        sum_x += float(x)
        sum_y += y
    mean_x = sum_x / len(data)
    mean_y = sum_y / len(data)
    
    sum_num = 0.
    sum_den = 0.
    for x, y in data.items():
        sum_num += ((float(x) - mean_x) * (y - mean_y))
        sum_den += ((float(x) - mean_x) ** 2)

    return sum_num / sum_den
# B1 = sum((x(i) - mean(x)) * (y(i) - mean(y))) / sum( (x(i) - mean(x))^2 )

In [115]:
slope_age = compute_slope(avg_by_age)

In [117]:
slope_children = compute_slope(avg_by_children)

In [118]:
slope_bmi = compute_slope(avg_by_bmi)

### Non-Ordered Categories
Compare average insurance cost for the factor's categories.

In [104]:
contribution_of_sex = avg_by_sex['male'] - avg_by_sex['female']
if contribution_of_sex > 0. :
    comparison_for_sex = "more"
else:
    comparison_for_sex = "less"

In [106]:
contribution_of_smoker = avg_by_smoker['yes'] - avg_by_smoker['no']
if contribution_of_smoker > 0. :
    comparison_for_smoker = "more"
else:
    comparison_for_smoker = "less"

## Results

In [123]:
print("Insurance cost {Change} ${Slope:.2f} for every year of age.".format(Change=('rises' if slope_age >= 0 else 'drops'), Slope=slope_age))
print("Insurance cost {Change} ${Slope:.2f} for every additional child.".format(Change=('rises' if slope_children >= 0 else 'drops'), Slope=abs(slope_children)))
print("Insurance cost {Change} ${Slope:.2f} for successive BMI tier.".format(Change=('rises' if slope_bmi >= 0 else 'drops'), Slope=slope_bmi))

print("Insurance for males costs ${Diff:.2f} {Comp} than for females.".format(Diff=contribution_of_sex, Comp=comparison_for_sex))
print("Insurance for a smoker costs ${Diff:.2f} {Comp} than for non-smoker.".format(Diff=contribution_of_smoker, Comp=comparison_for_smoker))


Insurance cost rises $264.83 for every year of age.
Insurance cost drops $407.41 for every additional child.
Insurance cost rises $2058.64 for successive BMI tier.
Insurance for males costs $1387.17 more than for females.
Insurance for a smoker costs $23615.96 more than for non-smoker.
