# U.S. Medical Insurance Costs

## Scope of Project

Determine which variable to calculate insurance rates, has the most dramatic impact on cost. 

## Part 1 - Convert Database

In [666]:
import csv

ages = []
sexes = []
bmis = []
num_of_children = []
smokers = []
regions = []
all_charges = []

with open('insurance.csv') as insurance_data:
    convert = csv.DictReader(insurance_data)
    for row in convert:
        ages.append(row["age"])
        sexes.append(row["sex"])
        bmis.append(row["bmi"])
        num_of_children.append(row["children"])
        smokers.append(row["smoker"])
        regions.append(row["region"])
        all_charges.append(row['charges'])

In [668]:
def create_dictionary(ages, sexes, bmis, num_of_children, smokers, regions, all_charges):
    new_dict = {}
    for index in range(len(ages)):
        new_dict[index] = {
            "age": ages[index],
            "sex": sexes[index],
            "bmi": bmis[index],
            "children": num_of_children[index],
            "smoker": smokers[index],
            "region": regions[index],
            "charges": all_charges[index]
        }
    return new_dict

insurance_data_dict = create_dictionary(ages, sexes, bmis, num_of_children, smokers, regions, all_charges)

## Part 2 - Is this dataset a good representation of the population?

### Age Representation

##### Determine if the Age Range and Average Age of this dataset are representative of the U.S. Population

In [428]:
age_total = 0
for age in ages:
    age_total = age_total + int(age)
print("Age total: " + str(age_total))

age_average = age_total / len(ages)
print("Average age: " + str(round(age_average, 1)))

highest_age = 0
lowest_age = age_average
for age in ages:
    if int(age) > highest_age:
        highest_age = int(age)
    elif int(age) < lowest_age:
        lowest_age = int(age)
    else:
        continue
print("Highest age: " + str(highest_age))
print("Lowest age: " + str(lowest_age))

Age total: 52459
Average age: 39.2
Highest age: 64
Lowest age: 18


#### Conclusion

The age range is 18 - 64. This data set excludes children and seniors who qualify for medicare.
The Average Age here is 39.2 compared to the U.S. average of 38.9. 
This dataset is comfortably representative of age in the U.S. population. 

### Sex Representation

##### Determine if percentage of sexes in population is representative of the U.S. Population

In [423]:
male_total = 0
female_total = 0
for sex in sexes:
    if sex == 'male':
        male_total += 1
    elif sex == 'female':
        female_total += 1
    else:
        continue
print("Male Total: " + str(male_total))
print("Female Total: " + str(female_total))
sex_percentages_per_group = {
    'male': round((male_total / len(sexes)) * 100 ,2),
    'female': round((female_total / len(sexes)) * 100 ,2)
}
print("Percent of males in dataset: " + str(sex_percentages_per_group['male']))
print("Percent of females in dataset: " + str(sex_percentages_per_group['female']))

Male Total: 676
Female Total: 662
Percent of males in dataset: 50.52
Percent of females in dataset: 49.48


#### Conclusion

Our dataset is 50.52% male. The U.S. population is 49.5% male.

Our dataset is 49.48% female. The U.S. population is 50.47% female.

Our dataset is slightly unrepresentative of the U.S. population and we should keep in mind we have more datapoints on men. While we can analyze the men and women in this database we should be mindful about extrapolating this data to the full population. 

### BMI Representation

##### Determine if BMI range and average is representative of the U.S. Population

In [347]:
bmi_total = 0
for bmi in bmis:
    bmi_total = bmi_total + float(bmi)
bmi_average = bmi_total / len(bmis)
print("Average BMI: " + str(round(bmi_average)))
print("The National Average BMI is 29")

highest_bmi = 0
lowest_bmi = bmi_average
for bmi in bmis:
    if float(bmi) > highest_bmi:
        highest_bmi = float(bmi)
    elif float(bmi) < lowest_bmi:
        lowest_bmi = float(bmi)
    else:
        continue
print("The highest bmi is: " + str(highest_bmi))
print("The lowest bmi is: " + str(lowest_bmi))

Average BMI: 31
The National Average BMI is 29
The highest bmi is: 53.13
The lowest bmi is: 15.96


#### Conclusion

The Average BMI of our dataset is 31. The National Average is 29. 
This is fairly representative of the U.S. population. 

While the BMI range isn't entirely representative of all potential extremes it has datapoints for the lowest category possible 'underweight' and the highest 'obesity class 3'. Too much variation from the average could actually skew our data. I am content with the bmi range of this dataset.

### Children Per Household Representation

##### Determine if Average Number of Children, Range of Family Size and Percent of Childlessness is indicative of the U.S. Population

In [452]:
children_total = 0
for children in num_of_children:
    children_total += int(children)
children_average = children_total / len(num_of_children)

print("The average number of children per adult in this data set is: " + str(round(children_average, 2)))

highest_children = 0
lowest_children = children_average
for children in num_of_children:
    if int(children) > highest_children:
        highest_children = int(children)
    elif int(children) < lowest_children:
        lowest_children = int(children)
    else:
        continue
print("The highest amount of children is " + str(highest_children))
print("The lowest amount of children is " + str(lowest_children))

no_children_total = 0
for children in num_of_children:
    if int(children) == 0:
        no_children_total += 1
no_children_percentage = (no_children_total / len(num_of_children)) * 100
print("Childless Percentage: " + str(round(no_children_percentage)) + "%")

The average number of children per adult in this data set is: 1.09
The highest amount of children is 5
The lowest amount of children is 0
Childless Percentage: 43%


#### Conclusion

The average number of children per adult in this data set is 1.09. Compared to the u.s. population at 1.94. 

The number of children per adult in this data set ranges from 0 to 5. 5% of the U.S. population has more than 5 children. 
This would lead us to think that our bias skews down by a signficant amount.

However, on the other side of the spectrum we see that the percentage of childless individuals in our data set is 43% compared to the national average of 47%. Our data skews up 4%. Leaving just a 1% skew downwards.

Our dataset is slightly unrepresentative of the U.S. population by 1%. We should keep in mind our conclusion slightly skew towards adults with fewer children. 

Also keep in mind this dataset excludes data on extreme outliers. As one of 12 kids, I am curious to see how to the rate of change for insurance costs continues for those outlier families. But we cannot extrapolate our findings here to those outlier cases. 

### Region Representation

##### Determine if there is even distribution between categories available

In [355]:
region_count_library = {}
for region in regions:
    if region_count_library.get(region, "none")  == "none": 
        region_count_library[region] = 1
    else:
        region_count_library[region] += 1
print(region_count_library)

for region in region_count_library:
    region_percentage = ( region_count_library[region] / len(regions)) * 100
    print(region_percentage)
    print("Percentage of " + region + " is: " + str(round(region_percentage)) + "%")

region_percentages = {
    'southwest': 24,
    'southeast': 27,
    'northwest': 24,
    'northeast': 24
}



{'southwest': 325, 'southeast': 364, 'northwest': 325, 'northeast': 324}
325
1338
24.28998505231689
Percentage of southwest is: 24%
364
1338
27.204783258594915
Percentage of southeast is: 27%
325
1338
24.28998505231689
Percentage of northwest is: 24%
324
1338
24.2152466367713
Percentage of northeast is: 24%
An acceptable difference would be 5%, the highest difference here is 4% so we should be ok, we should keep in mind a slight skew towards the southest


#### Conclusion

Considering we don't have the exact demarcations for these categories we can't really compare them to the actual population distribution of the U.S. population. So we would prefer that data is evenly distributed across the four categories. We have 3% more individuals from the southeast so we should take the data extrapolated here as slightly more  representative of people in the southeast than the rest of the population. 

### Smoker Representation

##### Determine the percent of people in this dataset who are smokers, is this percentage indicative of the population?

In [468]:
total_smokers = 0
for smoker in smokers:
    if smoker == "yes":
        total_smokers += 1
percentage_smokers = (total_smokers / len(smokers) ) * 100
total_non_smokers = 0
for smoker in smokers:
    if smoker == "no":
        total_non_smokers += 1
print(total_smokers)
print(total_non_smokers)
print("The percentage of individuals in this data set who are smokers are: " + str(round(percentage_smokers)) + "%")
smoker_percentages_per_group = {
    'smoker': round(percentage_smokers, 2),
    'non-smoker': round((total_non_smokers / len(smokers)) * 100, 2)
}

#20% of the dataset are smokers, we really need to take into account where and when smokers are driving up the cost. This is the variable most likely to impact other conclusions.

274
1064
The percentage of individuals in this data set who are smokers are: 20.48%


#### Conclusion

20% of individuals in this dataset are smokers. 19.8% of inidividuals in the U.S. population are smokers. This dataset is indicative of the U.S. population when considering this variable. 

However considering we have 1 smoker for every 5 non smokers we really need to pay attention to where those smokers fall in our data set and how they can be skewing other results. Are these smokers more likely to be in certain demographics than others?

## Part 3 - Charges per Demographic - Single Variable Analysis

### Age Bias based on Age Only

##### Determine how much each age demographic is charged on average and the rate of change between those averages

In [479]:
def age_bias(dictionary):
    totals = {
            "young adult" : 0,
            "20s" : 0,
            "30s" : 0,
            "40s" : 0,
            "50s" : 0,
            "60s" : 0
        }
    total_charges = {
        "young adult" : 0,
        "20s" : 0,
        "30s" : 0,
        "40s" : 0,
        "50s" : 0,
        "60s" : 0
        }
    for user in dictionary:
        age = int(dictionary[user]["age"])
        charge = round(float(dictionary[user]["charges"]), 2)
        if age < 18:
            continue;
        elif age >= 18 and age < 20:
            totals["young adult"] += 1
            total_charges["young adult"] += charge
        elif age >= 20 and age < 30:
            totals["20s"] += 1
            total_charges["20s"] += charge
        elif age >= 30 and age < 40:
            totals["30s"] += 1
            total_charges["30s"] += charge
        elif age >= 40 and age < 50:
            totals["40s"] += 1
            total_charges["40s"] += charge
        elif age >= 50 and age < 60:
            totals["50s"] += 1
            total_charges["50s"] += charge
        elif age >= 60 and age < 65:
            totals["60s"] += 1
            total_charges["60s"] += charge
        else:
            continue
    averages = {
        "young adult" : round(total_charges["young adult"] / totals["young adult"], 2),
        "20s": round(total_charges["20s"] / totals["20s"], 2),
        "30s": round(total_charges["30s"] / totals["30s"], 2),
        "40s": round(total_charges["40s"] / totals["40s"], 2),
        "50s": round(total_charges["50s"] / totals["50s"], 2),
        "60s": round(total_charges["60s"] / totals["60s"], 2)
    }
    difference_in_averages = {
        "young adult": round(averages["young adult"]),
        "20s": round(averages["20s"] - averages["young adult"]),
        "30s": round(averages["30s"] - averages["20s"]),
        "40s": round(averages["40s"] - averages["30s"]),
        "50s": round(averages["50s"] - averages["40s"]),
        "60s": round(averages["60s"] - averages["50s"]),
    }
    percentages = {
        "young adult": round((totals['young adult'] / len(ages)) * 100 ,2),
        '20s': round((totals['20s'] / len(ages)) * 100 ,2),
        '30s': round((totals['30s'] / len(ages)) * 100 ,2),
        '40s': round((totals['40s'] / len(ages)) * 100 ,2),
        '50s': round((totals['50s'] / len(ages)) * 100 ,2),
        '60s': round((totals['60s'] / len(ages)) * 100 ,2),
    }
    return [totals, total_charges, averages, difference_in_averages, percentages]

age_totals = age_bias(insurance_data_dict)[0]
age_total_charges = age_bias(insurance_data_dict)[1]
age_averages = age_bias(insurance_data_dict)[2]
age_difference_in_averages = age_bias(insurance_data_dict)[3]
age_percentages_per_group = age_bias(insurance_data_dict)[4]
print("Age Totals: " + str(age_totals))
print("Age Total Charges: " + str(age_total_charges))
print("Age Average Charge: " + str(age_averages))
print("Differences in Charges: " + str(age_difference_in_averages))
print("Age Percentage " + str(age_percentages_per_group))

Age Totals: {'young adult': 137, '20s': 280, '30s': 257, '40s': 279, '50s': 271, '60s': 114}
Age Total Charges: {'young adult': 1151806.88, '20s': 2677290.2900000005, '30s': 3016867.5100000007, '40s': 4017377.79, '50s': 4470208.05, '60s': 2422274.490000001}
Age Average Charge: {'young adult': 8407.35, '20s': 9561.75, '30s': 11738.78, '40s': 14399.2, '50s': 16495.23, '60s': 21248.02}
Differences in Charges: {'young adult': 8407, '20s': 1154, '30s': 2177, '40s': 2660, '50s': 2096, '60s': 4753}
Age Percentage {'young adult': 10.24, '20s': 20.93, '30s': 19.21, '40s': 20.85, '50s': 20.25, '60s': 8.52}


#### Conclusion

We can see that charges increase the older you get, the only inconsistency is that the amount increased between 40 to 50 years old is less than the increase form 30 to 40, and even the increase between 20 and 30! To determine whether or not this inherently how the calculations work we must consult what other demographics 50 year olds are in.

### Sex Bias based on Sex Only

##### Determine if Males or Females are charged more on average

In [344]:
def sex_bias(dictionary):
    male_charges_total = 0
    female_charges_total = 0
    for user in dictionary:
        if dictionary[user]["sex"] == "male":
            male_charges_total += float(dictionary[user]["charges"])
        elif dictionary[user]["sex"] == "female":
            female_charges_total += float(dictionary[user]["charges"])
        else: 
            continue
    average_male_charge = male_charges_total / male_total
    average_female_charge = female_charges_total / female_total
    print("Males on average are charged $" + str(round(average_male_charge, 2)))
    print("Females on average are charged $" + str(round(average_female_charge, 2)))
    difference_in_charge = average_male_charge - average_female_charge
    print("Men pay " + str(round(difference_in_charge, 2)) + " dollars more than women" )
    percentage = round((difference_in_charge / average_female_charge) * 100 ,2)
    print("Men pay " + str(percentage) + "% more than women")
    return [male_charges_total, female_charges_total, average_male_charge, average_female_charge, difference_in_charge, percentage]

sex_bias(insurance_data_dict)

Males on average are charged $13956.75
Females on average are charged $12569.58
Men pay 1387.17 dollars more than women
Men pay 11.04% more than women


[9434763.796139995,
 8321061.194618994,
 13956.751177721886,
 12569.57884383534,
 1387.1723338865468,
 11.04]

#### Conclusion

Based on sex only we see males pay on average 11% more than females. We will need to cross reference with our other demographics to determine if this is true.

### BMI Bias based on BMI Only

##### Determine how much each BMI category is charged on average and the rate of change between those averages

In [497]:
def bmi_bias(dictionary):
    totals = {
            'underweight': 0,
            'healthy': 0,
            'overweight': 0,
            'obesity1': 0,
            'obesity2': 0,
            'obesity3': 0
        }
    total_charges = {
            'underweight': 0,
            'healthy': 0,
            'overweight': 0,
            'obesity1': 0,
            'obesity2': 0,
            'obesity3': 0
        }
    for user in dictionary:
        bmi = float(dictionary[user]['bmi'])
        charge = round(float(dictionary[user]['charges']), 2)
        if bmi < 18.5:
            totals['underweight'] += 1
            total_charges['underweight'] += charge
        elif bmi >= 18.5 and bmi < 25:
            totals['healthy'] += 1
            total_charges['healthy'] += charge
        elif bmi >= 25 and bmi < 30:
            totals['overweight'] += 1
            total_charges['overweight'] += charge
        elif bmi >= 30 and bmi < 35:
            totals['obesity1'] += 1
            total_charges['obesity1'] += charge
        elif bmi >= 35 and bmi < 40:
            totals['obesity2'] += 1
            total_charges['obesity2'] += charge
        elif bmi >= 40: 
            totals['obesity3'] += 1
            total_charges['obesity3'] += charge
        else: 
            continue
    averages = {
        'underweight': round(total_charges["underweight"] / totals["underweight"], 2),
        'healthy': round(total_charges["healthy"] / totals["healthy"], 2),
        'overweight': round(total_charges["overweight"] / totals["overweight"], 2),
        'obesity1': round(total_charges["obesity1"] / totals["obesity1"], 2),
        'obesity2': round(total_charges["obesity2"] / totals["obesity2"], 2),
        'obesity3': round(total_charges["obesity3"] / totals["obesity3"], 2),
    }
    percentages = {
        'underweight': round((totals['underweight'] / len(bmis)) * 100, 2),
        'healthy': round((totals['healthy'] / len(bmis)) * 100, 2),
        'overweight': round((totals['overweight'] / len(bmis)) * 100, 2),
        'obesity1': round((totals['obesity1'] / len(bmis)) * 100, 2),
        'obesity2': round((totals['obesity2'] / len(bmis)) * 100, 2),
        'obesity3': round((totals['obesity3'] / len(bmis)) * 100, 2)
    }
    return [totals, total_charges, averages, percentages]

bmi_totals = bmi_bias(insurance_data_dict)[0]
print('BMI totals: ' + str(bmi_totals))
bmi_total_charges = bmi_bias(insurance_data_dict)[1]
print('BMI total charges: ' + str(bmi_total_charges))
bmi_average_charge = bmi_bias(insurance_data_dict)[2]
print('BMI average charge: ' + str(bmi_average_charge))
bmi_percentages_per_group = bmi_bias(insurance_data_dict)[3]
print('Percent of Dataset: ' + str(bmi_percentages_per_group) )

BMI totals: {'underweight': 20, 'healthy': 225, 'overweight': 386, 'obesity1': 391, 'obesity2': 225, 'obesity3': 91}
BMI total charges: {'underweight': 177044.03000000003, 'healthy': 2342100.9699999997, 'overweight': 4241178.740000005, 'obesity1': 5638092.950000008, 'obesity2': 3830008.259999999, 'obesity3': 1527400.0599999998}
BMI average charge: {'underweight': 8852.2, 'healthy': 10409.34, 'overweight': 10987.51, 'obesity1': 14419.68, 'obesity2': 17022.26, 'obesity3': 16784.62}
Percent of Dataset: {'underweight': 1.49, 'healthy': 16.82, 'overweight': 28.85, 'obesity1': 29.22, 'obesity2': 16.82, 'obesity3': 6.8}


#### Conclusion

For BMIS we see an increase in charges based on size, with underweight paying the smallest amount. Obesity Class 1 to 2 is a more expensive jump than obesity 2 to 3. Are underweight individuals being rewarded for being an unhealhty weight? What other demographics could be informing these numbers. 

### Number of Children Bias Based on Number of Children Only

##### Determine average charge per child size and average cost increase per child added

In [506]:
def children_bias(dictionary):
    totals = {'childless': 0, 'one_child': 0, 'two_children': 0, 'three_children': 0, 'four_children': 0, 'five_children': 0}
    total_charges = {'childless': 0, 'one_child': 0, 'two_children': 0, 'three_children': 0, 'four_children': 0, 'five_children': 0}
    for user in dictionary:
        children = int(dictionary[user]["children"])
        charges = float(dictionary[user]["charges"])
        if children == 0:
            totals['childless'] += 1
            total_charges['childless'] += charges
        elif children == 1:
            totals['one_child'] += 1
            total_charges['one_child'] += charges
        elif children == 2:
            totals['two_children'] += 1
            total_charges['two_children'] += charges
        elif children == 3:
            totals['three_children'] += 1
            total_charges['three_children'] += charges
        elif children == 4:
            totals['four_children'] += 1
            total_charges['four_children'] += charges
        elif children == 5:
            totals['five_children'] += 1
            total_charges['five_children'] += charges
        else:
            continue
    averages = {
        "childless" : round(total_charges['childless'] / totals['childless'], 2),
        "one_child": round(total_charges['one_child'] / totals['one_child'], 2),
        "two_children": round(total_charges['two_children'] / totals['two_children'], 2),
        "three_children": round(total_charges['three_children'] / totals['three_children'], 2),
        "four_children": round(total_charges['four_children'] / totals['four_children'], 2),
        "five_children": round(total_charges['five_children'] / totals['five_children'], 2)
    }
    per_child = {
        "childless": averages["childless"],
        "one_child": round(averages["one_child"] / 2, 2),
        "two_children": round(averages["two_children"] / 3, 2),
        "three_children": round(averages["three_children"] / 4, 2),
        "four_children": round(averages["four_children"] / 5, 2),
        "five_children": round(averages["five_children"] /6, 2)
    }
    difference_per_child = {
        'one_child': round(per_child["childless"] - per_child['one_child'], 2),
        'two_children' : round(per_child['one_child'] - per_child['two_children'], 2),
        'three_children' : round(per_child['two_children'] - per_child['three_children'], 2),
        'four_children' : round(per_child['three_children'] - per_child['four_children'], 2),
        'five_children' : round(per_child['four_children'] - per_child['five_children'], 2)
    }
    percentages = {
        'childless': round((totals['childless'] / len(num_of_children)) * 100, 2),
        'one_child': round((totals['one_child'] / len(num_of_children)) * 100, 2),
        'two_children': round((totals['two_children'] / len(num_of_children)) * 100, 2),
        'three_children': round((totals['three_children'] / len(num_of_children)) * 100, 2),
        'four_children': round((totals['four_children'] / len(num_of_children)) * 100, 2),
        'five_children': round((totals['five_children'] / len(num_of_children)) * 100, 2),
    }
    return [totals, total_charges, averages, per_child, difference_per_child, percentages]
    
children_totals = children_bias(insurance_data_dict)[0]
print('Total Families with x num of children : ' + str(children_totals))

children_total_charges = children_bias(insurance_data_dict)[1]
print('Total Charges for families with x num of children: ' + str(children_total_charges))

children_average_charges = children_bias(insurance_data_dict)[2]
print('Average Charge for families with x num of children: ' + str(children_average_charges))

average_cost_per_child = children_bias(insurance_data_dict)[3]
print('Average Cost Per Child: ' + str(average_cost_per_child))

cost_differential_per_added_child = children_bias(insurance_data_dict)[4]
print('Cost Differential Per Added Child: ' + str(cost_differential_per_added_child))

children_percentages_per_group = children_bias(insurance_data_dict)[5]
print('Percentage of dataset: ' + str(children_percentages_per_group)) 

Total Families with x num of children : {'childless': 574, 'one_child': 324, 'two_children': 240, 'three_children': 157, 'four_children': 25, 'five_children': 18}
Total Charges for families with x num of children: {'childless': 7098069.995338997, 'one_child': 4124899.673449997, 'two_children': 3617655.296149999, 'three_children': 2410784.983589999, 'four_children': 346266.40777999995, 'five_children': 158148.63445}
Average Charge for families with x num of children: {'childless': 12365.98, 'one_child': 12731.17, 'two_children': 15073.56, 'three_children': 15355.32, 'four_children': 13850.66, 'five_children': 8786.04}
Average Cost Per Child: {'childless': 12365.98, 'one_child': 6365.59, 'two_children': 5024.52, 'three_children': 3838.83, 'four_children': 2770.13, 'five_children': 1464.34}
Cost Differential Per Added Child: {'one_child': 6000.39, 'two_children': 1341.07, 'three_children': 1185.69, 'four_children': 1068.7, 'five_children': 1305.79}
Percentage of dataset: {'childless': 42.

##### Conclusion

Here we see price per child decreasing with addition of each child except the jump from 4 to 5. There are only 18 peopls in the dataset with 5 children do we have an outlier there that is skewing the data?

### Region Bias Based on Region Only

##### Determine average charge per region

In [511]:
def region_bias(dictionary):
    totals = {
        "southwest": 0,
        "southeast": 0,
        "northwest": 0,
        "northeast": 0
    }
    total_charges = {
        "southwest": 0,
        "southeast": 0,
        "northwest": 0,
        "northeast": 0
    }
    for user in dictionary:
        region = dictionary[user]["region"]
        charge = round(float(dictionary[user]['charges']), 2)
        if region == 'southwest':
            totals['southwest'] += 1
            total_charges['southwest'] += charge
        elif region == 'southeast':
            totals['southeast'] += 1
            total_charges['southeast'] += charge
        elif region == 'northwest':
            totals['northwest'] += 1
            total_charges['northwest'] += charge
        elif region == 'northeast':
            totals['northeast'] += 1
            total_charges['northeast'] += charge
        else:
            continue
    averages = {
        'southwest': round(total_charges['southwest'] / totals['southwest'], 2),
        'southeast': round(total_charges['southeast'] / totals['southeast'], 2),
        'northwest': round(total_charges['northwest'] / totals['northwest'], 2),
        'northeast': round(total_charges['northeast'] / totals['northeast'], 2)
    }

    return [totals, total_charges, averages]

region_totals = region_bias(insurance_data_dict)[0]
print('Region Totals: ' + str(region_totals))
region_total_charges = region_bias(insurance_data_dict)[1]
print('Region Total Charges: ' + str(region_total_charges))
region_average_charge = region_bias(insurance_data_dict)[2]
print('Region Average Charge: ' + str(region_average_charge))

Region Totals: {'southwest': 325, 'southeast': 364, 'northwest': 325, 'northeast': 324}
Region Total Charges: {'southwest': 4012754.69, 'southeast': 5363689.780000005, 'northwest': 4035711.93, 'northeast': 4343668.610000001}
Region Average Charge: {'southwest': 12346.94, 'southeast': 14735.41, 'northwest': 12417.58, 'northeast': 13406.38}


#### Conclusion

The highest average cost is in the southeast and we have extra data for this set, does that set have higher percentages of anything that could be driving those prices up?

### Smoker Bias Based on Smoking Status Only

20% of the dataset are smokers, that imbalance makes this variable most likely to impact other conclusions. 

##### Determine average smoker and non-smoker charges and the differential between the two.

In [523]:
def smoker_bias(dictionary):
    total_smoker_charges = 0
    total_non_smoker_charges = 0
    for user in dictionary:
        if dictionary[user]["smoker"] == "yes":
            total_smoker_charges += float(dictionary[user]["charges"])
        elif dictionary[user]["smoker"] == "no":
            total_non_smoker_charges += float(dictionary[user]["charges"])
    average_smoker_charge = total_smoker_charges / total_smokers
    average_non_smoker_charge = total_non_smoker_charges / total_non_smokers
    difference = round(average_smoker_charge - average_non_smoker_charge)
    mod = average_smoker_charge / average_non_smoker_charge
    return [total_smoker_charges, total_non_smoker_charges, average_smoker_charge, average_non_smoker_charge, difference, mod]


total_smoker_charges = smoker_bias(insurance_data_dict)[0]
print('Total Smoker Charges: ' + str(total_smoker_charges))
total_non_smoker_charges = smoker_bias(insurance_data_dict)[1]
print('Total Non-Smoker Charges: ' + str(total_non_smoker_charges))
average_smoker_charge = smoker_bias(insurance_data_dict)[2]
print('Average Smoker Charge: ' + str(average_smoker_charge))
average_non_smoker_charge = smoker_bias(insurance_data_dict)[3]
print('Average Non-Smoker Charge: ' + str(average_non_smoker_charge))

difference_between_smoker_and_non_charge = smoker_bias(insurance_data_dict)[4]
non_smokers_for_the_cost_of_one_smoker = smoker_bias(insurance_data_dict)[5]
print("On average smokers pay $" + str(difference_between_smoker_and_non_charge) + " more than non-smokers.")
print("For the price of one smoker you could pay for insurance for " + str(round(non_smokers_for_the_cost_of_one_smoker,1)) + " non-smokers.")

Total Smoker Charges: 8781763.52184
Total Non-Smoker Charges: 8974061.468918996
Average Smoker Charge: 32050.23183153285
Average Non-Smoker Charge: 8434.268297856199
On average smokers pay $23616 more than non-smokers.
For the price of one smoker you could pay for insurance for 3.8 non-smokers.


#### Conclusion

This is the biggest bias we've seen so far! 

For the price of one smoker you could pay for insurance for 3.8 non-smokers. Smokers, on average, pay $23616 more than nonsmokers. 

How does this difference affect our other categories?

## Part 4 - Bias Determined by Multiple Variables


### Smoker Bias effect on other Demographics

##### Determine Amount of Smokers in Each Demographic

In [532]:
def sort_by_smoker(dictionary):
    smoker_dict_totals = {
        'age': {
            'young adult' : 0,
            '20s': 0,
            '30s': 0,
            '40s': 0,
            '50s': 0,
            '60s': 0
        },
        'sex': {
            'male': 0,
            'female': 0
        },
        'bmi': {
            'underweight': 0,
            'healthy': 0,
            'overweight': 0,
            'obesity1': 0,
            'obesity2': 0,
            'obesity3': 0
        },
        'children': {
            'childless': 0,
            'one_child': 0,
            'two_children': 0,
            'three_children': 0,
            'four_children': 0,
            'five_children': 0
        },
        'region': {
            "southwest": 0,
            "southeast": 0,
            "northwest": 0,
            "northeast": 0
        }
    }
    #age sort
    def by_age(user):
        age = int(dictionary[user]['age'])
        if age >= 18 and age < 20:
            smoker_dict_totals['age']['young adult'] += 1
        elif age >= 20 and age < 30:
            smoker_dict_totals['age']['20s'] += 1
        elif age >= 30 and age < 40:
            smoker_dict_totals['age']['30s'] += 1
        elif age >= 40 and age < 50:
            smoker_dict_totals['age']['40s'] += 1
        elif age >= 50 and age < 60:
            smoker_dict_totals['age']['50s'] += 1
        elif age >= 60 and age < 65:
            smoker_dict_totals['age']['60s'] += 1
    #sex sort
    def by_sex(user):
        sex = dictionary[user]['sex']
        if sex == 'male': 
            smoker_dict_totals['sex']['male'] += 1
        elif sex == 'female':
            smoker_dict_totals['sex']['female'] += 1
    #bmi sort
    def by_bmi(user):
        bmi = float(dictionary[user]['bmi'])
        if bmi < 18.5:
            smoker_dict_totals['bmi']['underweight'] += 1
        elif bmi >= 18.5 and bmi < 25:
            smoker_dict_totals['bmi']['healthy'] += 1
        elif bmi >= 25 and bmi < 30:
            smoker_dict_totals['bmi']['overweight'] += 1
        elif bmi >= 30 and bmi < 35:
            smoker_dict_totals['bmi']['obesity1'] += 1
        elif bmi >= 35 and bmi < 40:
            smoker_dict_totals['bmi']['obesity2'] += 1
        elif bmi > 40: 
            smoker_dict_totals['bmi']['obesity3'] += 1
    #children sort
    def by_children(user):
        children = int(dictionary[user]['children'])
        if children == 0:
            smoker_dict_totals['children']['childless'] += 1
        elif children == 1:
            smoker_dict_totals['children']['one_child'] += 1
        elif children == 2:
            smoker_dict_totals['children']['two_children'] += 1
        elif children == 3:
            smoker_dict_totals['children']['three_children'] += 1
        elif children == 4: 
            smoker_dict_totals['children']['four_children'] += 1
        elif children == 5:
            smoker_dict_totals['children']['five_children'] += 1
    #region sort
    def by_region(user):
        region = dictionary[user]['region']
        if region == 'southwest':
            smoker_dict_totals['region']['southwest'] += 1
        elif region == 'southeast':
            smoker_dict_totals['region']['southeast'] += 1
        elif region == 'northwest':
            smoker_dict_totals['region']['northwest'] += 1
        elif region == 'northeast':
            smoker_dict_totals['region']['northeast'] += 1
    for person in dictionary:
        if dictionary[person]['smoker'] == 'yes':
            by_age(person)
            by_sex(person)
            by_bmi(person)
            by_children(person)
            by_region(person)
    return smoker_dict_totals

smokers_only = sort_by_smoker(insurance_data_dict)
print('Smokers Only Dictionary: ' + str(smokers_only))

Smokers Only Dictionary: {'age': {'young adult': 30, '20s': 56, '30s': 58, '40s': 62, '50s': 41, '60s': 27}, 'sex': {'male': 159, 'female': 115}, 'bmi': {'underweight': 5, 'healthy': 50, 'overweight': 74, 'obesity1': 74, 'obesity2': 50, 'obesity3': 21}, 'children': {'childless': 115, 'one_child': 61, 'two_children': 55, 'three_children': 39, 'four_children': 3, 'five_children': 1}, 'region': {'southwest': 58, 'southeast': 91, 'northwest': 58, 'northeast': 67}}


### Confirm Age Bias

##### Determine if Smokers Account for the inconsistencies in charges increasing by age

In [365]:
## find percentages of total count so we can properly compare to other statistics
smokers_age = smokers_only['age']
age_smoker_percentages = {
    'young adult': round((smokers_age['young adult'] / age_totals['young adult']) * 100, 2),
    '20s': round((smokers_age['20s'] / age_totals['20s']) * 100, 2),
    '30s': round((smokers_age['30s'] / age_totals['30s']) * 100, 2),
    '40s': round((smokers_age['40s'] / age_totals['40s']) * 100, 2),
    '50s': round((smokers_age['50s'] / age_totals['50s']) * 100, 2),
    '60s': round((smokers_age['60s'] / age_totals['60s']) * 100, 2),
}
print(age_smoker_percentages)


{'young adult': 21.9, '20s': 20.0, '30s': 22.57, '40s': 22.22, '50s': 15.13, '60s': 23.68}


#### Conclusion

50 year olds pay less than expected and now we see that is probably due too the fact that only 15% of 50 year olds smoke compared to the average 20%, we can then comfortably conclude that insurance rates get consistentley more expensive with age.

### Confirm Sex Bias

##### Determine if smoking is impacting our current conclusion that males are charged more for insurance than females

#### Part 1: Percentages of Men and Women who are Smokers

In [540]:
smokers_sex = smokers_only['sex']
sex_smoker_percentages = {
    'male': round((smokers_sex['male'] / male_total) * 100, 2),
    'female': round((smokers_sex['female'] / female_total) * 100, 2)
}
print('Smoker Percentages: ' + str(sex_smoker_percentages))
difference_in_sexes = sex_smoker_percentages['male'] - sex_smoker_percentages['female']
print('Difference: ' + str(difference_in_sexes))

Smoker Percentages: {'male': 23.52, 'female': 17.37}
Difference: 6.149999999999999


There is a 6% difference in men smoking compared to women smoking. Men paid 11% more on average, lets test the average cost of non-smokers between women and men to see if this is enough to even it out or if another factor could be affecting the gender difference besides smoking.

#### Part 2: Average Charges for Non-Smokers grouped by Sex

In [543]:
def sex_non_smoker(dictionary):
    totals = {
        'male': 0,
        'female': 0
    }
    total_charges = {
        'male': 0,
        'female': 0
    }
    for user in dictionary:
        if dictionary[user]['smoker'] == 'no':
            if dictionary[user]['sex'] == 'male':
                totals['male'] += 1
                total_charges['male'] += round(float(dictionary[user]['charges']), 2)
            elif dictionary[user]['sex'] == 'female':
                totals['female'] += 1
            total_charges['female'] += round(float(dictionary[user]['charges']), 2)
        else: continue
    averages = {
        'male': round(total_charges['male'] /  totals['male'], 2),
        'female': round(total_charges['female'] /  totals['female'], 2)
    }
    return [totals, total_charges, averages]

non_smoker_sex_totals = sex_non_smoker(insurance_data_dict)[0]
print('Non_smoker Totals by Sex: ' + str(non_smoker_sex_totals))
non_smoker_charges_sex = sex_non_smoker(insurance_data_dict)[1]
print('Non_smoker Total Charges by Sex: ' + str(non_smoker_charges_sex))
non_smoker_sex_averages = sex_non_smoker(insurance_data_dict)[2]
print('Non_smoker Average Charges by Sex: ' + str(non_smoker_sex_averages))

Non_smoker Totals by Sex: {'male': 517, 'female': 547}
Non_smoker Total Charges by Sex: {'male': 4181084.9200000004, 'female': 8974061.470000003}
Non_smoker Average Charges by Sex: {'male': 8087.2, 'female': 16405.96}


When we run average cost between the sexes on just non-smokers women pay nearly double what men pay!! Now the question is is there another factor that could be affecting what women pay? More children? age? higher bmi?

#### Part 3: Determine Percentages of Each Category that are Non-smoking Females 

In [550]:
def female_non_smoker_stats(dictionary):
    totals = {
        'age': {
            'young adult' : 0,
            '20s': 0,
            '30s': 0,
            '40s': 0,
            '50s': 0,
            '60s': 0
        },
        'bmi': {
            'underweight': 0,
            'healthy': 0,
            'overweight': 0,
            'obesity1': 0,
            'obesity2': 0,
            'obesity3': 0
        },
        'children': {
            'childless': 0,
            'one_child': 0,
            'two_children': 0,
            'three_children': 0,
            'four_children': 0,
            'five_children': 0
        },
        'region': {
            "southwest": 0,
            "southeast": 0,
            "northwest": 0,
            "northeast": 0
        }
    }
      #age sort
    def by_age(user):
        age = int(dictionary[user]['age'])
        if age >= 18 and age < 20:
            totals['age']['young adult'] += 1
        elif age >= 20 and age < 30:
            totals['age']['20s'] += 1
        elif age >= 30 and age < 40:
            totals['age']['30s'] += 1
        elif age >= 40 and age < 50:
            totals['age']['40s'] += 1
        elif age >= 50 and age < 60:
            totals['age']['50s'] += 1
        elif age >= 60 and age < 65:
            totals['age']['60s'] += 1
    #bmi sort
    def by_bmi(user):
        bmi = float(dictionary[user]['bmi'])
        if bmi < 18.5:
            totals['bmi']['underweight'] += 1
        elif bmi >= 18.5 and bmi < 25:
            totals['bmi']['healthy'] += 1
        elif bmi >= 25 and bmi < 30:
            totals['bmi']['overweight'] += 1
        elif bmi >= 30 and bmi < 35:
            totals['bmi']['obesity1'] += 1
        elif bmi >= 35 and bmi < 40:
            totals['bmi']['obesity2'] += 1
        elif bmi > 40: 
            totals['bmi']['obesity3'] += 1
    #children sort
    def by_children(user):
        children = int(dictionary[user]['children'])
        if children == 0:
            totals['children']['childless'] += 1
        elif children == 1:
            totals['children']['one_child'] += 1
        elif children == 2:
            totals['children']['two_children'] += 1
        elif children == 3:
            totals['children']['three_children'] += 1
        elif children == 4: 
            totals['children']['four_children'] += 1
        elif children == 5:
            totals['children']['five_children'] += 1
    #region sort
    def by_region(user):
        region = dictionary[user]['region']
        if region == 'southwest':
            totals['region']['southwest'] += 1
        elif region == 'southeast':
            totals['region']['southeast'] += 1
        elif region == 'northwest':
            totals['region']['northwest'] += 1
        elif region == 'northeast':
            totals['region']['northeast'] += 1
    for person in dictionary:
        if dictionary[person]['smoker'] == 'no' and dictionary[person]['sex'] == 'female':
            by_age(person)
            by_bmi(person)
            by_children(person)
            by_region(person)
    return totals

female_non_smoker_dict = female_non_smoker_stats(insurance_data_dict)
print('Female Non-Smoker Dictionary: ' + str(female_non_smoker_dict))

Female Non-Smoker Dictionary: {'age': {'young adult': 53, '20s': 111, '30s': 105, '40s': 111, '50s': 122, '60s': 45}, 'bmi': {'underweight': 8, 'healthy': 89, 'overweight': 168, 'obesity1': 160, 'obesity2': 89, 'obesity3': 33}, 'children': {'childless': 236, 'one_child': 133, 'two_children': 97, 'three_children': 63, 'four_children': 11, 'five_children': 7}, 'region': {'southwest': 141, 'southeast': 139, 'northwest': 135, 'northeast': 132}}


Now we known the total number of non-smoking females in each category, what percentage of the categories are these individuals

In [553]:
female_non_smoker_percentages = {
        'age': {
            'young adult' : round(female_non_smoker_dict['age']['young adult'] /  age_totals['young adult'] * 100, 2),
            '20s': round(female_non_smoker_dict['age']['20s'] /  age_totals['20s'] * 100, 2),
            '30s': round(female_non_smoker_dict['age']['30s'] /  age_totals['30s'] * 100, 2),
            '40s': round(female_non_smoker_dict['age']['40s'] /  age_totals['40s'] * 100, 2),
            '50s': round(female_non_smoker_dict['age']['50s'] /  age_totals['50s'] * 100, 2),
            '60s': round(female_non_smoker_dict['age']['60s'] /  age_totals['60s'] * 100, 2),
        },
        'bmi': {
            'underweight': round(female_non_smoker_dict['bmi']['underweight'] /  bmi_totals['underweight'] * 100, 2),
            'healthy': round(female_non_smoker_dict['bmi']['healthy'] /  bmi_totals['healthy'] * 100, 2),
            'overweight': round(female_non_smoker_dict['bmi']['overweight'] /  bmi_totals['overweight'] * 100, 2),
            'obesity1': round(female_non_smoker_dict['bmi']['obesity1'] /  bmi_totals['obesity1'] * 100, 2),
            'obesity2': round(female_non_smoker_dict['bmi']['obesity2'] /  bmi_totals['obesity2'] * 100, 2),
            'obesity3': round(female_non_smoker_dict['bmi']['obesity3'] /  bmi_totals['obesity3'] * 100, 2),
        },
        'children': {
            'childless': round(female_non_smoker_dict['children']['childless'] /  children_totals['childless'] * 100, 2),
            'one_child': round(female_non_smoker_dict['children']['one_child'] /  children_totals['one_child'] * 100, 2),
            'two_children': round(female_non_smoker_dict['children']['two_children'] /  children_totals['two_children'] * 100, 2),
            'three_children': round(female_non_smoker_dict['children']['three_children'] /  children_totals['three_children'] * 100, 2),
            'four_children': round(female_non_smoker_dict['children']['four_children'] /  children_totals['four_children'] * 100, 2),
            'five_children': round(female_non_smoker_dict['children']['five_children'] /  children_totals['five_children'] * 100, 2),
        },
        'region': {
            "southwest": round(female_non_smoker_dict['region']['southwest'] /  region_totals['southwest'] * 100, 2),
            "southeast": round(female_non_smoker_dict['region']['southeast'] /  region_totals['southeast'] * 100, 2),
            "northwest": round(female_non_smoker_dict['region']['northwest'] /  region_totals['northwest'] * 100, 2),
            "northeast": round(female_non_smoker_dict['region']['northeast'] /  region_totals['northeast'] * 100, 2),
        }
}
print('Female non-smoker percentages: '+ str(female_non_smoker_percentages))

Female non-smoker percentages: {'age': {'young adult': 38.69, '20s': 39.64, '30s': 40.86, '40s': 39.78, '50s': 45.02, '60s': 39.47}, 'bmi': {'underweight': 40.0, 'healthy': 39.56, 'overweight': 43.52, 'obesity1': 40.92, 'obesity2': 39.56, 'obesity3': 36.26}, 'children': {'childless': 41.11, 'one_child': 41.05, 'two_children': 40.42, 'three_children': 40.13, 'four_children': 44.0, 'five_children': 38.89}, 'region': {'southwest': 43.38, 'southeast': 38.19, 'northwest': 41.54, 'northeast': 40.74}}


#### Conclusion

I did a quick little check and ran women smoker stats for just women stats to double check that there was an even distribution between men and women in the general population per category and the only category significatly above 50% was for underweight bmi, which wouldn't account for higher insurance costs. 

If the percentage is more than 10% away from 50, that is an atypical amount of non-smoking women in that category and could be wny we see woman's prices double compared to non smoking men. None of these are much more than 10% away, the one that are would mean that women are actually being charges less. So we can conclude there is a bias against women when calculating insurance costs. 

### Confirm Underweight BMI bias

##### Determine other factors that make the underweight average charge so low

We just learned that there were more women underweight than men and considering men get charged more that means underweight is accounting for even less of a cost. Is there another category that could be warping the underweight to be so cheap. Is there a higher percentage of non-smokers? higher percentage of young people?

#### Part 1: Sort Dictionary for Underweight Only

In [561]:
def underweight_stats(dictionary): 
    total_underweight = 0
    totals = { 
    'age': {
            'young adult' : 0,
            '20s': 0,
            '30s': 0,
            '40s': 0,
            '50s': 0,
            '60s': 0
        },
        'sex': {
            'male': 0,
            'female': 0
        },
        'children': {
            'childless': 0,
            'one_child': 0,
            'two_children': 0,
            'three_children': 0,
            'four_children': 0,
            'five_children': 0
        },
        'smoker': {
            'smoker': 0,
            'non-smoker': 0
        },
        'region': {
            "southwest": 0,
            "southeast": 0,
            "northwest": 0,
            "northeast": 0
        }
    }
    #age sort
    def by_age(user):
        age = int(dictionary[user]['age'])
        if age >= 18 and age < 20:
            totals['age']['young adult'] += 1
        elif age >= 20 and age < 30:
            totals['age']['20s'] += 1
        elif age >= 30 and age < 40:
            totals['age']['30s'] += 1
        elif age >= 40 and age < 50:
            totals['age']['40s'] += 1
        elif age >= 50 and age < 60:
            totals['age']['50s'] += 1
        elif age >= 60 and age < 65:
            totals['age']['60s'] += 1
    #sex sort
    def by_sex(user):
         sex = dictionary[user]['sex']
         if sex == 'male':
             totals['sex']['male'] += 1
         elif sex == 'female':
                totals['sex']['female'] += 1
    #children sort
    def by_children(user):
        children = int(dictionary[user]['children'])
        if children == 0:
            totals['children']['childless'] += 1
        elif children == 1:
            totals['children']['one_child'] += 1
        elif children == 2:
            totals['children']['two_children'] += 1
        elif children == 3:
            totals['children']['three_children'] += 1
        elif children == 4: 
            totals['children']['four_children'] += 1
        elif children == 5:
            totals['children']['five_children'] += 1
    #smoker sort
    def by_smoker(user):
        smoker = dictionary[user]['smoker']
        if smoker == 'yes':
            totals['smoker']['smoker'] += 1
        elif smoker == 'no':
            totals['smoker']['non-smoker'] += 1
    #region sort
    def by_region(user):
        region = dictionary[user]['region']
        if region == 'southwest':
            totals['region']['southwest'] += 1
        elif region == 'southeast':
            totals['region']['southeast'] += 1
        elif region == 'northwest':
            totals['region']['northwest'] += 1
        elif region == 'northeast':
            totals['region']['northeast'] += 1
    for person in dictionary:
        if float(dictionary[person]['bmi']) < 18.5:
            total_underweight += 1
            by_age(person)
            by_sex(person)
            by_children(person)
            by_smoker(person)
            by_region(person)
    return [totals, total_underweight]

underweight_totals = underweight_stats(insurance_data_dict)[0]
all_underweight = underweight_stats(insurance_data_dict)[1]
print('Underweight Only: ' + str(underweight_totals))

Underweight Only: {'age': {'young adult': 4, '20s': 7, '30s': 5, '40s': 0, '50s': 3, '60s': 1}, 'sex': {'male': 8, 'female': 12}, 'children': {'childless': 9, 'one_child': 4, 'two_children': 6, 'three_children': 0, 'four_children': 0, 'five_children': 1}, 'smoker': {'smoker': 5, 'non-smoker': 15}, 'region': {'southwest': 3, 'southeast': 0, 'northwest': 7, 'northeast': 10}}


#### Part 2: Compare Underweight Percentage of Population in a Given Category with the Percentage of People in that Category in the General Population

In [378]:
young_adult_full_population_percentage = age_percentages_per_group['young adult']
young_adult_underweight_percentage = round((underweight_totals['age']['young adult'] / all_underweight) * 100, 2)

print("Young adult people comprise " + str(young_adult_full_population_percentage) + "% of the general population")
print("Young adult people comprise " + str(young_adult_underweight_percentage) + "% of the underweight population")

Young adult people compromise 10.24% of the general population
Young adult people compromise 20.0% of the underweight population


This could account for the low cost in the underweight population, since we know young adults are the age category that pays the lowest amount.

In [380]:
childless_full_population_percentage = children_percentages_per_group['childless']
childless_underweight_percentage = round((underweight_totals['children']['childless'] / all_underweight) * 100, 2)

print("Childless people comprise " + str(childless_full_population_percentage) + "% of the general population")
print("Childless people comprise " + str(childless_underweight_percentage) + "% of the underweight population")

Childless people comprimse 42.9% of the general population
Childless people comprise45.0% of the underweight population


There is just a 2% difference between general childless, and underweight childless, this isn't very statistically significant, age is a bigger factor.

#### Conclusion

Age seems to be the biggest reason underweight costs are so low. A higher percentage of young people are underweight. The underweight percentages also showed a big region discrepency, I need to confirm regional bias before I can conclude whether or not that is also impacting underweight average charges.

### Confirm Obese BMI bias

##### Determine why there is less of a jump between Obesity Classes 2 and 3 than 1 & 2

#### Sort Dictionary by Obesity Class

In [578]:
def obesity_stats(dictionary, class_num): 
    total_obesity = 0
    totals = { 
    'age': {
            'young adult' : 0,
            '20s': 0,
            '30s': 0,
            '40s': 0,
            '50s': 0,
            '60s': 0
        },
        'sex': {
            'male': 0,
            'female': 0
        },
        'children': {
            'childless': 0,
            'one_child': 0,
            'two_children': 0,
            'three_children': 0,
            'four_children': 0,
            'five_children': 0
        },
        'smoker': {
            'smoker': 0,
            'non-smoker': 0
        },
        'region': {
            "southwest": 0,
            "southeast": 0,
            "northwest": 0,
            "northeast": 0
        }
    }
    #age sort
    def by_age(user):
        age = int(dictionary[user]['age'])
        if age >= 18 and age < 20:
            totals['age']['young adult'] += 1
        elif age >= 20 and age < 30:
            totals['age']['20s'] += 1
        elif age >= 30 and age < 40:
            totals['age']['30s'] += 1
        elif age >= 40 and age < 50:
            totals['age']['40s'] += 1
        elif age >= 50 and age < 60:
            totals['age']['50s'] += 1
        elif age >= 60 and age < 65:
            totals['age']['60s'] += 1
    #sex sort
    def by_sex(user):
         sex = dictionary[user]['sex']
         if sex == 'male':
             totals['sex']['male'] += 1
         elif sex == 'female':
                totals['sex']['female'] += 1
    #children sort
    def by_children(user):
        children = int(dictionary[user]['children'])
        if children == 0:
            totals['children']['childless'] += 1
        elif children == 1:
            totals['children']['one_child'] += 1
        elif children == 2:
            totals['children']['two_children'] += 1
        elif children == 3:
            totals['children']['three_children'] += 1
        elif children == 4: 
            totals['children']['four_children'] += 1
        elif children == 5:
            totals['children']['five_children'] += 1
    #smoker sort
    def by_smoker(user):
        smoker = dictionary[user]['smoker']
        if smoker == 'yes':
            totals['smoker']['smoker'] += 1
        elif smoker == 'no':
            totals['smoker']['non-smoker'] += 1
    #region sort
    def by_region(user):
        region = dictionary[user]['region']
        if region == 'southwest':
            totals['region']['southwest'] += 1
        elif region == 'southeast':
            totals['region']['southeast'] += 1
        elif region == 'northwest':
            totals['region']['northwest'] += 1
        elif region == 'northeast':
            totals['region']['northeast'] += 1
    for person in dictionary:
        bmi = float(dictionary[person]['bmi'])
        if class_num == 1:
            if bmi >= 30 and bmi < 35 :
                total_obesity += 1
                by_age(person)
                by_sex(person)
                by_children(person)
                by_smoker(person)
                by_region(person)
        elif class_num == 2: 
            if bmi >= 35 and bmi < 40:
                total_obesity += 1
                by_age(person)
                by_sex(person)
                by_children(person)
                by_smoker(person)
                by_region(person)
        elif class_num == 3: 
            if bmi >= 40:
                total_obesity += 1
                by_age(person)
                by_sex(person)
                by_children(person)
                by_smoker(person)
                by_region(person)
        else: print('Invalid class number')
    return [totals, total_obesity]

obesity1_totals = obesity_stats(insurance_data_dict, 1)[0]
all_obesity1 = obesity_stats(insurance_data_dict, 1)[1]
print("Obesity Class 1 Dictionary: " + str(obesity1_totals))
print("Obesity Class 1 Total: " + str(all_obesity1))

obesity2_totals = obesity_stats(insurance_data_dict, 2)[0]
all_obesity2 = obesity_stats(insurance_data_dict, 2)[1]
print("Obesity Class 2 Dictionary: " + str(obesity2_totals))
print("Obesity Class 2 Total: " + str(all_obesity2))

obesity3_totals = obesity_stats(insurance_data_dict, 3)[0]
all_obesity3 = obesity_stats(insurance_data_dict, 3)[1]
print("Obesity Class 2 Dictionary: " + str(obesity2_totals))
print("Obesity Class 2 Total: " + str(all_obesity2))

Obesity Class 1 Dictionary: {'age': {'young adult': 40, '20s': 87, '30s': 65, '40s': 78, '50s': 87, '60s': 34}, 'sex': {'male': 204, 'female': 187}, 'children': {'childless': 163, 'one_child': 89, 'two_children': 73, 'three_children': 54, 'four_children': 8, 'five_children': 4}, 'smoker': {'smoker': 74, 'non-smoker': 317}, 'region': {'southwest': 102, 'southeast': 94, 'northwest': 105, 'northeast': 90}}
Obesity Class 1 Total: 391
Obesity Class 2 Dictionary: {'age': {'young adult': 21, '20s': 34, '30s': 39, '40s': 48, '50s': 49, '60s': 34}, 'sex': {'male': 118, 'female': 107}, 'children': {'childless': 98, 'one_child': 54, 'two_children': 45, 'three_children': 25, 'four_children': 2, 'five_children': 1}, 'smoker': {'smoker': 50, 'non-smoker': 175}, 'region': {'southwest': 58, 'southeast': 94, 'northwest': 35, 'northeast': 38}}
Obesity Class 2 Total: 225
Obesity Class 2 Dictionary: {'age': {'young adult': 21, '20s': 34, '30s': 39, '40s': 48, '50s': 49, '60s': 34}, 'sex': {'male': 118, 'f

#### Part 2: Percentages of Subcategories of the population of each Obesity Class and Comparison

In [581]:
obesity1_percentages = {
    'age': {
        'young adult' : round((obesity1_totals['age']['young adult'] / all_obesity1) * 100 ,2),
        '20s' : round((obesity1_totals['age']['20s'] / all_obesity1) * 100 ,2),
        '30s' : round((obesity1_totals['age']['30s'] / all_obesity1) * 100 ,2),
        '40s' : round((obesity1_totals['age']['40s'] / all_obesity1) * 100 ,2),
        '50s' : round((obesity1_totals['age']['50s'] / all_obesity1) * 100 ,2),
        '60s' : round((obesity1_totals['age']['60s'] / all_obesity1) * 100 ,2)
    },
    'sex': {
        'male': round((obesity1_totals['sex']['male'] / all_obesity1) * 100 ,2),
        'female': round((obesity1_totals['sex']['female'] / all_obesity1) * 100 ,2)
    }, 
    'children': {
        'childless': round((obesity1_totals['children']['childless'] / all_obesity1) * 100 ,2),
        'one_child': round((obesity1_totals['children']['one_child'] / all_obesity1) * 100 ,2),
        'two_children': round((obesity1_totals['children']['two_children'] / all_obesity1) * 100 ,2),
        'three_children': round((obesity1_totals['children']['three_children'] / all_obesity1) * 100 ,2),
        'four_children': round((obesity1_totals['children']['four_children'] / all_obesity1) * 100 ,2),
        'five_children': round((obesity1_totals['children']['five_children'] / all_obesity1) * 100 ,2)
    },
    'smoker': {
        'smoker': round((obesity1_totals['smoker']['smoker'] / all_obesity1) * 100 ,2),
        'non-smoker': round((obesity1_totals['smoker']['non-smoker'] / all_obesity1) * 100 ,2)
    },
    'region': {
        'southwest': round((obesity1_totals['region']['southwest'] / all_obesity1) * 100 ,2),
        'southeast': round((obesity1_totals['region']['southeast'] / all_obesity1) * 100 ,2),
        'northwest': round((obesity1_totals['region']['northwest'] / all_obesity1) * 100 ,2),
        'northeast': round((obesity1_totals['region']['northeast'] / all_obesity1) * 100 ,2)
    }
}
print("Obesity Class 1 Percentages: "  + str(obesity1_percentages))

Obesity Class 1 Percentages: {'age': {'young adult': 10.23, '20s': 22.25, '30s': 16.62, '40s': 19.95, '50s': 22.25, '60s': 8.7}, 'sex': {'male': 52.17, 'female': 47.83}, 'children': {'childless': 41.69, 'one_child': 22.76, 'two_children': 18.67, 'three_children': 13.81, 'four_children': 2.05, 'five_children': 1.02}, 'smoker': {'smoker': 18.93, 'non-smoker': 81.07}, 'region': {'southwest': 26.09, 'southeast': 24.04, 'northwest': 26.85, 'northeast': 23.02}}


In [583]:
obesity2_percentages = {
    'age': {
        'young adult' : round((obesity2_totals['age']['young adult'] / all_obesity2) * 100 ,2),
        '20s' : round((obesity2_totals['age']['20s'] / all_obesity2) * 100 ,2),
        '30s' : round((obesity2_totals['age']['30s'] / all_obesity2) * 100 ,2),
        '40s' : round((obesity2_totals['age']['40s'] / all_obesity2) * 100 ,2),
        '50s' : round((obesity2_totals['age']['50s'] / all_obesity2) * 100 ,2),
        '60s' : round((obesity2_totals['age']['60s'] / all_obesity2) * 100 ,2)
    },
    'sex': {
        'male': round((obesity2_totals['sex']['male'] / all_obesity2) * 100 ,2),
        'female': round((obesity2_totals['sex']['female'] / all_obesity2) * 100 ,2)
    }, 
    'children': {
        'childless': round((obesity2_totals['children']['childless'] / all_obesity2) * 100 ,2),
        'one_child': round((obesity2_totals['children']['one_child'] / all_obesity2) * 100 ,2),
        'two_children': round((obesity2_totals['children']['two_children'] / all_obesity2) * 100 ,2),
        'three_children': round((obesity2_totals['children']['three_children'] / all_obesity2) * 100 ,2),
        'four_children': round((obesity2_totals['children']['four_children'] / all_obesity2) * 100 ,2),
        'five_children': round((obesity2_totals['children']['five_children'] / all_obesity2) * 100 ,2)
    },
    'smoker': {
        'smoker': round((obesity2_totals['smoker']['smoker'] / all_obesity2) * 100 ,2),
        'non-smoker': round((obesity2_totals['smoker']['non-smoker'] / all_obesity2) * 100 ,2)
    },
    'region': {
        'southwest': round((obesity2_totals['region']['southwest'] / all_obesity2) * 100 ,2),
        'southeast': round((obesity2_totals['region']['southeast'] / all_obesity2) * 100 ,2),
        'northwest': round((obesity2_totals['region']['northwest'] / all_obesity2) * 100 ,2),
        'northeast': round((obesity2_totals['region']['northeast'] / all_obesity2) * 100 ,2)
    }
}
print("Obesity Class 2 Percentages: "  + str(obesity2_percentages))

Obesity Class 2 Percentages: {'age': {'young adult': 9.33, '20s': 15.11, '30s': 17.33, '40s': 21.33, '50s': 21.78, '60s': 15.11}, 'sex': {'male': 52.44, 'female': 47.56}, 'children': {'childless': 43.56, 'one_child': 24.0, 'two_children': 20.0, 'three_children': 11.11, 'four_children': 0.89, 'five_children': 0.44}, 'smoker': {'smoker': 22.22, 'non-smoker': 77.78}, 'region': {'southwest': 25.78, 'southeast': 41.78, 'northwest': 15.56, 'northeast': 16.89}}


In [388]:
obesity3_percentages = {
    'age': {
        'young adult' : round((obesity3_totals['age']['young adult'] / all_obesity3) * 100 ,2),
        '20s' : round((obesity3_totals['age']['20s'] / all_obesity3) * 100 ,2),
        '30s' : round((obesity3_totals['age']['30s'] / all_obesity3) * 100 ,2),
        '40s' : round((obesity3_totals['age']['40s'] / all_obesity3) * 100 ,2),
        '50s' : round((obesity3_totals['age']['50s'] / all_obesity3) * 100 ,2),
        '60s' : round((obesity3_totals['age']['60s'] / all_obesity3) * 100 ,2)
    },
    'sex': {
        'male': round((obesity3_totals['sex']['male'] / all_obesity3) * 100 ,2),
        'female': round((obesity3_totals['sex']['female'] / all_obesity3) * 100 ,2)
    }, 
    'children': {
        'childless': round((obesity3_totals['children']['childless'] / all_obesity3) * 100 ,2),
        'one_child': round((obesity3_totals['children']['one_child'] / all_obesity3) * 100 ,2),
        'two_children': round((obesity3_totals['children']['two_children'] / all_obesity3) * 100 ,2),
        'three_children': round((obesity3_totals['children']['three_children'] / all_obesity3) * 100 ,2),
        'four_children': round((obesity3_totals['children']['four_children'] / all_obesity3) * 100 ,2),
        'five_children': round((obesity3_totals['children']['five_children'] / all_obesity3) * 100 ,2)
    },
    'smoker': {
        'smoker': round((obesity3_totals['smoker']['smoker'] / all_obesity3) * 100 ,2),
        'non-smoker': round((obesity3_totals['smoker']['non-smoker'] / all_obesity3) * 100 ,2)
    },
    'region': {
        'southwest': round((obesity3_totals['region']['southwest'] / all_obesity3) * 100 ,2),
        'southeast': round((obesity3_totals['region']['southeast'] / all_obesity3) * 100 ,2),
        'northwest': round((obesity3_totals['region']['northwest'] / all_obesity3) * 100 ,2),
        'northeast': round((obesity3_totals['region']['northeast'] / all_obesity3) * 100 ,2)
    }
}
print("Obesity Class 3 Percentages: "  + str(obesity3_percentages))

{'age': {'young adult': 9.89, '20s': 15.38, '30s': 20.88, '40s': 20.88, '50s': 26.37, '60s': 6.59}, 'sex': {'male': 56.04, 'female': 43.96}, 'children': {'childless': 43.96, 'one_child': 21.98, 'two_children': 19.78, 'three_children': 9.89, 'four_children': 2.2, 'five_children': 2.2}, 'smoker': {'smoker': 23.08, 'non-smoker': 76.92}, 'region': {'southwest': 14.29, 'southeast': 60.44, 'northwest': 8.79, 'northeast': 16.48}}


#### Comparison

In [591]:
obesity_percentage_comparison = {
    'age': {
        'young adult' : [age_percentages_per_group['young adult'],obesity1_percentages['age']['young adult'], obesity2_percentages['age']['young adult'], obesity3_percentages['age']['young adult']],
        '20s' : [age_percentages_per_group['20s'], obesity1_percentages['age']['20s'], obesity2_percentages['age']['20s'], obesity3_percentages['age']['20s']],
        '30s' : [age_percentages_per_group['30s'], obesity1_percentages['age']['30s'], obesity2_percentages['age']['30s'], obesity3_percentages['age']['30s']],
        '40s' : [age_percentages_per_group['40s'], obesity1_percentages['age']['40s'], obesity2_percentages['age']['40s'], obesity3_percentages['age']['40s']],
        '50s' : [age_percentages_per_group['50s'], obesity1_percentages['age']['50s'], obesity2_percentages['age']['50s'], obesity3_percentages['age']['50s']],
        '60s' : [age_percentages_per_group['60s'], obesity1_percentages['age']['60s'], obesity2_percentages['age']['60s'], obesity3_percentages['age']['60s']]
    },
    'sex': {
        'male': [sex_percentages_per_group['male'], obesity1_percentages['sex']['male'], obesity2_percentages['sex']['male'], obesity3_percentages['sex']['male']],
        'female': [sex_percentages_per_group['female'], obesity1_percentages['sex']['female'], obesity2_percentages['sex']['female'], obesity3_percentages['sex']['female']]
    },
    'children': {
        'childless': [children_percentages_per_group['childless'], obesity1_percentages['children']['childless'], obesity2_percentages['children']['childless'], obesity3_percentages['children']['childless']],
        'one_child': [children_percentages_per_group['one_child'], obesity1_percentages['children']['one_child'], obesity2_percentages['children']['one_child'], obesity3_percentages['children']['one_child']],
        'two_children': [children_percentages_per_group['two_children'], obesity1_percentages['children']['two_children'], obesity2_percentages['children']['two_children'], obesity3_percentages['children']['two_children']],
        'three_children': [children_percentages_per_group['three_children'], obesity1_percentages['children']['three_children'], obesity2_percentages['children']['three_children'], obesity3_percentages['children']['three_children']],
        'four_children': [children_percentages_per_group['four_children'], obesity1_percentages['children']['four_children'], obesity2_percentages['children']['four_children'], obesity3_percentages['children']['four_children']],
        'five_children': [children_percentages_per_group['five_children'], obesity1_percentages['children']['five_children'], obesity2_percentages['children']['five_children'], obesity3_percentages['children']['five_children']]
    },
    'smoker': {
        'smoker': [smoker_percentages_per_group['smoker'], obesity1_percentages['smoker']['smoker'], obesity2_percentages['smoker']['smoker'], obesity3_percentages['smoker']['smoker']],
        'non-smoker': [smoker_percentages_per_group['non-smoker'], obesity1_percentages['smoker']['non-smoker'], obesity2_percentages['smoker']['non-smoker'], obesity3_percentages['smoker']['non-smoker']]
    },
    'region': {
        'southwest': [region_percentages['southwest'], obesity1_percentages['region']['southwest'], obesity2_percentages['region']['southwest'], obesity3_percentages['region']['southwest']],
        'southeast': [region_percentages['southeast'], obesity1_percentages['region']['southeast'], obesity2_percentages['region']['southeast'], obesity3_percentages['region']['southeast']],
        'northwest': [region_percentages['northwest'], obesity1_percentages['region']['northwest'], obesity2_percentages['region']['northwest'], obesity3_percentages['region']['northwest']],
        'northeast': [region_percentages['northeast'], obesity1_percentages['region']['northeast'], obesity2_percentages['region']['northeast'], obesity3_percentages['region']['northeast']]
    }
}
print("The following dictionary is formated like [Percentage of General Population, Percentage of Obesity Class 1, 2, and 3 ]")
print("Obesity Percentage Comparison: " + str(obesity_percentage_comparison))


The following dictionary is formated like [Percentage of General Population, Percentage of Obesity Class 1, 2, and 3 ]
Obesity Percentage Comparison: {'age': {'young adult': [10.24, 10.23, 9.33, 9.89], '20s': [20.93, 22.25, 15.11, 15.38], '30s': [19.21, 16.62, 17.33, 20.88], '40s': [20.85, 19.95, 21.33, 20.88], '50s': [20.25, 22.25, 21.78, 26.37], '60s': [8.52, 8.7, 15.11, 6.59]}, 'sex': {'male': [50.52, 52.17, 52.44, 56.04], 'female': [49.48, 47.83, 47.56, 43.96]}, 'children': {'childless': [42.9, 41.69, 43.56, 43.96], 'one_child': [24.22, 22.76, 24.0, 21.98], 'two_children': [17.94, 18.67, 20.0, 19.78], 'three_children': [11.73, 13.81, 11.11, 9.89], 'four_children': [1.87, 2.05, 0.89, 2.2], 'five_children': [1.35, 1.02, 0.44, 2.2]}, 'smoker': {'smoker': [20.48, 18.93, 22.22, 23.08], 'non-smoker': [79.52, 81.07, 77.78, 76.92]}, 'region': {'southwest': [24, 26.09, 25.78, 14.29], 'southeast': [27, 24.04, 41.78, 60.44], 'northwest': [24, 26.85, 15.56, 8.79], 'northeast': [24, 23.02, 16.8

#### Conclusion

The jump between class 1 and 2 for the percent of 60 year olds is drastic, where there are less than average 60 year olds in class 3. This is a high cost category and could make the difference between the jump. 

There are also 4% more smokers in class 2 than class 1, compared to a 1% jump between 2 and 3. These two factors should be why we see an unsteady increase in BMI based pricing.

### Confirm Number of Children Bias

##### Determine why rates per child go down between 4 and 5 children

#### Part 1: Sort Dictionary by Households with 5 children

In [598]:
def five_children_stats(dictionary): 
    total_five_children = 0
    totals = { 
    'age': {
            'young adult' : 0,
            '20s': 0,
            '30s': 0,
            '40s': 0,
            '50s': 0,
            '60s': 0
        },
        'sex': {
            'male': 0,
            'female': 0
        },

        'bmi': {
            'underweight': 0,
            'healthy': 0,
            'overweight': 0,
            'obesity1': 0,
            'obesity2': 0,
            'obesity3': 0
        },
        
        'smoker': {
            'smoker': 0,
            'non-smoker': 0
        },
        'region': {
            "southwest": 0,
            "southeast": 0,
            "northwest": 0,
            "northeast": 0
        }
    }
    #age sort
    def by_age(user):
        age = int(dictionary[user]['age'])
        if age >= 18 and age < 20:
            totals['age']['young adult'] += 1
        elif age >= 20 and age < 30:
            totals['age']['20s'] += 1
        elif age >= 30 and age < 40:
            totals['age']['30s'] += 1
        elif age >= 40 and age < 50:
            totals['age']['40s'] += 1
        elif age >= 50 and age < 60:
            totals['age']['50s'] += 1
        elif age >= 60 and age < 65:
            totals['age']['60s'] += 1
    #sex sort
    def by_sex(user):
         sex = dictionary[user]['sex']
         if sex == 'male':
             totals['sex']['male'] += 1
         elif sex == 'female':
                totals['sex']['female'] += 1
    #bmi sort
    def by_bmi(user):
        bmi = float(dictionary[user]['bmi'])
        if bmi < 18.5:
            totals['bmi']['underweight'] += 1
        elif bmi >= 18.5 and bmi < 25:
            totals['bmi']['healthy'] += 1
        elif bmi >= 25 and bmi < 30:
            totals['bmi']['overweight'] += 1
        elif bmi >= 30 and bmi < 35:
            totals['bmi']['obesity1'] += 1
        elif bmi >= 35 and bmi < 40:
            totals['bmi']['obesity2'] += 1
        elif bmi > 40: 
            totals['bmi']['obesity3'] += 1
    #smoker sort
    def by_smoker(user):
        smoker = dictionary[user]['smoker']
        if smoker == 'yes':
            totals['smoker']['smoker'] += 1
        elif smoker == 'no':
            totals['smoker']['non-smoker'] += 1
    #region sort
    def by_region(user):
        region = dictionary[user]['region']
        if region == 'southwest':
            totals['region']['southwest'] += 1
        elif region == 'southeast':
            totals['region']['southeast'] += 1
        elif region == 'northwest':
            totals['region']['northwest'] += 1
        elif region == 'northeast':
            totals['region']['northeast'] += 1
    for person in dictionary:
        if int(dictionary[person]['children']) == 5:
            total_five_children += 1
            by_age(person)
            by_sex(person)
            by_bmi(person)
            by_smoker(person)
            by_region(person)
    return [totals, total_five_children]

five_children_totals = five_children_stats(insurance_data_dict)[0]
all_five_children = five_children_stats(insurance_data_dict)[1]
print("5 Children Household Sub Category Totals: " + str(five_children_totals))
print("5 Children Household Total: " + str(all_five_children))

5 Children Household Sub Category Totals: {'age': {'young adult': 1, '20s': 4, '30s': 7, '40s': 5, '50s': 1, '60s': 0}, 'sex': {'male': 10, 'female': 8}, 'bmi': {'underweight': 1, 'healthy': 5, 'overweight': 5, 'obesity1': 4, 'obesity2': 1, 'obesity3': 2}, 'smoker': {'smoker': 1, 'non-smoker': 17}, 'region': {'southwest': 8, 'southeast': 6, 'northwest': 1, 'northeast': 3}}
5 Children Household Total: 18


#### Part 2: Convert Totals to Percentages for Easier Comparison

In [601]:
five_children_percentages = {
    'age': {
        'young adult': [age_percentages_per_group['young adult'], round((five_children_totals['age']['young adult'] / all_five_children) * 100 ,2)],
        '20s': [age_percentages_per_group['20s'], round((five_children_totals['age']['20s'] / all_five_children) * 100 ,2)],
        '30s': [age_percentages_per_group['30s'], round((five_children_totals['age']['30s'] / all_five_children) * 100 ,2)],
        '40s': [age_percentages_per_group['40s'], round((five_children_totals['age']['40s'] / all_five_children) * 100 ,2)],
        '50s': [age_percentages_per_group['50s'], round((five_children_totals['age']['50s'] / all_five_children) * 100 ,2)],
        '60s': [age_percentages_per_group['60s'], round((five_children_totals['age']['60s'] / all_five_children) * 100 ,2)]
    },
    'sex': {
        'male': [sex_percentages_per_group['male'], round((five_children_totals['sex']['male'] / all_five_children) * 100 ,2)],
        'female': [sex_percentages_per_group['female'], round((five_children_totals['sex']['female'] / all_five_children) * 100 ,2)],
    }, 
    'bmi': {
        'underweight': [bmi_percentages_per_group['underweight'], round((five_children_totals['bmi']['underweight'] / all_five_children) * 100 ,2)],
        'healthy': [bmi_percentages_per_group['healthy'], round((five_children_totals['bmi']['healthy'] / all_five_children) * 100 ,2)],
        'overweight': [bmi_percentages_per_group['overweight'], round((five_children_totals['bmi']['overweight'] / all_five_children) * 100 ,2)],
        'obesity1': [bmi_percentages_per_group['obesity1'], round((five_children_totals['bmi']['obesity1'] / all_five_children) * 100 ,2)],
        'obesity2': [bmi_percentages_per_group['obesity2'], round((five_children_totals['bmi']['obesity2'] / all_five_children) * 100 ,2)],
        'obesity3': [bmi_percentages_per_group['obesity3'], round((five_children_totals['bmi']['obesity3'] / all_five_children) * 100 ,2)]
    },
    'smoker': {
        'smoker': [smoker_percentages_per_group['smoker'], round((five_children_totals['smoker']['smoker'] / all_five_children) * 100 ,2)],
        'non-smoker': [smoker_percentages_per_group['non-smoker'], round((five_children_totals['smoker']['non-smoker'] / all_five_children) * 100 ,2)]
    },
    'region': {
        'southwest': [region_percentages['southwest'], round((five_children_totals['region']['southwest'] / all_five_children) * 100 ,2)],
        'southeast': [region_percentages['southeast'], round((five_children_totals['region']['southeast'] / all_five_children) * 100 ,2)],
        'northwest': [region_percentages['northwest'], round((five_children_totals['region']['northwest'] / all_five_children) * 100 ,2)],
        'northeast': [region_percentages['northeast'], round((five_children_totals['region']['northeast'] / all_five_children) * 100 ,2)]
    }
}

print("5 Children Household Subcategory Percentages: " + str(five_children_percentages))

5 Children Household Subcategory Percentages: {'age': {'young adult': [10.24, 5.56], '20s': [20.93, 22.22], '30s': [19.21, 38.89], '40s': [20.85, 27.78], '50s': [20.25, 5.56], '60s': [8.52, 0.0]}, 'sex': {'male': [50.52, 55.56], 'female': [49.48, 44.44]}, 'bmi': {'underweight': [1.49, 5.56], 'healthy': [16.82, 27.78], 'overweight': [28.85, 27.78], 'obesity1': [29.22, 22.22], 'obesity2': [16.82, 5.56], 'obesity3': [6.8, 11.11]}, 'smoker': {'smoker': [20.48, 5.56], 'non-smoker': [79.52, 94.44]}, 'region': {'southwest': [24, 44.44], 'southeast': [27, 33.33], 'northwest': [24, 5.56], 'northeast': [24, 16.67]}}


#### Conclusion

The percentages of people with 5 children who are in obesity class 3 is nearly double the percentage in the general population. Especially considering there are only 18 people who have 5 kids in this dataset the 2 class 3 obese people are warping that rate of change. We can assume that cost per child is steadily decreasing if this obesity factor wasn't skewing our results. 

### Confirm Region Bias

##### Determine what other factors could be causing variation in average price between region

#### Part 1: Sort Dictionary by Region

In [650]:
def region_stats(dictionary, region_input): 
    total_region = 0
    total_charges = 0
    totals = { 
    'age': {
            'young adult' : 0,
            '20s': 0,
            '30s': 0,
            '40s': 0,
            '50s': 0,
            '60s': 0
        },
        'sex': {
            'male': 0,
            'female': 0
        },
        'bmi': {
            'underweight': 0,
            'healthy': 0,
            'overweight': 0,
            'obesity1': 0,
            'obesity2': 0,
            'obesity3': 0
        },
        'children': {
            'childless': 0,
            'one_child': 0,
            'two_children': 0,
            'three_children': 0,
            'four_children': 0,
            'five_children': 0
        },
        'smoker': {
            'smoker': 0,
            'non-smoker': 0
        }
    }
    #age sort
    def by_age(user):
        age = int(dictionary[user]['age'])
        if age >= 18 and age < 20:
            totals['age']['young adult'] += 1
        elif age >= 20 and age < 30:
            totals['age']['20s'] += 1
        elif age >= 30 and age < 40:
            totals['age']['30s'] += 1
        elif age >= 40 and age < 50:
            totals['age']['40s'] += 1
        elif age >= 50 and age < 60:
            totals['age']['50s'] += 1
        elif age >= 60 and age < 65:
            totals['age']['60s'] += 1
    #sex sort
    def by_sex(user):
         sex = dictionary[user]['sex']
         if sex == 'male':
             totals['sex']['male'] += 1
         elif sex == 'female':
                totals['sex']['female'] += 1
    #bmi sort
    def by_bmi(user):
        bmi = float(dictionary[user]['bmi'])
        if bmi < 18.5:
            totals['bmi']['underweight'] += 1
        elif bmi >= 18.5 and bmi < 25:
            totals['bmi']['healthy'] += 1
        elif bmi >= 25 and bmi < 30:
            totals['bmi']['overweight'] += 1
        elif bmi >= 30 and bmi < 35:
            totals['bmi']['obesity1'] += 1
        elif bmi >= 35 and bmi < 40:
            totals['bmi']['obesity2'] += 1
        elif bmi > 40: 
            totals['bmi']['obesity3'] += 1
    #children sort
    def by_children(user):
        children = int(dictionary[user]['children'])
        if children == 0:
            totals['children']['childless'] += 1
        elif children == 1:
            totals['children']['one_child'] += 1
        elif children == 2:
            totals['children']['two_children'] += 1
        elif children == 3:
            totals['children']['three_children'] += 1
        elif children == 4: 
            totals['children']['four_children'] += 1
        elif children == 5:
            totals['children']['five_children'] += 1
    #smoker sort
    def by_smoker(user):
        smoker = dictionary[user]['smoker']
        if smoker == 'yes':
            totals['smoker']['smoker'] += 1
        elif smoker == 'no':
            totals['smoker']['non-smoker'] += 1
    #region sort
    for person in dictionary:
        region = dictionary[person]['region']
        charge = round(float(dictionary[person]['charges']), 2)
        if region_input == 'southwest' and region == 'southwest':
                total_region += 1
                total_charges += charge
                by_age(person)
                by_sex(person)
                by_bmi(person)
                by_children(person)
                by_smoker(person)
        elif region_input == 'southeast' and region == 'southeast':
                total_region += 1
                total_charges += charge
                by_age(person)
                by_sex(person)
                by_bmi(person)
                by_children(person)
                by_smoker(person)
        elif region_input == 'northwest' and region == 'northwest':
                total_region += 1
                total_charges += charge
                by_age(person)
                by_sex(person)
                by_bmi(person)
                by_children(person)
                by_smoker(person)
        elif region_input == 'northeast' and region == 'northeast':
                total_region += 1
                total_charges += charge
                by_age(person)
                by_sex(person)
                by_bmi(person)
                by_children(person)
                by_smoker(person)
        else: continue
    return [totals, total_region, total_charges]

southwest_totals = region_stats(insurance_data_dict, 'southwest')[0]
all_southwest = region_stats(insurance_data_dict, 'southwest')[1]
southwest_charges = region_stats(insurance_data_dict, 'southwest')[2]
print("Southwest Dictionary: " + str(southwest_totals))
print("Southwest Total: " + str(all_southwest))
print('Southwest Total Charges: ' + str(southwest_charges))

southeast_totals = region_stats(insurance_data_dict, 'southeast')[0]
all_southeast = region_stats(insurance_data_dict, 'southeast')[1]
southeast_charges = region_stats(insurance_data_dict, 'southeast')[2]
print("Southeast Dictionary: " + str(southeast_totals))
print("Southeast Total: " + str(all_southeast))
print('Southeast Total Charges: ' + str(southeast_charges))

northwest_totals = region_stats(insurance_data_dict, 'northwest')[0]
all_northwest = region_stats(insurance_data_dict, 'northwest')[1]
northwest_charges = region_stats(insurance_data_dict, 'northwest')[2]
print("Northwest Dictionary: " + str(northwest_totals))
print("Northwest Total: " + str(all_northwest))
print('Northwest Total Charges: ' + str(northwest_charges))

northeast_totals = region_stats(insurance_data_dict, 'northeast')[0]
all_northeast = region_stats(insurance_data_dict, 'northeast')[1]
northeast_charges = region_stats(insurance_data_dict, 'northeast')[2]
print("Northeast Dictionary: " + str(northeast_totals))
print("Northeast Total: " + str(all_northeast))
print('Northeast Total Charges: ' + str(northeast_charges))

Southwest Dictionary: {'age': {'young adult': 31, '20s': 68, '30s': 64, '40s': 66, '50s': 68, '60s': 28}, 'sex': {'male': 163, 'female': 162}, 'bmi': {'underweight': 3, 'healthy': 48, 'overweight': 101, 'obesity1': 102, 'obesity2': 58, 'obesity3': 13}, 'children': {'childless': 138, 'one_child': 78, 'two_children': 57, 'three_children': 37, 'four_children': 7, 'five_children': 8}, 'smoker': {'smoker': 58, 'non-smoker': 267}}
Southwest Total: 325
Southwest Total Charges: 4012754.69
Southeast Dictionary: {'age': {'young adult': 40, '20s': 75, '30s': 69, '40s': 79, '50s': 70, '60s': 31}, 'sex': {'male': 189, 'female': 175}, 'bmi': {'underweight': 0, 'healthy': 41, 'overweight': 80, 'obesity1': 94, 'obesity2': 94, 'obesity3': 55}, 'children': {'childless': 157, 'one_child': 95, 'two_children': 66, 'three_children': 35, 'four_children': 5, 'five_children': 6}, 'smoker': {'smoker': 91, 'non-smoker': 273}}
Southeast Total: 364
Southeast Total Charges: 5363689.780000005
Northwest Dictionary: {

#### Part 2: Compare Percentages and Average Cost

In [652]:
southeast_percentages = {
    'age': {
        'young adult': round((southeast_totals['age']['young adult'] / all_southeast) * 100 , 2),
        '20s': round((southeast_totals['age']['20s'] / all_southeast) * 100 , 2),
        '30s': round((southeast_totals['age']['30s'] / all_southeast) * 100 , 2),
        '40s': round((southeast_totals['age']['40s'] / all_southeast) * 100 , 2),
        '50s': round((southeast_totals['age']['50s'] / all_southeast) * 100 , 2),
        '60s': round((southeast_totals['age']['60s'] / all_southeast) * 100 , 2) 
    },
    'sex': {
        'male': round((southeast_totals['sex']['male'] / all_southeast) * 100 , 2),
        'female': round((southeast_totals['sex']['female'] / all_southeast) * 100 , 2)
    },
    'bmi': {
        'underweight': round((southeast_totals['bmi']['underweight'] / all_southeast) * 100 , 2),
        'healthy': round((southeast_totals['bmi']['healthy'] / all_southeast) * 100 , 2),
        'overweight': round((southeast_totals['bmi']['overweight'] / all_southeast) * 100 , 2),
        'obesity1': round((southeast_totals['bmi']['obesity1'] / all_southeast) * 100 , 2),
        'obesity2': round((southeast_totals['bmi']['obesity2'] / all_southeast) * 100 , 2),
        'obesity3': round((southeast_totals['bmi']['obesity3'] / all_southeast) * 100 , 2)
    },
    'children': {
        'childless': round((southeast_totals['children']['childless'] / all_southeast) * 100 , 2),
        'one_child': round((southeast_totals['children']['one_child'] / all_southeast) * 100 , 2),
        'two_children': round((southeast_totals['children']['two_children'] / all_southeast) * 100 , 2),
        'three_children': round((southeast_totals['children']['three_children'] / all_southeast) * 100 , 2),
        'four_children': round((southeast_totals['children']['four_children'] / all_southeast) * 100 , 2),
        'five_children': round((southeast_totals['children']['five_children'] / all_southeast) * 100 , 2)     
    },
    'smoker': {
        'smoker':  round((southeast_totals['smoker']['smoker'] / all_southeast) * 100 , 2),
        'non-smoker':  round((southeast_totals['smoker']['non-smoker'] / all_southeast) * 100 , 2),
    }
}

In [654]:
southwest_percentages = {
    'age': {
        'young adult': round((southwest_totals['age']['young adult'] / all_southwest) * 100 , 2),
        '20s': round((southwest_totals['age']['20s'] / all_southwest) * 100 , 2),
        '30s': round((southwest_totals['age']['30s'] / all_southwest) * 100 , 2),
        '40s': round((southwest_totals['age']['40s'] / all_southwest) * 100 , 2),
        '50s': round((southwest_totals['age']['50s'] / all_southwest) * 100 , 2),
        '60s': round((southwest_totals['age']['60s'] / all_southwest) * 100 , 2) 
    },
    'sex': {
        'male': round((southwest_totals['sex']['male'] / all_southwest) * 100 , 2),
        'female': round((southwest_totals['sex']['female'] / all_southwest) * 100 , 2)
    },
    'bmi': {
        'underweight': round((southwest_totals['bmi']['underweight'] / all_southwest) * 100 , 2),
        'healthy': round((southwest_totals['bmi']['healthy'] / all_southwest) * 100 , 2),
        'overweight': round((southwest_totals['bmi']['overweight'] / all_southwest) * 100 , 2),
        'obesity1': round((southwest_totals['bmi']['obesity1'] / all_southwest) * 100 , 2),
        'obesity2': round((southwest_totals['bmi']['obesity2'] / all_southwest) * 100 , 2),
        'obesity3': round((southwest_totals['bmi']['obesity3'] / all_southwest) * 100 , 2)
    },
    'children': {
        'childless': round((southwest_totals['children']['childless'] / all_southwest) * 100 , 2),
        'one_child': round((southwest_totals['children']['one_child'] / all_southwest) * 100 , 2),
        'two_children': round((southwest_totals['children']['two_children'] / all_southwest) * 100 , 2),
        'three_children': round((southwest_totals['children']['three_children'] / all_southwest) * 100 , 2),
        'four_children': round((southwest_totals['children']['four_children'] / all_southwest) * 100 , 2),
        'five_children': round((southwest_totals['children']['five_children'] / all_southwest) * 100 , 2)     
    },
    'smoker': {
        'smoker':  round((southwest_totals['smoker']['smoker'] / all_southwest) * 100 , 2),
        'non-smoker':  round((southwest_totals['smoker']['non-smoker'] / all_southwest) * 100 , 2),
    }
}

In [656]:
northwest_percentages = {
    'age': {
        'young adult': round((northwest_totals['age']['young adult'] / all_northwest) * 100 , 2),
        '20s': round((northwest_totals['age']['20s'] / all_northwest) * 100 , 2),
        '30s': round((northwest_totals['age']['30s'] / all_northwest) * 100 , 2),
        '40s': round((northwest_totals['age']['40s'] / all_northwest) * 100 , 2),
        '50s': round((northwest_totals['age']['50s'] / all_northwest) * 100 , 2),
        '60s': round((northwest_totals['age']['60s'] / all_northwest) * 100 , 2) 
    },
    'sex': {
        'male': round((northwest_totals['sex']['male'] / all_northwest) * 100 , 2),
        'female': round((northwest_totals['sex']['female'] / all_northwest) * 100 , 2)
    },
    'bmi': {
        'underweight': round((northwest_totals['bmi']['underweight'] / all_northwest) * 100 , 2),
        'healthy': round((northwest_totals['bmi']['healthy'] / all_northwest) * 100 , 2),
        'overweight': round((northwest_totals['bmi']['overweight'] / all_northwest) * 100 , 2),
        'obesity1': round((northwest_totals['bmi']['obesity1'] / all_northwest) * 100 , 2),
        'obesity2': round((northwest_totals['bmi']['obesity2'] / all_northwest) * 100 , 2),
        'obesity3': round((northwest_totals['bmi']['obesity3'] / all_northwest) * 100 , 2)
    },
    'children': {
        'childless': round((northwest_totals['children']['childless'] / all_northwest) * 100 , 2),
        'one_child': round((northwest_totals['children']['one_child'] / all_northwest) * 100 , 2),
        'two_children': round((northwest_totals['children']['two_children'] / all_northwest) * 100 , 2),
        'three_children': round((northwest_totals['children']['three_children'] / all_northwest) * 100 , 2),
        'four_children': round((northwest_totals['children']['four_children'] / all_northwest) * 100 , 2),
        'five_children': round((northwest_totals['children']['five_children'] / all_northwest) * 100 , 2)     
    },
    'smoker': {
        'smoker':  round((northwest_totals['smoker']['smoker'] / all_northwest) * 100 , 2),
        'non-smoker':  round((northwest_totals['smoker']['non-smoker'] / all_northwest) * 100 , 2),
    }
}

In [658]:
northeast_percentages = {
    'age': {
        'young adult': round((northeast_totals['age']['young adult'] / all_northeast) * 100 , 2),
        '20s': round((northeast_totals['age']['20s'] / all_northeast) * 100 , 2),
        '30s': round((northeast_totals['age']['30s'] / all_northeast) * 100 , 2),
        '40s': round((northeast_totals['age']['40s'] / all_northeast) * 100 , 2),
        '50s': round((northeast_totals['age']['50s'] / all_northeast) * 100 , 2),
        '60s': round((northeast_totals['age']['60s'] / all_northeast) * 100 , 2) 
    },
    'sex': {
        'male': round((northeast_totals['sex']['male'] / all_northeast) * 100 , 2),
        'female': round((northeast_totals['sex']['female'] / all_northeast) * 100 , 2)
    },
    'bmi': {
        'underweight': round((northeast_totals['bmi']['underweight'] / all_northeast) * 100 , 2),
        'healthy': round((northeast_totals['bmi']['healthy'] / all_northeast) * 100 , 2),
        'overweight': round((northeast_totals['bmi']['overweight'] / all_northeast) * 100 , 2),
        'obesity1': round((northeast_totals['bmi']['obesity1'] / all_northeast) * 100 , 2),
        'obesity2': round((northeast_totals['bmi']['obesity2'] / all_northeast) * 100 , 2),
        'obesity3': round((northeast_totals['bmi']['obesity3'] / all_northeast) * 100 , 2)
    },
    'children': {
        'childless': round((northeast_totals['children']['childless'] / all_northeast) * 100 , 2),
        'one_child': round((northeast_totals['children']['one_child'] / all_northeast) * 100 , 2),
        'two_children': round((northeast_totals['children']['two_children'] / all_northeast) * 100 , 2),
        'three_children': round((northeast_totals['children']['three_children'] / all_northeast) * 100 , 2),
        'four_children': round((northeast_totals['children']['four_children'] / all_northeast) * 100 , 2),
        'five_children': round((northeast_totals['children']['five_children'] / all_northeast) * 100 , 2)     
    },
    'smoker': {
        'smoker': round((northeast_totals['smoker']['smoker'] / all_northeast) * 100 , 2),
        'non-smoker': round((northeast_totals['smoker']['non-smoker'] / all_northeast) * 100 , 2),
    }
}

In [662]:
region_percentage_comparisons = {
    'age': {
        'young adult': [age_percentages_per_group['young adult'], southwest_percentages['age']['young adult'], southeast_percentages['age']['young adult'], northwest_percentages['age']['young adult'], northeast_percentages['age']['young adult']],
        '20s': [age_percentages_per_group['20s'], southwest_percentages['age']['20s'], southeast_percentages['age']['20s'], northwest_percentages['age']['20s'], northeast_percentages['age']['20s']],
        '30s': [age_percentages_per_group['30s'], southwest_percentages['age']['30s'], southeast_percentages['age']['30s'], northwest_percentages['age']['30s'], northeast_percentages['age']['30s']],
        '40s': [age_percentages_per_group['40s'], southwest_percentages['age']['40s'], southeast_percentages['age']['40s'], northwest_percentages['age']['40s'], northeast_percentages['age']['40s']],
        '50s': [age_percentages_per_group['50s'], southwest_percentages['age']['50s'], southeast_percentages['age']['50s'], northwest_percentages['age']['50s'], northeast_percentages['age']['50s']],
        '60s': [age_percentages_per_group['60s'], southwest_percentages['age']['60s'], southeast_percentages['age']['60s'], northwest_percentages['age']['60s'], northeast_percentages['age']['60s']]
    },
    'sex': {
        'male':  [sex_percentages_per_group['male'], southwest_percentages['sex']['male'], southeast_percentages['sex']['male'], northwest_percentages['sex']['male'], northeast_percentages['sex']['male']],
        'female':  [sex_percentages_per_group['female'], southwest_percentages['sex']['female'], southeast_percentages['sex']['female'], northwest_percentages['sex']['female'], northeast_percentages['sex']['female']]
    },
    'bmi': {
        'underweight':  [bmi_percentages_per_group['underweight'], southwest_percentages['bmi']['underweight'], southeast_percentages['bmi']['underweight'], northwest_percentages['bmi']['underweight'], northeast_percentages['bmi']['underweight']],
        'healthy':  [bmi_percentages_per_group['healthy'], southwest_percentages['bmi']['healthy'], southeast_percentages['bmi']['healthy'], northwest_percentages['bmi']['healthy'], northeast_percentages['bmi']['healthy']],
        'overweight':  [bmi_percentages_per_group['overweight'], southwest_percentages['bmi']['overweight'], southeast_percentages['bmi']['overweight'], northwest_percentages['bmi']['overweight'], northeast_percentages['bmi']['overweight']],
        'obesity1':  [bmi_percentages_per_group['obesity1'], southwest_percentages['bmi']['obesity1'], southeast_percentages['bmi']['obesity1'], northwest_percentages['bmi']['obesity1'], northeast_percentages['bmi']['obesity1']],
        'obesity2':  [bmi_percentages_per_group['obesity2'], southwest_percentages['bmi']['obesity2'], southeast_percentages['bmi']['obesity2'], northwest_percentages['bmi']['obesity2'], northeast_percentages['bmi']['obesity2']],
        'obesity3':  [bmi_percentages_per_group['obesity3'], southwest_percentages['bmi']['obesity3'], southeast_percentages['bmi']['obesity3'], northwest_percentages['bmi']['obesity3'], northeast_percentages['bmi']['obesity3']]      
    },
    'children': {
        'childless':  [children_percentages_per_group['childless'], southwest_percentages['children']['childless'], southeast_percentages['children']['childless'], northwest_percentages['children']['childless'], northeast_percentages['children']['childless']],
        'one_child': [children_percentages_per_group['one_child'], southwest_percentages['children']['one_child'], southeast_percentages['children']['one_child'], northwest_percentages['children']['one_child'], northeast_percentages['children']['one_child']],
        'two_children': [children_percentages_per_group['two_children'], southwest_percentages['children']['two_children'], southeast_percentages['children']['two_children'], northwest_percentages['children']['two_children'], northeast_percentages['children']['two_children']],
        'three_children': [children_percentages_per_group['three_children'], southwest_percentages['children']['three_children'], southeast_percentages['children']['three_children'], northwest_percentages['children']['three_children'], northeast_percentages['children']['three_children']],
        'four_children': [children_percentages_per_group['four_children'], southwest_percentages['children']['four_children'], southeast_percentages['children']['four_children'], northwest_percentages['children']['four_children'], northeast_percentages['children']['four_children']],
        'five_children': [children_percentages_per_group['five_children'], southwest_percentages['children']['five_children'], southeast_percentages['children']['five_children'], northwest_percentages['children']['five_children'], northeast_percentages['children']['five_children']]
    },
    'smoker': {
        'smoker': [smoker_percentages_per_group['smoker'], southwest_percentages['smoker']['smoker'], southeast_percentages['smoker']['smoker'], northwest_percentages['smoker']['smoker'], northeast_percentages['smoker']['smoker']],
        'non-smoker': [smoker_percentages_per_group['non-smoker'], southwest_percentages['smoker']['non-smoker'], southeast_percentages['smoker']['non-smoker'], northwest_percentages['smoker']['non-smoker'], northeast_percentages['smoker']['non-smoker']]
    }
}

print('The following dictionary can be read as follows: [General Population Percentage of subcategory, Southwest Percentage, Southeast, Northwest, Northeast]')
print("Region Percentages Comparison: " + str(region_percentage_comparisons))
print("Region Average Charge: " + str(region_average_charge))

The following dictionary can be read as follows: [General Population Percentage of subcategory, Southwest Percentage, Southeast, Northwest, Northeast]
Region Percentages Comparison: {'age': {'young adult': [10.24, 9.54, 10.99, 10.46, 9.88], '20s': [20.93, 20.92, 20.6, 20.92, 21.3], '30s': [19.21, 19.69, 18.96, 19.38, 18.83], '40s': [20.85, 20.31, 21.7, 20.31, 20.99], '50s': [20.25, 20.92, 19.23, 20.31, 20.68], '60s': [8.52, 8.62, 8.52, 8.62, 8.33]}, 'sex': {'male': [50.52, 50.15, 51.92, 49.54, 50.31], 'female': [49.48, 49.85, 48.08, 50.46, 49.69]}, 'bmi': {'underweight': [1.49, 0.92, 0.0, 2.15, 3.09], 'healthy': [16.82, 14.77, 11.26, 19.38, 22.53], 'overweight': [28.85, 31.08, 21.98, 32.92, 30.25], 'obesity1': [29.22, 31.38, 25.82, 32.31, 27.78], 'obesity2': [16.82, 17.85, 25.82, 10.77, 11.73], 'obesity3': [6.8, 4.0, 15.11, 2.46, 4.63]}, 'children': {'childless': [42.9, 42.46, 43.13, 40.62, 45.37], 'one_child': [24.22, 24.0, 26.1, 22.77, 23.77], 'two_children': [17.94, 17.54, 18.13, 20

#### Conclusion

We can see that Southwest and northwest have lower rates. They also have the lowest rates of class 3 obesity and smokers the two categories with the most effect on price. On the other hand the southeast and northeast had higher rates of both class 3 obesity and smoking. With the southeast being the highest in both categories. This is why the southeast has the highest average charge rate of any region, not due to a bias against the region itself. 

# Final Conclusion

The cost of insurance increases with age. 
Females are charged more for insurance than males.
The cost of insurance increases with bmi. Insurance doesn't reward people for being underweight but often more young people are underweight so their rates tend to be lower. 
The cost of insurance decreases per child added.
Smoking is the biggest cost driver for insurance prices. 
There is no region bias. 

Future work: determine rate of change in cost between nonbinary variables, and we can possibly insert some charts to help give visual representation of the sharpness of the increase