# U.S. Medical Insurance Costs

Source: insurance.csv

'insurance.csv' provides insurance data in table format. There are 1338 datapoints.

Each datapoint includes: Age, Sex, BMI, Children, Smoker, Region, Charges

What could be interesting to analyze:
1. What are the statistics about the dataset? Averages, mins, maxs, etc.
2. What has the greatest impact on insurance cost? Smoker? BMI?
3. Which age group pays most?
4. Are there gender differences in smoker characteristics?
5. Regions represent equally?

- Read in data as dictionary
- calculate statistical metrics and class counts for floating point data to get an idea of the data distribution
- calculate average charges for smokers/non-smokers
- calculate average charges for normal/high BMI (<25, >=25)
- calculate average charges for different age groups (10-20, 21-30, 31-40, ...)
- count no. of female/male in smoker classes 
- count regions

In [132]:
# Reading in the data
# better with DictReader?
import csv

mic_dict = {"Age": [],"Sex": [], "BMI": [], "Children": [], "Smoker": [], "Region": [], "Charges": []}
with open('insurance.csv', newline='') as datafile:
    data_dict = csv.DictReader(datafile)
    print(data_dict.fieldnames)
    for row in data_dict:
        mic_dict["Age"].append(float(row["age"]))
        mic_dict["Sex"].append(row["sex"])
        mic_dict["BMI"].append(float(row["bmi"]))
        mic_dict["Children"].append(int(row["children"]))
        mic_dict["Smoker"].append(row["smoker"])
        mic_dict["Region"].append(row["region"])
        mic_dict["Charges"].append(float(row["charges"]))

['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges']


### Investigation #1: Data characteristics

- calculate statistical metrics for float data BMI, Age, Children, Charges
- assign data into classes and check distribution

In [118]:
# The aim is to write a generic function that can be used for several parameters.
# Input: dict, parameter, class definitions (in dict form)
# Output: [average, min, max, median], class distribution 
def calculate_average(mic_dict, parameter, classes):
    # statistical metrics
    data = mic_dict[parameter]
    average = sum(data)/len(data)
    median = sorted(data)[round(len(data)/2)]
    vmin = min(data)
    vmax = max(data)
    print("Statistical metrics:")
    print("Average: {}, Median: {}, Min: {}, Max: {}".format(average, median, vmin, vmax))
    
    # class distribution
    print("Classes:")
    cvalues = list(classes.values())
    class_dict = classes
    for i in range(len(cvalues)):
        if i < len(cvalues)-1:
            lower = cvalues[i]
            upper = cvalues[i+1]
            data_filtered = [d for d in data if d >= lower and d < upper
            print("Class {}: {} <= x < {} -> {}".format(list(classes.keys())[i], lower, upper, len(data_filtered)))
        else:
            lower = cvalues[i]
            data_filtered = [d for d in data if d >= lower]
            print("Class {}: {} <= x -> {}".format(list(classes.keys())[i], lower, len(data_filtered)))
        class_dict[list(classes.keys())[i]] = len(data_filtered)

    return average, median, min, max, class_dict

In [119]:
# BMI
classes = {0: 0, 1: 25}
bmi_ave, bmi_median, bmi_min, bmi_max, bmi_classes = calculate_average(mic_dict, "BMI", classes)

Statistical metrics:
Average: 30.663396860986538, Median: 30.4, Min: 15.96, Max: 53.13
Classes:
Class 0: 0 <= x < 25 -> 245
Class 1: 25 <= x -> 1093


According to Wikipedia, a person with a BMI of 25 or higher is considered overweight. 
In this dataset, there is a bias towards overweight persons. This might affect the analysis of insurance charges. 

In [120]:
# Age
classes = {0: 0, 1: 10, 2: 20, 3: 30, 4: 40, 5: 50, 6:60, 7:70, 8:80, 9:90, 10: 100}
age_ave, age_median, age_min, age_max, age_classes = calculate_average(mic_dict, "Age", classes)

Statistical metrics:
Average: 39.20702541106129, Median: 39.0, Min: 18.0, Max: 64.0
Classes:
Class 0: 0 <= x < 10 -> 0
Class 1: 10 <= x < 20 -> 137
Class 2: 20 <= x < 30 -> 280
Class 3: 30 <= x < 40 -> 257
Class 4: 40 <= x < 50 -> 279
Class 5: 50 <= x < 60 -> 271
Class 6: 60 <= x < 70 -> 114
Class 7: 70 <= x < 80 -> 0
Class 8: 80 <= x < 90 -> 0
Class 9: 90 <= x < 100 -> 0
Class 10: 100 <= x -> 0


The age data is normally distributed. There are no persons of age 80 or older.

In [127]:
# Children
classes = {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5}
child_ave, child_median, child_min, child_max, child_classes = calculate_average(mic_dict, "Children", classes)

Statistical metrics:
Average: 1.0949177877429, Median: 1, Min: 0, Max: 5
Classes:
Class 0: 0 <= x < 1 -> 574
Class 1: 1 <= x < 2 -> 324
Class 2: 2 <= x < 3 -> 240
Class 3: 3 <= x < 4 -> 157
Class 4: 4 <= x < 5 -> 25
Class 5: 5 <= x -> 18


In [139]:
# Charges
classes = {0: 5000, 1: 10000, 2: 20000, 3: 30000, 4: 40000, 5: 50000, 6: 60000}
charges_ave, charges_median, charges_min, charges_max, charges_classes = calculate_average(mic_dict, "Charges", classes)

Statistical metrics:
Average: 13270.422265141257, Median: 9386.1613, Min: 1121.8739, Max: 63770.42801
Classes:
Class 0: 5000 <= x < 10000 -> 353
Class 1: 10000 <= x < 20000 -> 353
Class 2: 20000 <= x < 30000 -> 111
Class 3: 30000 <= x < 40000 -> 83
Class 4: 40000 <= x < 50000 -> 72
Class 5: 50000 <= x < 60000 -> 4
Class 6: 60000 <= x -> 3


### Investigation #2: What affects the insurance costs more, BMI or Smoker status?

- calculate averages charges for smokers/non-smokers, normal/high BMI

In [164]:
# The aim is to write a generic function that can be used for both, smoker status and BMIM.
# Input: charges, data
# Output: average charges for both categories (e.g. smokers/non-smokers)

def calculate_average_charges(charges, data):
    total_1 = 0
    count_1 = 0
    total_2 = 0
    count_2 = 0
    for i in range(len(data)):
        if data[i] == "Class 1":
            total_1 += charges[i]
            count_1 += 1
        else:
            total_2 += charges[i]
            count_2 += 1
    print ("Average insurance cost Class 1 = {}".format(total_1/count_1))
    print ("Average insurance cost Class 2 = {}".format(total_2/count_2))

In [165]:
# Categorize smoking status data in Class 1 and 2
# Class 1 = Smoker, Class 2 = Non-Smoker
data = mic_dict["Smoker"]
data = ["Class 1" if d == "yes" else "Class 2" for d in data]
calculate_average_charges(mic_dict["Charges"], data)

Average insurance cost Class 1 = 32050.23183153285
Average insurance cost Class 2 = 8434.268297856199


Class 1 = Smoker, Class 2 = Non-Smoker.
The average insurance cost is 3.8x higher for smokers.

In [166]:
# For BMI, we need to create a new list, with 2 categories
# Class 1 = normal BMI (<25), Class 2 = overweight (>=25)
data = mic_dict["BMI"]
data = ["Class 1" if d < 25 else "Class 2" for d in data]
calculate_average_charges(mic_dict["Charges"], data)

Average insurance cost Class 1 = 10282.224474367351
Average insurance cost Class 2 = 13940.237872405301


Class 1 = Normal BMI, Class 2 = Overweight.
The average insurance cost is only 1.3x higher for people with BMI > 25.

--> Smoking has a higher effect on insurance costs than BMI.

### Investigation #3: Which age group has the highest insurance costs?

- define age classes
- calculate average for each age class

In [173]:
# age classes
classes = [1, 2, 3, 4, 5, 6, 7, 8]
# categorize age data according to classes
data = mic_dict["Age"]
data_updated = []
for d in data:
    if d >= 0 and d < 10:
        data_updated.append("Class 1")
    if d >= 10 and d < 20:
        data_updated.append("Class 2")
    if d >= 20 and d < 30:
        data_updated.append("Class 3")
    if d >= 30 and d < 40:
        data_updated.append("Class 4")
    if d >= 40 and d < 50:
        data_updated.append("Class 5")
    if d >= 50 and d < 60:
        data_updated.append("Class 6")
    if d >= 60 and d < 70:
        data_updated.append("Class 7")
    if d >= 70 and d < 80:
        data_updated.append("Class 8")


In [217]:
def calculate_average_charges_perclass(charges, data, classes):
    old_average = 0
    for c in classes:
        total = 0
        count = 0
        for i in range(len(data)):
            if data[i] == "Class {}".format(c):
                total += charges[i]
                count += 1
            else:
                continue
        if count == 0:
            continue
        average = total/count
        if old_average == 0:
            print ("Average insurance cost Class {} = {}".format(c, average))
            old_average = average
        else:
            print ("Average insurance cost Class {} = {} -> {}x".format(c, average, average/old_average))
            old_average = average

In [218]:
classes = [1, 2, 3, 4, 5, 6, 7, 8]
calculate_average_charges_perclass(mic_dict["Charges"], data_updated, classes)

Average insurance cost Class 2 = 8407.34924189051
Average insurance cost Class 3 = 9561.75101803571 -> 1.1373086501977663x
Average insurance cost Class 4 = 11738.784117354091 -> 1.2276814252130164x
Average insurance cost Class 5 = 14399.203563870966 -> 1.2266350092071145x
Average insurance cost Class 6 = 16495.232664981537 -> 1.1455656274191246x
Average insurance cost Class 7 = 21248.021884912272 -> 1.288131081049899x


Between the different age group, there's always a ~1.2x increase in average insurance cost. Thus the highest age group pays most.

### Are there gender differences in Smoker characteristics?

- count male/female smokers

In [231]:
# input: mic_dict
# output: count male/female
def count_smokers(mic_dict):
    smokers = mic_dict["Smoker"]
    sex = mic_dict["Sex"]
    count_f = 0
    count_m = 0
    for i in range(len(sex)):
        if sex[i] == "female":
            count_f += 1
        else:
            count_m += 1
    return count_f, count_m, count_m/count_f

In [232]:
count_smokers(mic_dict)

(662, 676, 1.0211480362537764)

The number of smokers is equally distributed between female and male persons.

### Are the regions represented equally?

In [242]:
def analyze_regions(mic_dict):
    regions = mic_dict["Region"]
    regions_u = list(set(regions))
    regions_dict = {}
    for region in regions_u:
        regions_dict[region] = len([r for r in regions if r == region])
    return regions_dict

In [243]:
analyze_regions(mic_dict)

{'southeast': 364, 'southwest': 325, 'northwest': 325, 'northeast': 324}

Most insurance data came from "southeast". However, the regions are overall equally represented.