In [10]:
import csv

## Step 1 - Importing the data

In [82]:
with open('insurance.csv') as insurance_data:
    insurance_dict = csv.DictReader(insurance_data)
    # unpacking the insurance dictionary
    insurance_list = []
    count = 0
    for item in insurance_dict: #actually creating a usable dictionary once the file is closed...
        insurance_list.append({key:value for key, value in item.items()})
        count +=1
    print(count)

1338


## Step 2 -  Let's explore the data and ensure we have the correct datatypes.
* Age should be int
* BMI should be float
* number of children should be int

<em> Looks like the data types need to be converted, so lets do that first!</em>
* {'age': '19', 'sex': 'female', 'bmi': '27.9', 'children': '0', 'smoker': 'yes', 'region': 'southwest', 'charges': '16884.924'}

In [83]:
def conversion_func(dictionary):
    #in this function, we are going to convert datatypes
    for item in dictionary:
        # reset the datatypes here:
        item['age'] = int(item['age'])
        item['bmi'] = float(item['bmi'])
        item['children'] = int(item['children'])
        item['charges'] = float(item['charges'])
    return dictionary
new_insurance_dict = conversion_func(insurance_list)
#print(new_insurance_dict)
# now we have the correct data types to work with! 
# {'age': 19, 'sex': 'female', 'bmi': 27.9, 'children': 0, 'smoker': 'yes', 'region': 'southwest', 'charges': 16884.924}

## Step 3 - Defining a bunch of functions to slice the data into different categories
From these categories, one could then run additional analyses to determine average costs, etc. Each function can be modified easily to return the counts from each category and sub-category.

### Step 3.a - Break the data down into Males and Females

In [84]:
def gender_breakdown(dictionary):
    
    # initiate the new dictionary with two sub-dictionaries
    gender_dict = {'males':[], 'females':[]}
    
    # while we're at it, let's get a count of how many men and women are included in the dataset.
    male_count = 0
    female_count = 0
    
    # iterate through the dictionary of data to separate records for men and women into two sub-dictionaries.
    for item in dictionary:
        if item['sex'] == 'male':
            gender_dict['males'].append(item)
            male_count += 1
        elif item['sex'] == 'female':
            gender_dict['females'].append(item)
            female_count += 1
    print('There are ' + str(female_count) +' women in the dataset and ' + str(male_count) + ' men.')
    return gender_dict

by_gender_dict = gender_breakdown(new_insurance_dict)
# print(by_gender_dict['females']) # double check that the separation worked as expected. 

There are 662 women in the dataset and 676 men.


### Step 3.b - Let's look at representation by region

In [96]:
def region_breakdown(dictionary):
    
    # here is a dictionary of the regions. Not sure a dictinary is necessary, but digging working w/ these things.
    regions = {1: 'northeast', 2: 'southeast', 3: 'northwest', 4: 'southwest'}
    
    # creating a new dictionary for people in each region. 
    region_dict = {'Northeast': [], 'Southeast':[], 'Northwest':[], 'Southwest':[]}
    
    # Let's get a count of each. 
    region1_count = 0
    region2_count = 0
    region3_count = 0
    region4_count = 0
    
    # let's check to see which dictionary is input. If the high-level dictionary, we need to handle it differently than a broken down dict.
    for item in dictionary:
        
        # If a dictionary of dictionaries (Such as by_gender_dict)
        if (type(item) is str):
            for line in dictionary[item]:
                if regions[1] in line['region']:
                    region_dict['Northeast'].append(line)
                    region1_count += 1
                elif regions[2] in line['region']:
                    region_dict['Southeast'].append(line)
                    region2_count += 1
                elif regions[3] in line['region']:
                    region_dict['Northwest'].append(line)
                    region3_count += 1
                elif regions[4] in line['region']:
                    region_dict['Southwest'].append(line)
                    region4_count += 1
        
        # if not, it's the main dictionary
        elif type(item) is dict:
            if regions[1] in item['region']:
                region_dict['Northeast'].append(item)
                region1_count += 1
            elif regions[2] in item['region']:
                region_dict['Southeast'].append(item)
                region2_count += 1
            elif regions[3] in item['region']:
                region_dict['Northwest'].append(item)
                region3_count += 1
            elif regions[4] in item['region']:
                region_dict['Southwest'].append(item)
                region4_count += 1
    print('Broken down by region, there are ' + str(region1_count) + ' in the NE, ' + str(region2_count) + ' in the SE, ' + str(region3_count) +' in the NW, and ' + str(region4_count) +' in the SW.')
    return region_dict

by_region_dict = region_breakdown(by_gender_dict)


Broken down by region, there are 324 in the NE, 364 in the SE, 325 in the NW, and 325 in the SW.


### Step 3.c - Looking at Representation by Age Group

In [86]:
def agegroup_breakdown(dictionary):
    
    # here is a dictionary of the regions. Not sure a dictinary is necessary, but digging working w/ these things.
    age_groups = {1: 24, 2: 29, 3: 34, 4: 39, 5: 44, 6: 49, 7: 54, 8: 59, 9: 64}
    
    # creating a new dictionary for people in each region. 
    agegroup_dict = {'19 to 24': [], '25 to 29':[], '30 to 34':[], '35 to 39':[], '40 to 44':[], '45 to 49':[], '50 to 54':[], '55 to 59':[], '60 to 64':[]}
    
    # Let's get a count of each. 
    group1_count = 0
    group2_count = 0
    group3_count = 0
    group4_count = 0
    group5_count = 0
    group6_count = 0
    group7_count = 0
    group8_count = 0
    group9_count = 0
    
    # let's check to see which dictionary is input. If the high-level dictionary, we need to handle it differently than a broken down dict.
    for item in dictionary:
        
        # If a dictionary of dictionaries (Such as by_gender_dict)
        if (type(item) is str):
            for line in dictionary[item]:
                if line['age'] <= age_groups[1]:
                    agegroup_dict['19 to 24'].append(line)
                    group1_count += 1
                elif line['age'] <= age_groups [2] and line['age'] > age_groups[1]:
                    agegroup_dict['25 to 29'].append(line)
                    group2_count += 1
                elif line['age'] <= age_groups [3] and line['age']> age_groups[2]:
                    agegroup_dict['30 to 34'].append(line)
                    group3_count += 1
                elif line['age'] <= age_groups [4] and line['age']> age_groups[3]:
                    agegroup_dict['35 to 39'].append(line)
                    group4_count += 1
                elif line['age'] <= age_groups [5] and line['age']> age_groups[4]:
                    agegroup_dict['40 to 44'].append(line)
                    group5_count += 1
                elif line['age'] <= age_groups [6] and line['age']> age_groups[5]:
                    agegroup_dict['45 to 49'].append(line)
                    group6_count += 1
                elif line['age'] <= age_groups [7] and line['age']> age_groups[6]:
                    agegroup_dict['50 to 54'].append(line)
                    group7_count += 1
                elif line['age'] <= age_groups [8] and line['age']> age_groups[7]:
                    agegroup_dict['55 to 59'].append(line)
                    group8_count += 1
                elif line['age'] <= age_groups [9] and line['age']> age_groups[8]:
                    agegroup_dict['60 to 64'].append(line)
                    group9_count += 1
        
        # if not, it's the main dictionary
        elif type(item) is dict:
            if item['age'] <= age_groups[1]:
                agegroup_dict['19 to 24'].append(line)
                group1_count += 1
            elif item['age'] <= age_groups [2] and item['age'] > age_groups[1]:
                agegroup_dict['25 to 29'].append(line)
                group2_count += 1
            elif item['age'] <= age_groups [3] and item['age'] > age_groups[2]:
                agegroup_dict['30 to 34'].append(line)
                group3_count += 1
            elif item['age'] <= age_groups [4] and item['age']> age_groups[3]:
                agegroup_dict['35 to 39'].append(line)
                group4_count += 1
            elif item['age'] <= age_groups [5] and item['age']> age_groups[4]:
                agegroup_dict['40 to 44'].append(line)
                group5_count += 1
            elif item['age'] <= age_groups [6] and item['age']> age_groups[5]:
                agegroup_dict['45 to 49'].append(line)
                group6_count += 1
            elif item['age'] <= age_groups [7] and item['age']> age_groups[6]:
                agegroup_dict['50 to 54'].append(line)
                group7_count += 1
            elif item['age'] <= age_groups [8] and item['age']> age_groups[7]:
                agegroup_dict['55 to 59'].append(line)
                group8_count += 1
            elif item['age'] <= age_groups [9] and item['age']> age_groups[8]:
                agegroup_dict['60 to 64'].append(line)
                group9_count += 1
    print("Broken down by age group, the counts are in order of {'19 to 24': [], '25 to 29':[], '30 to 34':[], '35 to 39':[], '40 to 44':[], '45 to 49':[], '50 to 54':[], '55 to 59':[], '60 to 64':[]} " + str(group1_count) + ','+str(group2_count) +','+  str(group3_count)+','+ str(group4_count) +','+str(group5_count) +','+str(group6_count)+','+str(group7_count)+','+str(group8_count)+','+str(group9_count))
    return agegroup_dict

by_age_dict = agegroup_breakdown(by_gender_dict)

Broken down by age group, the counts are in order of {'19 to 24': [], '25 to 29':[], '30 to 34':[], '35 to 39':[], '40 to 44':[], '45 to 49':[], '50 to 54':[], '55 to 59':[], '60 to 64':[]} 278,139,132,125,135,144,143,128,114


## Step 4 - Analyzing the Data to determine which group(s) have highest insurance costs

### Step 4.a - Average cost of insurance by BMI group

In [112]:
def get_BMI_cost(dictionary):
    """If calling this function, please use the main-level dictionary created when importing the data. Otherwise, the
    funtion will throw an error. Alternatively, for other datasets, you can call the function with a sub-dictionary! Thanks!"""
        
    # initialize the counting variables:
    underweight = 0
    un_ct = 0
    healthy = 0
    he_ct = 0
    overweight = 0
    ov_ct = 0
    obese = 0
    ob_ct = 0
    bmi_groups = {'underweight': 18.5, 'Healthy': 24.9, 'Overweight': 29.9} # Obese = 30.0 and above
    
    # run through the data to find the totals for charges and counts for people in each group.
    for item in dictionary:
        if item['bmi'] <= bmi_groups['underweight']:
            underweight += item['charges']
            un_ct += 1
        elif item['bmi'] > bmi_groups['underweight'] and item['bmi'] <= bmi_groups['Healthy']:
            healthy += item['charges']
            he_ct += 1
        elif item['bmi'] > bmi_groups['Healthy'] and item['bmi'] <= bmi_groups['Overweight']:
            overweight += item['charges']
            ov_ct += 1
        elif item['bmi'] > bmi_groups['Overweight']:
            obese += item['charges']
            ob_ct += 1
    # Averaging the data
    if un_ct > 0:
        av_cost_un = underweight / un_ct
    else:
        av_cost_un = 'n/a'
    av_cost_he = healthy / he_ct
    av_cost_ov = overweight / ov_ct
    av_cost_ob = obese / ob_ct
    tot = (underweight + healthy + overweight + obese)
    av = (tot)/(un_ct + he_ct + ov_ct + ob_ct)
    
    # Create the report text
    report = """The overall average is ${average}.
    The average cost for an underweight person in the selected population is: ${under}.
    The average cost for a healthy person is: ${healthy}.
    The average cost for an overweight person is: ${overweight}.
    The average cost for an obese person is: ${obese}.\n"""
    
    report_2 = """The total insurance charges are ${total}.
    For the population provided, there are ${under} underweight individuals with total insurance charges of {under_total}.
    There are {healthy} healthy individuals with total charges of ${healthy_total}.
    There are {overweight} overweight individuals with total charges of ${over_total}.
    There are {obese} obese individuals with total charges of ${obese_total}.\n"""
    
    final_report = report.format(average = av, under = av_cost_un, healthy = av_cost_he, overweight = av_cost_ov, obese = av_cost_ob)
    final_report2 = report_2.format(total = tot, under = un_ct, under_total = underweight, healthy = he_ct, healthy_total = healthy, overweight = ov_ct, over_total = overweight, obese = ob_ct, obese_total = obese)
    
    # Print using string methods
    print(final_report)
    print(final_report2)
    return final_report, final_report2

get_BMI_cost(new_insurance_dict)

The overall average is $13270.422265141253.
    The average cost for an underweight person in the selected population is: $8657.620652380954.
    The average cost for a healthy person is: $10404.900083891405.
    The average cost for an overweight person is: $11006.80998941842.
    The average cost for an obese person is: $15491.542238184353.

The total insurance charges are $17755824.990758996.
    For the population provided, there are $21 underweight individuals with total insurance charges of 181810.03370000003.
    There are 221 healthy individuals with total charges of $2299482.9185400004.
    There are 380 overweight individuals with total charges of $4182587.7959789997.
    There are 716 obese individuals with total charges of $11091944.242539996.



('The overall average is $13270.422265141253.\n    The average cost for an underweight person in the selected population is: $8657.620652380954.\n    The average cost for a healthy person is: $10404.900083891405.\n    The average cost for an overweight person is: $11006.80998941842.\n    The average cost for an obese person is: $15491.542238184353.\n',
 'The total insurance charges are $17755824.990758996.\n    For the population provided, there are $21 underweight individuals with total insurance charges of 181810.03370000003.\n    There are 221 healthy individuals with total charges of $2299482.9185400004.\n    There are 380 overweight individuals with total charges of $4182587.7959789997.\n    There are 716 obese individuals with total charges of $11091944.242539996.\n')

In [None]:
def get_children(dictionary):
    """Suggestions for calling this function effectively:
    * Use the main-level dictionary created when importing the data.
    * Use a dictionary created by the region, age group or gender functions by iterating over the sub-dicts in a for loop."""
    
    

## BONUS!

### B.1 For this bonus round, I want to explore one of the following:
* Organize your findings into dictionaries, lists, or another convenient datatype.
* Make predictions about what features are the most influential for an individual’s medical insurance charges based on your analysis.
* Explore areas where the data may include bias and how that would impact potential use cases.

### B.2 For this second bonus round, I want to write a function that will interate through a given dictionary and provide average values for the insurance costs based on the keys of the dictionary and sub-dicts.

### B.3 Write results to a new file!

In [111]:
regions = {1: 'Northeast', 2: 'Southeast', 3: 'Northwest', 4: 'Southwest'}

for region in regions:
    with open('Insurance_results.txt', 'a') as f:
        report, report_2 = get_BMI_cost(by_region_dict[regions[region]])
        string = '\nThe following is for the {region} population in the insurance dataset.\n'
        
        f.write(string.format(region = regions[region]))
        f.writelines(report)
        f.writelines(report_2)

The overall average is $13406.384516385804.
    The average cost for an underweight person in the selected population is: $8914.42392.
    The average cost for a healthy person is: $11151.782012222224.
    The average cost for an overweight person is: $10818.593626928576.
    The average cost for an obese person is: $16606.762942986108.

The total insurance charges are $4343668.583309.
    For the population provided, there are 10 underweight individuals with total insurance charges of 89144.2392.
    There are 72 healthy individuals with total charges of 802928.3048800001.
    There are 98 overweight individuals with total charges of 1060222.1754390004.
    There are 144 obese individuals with total charges of 2391373.8637899994.

The overall average is $14735.411437609893.
    The average cost for an underweight person in the selected population is: $n/a.
    The average cost for a healthy person is: $13286.808262250002.
    The average cost for an overweight person is: $10846.202292