# U.S. Medical Insurance Costs

Project Goal:
A public health campaign has funds available to improve the health of the population.
The first wave of funding is to target the demographic that is most in need, however it is unclear who this demographic is.
The goal of this project is to identify the most 'in need' demographic so a health programme can be tailored
A dataset Funding has been made available to improve the health

In [10]:
import csv

with open('insurance.csv') as temp_file:
    temp_data = csv.DictReader(temp_file)
#create empty lists to populate with the csv file data
    master_list = []
    age_list = []
    sex_list = []
    bmi_list = []
    children_list = []
    smoker_list = []
    region_list = []
    cost_list = []
#looping through the csv data append the lists
    for row in temp_data:
        master_list.append(row)
        age_list.append(int(row['age']))
        sex_list.append(row['sex'])
        bmi_list.append(float(row['bmi']))
        children_list.append(int(row['children']))
        smoker_list.append(row['smoker'])
        region_list.append(row['region'])
        cost_list.append(float(row['charges']))

#print(age_list)
#print(master_list)


Before we dive into the detail, let's look some average metrics to get a feel for what the overall dataset looks like.

In [11]:
#looking at the whole dataset we are able to gain some insights into the population
#firstly we can look at the average values of the whole dataset 

#function to calculate the average value of a list
def calc_average(list):
    total = 0
    for item in list:
        total += item
    average = total / len(list)
    return average

#function to count the number of values in a list
def calc_count(list, option):
    return list.count(option)
    
average_age = calc_average(age_list)
average_bmi = calc_average(bmi_list)
average_cost = calc_average(cost_list)
average_children = calc_average(children_list)
print("The average age of the whole dataset is: {:.1f} ".format(average_age))
print("The average bmi of the whole dataset is: {:.1f} ".format(average_bmi))
print("The average insurance cost of the whole dataset is: {:.1f} ".format(average_cost))
print("The average no. of children of the whole dataset is: {:.1f} ".format(average_children))
data_size = len(sex_list)
num_females = calc_count(sex_list,'female')
num_males = calc_count(sex_list,'male')
print("Out of {} datapoints, there are {} or {:.0f}% females and {} or {:.0f}% males.\
".format(data_size, num_females, (num_females/data_size)*100, num_males, (num_males/data_size)*100))
num_smokers = calc_count(smoker_list,'yes')
num_nonsmokers = calc_count(smoker_list,'no')
print("Out of {} datapoints, there are {} smokers and {} non-smokers.".format(data_size, num_smokers, num_nonsmokers))
num_sw_region = calc_count(region_list,'southwest')
num_se_region = calc_count(region_list,'southeast')
num_nw_region = calc_count(region_list,'northwest')
num_ne_region = calc_count(region_list,'northeast')
print("Out of {} datapoints, {} are from the Southwest, {} are from the Southeast, {} are from the Northwest and {} are \
from the Northeast.".format(data_size, num_sw_region, num_se_region, num_nw_region, num_ne_region))

The average age of the whole dataset is: 39.2 
The average bmi of the whole dataset is: 30.7 
The average insurance cost of the whole dataset is: 13270.4 
The average no. of children of the whole dataset is: 1.1 
Out of 1338 datapoints, there are 662 or 49% females and 676 or 51% males.
Out of 1338 datapoints, there are 274 smokers and 1064 non-smokers.
Out of 1338 datapoints, 325 are from the Southwest, 364 are from the Southeast, 325 are from the Northwest and 324 are from the Northeast.


Across the whole dataset there is quite an even split approximately 50 - 50 male, female.
By creating a class and methods the objective is to create a dictionary from the csv data from which we can analyse the data. 
I have added age group to the data, and have given each datapoint a reference ID.

In [165]:
#the male / female split by region is:

#class that takes in a list of data and has methods to create a dictionary and do analysis on the data
class Datadict:
    def __init__(self, list):
        self.list = list
        self.count = len(list)
        self.id_list = [i for i in range(1,self.count +1)]

#method to create a dictionary from the csv data (stored as row dictionaries) adding a unique ID as the key
    def createDict(self):
        self.dict = {key:value for key, value in zip(self.id_list, self.list)}
        return self.dict
    
#method to add age group to the data so when looking at the regional data this can be split by region and agegroup
    def age_group(self):
        for key, value in self.dict.items():
            if int(value['age']) <= 25:
                value['age_group'] = "under25"
            elif 25 < int(value['age']) <= 35:
                value['age_group'] = "under35"
            elif 35 < int(value['age']) <= 45:
                value['age_group'] = "under45"
            elif 45 < int(value['age']) <= 55:
                value['age_group'] = "under55"
            elif int(value['age']) >= 55:
                value['age_group'] = "over55"
        return self.dict
            
#method to create regional dictionaries from the total dataset so analysis can be done at the regional level    
    def createRegion(self, region):
        region_dict = {}
        female_count = 0
        male_count = 0
        region_charge = 0
        for key, value in self.dict.items():
            if value.get('region') == region:
                region_dict[key] = value
                if value.get('sex') == 'female':
                    female_count += 1
                else:
                    male_count += 1
            else:
                continue
            #calculate the total insurance cost for this region
            region_charge += float(value.get('charges'))
        print("There are {} people in this region, {} female and {} male.".format(female_count + male_count, female_count, male_count))
        print("The total insurance cost for this region is: {:.0f}.".format(region_charge))
        return region_dict
    
    
#method to create age group dictionaries from the total dataset so analysis can be done at the age group level    
    def createRegionAgeGroup(self, region, ageGroup):
        region_age_dict = {}
        female_count = 0
        male_count = 0
        ageGroup_charge = 0
        for key, value in self.dict.items():
            if value.get('region') == region and value.get('age_group') == ageGroup:
                #print(key, value)
                region_age_dict[key] = value
                if value.get('sex') == 'female':
                    female_count += 1
                else:
                    male_count += 1
                #calculate the total insurance cost for this region
                ageGroup_charge += float(value.get('charges'))
        print("The total insurance cost for the {} region, age group {} is: {:.0f}.".format(region, ageGroup, ageGroup_charge))
        print("There are {} people, {} female and {} male.".format(female_count + male_count, female_count, male_count))
        return region_age_dict
    
    
#method to calculate the average values of the input measure separating female and male
    def ave_measure(self, data, measure):
        count_female = 0
        count_male = 0
        total_female = 0
        total_male = 0
        for key, value in data.items():
            if value.get('sex') == 'female':
                count_female += 1
                total_female += float(value.get(measure))
            else:
                count_male += 1
                total_male += float(value.get(measure))
        average_female = total_female / count_female
        average_male = total_male / count_male
        print("Average female {} is : {:.1f}, and average male {} is: {:.1f}.".format(measure, average_female, measure, average_male))
        return average_female, average_male
    
#method to calculate the % of the dataset that smokes also separating female and male  
    def smoker_percent(self, data):
        count_female = 0
        count_female_smoker = 0
        count_male = 0
        count_male_smoker = 0
        for key, value in data.items():
            if value.get('sex') == 'female':
                count_female += 1
                if value.get('smoker') == 'yes':
                    count_female_smoker += 1
            else:
                count_male += 1
                if value.get('smoker') == 'yes':
                    count_male_smoker += 1
        female_smoker_percent = (count_female_smoker/count_female) * 100
        male_smoker_percent = (count_male_smoker/count_male) * 100
        print("{:.1f}% of females are smokers, and {:.1f}% of males are smokers.".format(female_smoker_percent, male_smoker_percent))
        return female_smoker_percent, male_smoker_percent
                


In [169]:
#create master dictionary object (Datadict) using whole dataset from csv file
master_data = Datadict(master_list)
master_dict = master_data.createDict()

#using the object Datadict, we can create sub-dictionaries for each region and calculate the average measures
region = ['southwest', 'southeast', 'northwest', 'northeast']
measures = ['age', 'bmi', 'children', 'charges']

for local in region:
    print()
    print("Region: {}".format(local))
    region_dict = master_data.createRegion(local)
    #print(region_dict)
    for item in measures:
        master_data.ave_measure(region_dict, item)
    master_data.smoker_percent(region_dict)
 


Region: southwest
There are 325 people in this region, 162 female and 163 male.
The total insurance cost for this region is: 4012755.
Average female age is :39.7, and average male age is: 39.2.
Average female bmi is :30.1, and average male bmi is: 31.1.
Average female children is :1.1, and average male children is: 1.2.
Average female charges is :11274.4, and average male charges is: 13412.9.
13.0% of females are smokers, and 22.7% of males are smokers.

Region: southeast
There are 364 people in this region, 175 female and 189 male.
The total insurance cost for this region is: 5363690.
Average female age is :39.1, and average male age is: 38.8.
Average female bmi is :32.7, and average male bmi is: 34.0.
Average female children is :1.1, and average male children is: 1.0.
Average female charges is :13499.7, and average male charges is: 15879.6.
20.6% of females are smokers, and 29.1% of males are smokers.

Region: northwest
There are 325 people in this region, 164 female and 161 male.
T

Looking at the data by region and breaking this down by female / male shows us that the Southeast has the unhealthiest population. Whilst having the lowest average ages for both females and males, the Southeast has the highest bmi's and the most smokers and this is what contributes to this region having the highest insurance costs. 

The next step is to add age group to the dataset so we can pinpoint the demographic that can be targeted with the health programme.



In [170]:
#Then, create a new list that is a count of ages that appear in age bucket, to break down the average measures even further. 
#Do we need to amend the region dictionary and then pass this into the existing methods?
#We want to calculate the average age, bmi, no.children, smoker%, insurance cost for the different age groups
#Firstly what is the range in age for the dataset?


#let's add Age_Group info to the dataset
master_dict = master_data.age_group()
#master_dict = master_data.createRegion('southwest')
#master_dict = master_data.createRegionAgeGroup('southwest', "under25")
#print(master_dict)

#Let's run the regional analysis but this time by region AND age group
age_group = ['under25','under35','under45','under55','over55']

data_list = []

for local in region:
    print()
    print("Region: {}".format(local.upper()))
    for age in age_group:
        print("Region {}, age group {}:".format(local, age))
        region_age_dict = master_data.createRegionAgeGroup(local, age)
        #print(region_age_dict)
        for item in measures:
            data = master_data.ave_measure(region_age_dict, item)
            data_list.append(data)
        master_data.smoker_percent(region_age_dict)    
        print()


    


Region: SOUTHWEST
Region southwest, age group under25:
The total insurance cost for the southwest region, age group under25 is: 607697.
There are 73 people, 36 female and 37 male.
Average female age is :21.0, and average male age is: 21.0.
Average female bmi is :27.7, and average male bmi is: 29.7.
Average female children is :0.7, and average male children is: 0.9.
Average female charges is :7788.0, and average male charges is: 8846.7.
19.4% of females are smokers, and 24.3% of males are smokers.

Region southwest, age group under35:
The total insurance cost for the southwest region, age group under35 is: 725927.
There are 66 people, 31 female and 35 male.
Average female age is :30.6, and average male age is: 30.5.
Average female bmi is :30.0, and average male bmi is: 31.4.
Average female children is :1.3, and average male children is: 1.7.
Average female charges is :7479.8, and average male charges is: 14115.8.
9.7% of females are smokers, and 31.4% of males are smokers.

Region sout

Summary:
For this project I took the approach of creating a master dictionary with the csv data, and then created sub-dictionaries for region / age group, and ran through the average measures. 
This provides the data that I need, however I think it might have been better if I had created methods more in line with the 'hurricane' project that created dictionaries based on the different measures. This would have allowed for greater analysis of the data.
So whilst my class and methods work for calculating measures such as average age, no. of children etc, it is not possible to calculate the maximum bor minimum values easily. Unless I store this data in a list, and then run some calculations from that? This however, starts to get more complicated as the list will be by region and include all measures (print(data_list)) to see that the list gets unruly and will have to start indexing the items in the list and creating more methods.

This has been a great learning project for creating classes and understanding how this works better, and next time creating dictionaries and changing the key to the different measures, so then I can run more data analysis will the approach that I take.


In [172]:
#print(data_list)