# U.S. Medical Insurance Costs

In this project, a CSV file with medical insurance costs will be investigated using Python fundamentals. The goal of this project is to analyze the data focusing on the regions available in the dataset. The regions will be compared on the medical insurance costs data and demographic information available.

## 1. Preparation of data and collection of basic information about the sample

The first step is to import csv library and load the data from the **insurance.csv** file for further analysis. 

The data will be stored in a list called `insurance_data`, which contains each observation as a dictionary with the keys extracted from the first row of the csv file. 

Since the focus of the project is regions, a list called `list_regions` will also be created. This list contains only the names of the regions in the dataset and will be needed for one of the next steps.

In [1]:
# import csv library
import csv

# open the csv file and load the data into two lists
insurance_data = []
list_regions = []

with open('insurance.csv') as csv_file:
    csv_dict = csv.DictReader(csv_file)
    for row in csv_dict:
        insurance_data.append(row)
        list_regions.append(row['region'])

# print the result. For better readability, the lists are separated by three empty lines
print(insurance_data, list_regions, sep='\n\n\n')

[{'age': '19', 'sex': 'female', 'bmi': '27.9', 'children': '0', 'smoker': 'yes', 'region': 'southwest', 'charges': '16884.924'}, {'age': '18', 'sex': 'male', 'bmi': '33.77', 'children': '1', 'smoker': 'no', 'region': 'southeast', 'charges': '1725.5523'}, {'age': '28', 'sex': 'male', 'bmi': '33', 'children': '3', 'smoker': 'no', 'region': 'southeast', 'charges': '4449.462'}, {'age': '33', 'sex': 'male', 'bmi': '22.705', 'children': '0', 'smoker': 'no', 'region': 'northwest', 'charges': '21984.47061'}, {'age': '32', 'sex': 'male', 'bmi': '28.88', 'children': '0', 'smoker': 'no', 'region': 'northwest', 'charges': '3866.8552'}, {'age': '31', 'sex': 'female', 'bmi': '25.74', 'children': '0', 'smoker': 'no', 'region': 'southeast', 'charges': '3756.6216'}, {'age': '46', 'sex': 'female', 'bmi': '33.44', 'children': '1', 'smoker': 'no', 'region': 'southeast', 'charges': '8240.5896'}, {'age': '37', 'sex': 'female', 'bmi': '27.74', 'children': '3', 'smoker': 'no', 'region': 'northwest', 'charges'

The numeric values in the `insurance_data` list were loaded as strings. For further calculations, it's necessary to update the list and to convert the values of `'age'` and `'children'` into **int** format, and the values of `'bmi'` and `'charges'` into **float** format.

In [2]:
# convert the values in the insurance_data list
for row in insurance_data:
    row['age'] = int(row['age'])
    row['bmi'] = float(row['bmi'])
    row['children'] = int(row['children'])
    row['charges'] = float(row['charges'])

# check the updated list    
print(insurance_data)

[{'age': 19, 'sex': 'female', 'bmi': 27.9, 'children': 0, 'smoker': 'yes', 'region': 'southwest', 'charges': 16884.924}, {'age': 18, 'sex': 'male', 'bmi': 33.77, 'children': 1, 'smoker': 'no', 'region': 'southeast', 'charges': 1725.5523}, {'age': 28, 'sex': 'male', 'bmi': 33.0, 'children': 3, 'smoker': 'no', 'region': 'southeast', 'charges': 4449.462}, {'age': 33, 'sex': 'male', 'bmi': 22.705, 'children': 0, 'smoker': 'no', 'region': 'northwest', 'charges': 21984.47061}, {'age': 32, 'sex': 'male', 'bmi': 28.88, 'children': 0, 'smoker': 'no', 'region': 'northwest', 'charges': 3866.8552}, {'age': 31, 'sex': 'female', 'bmi': 25.74, 'children': 0, 'smoker': 'no', 'region': 'southeast', 'charges': 3756.6216}, {'age': 46, 'sex': 'female', 'bmi': 33.44, 'children': 1, 'smoker': 'no', 'region': 'southeast', 'charges': 8240.5896}, {'age': 37, 'sex': 'female', 'bmi': 27.74, 'children': 3, 'smoker': 'no', 'region': 'northwest', 'charges': 7281.5056}, {'age': 37, 'sex': 'male', 'bmi': 29.83, 'chil

Now the data is ready for analysis.

In order to collect information about regions it is first necessary to find out which regions are represented in the dataset.

In [3]:
# create a list with unique region names
regions_available = []
for region in list_regions:
    if region in regions_available:
        pass
    else:
        regions_available.append(region)

# print the result in alphabetical order
print(sorted(regions_available))

['northeast', 'northwest', 'southeast', 'southwest']


There are 4 regions: ***Northeast***, ***Northwest***, ***Southeast*** and ***Southwest***. It is also important to know how many observations there are for each region.

In [4]:
# calculate the sample sizes for each region
northeast_count = list_regions.count('northeast')
northwest_count = list_regions.count('northwest')
southeast_count = list_regions.count('southeast')
southwest_count = list_regions.count('southwest')

# print the result
print('Northeast sample: %s' %northeast_count)
print('Northwest sample: %s' %northwest_count)
print('Southeast sample: %s' %southeast_count)
print('Southwest sample: %s' %southwest_count)

Northeast sample: 324
Northwest sample: 325
Southeast sample: 364
Southwest sample: 325


The regions have sufficiently big sample size. There are a bit more observations for Southeast (***n=364 vs. n=324-325***), but overall, the regions are relatively equally represented in the dataset. This will allow conclusions to be drawn with confidence when comparing regions.

## 2. Exploring differences in the insurance costs between the regions

The next task is to compare the insurance costs between the regions. This requires splitting the data set into four parts containing observations for each of the regions separately.

In [5]:
# create a function that extracts observations from the insurance_data list only for the region given in the function argument
def insurance_data_region(region):
    final_list = []
    for row in insurance_data:
        if row['region'] != region:
            pass
        else:
            final_list.append(row)
    return final_list

# create a list for each region using this function
insurance_data_northeast = insurance_data_region('northeast')
insurance_data_northwest = insurance_data_region('northwest')
insurance_data_southeast = insurance_data_region('southeast')
insurance_data_southwest = insurance_data_region('southwest')

Based on the created lists, it is possible to calculate the maximum value of insurance costs for each of the regions.

In [6]:
# create a function that calculates the highest value of charges for the region.The result will be rounded to 2 decimal places.
def max_charges(data_region):
    list_charges = []
    for row in data_region:
        list_charges.append(row['charges'])
    return round(max(list_charges), 2)

# print the result for each region
print('Northeast highest cost: %s' %max_charges(insurance_data_northeast))
print('Northwest highest cost: %s' %max_charges(insurance_data_northwest))
print('Southeast highest cost: %s' %max_charges(insurance_data_southeast))
print('Southwest highest cost: %s' %max_charges(insurance_data_southwest))

Northeast highest cost: 58571.07
Northwest highest cost: 60021.4
Southeast highest cost: 63770.43
Southwest highest cost: 52590.83


Southeast has the highest value of insurance costs: ***63770.43***. Let's look at the minimum values.

In [7]:
# create a function that calculates the lowest value of charges for the region.The result will be rounded to 2 decimal places.
def min_charges(data_region):
    list_charges = []
    for row in data_region:
        list_charges.append(row['charges'])
    return round(min(list_charges), 2)

# print the result for each region
print('Northeast lowest cost: %s' %min_charges(insurance_data_northeast))
print('Northwest lowest cost: %s' %min_charges(insurance_data_northwest))
print('Southeast lowest cost: %s' %min_charges(insurance_data_southeast))
print('Southwest lowest cost: %s' %min_charges(insurance_data_southwest))

Northeast lowest cost: 1694.8
Northwest lowest cost: 1621.34
Southeast lowest cost: 1121.87
Southwest lowest cost: 1241.57


Southeast has also the lowest value of insurance costs: ***1121.87***. Since the extremum points do not give a whole picture, it is worth looking at the mean values and standard deviations.

In [8]:
# for the next functions the statistics library needs to be imported
import statistics

# create a function that calculates the mean value of charges for the region.The result will be rounded to 2 decimal places.
def mean_charges(data_region):
    list_charges = []
    for row in data_region:
        list_charges.append(row['charges'])
    return round(statistics.mean(list_charges), 2)

# create a function that calculates the standard deviation of charges for the region.The result will be rounded to 3 decimal places.
def stdev_charges(data_region):
    list_charges = []
    for row in data_region:
        list_charges.append(row['charges'])
    return round(statistics.stdev(list_charges), 3)

# print the result for each region
print('Northeast average cost is %s with standard deviation %s' % (mean_charges(insurance_data_northeast), stdev_charges(insurance_data_northeast)))
print('Northwest average cost is %s with standard deviation %s' % (mean_charges(insurance_data_northwest), stdev_charges(insurance_data_northwest)))
print('Southeast average cost is %s with standard deviation %s' % (mean_charges(insurance_data_southeast), stdev_charges(insurance_data_southeast)))
print('Southwest average cost is %s with standard deviation %s' % (mean_charges(insurance_data_southwest), stdev_charges(insurance_data_southwest)))

Northeast average cost is 13406.38 with standard deviation 11255.803
Northwest average cost is 12417.58 with standard deviation 11072.277
Southeast average cost is 14735.41 with standard deviation 13971.099
Southwest average cost is 12346.94 with standard deviation 11557.179


Southeast has the highest mean value of insurance costs (***14735.41***) followed by Northeast (***13406.38***), Northwest (***12417.58***) and Southwest (***12346.94***).

The standard deviation for each of the regions is very high. It means large spread of values in the data sets relative to the average values, if we suppose that the data is normally distributed. Southeast has also the highest standard deviation (***13971.099***), which correlates with the fact that its min and max values are the most polarized compared to other regions.

Let's save the collected information in a dictionary called `regions_charges`.

In [9]:
# create a function that returns the dictionary with the descriptive statistics about the insurance costs for the analyzed region
def charges_stats(data_region):
    return {'highest': max_charges(data_region), 'lowest' : min_charges(data_region), 'mean' : mean_charges(data_region), 'stdev' : stdev_charges(data_region)}

# save the data in a list called regions_charges
regions_charges = {'northeast' : charges_stats(insurance_data_northeast), 'northwest' : charges_stats(insurance_data_northwest), 'southeast' : charges_stats(insurance_data_southeast), 'southwest' : charges_stats(insurance_data_southwest)}

# print the created list
print(regions_charges)

{'northeast': {'highest': 58571.07, 'lowest': 1694.8, 'mean': 13406.38, 'stdev': 11255.803}, 'northwest': {'highest': 60021.4, 'lowest': 1621.34, 'mean': 12417.58, 'stdev': 11072.277}, 'southeast': {'highest': 63770.43, 'lowest': 1121.87, 'mean': 14735.41, 'stdev': 13971.099}, 'southwest': {'highest': 52590.83, 'lowest': 1241.57, 'mean': 12346.94, 'stdev': 11557.179}}


## 3. Analysis of demographic characteristics of regions

It is not regional affiliation that determines the difference in insurance costs (or at least not only that). To get a bigger picture, it's necessary to look at demographics, namely at the variables `'age'`, `'sex'`, `'bmi'`, `'children'` and `'smoker'`. 

Let's start with dichotomous variables: `'sex'` and `'smoker'`.

In [10]:
# create a variable that returns percentages of males and females in the region, rounded to 2 decimal places.
def sexes_stats(data_region):
    list_sexes = []
    for row in data_region:
        list_sexes.append(row['sex'])
    percentage_female = (list_sexes.count('female') / len(data_region)) * 100
    percentage_male = (list_sexes.count('male') / len(data_region)) * 100
    return {'female' : round(percentage_female, 2), 'male' : round(percentage_male, 2)}

# print the result for each region
print('Northeast:', sexes_stats(insurance_data_northeast))
print('Northwest:', sexes_stats(insurance_data_northwest))
print('Southeast:', sexes_stats(insurance_data_southeast))
print('Southwest:', sexes_stats(insurance_data_southwest))

Northeast: {'female': 49.69, 'male': 50.31}
Northwest: {'female': 50.46, 'male': 49.54}
Southeast: {'female': 48.08, 'male': 51.92}
Southwest: {'female': 49.85, 'male': 50.15}


All regions are equally represented by men and women (***~ 50% / 50%*** ratio).

In [11]:
# create a variable that returns percentages of smokers and non-smokers in the region, rounded to 2 decimal places.
def smokers_stats(data_region):
    list_smokers = []
    for row in data_region:
        list_smokers.append(row['smoker'])
    percentage_yes = (list_smokers.count('yes') / len(data_region)) * 100
    percentage_no = (list_smokers.count('no') / len(data_region)) * 100
    return {'yes' : round(percentage_yes, 2), 'no' : round(percentage_no, 2)}

# print the result for each region
print('Northeast:', smokers_stats(insurance_data_northeast))
print('Northwest:', smokers_stats(insurance_data_northwest))
print('Southeast:', smokers_stats(insurance_data_southeast))
print('Southwest:', smokers_stats(insurance_data_southwest))

Northeast: {'yes': 20.68, 'no': 79.32}
Northwest: {'yes': 17.85, 'no': 82.15}
Southeast: {'yes': 25.0, 'no': 75.0}
Southwest: {'yes': 17.85, 'no': 82.15}


Here is the first clue that may explain why charges for Southeast are higher on average: it has more smokers than other regions (***25%***). Next is Northeast (***~21%*** smokers), which is also the region with the second highest insurance costs. The number of smokers in the western regions is no more than ***18%***.

Let's analyse the numeric variables: `'age'`, `'children'` and `'bmi'`.

In [12]:
# create a variable that returns the percentages of two groups: people under 40 years old and people aged 40 and older, rounded to 2 decimal places.
def ages_stats(data_region):
    list_ages = []
    for row in data_region:
        list_ages.append(row['age'])
    list_ages_recode = []
    for age in list_ages:
        if age < 40:
            list_ages_recode.append('y')
        else:
            list_ages_recode.append('o')
    percentage_younger = (list_ages_recode.count('y') / len(data_region)) * 100
    percentage_older = (list_ages_recode.count('o') / len(data_region)) * 100
    return {'under 40' : round(percentage_younger, 2), '40 or more' : round(percentage_older, 2)}

# print the result for each region
print('Northeast:', ages_stats(insurance_data_northeast))
print('Northwest:', ages_stats(insurance_data_northwest))
print('Southeast:', ages_stats(insurance_data_southeast))
print('Southwest:', ages_stats(insurance_data_southwest))

Northeast: {'under 40': 50.0, '40 or more': 50.0}
Northwest: {'under 40': 50.77, '40 or more': 49.23}
Southeast: {'under 40': 50.55, '40 or more': 49.45}
Southwest: {'under 40': 50.15, '40 or more': 49.85}


Similar to the sex distribution, regions have an equal age distribution in the sense that the proportion of young people (<40 years old) is nearly equal to the proportion of older people (>=40 years old).

In [13]:
# create a variable that returns the percentages of two groups: people with children and people without children, rounded to 2 decimal places.
def children_stats(data_region):
    list_children = []
    for row in data_region:
        list_children.append(row['children'])
    list_children_recode = []
    for children in list_children:
        if children >0:
            list_children_recode.append('yes')
        else:
            list_children_recode.append('no')
    percentage_yes = (list_children_recode.count('yes') / len(data_region)) * 100
    percentage_no = (list_children_recode.count('no') / len(data_region)) * 100
    return {'yes' : round(percentage_yes, 2), 'no' : round(percentage_no, 2)}

# print the result for each region
print('Northeast:', children_stats(insurance_data_northeast))
print('Northwest:', children_stats(insurance_data_northwest))
print('Southeast:', children_stats(insurance_data_southeast))
print('Southwest:', children_stats(insurance_data_southwest))

Northeast: {'yes': 54.63, 'no': 45.37}
Northwest: {'yes': 59.38, 'no': 40.62}
Southeast: {'yes': 56.87, 'no': 43.13}
Southwest: {'yes': 57.54, 'no': 42.46}


Overall, there are slightly more people with children (***>50%***) in each of the four datasets. However, there are no significant differences between the regions that could explain the disparity in insurance costs.

In [14]:
# create a variable that returns the percentages of four groups: 'underweight', 'healthy weight' and 'overweight', rounded to 2 decimal places.
def bmis_stats(data_region):
    list_bmis = []
    for row in data_region:
        list_bmis.append(row['bmi'])
    list_bmis_recode = []
    for bmi in list_bmis:
        if bmi < 18.5:
            list_bmis_recode.append('underweight')
        elif bmi >= 18.5 and bmi < 25:
            list_bmis_recode.append('healthy weight')
        elif bmi >= 25 and bmi < 30:
            list_bmis_recode.append('overweight')
        elif bmi >= 30:
            list_bmis_recode.append('obese')
    percentage_underweight = (list_bmis_recode.count('underweight') / len(data_region)) * 100
    percentage_healthy = (list_bmis_recode.count('healthy weight') / len(data_region)) * 100
    percentage_overweight = (list_bmis_recode.count('overweight') / len(data_region)) * 100
    percentage_obese = (list_bmis_recode.count('obese') / len(data_region)) * 100
    return {'underweight' : round(percentage_underweight, 2), 'healthy weight' : round(percentage_healthy, 2), 'overweight' : round(percentage_overweight, 2), 'obese' : round(percentage_obese, 2)}

# print the result for each region
print('Northeast:', bmis_stats(insurance_data_northeast))
print('Northwest:', bmis_stats(insurance_data_northwest))
print('Southeast:', bmis_stats(insurance_data_southeast))
print('Southwest:', bmis_stats(insurance_data_southwest))

Northeast: {'underweight': 3.09, 'healthy weight': 22.53, 'overweight': 30.25, 'obese': 44.14}
Northwest: {'underweight': 2.15, 'healthy weight': 19.38, 'overweight': 32.92, 'obese': 45.54}
Southeast: {'underweight': 0.0, 'healthy weight': 11.26, 'overweight': 21.98, 'obese': 66.76}
Southwest: {'underweight': 0.92, 'healthy weight': 14.77, 'overweight': 31.08, 'obese': 53.23}


Based on the categorization of [Centers for Disease Control and Prevention](https://www.cdc.gov/healthyweight/assessing/index.html#:~:text=If%20your%20BMI%20is%20less,falls%20within%20the%20obese%20range), it should be noted that in all regions an overwhelming number of insured persons are overweight or even obese (***>74%*** in sum). The record holder here is again Southeast: ***~67%*** of people in this region can be defined as obese, and overall ***~89%*** have excessive weight. At the same time, Southeast has also the lowest proportion of people who are in the healthy weight range on the bmi scale: ***~11%***. In addition to the previous findings, this information helps to understand better why the insurance costs in Southeast are higher on average.

All variables available in the csv file have been analyzed. Let's save the collected statistics for each region in a dictionary named `regions_stats`.

In [15]:
# create a variable that collects the statistics for each region
def combine_stats(data_region):
    return {'age' : ages_stats(data_region), 'sex' : sexes_stats(data_region), 'bmi' : bmis_stats(data_region), 'children' : children_stats(data_region), 'smoker' : smokers_stats(data_region), 'charges' : charges_stats(data_region)}


# create a dictionary with all statistics collected
region_stats = {'northeast' : combine_stats(insurance_data_northeast), 'northwest' : combine_stats(insurance_data_northwest), 'southeast' : combine_stats(insurance_data_southeast), 'southwest' : combine_stats(insurance_data_southwest)}

# print the result
print(region_stats)

{'northeast': {'age': {'under 40': 50.0, '40 or more': 50.0}, 'sex': {'female': 49.69, 'male': 50.31}, 'bmi': {'underweight': 3.09, 'healthy weight': 22.53, 'overweight': 30.25, 'obese': 44.14}, 'children': {'yes': 54.63, 'no': 45.37}, 'smoker': {'yes': 20.68, 'no': 79.32}, 'charges': {'highest': 58571.07, 'lowest': 1694.8, 'mean': 13406.38, 'stdev': 11255.803}}, 'northwest': {'age': {'under 40': 50.77, '40 or more': 49.23}, 'sex': {'female': 50.46, 'male': 49.54}, 'bmi': {'underweight': 2.15, 'healthy weight': 19.38, 'overweight': 32.92, 'obese': 45.54}, 'children': {'yes': 59.38, 'no': 40.62}, 'smoker': {'yes': 17.85, 'no': 82.15}, 'charges': {'highest': 60021.4, 'lowest': 1621.34, 'mean': 12417.58, 'stdev': 11072.277}}, 'southeast': {'age': {'under 40': 50.55, '40 or more': 49.45}, 'sex': {'female': 48.08, 'male': 51.92}, 'bmi': {'underweight': 0.0, 'healthy weight': 11.26, 'overweight': 21.98, 'obese': 66.76}, 'children': {'yes': 56.87, 'no': 43.13}, 'smoker': {'yes': 25.0, 'no': 7

## 4. Conclusion

The analysis has been successfully completed. The four regions available in the data set were studied: ***Northeast***, ***Northwest***, ***Southeast*** and ***Southwest***. The number of observations for each of them is the following: ***324*** for Northeast, ***325*** for Northwest, ***325*** for Southwest and ***364*** for Southeast. Each region has ***~ 50% / 50%*** ratio of men and women and ***~ 50% / 50%*** ratio of younger (under 40 y.o.) and older (40 y.o. or more) people. More than half of the insured persons in each region have children: ***~55%-59%***.

Among the four regions available for analysis, Southeast has the highest average value of medical insurance costs. In addition, given the fact that the sample size is slightly larger for this region, its data spread is also higher: the standard deviation value is the highest and the min and max values are the most polarized compared to other regions. The higher rate of charges in Southeast can be explained by the fact that this region has the highest percentage of smokers (***25%***), the highest percentage of overweight people in general (***~89%***) and obese people in particular (***~67%***). 
However, it is not known what factors determine the medical insurance costs, nor how relevant each of the factors is: it is likely that there are also other determinants that are not present in the dataset. Moreover, nothing is known about the real nature of the data provided for the analysis: this is evident from the overly generic names of the regions.

The collected statistics were organized into a dictionary, which is available for possible projects in the future.