# U.S. Medical Insurance Costs

In this project, a **CSV** file with medical insurance costs will be investigated using Python fundamentals. The goal with this project will be to analyze various attributes within **insurance.csv** to learn more about the patient information in the file and gain insight into potential use cases for the dataset.

**insurance.csv** contains the following columns:
* age: Patient Age
* sex: Patient Sex 
* bmi: Patient BMI
* children: Patient Number of Children
* smoker: Patient Smoking Status
* region: Patient U.S Geopraphical Region
* charges: Patient Yearly Medical Insurance Cost

There are no signs of missing data. To store this information, seven empty lists will be created hold each individual column of data from **insurance.csv**.

We will look into the following aspects:
- Create a list that contains the dictionary of patient information
- Average age of patients
- Average age of bmi
- Average age of the patients in each region
- Average bmi of the patients in each region
- Average age and bmi change for each region if we consider sex
- Are there areas that have more smokers? In these areas, what is the proportion of male vs female that smoke?
- Average yearly medical charges
- Average yearly medical charges in each region
- Average yearly medical charges for each region if we take sex into account

First, we import the `CSV` library

In [215]:
#import CSV
import csv

Using `DictReader` in the `CSV` library, we will read the data into a list that holds a dictionary of the patient information (i.e. age, sex, bmi, children, smoker, region, charges)

In [217]:
#open the Insurace CSV file
with open("insurance.csv") as ins_file:
#read the CSV file
    Ins_read=csv.DictReader(ins_file,delimiter=",")
#Create List of Patient Info
    Patient_Info = []
    for record in Ins_read:
        Patient_Info.append(record)
#Print first 5 list item
Patient_Info[:5]

[{'age': '19',
  'sex': 'female',
  'bmi': '27.9',
  'children': '0',
  'smoker': 'yes',
  'region': 'southwest',
  'charges': '16884.924'},
 {'age': '18',
  'sex': 'male',
  'bmi': '33.77',
  'children': '1',
  'smoker': 'no',
  'region': 'southeast',
  'charges': '1725.5523'},
 {'age': '28',
  'sex': 'male',
  'bmi': '33',
  'children': '3',
  'smoker': 'no',
  'region': 'southeast',
  'charges': '4449.462'},
 {'age': '33',
  'sex': 'male',
  'bmi': '22.705',
  'children': '0',
  'smoker': 'no',
  'region': 'northwest',
  'charges': '21984.47061'},
 {'age': '32',
  'sex': 'male',
  'bmi': '28.88',
  'children': '0',
  'smoker': 'no',
  'region': 'northwest',
  'charges': '3866.8552'}]

We will define a function called `find_average()` that will calculate the average of any column

In [219]:
def find_average(column):
    total_value = 0
    for record in Patient_Info:
        total_value+=float(record[column])
    return print("Average " + column + " is " + str(round(total_value/len(Patient_Info),2)))

We will define a function called `find_avg_per_region()` that will find the average of requested column for each region.

In [221]:
def find_avg_per_region(column):
    Avg_per_region = {}
    for record in Patient_Info:
        region = record['region']
        if region in Avg_per_region:
            Avg_per_region[region]['Total']+=float(record[column])
            Avg_per_region[region]['Count']+=1
        else:
            Avg_per_region[region]={}
            Avg_per_region[region]['Total']=float(record[column])
            Avg_per_region[region]['Count']=1
    for region in Avg_per_region:
        print("Average " + column + " in the " + region + " region is " + 
              str(round(Avg_per_region[region]['Total']/Avg_per_region[region]['Count'],2)))

We will define a function called `find_avg_per_region_per_sex()` that will find the average of requested column for each region per sex.

In [223]:
def find_avg_per_region_per_sex(column):
    Avg_per_region_per_sex = {}
    for record in Patient_Info:
        region = record['region']
        sex = record['sex']
        if region in Avg_per_region_per_sex:
            if sex in Avg_per_region_per_sex[region]:             
                Avg_per_region_per_sex[region][sex]['Total']+=float(record[column])
                Avg_per_region_per_sex[region][sex]['Count']+=1
            else:
                Avg_per_region_per_sex[region][sex]={}
                Avg_per_region_per_sex[region][sex]['Total']=float(record[column])
                Avg_per_region_per_sex[region][sex]['Count']=1
        else:
            Avg_per_region_per_sex[region]={}
            Avg_per_region_per_sex[region][sex]={}
            Avg_per_region_per_sex[region][sex]['Total']=float(record[column])
            Avg_per_region_per_sex[region][sex]['Count']=1
    for region in Avg_per_region_per_sex:
        for sex in Avg_per_region_per_sex[region]:
            print("Average " + column + " of " + sex + " in the " + region + " region is " + 
                  str(round(Avg_per_region_per_sex[region][sex]['Total']/Avg_per_region_per_sex[region][sex]['Count'],2))) 
 

We will define a function called `smokers_in_region()` that will determine the proportion of smokers in each region.

In [225]:
def smokers_in_region():
    smokers_per_region = {}
    for record in Patient_Info:
        region = record['region']
        if region in smokers_per_region:
            smokers_per_region[region]['Total']+=1
            if record['smoker'] == 'yes':
                smokers_per_region[region]['Count of Smokers']+=1
        else:
            smokers_per_region[region]={}
            smokers_per_region[region]['Total']=1
            if record['smoker'] == 'yes':
                smokers_per_region[region]['Count of Smokers']=1
            else:
                smokers_per_region[region]['Count of Smokers']=0
    for region in smokers_per_region:
        print("Proportion of smokers in the " + region + " region is " + 
              str(round(smokers_per_region[region]['Count of Smokers']/smokers_per_region[region]['Total'],2)))

We will define a function called `smokers_in_region_per_sex()` that will determine the proportion of smokers for each sex in each region.

In [227]:
def smokers_in_region_per_sex():
    smokers_per_region_per_sex = {}
    for record in Patient_Info:
        region = record['region']
        sex = record['sex']
        if region in smokers_per_region_per_sex:
            if sex in smokers_per_region_per_sex[region]:             
                smokers_per_region_per_sex[region][sex]['Total']+=1
                if record['smoker'] == 'yes':
                    smokers_per_region_per_sex[region][sex]['Count of Smokers']+=1
            else:
                smokers_per_region_per_sex[region][sex]={}
                smokers_per_region_per_sex[region][sex]['Total']=1
                if record['smoker'] == 'yes':
                    smokers_per_region_per_sex[region][sex]['Count of Smokers']=1
                else:
                    smokers_per_region_per_sex[region][sex]['Count of Smokers']=0
        else:
            smokers_per_region_per_sex[region]={}
            smokers_per_region_per_sex[region][sex]={}
            smokers_per_region_per_sex[region][sex]['Total']=1
            if record['smoker'] == 'yes':
                smokers_per_region_per_sex[region][sex]['Count of Smokers']=1
            else:
                smokers_per_region_per_sex[region][sex]['Count of Smokers']=0
    for region in smokers_per_region_per_sex:
        for sex in smokers_per_region_per_sex[region]:
            print("Proportion of " + sex + " smokers in the " + region + " region is " + 
              str(round(smokers_per_region_per_sex[region][sex]['Count of Smokers']/smokers_per_region_per_sex[region][sex]['Total'],2)))

Let's perform the analysis on the data. We will find the average age and bmi.

In [229]:
find_average('age')

Average age is 39.21


In [230]:
find_average('bmi')

Average bmi is 30.66


The average age of the patients in **insurance.csv** is about 39 years old and average bmi of the patients in **insurance.csv** is about 30.7.

Next, we will find the average age and bmi per geographical region. It is important to note that all the patients come from the United States.

In [232]:
find_avg_per_region('age')

Average age in the southwest region is 39.46
Average age in the southeast region is 38.94
Average age in the northwest region is 39.2
Average age in the northeast region is 39.27


In [233]:
find_avg_per_region('bmi')

Average bmi in the southwest region is 30.6
Average bmi in the southeast region is 33.36
Average bmi in the northwest region is 29.2
Average bmi in the northeast region is 29.17


The geographical regions are classified as southwest, souteast, northwest and northeast. What defines each region is unknown. 

The average age in each region is approximately 39 years old which is similar to the population average.

The average bmi in the southeast is 33.4 which is higher than the average. The other regions have an average bmi closer to the population bmi. The average of the population is skewed probably because of the patients in the southeast region.

Next, we will find the average age and bmi for each geographical region for each sex.

In [235]:
find_avg_per_region_per_sex('age')

Average age of female in the southwest region is 39.7
Average age of male in the southwest region is 39.21
Average age of male in the southeast region is 38.78
Average age of female in the southeast region is 39.11
Average age of male in the northwest region is 38.8
Average age of female in the northwest region is 39.59
Average age of male in the northeast region is 38.9
Average age of female in the northeast region is 39.64


In [236]:
find_avg_per_region_per_sex('bmi')

Average bmi of female in the southwest region is 30.06
Average bmi of male in the southwest region is 31.13
Average bmi of male in the southeast region is 33.99
Average bmi of female in the southeast region is 32.67
Average bmi of male in the northwest region is 29.12
Average bmi of female in the northwest region is 29.28
Average bmi of male in the northeast region is 29.02
Average bmi of female in the northeast region is 29.32


We are only taking the following sexes into consideration: male, female

The average age in each region for both sexes is within one year of the population average.

The average bmi for males and females in each region is within 1-2 points. In the southeast region where the average bmi is 3 points higher than the population average, the average bmi for males is higher than females.

Next, we will find the proportion of smokers in each geographical location

In [238]:
smokers_in_region()

Proportion of smokers in the southwest region is 0.18
Proportion of smokers in the southeast region is 0.25
Proportion of smokers in the northwest region is 0.18
Proportion of smokers in the northeast region is 0.21


25% of the patients in southeast region are smokers. 21% of the patients in northeast region are smokers. In the other regions, 18% of the patients are smokers.

In [240]:
smokers_in_region_per_sex()

Proportion of female smokers in the southwest region is 0.13
Proportion of male smokers in the southwest region is 0.23
Proportion of male smokers in the southeast region is 0.29
Proportion of female smokers in the southeast region is 0.21
Proportion of male smokers in the northwest region is 0.18
Proportion of female smokers in the northwest region is 0.18
Proportion of male smokers in the northeast region is 0.23
Proportion of female smokers in the northeast region is 0.18


It looks like in most regions, there are more male smokers than female smokers. However, in the northwest, there is an equal proportion of male and females that smoke. The proportion of male smokers in the northwest is much smaller than other regions.

In the southwest region, 13% of females are smokers and 23% of males are smokers. The proportion of female smokers is much smaller than male smokers. The proportion of male smokers is much smaller than other regions.
In the southeast region, 21% of females are smokers and 29% of males are smokers.The overall proportion of female and male smokers is much higher than other regions.
In the northeast region, 18% of females are smokers and 23% of males are smokers.

Let us now analyze the insurance costs of the patients.

In [242]:
find_average('charges')

Average charges is 13270.42


The average charges of the patients is $13270.42. Now, lets take a look at the average charges for each geographical location. 

From the above analysis, we can assume that the average insurance cost for the southeast region will be high.

In [244]:
find_avg_per_region('charges')

Average charges in the southwest region is 12346.94
Average charges in the southeast region is 14735.41
Average charges in the northwest region is 12417.58
Average charges in the northeast region is 13406.38


As expected, the average charges in the southeast region is \\$14735.41 which is more than \\$1000 higher than the population average. Average charges for southwest and northwest is between \\$12000-\\$12500. The average charges for northeast is approximately \\$1000 more than the average charges of the southwest and northwest.

Let's take a look at the yearly cost for each sex in each region.

In [246]:
find_avg_per_region_per_sex('charges')

Average charges of female in the southwest region is 11274.41
Average charges of male in the southwest region is 13412.88
Average charges of male in the southeast region is 15879.62
Average charges of female in the southeast region is 13499.67
Average charges of male in the northwest region is 12354.12
Average charges of female in the northwest region is 12479.87
Average charges of male in the northeast region is 13854.01
Average charges of female in the northeast region is 12953.2


The impact of smoking on the insurance cost should be explored further. It looks like the average yearly cost seems to be lower in regions with fewer smokers vs regions with more smokers. The trend is similar to the proportion of smokers. 

For instance, in the southwest where the proportion of female smokers was .13 and male smokers is .23, the average cost for females is about $2000 less than the average cost for males. In the northwest, where the proportion of male and female smokers were equal, the cost is approximately the same.

It might be worth it to explore if smoking has a significant impact on the insurance cost.