# Portfolio Project 1: Analysis of Medical Insurance Costs

# About the Project

The following project was completed by Anna McCowan using patient data stored in `insurance.csv`.



## Project Objectives:
- Analyze the `insurance.csv` dataset by building out Python functions 
- Use Python libraries to assist in analysis 
- Document and organize your findings for easy reporting  


## Questions Answered:
- What are the demographics of the patients in this dataset?
- What are the factors that lead to higher medical costs?

---
---

# About the data:
- Source: [Medical Cost Personal Datasets](https://www.kaggle.com/datasets/mirichoi0218/insurance)
- File name: insurance.csv
- Number of rows: 1300
- Number of fields: 7
    - **age**: age of primary beneficiary
    - **sex**: insurance contractor gender, female, male
    - **bmi**: body mass index, providing an understanding of body
    - **children**: number of children covered by health insurance
    - **smoker**: whether or not the primary beneficiary smokes
    - **region**: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.
    - **charges**: individual medical costs billed by health insurance

---
---
# Diving into the data:
`insurance.csv` can be read by first importing `import csv`. 

The file is opened using `with open` and then `csv.DictReader` to iterate over the file. This produces a dictionary for each row of data, where the keys are the names of the columns, and the values are the data from the row being read.

For `insurance.csv`, the keys: `age`, `sex`, `bmi`, `children`, `smoker`, `region`, `charges`.

In [1]:
# import csv library
import csv

# open the file as a list of dictionaries
with open ('insurance.csv') as insurance:
    insurance_reader = csv.DictReader(insurance)
    insurance_list = list(insurance_reader)

I have assinged each dictionary to be an item in a list called `insurance_list` here:
```python
insurance_list = list(insurance_reader)
```

To see what this looks like, I have printed out the first 5 items in `insurance_list` below. This will be the main list used in my analysis.


In [2]:
print(insurance_list[:4], "...")    # looking at the first 5 items in the dataset 

[{'age': '19', 'sex': 'female', 'bmi': '27.9', 'children': '0', 'smoker': 'yes', 'region': 'southwest', 'charges': '16884.924'}, {'age': '18', 'sex': 'male', 'bmi': '33.77', 'children': '1', 'smoker': 'no', 'region': 'southeast', 'charges': '1725.5523'}, {'age': '28', 'sex': 'male', 'bmi': '33', 'children': '3', 'smoker': 'no', 'region': 'southeast', 'charges': '4449.462'}, {'age': '33', 'sex': 'male', 'bmi': '22.705', 'children': '0', 'smoker': 'no', 'region': 'northwest', 'charges': '21984.47061'}] ...


## Using a Pandas DataFrame to View the Data
While it's nice to see the data in a list/dictionary format, it's much easier to read as a DataFrame.

Below, I've imported the Pandas library and created a DataFrame `insurance_dataframe` to view the file as a table.   

In [3]:
# make the data easier to read by using pandas dataframe

import pandas as pd
insurance_dataframe = pd.DataFrame.from_records(insurance_list)
print(insurance_dataframe.head())

  age     sex     bmi children smoker     region      charges
0  19  female    27.9        0    yes  southwest    16884.924
1  18    male   33.77        1     no  southeast    1725.5523
2  28    male      33        3     no  southeast     4449.462
3  33    male  22.705        0     no  northwest  21984.47061
4  32    male   28.88        0     no  northwest    3866.8552


In [4]:
print(insurance_dataframe.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   age       1338 non-null   object
 1   sex       1338 non-null   object
 2   bmi       1338 non-null   object
 3   children  1338 non-null   object
 4   smoker    1338 non-null   object
 5   region    1338 non-null   object
 6   charges   1338 non-null   object
dtypes: object(7)
memory usage: 73.3+ KB
None


---
---
# Data Clean-Up
## Create unique lists for each column, assign a variable

To analyze and manipulate each column of data, I've created a unique list for each field and assigned it to a variable, for example `age_list`.

I've printed what `age_list` looks like below.

In [5]:
# created a list for each unique field in the dataset
age_list = [x['age'] for x in insurance_list]
sex_list = [x['sex'] for x in insurance_list]
bmi_list = [x['bmi'] for x in insurance_list]
children_list = [x['children'] for x in insurance_list]
smoker_list = [x['smoker'] for x in insurance_list]
region_list = [x['region'] for x in insurance_list]
charges_list = [x['charges'] for x in insurance_list]

# example list
print("age_list =", age_list[:12], "...")

age_list = ['19', '18', '28', '33', '32', '31', '46', '37', '37', '60', '25', '62'] ...


## Numeric values stored as strings
Upon evaluation, the numeric values such as `age` and `bmi` are stored as strings. Below I've created new lists such as `age_list_int` to convert the values to integers (or floats!) rather than strings.

<font color = gray>*Note: I intentionally created a new list with suffix "_int" rather than override the existing list in case I need to look at these values as strings later on.*</font>

In [6]:
age_list_int = [eval(i) for i in age_list]
bmi_list_foat = [float(i) for i in bmi_list]
children_list_int = [eval(i) for i in children_list]
charges_list_foat = [float(i) for i in charges_list]

---
---

# Begin Analysis
## Patient Demographics

The function `print_patient_demographics()` prints an overview of the data values such as:
- total number of patients
- total men vs. total women
- average age of patients


In [7]:
total_patients = len(insurance_list)  # total num of patients is the length insurance_list
total_sex = lambda sex: sex_list.count(sex)  # lambda function returns num of times `input` occurs in sex_list, `input` is either "male" or "female" 
avg_age_patients = round(sum(age_list_int)/total_patients)   # average = sum / total
percent_sex = lambda sex: round(total_sex(sex)/total_patients*100)  # lambda function returns percent of `input`

def print_patient_demographics():
    # print this function to print pre-defined patient demographics
    return """
    - There are {num} patients in the dataset.
    - There are {f} females ({pf}%) and {m} males ({pm}%).
    - The average age of the patients is {age} years old.
    """.format(num = total_patients, f = total_sex("female"), m = total_sex("male"), pf = percent_sex("female"), pm = percent_sex("male"), age = avg_age_patients)

print(print_patient_demographics())


    - There are 1338 patients in the dataset.
    - There are 662 females (49%) and 676 males (51%).
    - The average age of the patients is 39 years old.
    


---
---
# Futher Analysis of Age
## Range: Oldest Patient and Youngest Patient

In [8]:
print("The oldest patient is", max(age_list_int), "years old.")
print("The youngest patient is", min(age_list_int), "years old.")

The oldest patient is 64 years old.
The youngest patient is 18 years old.


## Average Age of Men vs. Women
Below are two methods of finding the average age of men and women.  

#### <font color = gray> Method 1

In [9]:
def find_avg_age(sex):
    """
    input is string "male" or "female"
    """
    sum_age = 0
    if type(sex) is not str or sex_list.count(sex) == 0:
        return "Please enter either 'male' or 'female'."
    else:
        for each_patient in insurance_list:
            if each_patient["sex"] == sex:
                sum_age += float(each_patient["age"])
        return round(sum_age / sex_list.count(sex))

avg_male_age = find_avg_age("male")
avg_female_age = find_avg_age("female")

print("The average age of the males is {avg_male} years old.".format(avg_male = avg_male_age))
print("The average age of the females is {avg_female} years old.".format(avg_female = avg_female_age))

The average age of the males is 39 years old.
The average age of the females is 40 years old.


#### <font color = gray>  Method 2

In [10]:
# alternative way to find average age of each sex

male_age_list = [eval(x["age"]) for x in insurance_list if x["sex"] == "male"]

avg_male_age = round(sum(male_age_list) / len(male_age_list))

print(avg_male_age) # prints 39

39


# 🌍 Regional Analysis
## How many patients are from each region?

In [11]:
unique_regions = []
for region in region_list:
    if region not in unique_regions:
        unique_regions.append(region)     # created a list of each unique region
        
num_regions = len(unique_regions)   # number of regions is the length of the list

print("There are {num} unique regions in this dataset. They are: ".format(num = num_regions))
print(unique_regions)

There are 4 unique regions in this dataset. They are: 
['southwest', 'southeast', 'northwest', 'northeast']


In [12]:
def count_region(region):
    sum_regions = 0
    for each_region in region_list:
        if each_region == region:
            sum_regions += 1
    return sum_regions

region_dict = {}
for each_region in unique_regions:
    region_dict[each_region] = count_region(each_region)

print(region_dict)

{'southwest': 325, 'southeast': 364, 'northwest': 325, 'northeast': 324}


<font color= gray>*Observation: Southeast has more patients than the other three regions.*

### The Cost of Smoking

In [13]:
sum_charges = 0
for charge in charges_list:
    sum_charges += float(charge)
avg_charges = round(sum_charges / len(charges_list), 2)
print("Average cost of medicine: $", avg_charges)

Average cost of medicine: $ 13270.42


In [14]:
smoker_list_yes = [x for x in insurance_list if x["smoker"] == "yes"]
smoker_list_no = [x for x in insurance_list if x["smoker"] == "no"]

total_smokers = len(smoker_list_yes)
percent_smokers = round(total_smokers / total_patients * 100)

print("{num} of the patients are smokers ({p}%).".format(num = total_smokers, p = percent_smokers))

274 of the patients are smokers (20%).


In [15]:
def find_avg_smoker_cost(answer):
    total_smokers_or_nonsmokers = smoker_list.count(answer)
    sum_smoker_cost = 0
    for each_patient in insurance_list:
        if each_patient["smoker"] == answer:
            sum_smoker_cost += float(each_patient["charges"])
    return round(sum_smoker_cost / total_smokers_or_nonsmokers, 2)


avg_smoker_cost = find_avg_smoker_cost("yes")
avg_nonsmoker_cost = find_avg_smoker_cost("no")
diff_smoker_cost = avg_smoker_cost - avg_nonsmoker_cost

print("The average medical cost for smokers is $", avg_smoker_cost)
print("The average medical cost for non-smokers is $", avg_nonsmoker_cost)
print("The annual difference is $", diff_smoker_cost)

The average medical cost for smokers is $ 32050.23
The average medical cost for non-smokers is $ 8434.27
The annual difference is $ 23615.96


### More Cost Analysis

In [16]:
def find_avg_sex_cost(sex):
    total_sex = sex_list.count(sex)
    sum_sex_cost = 0
    for each_patient in insurance_list:
        if each_patient["sex"] == sex:
            sum_sex_cost += float(each_patient["charges"])
    return round(sum_sex_cost / total_sex, 2)

avg_male_cost = find_avg_sex_cost("male")
avg_female_cost = find_avg_sex_cost("female")

print("The average medical cost for males is ${avg_male_cost}".format(avg_male_cost = avg_male_cost))
print("The average medical cost for females is ${avg_female_cost}".format(avg_female_cost = avg_female_cost))

The average medical cost for males is $13956.75
The average medical cost for females is $12569.58


In [17]:
def find_avg_region_cost(region):
    sum_region_cost = 0
    for each_patient in insurance_list:
        if each_patient["region"] == region:
            sum_region_cost += float(each_patient["charges"])
    return round(sum_region_cost / count_region(region), 2)

avg_sw_cost = find_avg_region_cost("southwest")
avg_nw_cost = find_avg_region_cost("northwest")
avg_se_cost = find_avg_region_cost("southeast")
avg_ne_cost = find_avg_region_cost("northeast")

print("Southwest: ${cost}".format(cost = avg_sw_cost))
print("Northwest: ${cost}".format(cost = avg_nw_cost))
print("Southeast: ${cost}".format(cost = avg_se_cost))
print("Northeast: ${cost}".format(cost = avg_ne_cost))

Southwest: $12346.94
Northwest: $12417.58
Southeast: $14735.41
Northeast: $13406.38


### Parental Findings

In [18]:
def how_many_parents():
    sum_parents = 0
    for each_patient in insurance_list:
        if each_patient["children"] != "0":
            sum_parents += 1
    return sum_parents

print(how_many_parents(), "patients have one or more child.") 

764 patients have one or more child.


#### Added a yes / no field called "parent"

In [19]:
def are_you_a_parent():
    for each_patient in insurance_list:
        if each_patient["children"] != "0":
            each_patient["parent"] = "yes"
        else:
            each_patient["parent"] = "no"

are_you_a_parent()
parent_list = [x['parent'] for x in insurance_list] 
# print(insurance_list[:4]) # checking my work :)

In [20]:
def avg_cost_of_children(num_children):
    sum_cost = 0
    if children_list.count(str(num_children)) == 0:
        return "There are no patients with that number of children in this dataset."
    else:
        for each_patient in insurance_list:
            if each_patient["children"] == str(num_children):
                sum_cost += float(each_patient["charges"])
        return round(sum_cost / children_list.count(str(num_children)), 2)

In [21]:
# child_cost_dict = {}
# for num in range(6):
#     child_cost_dict[num] = avg_cost_of_children(num)
# print(child_cost_dict)

In [22]:
for num in range(6):
    print("The average medical cost of individuals with {num} children is ${cost}".format(num = num, cost = avg_cost_of_children(num)))

The average medical cost of individuals with 0 children is $12365.98
The average medical cost of individuals with 1 children is $12731.17
The average medical cost of individuals with 2 children is $15073.56
The average medical cost of individuals with 3 children is $15355.32
The average medical cost of individuals with 4 children is $13850.66
The average medical cost of individuals with 5 children is $8786.04


In [23]:
def find_avg_parent_cost(answer):
    sum_cost = 0
    for each_patient in insurance_list:
        if each_patient["parent"] == answer:
            sum_cost += float(each_patient["charges"])
    return round(sum_cost / parent_list.count(answer), 2)

avg_parent_cost = find_avg_parent_cost("yes")
avg_nonparent_cost = find_avg_parent_cost("no")

print("The average medical cost of parents is :", avg_parent_cost)
print("The average medical cost of non-parents is :", avg_nonparent_cost)

The average medical cost of parents is : 13949.94
The average medical cost of non-parents is : 12365.98


# Question: Which variable is most costly to patients?

### Age
- Group 1 (Ages 18-29)
- Group 2 (Ages 30-44)
- Group 3 (Ages 45-64)

In [24]:
age_group1 = [x for x in insurance_list if int(x["age"]) < 30]
age_group2 = [x for x in insurance_list if int(x["age"]) >= 30 or int(x["age"]) < 45]
age_group3 = [x for x in insurance_list if int(x["age"]) >= 45]

def find_cost(group):
    sum_cost = 0
    for each_patient in group:
        sum_cost += float(each_patient["charges"])
    return round(sum_cost / len(group), 2)

age_group1_cost = find_cost(age_group1)
age_group2_cost = find_cost(age_group2)
age_group3_cost = find_cost(age_group3)

print("Group 1 (Ages 18-29): $", age_group1_cost)
print("Group 2 (Ages 30-44): $", age_group2_cost)
print("Group 3 (Ages 45-64): $", age_group3_cost)

Group 1 (Ages 18-29): $ 9182.49
Group 2 (Ages 30-44): $ 13270.42
Group 3 (Ages 45-64): $ 17070.49


### Sex
- Male
- Female

In [25]:
print("The average medical cost for males is ${avg_male_cost}".format(avg_male_cost = avg_male_cost))
print("The average medical cost for females is ${avg_female_cost}".format(avg_female_cost = avg_female_cost))

The average medical cost for males is $13956.75
The average medical cost for females is $12569.58


### BMI 
- If your BMI is less than 18.5, it falls within the underweight range.
- If your BMI is 18.5 to <25, it falls within the healthy weight range.
- If your BMI is 25.0 to <30, it falls within the overweight range.
- If your BMI is 30.0 or higher, it falls within the obesity range.

In [26]:
bmi_list_int = [eval(i) for i in bmi_list]
bmi_group1 = [x for x in insurance_list if float(x["bmi"]) < 18.5]
bmi_group2 = [x for x in insurance_list if float(x["bmi"]) >= 18.5 or float(x["bmi"]) < 25]
bmi_group3 = [x for x in insurance_list if float(x["bmi"]) >= 25 or float(x["bmi"]) < 30]
bmi_group4 = [x for x in insurance_list if float(x["bmi"]) >= 30]

print("BMI Group 1 (< 18.5): $", find_cost(bmi_group1))
print("BMI Group 2 (18.5 - 24): $", find_cost(bmi_group2))
print("BMI Group 3 (25 - 29): $", find_cost(bmi_group3))
print("BMI Group 4 (> 29): $", find_cost(bmi_group4))

BMI Group 1 (< 18.5): $ 8852.2
BMI Group 2 (18.5 - 24): $ 13270.42
BMI Group 3 (25 - 29): $ 13270.42
BMI Group 4 (> 29): $ 15552.34


### Children
- Parents
- Non-parents

In [27]:
print("The average medical cost of parents is: $", avg_parent_cost)
print("The average medical cost of non-parents is: $", avg_nonparent_cost)

The average medical cost of parents is: $ 13949.94
The average medical cost of non-parents is: $ 12365.98


### Smoker
- Yes
- No

In [28]:
# I did this a different way earlier. Showing a new way of doing it by creating a new list of smokers
smoker_group = [x for x in insurance_list if x["smoker"] == "yes"]
nonsmoker_group = [x for x in insurance_list if x["smoker"] == "no"]

print("The average cost of being a smoker is $", find_cost(smoker_group))
print("The average cost of being a non-smoker is $", find_cost(nonsmoker_group))

The average cost of being a smoker is $ 32050.23
The average cost of being a non-smoker is $ 8434.27


### Region
- Southwest
- Northwest
- Southeast
- Northeast

In [29]:
southwest_group = [x for x in insurance_list if x["region"] == "southwest"]
northwest_group = [x for x in insurance_list if x["region"] == "northwest"]
southeast_group = [x for x in insurance_list if x["region"] == "southeast"]
northeast_group = [x for x in insurance_list if x["region"] == "northeast"]

print("The average cost for patients in the southwest is $", find_cost(southwest_group))
print("The average cost for patients in the northwest is $", find_cost(northwest_group))
print("The average cost for patients in the southeast is $", find_cost(southeast_group))
print("The average cost for patients in the northeast is $", find_cost(northeast_group))


The average cost for patients in the southwest is $ 12346.94
The average cost for patients in the northwest is $ 12417.58
The average cost for patients in the southeast is $ 14735.41
The average cost for patients in the northeast is $ 13406.38


In [30]:
def find_avg_bmi(group):
    sum_bmi = 0
    for each_patient in group:
        sum_bmi += float(each_patient["bmi"])
    return round(sum_bmi / len(group), 1)

print("The average BMI of patients in the southwest is", find_avg_bmi(southwest_group))
print("The average BMI of patients in the northwest is", find_avg_bmi(northwest_group))
print("The average BMI of patients in the southeast is", find_avg_bmi(southeast_group))
print("The average BMI of patients in the northeast is", find_avg_bmi(northeast_group))


The average BMI of patients in the southwest is 30.6
The average BMI of patients in the northwest is 29.2
The average BMI of patients in the southeast is 33.4
The average BMI of patients in the northeast is 29.2


In [31]:
# series1 = pd.Series(age_list)
# print(series1)