# U.S. Medical Insurance Costs

## NB Purpose
This portfolio project is the culmination of the 1st part of Data Science Fundamentals within the Data Science "Analytics" track Codecademy course.

The purpose of this notebook is to act as a dev, experiment, and analysis envrionment for the data set provided for the project.
I was given a .csv file and told to analyze the information in Jupyter NB, so here we are.

## Scope:
The data set is organized with 7 descriptive fields: Age, Sex (M,F), Body Mass Index (BMI), Number of Children (Children), Smoker (text binary), region (Assuming this is representative of US regions), and Charges (Yearly Premium???)

Possible topics of investigation:
- Avg age of individuals within the data set
- Avg cost of coverage for smokers
- Relationship of Region to Smoker status
- Relationship of Region to BMI
- Relationship of Age to BMI
- Avg age of individuals with:
  - 1 child
  - 2 children
  - 3 children
  - 4 or more children
- Percentage of smokers with children vs no children

## Data Presentation
#### Keys:
Key       | age  | sex | bmi | children | smoker | region | charges |
--------- | -----|-----| ---|---|---|---|-|
Data Type | int| string | float | int | string | string | float |


In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
# Import block for required mods
import functions.insurance_functions as my_functions
import csv
import json

# Load csv data into dictionary
data_list = []
with open('./data/insurance.csv', 'r') as insurance_file_obj:
    csv_reader = csv.DictReader(insurance_file_obj, delimiter=',')
    for record in csv_reader:
        data_list.append(record)
# A dictionary would be ideal for this data set, but given each record has no key identifying this information; an indexed list will have to suffice


### Avg Age of Individuals within the Data Set
Calls the function to find the information and prints for markdown purposes

In [3]:
average_age_key = "age"
average_age_of_dataset = round(my_functions.find_average(data_list, average_age_key), 2)
print(f"The average age across the entire health insurance data set is: {average_age_of_dataset}")

The average age across the entire health insurance data set is: 39.21


### Avg Premium Paid By Smokers vs Non-Smokers

In [9]:
# Split the list of records into two separate lists based on smoking status
smokers = my_functions.extract_records_by_key(data_list, "smoker", "yes")
non_smokers = my_functions.extract_records_by_key(data_list, "smoker", "no")
avg_smoker_cost = my_functions.find_average(smokers, "charges")
avg_non_smoker_cost = my_functions.find_average(non_smokers, "charges")
print("Number of smokers in data set: ", len(smokers), "Avg cost: $", round(avg_smoker_cost, 2))
print("Number of non-smokers in data set: ", len(non_smokers), "Avg cost: $", round(avg_non_smoker_cost, 2))
smokers_pay_percent_more = round(((avg_smoker_cost - avg_non_smoker_cost) / avg_non_smoker_cost) * 100, 2)
print()

print(f"\nSmokers pay an average of {smokers_pay_percent_more}% more for health insurance premiums than non-smokers")



Number of smokers in data set:  274 Avg cost: $ 32050.23
Number of non-smokers in data set:  1064 Avg cost: $ 8434.27


Smokers pay an average of 280.0% more for health insurance premiums than non-smokers


### Smoker Status and Region Relationship
This analysis is to investigate the potential correlation between a person's living region and their smoking status.

In [10]:
# Find each unique region and store into a separate list so we can put records into buckets programmatically
region_list = my_functions.get_unique_values(data_list, "region")
print(region_list)
my_functions.bucket_regional_values_by_key(region_list, "smoker", smokers)

['southwest', 'southeast', 'northwest', 'northeast']
