# U.S. Medical Insurance Costs

## NB Purpose
This portfolio project is the culmination of the 1st part of Data Science Fundamentals within the Data Science "Analytics" track Codecademy course.

The purpose of this notebook is to act as a dev, experiment, and analysis envrionment for the data set provided for the project.
I was given a .csv file and told to analyze the information in Jupyter NB, so here we are.

## Scope:
The data set is organized with 7 descriptive fields: Age, Sex (M,F), Body Mass Index (BMI), Number of Children (Children), Smoker (text binary), region (Assuming this is representative of US regions), and Charges (Yearly Premium???)

Possible topics of investigation:
- Avg age of individuals within the data set
- Avg cost of coverage for smokers
- Relationship of Region to Smoker status
- Relationship of Region to BMI
- Relationship of Age to BMI
- Avg age of individuals with:
  - 1 child
  - 2 children
  - 3 children
  - 4 or more children
- Percentage of smokers with children vs no children

## Data Presentation
#### Keys:
Key       | age  | sex | bmi | children | smoker | region | charges |
--------- | -----|-----| ---|---|---|---|-|
Data Type | int| string | float | int | string | string | float |


In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
# Import block for required mods
import functions.insurance_functions as my_functions
import csv
import json

# Load csv data into dictionary
data_list = []
with open('./data/insurance.csv', 'r') as insurance_file_obj:
    csv_reader = csv.DictReader(insurance_file_obj, delimiter=',')
    for record in csv_reader:
        data_list.append(record)
# A dictionary would be ideal for this data set, but given each record has no key identifying this information; an indexed list will have to suffice
total_records = len(data_list)


### Avg Age of Individuals within the Data Set
Calls the function to find the information and prints for markdown purposes

In [3]:
average_age_key = "age"
average_age_of_dataset = round(my_functions.find_average(data_list, average_age_key), 2)
print(f"The average age across the entire health insurance data set is: {average_age_of_dataset}")

The average age across the entire health insurance data set is: 39.21


### Avg Premium Paid By Smokers vs Non-Smokers

In [9]:
# Split the list of records into two separate lists based on smoking status
smokers = my_functions.extract_records_by_key(data_list, "smoker", "yes")
non_smokers = my_functions.extract_records_by_key(data_list, "smoker", "no")
avg_smoker_cost = my_functions.find_average(smokers, "charges")
avg_non_smoker_cost = my_functions.find_average(non_smokers, "charges")
print("Number of smokers in data set: ", len(smokers), "Avg cost: $", round(avg_smoker_cost, 2))
print("Number of non-smokers in data set: ", len(non_smokers), "Avg cost: $", round(avg_non_smoker_cost, 2))
smokers_pay_percent_more = round(((avg_smoker_cost - avg_non_smoker_cost) / avg_non_smoker_cost) * 100, 2)

print(f"\nSmokers pay an average of {smokers_pay_percent_more}% more for health insurance premiums than non-smokers")



Number of smokers in data set:  274 Avg cost: $ 32050.23
Number of non-smokers in data set:  1064 Avg cost: $ 8434.27


Smokers pay an average of 280.0% more for health insurance premiums than non-smokers


### Smoker Status and Region Relationship
This analysis is to investigate the potential correlation between a person's living region and their smoking status.

In [38]:
# Get the regions that are unique to both smokers and non=smokers
smoker_region_list = my_functions.get_unique_values(smokers, "region")
non_smoker_region_list = my_functions.get_unique_values(non_smokers, "region")
# Bucket smoker / non-smoker lists by region
my_functions.bucket_values_by_key(smoker_region_list, "region", smokers)
my_functions.bucket_values_by_key(non_smoker_region_list, "region", non_smokers)
# Compare keys to make sure our analysis is valid
smoker_keys = list(smoker_region_list.keys())
non_smoker_keys = list(non_smoker_region_list.keys())

smoking_info = dict()
if my_functions.check_for_matching_keys(smoker_keys, non_smoker_keys):
    # We can compare values
    for smoker_region in smoker_region_list:
        # Get number of smokers
        smk = len(smoker_region_list[smoker_region])
        non_smk = len(non_smoker_region_list[smoker_region])
        smoking_info.update(my_functions.create_smoker_data(smoker_region, smk, non_smk))
else:
    print("Cannot draw comparisons against non-matching regions")

# If the dictionary isn't empty, print the info to NB
if smoking_info:
    highest_percent = 0
    area = ''
    for region in smoking_info:
        record = smoking_info[region]
        print(
            f"""The {region.upper()} regional area reported {record['smokers']} smokers of {record['total']} participants\nThis results in a {record['percent_smokers']}% smoking population for the region.\n
            """)
        if record['percent_smokers'] > highest_percent:
            highest_percent = record['percent_smokers']
            area = region
    print(f'The {area.upper()} region had the highest rate of smokers with {highest_percent}%')

The SOUTHWEST regional area reported 58 smokers of 325 participants
This results in a 17.85% smoking population for the region.

            
The SOUTHEAST regional area reported 91 smokers of 364 participants
This results in a 25.0% smoking population for the region.

            
The NORTHEAST regional area reported 67 smokers of 324 participants
This results in a 20.68% smoking population for the region.

            
The NORTHWEST regional area reported 58 smokers of 325 participants
This results in a 17.85% smoking population for the region.

            
The SOUTHEAST region had the highest rate of smokers with 25.0%


This is an interesting development. From first glance, it appears there are significant'y more smokers in the southeast region of the data set than any other region. 

### Relationship of Region to BMI
Similar to the investigative relationship of smoking and region, the data set will be bucketed by region with average BMI for each to see if there are any signs of correlation.

In [44]:
# Should have made a simple region list before
regional_record_list = my_functions.get_unique_values(data_list, "region")
my_functions.bucket_values_by_key(regional_record_list, "region", data_list)

regional_bmi_info = my_functions.get_regional_bmi_info(regional_record_list)
print(regional_bmi_info)

{'southwest': {'total_records': 325, 'avg_bmi': 30.6, 'std_dev': 5.69}, 'southeast': {'total_records': 364, 'avg_bmi': 33.36, 'std_dev': 6.48}, 'northwest': {'total_records': 325, 'avg_bmi': 29.2, 'std_dev': 5.14}, 'northeast': {'total_records': 324, 'avg_bmi': 29.17, 'std_dev': 5.94}}
