# U.S. Medical Insurance Costs

## Goals
1. What is the average age of insured people?
2. Group patients by age groups --> Need to figure out most sensible way to create age groups
3. Is there a relationship between age and cost of insurance?
4. Is insurance more expensive if you're a female?
5. Where do most insured people live?
6. Is insurance more expensive depending on your region?
7. Look at the differences in cost between smokers vs non-smokers
8. Average age of people with at least one child

## Data

In [150]:
import csv
import statistics
from pprint import pprint as pp

The only libraries needed to do the analysis is `csv` and `statistics`. The `pprint` module is only used to format the dictionaries returned by several functions. The `pprint` library enhances the readability of large dictionaries as they are formatted as tables.

Initial analysis of the **insurance.csv** file reveals the following information about the dataset.
There are 7 variables:
* Patient age
* Patient gender
* Patient BMI
* Patient Number of Children
* Patient Smoker Status
* Patient US Region
* Patient Yearly Medical Insurance Costs

The first step is to load the **insurance.csv** data into 7 separate `lists` that will hold the data for the respective variable.

In [229]:
# Read in the csv file and store the contents into separate
# lists for each column
# Example: BMI column -> bmi [ ... ]

# Initialise all the lists
ages, sexes, bmis, children, smokers, regions, charges = [], [], [], [], [], [], []

with open('insurance.csv') as insurance_file:
    insurance_csv = csv.DictReader(insurance_file)
    
    for row in insurance_csv:
        ages.append(row['age'])
        sexes.append(row['sex'])
        bmis.append(row['bmi'])
        children.append(row['children'])
        smokers.append(row['smoker'])
        regions.append(row['region'])
        charges.append(row['charges'])

In [151]:
def convert_elements_to_numerical(lst):
    """Converts the elements in a list to a numerical data type. 
    The function can discern if an element should be converted 
    to a floating point number (float) or an integer (int).
    Example:
    >>> convert_elements_to_numerical(['1', '2', '3'])
    [1, 2, 3]
    
    >>> convert_elements_to_numerical(['22.3', '19.21', '31.4'])
    [22.3, 19.21, 31.4]
    """
    result = []
    for item in lst:
        if item.find('.') != -1:  # This means the item is a float
            result.append(float(item))
        else:  # The item is an int
            result.append(int(item))
    return result

In [152]:
def build_patient_dictionary(age_lst, sex_lst, 
                             bmi_lst, children_lst, 
                             smoker_lst, region_lst, charges_lst):
    """Creates a new dictionary with the patient data
    It will have sequential numbers for each patient, start from 0 on to the total patients
    Each key of the dict will be a unique counter which correlates to patient number
    and the values will be the rest of the data with corresponding keys
    """
    result = {}
    # Initialise a counter of type int with initial value of 1
    patient_number = 1
    for age, sex, bmi, children_count, smoker_status, region, charges in zip(age_lst, sex_lst, bmi_lst, children_lst, smoker_lst, region_lst, charges_lst):
        result[patient_number] = {
                                'Age': age,
                                'Sex': sex,
                                'BMI': bmi,
                                'Children': children_count,
                                'Smoker': smoker_status,
                                'Region': region,
                                'Charges': charges
                                }
        # Increase the counter so the next patient gets the following number up
        patient_number += 1

    # Return the dictionary
    return result

In [153]:
def group_patients_by_age(px_dict):
    """Creates a new dictionary that groups patients by ages. The keys
    of this new dictionary are ages and the values are dictionaries of each
    patient that corresponds to that age
    """
    result = {}
    for patient in px_dict.values():
        #Capture the current age_value from the 'Age' value
        current_age_value = patient['Age']
        #Capture the current patient_dictionary
        current_patient = patient
        #If the current age_value is not in the new dictionary:
        if current_age_value not in result:
            #Initialise that age_value as key and append the value of current_patient_dict to the list
            result[current_age_value] = [current_patient]
        #Otherwise the age_value has been seen before:
        else:
            #Append the current_patient_dict to the list
            result[current_age_value].append(patient)
    return result

In [154]:
def count_patient_numbers_per_region(px_dict):
    regions_dict = {}
    
    for patient in px_dict.values():
        current_region = patient['Region']
        if current_region not in regions_dict:  # Check if this region has not been seen yet
            regions_dict[current_region] = 1
        else:                                   # This region has been seen before
            regions_dict[current_region] += 1
            
    return regions_dict

In [155]:
def count_patients_by_gender(px_dict):
    gender_count_dict = {}
    
    for patient in px_dict.values():
        current_gender = patient['Sex']
        
        if current_gender not in gender_count_dict:
            gender_count_dict[current_gender] = 1
        else:
            gender_count_dict[current_gender] += 1
    return gender_count_dict

In [156]:
def find_age_most_insured(age_dict):
    age_most_insured = 0   # These two values will be used to find the King of the Hill
    max_number_patients = 0

    for age, patients in age_dict.items():
        current_age = age
        total_patients = len(patients)

        if total_patients > max_number_patients:
            max_number_patients = total_patients
            age_most_insured = current_age

    return (age_most_insured, max_number_patients)

In [157]:
def find_most_expensive_insurance(px_dict):
    most_expensive = 0
    for patient in patients_dictionary.values():
        current_cost = patient['Charges']
        patient_id = patient    # Record the patient number of the most expensive insurance
        
        if current_cost > most_expensive:
            most_expensive = current_cost
            patient_who_pays_most = patient_id
            
    return (patient_who_pays_most, most_expensive)

In [181]:
def find_average_cost_per_region(px_dict):
    result = {}
    sw_total, se_total, nw_total, ne_total = 0, 0, 0, 0
    sw_count, se_count, nw_count, ne_count = 0, 0, 0, 0
    
    for id, info in patients_dictionary.items():
        current_charges = info['Charges']
        if info['Region'] == 'southwest':
            sw_total += current_charges
            sw_count += 1
        
        elif info['Region'] == 'southeast':
            se_total += current_charges
            se_count += 1
            
        elif info['Region'] == 'northwest':
            nw_total += current_charges
            nw_count += 1
        
        else:
            ne_total += current_charges
            ne_count += 1
    
    result.update({'southwest': sw_total / sw_count,
                    'southeast': se_total / se_count,
                    'northwest': nw_total / nw_count,
                    'northeast': ne_total / ne_count})
    
    return result

In [199]:
def find_average_cost_per_region_using_lists(px_dict):
    from statistics import median, mean
    
    southwest, southeast, northwest, northeast = [], [], [], []
    result = {}   # To store the final statistical mean and average
    
    for id, info in patients_dictionary.items():
        current_charges = info['Charges']
        if info['Region'] == 'southwest':
            southwest.append(current_charges)
        
        elif info['Region'] == 'southeast':
            southeast.append(current_charges)
            
        elif info['Region'] == 'northwest':
            northwest.append(current_charges)
        
        else:
            northeast.append(current_charges)
    
    result.update({ 'Southwest': {'Mean': mean(southwest), 'Median': median(southwest)},
                    'Southeast': {'Mean': mean(southeast), 'Median': median(southeast)},
                    'Northwest': {'Mean': mean(northwest), 'Median': median(northwest)},
                    'Northeast': {'Mean': mean(northeast), 'Median': median(northeast)} })
    
    return result

The function `convert_elements_to_numerical()` was implemented to convert all the data types of each individual list. This will facilitate numerical analysis of each variable.
Only four lists need to be updated, all with numerical data.
`updated_ages`
`updated_bmis`
`updated_children`
`updated_charges`


In [225]:
# Change lists that should be numerical to an appropriate data type
# For example, ages should all be integers but are strings at the moment
updated_ages = convert_elements_to_numerical(ages)
updated_bmis = convert_elements_to_numerical(bmis)
updated_children = convert_elements_to_numerical(children)
updated_charges = convert_elements_to_numerical(charges)

In [160]:
# Build a dictionary that has a patient number as keys and the different variables
# built into another dictionary. The patient number starts at 1 and goes to the last patient
# in the dataset, i.e. 1338 patients
patients_dictionary = build_patient_dictionary(updated_ages, sexes, updated_bmis, updated_children, smokers, regions, updated_charges)

# Print patients dictionary to make sure data was stored properly
# pp(patients_dictionary)

In [161]:
# Build a dictionary that groups all patients by age
# The resulting dictionary will have an age (as an int) as the keys
# and the values are lists of dicionaries of patients that have that same age
patients_by_age = group_patients_by_age(patients_dictionary)

# Print out the new dictionary of patients by age
#pp(patients_by_age)

In [246]:
# Find the age most insured
most_insured_age = find_age_most_insured(patients_by_age)
total_count = most_insured_age[1]
print(f'The patients aged {most_insured_age[0]} years old are the most insured with {total_count} patients total.') 

The patients aged 18 years old are the most insured with 69 patients total.


In [165]:
# Figure out who pays the most expensive insurance policy
# See also the additional details for that patient
dearest_insurance = find_most_expensive_insurance(patients_dictionary)
patient_who_pays_most = dearest_insurance[0]

print(f'The most expensive insurance costs ${dearest_insurance[1]:,.2f} dollars.\n')
print(f'The patient who pays most is:')
print(f"{patient_who_pays_most['Age']} years old {patient_who_pays_most['Sex']}.")
print(f"Whose BMI is {patient_who_pays_most['BMI']}.")
print(f"Has {patient_who_pays_most['Children']} children.")
print(f"Her smoker status is {patient_who_pays_most['Smoker']}.")
print(f"And is registered in the {patient_who_pays_most['Region'].title()} region.")

The most expensive insurance costs $63,770.43 dollars.

The patient who pays most is:
54 years old female.
Whose BMI is 47.41.
Has 0 children.
Her smoker status is yes.
And is registered in the Southeast region.


## Analysis

The data has been organised, and the analysis can begin. This part of the analysis will try to answer the following questions:

1. Find average age of the patients.
2. Break down the number of males vs. females in the dataset.
3. Find geographical location of the patients.
4. Count the number of patients in relation to their geographical location.
5. Calculate the average yearly medical charges as a function of geographical location. Is there a region where insurance is more expensive?
6. Create a searchable dictionary of all patient information

In [279]:
# Find the average age in the dataset using the updated list of ages
average_age = statistics.mean(updated_ages)
print(f'Average age: {average_age:,.2f} years old.')

Average age: 39.21 years old.


The average (mean) age of the patients in the **insurance.csv** dataset is `39 years old`. The range and standard deviation of the patient age group in this dataset has not been performed.

In [244]:
# Build a dictionary to break down total count of patients insured
# by gender. Two keys exist: 1) female 2) males with their corresponding patient count
patients_by_gender = count_patients_by_gender(patients_dictionary)
print(f'The number of female patients in the dataset is: {patients_by_gender["female"]}')
print(f'The number of male patients in the dataset is: {patients_by_gender["male"]}')

The number of female patients in the dataset is: 662
The number of male patients in the dataset is: 676


A helper function `count_patients_by_gender()` calculates the total number of patients by gender. The result is that there are **662 females** and **676 males** and it can be said that the dataset is somewhat representative of a broader population of individuals. In other words, the dataset seems to be balanced.

In [257]:
# Build a dictionary where all patients are grouped by region
# The resulting dictionary will have a region (out of 4 possible regions) as keys
# and the count as values
regions_dictionary_count = count_patient_numbers_per_region(patients_dictionary)

for region, count in regions_dictionary_count.items():
    print(f'The {region.title()} has {count} patients.')

The Southwest has 325 patients.
The Southeast has 364 patients.
The Northwest has 325 patients.
The Northeast has 324 patients.


Another helper function `count_patient_numbers_per_region()` was used to calculate the total number of patients in each region of the United States.

The dataset contains four unique geograhical regions in the United States. It is important to note that only one region, the **Southeast** has more patients than the other three. This analysis confirm the total number of patients in the dataset to be **1338 patients**. 

In [277]:
# Analyse the regional costs to obtain statistical measures of central tendency
average_regional_costs_second = find_average_cost_per_region_using_lists(patients_dictionary)

total_all_regions = 0   # Used to calculate the total mean of all regions
for region, stats in average_regional_costs_second.items():
    print(f'{region}: ${stats["Mean"]:,.2f}')

    total_all_regions += stats["Mean"]

# Calculate the total average for all regions
total_average_all_regions = total_all_regions / len(average_regional_costs_second)

                                                     
print(f'Average Yearly Insurance Charges for all Regions: {total_average_all_regions:,.2f} dollars.')

Southwest: $12,346.94
Southeast: $14,735.41
Northwest: $12,417.58
Northeast: $13,406.38
Average Yearly Insurance Charges for all Regions: 13,226.58 dollars.


The function `analyse_regional_costs_second()` helps to calculate the statistical mean for all the four regions. In addition, the average yearly medical insurance charge for all regions per individual is **13,226 US dollars**. Additional analysis is needed to determine which patient variables contribute most strongly to low or high medical insurance charges. For example, is age a factor that correlates with insurance costs per year?