# U.S. Medical Insurance Costs

## Goals
Observations:
1. Things to note: There are categorical and numerical data
2. No data is missing
3. There are 7 columns, variables

Possible goals / questions:
1. What is the average age of insured people?
2. Group patients by age groups --> Need to figure out most sensible way to create age groups
3. Is there a relationship between age and cost of insurance?
4. Is insurance more expensive if you're a female?
5. Where do most insured people live?
6. Is insurance more expensive depending on your region?
7. Look at the differences in cost between smokers vs non-smokers
8. Average age of people with at least one child

## Data

In [2]:
import csv
import statistics
from pprint import pprint as pp

In [41]:
def convert_elements_to_numerical(lst):
    """Converts the elements in a list to a numerical data type. 
    The function can discern if an element should be converted 
    to a floating point number (float) or an integer (int).
    Example:
    >>> convert_elements_to_numerical(['1', '2', '3'])
    [1, 2, 3]
    
    >>> convert_elements_to_numerical(['22.3', '19.21', '31.4'])
    [22.3, 19.21, 31.4]
    """
    result = []
    for item in lst:
        if item.find('.') != -1:  # This means the item is a float
            result.append(float(item))
        else:  # The item is an int
            result.append(int(item))
    return result


def build_patient_dictionary(age_lst, sex_lst, 
                             bmi_lst, children_lst, 
                             smoker_lst, region_lst, charges_lst):
    """Creates a new dictionary with the patient data
    It will have sequential numbers for each patient, start from 0 on to the total patients
    Each key of the dict will be a unique counter which correlates to patient number
    and the values will be the rest of the data with corresponding keys
    """
    result = {}
    # Initialise a counter of type int with initial value of 1
    patient_number = 1
    for age, sex, bmi, children_count, smoker_status, region, charges in zip(age_lst, sex_lst, bmi_lst, children_lst, smoker_lst, region_lst, charges_lst):
        result[patient_number] = {
                                'Age': age,
                                'Sex': sex,
                                'BMI': bmi,
                                'Children': children_count,
                                'Smoker': smoker_status,
                                'Region': region,
                                'Charges': charges
                                }
        # Increase the counter so the next patient gets the following number up
        patient_number += 1

    # Return the dictionary
    return result

def group_patients_by_age(patient_d):
    """Creates a new dictionary that groups patients by ages. The keys
    of this new dictionary are ages and the values are dictionaries of each
    patient that corresponds to that age
    """
    result = {}
    for patient in patient_d.values():
        #Capture the current age_value from the 'Age' value
        current_age_value = patient['Age']
        #Capture the current patient_dictionary
        current_patient_dict = patient
        #If the current age_value is not in the new dictionary:
        if current_age_value not in result:
            #Initialise that age_value as key and append the value of current_patient_dict to the list
            result[current_age_value] = [current_patient_dict]
        #Otherwise the age_value has been seen before:
        else:
            #Append the current_patient_dict to the list
            result[current_age_value].append(patient)
    return result



def count_patient_numbers_per_region():
    pass

def find_most_popular_region():
    pass

def count_patients_by_gender():
    pass

def find_age_most_insured():
    pass

def find_most_expensive_insurance():
    pass

def group_patients_by_age():
    pass

def rate_patients_by_children_count():
    pass

In [26]:
# Read in the csv file and store the contents into separate
# lists for each column
# Example: BMI column -> bmi [ ... ]

# Initialise all the lists
ages, sexes, bmis, children, smokers, regions, charges = [], [], [], [], [], [], []

with open('insurance.csv') as insurance_file:
    insurance_csv = csv.DictReader(insurance_file)
    
    for item in insurance_csv:
        ages.append(item['age'])
        sexes.append(item['sex'])
        bmis.append(item['bmi'])
        children.append(item['children'])
        smokers.append(item['smoker'])
        regions.append(item['region'])
        charges.append(item['charges'])

In [27]:
# Change lists that should be numerical to an appropriate data type
# For example, ages should all be integers but are strings at the moment
updated_ages = convert_elements_to_numerical(ages)
updated_bmis = convert_elements_to_numerical(bmis)
updated_children = convert_elements_to_numerical(children)
updated_charges = convert_elements_to_numerical(charges)


In [38]:
# Build a dictionary that has a patient number as keys and the different variables
# built into another dictionary. The patient number starts at 1 and goes to the last patient
# in the dataset, i.e. 1338 patients
patients_dictionary = build_patient_dictionary(updated_ages, sexes, updated_bmis, updated_children, smokers, regions, updated_charges)

# Print patients dictionary to make sure data was stored properly
#pp(patients_dictionary)

In [None]:
# Build a dictionary that groups all patients by age
# The resulting dictionary will have an age (as an int) as the keys
# and the values are lists of dicionaries of patients that have that same age
patients_by_age = group_patients_by_age(patients_dictionary)

# Print out the new dictionary of patients by age
#pp(patients_by_age)

"""Sample output

>>> patients_by_age[18]
{18: [{'Age': 18,
       'BMI': 33.77,
       'Charges': 1725.5523,
       'Children': 1,
       'Region': 'southeast',
       'Sex': 'male',
       'Smoker': 'no'},
      {'Age': 18,
       'BMI': 34.1,
       'Charges': 1137.011,
       'Children': 0,
       'Region': 'southeast',
       'Sex': 'male',
       'Smoker': 'no'} }
"""

## Analysis

In [36]:
average_age = statistics.mean(updated_ages)
median_age = statistics.median(updated_ages)
print(average_age)
print(median_age)

39.20702541106129
39.0
