# U.S. Medical Insurance Costs

Initial questions:
Correlation between region and other attributes (especially age and bmi?)

### Scope:
#### Goal: Show how region affects other variables, especially age, bmi and total costs
#### Action: Provide a basis for examining spatial patterns that could show disparities in how medical hardship is experienced. That geographic patterns exist is just a conjecture. This is meant to be an exercise in exploring the potential of a dataset in preparation for learning more advanced quantitative analysis techniques.
#### Data: Codecademy-provided dataset
#### Analysis: Simple averages and counts of data, along with one basic correlation between two continuous variables

In [178]:
import csv
import os
import numpy as np

In [11]:
# was trying to figure out how much of the file path I needed to use to open the file
print(os.getcwd())

C:\Users\User1\Desktop\python_projects\python-portfolio-project-starter-files\python-portfolio-project-starter-files


The basic structure of unpacking the CSV into lists is borrowed from Codecademy's sample solution code. 

In [28]:
# lists for data organization purposes
ages = []
sexes = []
bmis = []
num_children = []
smoker_status = []
regions = []
insurance_charges = []

In [26]:
# populates a list with data from its corresponding column
def load_list_data(lst, csv_file, column_name):
    with open(csv_file) as csv_info:
        csv_reader = csv.DictReader(csv_info)
        for row in csv_reader:
            lst.append(row[column_name])
        return lst

In [None]:
# load each column from the CSV into a list
load_list_data(ages, 'insurance.csv', 'age')
load_list_data(sexes, 'insurance.csv', 'sex')
load_list_data(bmis, 'insurance.csv', 'bmi')
load_list_data(num_children, 'insurance.csv', 'children')
load_list_data(smoker_status, 'insurance.csv', 'smoker')
load_list_data(regions, 'insurance.csv', 'region')
load_list_data(insurance_charges, 'insurance.csv', 'charges')

Further down my first attempt at implementing the PatientsInfo class was in the same way that Codecademy does: by initiatizing the object with the lists created above. Directly below I use an alternative method for unpacking the CSV to a single dictionary of lists instead of multiple lists, and an alternative implementation of PatientsInfo that takes the dictionary for initialization

In [196]:
# a function that automates loading the data columns into lists without having to call load_list_data() multiple times
def load_csv_to_dict(csv_file):
    master_dict = {}
    with open(csv_file) as csv_info:
        csv_reader = csv.DictReader(csv_info)
        for column in csv_reader.fieldnames:
            csv_info.seek(0)
            master_dict[column] = []
            csv_reader.__next__()
            for row in csv_reader:
                master_dict[column].append(row[column])
    return master_dict
                

In [None]:
# test load_csv_to_dict()
load_csv_to_dict('insurance.csv')

In [190]:
# an alternative implementation of PatientsInfo to accept the dictionary of lists instead of individual lists:
class PatientsInfoDict:
    
    def __init__(self, patient_dict):
        self.patient_ages = patient_dict['age']
        self.patient_sexes = patient_dict['sex']
        self.patient_bmis = patient_dict['bmi']
        self.patient_children = patient_dict['children']
        self.patient_smoker = patient_dict['smoker']
        self.patient_region = patient_dict['region']
        self.patient_charges = patient_dict['charges']
        
# other functions are identical

In [198]:
# test PatientsInfoDict class:
patient_dict = load_csv_to_dict('insurance.csv')
patient_info_dict = PatientsInfoDict(patient_dict)
print(patient_info_dict.patient_ages[:5])

['19', '18', '28', '33', '32']


I'm just learning basic statistics, including correlations. Below I calculate a Pearson Correlation using NumPy's numpy.corrcoef() function for ages and insurance charges across the entire dataset as a test for implementing it later for each geographic region in the PatientsInfo class. 

In [141]:
# test space for calculating Pearson Correlation
# convert numerical values to int and float
age_list = [int(age) for age in ages]
charge_list = [float(charge) for charge in insurance_charges]
# make numpy arrays
age_array = np.array(age_list)
charges_array = np.array(charge_list)
# calculate correlation coefficient
rho = np.corrcoef(age_array, charges_array)
print(rho)

[[1.         0.29900819]
 [0.29900819 1.        ]]


The PatientsInfo class contains methods that calculate averages for each region in the dataset. This allows a comparison to see if there are significant discrepancies between regions. This implementation uses the "conventional" Codecademy method of initializing the object with individual lists for each attribute.

In [175]:
class PatientsInfo:
    def __init__(self, patient_ages, patient_sexes, patient_bmis, patient_children, patient_smoker, patient_regions, patient_charges):
        self.patient_ages = patient_ages
        self.patient_sexes = patient_sexes
        self.patient_bmis = patient_bmis
        self.patient_children = patient_children
        self.patient_smoker = patient_smoker
        self.patient_regions = patient_regions
        self.patient_charges = patient_charges

    def region_average_ages(self, region):
        count = 0
        total = 0
        for patient_age, patient_region in zip(self.patient_ages, self.patient_regions):
            if patient_region == region:
                total += int(patient_age)
                count += 1
        return round(total / count, 2)
    
    def region_average_bmis(self, region):
        count = 0
        total = 0
        for patient_bmi, patient_region in zip(self.patient_bmis, self.patient_regions):
            if patient_region == region:
                total += float(patient_bmi)
                count += 1
        return round(total / count, 2)
    
    def percent_smoker_status(self, region):
        count = 0
        for smoker, patient_region in zip(self.patient_smoker, self.patient_regions):
            if patient_region == region:
                if smoker == "yes":
                    count += 1
        return round(count / len(self.patient_smoker) * 100, 2)
    
    def region_average_children(self, region):
        count = 0
        total = 0
        for children, patient_region in zip(self.patient_children, self.patient_regions):
            if patient_region == region:
                total += int(children)
                count += 1
        return round(total / count, 2)
    
    def region_average_charges(self, region):
        count = 0
        total = 0
        for costs, patient_region in zip(self.patient_charges, self.patient_regions):
            if patient_region == region:
                total += float(costs)
                count += 1
        return round(total / count, 2)
    
    def age_charges_correlation(self, region):
        # convert numerical values to int and float
        age_list = [int(age) for age, patient_region in zip(self.patient_ages, self.patient_regions) if patient_region == region]
        charge_list = [float(charge) for charge, patient_region in zip(self.patient_charges, self.patient_regions) if patient_region == region]
        # make numpy arrays
        age_array = np.array(age_list)
        charges_array = np.array(charge_list)
        # calculate correlation coefficient
        rho = np.corrcoef(age_array, charges_array)
        print("Correlation coefficient between age and insurance charges for region {} = ".format(region.title()) + str(round(rho.tolist()[0][1], 3)))
    
    def generate_regional_statistics(self):
        regional_statistics = {}
        for region in self.patient_regions:
            if region not in regional_statistics:
                regional_statistics[(region.title())] = {"Average age": self.region_average_ages(region), "Average BMI": self.region_average_bmis(region), "Average Number of Children": self.region_average_children(region), "Percent smokers": self.percent_smoker_status(region), "Average Charges": self.region_average_charges(region)}
        return regional_statistics

In [176]:
patient_info = PatientsInfo(ages, sexes, bmis, num_children, smoker_status, regions, insurance_charges)

In [136]:
patient_info.generate_regional_statistics()

{'Southwest': {'Average age': 39.46,
  'Average BMI': 30.6,
  'Average Number of Children': 1.14,
  'Percent smokers': 4.33,
  'Average Charges': 12346.94},
 'Southeast': {'Average age': 38.94,
  'Average BMI': 33.36,
  'Average Number of Children': 1.05,
  'Percent smokers': 6.8,
  'Average Charges': 14735.41},
 'Northwest': {'Average age': 39.2,
  'Average BMI': 29.2,
  'Average Number of Children': 1.15,
  'Percent smokers': 4.33,
  'Average Charges': 12417.58},
 'Northeast': {'Average age': 39.27,
  'Average BMI': 29.17,
  'Average Number of Children': 1.05,
  'Percent smokers': 5.01,
  'Average Charges': 13406.38}}

In [177]:
patient_info.age_charges_correlation('northeast')
patient_info.age_charges_correlation('southeast')
patient_info.age_charges_correlation('northwest')
patient_info.age_charges_correlation('southwest')

Correlation coefficient between age and insurance charges for region Northeast = 0.301
Correlation coefficient between age and insurance charges for region Southeast = 0.311
Correlation coefficient between age and insurance charges for region Northwest = 0.338
Correlation coefficient between age and insurance charges for region Southwest = 0.258
