# U.S. Medical Insurance Costs

In this project I apply my knowledge on Python functions and classes to analyze real data on US medical insurance costs. This project is meant as a steping stone in my development as a professional Data Scientist.

For this project I will be working with a csv file containing data on a number of US patients. First I define a function which gets the column names of the csv file and turns them into a python list.

In [6]:
def column_names_list(csv_reader):
    first_row = {}
    count = 0
    for row in csv_reader:
        while count < 1:
            first_row = row
            count += 1
    column_names_list = []
    for column_name in first_row.keys():
        column_names_list.append(column_name)
    return column_names_list

I then use my `column_names_list` function to print out the column names. Now I know what kind of variables are contained in the data set.

In [7]:
import csv

with open(r'C:\Users\dvale\Desktop\insurance.csv') as insurance_csv:
    insurance_reader = csv.DictReader(insurance_csv)
    insurance_column_names_list = column_names_list(insurance_reader)
    for name in insurance_column_names_list:
        print(name)

age
sex
bmi
children
smoker
region
charges


I store the values of each column as individual lists.

In [8]:
age = []
sex = []
bmi = []
children = []
smoker = []
region = []
charges = []

with open(r'C:\Users\dvale\Desktop\insurance.csv') as insurance_csv:
    insurance_reader = csv.DictReader(insurance_csv)
    for row in insurance_reader:
        age.append(row['age'])
        sex.append(row['sex'])
        bmi.append(row['bmi'])
        children.append(row['children'])
        smoker.append(row['smoker'])
        region.append(row['region'])
        charges.append(row['charges'])

And also create a dictionary containing all the information in the data set.

In [9]:
insurance_dict = {}

with open(r'C:\Users\dvale\Desktop\insurance.csv') as insurance_csv:
    insurance_reader = csv.DictReader(insurance_csv)
    insurance_column_names_list = column_names_list(insurance_reader)
    for name in insurance_column_names_list:
        insurance_dict[name] = []

insurance_dict['age'] = age
insurance_dict['sex'] = sex
insurance_dict['bmi'] = bmi
insurance_dict['children'] = children
insurance_dict['smoker'] = smoker
insurance_dict['region'] = region
insurance_dict['charges'] = charges

Once I have my dictionary I can make sure there is no missing values. I can also check which of my variables are categorical and which are continious.

In [10]:
for key, value in insurance_dict.items():
    print('Name of value: {}, Num of values: {}, Value example: {}, Python class type: {}'.format(key, len(value), value[0], type(value[0])))

Name of value: age, Num of values: 1338, Value example: 19, Python class type: <class 'str'>
Name of value: sex, Num of values: 1338, Value example: female, Python class type: <class 'str'>
Name of value: bmi, Num of values: 1338, Value example: 27.9, Python class type: <class 'str'>
Name of value: children, Num of values: 1338, Value example: 0, Python class type: <class 'str'>
Name of value: smoker, Num of values: 1338, Value example: yes, Python class type: <class 'str'>
Name of value: region, Num of values: 1338, Value example: southwest, Python class type: <class 'str'>
Name of value: charges, Num of values: 1338, Value example: 16884.924, Python class type: <class 'str'>


This data set gives information about 1338 patients in the US. Specifically it gives information about their age, their sex, their Body Mass Index, their number of children if any, whether they smoke or not, their regional location and finally their yearly cost on medical insurance.

Since my countinious variables have `class 'str'` I need to update their types ino `int` and `float`.

In [11]:
updated_insurance_dict = {}
for key in insurance_dict.keys():
    updated_insurance_dict[key] = []

for key, value in insurance_dict.items():
    try:
        for item in value:
            updated_insurance_dict[key].append(int(item))
    except ValueError:
        try:
            for item in value:
                updated_insurance_dict[key].append(float(item))
        except ValueError:
            for item in value:
                updated_insurance_dict[key].append(item)

insurance_dict = updated_insurance_dict

age = insurance_dict['age']
bmi = insurance_dict['bmi']
children = insurance_dict['children']
charges = insurance_dict['charges']

Now I am ready to define a special class with specific methods to aid me in my analysis.

My first goal is to get an idea of the shape of the data. For my cuantitative variables I will ask for key parameters like the median, mean and standard deviation to the mean. For my categorical variables I will ask about the frecuency of each attribute expressed in percentage of the total.

My second goal is to anticipate potential relationships between the dependent variable (charges) with each of the other independent variables (age, sex, bmi, children, smoker and region). 

In [18]:
class InsuranceData:
    def __init__(self, age, sex, bmi, children, smoker, region, charges):
        self.age = age
        self.sex = sex
        self.bmi = bmi
        self.children = children
        self.smoker = smoker
        self.region = region
        self.charges = charges
    
    def stat_age(self):
        mean = sum(self.age) / len(self.age)
        mean_round = round(mean, 2)
                
        sum_squares = 0
        for item in self.age:
            sum_squares += (mean - item) ** 2
        stdv = (sum_squares / len(self.age)) ** 0.5
        stdv_round = round(stdv, 2)
    
        sorted_list = sorted(self.age)
        median = 0
        if len(self.age) % 2 == 0:
            median = (sorted_list[int(len(self.age)/2)] + sorted_list[int((len(self.age)/2)-1)]) / 2
        else:
            median = self.age[int(len(self.age)/2)]
        median_round = round(median, 2)       
        
        print('The median age is {}. \nThe average age is {}. \nThe age standard deviation is {}. \n'.format(median_round, mean_round, stdv_round))
    
    def stat_bmi(self):
        mean = sum(self.bmi) / len(self.bmi)
        mean_round = round(mean, 2)
                
        sum_squares = 0
        for item in self.bmi:
            sum_squares += (mean - item) ** 2
        stdv = (sum_squares / len(self.bmi)) ** 0.5
        stdv_round = round(stdv, 2)
    
        sorted_list = sorted(self.bmi)
        median = 0
        if len(self.bmi) % 2 == 0:
            median = (sorted_list[int(len(self.bmi)/2)] + sorted_list[int((len(self.bmi)/2)-1)]) / 2
        else:
            median = sorted_list[int(len(self.bmi)/2)]
        median_round = round(median, 2)
        
        print('The median BMI is {}. \nThe average BMI is {}. \nThe BMI standard deviation is {}.\n'.format(median_round, mean_round, stdv_round))
    
    def stat_charges(self):
        mean = sum(self.charges) / len(self.charges)
        mean_round = round(mean, 2)
                
        sum_squares = 0
        for item in self.charges:
            sum_squares += (mean - item) ** 2
        stdv = (sum_squares / len(self.charges)) ** 0.5
        stdv_round = round(stdv, 2)
    
        sorted_list = sorted(self.charges)
        median = 0
        if len(self.charges) % 2 == 0:
            median = (sorted_list[int(len(self.charges)/2)] + sorted_list[int((len(self.charges)/2)-1)]) / 2
        else:
            median = sorted_list[int(len(self.charges)/2)]
        median_round = round(median, 2)

        print('The median charge is {}$. \nThe average charge is {}$. \nThe charge standard deviation is {}$.'.format(median_round, mean_round, stdv_round))
        
    def stat_sex(self):
        nom_attribute = {}
        for item in self.sex:
            if item in nom_attribute:
                continue
            else:
                nom_attribute[item] = 0
        for item in self.sex:
            if item in nom_attribute:
                nom_attribute[item] += 1
        
        text = ''
        for key, value in list(nom_attribute.items()):
            text += '{}% are {}. \n'.format(round((value/len(self.sex)*100), 2), key)
        print(text) 
    
    def stat_smoker(self):
        nom_attribute = {}
        for item in self.smoker:
            if item in nom_attribute:
                continue
            else:
                nom_attribute[item] = 0
        for item in self.smoker:
            if item in nom_attribute:
                nom_attribute[item] += 1
        
        text = ''
        for key, value in list(nom_attribute.items()):
            text += '{}% said {}. \n'.format(round((value/len(self.smoker)*100), 2), key)
        print(text) 
        
    def stat_region(self):
        nom_attribute = {}
        for item in self.region:
            if item in nom_attribute:
                continue
            else:
                nom_attribute[item] = 0
        for item in self.region:
            if item in nom_attribute:
                nom_attribute[item] += 1
        
        text = ''
        for key, value in list(nom_attribute.items()):
            text += '{}% live in {}. \n'.format(round((value/len(self.region)*100), 2), key)
        print(text)
        
    def stat_children(self):
        nom_attribute = {}
        for item in sorted(self.children):
            if item in nom_attribute:
                continue
            else:
                nom_attribute[item] = 0
        for item in self.children:
            if item in nom_attribute:
                nom_attribute[item] += 1
        
        text = ''
        for key, value in list(nom_attribute.items()):
            text += '{}% have {} children. \n'.format(round((value/len(self.children)*100), 2), key)
        print(text)
        
    def charge_age(self):
        ordinal_values = {'Young Adults (less than 30)': [], 'Adults (from 30 to 60)': [], 'Seniors (older than 60)': []}
        for i in range(len(self.charges)):
            if self.age[i] < 30:
                ordinal_values['Young Adults (less than 30)'].append(self.charges[i])
            elif self.age[i] < 60:
                ordinal_values['Adults (from 30 to 60)'].append(self.charges[i])
            else:
                ordinal_values['Seniors (older than 60)'].append(self.charges[i])
        
        text = ''
        for key, value in list(ordinal_values.items()):
            text += '{} pay {}$ on average.\n'.format(key, round(sum(value)/len(value)), 2)
        print(text)
        
    def charge_bmi(self):
        ordinal_values = {'Underweight (less than 18.4)': [], 'Normal (from 18.5 to 24.9)': [], 'Overweight (from 25 to 29.9)': [], 'Obese (from 30 to 34.9)': [], 'Extremly Obese (more than 35)': []}
        for i in range(len(self.charges)):
            if self.bmi[i] < 18.5:
                ordinal_values['Underweight (less than 18.4)'].append(self.charges[i])
            elif self.bmi[i] < 25:
                ordinal_values['Normal (from 18.5 to 24.9)'].append(self.charges[i])
            elif self.bmi[i] < 30:
                ordinal_values['Overweight (from 25 to 29.9)'].append(self.charges[i])
            elif self.bmi[i] < 35:
                ordinal_values['Obese (from 30 to 34.9)'].append(self.charges[i])
            else:
                ordinal_values['Extremly Obese (more than 35)'].append(self.charges[i])
        
        text = ''
        for key, value in list(ordinal_values.items()):
            text += '{} people pay {}$ on average.\n'.format(key, round(sum(value)/len(value)), 2)
        print(text)   

    def charge_sex(self):
        nom_variables = {}
        for item in self.sex:
            if item in nom_variables:
                continue
            else:
                nom_variables[item] = []
        for i in range(len(self.charges)): 
            nom_variables[self.sex[i]].append(self.charges[i])
        
        text = ''
        for key, value in list(nom_variables.items()):
            text += '{}s pay {}$ on average.\n'.format(key.title(), round(sum(value)/len(value)), 2)
        print(text)
    
    def charge_children(self):
        nom_variables = {}
        for item in sorted(self.children):
            if item in nom_variables:
                continue
            else:
                nom_variables[item] = []
        for i in range(len(self.charges)): 
            nom_variables[self.children[i]].append(self.charges[i])
        
        text = ''
        for key, value in list(nom_variables.items()):
            text += 'People with {} children pay {}$ on average.\n'.format(key, round(sum(value)/len(value)), 2)
        print(text)
    
    def charge_region(self):
        nom_variables = {}
        for item in self.region:
            if item in nom_variables:
                continue
            else:
                nom_variables[item] = []
        for i in range(len(self.charges)): 
            nom_variables[self.region[i]].append(self.charges[i])
        
        text = ''
        for key, value in list(nom_variables.items()):
            text += 'People form the {} pay {}$ on average.\n'.format(key.title(), round(sum(value)/len(value)), 2)
        print(text)
    
    def charge_smoker(self):
        nom_variables = {}
        for item in self.smoker:
            if item in nom_variables:
                continue
            else:
                nom_variables[item] = []
        for i in range(len(self.charges)): 
            nom_variables[self.smoker[i]].append(self.charges[i])
        
        text = ''
        for key, value in list(nom_variables.items()):
            if key == 'yes':
                text += 'Smokers pay {}$ on average.\n'.format(round(sum(value)/len(value)), 2)
            else:
                text += 'Non smokers pay {}$ on average.\n'.format(round(sum(value)/len(value)), 2)
        print(text)

First I ran my first batch of methods. We seem to have a sample population evenly distributed in terms of sex and location. Most of the patients seem to be adults bewtween 25 and 55 years old (68% rule assuming normality). Most of them also seem to obese or overweight on a scale between 24 and 36 of Body Mass Index. Close to half of them do not have any children, while most of those who do either have one or two. 1 out of 5 claim to be smokers. Lastly the amount paid on insurace seem to be skwed to the right. This means although half of the population pays less than 9400\\$ the other half pays increasengly larger amounts well pass the 25000$ mark and beyond. 

In [19]:
insurance_data = InsuranceData(age, sex, bmi, children, smoker, region, charges)
insurance_data.stat_age()
insurance_data.stat_sex()
insurance_data.stat_bmi()
insurance_data.stat_children()
insurance_data.stat_smoker()
insurance_data.stat_region()
insurance_data.stat_charges()

The median age is 39.0. 
The average age is 39.21. 
The age standard deviation is 14.04. 

49.48% are female. 
50.52% are male. 

The median BMI is 30.4. 
The average BMI is 30.66. 
The BMI standard deviation is 6.1.

42.9% have 0 children. 
24.22% have 1 children. 
17.94% have 2 children. 
11.73% have 3 children. 
1.87% have 4 children. 
1.35% have 5 children. 

20.48% said yes. 
79.52% said no. 

24.29% live in southwest. 
27.2% live in southeast. 
24.29% live in northwest. 
24.22% live in northeast. 

The median charge is 9382.03$. 
The average charge is 13270.42$. 
The charge standard deviation is 12105.48$.


Secondly I ran my second batch. We can anticipate linear relationships with several of these variables. Insurance costs seems to be affected the most by smoking with smokers paying almost four times more than non smokers. Other variables significantly affecting insurance fees are age and weight. Senior patiants pay twice the amount of young patients. In a similar way obese individuals pay 50% more than individuals with normal weight. Sex and children seem to affect costs only slightly. Males and peple with children would tend to pay a bit more. We would be ignorig data on people with 4 or 5 children since they cover very few people in the sample and thus have a large margin of error. We do not have enough context to draw inference on location. Differences are not substantial in any case.

In [17]:
insurance_data.charge_age()
insurance_data.charge_sex()
insurance_data.charge_bmi()
insurance_data.charge_children()
insurance_data.charge_smoker()
insurance_data.charge_region()

Young Adults (less than 30) pay 9182$ on average.
Adults (from 30 to 60) pay 14256$ on average.
Seniors (older than 60) pay 21248$ on average.

Females pay 12570$ on average.
Males pay 13957$ on average.

Underweight (less than 18.4) people pay 8852$ on average.
Normal (from 18.5 to 24.9) people pay 10409$ on average.
Overweight (from 25 to 29.9) people pay 10988$ on average.
Obese (from 30 to 34.9) people pay 14420$ on average.
Extremly Obese (more than 35) people pay 16954$ on average.

People with 0 children pay 12366$ on average.
People with 1 children pay 12731$ on average.
People with 2 children pay 15074$ on average.
People with 3 children pay 15355$ on average.
People with 4 children pay 13851$ on average.
People with 5 children pay 8786$ on average.

Smokers pay 32050$ on average.
Non smokers pay 8434$ on average.

People form the Southwest pay 12347$ on average.
People form the Southeast pay 14735$ on average.
People form the Northwest pay 12418$ on average.
People form the N