# U.S. Medical Insurance Costs

In this project I apply my knowledge on Python functions and classes to analyze real data on US medical insurance costs. This project is meant as a steping stone in my development as a professional Data Scientist.

For this project I will be working with a csv file containing data on a number of US patients. First I define a function which gets the column names of the csv file and turns them into a python list.

In [3]:
def column_names_list(csv_reader):
    first_row = {}
    count = 0
    for row in csv_reader:
        while count < 1:
            first_row = row
            count += 1
    column_names_list = []
    for column_name in first_row.keys():
        column_names_list.append(column_name)
    return column_names_list

I use my `column_names_list` function to print out the column names. Now I know what kind of variables are contained in the data set.

In [4]:
import csv

with open(r'C:\Users\dvale\Desktop\insurance.csv') as insurance_csv:
    insurance_reader = csv.DictReader(insurance_csv)
    insurance_column_names_list = column_names_list(insurance_reader)
    for name in insurance_column_names_list:
        print(name)

age
sex
bmi
children
smoker
region
charges


I store the values of each column as individual lists.

In [5]:
age = []
sex = []
bmi = []
children = []
smoker = []
region = []
charges = []

with open(r'C:\Users\dvale\Desktop\insurance.csv') as insurance_csv:
    insurance_reader = csv.DictReader(insurance_csv)
    for row in insurance_reader:
        age.append(row['age'])
        sex.append(row['sex'])
        bmi.append(row['bmi'])
        children.append(row['children'])
        smoker.append(row['smoker'])
        region.append(row['region'])
        charges.append(row['charges'])

And also create a dictionary containing all the information in the data set.

In [6]:
insurance_dict = {}

with open(r'C:\Users\dvale\Desktop\insurance.csv') as insurance_csv:
    insurance_reader = csv.DictReader(insurance_csv)
    insurance_column_names_list = column_names_list(insurance_reader)
    for name in insurance_column_names_list:
        insurance_dict[name] = []

insurance_dict['age'] = age
insurance_dict['sex'] = sex
insurance_dict['bmi'] = bmi
insurance_dict['children'] = children
insurance_dict['smoker'] = smoker
insurance_dict['region'] = region
insurance_dict['charges'] = charges

Once I have my dictionary I can make sure there is no missing values. I can also check which of my variables are categorical and which are continious.

In [7]:
for key, value in insurance_dict.items():
    print('Name of value: {}, Num of values: {}, Value example: {}, Python class type: {}'.format(key, len(value), value[0], type(value[0])))

Name of value: age, Num of values: 1338, Value example: 19, Python class type: <class 'str'>
Name of value: sex, Num of values: 1338, Value example: female, Python class type: <class 'str'>
Name of value: bmi, Num of values: 1338, Value example: 27.9, Python class type: <class 'str'>
Name of value: children, Num of values: 1338, Value example: 0, Python class type: <class 'str'>
Name of value: smoker, Num of values: 1338, Value example: yes, Python class type: <class 'str'>
Name of value: region, Num of values: 1338, Value example: southwest, Python class type: <class 'str'>
Name of value: charges, Num of values: 1338, Value example: 16884.924, Python class type: <class 'str'>


Since my countinious variables have `class 'str'` I need to update their types ino `int` and `float`.

In [9]:
updated_insurance_dict = {}
for key in insurance_dict.keys():
    updated_insurance_dict[key] = []

for key, value in insurance_dict.items():
    try:
        for item in value:
            updated_insurance_dict[key].append(int(item))
    except ValueError:
        try:
            for item in value:
                updated_insurance_dict[key].append(float(item))
        except ValueError:
            for item in value:
                updated_insurance_dict[key].append(item)

insurance_dict = updated_insurance_dict

age = insurance_dict['age']
bmi = insurance_dict['bmi']
children = insurance_dict['children']
charges = insurance_dict['charges']

Now I am ready to define a special class with specific methods to aid me in my analysis.

In [72]:
class InsuranceData:
    def __init__(self, age, sex, bmi, children, smoker, region, charges):
        self.age = age
        self.sex = sex
        self.bmi = bmi
        self.children = children
        self.smoker = smoker
        self.region = region
        self.charges = charges
    
    def stat_age(self):
        mean = sum(self.age) / len(self.age)
        mean_round = round(mean, 2)
                
        sum_squares = 0
        for item in self.age:
            sum_squares += (mean - item) ** 2
        stdv = (sum_squares / len(self.age)) ** 0.5
        stdv_round = round(stdv, 2)
    
        sorted_list = sorted(self.age)
        median = 0
        if len(self.age) % 2 == 0:
            median = (sorted_list[int(len(self.age)/2)] + sorted_list[int((len(self.age)/2)-1)]) / 2
        else:
            median = self.age[int(len(self.age)/2)]
        median_round = round(median, 2)
        
        print('The median age is {}, the average age is {} and the age standard deviation is {}.'.format(median_round, mean_round, stdv_round))
    
    def stat_bmi(self):
        mean = sum(self.bmi) / len(self.bmi)
        mean_round = round(mean, 2)
                
        sum_squares = 0
        for item in self.bmi:
            sum_squares += (mean - item) ** 2
        stdv = (sum_squares / len(self.bmi)) ** 0.5
        stdv_round = round(stdv, 2)
    
        sorted_list = sorted(self.bmi)
        median = 0
        if len(self.bmi) % 2 == 0:
            median = (sorted_list[int(len(self.bmi)/2)] + sorted_list[int((len(self.bmi)/2)-1)]) / 2
        else:
            median = sorted_list[int(len(self.bmi)/2)]
        median_round = round(median, 2)
        
        print('The median BMI is {}, the average BMI is {} and the BMI standard deviation is {}.'.format(median_round, mean_round, stdv_round))
    
    def stat_charges(self):
        mean = sum(self.charges) / len(self.charges)
        mean_round = round(mean, 2)
                
        sum_squares = 0
        for item in self.charges:
            sum_squares += (mean - item) ** 2
        stdv = (sum_squares / len(self.charges)) ** 0.5
        stdv_round = round(stdv, 2)
    
        sorted_list = sorted(self.charges)
        median = 0
        if len(self.charges) % 2 == 0:
            median = (sorted_list[int(len(self.charges)/2)] + sorted_list[int((len(self.charges)/2)-1)]) / 2
        else:
            median = sorted_list[int(len(self.charges)/2)]
        median_round = round(median, 2)

        print('The median charge is {}$, the average age is {}$ and the age standard deviation is {}$.'.format(median_round, mean_round, stdv_round))
        
    def stat_sex(self):
        nom_variables = {}
        for item in self.sex:
            if item in nom_variables:
                continue
            else:
                nom_variables[item] = 0
        for item in self.sex:
            if item in nom_variables:
                nom_variables[item] += 1
        
        text = ''
        for key, value in list(nom_variables.items()):
            text += '{}% are {} '.format(round((value/len(self.sex)*100), 2), key)
        print(text) 
    
    def stat_smoker(self):
        nom_variables = {}
        for item in self.smoker:
            if item in nom_variables:
                continue
            else:
                nom_variables[item] = 0
        for item in self.smoker:
            if item in nom_variables:
                nom_variables[item] += 1
        
        text = ''
        for key, value in list(nom_variables.items()):
            text += '{}% said {} '.format(round((value/len(self.smoker)*100), 2), key)
        print(text) 
        
    def stat_region(self):
        nom_variables = {}
        for item in self.region:
            if item in nom_variables:
                continue
            else:
                nom_variables[item] = 0
        for item in self.region:
            if item in nom_variables:
                nom_variables[item] += 1
        
        text = ''
        for key, value in list(nom_variables.items()):
            text += '{}% live in {} '.format(round((value/len(self.region)*100), 2), key)
        print(text)
        
    def stat_children(self):
        nom_variables = {}
        for item in self.children:
            if item in nom_variables:
                continue
            else:
                nom_variables[item] = 0
        for item in self.children:
            if item in nom_variables:
                nom_variables[item] += 1
        
        text = ''
        for key, value in list(nom_variables.items()):
            text += '{}% have {} '.format(round((value/len(self.children)*100), 2), key)
        print(text)
        
    

blabla

In [73]:
insurance_data = InsuranceData(age, sex, bmi, children, smoker, region, charges)
insurance_data.stat_age()
insurance_data.stat_sex()
insurance_data.stat_bmi()
insurance_data.stat_children()
insurance_data.stat_smoker()
insurance_data.stat_region()
insurance_data.stat_charges()

The median age is 39.0, the average age is 39.21 and the age standard deviation is 14.04.
49.48% are female 50.52% are male 
The median BMI is 30.0, the average BMI is 30.17 and the BMI standard deviation is 6.12.
42.9% have 0 24.22% have 1 11.73% have 3 17.94% have 2 1.35% have 5 1.87% have 4 
20.48% said yes 79.52% said no 
24.29% live in southwest 27.2% live in southeast 24.29% live in northwest 24.22% live in northeast 
The median charge is 9381.5$, the average age is 13269.93$ and the age standard deviation is 12105.49$.
