# U.S. Medical Insurance Costs

 Within this project, I aim to perform some analysis on data concerning U.S. Medical Insurance Costs. Furthermore, this project is a tool for me to put my Python skills into practice. I will be showcasing my knowledge of lists, loops, functions, dictionaries and so on here. We are told that this data contains no missing elements so no editing of the data is needed. Let's first begin by having a look at the dimensions of the dataset and seeing what variables it contains. 

In [1]:
import pandas as pd

In [2]:
medical_data = pd.read_csv('insurance.csv')
print(medical_data)

      age     sex     bmi  children smoker     region      charges
0      19  female  27.900         0    yes  southwest  16884.92400
1      18    male  33.770         1     no  southeast   1725.55230
2      28    male  33.000         3     no  southeast   4449.46200
3      33    male  22.705         0     no  northwest  21984.47061
4      32    male  28.880         0     no  northwest   3866.85520
...   ...     ...     ...       ...    ...        ...          ...
1333   50    male  30.970         3     no  northwest  10600.54830
1334   18  female  31.920         0     no  northeast   2205.98080
1335   18  female  36.850         0     no  southeast   1629.83350
1336   21  female  25.800         0     no  southwest   2007.94500
1337   61  female  29.070         0    yes  northwest  29141.36030

[1338 rows x 7 columns]


#### The data consists of 7 variables:
- Age 
- Sex
- BMI
- Children
- Smoker
- Medical charges

Sex, Smoker and Region are catergorical variables. Whilst we are not building models in this project, it's worth noting that the data set would need to be modified further to accomodate for this (i.e. (k-1) columns of 0's and 1's assigning the observations to one of the k catergories). Children and Age are count variables. Finally charges and bmi are continuous variables. 

Furthermore, we can see that the data contains 1338 observations. This is a relatviely small sample size, meaning that conclusions made aren't strongly backed by the data. We now need to convert each observation in the dataframe into a dictionary so that basic analysis can be conducted. 

In [3]:
import csv as c

In [4]:
with open('insurance.csv', newline='') as insurance_csv:
    reader = c.DictReader(insurance_csv)
    medical_data = []
    for i in reader:
        medical_data.append(i)

Let's begin by seeing the average values for age, bmi, number of children, and medical charges. 

In [7]:
age_total = 0 
for i in medical_data:
    age_total += int(i['age'])
average_age = age_total / len(medical_data)
average_age

39.20702541106129

In [8]:
bmi_total = 0 
for i in medical_data:
    bmi_total += float(i['bmi'])
average_bmi = bmi_total / len(medical_data)
average_bmi

30.663396860986538

In [9]:
children_total = 0 
for i in medical_data:
    children_total += int(i['children'])
average_children = children_total / len(medical_data)
average_children

1.0949177877429

In [10]:
charges_total = 0 
for i in medical_data:
    charges_total += float(i['charges'])
average_charges = charges_total / len(medical_data)
average_charges

13270.422265141257

There are a few key conlusions we can draw from these averages: 
- On average, our sample has a BMI of 30.7 (to 2d.p). Considering that the BMI range to be considered healthy is 18.5 to 24.9, this suggests that people are overweight on average. This could play an affect on medical costs as being overweight puts a stress on your health. This is something that could be explored with linear regression (but won't be a part of this project).
- The average person has 1 child (the statistic showing a value slightly greater than 1). Having children could have an impact on medical charges as some people become more cautious when they're a parent which might lead to less medical charges for accidents. This is something we can look at by comparing the average medial charges for people with children and without. 
- The average age is 39 (to 2 s.f) which doesn't tell us alot without looking at skewness and variance to know how the data is distributed (not to be done in this poject but something to think about). 

Now, let's compare how the medical charges vary depending on different catergorial variables. So to begin with let's compare the average medical charges to the average medical charges for smokers and non-smokers.

In [11]:
count = 0 
total = 0 
for i in medical_data:
    if i['smoker'] == 'yes' : 
        count += 1 
        total += float(i['charges'])
average_charges_for_smokers = total / count
average_charges_for_smokers

32050.23183153285

In [12]:
count_non = 0 
total_non = 0 
for i in medical_data:
    if i['smoker'] == 'no' : 
        count_non += 1 
        total_non += float(i['charges'])
average_charges_for_non_smokers = total_non / count_non
average_charges_for_non_smokers

8434.268297856199

It is clear to see that being a smoker dramatically increases a person's medical charges. This is unsurprising as smokers often require more medical services for the consequences of smoking. 
Now, let's conduct a similar analysis of female's and male's medical charges. 

In [13]:
count_male = 0 
total_male = 0 
for i in medical_data:
    if i['sex'] == 'male' : 
        count_male += 1 
        total_male += float(i['charges'])
average_charges_for_male = total_male / count_male
average_charges_for_male

13956.751177721886

In [14]:
count_female = 0 
total_female = 0 
for i in medical_data:
    if i['sex'] == 'female' : 
        count_female += 1 
        total_female += float(i['charges'])
average_charges_for_female = total_female / count_female
average_charges_for_female

12569.57884383534

In [28]:
diff_sex = abs(average_charges_for_male - average_charges_for_female)
diff_sex

1387.1723338865468

Men on average have larger medical costs in comparison to women. There was a difference of 1387.17 dollars (to 2d.p). I don't this difference is enough to make any inferences from. Medical costs in America are very expensive, so the difference of over $1000 is negligible. 

Finally, let's see how medical charges vary depending which region in the country the individuals live. 

In [16]:
count_ne = 0 
total_ne = 0 
for i in medical_data:
    if i['region'] == 'northeast' : 
        count_ne += 1 
        total_ne += float(i['charges'])
average_charges_for_ne = total_ne / count_ne
average_charges_for_ne

13406.3845163858

In [17]:
count_nw = 0 
total_nw = 0 
for i in medical_data:
    if i['region'] == 'northwest' : 
        count_nw += 1 
        total_nw += float(i['charges'])
average_charges_for_nw = total_nw / count_nw
average_charges_for_nw

12417.575373969228

In [18]:
count_se = 0 
total_se = 0 
for i in medical_data:
    if i['region'] == 'southeast' : 
        count_se += 1 
        total_se += float(i['charges'])
average_charges_for_se = total_se / count_se
average_charges_for_se

14735.411437609895

In [19]:
count_sw = 0 
total_sw = 0 
for i in medical_data:
    if i['region'] == 'southwest' : 
        count_sw += 1 
        total_sw += float(i['charges'])
average_charges_for_sw = total_sw / count_sw
average_charges_for_sw

12346.93737729231

Whilst there are variations in avergae medical charges in all the regions, none of them are striking out as drastic in comparison to the others. The small variations seen could be due to some areas of the country charging slightly more money for the same services than other areas, but we have no information to support that. 

All these previous caluclation have been done by constructing seperate loops for each average. This can be simplified by creating a function. Furthermore, I'm going to extend the function so that we can also choose what numerical variable we want the average of (so not just medical charges). This will allow for greater depth of our analysis. This is done as follows:

In [20]:
 def average(cat_variable_being_split, chosen_sub_section , variable_to_average):
    count = 0 
    total = 0 
    for i in medical_data:
        if i[cat_variable_being_split] == chosen_sub_section : 
            count += 1 
            total += float(i[variable_to_average])
    return total / count

This function takes 3 inputs: 
- The catergorical variable we want to split (cat_variable_being_split)
- A catergory in that variable we want to isolate and find its average (chosen_sub_section)
- The variable we want to take the average of (for the observations that fit into the criteria)(variable_to_average)

Now let's test it out!

In [21]:
average('sex', 'female', 'charges')

12569.57884383534

We got the same result as before!! Now, we can use this function to not only see how medical charges vary between factors, but also bmi, age, and so on.  

I now want to see whether the sex of a person affects their bmi. So let's calculate the average bmi for males and females. But before conducting this analysis, we should see what the mean age is for both female and male data. This is because if the mean age is significantly higher for one sex, that could give misleading results. 

In [22]:
average('sex','male','age')

38.917159763313606

In [23]:
average('sex','female','age')

39.503021148036254

So the average age for males is 39 (to 2sf) whereas for females, it is 40 (to 2sf). This therefore suggests that it will be useful to compare the two sexes as they have similar average ages and that shouldn't skew the resutls. Thus, we get the mean bmi values for females and males as follows :

In [24]:
average('sex','male','bmi')

30.943128698224832

In [25]:
average('sex','female','bmi')

30.377749244713023

So we can see that the difference between the average bmi value for males and females is negligible. This therefore gives evidence to suggest that the sex of a person doesn't affect their bmi. 

Now let's see whether being a smoker affects someone's bmi.

In [26]:
average('smoker','yes','bmi')

30.708448905109503

In [27]:
average('smoker','no','bmi')

30.651795112781922

Again like sex, being a smoker doesn't affect the bmi of a person. This gives evidence to suggest that you can;t determine the factors that affect bmi so clearly.

For my next portion of analysis, I will see how children affects medical costs. To do this, we will create a loop that goes from 0 to the maximum number of children, outputting the average medical costs for each number of children.

In [43]:
children = []
for i in medical_data:
    children.append(int(i['children']))

In [45]:
max(children)

5

In [47]:
average_dict = {}
for i in range(0, max(children)+1):
    average_dict[i] = average('children',str(i),'charges')
    
average_dict
    

{0: 12365.975601635882,
 1: 12731.171831635793,
 2: 15073.563733958328,
 3: 15355.31836681528,
 4: 13850.656311199999,
 5: 8786.035247222222}

Before I infer anything from these averages, we need to see how many observations there are for each number of children so that we aren't making unrealistic conclusions.

In [51]:
count_dict = {}
for i in range(0, max(children)+1):
    count = 0 
    for j in children:
        if i ==  j:
            count+= 1
    count_dict[i] = count
        
count_dict

{0: 574, 1: 324, 2: 240, 3: 157, 4: 25, 5: 18}

We can clearly see there is very few observations for individuals with 4 or 5 children. We therefore should not draw too much from these averages as they don't have enough data backing them to truly represent what the population average is. 

Ignoring the avergaesfor individuals with 4 or 5 children, we can see a clear linear trend in the average medical costs. As the number of children increases, as does the average medical costs. Furthermore, there is a clear jump of over 2000 dollars from 1 to 2 children. I don't know what the reasoning behind this is but it's an intereting observation.

To conclude this project, I have showcased my understanding of the fundamentals of Python 3. I have used lists, loops, functions and dictionaries to carry out basic caluclations which I have then analysed to come to conclusions. With futher Python learning, this analysis could then lead into data modelling as well as some hypothesis testing to get some statistically back inferences.