# U.S. Medical Insurance Costs

# Introduction

Using data from "Machine Learning With R" by Brett Lantz [via Kaggle](https://www.kaggle.com/datasets/mirichoi0218/insurance), we have been tasked with using Python to organize the data and perform some descriptive analysis. 

We have a .csv file containing the insurance costs and associated anonymized demographic data for 1338 people. There is no missing data across our 7 columns. The columns vary between numerical and categorical data. 

## Set up:

The first step will be importing the .csv library.

In [260]:
import csv

Now we can import the .csv file. Our .csv is organized into these categories: age, sex, bmi, children, smoker, region, and charges. 

Since we're obviously going to need to interact with these columns in our program, let's save those values into lists with the same names.

In [261]:
age = []
sex = []
bmi = []
children = []
smoker = []
region = []
charges = []
with open("insurance.csv", newline= '') as ins_csv:
    dataset=csv.DictReader(ins_csv, fieldnames = ("age", "sex", "bmi", "children", "smoker", "region", "charges"))
    for row in dataset:
        age.append(row["age"])
        sex.append(row["sex"])
        bmi.append(row["bmi"])
        children.append(row["children"])
        smoker.append(row["smoker"])
        region.append(row["region"])
        charges.append(row["charges"])
#Let's print out one of the lists just to see what it looks like. 
print(children)

['children', '0', '1', '3', '0', '0', '0', '1', '3', '2', '0', '0', '0', '0', '0', '0', '1', '1', '0', '0', '0', '0', '1', '0', '1', '2', '3', '0', '2', '1', '2', '0', '0', '5', '0', '1', '0', '3', '0', '1', '0', '0', '2', '1', '2', '1', '0', '2', '0', '0', '1', '0', '2', '1', '0', '3', '2', '2', '2', '1', '2', '3', '4', '1', '1', '0', '0', '2', '1', '0', '3', '0', '5', '3', '1', '2', '0', '1', '0', '0', '0', '1', '0', '1', '4', '2', '2', '0', '0', '0', '0', '0', '1', '3', '2', '2', '1', '3', '0', '0', '0', '0', '0', '0', '3', '1', '1', '1', '2', '0', '0', '1', '2', '0', '0', '3', '0', '0', '1', '0', '2', '2', '0', '0', '1', '3', '0', '0', '0', '2', '2', '0', '0', '2', '0', '0', '0', '0', '0', '3', '0', '2', '1', '2', '2', '3', '3', '3', '1', '1', '1', '1', '0', '3', '0', '1', '0', '0', '0', '0', '3', '0', '0', '1', '2', '0', '4', '5', '3', '1', '3', '0', '0', '0', '1', '0', '0', '2', '1', '2', '3', '0', '0', '3', '0', '2', '3', '2', '3', '1', '2', '0', '0', '0', '1', '0', '0', '0', '2

Two problems are apparent here: One is the column names are included in the lists at the 0 index. The second issue is that numerical values are being saved as strings. 

Let's fix those issues by cleaning the lists of their column names at index 0 and converting them to their proper data types (integers or floats).

In [262]:
#shaving off the 0 index is the easier part:
age1 = age[1:]
sex1 = sex[1:]
bmi1 = bmi[1:]
children1 = children[1:]
smoker1 = smoker[1:]
region1 = region[1:]
charges1 = charges[1:]

#Now let's make two functions that can convert the lists to their proper data types.
#bmi1 and charges1 need to be floats; children1 and age1 need to be integers.

#smoker1, sex1, and region1 won't get converted because they store categorical data as strings, which is good!

def make_float(lst):
    for i in range(0, len(lst)):
        lst[i] = float(lst[i])
    return lst

def make_int(lst):
    for i in range(0, len(lst)):
        lst[i] = int(lst[i])
    return lst


In [263]:
bmi2 = make_float(bmi1)
charges2 = make_float(charges1)

children2 = make_int(children1)
age2 = make_int(age1)


## Analysis

Now we can begin analyzing the data. To start, let's get percentages for men/women, smokers/non-smokers, people with/without children, and people over/under 40. This will be important in seeing how representative our sample is. 

In [264]:
#This will help us create a series of percentages regardless of the datatype in the list. 
#We'll use *args to represent the variables we're creating percents of within the dataset.
def percent_maker(lst, *args):  
    #subtracting 1 bc the first value of each list is the name of the column
    for x in args:
        if isinstance(x, str) is True:
        #checking if arg is categorical
            counter = lst.count(x)
            percent = round((counter/len(lst)) * 100, 2)
            print(f"{x}s in the dataset: {percent}%")
        elif isinstance(x, int) or isinstance(x, float):
        #checking if arg is numerical
            over = 0
            under = 0
            at = 0
            for i in lst:
                if i > x:
                    over += 1
                elif i == x:
                    at +=1
                elif i < x:
                    under +=1       
            over_percent = round((over/len(lst)) * 100, 2)
            under_percent = round((under/len(lst)) * 100, 2)
            at_percent = round((at/len(lst)) * 100, 2)
            print(f"Percent of people over {x}: {over_percent}%")
            print(f"Percent of people at {x}: {at_percent}%")
            print(f"Percent of people under {x}: {under_percent}%")
percent_maker(sex1, "male", "female")
percent_maker(smoker1, "yes", "no")
percent_maker(children2, 0)
percent_maker(age2, 40)

males in the dataset: 50.52%
females in the dataset: 49.48%
yess in the dataset: 20.48%
nos in the dataset: 79.52%
Percent of people over 0: 57.1%
Percent of people at 0: 42.9%
Percent of people under 0: 0.0%
Percent of people over 40: 47.61%
Percent of people at 40: 2.02%
Percent of people under 40: 50.37%


The data seems pretty evenly distributed, except for the amount of smokers. The CDC percentage of smokers in the U.S. for 2020 was actually 12.5%. Smokers in our dataset were 20.48%, so about 8% more than average.

Now let's find the average insurance cost for our dataset. 

Since we're going to be finding a lot of averages today, we're going to write a general-purpose average calculator function that we can use and modify for the rest of our questions.

In [265]:
avg_maker = lambda lst: round(sum(lst)/len(lst),2)
print("The average insurance cost in our dataset is $" + str(avg_maker(charges2)))

The average insurance cost in our dataset is $13270.42


Since we have data from different regions, let's see if there's a large difference in average insurance cost  by region.

In [266]:
cost_by_region = list(zip(region1, charges2))
#print(cost_by_region)

southwest_cost = [float(x[1]) for x in cost_by_region if "southwest" in x]
southeast_cost = [float(x[1]) for x in cost_by_region if "southeast" in x]
northwest_cost = [float(x[1]) for x in cost_by_region if "northwest" in x]
northeast_cost = [float(x[1]) for x in cost_by_region if "northeast" in x]
#print(len(southwest_cost) + len(southeast_cost) + len(northwest_cost) + len(northeast_cost))
#^came out to 1338 so we know no records got left behind

avg_sw_cost = avg_maker(southwest_cost)
avg_se_cost = avg_maker(southeast_cost)
avg_nw_cost = avg_maker(northwest_cost)
avg_ne_cost = avg_maker(northeast_cost)

print("The average insurance cost in the southwest is $" + str(avg_sw_cost))
print("The average insurance cost in the southeast is $" + str(avg_se_cost))
print("The average insurance cost in the northwest is $" + str(avg_nw_cost))
print("The average insurance cost in the northeast is $" + str(avg_ne_cost))

The average insurance cost in the southwest is $12346.94
The average insurance cost in the southeast is $14735.41
The average insurance cost in the northwest is $12417.58
The average insurance cost in the northeast is $13406.38


Do smokers have higher bmis? Even though we can't show causation with our data, we can still look for correlation. Here we'll calculate the average BMI for smokers vs. the average BMI for non-smokers. 

In [267]:
#our solution here will be similar to our last question
bmi_by_smoking = list(zip(bmi2, smoker1))

bmi_by_yes = [float(x[0]) for x in bmi_by_smoking if "yes" in x]
bmi_by_no = [float(x[0]) for x in bmi_by_smoking if "no" in x]

avg_smokers_bmi = avg_maker(bmi_by_yes)
avg_nonsmokers_bmi = avg_maker(bmi_by_no)

print(f"The average BMI for smokers is {avg_smokers_bmi}.")
print(f"The average BMI for non-smokers is {avg_nonsmokers_bmi}.")

The average BMI for smokers is 30.71.
The average BMI for non-smokers is 30.65.


Do women or men smoke more? And which sex of smoker has higher associated medical costs?
We'll start by getting the count of men and women who smoke and then find their average cost of insurance. 

In [268]:
#zip sex, smoker, charges
sex_smoking_costs = list(zip(sex1, smoker1, charges2))
#print(sex_smoking_costs)
female_smokers = [x for x in sex_smoking_costs if "female" in x and "yes" in x]
num_female_smokers = len(female_smokers)

male_smokers = [x for x in sex_smoking_costs if "male" in x and "yes" in x]
num_male_smokers = len(male_smokers)

print(f"There are {num_female_smokers} female smokers and {num_male_smokers} male smokers in the dataset.")

#now to find their average insurance costs...
avg_m_smoker_cost = avg_maker([x[2] for x in male_smokers])
avg_f_smoker_cost = avg_maker([x[2] for x in female_smokers])

print(f"The average insurance cost for female smokers is ${avg_f_smoker_cost}.")
print(f"The average insurance cost for male smokers is ${avg_m_smoker_cost}.")

There are 115 female smokers and 159 male smokers in the dataset.
The average insurance cost for female smokers is $30679.0.
The average insurance cost for male smokers is $33042.01.


### Top 134

Now I want to know some information about the people in our dataset with the highest medical costs. 
To look at the top 10% (aka 90th percentile) of our dataset of 1338, since we can't look at 133.8, we're rounding up to 134.

We'll zip all the lists together into tuples that resemble the original rows in the .csv.

In [269]:
all_data = list(zip(age2, sex1, bmi2, children2, smoker1, region1, charges2))
#print(all_data)
#To sort the list by highest medical costs, we'll use charges2 (index 6) as the key.
all_data_by_cost = sorted(all_data, key=lambda info: -info[6]) 
#print(all_data_by_cost)
top_134 = all_data_by_cost[0:134]
print(top_134)

[(54, 'female', 47.41, 0, 'yes', 'southeast', 63770.42801), (45, 'male', 30.36, 0, 'yes', 'southeast', 62592.87309), (52, 'male', 34.485, 3, 'yes', 'northwest', 60021.39897), (31, 'female', 38.095, 1, 'yes', 'northeast', 58571.07448), (33, 'female', 35.53, 0, 'yes', 'northwest', 55135.40209), (60, 'male', 32.8, 0, 'yes', 'southwest', 52590.82939), (28, 'male', 36.4, 1, 'yes', 'southwest', 51194.55914), (64, 'male', 36.96, 2, 'yes', 'southeast', 49577.6624), (59, 'male', 41.14, 1, 'yes', 'southeast', 48970.2476), (44, 'female', 38.06, 0, 'yes', 'southeast', 48885.13561), (63, 'female', 37.7, 0, 'yes', 'southwest', 48824.45), (57, 'male', 42.13, 1, 'yes', 'southeast', 48675.5177), (60, 'male', 40.92, 0, 'yes', 'southeast', 48673.5588), (54, 'male', 40.565, 3, 'yes', 'northeast', 48549.17835), (61, 'female', 36.385, 1, 'yes', 'northeast', 48517.56315), (60, 'male', 39.9, 0, 'yes', 'southwest', 48173.361), (64, 'female', 33.8, 1, 'yes', 'southwest', 47928.03), (59, 'female', 36.765, 1, 'ye

Okay, interesting. Now let's get a percent for men and women in the top 10%. 

In [270]:
top_134_sex = [i[1] for i in top_134]

print(percent_maker(top_134_sex, "male", "female"))

males in the dataset: 62.69%
females in the dataset: 37.31%
None


And of those men and women, how many have no children? How many have 1, 2, 3, or more?

In [271]:
no_children = []
one_child = []
two_children = []
three_children = []
more_kids = []

for i in top_134:
    if i[3] == 0:
        no_children.append(i)
    elif i[3] == 1:
        one_child.append(i)
    elif i[3] == 2:
        two_children.append(i)
    elif i[3] == 3:
        three_children.append(i)
    elif i[3] > 3:
        more_kids.append(i)

print(f"In the 90th percentile of medical costs, {len(no_children)} people had no children, {len(one_child)} had one child, {len(two_children)} had two children, {len(three_children)} had three and {len(more_kids)} had more than three children."
     )


In the 90th percentile of medical costs, 48 people had no children, 31 had one child, 34 had two children, 19 had three and 2 had more than three children.


What is the average BMI in the top 134?

In [272]:
avg_134_bmi = avg_maker([x[2] for x in top_134])
print(avg_134_bmi)

35.65


What is the average age in the top 134?

In [273]:
avg_134_age = avg_maker([x[0] for x in top_134])
print(avg_134_age)

41.78
