# U.S. Medical Insurance Costs

### Normalize "insurance.csv" data

First thing to do, is to read and get all the information from the ".csv" file **(there are 1,338 total records in this dataset)**. I gather all rows information in a dictionary and append them to an empty list; since all values for every dictionary record in the list are strings, I parsed the ones I need as an *integer* or a *float* number.

In [46]:
import csv

insurance_data_raw = []

# Open ".csv" file, parse it as a dictionary and append them to an empty list
with open("insurance.csv", newline = "") as insurance_csv:
	reader = csv.DictReader(insurance_csv)

	for item in reader:
		insurance_data_raw.append(item)

# Parse string values as integer or float number
for data in insurance_data_raw:
	data["age"] = int(data["age"])
	data["bmi"] = float(data["bmi"])
	data["children"] = int(data["children"])
	data["charges"] = float(data["charges"])

### Dataset Insights

Before I start analyzing this dataset, I want to count specific keys, like:

1. How many total records are in this dataset?
2. How many females and males are?
3. How many smoker vs. non-smoker people are?
4. How many southwest region, southeast, northeast and northwest are?
5. Range of children in this dataset.
6. Average BMI.
7. Average charges.

In [2]:
# Total records and total females vs. total males

num_of_males = 0
num_of_females = 0

for data in insurance_data_raw:
	if data["sex"] == "female":
		num_of_females += 1
	elif data["sex"] == "male":
		num_of_males += 1

total_medical_records = num_of_females + num_of_males

print("In this dataset, there's a total of {num_of_females} female records, againts {num_of_males} male records. Total records in this dataset are {total_medical_records}.".format(num_of_females = num_of_females, num_of_males = num_of_males, total_medical_records = total_medical_records))

In this dataset, there's a total of 662 female records, againts 676 male records. Total records in this dataset are 1338.


In [3]:
# Total of smoker and non-smoker people

num_of_smokers = 0
num_of_non_smokers = 0

for data in insurance_data_raw:
	if data["smoker"] == "yes":
		num_of_smokers += 1
	elif data["smoker"] == "no":
		num_of_non_smokers += 1

print("There's a total of {num_of_smokers} smoker people against {num_of_non_smokers} non smoker people.".format(num_of_smokers = num_of_smokers, num_of_non_smokers = num_of_non_smokers))

There's a total of 274 smoker people against 1064 non smoker people.


In [4]:
# Southwest, Southeast, Northeast and Northwest people

num_of_southwest = 0
num_of_southeast = 0
num_of_northeast = 0
num_of_northwest = 0

for data in insurance_data_raw:
    if data["region"] == "southwest":
        num_of_southwest += 1
    elif data["region"] == "southeast":
        num_of_southeast += 1
    elif data["region"] == "northeast":
        num_of_northeast += 1
    elif data["region"] == "northwest":
        num_of_northwest += 1
        
print("There's a total of: {num_of_southwest} Southwest, {num_of_southeast} Southeast, {num_of_northeast} Northeast and {num_of_northwest} Northwest people.".format(num_of_southeast = num_of_southeast, num_of_southwest = num_of_southwest, num_of_northwest = num_of_northwest, num_of_northeast = num_of_northeast))

There's a total of: 325 Southwest, 364 Southeast, 324 Northeast and 325 Northwest people.


In [5]:
# Range of children

children_range = []
children_count = 0

for data in insurance_data_raw:
    if data["children"] not in children_range and data["children"] == children_count:
        children_range.append(data["children"])
        children_count += 1
        
total_children = {children_range[i + 1]: 0 for i in range(-1, len(children_range) - 1)}

for data in insurance_data_raw:
    if data["children"] == 0:
        total_children[0] += 1
    elif data["children"] == 1:
        total_children[1] += 1
    elif data["children"] == 2:
        total_children[2] += 1
    elif data["children"] == 3:
        total_children[3] += 1
    elif data["children"] == 4:
        total_children[4] += 1
    elif data["children"] == 5:
        total_children[5] += 1
        
for key, value in total_children.items():
    print("There are {value} records that have {key} children.".format(key = key, value = value))

There are 574 records that have 0 children.
There are 324 records that have 1 children.
There are 240 records that have 2 children.
There are 157 records that have 3 children.
There are 25 records that have 4 children.
There are 18 records that have 5 children.


In [19]:
# Average BMI

total_bmi = 0

for data in insurance_data_raw:
    total_bmi += data["bmi"]
    
average_bmi = round((total_bmi / total_medical_records), 2)

print("The average BMI of this dataset is {average_bmi}.".format(average_bmi = average_bmi))

The average BMI of this dataset is 30.66.


In [18]:
# Average Cost

total_cost = 0

for data in insurance_data_raw:
    total_cost += data["charges"]
    
average_cost = round((total_cost / total_medical_records), 2)

print("The average cost of this dataset is ${average_cost} dollars.".format(average_cost = average_cost))

The average cost of this dataset is $13270.42 dollars.


### Analyzing the dataset

Upon answering questions about how many records there are, how many males and females, etc. I want to answer questions like:

1. How much is the average cost of females vs. males and it's difference.
2. How much is the average cost of smokers vs. non-smokers and it's difference.
3. How much is the average cost between regions and which one is the priciest and the cheapiest.

In [24]:
# Female vs. Male average cost

female_total_cost = 0
male_total_cost = 0

for data in insurance_data_raw:
    if data["sex"] == "female":
        female_total_cost += data["charges"]
    elif data["sex"] == "male":
        male_total_cost += data["charges"]
        
female_average_cost = round((female_total_cost / num_of_females), 2)
male_average_cost = round((male_total_cost / num_of_males), 2)
female_male_cost_diff = round((male_average_cost - female_average_cost), 2)

print("The female average cost is ${female_average_cost} dollars, vs. male average cost ${male_average_cost}. We can say, based in this dataset, that men pay in average ${female_male_cost_diff} dollars more than females.".format(female_average_cost = female_average_cost, male_average_cost = male_average_cost, female_male_cost_diff = female_male_cost_diff))

The female average cost is $12569.58 dollars, vs. male average cost $13956.75. We can say, based in this dataset, that men pay in average $1387.17 dollars more than females.


In [27]:
# Smokers vs. Non-smokers average cost

smokers_total_cost = 0
non_smokers_total_cost = 0

for data in insurance_data_raw:
    if data["smoker"] == "yes":
        smokers_total_cost += data["charges"]
    elif data["smoker"] == "no":
        non_smokers_total_cost += data["charges"]
        
smokers_average_cost = round((smokers_total_cost / num_of_smokers), 2)
non_smokers_average_cost = round((non_smokers_total_cost / num_of_non_smokers), 2)
smokers_average_diff = round((smokers_average_cost - non_smokers_average_cost ), 2)

print("Smokers average cost is ${smokers_average_cost} dollars vs. ${non_smokers_average_cost} dollars for non-smokers. Based on this dataset, we can conclude that smokers pay ${smokers_average_diff} dollars more than non-smokers".format(smokers_average_cost = smokers_average_cost, non_smokers_average_cost = non_smokers_average_cost, smokers_average_diff = smokers_average_diff))

Smokers average cost is $32050.23 dollars vs. $8434.27 dollars for non-smokers. Based on this dataset, we can conclude that smokers pay $23615.96 dollars more than non-smokers


In [45]:
# Average cost between regions

def region_total_charges(region, region_average_records):

	region_total_charges = 0
    
	for data in insurance_data_raw:
		if data["region"] == region:
			region_total_charges += data["charges"]

	region_average_cost = round((region_total_charges / region_average_records), 2)
	return region_average_cost

southwest_average_charge = region_total_charges("southwest", num_of_southwest)
southeast_average_charge = region_total_charges("southeast", num_of_southeast)
northeast_average_charge = region_total_charges("northeast", num_of_northeast)
northwest_average_charge = region_total_charges("northwest", num_of_northwest)

priciest_region = max(southeast_average_charge, southwest_average_charge, northeast_average_charge, northwest_average_charge)
cheapiest_region = min(southeast_average_charge, southwest_average_charge, northeast_average_charge, northwest_average_charge)

print("Southwest average cost: ${southwest_average_charge}. Southeast average cost: ${southeast_average_charge}. Northeast average cost: ${northeast_average_charge}. Northwest average cost: ${northwest_average_charge}.".format(southwest_average_charge = southwest_average_charge, southeast_average_charge = southeast_average_charge, northeast_average_charge = northeast_average_charge, northwest_average_charge = northwest_average_charge))
print("The priciest region has an average cost of ${priciest_region} dollars.".format(priciest_region = priciest_region))
print("The cheapiest region has an average cost of ${cheapiest_region} dollars".format(cheapiest_region = cheapiest_region))

Southwest average cost: $12346.94. Southeast average cost: $14735.41. Northeast average cost: $13406.38. Northwest average cost: $12417.58.
The priciest region has an average cost of $14735.41 dollars.
The cheapiest region has an average cost of $12346.94 dollars
