# U.S. Medical Insurance Costs

### Project Author: Carlos Paiva

The first steps for solving this project are to have a look at the raw data from the file "insurance.csv" (which will be our main datasource), create the lists to import this information in our python program and load the required python modules for working on it. Code for this follows below:

In [3]:
# Import csv library
import csv

In [4]:
# Create empty lists to populate with raw data from imported file:
age = []
sex = []
bmi = []
children = []
smoker = []
region = []
charges = []

In [5]:
# Populating lists:
with open("insurance.csv") as insurance_data:
    insurance_dict = csv.DictReader(insurance_data)
    for row in insurance_dict:
        age.append(row["age"])
        sex.append(row["sex"])
        bmi.append(row["bmi"])
        children.append(row["children"])
        smoker.append(row["smoker"])
        region.append(row["region"])
        charges.append(row["charges"])

For the analysis, we will perform 7 tasks that will help us to have more insight about the dataset:
1. Find the average age of the patients
2. Return the number of males vs. females countered in the dataset and compare average cost for males vs females
3. Find the geographical location of the patients and analyze the total number of patients from each region
4. Return the average yearly medical charges of the patients
5. Look at the difference between the average cost for smokers vs non smokers
6. Figure out what the average age is for someone who has at least one child in the dataset
7. Create a dictionary that contains all patient information

#### 1. Find the average age of the patients

In [6]:
# Calculating total number of patients:
number_of_patients = len(age)
print("The total number of patients is " + str(number_of_patients) + ".")

The total number of patients is 1338.


In [7]:
# Calculating average age of patients:
total_age = 0
for patient in age:
    total_age += int(patient)

average_age = round(total_age/number_of_patients, 1)
print("The average age of the patients in the dataset is " + str(average_age) + ".")

The average age of the patients in the dataset is 39.2.


#### 2. Return the number of males vs. females countered in the dataset and compare average cost for males vs females

In [32]:
# Counting number of males and females:
number_of_males = 0
number_of_females = 0
for patient in sex:
    if patient=="male":
        number_of_males += 1
    if patient=="female":
        number_of_females += 1
        
percentage_males = number_of_males/number_of_patients*100
percentage_females = number_of_females/number_of_patients*100
        
print("The total number of male patients in the dataset is " + str(number_of_males) + ".\nThe percentage that the male patients represent of the total is " + str(round(percentage_males,1)) + "%.\n")
print("The total number of female patients in the dataset is " + str(number_of_females) + ".\nThe percentage that the female patients represent of the total is " + str(round(percentage_females,1)) + "%.\n")

# Comparing average costs for males vs females:
cost_by_sex = list(zip(sex, charges))

def aver_cost_sex(list, sex_obj):
    tot_cost_sex = 0
    num_patients = 0
    for sex_ind, charge in list:
        if sex_ind == sex_obj:
            num_patients += 1
            tot_cost_sex += float(charge)
    aver_cost = tot_cost_sex / num_patients
    print("The average cost for " + sex_obj + " patients is " + str(round(aver_cost, 1)) + ".")
    return aver_cost

aver_cost_sex(cost_by_sex, "male")
aver_cost_sex(cost_by_sex, "female")

male_cost = aver_cost_sex(cost_by_sex, "male")
female_cost = aver_cost_sex(cost_by_sex, "female")
diff_sex = male_cost - female_cost

print("")
print("Conclusions:\nIn the dataset, we can see that the breakdown between males and females is almost 50-50. Therefore, no biases expected here when analyzing the average cost for each group.\nThe average cost for male patients is " + str(round(diff_sex,1)) + " dollars more than for female patients.")


The total number of male patients in the dataset is 676.
The percentage that the male patients represent of the total is 50.5%.

The total number of female patients in the dataset is 662.
The percentage that the female patients represent of the total is 49.5%.

The average cost for male patients is 13956.8.
The average cost for female patients is 12569.6.
The average cost for male patients is 13956.8.
The average cost for female patients is 12569.6.

Conclusions:
In the dataset, we can see that the breakdown between males and females is almost 50-50. Therefore, no biases expected here when analyzing the average cost for each group.
The average cost for male patients is 1387.2 dollars more than for female patients.


#### 3. Find the geographical location of the patients and analyze the total number of patients from each region

In [39]:
# Creating a list of the unique geographical regions in the dataset:
unique_regions = []

for location in region:
    if location not in unique_regions:
        unique_regions.append(location)

print("The list of possible regions where the patients from the dataset come from is: " + str(unique_regions) + "\n")

# Counting the number of patients from each region using a function:
def count_region(list, reg_to_count):
    counter = 0
    for location in list:
        if location == reg_to_count:
            counter += 1
    print("The number of patients from " + reg_to_count + " is " + str(counter) + ".")

count_region(region, "southwest")
count_region(region, "southeast")
count_region(region, "northwest")
count_region(region, "northeast")

print("")
print("Conclusion:\nThe number of patients from the dataset is evenly distributed amongst the four regions.")


The list of possible regions where the patients from the dataset come from is: ['southwest', 'southeast', 'northwest', 'northeast']

The number of patients from southwest is 325.
The number of patients from southeast is 364.
The number of patients from northwest is 325.
The number of patients from northeast is 324.

Conclusion:
The number of patients from the dataset is evenly distributed amongst the four regions.


#### 4. Return the average yearly medical charges of the patients

In [10]:
# Calculating average yearly charges:
total_charges = 0
for patient in charges:
    total_charges += float(patient)

average_charges = round(total_charges/number_of_patients,2)
print("The average yearly charges for all the patients in the dataset is USD " + str(average_charges) + ".")

The average yearly charges for all the patients in the dataset is USD 13270.42.


#### 5. Look at the difference between the average cost for smokers vs non smokers

In [36]:
# Creating a list that combines smokers / non smokers status and average costs:
smokers = list(zip(smoker, charges))

# Creating a function to evaluate average costs according to smoking status:
def aver_cost(list, smo_status):
    num_smo_status = 0
    tot_cost_smo_status = 0
    for status, cost in list:
        if status == smo_status:
            num_smo_status += 1
            tot_cost_smo_status += float(cost)
    aver_cost = tot_cost_smo_status/num_smo_status
    return num_smo_status, aver_cost

yes_smokers = aver_cost(smokers, "yes")
no_smokers = aver_cost(smokers, "no")
difference = yes_smokers[1] - no_smokers[1]

print("The average insurance cost for non smokers is " + str(round(no_smokers[1],1)) + " dollars.\nThis cost is " + str(round(difference,1)) + " dollars less than the average cost for smokers.\n")
print("The average insurance cost for smokers is " + str(round(yes_smokers[1],1)) + " dollars.\nThis cost is " + str(round(difference,1)) + " dollars more than the average cost for non smokers.\n")

print("Conclusion: Being a smoker dramatically increases the insurance cost.")

The average insurance cost for non smokers is 8434.3 dollars.
This cost is 23616.0 dollars less than the average cost for smokers.

The average insurance cost for smokers is 32050.2 dollars.
This cost is 23616.0 dollars more than the average cost for non smokers.

Conclusion: Being a smoker dramatically increases the insurance cost.


#### 6. Figure out what the average age is for someone who has at least one child in the dataset

In [40]:
# Creating a list that combines age and number of children:
age_and_children = list(zip(age, children))

# Creating a function for getting the average age for patients from certain number of children:
def aver_age(list, min_children):
    tot_age_min_children = 0
    num_min_children = 0
    for age_ind, child_ind in list:
        if int(child_ind) >= min_children:
            num_min_children += 1
            tot_age_min_children += float(age_ind)
    aver_age = tot_age_min_children / num_min_children
    return aver_age

print("The average is for patients who have at least one child is " + str(round(aver_age(age_and_children, 1),1)) + ".\n")

# Creating a function for getting the average age for patients of certain number of children:
def aver_age1(list, obj_child):
    tot_age_children = 0
    num_patients = 0
    for age_ind, child_ind in list:
        if int(child_ind) == obj_child:
            num_patients += 1
            tot_age_children += float(age_ind)
    aver_age = tot_age_children / num_patients
    if obj_child == 1:
        print("The average age for patients with " + str(obj_child) + " child is " + str(round(aver_age, 1)) + ".")
    else:
        print("The average age for patients with " + str(obj_child) + " children is " + str(round(aver_age, 1)) + ".")

aver_age1(age_and_children, 0)
aver_age1(age_and_children, 1)
aver_age1(age_and_children, 2)
aver_age1(age_and_children, 3)
aver_age1(age_and_children, 4)
aver_age1(age_and_children, 5)

print("\nConclusion:\nThe average age of the patients does not change much for the different groups of patients according to the number of children that they have.")

The average is for patients who have at least one child is 39.8.

The average age for patients with 0 children is 38.4.
The average age for patients with 1 child is 39.5.
The average age for patients with 2 children is 39.4.
The average age for patients with 3 children is 41.6.
The average age for patients with 4 children is 39.0.
The average age for patients with 5 children is 35.6.

Conclusion:
The average age of the patients does not change much for the different groups of patients according to the number of children that they have.


#### 7. Create a dictionary that contains all patient information

In [13]:
# Creating a function to generate the dictionary:

def create_dict(age, sex, bmi, children, smoker, region, charges):
    dictionary = {}
    dictionary["age"] = [int(item) for item in age]
    dictionary["sex"] = sex
    dictionary["bmi"] = bmi
    dictionary["children"] = children
    dictionary["smoker"] = smoker
    dictionary["regions"] = region
    dictionary["charges"] = charges
    return dictionary

# Using function for creating the dictionary:
patient_dictionary = create_dict(age, sex, bmi, children, smoker, region, charges)