# U.S. Medical Insurance Costs

### Project Goals for analysis:

1. Find out the average age of the patients in the dataset.
2. Analyze where a majority of the individuals are from.
3. Look at the different costs between smokers vs. non-smokers.
4. Figure out what the average age is for someone who has at least one child in this dataset.

**insurance.csv** contains the following columns:

* Patient Age
* Patient Sex
* Patient BMI
* Patient Number of Children
* Patient Smoking Status
* Patient U.S Geopraphical Region
* Patient Yearly Medical Insurance Cost
There are no signs of missing data. To store this information, seven empty lists will be created hold each individual column of data from **insurance.csv**.

In [1]:
# raw data provided in insurance.csv file

import csv

# creating separate lists for each column 
ages = []
sexes = []
bmis = []
children = []
smokers = []
regions = []
actual_charges = []

with open('insurance.csv', newline='') as insurance_csv:
    patient_reader = csv.DictReader(insurance_csv)
    for row in patient_reader:
        ages.append(row['age'])
        sexes.append(row['sex'])
        bmis.append(row['bmi'])
        children.append(row['children'])
        smokers.append(row['smoker'])
        regions.append(row['region'])
        actual_charges.append(row['charges'])

In [2]:
# finding out the number of patients in the dataset
num_records = 0
for i in ages:
    num_records += 1
print("There are {} patients in the dataset.".format(num_records))

# finding out the average age of patients      
def avg_age(ages):
    total_age = 0
    for age in ages:
        total_age += int(age)
    average_age =  round(total_age / len(ages), 2)
    return average_age

average_age = avg_age(ages)
print("The average age of the patient in the given dataset is " + str(average_age) + " years.")

There are 1338 patients in the dataset.
The average age of the patient in the given dataset is 39.21 years.


In [3]:
# analyzing where the majority of patients live
sw_count = regions.count('southwest')
se_count = regions.count('southeast')
nw_count = regions.count('northwest')
ne_count = regions.count('northeast')
print("Patients are located evenly over 4 areas.\n" + "\n" +
      str(sw_count) + " patients live in the southwest region.\n" + 
      str(se_count) + " patients live in the southeast region.\n" + 
      str(nw_count) + " patients live in the northwest region.\n" +
      str(ne_count) + " patients live in the northeast region.\n" + 
      "\nThe majority of patients live in the southeast region.")
    

# calculating each region's share
def region_share(sw, se, nw, ne):
    sw_share = round(sw * 100 / len(regions), 1)
    se_share = round(se * 100 / len(regions), 1)
    nw_share = round(nw * 100 / len(regions), 1)
    ne_share = round(ne * 100 / len(regions), 1)
    print("\nSouthwest patients share is " + str(sw_share) + "%\n" + 
        "Southeast patients share is " + str(se_share) + "%\n" +
        "Northwest patients share is " + str(nw_share) + "%\n" +
        "Northeast patients share is " + str(ne_share) + "%\n")

region_share(sw_count, se_count, nw_count, ne_count)

Patients are located evenly over 4 areas.

325 patients live in the southwest region.
364 patients live in the southeast region.
325 patients live in the northwest region.
324 patients live in the northeast region.

The majority of patients live in the southeast region.

Southwest patients share is 24.3%
Southeast patients share is 27.2%
Northwest patients share is 24.3%
Northeast patients share is 24.2%



In [4]:
# calculating total/average charge
total_charge = 0
for charge in actual_charges:
    total_charge += float(charge)
    
print("Total actual charge is: $" + str(round(total_charge, 2)))

average_charge = total_charge / len(actual_charges)
print("Average actual charge is: $" + str(round(average_charge, 2)))

Total actual charge is: $17755824.99
Average actual charge is: $13270.42


In [5]:
# looking at different charges in smoker vs non-smoker

# counting smokers/non-smokers
def counting_smokers(smokers):
    smokers_count = 0
    non_smokers_count = 0
    for smoker in smokers:
        if smoker == "yes":
            smokers_count += 1
        else:
            non_smokers_count += 1
    return smokers_count, non_smokers_count

show_me_smokers = counting_smokers(smokers)

print("Number of smokers and non-smokers in the presented dataset is " + str(show_me_smokers) + " respectively.")

Number of smokers and non-smokers in the presented dataset is (274, 1064) respectively.


In [6]:
# creating a zipped list of smokers and their charges
zipped_smokers = list(zip(smokers, actual_charges))

# calculating smokers charges and share in total charges
def smokers_charges(zipped_smokers, smokers_count = 274, non_smokers_count = 1064):
    smoker_charges = 0
    non_smoker_charges = 0
    for item in zipped_smokers:
        if item[0] == 'yes':
            smoker_charges += float(item[1])
        else:
            non_smoker_charges += float(item[1])
    smokers_percent_in_charges = round(smoker_charges * 100 / total_charge, 1)
    smokers_share = smokers_count * 100 / len(smokers)
    non_smokers_share = non_smokers_count * 100 / len(smokers)
    print("\nSmoker charges are: ${}\n \
            \nNon-smoker charges are: ${}\n \
            \nPercent of smokers in the dataset: {}%\n \
            \nPercent of non-smokers in the dataset: {}%\n \
            \nDespite the fact that there are only {}% of smokers from the total amount of patients, \
            \nthey account for almost half of all insurance costs in the given dataset, namely, {}%.\n".format(round(smoker_charges, 2),
                                                                                                            round(non_smoker_charges, 2),
                                                                                                            round(smokers_share, 1),
                                                                                                            round(non_smokers_share, 1),
                                                                                                            round(smokers_share, 1),
                                                                                                            round(smokers_percent_in_charges, 1)))

smokers_charges(zipped_smokers)

# calculating average cost for smokers/non-smokers
def calculate_average_cost_smoker(zipped_smokers):
    smoker_charges = 0
    non_smoker_charges = 0
    smoker_list = []
    non_smoker_list = []
    for item in zipped_smokers:
        if item[0] == 'yes':
            smoker_charges += float(item[1])
            smoker_list.append(item[0])
        else:
            non_smoker_charges += float(item[1])
            non_smoker_list.append(item[0])
    average_cost_smoker = smoker_charges / len(smoker_list)
    average_cost_non_smoker = non_smoker_charges / len(non_smoker_list)
    print("Smoker pays ${} of insurance costs in average, while non-smoker pays ${} in average.".format(round(average_cost_smoker, 2),
                                                                                   round(average_cost_non_smoker, 2)))
    
calculate_average_cost_smoker(zipped_smokers)


Smoker charges are: $8781763.52
             
Non-smoker charges are: $8974061.47
             
Percent of smokers in the dataset: 20.5%
             
Percent of non-smokers in the dataset: 79.5%
             
Despite the fact that there are only 20.5% of smokers from the total amount of patients,             
they account for almost half of all insurance costs in the given dataset, namely, 49.5%.

Smoker pays $32050.23 of insurance costs in average, while non-smoker pays $8434.27 in average.


In [7]:
# figuring out what the average age is for someone who has at least one child in this dataset.

# creating a common list for ages and children
zipped_parents = list(zip(ages, children))

# finding out the average age for patient with at least one child
def calculate_avg_age_of_parents(zipped_parents):
    parents_with_children = []
    for parent in zipped_parents:
        if int(parent[1]) > 0:
            parents_with_children.append(parent[0])
        else:
            pass
    length = len(parents_with_children)
    total_age = 0
    for age in parents_with_children:
        total_age += int(age)
    average_age = round(total_age / length, 2)
    print("The average age for a patient who has at least one child in this dataset is " + str(average_age) + " years.")
    
calculate_avg_age_of_parents(zipped_parents)

The average age for a patient who has at least one child in this dataset is 39.78 years.


In [8]:
# creating a dictionary with a number as key and all other data as value for simplicity
medical_records = dict()
for i in range(len(ages)):
    medical_records[i] = {"Age": ages[i],
                          "Sex": sexes[i],
                          "BMI": bmis[i],
                          "Children": children[i],
                          "Smoker": smokers[i],
                          "Region": regions[i],
                          "Charges": actual_charges[i]}

# creating a dictionary with number of patients by regions
number_of_patients_by_regions = {}
number_of_patients_by_regions.update({"Southwest": 325, "Southeast": 364, "Northwest": 325, "Northeast": 324})
    
# creating a list with data analysis results
analysis_results = []
analysis_results.append(["Total number of records", 1338])
analysis_results.append(["Average age of patients", 39.21])
analysis_results.append(number_of_patients_by_regions)
analysis_results.append(["Total charge, $", round(total_charge, 2)])
analysis_results.append(["Average charge, $", round(average_charge, 2)])
analysis_results.append(["Smokers", 274])
analysis_results.append(["Non-smokers", 1064])
analysis_results.append(["Percent of smokers, %", 20.5])
analysis_results.append(["Smokers share in total charges, %", 49.5])
analysis_results.append(["Smoker's average insurance payment, $", 32050.23])
analysis_results.append(["Non-smoker's average insurance payment, $", 8434.27])
analysis_results.append(["Average age of patient who is a parent", 39.78])

In [9]:
print(analysis_results)

[['Total number of records', 1338], ['Average age of patients', 39.21], {'Southwest': 325, 'Southeast': 364, 'Northwest': 325, 'Northeast': 324}, ['Total charge, $', 17755824.99], ['Average charge, $', 13270.42], ['Smokers', 274], ['Non-smokers', 1064], ['Percent of smokers, %', 20.5], ['Smokers share in total charges, %', 49.5], ["Smoker's average insurance payment, $", 32050.23], ["Non-smoker's average insurance payment, $", 8434.27], ['Average age of patient who is a parent', 39.78]]
