# U.S. Medical Insurance Costs

To make the work more interesting, let's add a story to this project. 
Imagine that a small space insurance company, SpaceINS, has insured a crew of settlers on a lunar agricultural base and needs help evaluating the contract. The base is under the jurisdiction of the United States, so health insurance is carried out according to the same procedures as in the USA on Earth.
In this project, a **CSV** file with medical insurance costs will be investigated using Python fundamentals. The goal with this project will be to analyze various attributes within **insurance.csv** to learn more about the patient information in the file and gain insight into potential use cases for the dataset.

First, import all the libraries needed for the project. Here it looks like we will only need the **csv** library.



In [97]:
#import csv
import csv

Second, let's take a look on the **insurance.csv** and plan out how to import the data into a Python file:
* The names of columns and rows: age,sex,bmi,children,smoker,region,charges
* Any noticeable missing data: no missing data
* Types of values: some of the data is numerical (age, bmi, children, charges) and some is categorical (sex, smoker, region)



## Scope of the project

* Check the sample for balance through categories (e.g. we may find that this dataset is mainly composed of individuals who have children or that it is imbalanced in terms of representation of males vs. females.)
* Review geopraphy of clients' origin
* Analyze costs and work out recommendations on how to lower insurance costs for clients:
    * Determine the dependence of the cost of insurance on the status of a smoker 
    * Determine the dependence of the cost of insurance on BMI
    * Identify recommendations for optimizing health insurance costs  

Next, let's prepare a skeleton for our project:

In [149]:
# add empty lists for the various columns in insurance.csv
ages = []
sexes = []
bmis = []
num_children = []
smoker_statuses = []
regions = []
insurance_charges = []

In [150]:
# helper function to load csv data
def load_list_data(lst, csv_file, column_name):
    with open(csv_file) as csv_info:
        csv_dict = csv.DictReader(csv_info)
        for row in csv_dict:
            lst.append(row[column_name])
        return lst

In [151]:
# Populate the created columns with data 
load_list_data(ages, 'insurance.csv', "age")
load_list_data(sexes, 'insurance.csv', "sex")
load_list_data(bmis, 'insurance.csv', "bmi")
load_list_data(num_children, 'insurance.csv', "children")
load_list_data(smoker_statuses, 'insurance.csv', "smoker")
load_list_data(regions, 'insurance.csv', "region")
load_list_data(insurance_charges, 'insurance.csv', "charges");

In [152]:
# Convert numerical values from str to int/flow

ages = [int(ages[i]) for i in range(len(ages))]
bmis = [float(bmis[i]) for i in range(len(bmis))]
insurance_charges = [float(insurance_charges[i]) for i in range(len(insurance_charges))]

In [153]:
avg_charges = round(sum(insurance_charges) / len(insurance_charges), 2)
print(avg_charges)

13270.42


In [154]:
# Check average age
avg_age = sum(ages) / len(ages)
print("Average age of a client in this contract is " + str(round(avg_age, 2)) + " years.")

# Group clients by age and review age balance
age_under_30 = []
age_30_to_45 = []
age_45_to_60 = []
age_over_60 = []

for age in ages:
    if age < 30:
        age_under_30.append(age)
    elif age < 45:
        age_30_to_45.append(age)
    elif age <60:
        age_45_to_60.append(age)
    else:
        age_over_60.append(age)
print("Under 30: " + str(len(age_under_30)) + 
      "\n30 to 45: " + str(len(age_30_to_45)) +
      "\n45 to 60: " + str(len(age_45_to_60)) +
      "\nOver 60: " + str(len(age_over_60)))

Average age of a client in this contract is 39.21 years.
Under 30: 417
30 to 45: 392
45 to 60: 415
Over 60: 114


The age structure of the sample corresponds to the structure of the population in this lunar county, so it does not raise any concerns.

Next, analyze gender structure.

In [155]:
m = sexes.count("male")
f = sexes.count("female")
gender_ratio = round(m / f, 2)
male_percent = round((m / len(sexes)) * 100, 2)
female_percent = round((f / len(sexes)) * 100, 2)
print("Gender ratio of the insured settlement is " + str(gender_ratio) + " with " + str(male_percent) + "% of male and " + str(female_percent) + "% of female population")

Gender ratio of the insured settlement is 1.02 with 50.52% of male and 49.48% of female population


Seems that the organizers of the settlement have coped well with the challenges of gender equality!
However the settlement's authorities use old software that does not include non-binary persons. Time to update!  

Now let's take a look at where the settlers come from.

In [156]:
for region in set(regions):
    print(region, regions.count(region))

northeast 324
southwest 325
northwest 325
southeast 364


Almost equal here too! Maybe we'll manage to find something interesting if we dig deeper... 

In [157]:
region_smoker = list(zip(regions, smoker_statuses))
smokers = []
for i in region_smoker:
    if i[1] == "yes":
        smokers.append(i[0])
    else: continue
    
total_smokers = len(smokers)    
smokers_ne = smokers.count("northeast")
smokers_sw = smokers.count("southwest")
smokers_nw = smokers.count("northwest")
smokers_se = smokers.count("southeast")
    
print("Total smokers:", total_smokers, "(",round(total_smokers / len(region_smoker) * 100, 2), "%).",
    "\nSmoker count per origin:",
    "\nNortheast:", smokers_ne,
     "\nSouthwest:", smokers_sw,
     "\nNorthwest:", smokers_nw,
     "\nSoutheast:", smokers_se)


Total smokers: 274 ( 20.48 %). 
Smoker count per origin: 
Northeast: 67 
Southwest: 58 
Northwest: 58 
Southeast: 91


The most lunar smokers come from Southeast, well, it's the most smoking US region on Earth too... But 20.5% smokers is almost twice as much as on Earth! Maybe the administration should launch a campaign to help people quit... 

While it's lunar doc's job to consult people on how to quit smoking, we could help the authorities with informational campaign. While life expectancy, QUALY or Lee-Carter model is something difficult for nerds, there is something everyone understands: money! 

In [158]:
smoker_status_charges = list(zip(smoker_statuses, insurance_charges))
smoker_charges = []
non_smoker_charges = []
for i in smoker_status_charges:
    if i[0] == 'yes':
        smoker_charges.append(i[1])
    else:
        non_smoker_charges.append(i[1])
avg_smoker_charges = sum(smoker_charges) / len(smoker_charges)
avg_non_smoker_charges = sum(non_smoker_charges) / len(non_smoker_charges)

print("A smoker pays an average of", round(avg_smoker_charges,2), "dollars for their insurance, while a non-smoker pays only", round(avg_non_smoker_charges, 2), "dollars."
     "\nQuitting smoking can save you", round(avg_smoker_charges - avg_non_smoker_charges, 2), "dollars!")
        

A smoker pays an average of 32050.23 dollars for their insurance, while a non-smoker pays only 8434.27 dollars.
Quitting smoking can save you 23615.96 dollars!


Body weight is a sensitive topic to discuss, but it has been scientifically proven to be an important factor in the risk of developing serious conditions and disease. Let's help the lunar doc investigate the settlers' BMIs and come up with arguments in favor of normalizing their weight.

In [159]:
bmi_charges = list(zip(bmis, insurance_charges))
bmi_groups = { "underweight" : 18.5,
                "normal" : 25,
                "overweight" : 30,
                "obese" : 35,
                "extremely obese" : []}
underweight = []
normal = []
overweight = []
obese = []
extremely_obese = []

for bmi in bmi_charges: 
    if bmi[0] < bmi_groups["underweight"]:
        underweight.append(bmi[1])
    elif bmi[0] < bmi_groups["normal"]:
        normal.append(bmi[1])
    elif bmi[0] < bmi_groups["overweight"]:
        overweight.append(bmi[1])
    elif bmi[0] < bmi_groups["obese"]:
        obese.append(bmi[1])
    else:
        extremely_obese.append(bmi[1])
        
print("BMI distribution:",
     "\nUnderweight:", len(underweight), "ppl."
     "\nNormal:", len(normal), "ppl."
     "\nOverweight:", len(overweight), "ppl."
     "\nObese:", len(obese), "ppl."
     "\nExtremely obese:", len(extremely_obese), "ppl.")

BMI distribution: 
Underweight: 20 ppl.
Normal: 225 ppl.
Overweight: 386 ppl.
Obese: 391 ppl.
Extremely obese: 316 ppl.


Wow, it looks like limited physical activity in the colony has consequences! We'll send the data to lunar doc, and inform administration that public policies and activities must be adopted reduce settlers' body weight.
Let's play with data a bit to see, what would we present to the Comittee.


In [160]:
overweight_percent = (len(overweight) + len(obese) + len(extremely_obese)) / len(bmi_charges) * 100
print(str(round(overweight_percent, 2)) + "% of the settlement have increased body weight.")

81.69% of the settlement have increased body weight.


In [161]:
def bmi_costs(group):
    average_costs_bmi = round(sum(group) / len(group), 2)
    return average_costs_bmi 
    
print("Average insurance expences for settlers from Normal BMI group are " + str(bmi_costs(normal)) + "$.")
print("Average insurance expences for settlers from Overweight BMI group are " + str(bmi_costs(overweight)) + "$.")
print("Average insurance expences for settlers from Obese BMI group are " + str(bmi_costs(obese)) + "$.")
print("Average insurance expences for settlers from Extremely Obese BMI group are " + str(bmi_costs(extremely_obese)) + "$.")



Average insurance expences for settlers from Normal BMI group are 10409.34$.
Average insurance expences for settlers from Overweight BMI group are 10987.51$.
Average insurance expences for settlers from Obese BMI group are 14419.67$.
Average insurance expences for settlers from Extremely Obese BMI group are 16953.82$.


The authorities speak best money language as well, so we'll go an extra mile to find even more convincing arguments for the program.

In [162]:
def bmi_savings(group):
    
    savings = bmi_costs(group) - bmi_costs(normal)
    total_savings = savings * len(group)
    return round(total_savings, 2)

print("Normalizing body weight in Overweight group would save up to " + str(bmi_savings(overweight)) + "$.")
print("Normalizing body weight in Obese group would save up to " + str(bmi_savings(obese)) + "$.")
print("Normalizing body weight in Extremely Obese group would save up to " + str(bmi_savings(extremely_obese)) + "$.")

print("Idea for leaflet: person with normal BMI pays average of $" + str(bmi_costs(normal)) + ", while person with high BMI pays average of $" + str(bmi_costs(extremely_obese)) + "." 
                                                                         "\nYou can save up to $" + str(bmi_costs(extremely_obese) - bmi_costs(normal)) + " by joining the weight control program!")

Normalizing body weight in Overweight group would save up to 223173.62$.
Normalizing body weight in Obese group would save up to 1568039.03$.
Normalizing body weight in Extremely Obese group would save up to 2068055.68$.
Idea for leaflet: person with normal BMI pays average of $10409.34, while person with high BMI pays average of $16953.82.
You can save up to $6544.48 by joining the weight control program!


Finally, let's add all patient data to a nice and convenient dictionary to make it easier to access in the future.

In [163]:
patients_dictionary = dict()
patients_dictionary["age"] = ages
patients_dictionary["sex"] = sexes
patients_dictionary["bmi"] = bmis
patients_dictionary["children"] = num_children
patients_dictionary["smoker"] = smoker_statuses
patients_dictionary["regions"] = regions
patients_dictionary["charges"] = insurance_charges