# U.S. Medical Insurance Costs

This is a codecademy Portfolio Project that is in the curriculum after the Python Fundamentals module.  We are looking a medical insurance information and costs for patients in the United States and looking for any insights from that data.

That module includes functions, lists, loops, strings, dictionaries, classes, modules, and files.

After completing the Pandas module, I revisited this project to play with the data some more.

## Import the data set

Use the csv library.

In [1]:
import csv

## Look over your dataset
Information given about dataset:

-   There is no missing data.
-    There are seven columns.
-    Some columns are numerical while some are categorical.

This dataset contains medical insurance information about patients in the United States that includes:
- their age, 
- sex (male or female), 
- body mass index, 
- number of children they have, 
- whether they are a smoker(yes or no),
- what region they live in (southwest, southeast, northwest, northeast),
- and how much they are charged for insurance per year.

In [2]:
with open('insurance.csv') as insurance_file:
    print(insurance_file.read())

age,sex,bmi,children,smoker,region,charges
19,female,27.9,0,yes,southwest,16884.924
18,male,33.77,1,no,southeast,1725.5523
28,male,33,3,no,southeast,4449.462
33,male,22.705,0,no,northwest,21984.47061
32,male,28.88,0,no,northwest,3866.8552
31,female,25.74,0,no,southeast,3756.6216
46,female,33.44,1,no,southeast,8240.5896
37,female,27.74,3,no,northwest,7281.5056
... deleted some rows to make it more readable...



## Scoping the Project

What do we want to analyze?  Average age? Where the majority of individuals are from? Difference in costs between smokers and non-smokers? How many smokers are in each region?

We might first want to see if the data is skewed in any one particular direction
What's the breakdown of males vs. females, smokers vs. non-smokers, how are the different areas covered? 

We could also look at this from a social justice standpoint. Is there a difference in what males and females are charged for insurance?  And take a look at charges per region?

## Save your dataset via Python variables

Save the features of your dataset (the columns) from insurance.csv by storing them in variables that can be used for analysis. 

As you consider what types of variables to use and how many you plan to create, think ahead about the parameters you wish to investigate and how your organization will impact this analysis.

In [3]:
# lists of all the columns, so we can do some counting.
age = []
sex = []
bmi = []
num_children = []
smoker_status = []
region = []
charges = []

def make_lists(lst, csv_file, col_name):
    with open(csv_file) as list_file:
        records_reader = csv.DictReader(list_file)
        for row in records_reader:
            lst.append(row[col_name])
        return lst

make_lists(age, 'insurance.csv', 'age')
make_lists(sex, 'insurance.csv', 'sex')
make_lists(bmi, 'insurance.csv', 'bmi')
make_lists(num_children, 'insurance.csv', 'children')
make_lists(smoker_status, 'insurance.csv', 'smoker')
make_lists(region, 'insurance.csv', 'region')
make_lists(charges, 'insurance.csv', 'charges')

# And a dictionary of all the records
list_of_records = []
with open('insurance.csv', newline = '') as insurance_file:
    records_reader = csv.DictReader(insurance_file)
    for row in records_reader:
        list_of_records.append(row)
        
num_records = len(list_of_records)
num_records

1338

In [8]:
list_of_records[0:5]

[{'age': '19',
  'sex': 'female',
  'bmi': '27.9',
  'children': '0',
  'smoker': 'yes',
  'region': 'southwest',
  'charges': '16884.924'},
 {'age': '18',
  'sex': 'male',
  'bmi': '33.77',
  'children': '1',
  'smoker': 'no',
  'region': 'southeast',
  'charges': '1725.5523'},
 {'age': '28',
  'sex': 'male',
  'bmi': '33',
  'children': '3',
  'smoker': 'no',
  'region': 'southeast',
  'charges': '4449.462'},
 {'age': '33',
  'sex': 'male',
  'bmi': '22.705',
  'children': '0',
  'smoker': 'no',
  'region': 'northwest',
  'charges': '21984.47061'},
 {'age': '32',
  'sex': 'male',
  'bmi': '28.88',
  'children': '0',
  'smoker': 'no',
  'region': 'northwest',
  'charges': '3866.8552'}]

## Build out analysis functions or class methods

In [11]:
# What's the breakdown of males vs. females, smokers vs. non-smokers, 
# how are the different areas covered?

print('Number of patients who are: Males: ' + str(sex.count('male')), 'Females: ' + str(sex.count('female')) + "\n")

print('Number of patients who are a: Smoker: ' + str(smoker_status.count('yes')), 'Non-smoker: ' + str(smoker_status.count('no')) + "\n")

print('Number of patients in each region of the US:\n'
      'Northeast: ' + str(region.count('northeast')),
      'Southeast: ' + str(region.count('southeast')), 
      'Northwest: ' + str(region.count('northwest')),
      'Southwest: ' + str(region.count('southwest'))
     )

Number of patients who are: Males: 676 Females: 662

Number of patients who are a: Smoker: 274 Non-smoker: 1064

Number of patients in each region of the US:
Northeast: 324 Southeast: 364 Northwest: 325 Southwest: 325


It is a pretty even spread, excepting the smokers (which I think we'd hope to see), and slightly more coverage in the southeastern states.

What is the average age?  What is the average insurance cost?  What is average bmi?

It would be interesting to look at range and standard deviation, but that is outside the scope of this project (and beyond what was taught in the Python Fundamentals). 


In [12]:
def avg_lst(lst):
    total_lst = 0
    for l in lst:
        total_lst += float(l)
    return (round((total_lst / len(lst)), 2))

avg_age = avg_lst(age)

avg_charge = avg_lst(charges)

avg_bmi = avg_lst(bmi)

print('Aveage age: ' + str(avg_age))
print('Average charges: ' + str(avg_charge))
print('Average bmi: ' +  str(avg_bmi))

Aveage age: 39.21
Average charges: 13270.42
Average bmi: 30.66


## With Pandas 

Using dataframes instead of dictionaries and lists, we can import `insurance.csv`.

In [13]:
import pandas as pd

We can look at the difference in all the categories of age, bmi, children, smoker, region, and charges.

In [14]:
patient_records = pd.read_csv('insurance.csv')
patient_records.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [15]:
patient_count = len(patient_records)
patient_count

1338

In [17]:
females = patient_records[patient_records.sex == 'female']
males = patient_records[patient_records.sex == 'male']
print(len(females), len(males))

662 676


In [19]:
female_avg_age = females.age.mean()
male_avg_age = males.age.mean()
print(round(female_avg_age), round(male_avg_age))

40 39


The average ages of males and females are very close. If they were not, that might indicate the data was not equitably collected.

In [24]:
female_avg_cost = round(females.charges.mean(), 2)
male_avg_cost = round(males.charges.mean(), 2)
print('Female average cost: ' + str(female_avg_cost) + "\n"
      'Male average cost: ' + str(male_avg_cost) + "\n")

def most_cost(female, male):
    if female > male:
        diff_f_m_cost = female - male
        print('Females are charged ' + str(diff_f_m_cost) + ' more.')
    else:
        diff_f_m_cost = male - female
        print('Males are charged ' + str(diff_f_m_cost) + ' more.')
        
most_cost(female_avg_cost, male_avg_cost)

Female average cost: 12569.58
Male average cost: 13956.75

Males are charged 1387.17 more.


What is the correlation between bmi and charges if you are below or above the average?

In [26]:
avg_bmi = round(patient_records.bmi.mean(), 1)
avg_bmi

30.7

In [36]:
high_bmi = patient_records[patient_records.bmi >= 30.7]
low_bmi = patient_records[patient_records.bmi < 30.7]
# high_bmi

high_bmi_avg_cost = round(high_bmi.charges.mean(), 2)
low_bmi_avg_cost = round(low_bmi.charges.mean(), 2)

print('Patients with BMI above the average of 30.7 pay ' + str(high_bmi_avg_cost) + ' in medical insurance costs. \n')
print('Patients with BMI below the average pay ' + str(low_bmi_avg_cost) + '.\n')

def most_cost(high, low):
    if high > low:
        diff_bmi_cost = high - low
        print('Patients with above average BMI are charged ' + str(diff_bmi_cost) + ' more.')
    else:
        diff_bmi_cost = low - high
        print('Patients with below average BMI are charged ' + str(diff_bmi_cost) + ' more.')
        
most_cost(high_bmi_avg_cost, low_bmi_avg_cost)

Patients with BMI above the average of 30.7 pay 15772.48 in medical insurance costs. 

Patients with BMI below the average pay 10969.39.

Patients with above average BMI are charged 4803.09 more.
