# U.S. Medical Insurance Costs

By Paul P.<br>
This project was created with the <strong>U.S. Medical Insurance Costs</strong> [dataset](https://www.kaggle.com/mirichoi0218/insurance) <em>(insurance.csv)</em>.

## Project Scope
Questions to be answered:
1. Find out the <b>average age</b> of the patients in the dataset.
2. Analyze where a <b>majority of the individuals are from</b>.
3. Look at the different <b>costs between smokers vs. non-smokers.</b>
4. Figure out what the <b>average age is for someone who has at least one child</b> in this dataset.
5. Provide insight on <b>how Sex influences insurance costs.</b>

In [23]:
import csv

## Reading Dataset

First, I imported the dataset using a helper function. Since the insurance data is organized in a tabular form, I just have to call `load_list()` each time to store the corresponding data into a list. 

In [24]:
# creating variable lists
age = []
sex = []
bmi = []
children = []
smoker = []
region = []
charges = []

# creating a helper function to load the .csv dataset
def load_list(var_list, csv_file, column_title):
    with open(csv_file) as insurance_data:
        data = csv.DictReader(insurance_data)
        for row in data:
            var_list.append(row[column_title])


Each column is stored in the appropiate list so we can make our desired calculations.

In [25]:
load_list(age, 'insurance.csv', 'age')
load_list(sex, 'insurance.csv', 'sex')
load_list(bmi, 'insurance.csv', 'bmi')
load_list(children, 'insurance.csv', 'children')
load_list(smoker, 'insurance.csv', 'smoker')
load_list(region, 'insurance.csv', 'region')
load_list(charges, 'insurance.csv', 'charges')

# Analyzing The Data

### 1. Average Age
The average age patient age is calculated using `average_age()`. The function simply adds all the numerical elements and then divides it by the amount of patients in the list.

In [26]:
def average_age():
    # a variable that stores the sum of all patient ages.
    sum = 0
    # calculates the length of the list.
    length_age = len(age)


    for element in age:
        # adds the total ages into the variable 'sum'
        sum += int(element)

    # calculates the average age by dividing the total ages by the length of the list
    average_age = sum / length_age
    return "The average patient age is: " + str(round(average_age, 1)) + " years."

    

In [27]:
average_age()

'The average patient age is: 39.2 years.'

### 2. Most popular regions
The three most frequent regions in our dataset can be calculated with our `popular_regions()` function. 

In [28]:
def popular_regions():
    # create a empty list to store our regions
    unique_regions = []

    # finds every region and append it to our list
    for element in region:
        if element not in unique_regions:
            unique_regions.append(element)

    # create a empty dict to store our regions with number of occurrences
    region_occurences = {}

    # for each region in our list, assign  a key and number of occurrences
    for unique_region in unique_regions: 
        region_occurences[unique_region] = region.count(unique_region)

    # sort our dict
    sorted_regions = sorted(region_occurences, key=region_occurences.get, reverse=True)
    
    return "The most popular region is {first}. The second most popular region is {second}. The third most popular region is {third}." \
    .format(first=sorted_regions[0], second=sorted_regions[1], third=sorted_regions[2])


In [29]:
popular_regions()

'The most popular region is southeast. The second most popular region is southwest. The third most popular region is northwest.'

In [44]:
# similar function to above but print only the values
unique_regions = []
def popular_regions_values():
    for element in region:
        if element not in unique_regions:
            unique_regions.append(element)

    region_occurences = {}

    for unique_region in unique_regions: 
        region_occurences[unique_region] = region.count(unique_region)
    return region_occurences

print(popular_regions_values())




{'southwest': 325, 'southeast': 364, 'northwest': 325, 'northeast': 324}


## 3. Smokers versus non-smokers cost

To calculate the average cost for smokers and non-smokers, and the difference between both groups, I will use the **Pandas Library.**

First, I seperate the smoker data from the DataFrame. Then I calculate the average of the smokers DataFrame (*smokers_only*) utilizing `mean()`. And finally, we can select our desired *average cost* with `iloc` and print the results.

In [31]:
# import the pandas library to view the dataset
import pandas as pd

# import and read the CSV file as a DataFrame
df = pd.read_csv('insurance.csv')

# print first 5 rows
df.head()


Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


Here I am seperating the smoker data from the DataFrame.

In [32]:
df_smokers = df[['smoker', 'charges']]

# print first 5 rows
df_smokers.head()

Unnamed: 0,smoker,charges
0,yes,16884.924
1,no,1725.5523
2,no,4449.462
3,no,21984.47061
4,no,3866.8552


In [33]:
# group by smoker type, then calculate average cost 
average_smoker = df_smokers.groupby('smoker').charges.mean().reset_index()

# rename column title to represent average cost
average_smoker.rename(columns={'smoker': 'Smoker', 'charges': 'average Cost'}, inplace=True)

# print first 5 rows
average_smoker.head()


Unnamed: 0,Smoker,average Cost
0,no,8434.268298
1,yes,32050.231832


In [34]:
# assign non-smoker cost in the table to given variable
non_smoker_avg = average_smoker.iloc[0, 1]

# assignment smoker cost in the table to given variable
smoker_avg = average_smoker.iloc[1,1]

# print our results
print('The average cost for a smoker is ${smoker}'.format(smoker=smoker_avg.round(1)))
print('The average cost for non-smokers is ${non_smoker}'.format(non_smoker=non_smoker_avg.round(1)))

The average cost for a smoker is $32050.2
The average cost for non-smokers is $8434.3


## 4. Average age with someone with atleast one child

To calculate the average age for someone with at least one child, I will again use the **Pandas Library**. We will simply use `mean()` from **pandas** for someone who has greater than 0 children.

In [35]:
# select our desired subset
df_children = df[['age', 'children']]

# only include people who have at least 1 child
df_children = df_children[df_children.children > 0]

# print first 5 rows
df_children.head()

Unnamed: 0,age,children
1,18,1
2,28,3
6,46,1
7,37,3
8,37,2


In [36]:
# assign the average age of someone to given variable 
average_age = df_children.age.mean()

# print results
print('The average age for someone with at least one child is {age}'.format(age = round(average_age, 1)))

The average age for someone with at least one child is 39.8


## 5. Influence of Sex on insurance costs

To provide insight on the influece of Sex on insurance costs, we will again use **pandas**. This time we will use the very useful, `agg()`. Aggregate will provide us with the mean, minimum, and max for insurance costs grouped by Sex.

In [47]:
# using aggregate to calculate the mean, minimum, and maximum costs
cost_per_sex = df.groupby('sex').agg({'charges': ['mean', 'min', 'max']})

# rename our colums to represent our table
cost_per_sex.columns = ['cost_mean', 'cost_min', 'cost_max']

# reset index to default indexing
cost_per_sex = cost_per_sex.reset_index()

#print first 5 rows
cost_per_sex.head()

Unnamed: 0,sex,cost_mean,cost_min,cost_max
0,female,12569.578844,1607.5101,63770.42801
1,male,13956.751178,1121.8739,62592.87309


In [38]:
# round dataset
sex_agg = cost_per_sex.round(2)

# print our results
print('The average cost for female insurance cost is ${}. The average cost for male insurance cost is ${}. \
The minimum cost for females is ${}. The max cost for females is ${}. \
The minimum cost for males is ${}. The max cost for males is ${}.'.\
format(sex_agg.iloc[0,1], sex_agg.iloc[1,1], sex_agg.iloc[0,2], sex_agg.iloc[1,1], sex_agg.iloc[1,2], sex_agg.iloc[1,3]))

The average cost for female insurance cost is $12569.58. The average cost for male insurance cost is $13956.75. The minimum cost for females is $1607.51. The max cost for females is $13956.75. The minimum cost for males is $1121.87. The max cost for males is $62592.87.


Displaying which Sex has a higher medical insurance cost.

In [39]:
# only our average male cost
mean_m = sex_agg.iloc[1,1]

# only our average female cost
mean_f = sex_agg.iloc[0,1]

# difference of Sex
sex_diff = mean_m-mean_f

# print results
print('The average cost of medical insurance is ${} higher for males than females.'.format(sex_diff))

The average cost of medical insurance is $1387.17 higher for males than females.
