# U.S. Medical Insurance Costs

by Paul P.
This project was created with the <strong>U.S. Medical Insurance Costs</strong> [dataset](https://www.kaggle.com/mirichoi0218/insurance) <em>(insurance.csv)</em>.

## Project Scope
Questions to be answered:
1. Find out the <b>average age</b> of the patients in the dataset.
2. Analyze where a <b>majority of the individuals are from</b>.
3. Look at the different <b>costs between smokers vs. non-smokers.</b>
4. Figure out what the <b>average age is for someone who has at least one child</b> in this dataset.
5. Provide insight on the <b>cheapest and most expensive</b> insurance costs.

In [114]:
import csv

## Reading Dataset

First, I imported the dataset using a helper function. Since the insurance data is organized in a tabular form, I just have to call `load_list()` each time to store the corresponding data into a list. 

In [115]:
# creating variable lists
age = []
sex = []
bmi = []
children = []
smoker = []
region = []
charges = []

# creating a helper function to load the .csv dataset
def load_list(var_list, csv_file, column_title):
    with open(csv_file) as insurance_data:
        data = csv.DictReader(insurance_data)
        for row in data:
            var_list.append(row[column_title])


Each column is stored in the appropiate list so we can make our desired calculations.

In [116]:
load_list(age, 'insurance.csv', 'age')
load_list(sex, 'insurance.csv', 'sex')
load_list(bmi, 'insurance.csv', 'bmi')
load_list(children, 'insurance.csv', 'children')
load_list(smoker, 'insurance.csv', 'smoker')
load_list(region, 'insurance.csv', 'region')
load_list(charges, 'insurance.csv', 'charges')

# Analyzing The Data

### 1. Average Age
The average age patient age is calculated using `average_age()`. The function simply adds all the numerical elements and then divides it by the amount of patients in the list.

In [117]:
def average_age():
    # a variable that stores the sum of all patient ages.
    sum = 0
    # calculates the length of the list.
    length_age = len(age)


    for element in age:
        # adds the total ages into the variable 'sum'
        sum += int(element)

    # calculates the average age by dividing the total ages by the length of the list
    average_age = sum / length_age
    return "The average patient age is: " + str(round(average_age, 1)) + " years."

    

In [118]:
average_age()

'The average patient age is: 39.2 years.'

### 2. Most popular regions
The three most frequent regions in our dataset can be calculated with our `popular_regions()` function. 

In [119]:
def popular_regions():
    unique_regions = []
    for element in region:
        if element not in unique_regions:
            unique_regions.append(element)

    region_occurences = {}

    for unique_region in unique_regions: 
        region_occurences[unique_region] = region.count(unique_region)

    sorted_regions = sorted(region_occurences, key=region_occurences.get, reverse=True)
    
    return "The most popular region is {first}. The second most popular region is {second}. The third most popular region is {third}.".format(first=sorted_regions[0], second=sorted_regions[1], third=sorted_regions[2])


In [120]:
popular_regions()

'The most popular region is southeast. The second most popular region is southwest. The third most popular region is northwest.'

In [121]:
unique_regions = []
def popular_regions():
    for element in region:
        if element not in unique_regions:
            unique_regions.append(element)

    region_occurences = {}

    for unique_region in unique_regions: 
        region_occurences[unique_region] = region.count(unique_region)
    return region_occurences

print(popular_regions())




{'southwest': 325, 'southeast': 364, 'northwest': 325, 'northeast': 324}


## 3. Smokers versus non-smokers cost

To calculate the average cost for smokers and non-smokers, and the difference between both groups, I will use the *Pandas Library.* 

First, I seperate the smoker data from the DataFrame. Then I calculate the mean of the smokers DataFrame (*smokers_only*) utilizing `mean()`. And finally, we can select our desired *Mean Cost* with `iloc` and print the results.

In [122]:
# import the pandas library to view the dataset
import pandas as pd

# import and read the CSV file as a DataFrame
df = pd.read_csv('insurance.csv')

# print first 5 rows
df.head()


Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


Here I am seperating the smoker data from the DataFrame.

In [123]:
smokers_only = df[['smoker', 'charges']]

smokers_only.head()

Unnamed: 0,smoker,charges
0,yes,16884.924
1,no,1725.5523
2,no,4449.462
3,no,21984.47061
4,no,3866.8552


In [124]:
average_smoker = smokers_only.groupby('smoker').charges.mean().reset_index()

average_smoker.rename(columns={'smoker': 'Smoker', 'charges': 'Mean Cost'}, inplace=True)

average_smoker.head()


Unnamed: 0,Smoker,Mean Cost
0,no,8434.268298
1,yes,32050.231832


In [125]:
non_smoker_avg = average_smoker.iloc[0, 1]
smoker_avg = average_smoker.iloc[1,1]

print('The average cost for a smoker is ${smoker}'.format(smoker=smoker_avg.round(1)))
print('The average cost for non-smokers is ${non_smoker}'.format(non_smoker=non_smoker_avg.round(1)))

The average cost for a smoker is $32050.2
The average cost for non-smokers is $8434.3


## 4. Average age with someone with atleast one child

In [126]:
df_children = df[['age', 'children']]
df_children.head()

Unnamed: 0,age,children
0,19,0
1,18,1
2,28,3
3,33,0
4,32,0
