# U.S. Medical Insurance Cost Analysis
---

## The Objective

This project aims to develop a __supervised predictive model__ that will estimate the total insurance cost of an individual, which will be the __target variable__, against __features__ such as BMI, age, sex, number of children, smoker, and region. 

We will first explore the data by analyzing summary statistics, linear regressions, and analyze each feature's relation with the __target variable__ against our predictions. 

#####  We will then __develop the model__ by following this framework: 
1. Define the performance metric ($R^2$)
2. Analyze goodness of fit
3. Shuffle and split the data (subsampling and testing) 

##### After creating the model, we __test its performance__ by conducting a bias-variance tradeoff analysis using the following methods: 
1. Learning curves (Training vs Testing MSE)
2. Complexity curves (Training vs Validation) 

##### We will then finish off by __evaluating the model performance__ using the following techniques: 
1. Grid Search 
2. Cross validation

We will then wrap things up by using our model, comparing it to an optimal model, and then applying it to make some predictions.

## The Data

This project analyzes data provided by the [Codeacademy Pro's Data Science course](https://github.com/dannyinpyoung/Data-Science-Portfolio/tree/main/Portfolio%20Project), which is a CSV file containing 1338 unique individuals, observing the following variables: age (integer), sex (string, "female" or "male"), bmi (float), children (integer, indicating number of children), smoker (string,'yes' or 'no'), region (string, southeast, southwest, northeast, northwest), and charges (float). 

The project will experiment with multiple data types and dataframes as I progress into different courses and such.

Throughout the data-cleaning process, we applied processes to the data to ultimately reflect the following: 
* `Charges` to be rounded to 2 decimal spaces. 
* Adjusted one or more of the charges were entered as a string, changed type to float. 
* `Children` and `age` are converted all to integers. 
* We ensure string variables `region`, `smoker`, `sex`, do not return values outside of the possible values.  

## Key Insights



---

### Importing Packages

In [1]:
import numpy as np
import pandas as pd
import matplotlib as plt
import csv

# 1. Cleaning The Data (Using CSV Module) 

This section explores multiple ways to import the data and create functions that converts the data to different datatypes. We look for inconsistencies in datatypes and abnormalities using min, maxes, counts, and number of observations. We then clean the data to ensure expected datatypes and values while removing abnormalities. 

## Importing and Cleaning The Data

### (a) Method 1: Lists

In [2]:
#TODO: Initialize List
age = []
sex = []
bmi = []
children = []
smoker = []
region = []
charges = []

#TODO: Import Files and Assign To Respective List
def import_data(file, variable, lst):
    with open(file, newline = '') as insurance_csv: 
        insurance_information = csv.DictReader(insurance_csv)
        for information in insurance_information: 
            lst.append(information[variable])
        return lst

#TODO: Assign imported data to respective variable. 
age = import_data("insurance.csv", "age", age)
sex = import_data("insurance.csv", "sex", sex)
bmi = import_data("insurance.csv", "bmi", bmi)
children = import_data("insurance.csv", "children", children)
smoker = import_data("insurance.csv", "smoker", smoker) 
region = import_data("insurance.csv", "region", region)
charges = import_data("insurance.csv", "charges", charges)

#TODO: Functions below cleans data of type-mismatches, inconsistencies, and abnormal values. 
charges = [round(float(charge),2) for charge in charges]
bmi = [round(float(bmi_),2) for bmi_ in bmi]
age = [int(age_) for age_ in age]
children = [int(children_) for children_ in children]

def clean_region(region):
    regions_ = ["northeast", "northwest", "southeast", "southwest"]
    for i in region: 
        if i in range(len(region)): 
            if region[i] not in regions_:
                region[i] = "n/a"
            else: 
                region[i] = str[region[i]]
    return region

def clean_smoker(smoker):
    smoker_ = ["yes", "no"]
    for i in smoker: 
        if i in range(len(smoker)): 
            if smoker[i] not in smoker_:
                smoker[i] = "n/a"
            else: 
                smoker[i] = str[smoker[i]]
    return smoker

def clean_sex(sex):
    sex_ = ["female", "male"]
    for i in sex: 
        if i in range(len(sex)): 
            if sex[i] not in sex_:
                sex[i] = "n/a"
            else: 
                sex[i] = str[sex[i]]
    return sex

# TODO: create cleaned list. 
region = clean_region(region)
smoker = clean_smoker(smoker)
sex = clean_sex(sex)

#TODO: Summary Statistics: Mean, Min, Max, and Observations to check for abnormalities. 
def data_check(lst,name):
    total_ = 0.00 
    print("Summary Statistics: {}".format(name))
    for i in lst: 
        if i != str(i): 
            total_+= i
        else:
            pass
    observations, min_, max_, mean_= len(lst), min(lst), max(lst), round(total_/len(lst),2)
    return ("Observations: {}\nMin: {}\nMax: {}\nMean {}\n".format(observations, min_, max_,mean_))

# Display summary statistics. 
#print(data_check(age,"age"))
#print(data_check(sex, "sex"))
#print(data_check(bmi, "bmi"))
#print(data_check(children, "children"))
#print(data_check(smoker, "smoker"))
#print(data_check(region, "region"))
#print(data_check(charges, "charges"))

### (b) Method 2: Generate Dictionary

In [3]:
import csv 

# Initialize dictionary
empty_dict = {}
# Function takes in csv file and returns a dictionary indexes each row from 0.
def import_data(file, insurance_dict): 
    with open(file, newline = '') as insurance_csv: 
        insurance_information = csv.DictReader(insurance_csv)
        unique_id = 0
        for row in insurance_information: 
            insurance_dict[unique_id] = row
            unique_id += 1 
    return insurance_dict

insurance_dict = import_data("insurance.csv", empty_dict)


### (c) Method 3: NumPy and Pandas

In [8]:
insurance_data = pd.read_csv("insurance.csv")

age = insurance_data['age']
sex = insurance_data['sex']
bmi = insurance_data['bmi']
children = insurance_data['children']
smoker = insurance_data['smoker']
region = insurance_data['region']
charges = insurance_data['charges']

print("The insurance dataset has {} data points with {} variables each.".format(*insurance_data.shape))
# TODO: Observe the data.
insurance_data.head(10)

The insurance dataset has 1338 data points with 7 variables each.


Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552
5,31,female,25.74,0,no,southeast,3756.6216
6,46,female,33.44,1,no,southeast,8240.5896
7,37,female,27.74,3,no,northwest,7281.5056
8,37,male,29.83,2,no,northeast,6406.4107
9,60,female,25.84,0,no,northwest,28923.13692


# 2. Exploring The Data
## Summary Statistics
We begin with observing some summary statistics for the numerical variables. We calculate the summary statistics below however, for convenience sake, we will present them here: 

|Variable Name|Mean    |Median  |Std Dev  |Min      |Max       |
|-------------|--------|------  |-------  |---------|----------|
|Age          |39.207  |39.0    |14.0447  |18       |64        |
|Charges      |13270.42|9382.033|12105.485|1121.8739|63770.4280|
|BMI          |30.6634 |30.4    |6.0959   |15.96.   |53.13     |
|Children     |1.0949  |1.0     |1.205    |0        |5         | 

Now, to look at some summary statistics of the string variables: 

|Region    |Northwest|Southeast|Northeast|Southwest| 
|----------|---------|---------|---------|---------| 
|Frequency |324      |325      |364      |325      |
|% of Total|24.22%   |24.29%   |27.20%   |24.29%   |

|Sex       |Male |Female |
|----------|-----|-------|
|Frequency |676  |662    |
|% of Total|50.5%|49.5%  |

|Smoker    |Yes      |No       |
|----------|---------|---------|
|Frequency |274      |1064     |
|% of Total|20.48%   |79.52%.  |

Some of the __Key Insights__ from these statistics are: 

* Average age of an insured individual in the U.S. is 39 years old, with insurance costs of \$13270.42. 
* Insured individuals have 1 child on average and tend to be overweight - obese (BMI > 25.0). 
* Average \% of population that smokes in the U.S. in 2016 was 15.5\%. We see that insured individuals are more likely to be smokers. 
* Despite an almost equal representation of each sex and that female smoking rates are low and around 13.5\% in 2016, we see an aggregate smoking rate of 20.5\%. This suggests to look more into the smoking rates among sexes and see the impact each variable has for each gender. 
* Individuals located in the northeast have 3\% high insurance takeup than the other regions. This may be due to accessibility, population, or other unobserved factors. We may want to explore if there exists a cost difference among regions, particularly with the northeast. 

Below is the code that was used to generate summary statistics. We explore two methods: Lists, and NumPy functions using Pandas dataframe.

### (a) Generating Summary Statistics Using LIST Data
We look at the mean, median, standard deviation, maximum, and minimum of each numerical statistic. We then observe frequencies in the string integers. This section covers the summary statistics using classes and functions native to Python 3. We will use NumPy in the next section. 

In [None]:
# Note that we don't use NumPy functions for summary stats for the sake of exercise. We will use them in later projects

# TODO: Create class for summary stats for numeric data. 
class Summarystats:     

    #TODO: Calculate Mean
    def mean(self,lst): 
        return round(sum(lst)/len(lst),4)

    #TODO: Calculate Median. Note there are 1338 observations and so we only need to consider when len(lst)%2 == 0. 
    def median(self, lst):
        lst.sort()
        return (lst[int(len(lst)/2)] + lst[int(len(lst)/2) - 1])/2
         

    #TODO: Caclulate Standard Deviation
    def std_dev(self, lst): 
        total_diff = 0.00
        mean_ = 0.00
        for i in lst: 
            mean_ = float(sum(lst)/len(lst))
            total_diff += (i - mean_)**2 
        return round((total_diff/(len(lst)-1))**(1/2),4)
    
    #TODO: Calculate Maximum 
    def maximum(self, lst):
        return max(lst)
    
    #TODO: Calculate Minimum 
    def minimum(self, lst):
        return min(lst)

# TODO: Instantatiate sum_stat variable. 
sum_stats = Summarystats()

# TODO: Create class for summary stats for string data. 

def frequency_region(region, name):
    ne, se, sw, nw = 0, 0, 0, 0
    for place in region: 
        if place == "northeast":
            ne += 1
        elif place == "northwest": 
            nw += 1
        elif place == "southeast":
            se += 1 
        elif place == "southwest": 
            sw += 1
    return print("""
    {} Frequencies: 
        Northeast: {}
        Northwest: {}
        Southeast: {}
        Southwest: {}""".format(name, ne, nw, se, sw))

def frequency_sex(sex,name): 
    f,m = 0,0
    for gender in sex: 
        if gender == "male":
            m += 1 
        else: 
            f += 1
    return print("""
    {} Frequencies: 
        Male: {}
        Female: {}""".format(name, m, f))

def frequency_smoker(smoker, name): 
    y,n = 0,0
    for person in smoker: 
        if person == "yes":
            y += 1 
        else: 
            n += 1
    return print("""
    {} Frequencies: 
        Yes: {}
        No: {}""".format(name, y, n))


# TODO: Create function that generates summary stat upon input of list and name of variable. 
def data_check(lst,name):
    total_, obs, mean, median, std, min_, max_ = 0.00, len(lst), sum_stats.mean(lst), sum_stats.median(lst), sum_stats.std_dev(lst), min(lst), max(lst)
    return ("""
{} Summary Stats
    Observations: {}
    Mean: {}
    Median: {}
    Standard Deviation: {}
    Minimum: {}
    Maximum: {}""".format(name,obs,mean,median,std,min_,max_))

# Print Summary Statistics. 
#print(data_check(age, "Age"))
#print(data_check(bmi, "BMI"))
#print(data_check(children, "Children"))
#print(data_check(charges, "Charges"))
frequency_smoker(smoker, "Smokers")
frequency_sex(sex, "Gender")
frequency_region(region, "Regions")



### (b) Generating Summary Statistics (Using NumPy and Pandas)

* __We note that we can use NumPy and Pandas for a more efficient code__. Below, we show how to retrieve the same summary statistics using NumPy and Pandas. Recall that the modules were imported near the beginning of the file.

In [5]:
def summary_stat(lst, name): 
    return print("""
{} Summary Stats
    Observations: 1338
    Mean: {}
    Median: {}
    Standard Deviation: {}
    Minimum: {}
    Maximum: {}""".format(name, round(np.mean(lst),4), round(np.median(lst), 4), round(np.std(lst), 4),min(lst),max(lst)))    

summary_stat(age, "Age")
summary_stat(charges, "Charges")
summary_stat(bmi, "BMI")
summary_stat(children, "Children")


Age Summary Stats
    Observations: 1338
    Mean: 39.207
    Median: 39.0
    Standard Deviation: 14.0447
    Minimum: 18
    Maximum: 64

Charges Summary Stats
    Observations: 1338
    Mean: 13270.4223
    Median: 9382.03
    Standard Deviation: 12105.485
    Minimum: 1121.87
    Maximum: 63770.43

BMI Summary Stats
    Observations: 1338
    Mean: 30.6635
    Median: 30.4
    Standard Deviation: 6.0957
    Minimum: 15.96
    Maximum: 53.13

Children Summary Stats
    Observations: 1338
    Mean: 1.0949
    Median: 1.0
    Standard Deviation: 1.205
    Minimum: 0
    Maximum: 5
