# U.S. Medical Insurance Cost Analysis
---

### The Objective

This project aims to build a _predictive model_ that will estimate the total insurance cost of an individual given inputs such as BMI, age, sex, number of children, smoker, region, and total dollar value in charges. The model will be built using a multiple linear regression on the variables listed above.

_Some of the suggested extensions from the original objective provided by Codeacademy_

* Organize your findings into dictionaries, lists, or another convenient datatype.
* Make predictions about what features are the most influential for an individual’s medical insurance charges based on your analysis.
* Explore areas where the data may include bias and how that would impact potential use cases.

_We explore more opportunities to shed light onto insights by using:_ Pandas, Numpy, Scikit, Matplotlib, and R

### The Data

This project analyzes data provided by the [Codeacademy Pro's Data Science course](https://github.com/dannyinpyoung/Data-Science-Portfolio/tree/main/Portfolio%20Project), which is a CSV file containing 1338 observations of the following variables: age, sex, bmi, children (integer indicating number of children), smoker ('yes' or 'no'), region (southeast, southwest, northeast, northwest), and charges (float). 

The project will experiment with multiple data types and dataframes as I progress into different courses and such.

Throughout the data-cleaning process, we applied processes to the data to ultimately reflect the following: 
* `Charges` to be rounded to 2 decimal spaces. 
* Adjusted one or more of the charges were entered as a string, changed type to float. 
* `Children` and `age` are converted all to integers. 
* We ensure string variables `region`, `smoker`, `sex`, do not return values outside of the possible values.  

### Key Insights



---

_Importing Packages_

In [1]:
import numpy as np
import pandas as pd
import csv

# 1. Importing and Cleaning The Data

This section explores multiple ways to import the data and create functions that converts the data to different datatypes. We look for inconsistencies in datatypes and abnormalities using min, maxes, counts, and number of observations. We then clean the data to ensure expected datatypes and values while removing abnormalities. 

### (a) Method 1: Generate Lists 

In [2]:
#TODO: Initialize List
age = []
sex = []
bmi = []
children = []
smoker = []
region = []
charges = []

#TODO: Import Files and Assign To Respective List
def import_data(file, variable, lst):
    with open(file, newline = '') as insurance_csv: 
        insurance_information = csv.DictReader(insurance_csv)
        for information in insurance_information: 
            lst.append(information[variable])
        return lst

#TODO: Assign imported data to respective variable. 
age = import_data("insurance.csv", "age", age)
sex = import_data("insurance.csv", "sex", sex)
bmi = import_data("insurance.csv", "bmi", bmi)
children = import_data("insurance.csv", "children", children)
smoker = import_data("insurance.csv", "smoker", smoker) 
region = import_data("insurance.csv", "region", region)
charges = import_data("insurance.csv", "charges", charges)


#TODO: Summary Statistics: Mean, Min, Max, and Observations to check for abnormalities. 
def data_check(lst,name):
    total_ = 0.00 
    print("Summary Statistics: {}".format(name))
    for i in lst: 
        if i != str(i): 
            total_+= i
        else:
            pass
    observations, min_, max_, mean_= len(lst), min(lst), max(lst), round(total_/len(lst),2)
    return ("Observations: {}\nMin: {}\nMax: {}\nMean {}\n".format(observations, min_, max_,mean_))

# Display summary statistics. 
#print(data_check(age,"age"))
#print(data_check(sex, "sex"))
#print(data_check(bmi, "bmi"))
#print(data_check(children, "children"))
#print(data_check(smoker, "smoker"))
#print(data_check(region, "region"))
#print(data_check(charges, "charges"))

#TODO: Functions below cleans data of type-mismatches, inconsistencies, and abnormal values. 

charges = [round(float(charge),2) for charge in charges]
bmi = [round(float(bmi_),2) for bmi_ in bmi]
age = [int(age_) for age_ in age]
children = [int(children_) for children_ in children]

def clean_region(region):
    regions_ = ["northeast", "northwest", "southeast", "southwest"]
    for i in region: 
        if i in range(len(region)): 
            if region[i] not in regions_:
                region[i] = "n/a"
            else: 
                region[i] = str[region[i]]
    return region

def clean_smoker(smoker):
    smoker_ = ["yes", "no"]
    for i in smoker: 
        if i in range(len(smoker)): 
            if smoker[i] not in smoker_:
                smoker[i] = "n/a"
            else: 
                smoker[i] = str[smoker[i]]
    return smoker

def clean_sex(sex):
    sex_ = ["female", "male"]
    for i in sex: 
        if i in range(len(sex)): 
            if sex[i] not in sex_:
                sex[i] = "n/a"
            else: 
                sex[i] = str[sex[i]]
    return sex

region = clean_region(region)
smoker = clean_smoker(smoker)
sex = clean_sex(sex)

# Display summary statistics after cleaning the data. 
print(data_check(age,"age"))
print(data_check(sex, "sex"))
print(data_check(bmi, "bmi"))
print(data_check(children, "children"))
print(data_check(smoker, "smoker"))
print(data_check(region, "region"))
print(data_check(charges, "charges"))

Summary Statistics: age
Observations: 1338
Min: 18
Max: 64
Mean 39.21

Summary Statistics: sex
Observations: 1338
Min: female
Max: male
Mean 0.0

Summary Statistics: bmi
Observations: 1338
Min: 15.96
Max: 53.13
Mean 30.66

Summary Statistics: children
Observations: 1338
Min: 0
Max: 5
Mean 1.09

Summary Statistics: smoker
Observations: 1338
Min: no
Max: yes
Mean 0.0

Summary Statistics: region
Observations: 1338
Min: northeast
Max: southwest
Mean 0.0

Summary Statistics: charges
Observations: 1338
Min: 1121.87
Max: 63770.43
Mean 13270.42



### (b) Method 2: Generate Dictionary

In [3]:
import csv 

# Initialize dictionary
empty_dict = {}
# Function takes in csv file and returns a dictionary indexes each row from 0.
def import_data(file, insurance_dict): 
    with open(file, newline = '') as insurance_csv: 
        insurance_information = csv.DictReader(insurance_csv)
        unique_id = 0
        for row in insurance_information: 
            insurance_dict[unique_id] = row
            unique_id += 1 
    return insurance_dict

insurance_dict = import_data("insurance.csv", empty_dict)


### (c) Method 3: Class

# 2. Exploring The Data
### (a) More Summary Statistics 
We look at the mean, median, standard deviation, maximum, and minimum of each numerical statistic. We then observe frequencies in the string integers. This section covers the summary statistics using classes and functions native to Python 3. We will use NumPy in the next section. 

In [18]:
# Note that we don't use NumPy functions for summary stats for the sake of exercise. We will use them in later projects

# TODO: Create class for summary stats for numeric data. 
class Summarystats:     

    #TODO: Calculate Mean
    def mean(self,lst): 
        return round(sum(lst)/len(lst),4)

    #TODO: Calculate Median. Note there are 1338 observations and so we only need to consider when len(lst)%2 == 0. 
    def median(self, lst):
        lst.sort()
        return (lst[int(len(lst)/2)] + lst[int(len(lst)/2) - 1])/2
         

    #TODO: Caclulate Standard Deviation
    def std_dev(self, lst): 
        total_diff = 0.00
        mean_ = 0.00
        for i in lst: 
            mean_ = float(sum(lst)/len(lst))
            total_diff += (i - mean_)**2 
        return round((total_diff/(len(lst)-1))**(1/2),4)
    
    #TODO: Calculate Maximum 
    def maximum(self, lst):
        return max(lst)
    
    #TODO: Calculate Minimum 
    def minimum(self, lst):
        return min(lst)

# TODO: Instantatiate sum_stat variable. 
sum_stats = Summarystats()

# TODO: Create class for summary stats for string data. 

def frequency_region(region):
    ne, se, sw, nw = 0, 0, 0, 0
    for place in region: 
        if place == "northeast":
            ne += 1
        elif place == "northwest": 
            nw += 1
        elif place == "southeast":
            se += 1 
        elif place == "southwest": 
            sw += 1
    return ne, se, sw, nw

def frequency_sex(sex): 
    f,m = 0,0
    for gender in sex: 
        if gender == "male":
            m += 1 
        else: 
            f += 1
    return f,m
    
# Print summary stats. 
print("Age Summary Stats\nObservations: 1338\nMean: {}\nMedian: {}\nStandard Deviation: {}\nMinimum: {}\nMaximum: {}\n".format(sum_stats.mean(age), sum_stats.median(age), sum_stats.std_dev(age), sum_stats.maximum(age), sum_stats.minimum(age)))
print("Charges Summary Stats\nObservations: 1338\nMean: {}\nMedian: {}\nStandard Deviation: {}\nMinimum: {}\nMaximum: {}\n".format(sum_stats.mean(charges), sum_stats.median(charges), sum_stats.std_dev(charges), sum_stats.maximum(charges), sum_stats.minimum(charges)))
print("Children Summary Stats\nObservations: 1338\nMean: {}\nMedian: {}\nStandard Deviation: {}\nMinimum: {}\nMaximum: {}\n".format(sum_stats.mean(children), sum_stats.median(children), sum_stats.std_dev(children), sum_stats.maximum(children), sum_stats.minimum(children)))
print("BMI Summary Stats\nObservations: 1338\nMean: {}\nMedian: {}\nStandard Deviation: {}\nMinimum: {}\nMaximum: {}\n".format(sum_stats.mean(bmi), sum_stats.median(bmi), sum_stats.std_dev(bmi), sum_stats.maximum(bmi), sum_stats.minimum(bmi)))

(662, 676)
Age Summary Stats
Observations: 1338
Mean: 39.207
Median: 39.0
Standard Deviation: 14.05
Minimum: 64
Maximum: 18

Charges Summary Stats
Observations: 1338
Mean: 13270.4223
Median: 9382.029999999999
Standard Deviation: 12110.0113
Minimum: 63770.43
Maximum: 1121.87

Children Summary Stats
Observations: 1338
Mean: 1.0949
Median: 1.0
Standard Deviation: 1.2055
Minimum: 5
Maximum: 0

BMI Summary Stats
Observations: 1338
Mean: 30.6635
Median: 30.4
Standard Deviation: 6.098
Minimum: 53.13
Maximum: 15.96



* __We note that we can use NumPy and Pandas for a more efficient code__. Below, we show how to retrieve the same summary statistics using NumPy and Pandas. Recall that the modules were imported near the beginning of the file.

### (b) More Summary Statistics (NumPy and Pandas) 