# U.S. Medical Insurance Costs

## Scope

In this notebook we will be analyzing the data gathered from individuals across the United States compiled in insurance.csv. The data includes their age, sex, BMI, their region in the United States, whether the individual has children, smokes, and their insurance cost. This notebook will mostly consists of goals suggested by Codecademy, but I will be adding more analysis as I move through the Data Scientist path. 

## Goals 

* Find out the average age of the patients in the dataset
* Analyze where a majority of the individuals are from
* Look at the different costs between smokers vs. non-smokers
* Figure out what the average age is for someone who has at least one child in this dataset

----------------------

### Importing and Organizing Data 
I start by importing the csv library, followed by creating variables for all 7 keys, and consequently importing and appending the datapoints respectively. This will help me with some of the goals listed above. 

In [1]:
import csv

In [2]:
age = []
sex = []
bmi = []
children = []
smoker = []
region = []
charges = []

In [3]:
with open('insurance.csv', newline = '') as insurance_csv:
    insurance_data = csv.DictReader(insurance_csv)
    for row in insurance_data:
        age.append(row['age'])
        sex.append(row['sex'])
        bmi.append(row['bmi'])
        children.append(row['children'])
        smoker.append(row['smoker'])
        region.append(row['region'])
        charges.append(row['charges'])

In the line below I checked that all my lists are complete, since it was stated in Codecademy that there was no missing data. I can't assume I will be able to utilize this line on other projets. 

In [4]:
print(len(age) == len(sex) == len(bmi) == len(children) == len(smoker) == len(region) == len(charges))

True


---------
### Average Amount of Numerical Categories

Even though we wanted to find the average age of individuals, the function created below can also help us find the average amount of children and charge per individual. 

In [5]:
def average_category(factor):
    total_factor = 0
    for x in factor:
        total_factor += float(x)
    avg_factor = total_factor / len(factor)
    return avg_factor

print((average_category(age)))
print((average_category(children)))
print((average_category(charges)))

39.20702541106129
1.0949177877429
13270.422265141257


- The average age of our individuals is 39
- The average number of children per individual is 1
- The average charge for insurance per individual is $13,270.42

----------
### Region by Number of Individuals

Below we will determine how many individuals from our data live in each region of the United States using 2 methods. 

#### Method 1

The first method could be used if we already knew what regions are included in the data.

In [6]:
def majority_region(location):
    northeast = 0
    northwest = 0
    southeast = 0
    southwest = 0
    for x in location: 
        if x == 'northeast':
            northeast += 1
        if x == 'northwest':
            northwest += 1
        if x == 'southeast':
            southeast += 1
        if x == 'southwest':
            southwest += 1
    return('There are {ne} individuals from the Northeast, {nw} from the Northwest, {se} from the Southeast, and {sw} from the Southwest'.format(ne=northeast, nw=northwest, se=southeast, sw=southwest))

print(majority_region(region))

There are 324 individuals from the Northeast, 325 from the Northwest, 364 from the Southeast, and 325 from the Southwest


#### Method 2
This second method would be used if we didn't know the regions ahead of time, we needed to find them in the data, and avoid repeating any.

In [7]:
region_count = {}
for place in region:
    if place not in region_count:
        region_count[place] = 1
    else:
        region_count[place] += 1

print(region_count)

{'southwest': 325, 'southeast': 364, 'northwest': 325, 'northeast': 324}


-------
### Difference in Cost between Smoker and Sex

This is a simple comparison between sex and smokers, and the insurance cost difference between each. 

In [8]:
male = 0
female = 0
for x in sex: 
    if x == 'male':
        male += 1 
    else:
        female +=1 
print('There are {male} males and {female} females in our data.'.format(male = male, female = female))

There are 676 males and 662 females in our data.


In [9]:
smoker_count = 0
non_smoker_count = 0
for x in smoker: 
    if x == 'yes':
        smoker_count += 1 
    else:
        non_smoker_count +=1 
print('There are {smoker_count} smokers and {non_smoker_count} non-smokers in our data.'.format(smoker_count = smoker_count, non_smoker_count = non_smoker_count))
    
    

There are 274 smokers and 1064 non-smokers in our data.


In [10]:
def comparing_avg_costs(group_cost, factor1, factor2):
    with open('insurance.csv') as file: 
        f_dict = csv.DictReader(file)
        factor1_total_charge = 0
        factor1_total = 0
        factor2_total_charge = 0
        factor2_total = 0 
        for row in f_dict:
            if row[group_cost] == factor1:
                factor1_total_charge += float(row['charges'])
                factor1_total += 1 
            elif row[group_cost] == factor2:
                factor2_total_charge += float(row['charges'])
                factor2_total += 1 
                
        avg_factor1_cost = round(factor1_total_charge / factor1_total, 2)   
        avg_factor2_cost = round(factor2_total_charge / factor2_total, 2)
        difference_cost = avg_factor1_cost - avg_factor2_cost
        
        print ('if {group_cost}: {factor1}, individual pays on average ${avg_factor1_cost}.'.format(group_cost=group_cost, factor1=factor1, avg_factor1_cost=avg_factor1_cost))
        print ('if {group_cost}: {factor2}, individual pays on average ${avg_factor2_cost}.'.format(group_cost=group_cost, factor2=factor2, avg_factor2_cost=avg_factor2_cost))
        print ('The difference depending on {group_cost} is ${difference_cost}.'.format(group_cost=group_cost, difference_cost=difference_cost))
        
diff_sex = comparing_avg_costs('sex', 'male', 'female')
diff_smoker = comparing_avg_costs('smoker', 'yes', 'no')



if sex: male, individual pays on average $13956.75.
if sex: female, individual pays on average $12569.58.
The difference depending on sex is $1387.17.
if smoker: yes, individual pays on average $32050.23.
if smoker: no, individual pays on average $8434.27.
The difference depending on smoker is $23615.96.
