# U.S. Medical Insurance Costs

In this project a **csv** file with medical insurance costs will be analysed using Python fundamentals.The goal of this project will be to analyse various attributes within **insurance.csv** to learn more about the patient information in the file and gain insights into potential use cases for the dataset.

### Importing Modules

I started off by importing the **csv** module to allow me to use the **.DictReader()** function to convert the data from the csv file into a Python dictionary object. I also imported the **peasonr()** function from **scipy.stats** to calculate the correlation coefficient later on.

In [1]:
import csv
from scipy.stats import pearsonr

### Opening the CSV File and Appending the Data into Lists

Before opening the csv file, I created empty lists that correspond to each of the columns within the file. After opening the file, I converted it into a Python dictionary object using the **.DictReader()** function. I then called a for-loop to iterate through each row of data and append each column into the corresponding empty list.

In [2]:
ages = []
sexes = []
bmis = []
num_children = []
smokers = []
regions = []
total_charges = []

In [3]:
with open("insurance.csv", newline="") as insurance_file:
    insurance_data = csv.DictReader(insurance_file)
    for row in insurance_data:
        ages.append(int(row["age"]))
        sexes.append(row["sex"])
        bmis.append(float(row["bmi"]))
        num_children.append(int(row["children"]))
        smokers.append(row["smoker"])
        regions.append(row["region"])
        total_charges.append(float(row["charges"]))

### Calculating the Average, Minimum and Maximum Ages

The first calculation I carried out was to find the average age of patients within the dataset. I also returned the maximum and minimum ages within the dataset. 

I set the initial value of the min_age variable to **float("inf")** and max_age to **float("-inf")**, as this would give set them to the highest and lowest values possible. In my for-loop I added the age of each patient to the sum_of_ages variable, then had two if statements to find the min and max age and set it to the respective variable. I also created a **num_patients** variable that I could use throughout my analysis.

In [4]:
num_patients = len(ages)

sum_of_ages = 0
min_age = float("inf")
max_age = float("-inf")
    
for age in ages:
    sum_of_ages += age
    if age < min_age:
        min_age = age
    if age > max_age:
        max_age = age
    
average_age = sum_of_ages / num_patients

In [5]:
print("The average age of patients in the dataset is {}. The oldest patient is {} years old, whilst the youngest is {} years old".format(
    "%.f" % average_age, max_age, min_age))

The average age of patients in the dataset is 39. The oldest patient is 64 years old, whilst the youngest is 18 years old


### Calculating the Difference in Costs between Patients With and Without Children

In [6]:
cost_children = zip(ages, num_children, total_charges)

count_with_children = 0
count_without_children = 0
sum_cost_with_children = 0
sum_cost_without_children = 0
sum_age_with_children = 0
sum_age_without_children = 0

for patient in cost_children:
    if patient[1] > 0:
        count_with_children += 1
        sum_age_with_children += patient[0]
        sum_cost_with_children += patient[2]
    else:
        count_without_children += 1
        sum_age_without_children += patient[0]
        sum_cost_without_children += patient[2]
        
avg_age_with_children = sum_age_with_children / count_with_children
avg_age_without_children = sum_age_without_children / count_without_children
percent_with_children = (count_with_children / num_patients) * 100
percent_without_children = (count_without_children / num_patients) * 100
avg_cost_with_children = sum_cost_with_children / count_with_children
avg_cost_without_children = sum_cost_without_children / count_without_children

In [7]:
print("""{}% of patients in the dataset are parents. {}% have no children. 
That's {} patients in the dataset with children and {} patients without children.

The average age of patients with children is {}. The average age of patients without children is {}.

The average cost of insurance for patients with children is ${:,.2f} and for those without children is ${:,.2f}.
That's an extra ${:,.2f} on average for patients with children.""".format(
    "%.1f" % percent_with_children, "%.1f" % percent_without_children, 
    count_with_children, count_without_children, 
    "%.f" % avg_age_with_children, "%.f" % avg_age_without_children, 
    avg_cost_with_children, avg_cost_without_children, avg_cost_with_children - avg_cost_without_children))

57.1% of patients in the dataset are parents. 42.9% have no children. 
That's 764 patients in the dataset with children and 574 patients without children.

The average age of patients with children is 40. The average age of patients without children is 38.

The average cost of insurance for patients with children is $13,949.94 and for those without children is $12,365.98.
That's an extra $1,583.97 on average for patients with children.


### Calculating the Difference in Costs between Smokers and Non-Smokers

In [8]:
cost_smokers = zip(smokers, total_charges)

count_of_smokers = smokers.count("yes")
count_of_non_smokers = smokers.count("no")
sum_cost_smokers = 0
sum_cost_non_smokers = 0

for patient in cost_smokers:
    if patient[0] == "yes":
        sum_cost_smokers += patient[1]
    else:
        sum_cost_non_smokers += patient[1]

percent_smokers = (count_of_smokers / num_patients) * 100
avg_cost_smokers = sum_cost_smokers / count_of_smokers
avg_cost_non_smokers = sum_cost_non_smokers / count_of_non_smokers

In [9]:
print("""{}% of patients in the dataset are smokers.
The average cost of insurance for smokers is ${:,.2f} vs ${:,.2f} for non-smokers.""".format(
    "%.1f" % percent_smokers, avg_cost_smokers, avg_cost_non_smokers))

20.5% of patients in the dataset are smokers.
The average cost of insurance for smokers is $32,050.23 vs $8,434.27 for non-smokers.


### Calculating the Difference in Costs between the Patient's Regions

In [10]:
cost_region = zip(regions, total_charges)

count_southwest = regions.count("southwest")
count_southeast = regions.count("southeast")
count_northwest = regions.count("northwest")
count_northeast = regions.count("northeast")

sum_cost_southwest = 0
sum_cost_southeast = 0
sum_cost_northwest = 0
sum_cost_northeast = 0


for patient in cost_region:
    if patient[0] == "southwest":
        sum_cost_southwest += patient[1]
    elif patient[0] == "southeast":
        sum_cost_southeast += patient[1]
    elif patient[0] ==  "northwest":
        sum_cost_northwest += patient[1]
    else:
        sum_cost_northeast += patient[1]

avg_cost_southwest = sum_cost_southwest / count_southwest
avg_cost_southeast = sum_cost_southeast / count_southeast
avg_cost_northwest = sum_cost_northwest / count_northwest
avg_cost_northeast = sum_cost_northeast / count_northeast

In [11]:
print("""Count of patients in different regions and average insurance costs:
Southwest: {} patients. Average Insurance Cost: ${:,.2f}
Southeast: {} patients. Average Insurance Cost: ${:,.2f}
Northwest: {} patients. Average Insurance Cost: ${:,.2f}
Northeast: {} patients. Average Insurance Cost: ${:,.2f}""".format(
     count_southwest, avg_cost_southwest, 
     count_southeast, avg_cost_southeast, 
     count_northwest, avg_cost_northwest, 
     count_northeast, avg_cost_northeast))

Count of patients in different regions and average insurance costs:
Southwest: 325 patients. Average Insurance Cost: $12,346.94
Southeast: 364 patients. Average Insurance Cost: $14,735.41
Northwest: 325 patients. Average Insurance Cost: $12,417.58
Northeast: 324 patients. Average Insurance Cost: $13,406.38


### Calculating the Difference in Costs between the Sexes

In [12]:
sex_cost = zip(ages, sexes, bmis, total_charges)

count_males = sexes.count("male")
count_females = sexes.count("female")
sum_male_bmis = 0
sum_female_bmis = 0
sum_male_costs = 0
sum_female_costs = 0
sum_male_ages = 0
sum_female_ages = 0

for patient in sex_cost:
    if patient[1] == "male":
        sum_male_bmis += patient[2]
        sum_male_costs += patient[3]
        sum_male_ages += patient[0]
    else:
        sum_female_bmis += patient[2]
        sum_female_costs += patient[3]
        sum_female_ages += patient[0]

avg_male_cost = sum_male_costs / count_males
avg_female_cost = sum_female_costs / count_females
avg_male_bmi = sum_male_bmis / count_males
avg_female_bmi = sum_female_bmis / count_females
avg_male_age = sum_male_ages / count_males
avg_female_age = sum_female_ages / count_females

In [13]:
print("""There are {} males in the dataset and {} females.

The average insurance cost for males is ${:,.2f}
The average insurance cost for females is ${:,.2f}

The average BMI for males is {}
The average BMI for females is {}

The average age of male patients is {}
The average age of female patients is {}""".format(
    count_males, count_females,
    avg_male_cost, avg_female_cost, 
    "%.1f" % avg_male_bmi, "%.1f" % avg_female_bmi,
    "%.f" % avg_male_age, "%.f" % avg_female_age
    ))

There are 676 males in the dataset and 662 females.

The average insurance cost for males is $13,956.75
The average insurance cost for females is $12,569.58

The average BMI for males is 30.9
The average BMI for females is 30.4

The average age of male patients is 39
The average age of female patients is 40


### Correlation between Insurance Costs and Age, Sex, BMI, Number of Children and Smoker Status 

The Pearson correlation coefficient measures the strength of the linear relationship between two datasets.

To carry out the correlation on smoker status and sex, I first had to change these to binary numerical values. For smoker status, 1 = smoker, 0 = non-smoker. For sex, 1 = male, 0 = female.

I used the **scipy.stats.pearsonr** function to calculate the correlation coefficient between the Insurance Costs and the  patients age, sex, BMI, the number of children they have and their smoker status. 

The results show there is a strong positive correlation between smoker status and insurance cost (i.e. smoking leads to higher insurance costs). The results show a weak positive correlation between insurance cost and a patient's age and BMI. The results show there is no correlation between the cost of insurance and the number of children a patient has or their sex.

In [14]:
smokers_numeric = []
sexes_numeric = []

for smoker in smokers:
    if smoker == "yes":
        smokers_numeric.append(1)
    else:
        smokers_numeric.append(0)
        
for sex in sexes:
    if sex == "male":
        sexes_numeric.append(1)
    else:
        sexes_numeric.append(0)
        
age_cost_correlation, _ = pearsonr(ages, total_charges)
sex_cost_correlation, _ = pearsonr(sexes_numeric, total_charges)
bmis_cost_correlation, _ = pearsonr(bmis, total_charges)
num_children_cost_correlation, _ = pearsonr(num_children, total_charges)
smoker_cost_correlation, _ = pearsonr(smokers_numeric, total_charges)

In [15]:
print("""The pearson correlation coeffiecents for Insurance Cost vs:
Age is {}
Sex is {}
BMI is {}
Number of Children is {}
Smoker Status is {}""".format(
    "%.3f" % age_cost_correlation, "%.3f" % sex_cost_correlation, "%.3f" % bmis_cost_correlation, 
    "%.3f" % num_children_cost_correlation, "%.3f" % smoker_cost_correlation))

The pearson correlation coeffiecents for Insurance Cost vs:
Age is 0.299
Sex is 0.057
BMI is 0.198
Number of Children is 0.068
Smoker Status is 0.787
