# U.S. Medical Insurance Costs

This project involves analyzing a CSV file containing medical insurance costs using fundamental Python concepts. The objective is to investigate various attributes within the "insurance.csv" file to gain insight into patient information and identify potential use cases for the dataset.

Data Scoping

Problem: Insurance Company wants to estimate the expected number of claims by assessing the risk profile of its customer base. 

Goals: Reducing the number of claims made.

Action: Identify high risk customers and recommend ways to lead a healthier lifestyle

Analysis: 
1. Identify proportion of customers who smoke.
2. Identify proportion of customers with unhealthy bmi levels (18.5 – 24.9)
3. Identify region with most smokers
4. Identify which sex contains more smokers

Constraints:
    BMI is not an accurate measure of a healthy lifestyle*

* BMI does not take into account factors such as muscle mass, bone density and body composition

In [55]:
# Importing required modules

# open, read, write csv file
import csv 

# statistical functions - mean, median, mode, standard deviation
import statistics

# count number of elements in a list
from collections import Counter

# Opening Dataset

Opening the dataset and saving the data in a dictionary format makes it easier to use for further analysis

In [82]:
# Opening Insurance dataset and saving data in a dictionary form
insurance_dict = []

with open('insurance.csv') as insurance_csv:
    csv_reader = csv.DictReader(insurance_csv)
    
    for row in csv_reader:
        insurance_dict.append(row)

Further segregating the data into separate lists based on column headings helps in category wise analysis

In [60]:
# Segregating data from insurance dictionary into lists based on column headings
age = []
sex = []
bmi = []
children = []
smoker = []
region = []
charges = []

for row in insurance_dict:
    age.append(int(row["age"]))
    sex.append(str(row["sex"]))
    bmi.append(float(row["bmi"]))
    children.append(int(row["children"]))
    smoker.append(str(row["smoker"]))
    region.append(str(row["region"]))
    charges.append(float(row["charges"]))

# Data Exploration

Calculating the mean, median and standard deviation of different variables helps identify patterns and trends which provide insights into the charecteristics of a dataset

In [83]:
# Function to find the mean, median and standard deviation of age, bmi and charges
def stats(lst):
    mean = statistics.mean(lst)
    median = statistics.median(lst)
    std_dev = statistics.stdev(lst)
    print(f"Mean: {mean}, Median: {median}, stdev: {std_dev}")

stats(age)

Mean: 39.20702541106129, Median: 39.0, stdev: 14.049960379216156


# Data Analysis

The code below provides a short, preliminary analysis of the dataset which gives the Insurance Company an overview of the Risk profile of its customers. Smoking is the main factor in this analysis because smoking is a risk factor for underlying illnesses such as lung cancer and cardiovascular disease. It is also a fairly accurate indicator of a healthy lifestyle compared to bmi. 

In [84]:
# Function to find the proportion of smokers in the dataset
count_smokers = 0
def num_of_smokers(smoker, count_smokers):
    for item in smoker:
        if item == "yes":
          count_smokers += 1
    return count_smokers

proportion_of_smokers = num_of_smokers(smoker, count_smokers)/len(smoker)
print(proportion_of_smokers)

0.20478325859491778


In [85]:
# Function to find the proportion of healthy individuals (bmi > 18.5 and < 24.9) in the dataset
count_bmi = 0
def num_of_healthy_bmi(bmi, count_bmi):
    for item in bmi:
        if item > 18.5 and item < 24.9:
            count_bmi += 1
    return count_bmi

proportion_of_healthy_bmi = num_of_healthy_bmi(bmi, count_bmi)/len(bmi)
print(proportion_of_healthy_bmi)

0.16517189835575485


In [78]:
# Number of Smokers by region
smoker_by_region = []

for row in insurance_dict:
    if row["smoker"] == "yes":
        smoker_by_region.append(row["region"])

smoker_by_region_counts = Counter(smoker_by_region)

for key, value in smoker_by_region_counts.items():
    print(f"{key}: {value}")

southwest: 58
southeast: 91
northeast: 67
northwest: 58


In [81]:
# Number of Smokes by sex
smoker_by_sex = []

for row in insurance_dict:
    if row["smoker"] == "yes":
        smoker_by_sex.append(row["sex"])

smoker_by_sex_counts = Counter(smoker_by_sex)

for key, value in smoker_by_sex_counts.items():
    print(f"{key}: {value}")

female: 115
male: 159
