# U.S. Medical Insurance Costs

The goal of this project is to explore the **Medical Insurance Costs Dataset** using only python built-in features, without the help of popular data science libraries.  

---

**Project structure:**

    1. Data import and initial exploration.
    2. Assign the data columns to appropriate variable types.
    3. Build analysis functions to better understand the data.

## 1. Data import and initial exploration.



We begin by importing the dataset features as python data types.

In [31]:
import csv 
# Features to list variables
# Column: age
age_col = []
# Column: smokers
sex_col = []
# Column: smokers
bmi_col = []
# Column: smokers
children_col = []
# Column: smokers
smoker_col = []
# Column: smokers
region_col = []
# Column: smokers
charges_col = []
# Column: index
index_col = []

idx_counter = 0

with open("insurance.csv", newline='') as medical_insurance_data:
    medical_insurance_data_dict = csv.DictReader(medical_insurance_data)
    for row in medical_insurance_data_dict:
        index_col.append(idx_counter)
        age_col.append(row["age"])
        sex_col.append(row["sex"])
        bmi_col.append(row['bmi'])
        children_col.append(row["children"])
        smoker_col.append(row["smoker"])
        region_col.append(row["region"])
        charges_col.append(row["charges"])
        
        idx_counter += 1




We proceed to inspect the first 5 values of each column in our python lists and confirm that all columns are of equal length.

In [33]:
print(f"Age column (first 5 values): {age_col[0:5]}. Length: {len(age_col)} ")
print(f"Sex column (first 5 values): {sex_col[0:5]}. Length: {len(sex_col)} ")
print(f"BMI column (first 5 values): {bmi_col[0:5]}. Length: {len(bmi_col)} ")
print(f"Children column (first 5 values): {smoker_col[0:5]}. Length: {len(smoker_col)} ")
print(f"Region column (first 5 values): {region_col[0:5]}. Length: {len(region_col)} ") 
print(f"Charges column (first 5 values): {charges_col[0:5]}. Length: {len(charges_col)} ")
print(f"Index column (first 5 values): {index_col[0:5]}. Length: {len(index_col)} ")

Age column (first 5 values): ['19', '18', '28', '33', '32']. Length: 1338 
Sex column (first 5 values): ['female', 'male', 'male', 'male', 'male']. Length: 1338 
BMI column (first 5 values): ['27.9', '33.77', '33', '22.705', '28.88']. Length: 1338 
Children column (first 5 values): ['yes', 'no', 'no', 'no', 'no']. Length: 1338 
Region column (first 5 values): ['southwest', 'southeast', 'southeast', 'northwest', 'northwest']. Length: 1338 
Charges column (first 5 values): ['16884.924', '1725.5523', '4449.462', '21984.47061', '3866.8552']. Length: 1338 
Index column (first 5 values): [0, 1, 2, 3, 4]. Length: 1338 


Our imported data seems consistent, however it is necessary to convert string values to numeric (float or integer) for the `age_col`, `bmi_col` and `charges_col` columns.


In [27]:
# age values to int
for index in range(len(age_col)):
    age_col[index] = int(age_col[index])
# BMI values to float
for index in range(len(bmi_col)):
    bmi_col[index] = float(bmi_col[index])
# Charges values to float
for index in range(len(charges_col)):
    charges_col[index] = float(charges_col[index])

Inspect parsed columns:

In [29]:
print(f"Age column parsed to integer data type (first 5 values): {age_col[0:5]}. Length: {len(age_col)} ")
print(f"BMI column parsed to float data type (first 5 values): {bmi_col[0:5]}. Length: {len(bmi_col)} ")
print(f"Charges column parsed to float data type (first 5 values): {charges_col[0:5]}. Length: {len(charges_col)} ")


Age column parsed to integer data type (first 5 values): [19, 18, 28, 33, 32]. Length: 1338 
BMI column parsed to float data type (first 5 values): [27.9, 33.77, 33.0, 22.705, 28.88]. Length: 1338 
Charges column parsed to float data type (first 5 values): [16884.924, 1725.5523, 4449.462, 21984.47061, 3866.8552]. Length: 1338 


Finally we proceed to create a python dictionary to function as a database/dataframe for our medical records data. The structure will consist of a dictionary within a dictionary, the key of the outmost dictionary will be the `index_col` value of each row.


In [34]:
# Dictionary that stores medical records:


## 2. Assign the data columns to appropriate variable types.

In [None]:
# List of the smoker column in the dataframe
smoker_col = medical_insurance_data['smoker']




## 3. Build analysis functions to better understand the data.

### **Goals of the analysis**
<br><br>
- Explore the relationship between age and the other features by: 
    - Calculating the average age of the persons in the dataset.
    - Calculate the proportions of age samples in groups that span by 5 years (starting by the minimum age sample).
    - Group the average charges by the age groups.
<br><br>
- Explore the general characteristics of the smoker columns:
    - What is the number of smokers and non-smokers?
    - What is the proportion of smokers vs. non smokers?
    - Is there a significant difference in the insurance charges amount that smokers have compared to non-smokers?
<br><br>        
- Analyze the possibility of a regional bias in our dataset by:
    - analyzing where the majority of individuals live.
    - answering if there is a siginificant difference between the region with the most samples and the other regions.
<br><br>   
- Analyze the average age of the persons with one or more children.
<br><br>
- Analyze the average age and the average charges amount grouped by the BMI feature and answer if the majority of samples in each particular BMI group correspond to smokers or non-smokers.



Total number of smokers and non smokers:


In [None]:
def calculate_number_of_smokers(smoker_column):
    are_smokers = 0
    non_smokers = 0
    for i in smoker_column:
        if i == 'yes':
            are_smokers += 1
        else:
            non_smokers += 1
    return are_smokers, non_smokers

smokers_positive, smokers_negative = calculate_number_of_smokers(smoker_col)
print(f'Smokers: {smokers_positive}\nNon-smokers: {smokers_negative}')
            

Proportions of smokers vs non-smokers:


In [None]:
def proportion_of_smokers(smoker_column):
    size = len(smoker_column)
    smokers, non_smokers = calculate_number_of_smokers(smoker_column)
    prop_smokers = round((smokers * 100 ) / size, 2 )
    prop_non_smokers = round((non_smokers *100) / size, 2)
    return prop_smokers, prop_non_smokers

smokers_proportion, non_smokers_proportion = proportion_of_smokers(smoker_col)
print(f'Proportion of smokers: {smokers_proportion}\nProportion of non-smokers: {non_smokers_proportion}')