# Portfolio Project: Analysing Medical Insurance Cost

## Project Objectives:
- Work locally on your own computer
- Import a dataset into your program
- Analyze a dataset by building out functions or class methods
- Use libraries to assist in your analysis
- Optional: Document and organize your findings
- Optional: Make predictions about a dataset’s features based on your findings

## Data Source:
https://www.kaggle.com/datasets/mirichoi0218/insurance


## Step 1: Understanding the data

### Kaggle gives basic information for each field:

*age:* age of primary beneficiary

*sex:* insurance contractor gender (female, male)

*bmi:* Body mass index, providing an understanding of body, weights that are relatively high or low relative to height,
objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9

*children:* Number of children covered by health insurance / Number of dependents

*smoker:* Smoking

*region:* the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.

*charges:* Individual medical costs billed by health insurance

## Step 1: Understanding the data

### Kaggle gives basic information for each field:

*age:* age of primary beneficiary

*sex:* insurance contractor gender (female, male)

*bmi:* Body mass index, providing an understanding of body, weights that are relatively high or low relative to height,
objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9

*children:* Number of children covered by health insurance / Number of dependents

*smoker:* Smoking

*region:* the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.

*charges:* Individual medical costs billed by health insurance

## Step 2: Scoping the project

Defining my project goals:

### Goal 1: initial understanding of how data is distributed
1.1 What is the average age in the dataset?
- we need to sum(age) for all rows and divide by count(rows)
  
1.2 How is BMI distributed?
- find the range of BMI values
- learn how to create a visual chart
- create visual chart
- learn how to calculate statistics like Q1, Q2, Q3, IQR etc
- calculate statistics for BMI

### Goal 2: advanced understanding

2.1 What is the relationship between each variable and the insurance cost?
- learn how to run correlation between variables in 2 columns
- run correlation
- repeat for all variables
  
2.2 Can we get the function to calculate cost based on the other variables? with which level of confidence?
- learn :)
  

### Step 3: Import the dataset


In [40]:
import csv

path_to_datasources = '/Users/egraciani/Python/CodeAcademy/Datasources/'
insurance_data_csv_file = 'insurance.csv'
file_path = path_to_datasources + insurance_data_csv_file

with open(file_path, newline='') as csvfile:
    csvreader = csv.DictReader(csvfile)
    data = list(csvreader)
# Now `data` is a list of dictionaries, where each dictionary represents a row from the CSV file

print('First 5 results:')
print(data[:5])
print('---- ')
# Example: Access the first row (as a dictionary)
print('First row:')
print(data[0])
print('---- ')

#Example: Access a specific value from the first row
# Replace 'ColumnName' with the actual name of the column you're interested in
print('First value for \'sex\':')
print(data[0]['sex'])
print('---- ')


First 5 results:
[{'age': '19', 'sex': 'female', 'bmi': '27.9', 'children': '0', 'smoker': 'yes', 'region': 'southwest', 'charges': '16884.924'}, {'age': '18', 'sex': 'male', 'bmi': '33.77', 'children': '1', 'smoker': 'no', 'region': 'southeast', 'charges': '1725.5523'}, {'age': '28', 'sex': 'male', 'bmi': '33', 'children': '3', 'smoker': 'no', 'region': 'southeast', 'charges': '4449.462'}, {'age': '33', 'sex': 'male', 'bmi': '22.705', 'children': '0', 'smoker': 'no', 'region': 'northwest', 'charges': '21984.47061'}, {'age': '32', 'sex': 'male', 'bmi': '28.88', 'children': '0', 'smoker': 'no', 'region': 'northwest', 'charges': '3866.8552'}]
---- 
First row:
{'age': '19', 'sex': 'female', 'bmi': '27.9', 'children': '0', 'smoker': 'yes', 'region': 'southwest', 'charges': '16884.924'}
---- 
First value for 'sex':
female
---- 


## Step 4: Analysis

### 4.1: Clean and prepare data

Note: ideally we might want to generate a report with inconsistencies/blanks/errors.


In [53]:
#Transform all numbers stored as strings to float or integer
def convert_to_numeric(s):
    try:
        return float(s) if '.' in s else int(s)
    except ValueError:
        return s

# Assuming 'age' is among the keys in numeric_keys
numeric_keys = ['age', 'bmi', 'children', 'charges']
converted_data = [{k: convert_to_numeric(v) if k in numeric_keys else v for k, v in item.items()} for item in data]
print("First 2
converted_data[:2]

[{'age': 19,
  'sex': 'female',
  'bmi': 27.9,
  'children': 0,
  'smoker': 'yes',
  'region': 'southwest',
  'charges': 16884.924},
 {'age': 18,
  'sex': 'male',
  'bmi': 33.77,
  'children': 1,
  'smoker': 'no',
  'region': 'southeast',
  'charges': 1725.5523}]

In [None]:
def convert_smoker_to_int(smoker_status):
    return 1 if smoker_status == 'yes' else 0


### 4.2: Goal 1: initial understanding of how data is distributed

1.1 What is the average age in the dataset?

we need to sum(age) for all rows and divide by count(rows)


In [44]:
#We start creating a function to get any variable from the dataset
# Parameters: 
## variable_to_get is the column in the dataset: 'age', 'sex', 'bmi', 'children', 'smoker', 'region' and 'charges'
## data is the dataset to extract the column
# Returns: a list containing the values for the variable 'age', 'sex', etc. 
def get_variable(variable_to_get, data):
    if variable_to_get in data[0]:
        single_variable = []
        for values in data:
            single_variable.append(values[variable_to_get])
        return single_variable
    else: return "Variable not found in dataset"

ages = get_variable('age', data)
print("The first 10 ages in the dataset are: {age}".format(age=ages[0:10]))

sexes = get_variable('sex', data)
print("The first 10 sexes in the dataset are: {sex}".format(sex=sexes[0:10]))

The first 10 ages in the dataset are: ['19', '18', '28', '33', '32', '31', '46', '37', '37', '60']
The first 10 sexes in the dataset are: ['female', 'male', 'male', 'male', 'male', 'female', 'female', 'female', 'male', 'female']


In [49]:
#Let's create a function to calculate average for a column
# Parameter: values is the list with numerical values
# Returns: the average or '
def get_average(values):
    if all(isinstance(value, (int, float)) for value in values):
        print("I'm in the if")
        average = sum(values)/len(values)
        print("The average of the elements is: {average}".format(average=average))
        return average
    else: 
        print("I'm in the else")
        print('Not all values are numbers.')
        return None
    
print('Test 1: get_average(sexes):')
get_average(sexes)       
print('--------------')    
print('Test 2: get_average(ages):')
get_average(ages)       
        

Test 1: get_average(sexes):
I'm in the else
Not all values are numbers.
--------------
Test 2: get_average(ages):
I'm in the else
Not all values are numbers.


1.2 How is BMI distributed?

find the range of BMI values
learn how to create a visual chart
create visual chart
learn how to calculate statistics like Q1, Q2, Q3, IQR etc
calculate statistics for BMI