# Python Fundamentals - US Medical Insurance Project
## Introduction



## Datasets

The dataset comes from [Kaggle Dataset - Medical Cost Personal Datasets](https://www.kaggle.com/mirichoi0218/insurance?select=insurance.csv).  It is a dataset thwt is in the public domain, but that was formatted to match the format used in the book 'Machine Learning with R'.

It can be directly downloaded [here](https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/insurance.csv)

The dataset is also provided by codecademy [here](https://content.codecademy.com/PRO/paths/data-science/python-portfolio-project-starter-files.zip). Will have to check if 'insurance.csv' is the same in this case. Using the one from GitHub for now. 

### Columns

+ age: age of primary beneficiary


+ sex: insurance contractor gender, female, male


+ bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9


+ children: Number of children covered by health insurance / Number of dependents


+ smoker: Smoking


+ region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.


+ charges: Individual medical costs billed by health insurance

#### Inspect the dataset

In [1]:
import csv
with open('./materials/insurance.csv', newline = '') as insurance_file:
    insurance_data = csv.DictReader(insurance_file)
    insurance_list = [row for row in insurance_data]

### Check imported data:

In [2]:
print(insurance_list[0:4])

[{'age': '19', 'sex': 'female', 'bmi': '27.9', 'children': '0', 'smoker': 'yes', 'region': 'southwest', 'charges': '16884.924'}, {'age': '18', 'sex': 'male', 'bmi': '33.77', 'children': '1', 'smoker': 'no', 'region': 'southeast', 'charges': '1725.5523'}, {'age': '28', 'sex': 'male', 'bmi': '33', 'children': '3', 'smoker': 'no', 'region': 'southeast', 'charges': '4449.462'}, {'age': '33', 'sex': 'male', 'bmi': '22.705', 'children': '0', 'smoker': 'no', 'region': 'northwest', 'charges': '21984.47061'}]


In [3]:
set([ row['region'] for row in insurance_list])

{'northeast', 'northwest', 'southeast', 'southwest'}

In [4]:
set([row['sex'] for row in insurance_list])

{'female', 'male'}

In [5]:
set([row['children'] for row in insurance_list])

{'0', '1', '2', '3', '4', '5'}

## Objectives - Scope

Project Objectives Defined by Codecademy (sublevels mine) :

+ Work locally on your own computer: 
    + Work with Jupyter Notebooks
     
    + Use Git/GitHub for version control


+ Import a dataset into your program:
    + Source of the dataset
    
    + Description of the dataset

+ Analyze a dataset by building out functions or class methods
    + [Explore the dataset](#explore_dataset):
       
    + Types of variables and categories/distributions of values
    
    + What functions and classes can we use to organize and work with the data?
    
    + Relationships/correlations between variables

+ Use libraries to assist in your analysis

+ Optional: Document and organize your findings
    + Keep track of references used in the project
       
    + Explain the analysis/graphs done
    
    + Record ideas for next steps or further analysis to do.


+ Optional: Make predictions about a dataset’s features based on your findings
    + WaitingFor: advancing a little more in the Data Science Course

<a id='explore_dataset'></a>
### Explore the dataset
Look for relationships between the different variables. With the data available, we could:

1. Look at distributions of users by the categories provided:
    + mean, median, variance
    + histograms/bar plots
    
    
2. Look at relationships/correlations between two categorical variables. Ex: 
    + smoker/non-smoker by sex 
    + smoker/non-smoker by region

    
3. Look at relationships/correlations between one categorical variable and one numerical variable:
    + age by region
    + number of children vs age
    + age vs bmi


4. Multivariable analysis
    + WaitingFor: advancing a little more in the Data Science Course
      

## References
* [Data Science Project Scoping Guide](http://www.datasciencepublicpolicy.org/home/resources/data-science-project-scoping-guide/)