# US Medical Insurance Costs Project

In this project, I investigated a **CSV** file with medical insurance information using the **Pandas Python Library**. This project aimed to analyze various attributes within insurance.csv to learn more about the patient information in the file and gain insight into potential use cases for the dataset. The **CSV** dataset file used can be found [here](https://www.kaggle.com/mirichoi0218/insurance).

## Preparation

First, I imported **Pandas**.

In [57]:
import pandas as pd

Then, I loaded the data.

In [56]:
insurance = pd.read_csv("insurance.csv")

Once the data was loaded, I looked at the names of each column and the total amount of columns.

In [46]:
insurance.columns

Index(['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges'], dtype='object')

The dataset contains seven columns with the following information:
* Patient age
* Sexual orientation
* Patient BMI
* Number of children of each patient
* Patient cigarette smoking status
* US geographical region
* Annual medical insurance cost

Then, I inspected the first five lines of data.

In [39]:
insurance.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


Next, I searched for missing data and found that the dataset is complete.

In [62]:
insurance.isna().any().any()

False

Finally, I inspected data types. Some columns are numerical, while others are categorical.

In [41]:
insurance.dtypes

age           int64
sex          object
bmi         float64
children      int64
smoker       object
region       object
charges     float64
dtype: object

## Analysis

I started my analysis by determining the summary age statistics for the patients.

In [37]:
insurance['age'].describe()

count    1338.000000
mean       39.207025
std        14.049960
min        18.000000
25%        27.000000
50%        39.000000
75%        51.000000
max        64.000000
Name: age, dtype: float64

The next step of my analysis was to determine the sexual orientation breakdown of the patients.

In [32]:
insurance.groupby('sex')['sex'].count()

sex
female    662
male      676
Name: sex, dtype: int64

Then, I discovered that there are four unique geographical regions in the dataset. All parties are from the United States.

In [50]:
insurance['region'].unique()

array(['southwest', 'southeast', 'northwest', 'northeast'], dtype=object)

Finally, I determined that the average annual medical insurance charge per individual is approximately $13,270.42.

In [51]:
insurance['charges'].mean()

13270.422265141257

## Organize Patient Data In a Dictionary

After completing my analysis, I neatly organized the patient information in a convenient dictionary for possible research in the future.

In [54]:
insurance.to_dict

<bound method DataFrame.to_dict of       age     sex     bmi  children smoker     region      charges
0      19  female  27.900         0    yes  southwest  16884.92400
1      18    male  33.770         1     no  southeast   1725.55230
2      28    male  33.000         3     no  southeast   4449.46200
3      33    male  22.705         0     no  northwest  21984.47061
4      32    male  28.880         0     no  northwest   3866.85520
...   ...     ...     ...       ...    ...        ...          ...
1333   50    male  30.970         3     no  northwest  10600.54830
1334   18  female  31.920         0     no  northeast   2205.98080
1335   18  female  36.850         0     no  southeast   1629.83350
1336   21  female  25.800         0     no  southwest   2007.94500
1337   61  female  29.070         0    yes  northwest  29141.36030

[1338 rows x 7 columns]>