# U.S. Medical Insurance Data

This analysis uses a data set of medical insurance costs provided by codecademy.com as part of their python learning courses. the dataset **insurance.csv** contains information on age, sex, bmi, number children, smoker, region, and charges. This is a univariate analysis utilizing pandas dataframe commands.


## Prep Work

First, I will import any libraries I need to complete my analysis. For this analysis, I know will need the **pandas** library.

In [1]:
#import pandas
import pandas as pd

Next, I will put the csv into a dataframe using pandas for analysis and output the dataframe

In [2]:
#import csv to dataframe
df = pd.read_csv("insurance.csv") 

#output dataframe
print(df)

      age     sex     bmi  children smoker     region      charges
0      19  female  27.900         0    yes  southwest  16884.92400
1      18    male  33.770         1     no  southeast   1725.55230
2      28    male  33.000         3     no  southeast   4449.46200
3      33    male  22.705         0     no  northwest  21984.47061
4      32    male  28.880         0     no  northwest   3866.85520
...   ...     ...     ...       ...    ...        ...          ...
1333   50    male  30.970         3     no  northwest  10600.54830
1334   18  female  31.920         0     no  northeast   2205.98080
1335   18  female  36.850         0     no  southeast   1629.83350
1336   21  female  25.800         0     no  southwest   2007.94500
1337   61  female  29.070         0    yes  northwest  29141.36030

[1338 rows x 7 columns]


I want to investigate if there is any missing (blank) information in my dataframe by utlilizng the count function.<br>
- There is no missing information in this dataset.

In [3]:
#check for any blank cells.
df.count()

age         1338
sex         1338
bmi         1338
children    1338
smoker      1338
region      1338
charges     1338
dtype: int64

## Univariate analysis

**Age:**<br>
- Identify the basic descriptive statistics for age and then count how many are in different the following age groups: 18-24; 25-34; 35-44; 45-54; 55-64,<br>
    - **Findings:** there is a range of ages from 18 - 64 years old. All age range groups are represented generally equal with the most individuals falling in the 45-54 age range category which as +45 from the lowest represented range (55-64). The average age (rounded down) is  39.

In [4]:
#use describe to get descriptive statistic information on age column
df["age"].describe()

count    1338.000000
mean       39.207025
std        14.049960
min        18.000000
25%        27.000000
50%        39.000000
75%        51.000000
max        64.000000
Name: age, dtype: float64

In [5]:
#create 'age_range' column and assign value for each individual.
df['age_range'] = pd.cut(df['age'], bins=[0, 24, 34, 44, 54, 64], labels=['18-24', '25-34', '35-44', '45-54', '55-64'])

# group data by 'age_range' and count the number of individuals per range.
age_range_counts = df.groupby('age_range')['age'].count()
print(age_range_counts)

age_range
18-24    278
25-34    271
35-44    260
45-54    287
55-64    242
Name: age, dtype: int64


**Sex:**<br>
- Identify the number of male vs. the number of female<br>
    - **Findings:** there are +14 male individuals in this dataset

In [6]:
#find how many people are listed as male or female
df['sex'].value_counts()

male      676
female    662
Name: sex, dtype: int64

**BMI:**<br>
- Identify the basic descriptive statistics for bmi<br>
    -  **Findings:** the bmi in the dataset ranges from 15.96 - 53.13 with an average of 30.66 <br>


In [7]:
#use describe to get descriptive statistic information on age column
df["bmi"].describe()

count    1338.000000
mean       30.663397
std         6.098187
min        15.960000
25%        26.296250
50%        30.400000
75%        34.693750
max        53.130000
Name: bmi, dtype: float64

**Children:**<br>
- Identify the basic descriptive statistics for number of children & identify how many people have 0, 1, 2, 3, 4, or 5 children<br>
    -  **Findings:** the number of children an individual has ranges from 0 - 5 with an average of 1 (rounded down). Most of the individuals in this data set have 0 children<br>


In [8]:
#use describe to get descriptive statistic information on children column
df["children"].describe()

count    1338.000000
mean        1.094918
std         1.205493
min         0.000000
25%         0.000000
50%         1.000000
75%         2.000000
max         5.000000
Name: children, dtype: float64

In [9]:
#count how many people have 0-5 children
df['children'].value_counts()

0    574
1    324
2    240
3    157
4     25
5     18
Name: children, dtype: int64

**Smoker:**<br>
- Identify the number of smokers vs. non-smokers<br>
    - **Findings:** most individuals in this dataset are listed as non-smokers

In [10]:
#find how many people are listed smoker or non-smoker
df['smoker'].value_counts()

no     1064
yes     274
Name: smoker, dtype: int64

**Region:**<br>
- Identify unique region names and how many people are representing each region<br>
    -  **Findings:** there are 4 different regions all within the United States which are generally equally represented with the exception of "southeast" which has +39 from the lowest represented region (northeast). <br>


In [11]:
#find how many people are representing each uniquely named region.
df['region'].value_counts()

southeast    364
southwest    325
northwest    325
northeast    324
Name: region, dtype: int64

**Charges:**<br>
- Identify the basic descriptive statistics for charges<br>
    -  **Findings:** the charges in the dataset ranges from 1,121.87 - 63,770.43 with an average of 13,270.422 <br>


In [12]:
#use describe to get descriptive statistic information on charges column
df["charges"].describe()

count     1338.000000
mean     13270.422265
std      12110.011237
min       1121.873900
25%       4740.287150
50%       9382.033000
75%      16639.912515
max      63770.428010
Name: charges, dtype: float64