# U.S. Medical Insurance Costs

# Exploring the Data:

In [2]:
import pandas as pd
import numpy as np

data = pd.read_csv("insurance.csv")
data.head(3) 
#print a few rows to see what sort of data is present
#also show how many columns there are

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462


The dataset has 7 columns, with mixed data types (likely: float, string, integer).

In [3]:
data.shape[0] #row count

1338

There are 1338 rows (observations) in this dataset!

In [4]:
data['region'].unique() #all unique regions

array(['southwest', 'southeast', 'northwest', 'northeast'], dtype=object)

We see that there are four unique regions: southwest, southeast, northwest, and northeast.

In [5]:
data.dtypes #data types for each variable

age           int64
sex          object
bmi         float64
children      int64
smoker       object
region       object
charges     float64
dtype: object

We can see that the three data types are indeed the following: integer, float, string/object.

In [6]:
data.describe() #describe the data using summary statistics

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


We are able to generate summary statistics for the variables that are integers/float. From the above table, we observe the following:
1. We have data for people aged 18-64, with the middle 50% aged between 27-51. 
2. The BMI range is 15.96 - 53.13, with the middle 50% sitting between overweight and obese levels. That is, less than 25% of the population fall in the healthy or underweight categories!
3. The median family has one child, but the range is between no children to 5 children.
4. The mean and median are quite literally the same for the variables age, bmi, and children.
5. The mean and median for cost of insurance are quite different, suggesting that the highest costs skew the average value upwards.

# Relabel data (Convert strings to numeric)

In [7]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
label = le.fit_transform(data['region'])
label
data.drop("region",axis=1,inplace=True)
data["region"]=label

#recoded as follows: "Southwest" = 3, "Southeast" = 2, "Northwest" = 1, "Northeast" = 0

le2 = LabelEncoder()
label2 = le2.fit_transform(data['sex'])
label2
data.drop("sex",axis=1,inplace=True)
data["sex"]=label2

#recoded as follows: "Female" = 0, "Male" = 1

le3 = LabelEncoder()
label3 = le3.fit_transform(data['smoker'])
label3
data.drop("smoker",axis=1,inplace=True)
data["smoker"]=label3

#recoded as follows: Smoker "Yes" = 1, "No" = 0
data.head(3)

Unnamed: 0,age,bmi,children,charges,region,sex,smoker
0,19,27.9,0,16884.924,3,0,1
1,18,33.77,1,1725.5523,2,1,0
2,28,33.0,3,4449.462,2,1,0


SCOPE - Potential Questions to Answer:
1. What are the factors that have the biggest impact on insurance costs?
2. What are the regional variations for insurance cost?
3. How much does smoking status change the insurance cost?
4. How much does an additional child add to insurance cost?
5. Is there a gender difference for insurance cost?
6. What would the insurance cost be for me given my specifications?
7. Are there specific traits for which this dataset isn't sufficiently comprehensive to give strong estimates for new users?

# Multiple Linear Regression

In [8]:
from sklearn import linear_model

X = data[['age', 'bmi','children','region','sex','smoker']]
y = data['charges']

regr = linear_model.LinearRegression()
regr.fit(X, y)

print(regr.coef_)

[  257.28807486   332.57013224   479.36939355  -353.64001656
  -131.11057962 23820.43412267]


From the regression above, we see that the coefficients for the independent variables, rounded to 1 decimal place, are as follows:

|Variable |Coefficient|
|-------- |-----------|
|age      |257.3      |
|bmi      |332.6      |
|children |479.4      |
|region   |-353.6     |
|sex      |-131.1     |
|smoker   |23820.4    |

# Answers to SCOPE questions 1-5:
1. The biggest factor for insurance cost from the data provided is whether an individual is a smoker or not. A smoker, on average, pays &#36;23,820 more than a nonsmoker.
2. The cheapest region for insurance is the Southwest, and the most expensive is the Northeast. The difference between these two regions is over &#36;1,000.
3. As explained in the answer to Q1, smokers pay &#36;23,820 more than nonsmokers.
4. An additional child adds ~&#36;479 to insurance costs.
5. Males pay, on average, &#36;131 less than females for the same insurance!

# Predicting Abhishek's Insurance Cost!

In [9]:
AAage = 26
AAbmi = 24.9
AAchildren = 0
AAregion = 0 #northeast
AAsex = 1 #male
AAsmoker = 0 #nonsmoker
predictedInsurance = regr.predict([[AAage, AAbmi,AAchildren,AAregion,AAsex,AAsmoker]])
predictedInsurance

array([3023.92333829])

6. Abhishek's predicted insurance cost is &#36;3023.9

# Checking data for caveats regarding use of regression coefficients

In [10]:
data['sex'].value_counts()

1    676
0    662
Name: sex, dtype: int64

In [12]:
data['smoker'].value_counts()

0    1064
1     274
Name: smoker, dtype: int64

In [13]:
data['region'].value_counts()

2    364
3    325
1    325
0    324
Name: region, dtype: int64

From the data counts above, we see that the dataset appears to be relatively well spread among the different categorical values - i.e., the distribution between males and females is even, the distribution among the four regions is even as well. A quarter of all individuals are smokers, which also indicates that the data would be robust enough to predict value for any new individual.

Caveats with the regression would be that there may be external variables that may impact the cost of insurance for any given individual, such as other pre-existing conditions, prior insurance claims, etc. These have not been factored into the model above.