# Dataset

Task: Predicting health insurance charges using features about the insured person (age, sex, bmi, children, smoker, region).

Dataset link on Kaggle: <a href="https://www.kaggle.com/teertha/ushealthinsurancedataset">US Health Insurance Dataset</a>

Dataset description on Kaggle: 
<br> "*This dataset contains **1338 rows** of insured data, where the Insurance charges are given against the following **attributes** of the insured: **Age, Sex, BMI, Number of Children, Smoker and Region**. There are no missing or undefined values in the dataset.*"

In [58]:
import pandas as pd

data = pd.read_csv('insurance.csv')
data.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


# Check categorical data types and values

In [59]:
data.dtypes

age           int64
sex          object
bmi         float64
children      int64
smoker       object
region       object
charges     float64
dtype: object

In [60]:
print('sex',data['sex'].unique())
print('smoker',data['smoker'].unique())
print('region',data['region'].unique())

sex ['female' 'male']
smoker ['yes' 'no']
region ['southwest' 'southeast' 'northwest' 'northeast']


**Answer the following questions:**
1. Which features are categorical?
2. What unique values appear in each feature?

# Encoding binary categorical features using numerical encoding

In [98]:
data_encoded_1 = data.replace({
    'sex':{'male':0,'female':1},
    'smoker':{'yes':1,'no':0},
    
    
})

data_encoded_1.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,1,27.9,0,1,southwest,16884.924
1,18,0,33.77,1,0,southeast,1725.5523
2,28,0,33.0,3,0,southeast,4449.462
3,33,0,22.705,0,0,northwest,21984.47061
4,32,0,28.88,0,0,northwest,3866.8552


**Note:** The previous output must have numeric values for `sex` and `smoker` features.

# Encoding nominal features using one-hot encoding

In [99]:
#data_encoded_2 =  TODO: apply one-hot encoding to `data_encoded_1`
data_encoded_2 = pd.get_dummies(data_encoded_1,prefix='is')
data_encoded_2.head()

Unnamed: 0,age,sex,bmi,children,smoker,charges,is_northeast,is_northwest,is_southeast,is_southwest
0,19,1,27.9,0,1,16884.924,0,0,0,1
1,18,0,33.77,1,0,1725.5523,0,0,1,0
2,28,0,33.0,3,0,4449.462,0,0,1,0
3,33,0,22.705,0,0,21984.47061,0,1,0,0
4,32,0,28.88,0,0,3866.8552,0,1,0,0


**Note:** The new dataset `data_encoded_2` must have numeric values for **all features**. 

**Note:** The output `charges` has moved to the middle of the DataFrame. But it doesn't matter because we will separate it from the input in the next code cell.

# Splitting data to input and output

In [100]:
#data_concat = pd.concat([data_encoded_2,data_encoded_1],axis=1)
data_input = data_encoded_2.drop(columns=['charges']) 

data_output=data_concat['charges']
#=TODO: select `charges` column from `data_encoded_2`

In [101]:
data_input.head()

Unnamed: 0,age,sex,bmi,children,smoker,is_northeast,is_northwest,is_southeast,is_southwest
0,19,1,27.9,0,1,0,0,0,1
1,18,0,33.77,1,0,0,0,1,0
2,28,0,33.0,3,0,0,0,1,0
3,33,0,22.705,0,0,0,1,0,0
4,32,0,28.88,0,0,0,1,0,0


**Note:** `data_input` must contain the same columns as `data_encoded_2` except `charges` column.

In [102]:
data_output.head()

0    16884.92400
1     1725.55230
2     4449.46200
3    21984.47061
4     3866.85520
Name: charges, dtype: float64

In [103]:
data.dtypes

age           int64
sex          object
bmi         float64
children      int64
smoker       object
region       object
charges     float64
dtype: object

**Note:** `data_output` must contain only the `charges` column.

# Splitting data to (train - validation - test)

In [104]:


from sklearn.model_selection import train_test_split

X, X_test, y, y_test = train_test_split(
    data_input, data_output, test_size=0.20, random_state=0
    # TODO: split (data_input, data_output) using test_size=0.20 and random_state=0
)

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.25, random_state=0
    # TODO: split (X, y) using test_size=0.25 and random_state=0
)

In [105]:
print(X_train.shape)
print(y_train.shape)
print()
print(X_val.shape)
print(y_val.shape)
print()
print(X_test.shape)
print(y_test.shape)

(802, 9)
(802,)

(268, 9)
(268,)

(268, 9)
(268,)


**Note:** The previous output must be like the following:
```
(802, 9)
(802,)

(268, 9)
(268,)

(268, 9)
(268,)
```

# Feature scaling using StandardScaler

In [106]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)


X_train_scaled = scaler.transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)
X_test_scaled[:5]

array([[ 0.9011451 , -0.99750934, -0.0581528 , -0.06942463, -0.50894665,
        -0.56295024, -0.55142093, -0.60327902,  1.68990214],
       [ 0.54638446,  1.00249688, -0.19454444, -0.06942463, -0.50894665,
        -0.56295024, -0.55142093,  1.65760778, -0.59175024],
       [ 0.61733659, -0.99750934,  1.64509941,  0.74937764,  1.9648425 ,
        -0.56295024,  1.81349664, -0.60327902, -0.59175024],
       [ 1.53971424, -0.99750934,  1.28604431, -0.8882269 , -0.50894665,
        -0.56295024,  1.81349664, -0.60327902, -0.59175024],
       [ 0.83019297,  1.00249688, -2.05472919, -0.8882269 , -0.50894665,
        -0.56295024,  1.81349664, -0.60327902, -0.59175024]])

# Linear Regression

## Training and Validation

In [107]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

In [108]:
# Creating a linear regression model
linear_reg = LinearRegression()

# training our model
linear_reg.fit(X_train_scaled, y_train)

# making predictions
y_pred_train = linear_reg.predict(X_train_scaled)
y_pred_val = linear_reg.predict(X_val_scaled)

# Evaluating our model using R2 score
print('train', r2_score(y_train, y_pred_train))
print('val', r2_score(y_val, y_pred_val))

train 0.7410080381461
val 0.7242096157656368


## Testing

In [109]:
y_pred_test = linear_reg.predict(X_test_scaled)
print('test', r2_score(y_test, y_pred_test))

test 0.7998286945744545
