In [1]:
import pandas as pd
import numpy as np

## README FIRST

In this homework you will build a linear regression model on an Insurance Company dataset. Given the features, predict the insurance cost (charges). 
You need to complete 16 Questions (**Q1** -  **Q16**). **Please answer the question within the same section where the question is asked and indicate your solution for each question in order**. 

Add your name on the following cell:

## YOUR NAME: Julian Cabrera 

## 1. Loading and explore the Insurance dataset

Load the `insurance.csv` file as a dataframe, and then answer the following questions: 

- **Q1**. How many data points in the dataset? 1338
- **Q2**. How many columns in the dataset? 7 
- **Q3**. How many categorical variables in the dataset? List the categorical variables: Sex  Smoker & Region
- **Q4**. How many numerical variables in the dataset? List the numerical variables Age, bmi, Charges, & Children 
- **Q5**. Show the descriptive statistics of the numerical columns (e.g.: mean, standard deviation, minimum value, maximum value) -> in Cell Below

In [2]:
df = pd.read_csv("insurance.csv")
df.shape

(1338, 7)

In [3]:
df.describe() # Describing Statistics of Data 

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


In [4]:
df.columns

Index(['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges'], dtype='object')

In [5]:
df.head

<bound method NDFrame.head of       age     sex     bmi  children smoker     region      charges
0      19  female  27.900         0    yes  southwest  16884.92400
1      18    male  33.770         1     no  southeast   1725.55230
2      28    male  33.000         3     no  southeast   4449.46200
3      33    male  22.705         0     no  northwest  21984.47061
4      32    male  28.880         0     no  northwest   3866.85520
...   ...     ...     ...       ...    ...        ...          ...
1333   50    male  30.970         3     no  northwest  10600.54830
1334   18  female  31.920         0     no  northeast   2205.98080
1335   18  female  36.850         0     no  southeast   1629.83350
1336   21  female  25.800         0     no  southwest   2007.94500
1337   61  female  29.070         0    yes  northwest  29141.36030

[1338 rows x 7 columns]>

## 2. Data preprocessing I
In this section, you will need to preprocess the dataset so that you can use it to train a linear regression model

### 2.1 Handling Missing Data
- **Q6**. Show if the dataset has any missing data? If so, what is your approach on handling the missing data? If I did I would use Permutation and insert the mean values of the data

In [6]:
df.isnull().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

### 2.2 Categorical columns encoding
- **Q7**. Linear Regression model requires all features and target to be numerical. Show how you transform each categorical columns into numerical columns. 

In [9]:
from sklearn.preprocessing import LabelEncoder

label_encoder_sex = LabelEncoder()
label_encoder_smoker = LabelEncoder()

df['sex'] = label_encoder_sex.fit_transform(df['sex'])       
df['smoker'] = label_encoder_smoker.fit_transform(df['smoker']) 
df = pd.get_dummies(df, columns=['region'], drop_first=True).astype(int)


print(df.head())



   age  sex  bmi  children  smoker  charges  region_northwest  \
0   19    0   27         0       1    16884                 0   
1   18    1   33         1       0     1725                 0   
2   28    1   33         3       0     4449                 0   
3   33    1   22         0       0    21984                 1   
4   32    1   28         0       0     3866                 1   

   region_southeast  region_southwest  
0                 0                 1  
1                 1                 0  
2                 1                 0  
3                 0                 0  
4                 0                 0  


- **Q8**. After transforming the categorical data into numerical columns, show the new or updated dataframe

### 2.3 Get Predictor Variables (X) and Target Variable (Y)

- **Q9**. The modified dataframe from the previous section still contains all the predictor variables and target variable together. Show how get the predictor variables, `X`, and the target variable, `Y`. Set `X` and `Y` as numpy arrays. 

In [44]:
X = df.drop("charges", axis=1).values
Y = df["charges"].values

### 2.4 Splitting dataset into training and testing sets.

- **Q10**. Split `X` and `Y` into training and testing sets. The training set should have 25% of the dataset. Show the size (shape) of `X_train`, `X_test`, `Y_train`, and `Y_test`.


In [43]:
from sklearn.model_selection import train_test_split


X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.25, random_state=1)


## 3. Data preprocessing II
Now, you continue preprocessing the preprocessed dataset from the previous section

### 3.1 Standardize X_train

- **Q11**. Many learning algorithms including linear regression require input features on the same scale for optimal performance.  Show how you standardize `X_train` and save the standardized input features into a variable named `X_train_std`.


In [42]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_std = scaler.fit_transform(X_train)


## 4. Train a Linear Regression model

Now you are ready to train a linear regression model. You can use the LinearRegressionGD class that we created from scrath, meaning you have to copy that class into this notebook, or use the class from scikit-learn. 

You can import the LinearRegression class from scikit learn as follows:

```from sklearn.linear_model import LinearRegression ```

Please refer to this documentation for more details [https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)

- **Q12**. Train a LinearRegression model and then show the mean squared error of the model on the standardized training data.

In [41]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error


model = LinearRegression()
model.fit(X_train_std, Y_train)

Y_train_pred = model.predict(X_train_std)

train_mse = mean_squared_error(Y_train, Y_train_pred)


print("Mean Squared Error on Training Data:", train_mse)

Mean Squared Error on Training Data: 36628489.991965264


## 5. Evaluate the linear regression model

After fitting (training) the linear regression model in the previous section, the next task is to evaluate its performance on the testing set. Since the model is trained using the standardized training data (`X_train_std`), you need to evaluate it using standardized testing set. 

- **Q13**. Show how you standardize the testing set based on the standardized training set (Hint: if you use `StandardScaler()`, pass `X_test` as an argument into the `transform` method of the `StandardScaler()` instance that you have fitted in Section 3.1). Save the standardized `X_test` as `X_test_std`

In [40]:
X_test_std = scaler.transform(X_test)

- **Q14**. Predict the target values of the `X_test_std` and show the mean squared error (MSE) of the testing set. Is the MSE of the testing set is better (lower) than the MSE of the training set? The MSE of the testing set is Higher than the MSE of the training set

In [36]:
Y_test_pred = model.predict(X_test_std)

test_mse = mean_squared_error(Y_test, Y_test_pred)

print("Mean Squared Error on Testing Data:", test_mse)

Mean Squared Error on Testing Data: 37442767.89971797


## 6. Build another linear regression model using the `make_pipeline` function
Now that you have fitted and evaluated the first model, its time to create another model. Basically you do not need to repeat Section 1-2. However, use the `make_pipeline` function to streamline Sections 4 and 5 into a single workflow.

- **Q15**. Show how you create a pipeline model that streamline input feature standardization (e.g. using StandardScaler) and modeling (LinearRegression) into a single workflow. Train (fit) the the pipeline model using `X_train` and show the mean squared error (MSE) achieved by the pipeline model on `X_train`

In [38]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

pipeline = Pipeline([
    ('scaler', StandardScaler()),          
    ('model', LinearRegression())          
])

pipeline.fit(X_train, Y_train)

y_pred = pipeline.predict(X_train)

mse = mean_squared_error(Y_train, y_pred)
print("Mean Squared Error on Training Data:", mse)


Mean Squared Error on Training Data: 36628489.991965264


- **Q16**. Evaluate the fitted pipeline using `X_test`. Show the mean squared error (MSE) achieved by the pipeline model on `X_test`.

In [37]:
pipeline.fit(X_train, Y_train)

y_test_pred = pipeline.predict(X_test)

mse_test = mean_squared_error(Y_test, y_test_pred)


print("Mean Squared Error on Test Data:", mse_test)


Mean Squared Error on Test Data: 37442767.89971797
