# Lab 4 Exercises

Now that you've seen how linear regression was applied in the main lab, you are asked to implement it yourself with slightly different steps, to see how it will impact the results.

## Step 1- Import Libraries

Since we'll be using these libraries frequently, let's load them as the first step.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

## Step 2 - Load the Dataset

In [2]:
try:
    df = pd.read_csv('../data/insurance.csv')
except:
    df = pd.read_csv('https://raw.githubusercontent.com/GUC-DM/W2024/refs/heads/main/data/insurance.csv')
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


## Step 3 - Data Understanding

Call `.describe()` for summary statistics and `.describe(include='object')` for a summary of the non-numerical attributes.

How many smokers and non-smokers are there in the dataset?

Hint: `value_counts()`

## Step 4 - Exploratory Data Analysis

Try calling seaborn's `pairplot` function on the data ([documentation page](https://seaborn.pydata.org/generated/seaborn.pairplot.html)). How is this function useful?

Plot the data's correlation matrix as a heatmap using seaborn

Calculate the mean and standard deviation of insurance charges for smokers and non-smokers

Hint: `groupby`

Create a [box plot](https://seaborn.pydata.org/generated/seaborn.boxplot.html) and a [violin plot](https://seaborn.pydata.org/generated/seaborn.violinplot.html) of insurance charges against the number of children a person has

## Step 5 - Data Transformation / Pre-Processing

Given that linear regression only accepts numerical attributes, we need to transform our categorical values to numerical ones. Instead of on-hot encoding as per the main lab, apply label encoding.

The label encoder is implemented for 'sex' and 'region' as an example. Label encode the 'smoker' column in the dataframe.

In [3]:
from sklearn.preprocessing import LabelEncoder

df_encoded = df.copy()

# label encode 'sex' feature
sex_le = LabelEncoder()

sex_le.fit(df['sex'])
df_encoded['sex'] = sex_le.transform(df['sex'])


# label encode 'region' feature
region_le = LabelEncoder()

region_le.fit(df['region'])
df_encoded['region'] = region_le.transform(df['region'])


# label encode 'smoker' feature



Now that some other features have been converted to a numerical type, plot the correlation matrix again. What deductions can you now make?

Since there are large differences in magnitude in the insurance charges column, log scaling the values (a non-linear transformation) could be a potential improvement since it essentially compresses wide range of values to a narrow range, improving linear model performance. Apply log scaling to the insurance charges column of `df_encoded`.

Hint: consider using numpy's log10 function

Plot a [histogram plot](https://seaborn.pydata.org/generated/seaborn.histplot.html) of the original insurance charges data with the `kde` parameter set to `True`.

Plot a [histogram plot](https://seaborn.pydata.org/generated/seaborn.histplot.html) of the log-scaled charges data with the `kde` parameter set to `True`. How does it compare to the distribution before log-scaling?

## Step 6 - Modelling

Split your data into training and testing data and apply the linear regression model from the sklearn library

## Step 7 - Model Evaluation

Evaluate your model using the model's `.score` function. How did your solution with label encoding and log scaling fare compared to the lab's one-hot encoding approach?

Note: if you used log-scaling, you'll need to apply the antilog (i.e. reverse of the logarithm) to the predictions to get their actual value in USD.

## Step 8 - Model Validation (Optional)

Repeat the model validation routine for linear regression as done in the main lab notebook. How does your model fare compared to the lab's baseline model?

Data preprocessing and transformation makes a noticable impact on model performance. What can you do to further increase the model accuracy? Possible changes: z-score or min/max normalization to the insurance charges column; creating a 'has_children' binary feature.