# Linear Regression



Starting off with usual imports, stuff we have seen before already

In [None]:
#Dataframe and array manipulation
import pandas as pd
import numpy as np

#For visualization
import plotly
import plotly.express as px

## Importing the Income Data

To start, let's import the income data that we were looking at earlier. For simplicity, let's *only* look at the data ranging from $15k to $70k.

Let's start with creating a dataframe to visualize the data.

In [None]:
# Importing the data into a pd dataframe
URL = "https://raw.githubusercontent.com/dt3zjy/node/master/week-8/workshop/lin-reg/income.csv"


Let's go ahead and drop the unnecessary column and multiply the income by 10000 to match dollars. (Throwback to data cleaning/manipulation)

In [None]:
# Dropping the first column


That's much better.

In [None]:
# Finding the shape

## Let's visualize the data

In [None]:
# Simple scatter plot
px.scatter(data, x='income', y='happiness',
    labels = {"income" : "Income (in Euros)",
              'happiness' : 'Happiness Score (0 to 10)'
              }
)

Linear regression works very well with data that has a correlation with each other. Since both of the columns are already in numerical form, we don't have to do much in terms of cleaning/modifying the data.

Let's get it ready for the model now.

## Implementing Linear Regression

In [None]:
# Split into X and y and do train_test_split (this should be familiar)
from sklearn.model_selection import train_test_split


In [None]:
# Fit a LinearRegression to the data


In [None]:
# Predict on the testing data and compare it with the actual data

print('Look at first 5 predictions:')
print('Predicted: ',predicted[:5].round(2))
print('Actual:    ',actual[:5].round(2))

As you can see from the prediction/actual, none of these are exactly correct. It's kind of unrealistic to expect the model to accurately predict a value exactly. Let's check out what the model looks like through a scatter plot.

$$ y = \beta_{1} x + \beta_{0} $$

There's actually a way to get the coefficients that the model creates.

In [None]:
# Get the coefficients and y-intercept

print("beta_1 = ", coef)
print("beta_0 = ", intercept)

# Find the first and second point
# The first point will just be (0, intercept)

print("Point 1: [", x_0, ",", y_0, "]")
print("Point 2: [", x_1, ",", y_1, "]")

In [None]:
# Graphs the data and the line on plotly
fig = px.scatter(data, x="income", y="happiness")
fig.add_shape(type='line', xref="x", yref="y",
    x0 = x_0, y0 = y_0, x1 = x_1, y1 = y_1,
    line = dict(
        color = "red",
        width = 4,
    )   
)
fig.show()

## Metrics

You can't really look for the accuracy of a regression model like you would for classification models. A common way to look at how good a regression model is, is through the **Mean Squared Error**.

$$  \frac{1}{n}\Sigma_{i=1}^{n}{\Big(y_a -y_p\Big)^2} $$

In [None]:
# Get the mean squared error of the linear regression model

print(mse.round(4))

# Trying it out on complex data

Let's check out a different dataset. This one looks at the different medical charges a patient got from their visit to the hospital.

In [None]:
# Importing data into a dataframe
URL = "https://raw.githubusercontent.com/dt3zjy/node/master/week-8/workshop/lin-reg/med_charges.csv"

# Clean the dataset
med = med.drop("Unnamed: 0", axis=1)
med.head()

Let's try to predict the medical charge someone would have with certain characteristics (age, bmi, etc.)

In [None]:
# Correlation matrix


In [None]:
# Let's look at age vs. charges


It looks like there is a clear separation between smokers and non-smokers. It would be a good idea to split the dataset on that to have a more accurate model for one or the other.

In [None]:
# Get only the non-smokers

# Go ahead and drop the smoker column (redudancy)


There's some categorical data in there. Let's change it to numerical with the `pd.get_dummies` function

In [None]:
# Change the categorical data to numerical


Now, let's go ahead and put this into a model, starting with only looking at one variable: age. The code here should look quite familiar

In [None]:
# Divide into X and y


# Split the data

# Create and train the model



In [None]:
# Generate Predictions

# Get MSE

# Get RMSE


In [None]:
# Get only the smokers


# Go ahead and drop the smoker column (redudancy)
smoker = smoker.drop("smoker", axis=1)
smoker.head()

# Change the categorical data to numerical
num_s = pd.get_dummies(smoker)
num_s.head()

# Divide into X and y
X = num_s[['age']]
y = num_s['charges']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

# Create and train the model
reg_s = LinearRegression()
reg_s.fit(X_train, y_train)

# Generate Predictions
predicted = reg_s.predict(X_test)
actual = np.array(y_test)

# Get MSE
mse_s_age = mean_squared_error(y_pred=predicted, y_true=actual)

# Get RMSE
rmse_s_age = mean_squared_error(y_pred=predicted, y_true=actual, squared=False)


## Let's visualize the trends

In [None]:
# Getting coefficients and intercept of the non smokers


# Creating the line


# Getting coefficients and intercept of the non smokers
coef_s = reg_s.coef_
intercept_s = reg_s.intercept_

# Creating the line
x_0_s = 18
y_0_s = coef_s[0]*x_0_s + intercept_s
x_1_s = 64
y_1_s = coef_s[0]*x_1_s + intercept_s

# Graphs the data and the line on plotly
fig = px.scatter(med, x="age", y="charges", color = 'smoker')
fig.add_shape(type='line', xref="x", yref="y",
    x0 = x_0_ns, y0 = y_0_ns, x1 = x_1_ns, y1 = y_1_ns,
    line = dict(
        color = "purple",
        width = 4,
    )   
)
fig.add_shape(type='line', xref="x", yref="y",
    x0 = x_0_s, y0 = y_0_s, x1 = x_1_s, y1 = y_1_s,
    line = dict(
        color = "forestgreen",
        width = 4,
    )   
)
fig.show()

In [None]:
# Metrics!
print("MSE S:",mse_s_age)
print("RMSE S:",rmse_s_age)
print("MSE NS:",mse_ns_age)
print("RMSE NS:",rmse_ns_age)

# Using Multiple Features for Linear Regression

What if we wanted to look at ALL of the different columns within the dataset (age, bmi, children, sex)?

We just add coefficients!

$$ charges = \beta_{1} * age + \beta_{2} * bmi + \beta_{3} * children + \beta_{4} * sex\_male + \beta_{5} * sex\_female + \beta_{0}$$

In [None]:
# Divide into X and y


# Split the data

# Create and train the model


In [None]:
# Generate Predictions


# Get MSE for both
print("MSE for model with just age:", mse_ns_age)
print("MSE for model with all features:", mse_mult)

# Get RMSE for both
print("MSE for model with just age:", rmse_ns_age)
print("MSE for model with all features:", rmse_mult)

It's going to be a little hard to graph this... Why do you think that is?