# Supervised Learning — How to do a linear regression in Python

## When can linear regression be used?

- When the response variable (the one being predicted) is numeric and continuous.
- When the observations are independent.

## Which packages can be used for performing linear regression?

- Scikit learn (used here)
- statsmodels
- PyCaret, Tensorflow, Keras, PyTorch

## Case study: predicting brain weights

Here we'll explore a classic dataset (Gladstone 1905) to predict people's brain weights based on the volume of their head. ([Data source](https://users.stat.ufl.edu/~winner/data/brainhead.dat) and its [description](https://users.stat.ufl.edu/~winner/data/brainhead.txt).)

We'll need **pandas** for importing the data, and doing some manipulation. **scikit-learn** is used for modeling, and **plotly.express** is used for plotting.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import plotly.express as px

The dataset is imported from a CSV file.

In [None]:
brainhead = pd.read_csv("brainhead.csv")
brainhead

## Data dictionary

Each row in the dataset corresponds to one adult human.

- **gender**: Gender of the person. Either **male** or **female**.
- **age_range**: Age range of the person. Either **20-46** or **46+**.
- **head_size_cm3**: Volume of the person's head, in cm^3.
- **brain_weight_g**: Mass of the person's brain, in grams.

## Converting categorical columns to dummy variables

Scikit-learn can't deal with categorical columns directly. They must be converted to dummy columns of ones and zeroes. The pandas function [`get_dummies()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) can be used for this.

In [None]:
brainhead_dum = pd.get_dummies(brainhead)
brainhead_dum

## Splitting into response and explanatory columns

The dataset needs to be split into the response variable, and the explanatory variables (all other columns).

In [None]:
response = brainhead_dum["brain_weight_g"]
explanatory = brainhead_dum.drop(columns="brain_weight_g")

## Splitting into training and testing sets

The explanatory and response datasets need to be split into training and testing sets. 

Here we'll use [`train_test_split()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) with the default arguments.

In [None]:
explanatory_train, explanatory_test, response_train, response_test = train_test_split(explanatory, response)

## Fitting the model to the training set

The data is now ready to model. The first modeling step is to create a [`LinearRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) object.

In [None]:
mdl = LinearRegression()

Use the [`.fit()`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.fit) method to fit the model to the training set.

In [None]:
mdl.fit(explanatory_train, response_train)

## Making predictions on the testing set

Now use the [`.predict()`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.predict) method to make predictions on the testing set.

It can be helpful to combine the actual responses and the predicted responses together in a dataframe.

In [None]:
responses = pd.DataFrame({
    "actual": response_test,
    "predicted": mdl.predict(explanatory_test)
})

One way to visualize the results is to draw a scatter plot of predicted responses versus actual responses.

This plot is easier to understand if you have equal distances for the x and y coordinates.

In [None]:
fig = px.scatter(responses, x="actual", y="predicted", width=800, height=800)
fig.update_yaxes(
    scaleanchor = "x",
    scaleratio = 1,
)
fig

## Understanding the model predictions

Model predictions are calculated as an intercept, plus a coefficient for each input parameter.

You can see the intercept with the `.intercept_` attribute.

In [None]:
mdl.intercept_

The coefficients and their names are found with the `.coef_` and `.feature_names_in_` attributes.

In [None]:
mdl.coef_

In [None]:
mdl.feature_names_in_

To see how the prediction calculations work, let's consider an example of a female aged 46+ with a head size of 4000 cm^3.

In [None]:
gender = "female"
age_range = "46+"
head_size_cm3 = 4000
prediction = mdl.intercept_ + \
    mdl.coef_[mdl.feature_names_in_ == "gender_female"] + \
    mdl.coef_[mdl.feature_names_in_ == "age_range_46+"] + \
    head_size_cm3 * mdl.coef_[mdl.feature_names_in_ == "head_size_cm3"]
f"A {gender} aged {age_range} with a head size of {head_size_cm3} cm^3 would have a predicted brain weight of {prediction}."

## Checking model fit

To assess whether the model was a good fit, it is useful to analyze the residuals, which are the actual respones minus the predicted responses.

In [None]:
responses["residual"] = responses["actual"] - responses["predicted"]
responses

Drawing a scatter plot of the residuals versus the predicted values can show whether or not there is a good fit. The residuals should be centered around zero across the whole range of predictions.

That is, if you fit a LOWESS trendline, the line should stay close to zero.

In [None]:
px.scatter(responses, x="predicted", y="residual", trendline="lowess")

## Want to learn more?

[This scikit-learn tutorial](https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html) covers linear regresion.

These DataCamp courses cover linear regression in Python.

- [Machine Learning with scikit-learn](https://app.datacamp.com/learn/courses/machine-learning-with-scikit-learn) provides an introduction to modeling with scikit-learn, including linear regression.
- [Introduction to Regression with statsmodels in Python](https://app.datacamp.com/learn/courses/introduction-to-regression-with-statsmodels-in-python) and [Intermediate Regression with statsmodels in Python](https://app.datacamp.com/learn/courses/intermediate-regression-with-statsmodels-in-python) provide a deep dive into linear and logistic regression, using the statsmodels package.