# Supervised Learning — How to do a linear regression in Python

## What is linear regression?

TODO

## When can linear regression be used?

When the response variable (the one being predicted) is numeric.

## Which packages can be used for performing linear regression?

- Scikit learn (used here)
- statsmodels
- PyCaret

## Case study: predicting brain weights

Here we'll explore a classic dataset (Gladstone 1905) to predict people's brain weights based on the volume of their head. ([Data source](https://users.stat.ufl.edu/~winner/data/brainhead.dat) and its [description](https://users.stat.ufl.edu/~winner/data/brainhead.txt).)

We'll need **pandas** for importing the data, and doing some manipulation.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import plotly.express as px

In [3]:
brainhead = pd.read_csv("brainhead.csv")
brainhead

Unnamed: 0,gender,age_range,head_size_cm3,brain_weight_g
0,male,20-46,4512,1530
1,male,20-46,3738,1297
2,male,20-46,4261,1335
3,male,20-46,3777,1282
4,male,20-46,4177,1590
...,...,...,...,...
232,female,46+,3214,1110
233,female,46+,3394,1215
234,female,46+,3233,1104
235,female,46+,3352,1170


## Data dictionary

Each row in the dataset corresponds to one adult human.

- **gender**: Gender of the person. Either **male** or **female**.
- **age_range**: Age range of the person. Either **20-46** or **46+**.
- **head_size_cm3**: Volume of the person's head, in cm^3.
- **brain_weight_g**: Mass of the person's brain, in grams.

## Converting categorical columns to dummy variables

Scikit-learn can't deal with categorical columns directly. They must be converted to dummy columns of ones and zeroes. The pandas function [`get_dummies()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) can be used for this.

In [None]:
brainhead_dum = pd.get_dummies(brainhead)
brainhead_dum

## Splitting into response and explanatory columns

The dataset needs to be split into the response variable, and the explanatory variables (all other columns).

In [None]:
response = brainhead_dum["brain_weight_g"]
explanatory = brainhead_dum.drop(columns="brain_weight_g")

## Splitting into training and testing sets

TODO

In [None]:
explanatory_train, explanatory_test, response_train, response_test = train_test_split(explanatory, response)

## Fitting the model to the training set

TODO

In [None]:
mdl = LinearRegression()

In [None]:
mdl.fit(explanatory_train, response_train)

## Making predictions on the testing set

TODO

In [None]:
response_predicted = mdl.predict(explanatory_test)

In [None]:
responses = pd.DataFrame({
    "actual": response_test,
    "predicted": mdl.predict(explanatory_test)
})

In [None]:
fig = px.scatter(responses, x="actual", y="predicted", width=800, height=800)
fig.update_yaxes(
    scaleanchor = "x",
    scaleratio = 1,
)
fig

## Understanding the model fit

TODO

In [None]:
mdl.intercept_

In [None]:
mdl.coef_

In [None]:
mdl.feature_names_in_