# Logistic Regression Introduction

In the lesson slides for this learning unit, we looked at:

* The form and function of the logistic regression model
* Use of binary logistic regression to classify a single variable
* Evaluating a trained model with the accuracy score
* Interpreting model predictions via probability scores

In this practical we will focus on using `sklearn` to demonstrate the above points. The focus will be on the simple univariate case, using a single input feature to predict a binary output class.

# The data

`data.csv` contains 100 observations of some mystery process, with two possible classes for each - 0 or 1.

There are six features recorded for each observation. These have been normalised so that each column ranges between 0 and 1.

In [None]:
import pandas as pd

data = pd.read_csv('data/data.csv')

data.head()

# Visualising the data

Below shows a `seaborn.displot`. Each subplot shows the distribution of a feature's values, across the two classes.

In [None]:
import seaborn as sns

data_long = data.melt(value_vars=['f1', 'f2', 'f3', 'f4', 'f5', 'f6'], id_vars='class', var_name='feature', value_name='value')

sns.displot(data=data_long,  hue='class', x='value', kind='kde', col='feature', col_wrap=3, fill=True);

# Logistic regression: model assumptions

Compared to linear regression, the logistic model has few restrictions/assumptions. However, for the univariate classification task it is useful for the predictive feature to be **linearly seperable** - the distribution of that feature should make it easy to distinguish between the classes.

Looking at the six features above, which would you choose to predict the `class` variable? What do you think are the strengths/weaknesses of each feature?

In [None]:
# Your code here


# Training a logistic regression model: toy example

This is easy to do with `sklearn`. The steps are:

1. Instantiate a model from `sklearn.linear_model.LogisticRegression`
2. Use the `.fit()` method to train the model
3. Access model accuracy score through `.score()`
4. Use the `.predict()` method of the trained model to get new predictions
5. Use the `.predict_proba()` method of the trained model to get predictions probabilities

An example is given below.

In [None]:
import numpy as np
from sklearn.linear_model import LogisticRegression

x = np.random.random(50)
x = x.reshape(-1, 1)
# When using a single feature in sklearn, it must be reshaped
# from the form [1,2,3,4] to [[1],[2],[3],[4]]

y = [0] * 25 + [1] * 25

model = LogisticRegression()

model.fit(x, y)

print(f"Weight for single feature: {model.coef_}")
print(f"Accuracy score for model: {model.score(x, y):.5f}")

unseen_x = [[0.2], [0.11], [0.8], [0.45]]

print("Predictions:", [i for i in model.predict(unseen_x)])
print("Prob of class 1:", [i[1] for i in model.predict_proba(unseen_x)])

The model was trained on random data and unsurprisingly does not perform very well!

Note how all the probabilities are around 0.5, no matter what the input.

Also the accuracy for the model is exactly 50% - it is basically at chance because the input data is not at all informative about the two classes we want to predict.

# Training a logistic regression model: more realistic

Now, train six invidual logistic models using each feature `['f1', 'f2', 'f3', 'f4', 'f5', 'f6']` from `data`. Remember you will need to convert the dataframe column (which is a `pandas.Series` object) to a numpy array and reshape it, because it only contains a single feature.

In [None]:
# Your code here


# Examining logistic regression models

For each model, get the probability of class 1 for a range of values of `x` provided in `new_x`.

Append each of these results to the list `probs`.

(Note: `.predict_proba()` returns a list of probabilities for both classes, for each input in `new_x`. You want the one at position `[1]` in each item in that list!) 

In [None]:
new_x = [[i] for i in np.linspace(0, 1, 100)]

probs = []

# Your code here


The code below will plot all the values of `x` between 0 and 1, and each model's predicted `y`. The line represents the learned model.

Also shown are the distribution of the classes for that feature.

What do you observe?

In [None]:
import matplotlib.pyplot as plt
sns.set(rc={'figure.figsize':(12, 8)})

df = pd.DataFrame(probs).T
df.columns = ['f1', 'f2', 'f3', 'f4', 'f5', 'f6']

fig, axes = plt.subplots(2,3, sharex=False)

for a, f in enumerate(['f1', 'f2', 'f3', 'f4', 'f5', 'f6']):
    
    x = data[f] * 100
    y = data['class']
    
    sns.lineplot(data=df[f], lw=4, ax=axes.flatten()[a], color='g')
    sns.scatterplot(x=x, y=y, ax=axes.flatten()[a], hue=y, palette=['b', 'r'] )

# Your code here


# Summary

In this practical, we explored logistic regression and its implementation in `sklearn` - how to fit a model from data, get predictions, access prediction probabilities. We also examined how the relationship between the input features and the classes we want to predict impacts the model and its performance.