<img src="data/images/div/lecture-notebook-header.png" />

# Classification & Regression: Logistic Regression

Logistic Regression is a statistical modeling technique used for binary classification tasks, where the goal is to predict the probability of an instance belonging to a certain class. Despite its name, logistic regression is a classification algorithm, not a regression algorithm.

In Logistic Regression, the algorithm models the relationship between the independent variables and the binary outcome using a logistic function, also known as the sigmoid function. The logistic function maps the linear combination of the independent variables to a value between 0 and 1, which represents the probability of belonging to the positive class. This mapping allows logistic regression to estimate the likelihood of an instance belonging to a class and make predictions accordingly.

Although Logistic Regression is called "regression," it is considered a linear model due to its underlying mathematical formulation. The linearity in Logistic Regression refers to the relationship between the independent variables and the log-odds (also known as logit) of the positive class. The log-odds are transformed using the logistic function, which introduces the nonlinearity necessary to model the probability.

The linear part of Logistic Regression comes from the fact that the log-odds of the positive class are expressed as a linear combination of the independent variables. The algorithm determines the coefficients (weights) associated with each independent variable, similar to linear regression. However, instead of predicting the actual continuous value, logistic regression predicts the probability of belonging to the positive class.

## Setting up the Notebook

### Specify How Plots Get Rendered

In [None]:
%matplotlib inline

### Import Required Packages

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.metrics import f1_score, roc_curve, roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score

import warnings
warnings.filterwarnings('ignore')

---

## Working with Toy Data (CSI Example)

As we did in the lecture, we adopt the simple CSI example we used for Linear Regression to a classification task. While the input is still the shoe print size of a person, the output is now a binary class label representing the sex of the person (woman: 0, man: 1).

In [None]:
data = np.array([
    (31.3, 1), (29.7, 1), (31.3, 0), (31.8, 0),
    (31.4, 1), (31.9, 1), (31.8, 1), (31.0, 1),
    (29.7, 0), (31.4, 1), (32.4, 1), (33.6, 1),
    (30.2, 0), (30.4, 0), (27.6, 0), (31.8, 1),
    (31.3, 1), (34.5, 1), (28.9, 0), (28.2, 0)
])

# Convert input and outputs to numpy arrays; makes some calculations easier
X = data[:,0].reshape(-1,1)
y = data[:,1]

We can still plot the data by using the class label as y coordinate in the scatter plot.

In [None]:
plt.figure()
plt.tick_params(labelsize=14)
plt.scatter(X, y)
plt.xlabel('Shoe print size (cm)', fontsize=16)
plt.ylabel('P(male)', fontsize=16)
plt.tight_layout()
plt.show()

We can see there is the expected trend that men generally have a larger show print size. But of course, there is no clear separation as there are tall women and small men with corresponding shoe print sizes. That means there will never be a perfect classifier to predict the sex of a person merely based on the size of a shoe print.

### Apply Logistic Regression

scikit-learn provides [`LogisticRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) as implementation for Logistic Regression. Similar to the Linear Regression implementation, the model considers $\theta_0$ (`intercept_`) and $\theta_{i\neq 0}$ (`coef_`) separately. It also features the parameter `fit_intercept` whether to calculate the intercept $\theta_0$ or not.

Below, as we use the original data without adding the constant term ourselves, should set `fit_intercept=True`, which is the default value, so we can simply ignore it.

In [None]:
clf = LogisticRegression().fit(X, y)

print('Intercept: {}, Coefficients: {}'.format(clf.intercept_, clf.coef_))

We can visualize this result in two ways:

* directly plotting the probabilities (see orange line in the plot below)
* plotting the decision boundary as defined by values for $\theta$

In [None]:
# Specify series of shoe print size in the range of the input data
x_range = np.arange(27, 35, 0.1).reshape(-1, 1)

# Calculate the probability for all shoe print size
# The method predict_proba() does this for us
y_best = clf.predict_proba(x_range)[:,1]

# Calculate the decision boundary
decision_boundary = clf.intercept_ +  clf.coef_[0] * x_range

We can now plot the probability values and the decision boundary together with the data sample in one figure.

In [None]:
plt.figure()
plt.ylim(-0.05, 1.05)
plt.tick_params(labelsize=14)
plt.scatter(X, y, c='C0', s=100)
plt.plot(x_range, y_best, color='orange', lw=3)
plt.plot(x_range, decision_boundary, '--', color='black', lw=2)
plt.xlabel('Shoe print size (cm)', fontsize=16)
plt.ylabel('P(male)', fontsize=16)
plt.tight_layout()
plt.show()

As expected, the classification is not perfect.

### Predict Sex of Suspect

In our CSI example, the shoe print size we found of the suspect was 32.2 cm. So take this value as input for our model and look at the prediction. The method `predict()` directly returns the predicted class label (instead of the probabilities).

In [None]:
y_pred = clf.predict([[32.2]])

print('The predicted class label is: {}'.format(y_pred.squeeze()))

A class label on 1 means the suspect is predicted to be a man. This output can already be seen when looking at the plot above. Also recall that the predicted height for the suspect was 185.7 cm (see notebook for Linear Regression) which is arguably more likely to be a man.

Apart from directly getting the class label, we can also look at the estimated probabilities. This gives us an indication of how "sure" the classifier is about the returned label. Again, we use the method `predict_proba()` for that.

In [None]:
y_pred = clf.predict_proba([[32.2]])

print('The estimated probabilites are: {}'.format(y_pred.squeeze()))

In the case of our suspect, the difference between the two probabilities is quite large, so we can be reasonably confident that the suspect is indeed a man -- although there will never be a 100% guarantee. Of course, both probabilities add up to 1.

Let's assume the size of the shoe print size of the suspect would have been 30.6 cm. We can estimate the probabilities for this value as well.

In [None]:
y_pred = clf.predict_proba([[30.6]])

print('The estimated probabilites are: {}'.format(y_pred.squeeze()))

As the probability for Class 1 is still higher than for Class 0, we still would predict the suspect to be a man. However, here the two probabilities are much closer, so we can say the level of confidence of the classifier is much lower. This kind of interpretation is pretty straightforward for binary classification but gets less obvious for multiple classes.

---

## Logistic Regression using Vessel Details Dataset

For a more practical example, let's see if we can predict the type of a vessel based on some of its features. This implies that the underlying assumption is that a vessel's width, length, and tonnage are good indicators for the vessel's type. This may not be obvious, but in the context of this notebook we will check how much this assumption will hold.

### Prepared Training & Test Data

#### Load Dataset from File

As usual, we use `pandas` to load the `csv` file with the details about all vessels.

In [None]:
df = pd.read_csv('data/datasets/vessels/vessel-details.csv')

# Sort dataset (often a good practice)
df = df.sample(frac=1, random_state=0).reset_index(drop=True)

# Show the first 5 columns
df.head()

#### Data Selection

To skip any more sophisticated data preprocessing steps, we consider only the convenient features -- that is, we consider only a subset of numerical features for our model. This particularly means that we do not have to consider any encoding strategies for categorical features. To keep it even simpler, we also remove all rows containing any missing value.

In [None]:
# Keep only the numerical attributes to keep it simple here + Type as our class label
df = df[['Length', 'Width', 'Gross Tonnage', 'Deadweight Tonnage', 'Type']]

# Remove all rows with any NaN values; again, just to keep it simple
df = df.dropna()

df.head()

Let's see how many class labels we have -- which is the number of unique labels of column `Type`.

In [None]:
print('#Classes: {}'.format(len(set(df.Type.tolist()))))

#### Convert Class Labels

Most classification algorithms assume that the class labels of the range 0..C, where C is the number of classes. Using `pandas`, this conversion is easy to do. After the conversion, all rows with the class labels, say, "Oil Tanker" will have the same numerical (integer) class label of the range 0..C. For our dataset, the number of classes is `C=15`.

In [None]:
df['Type'] = pd.factorize(df['Type'])[0]

df.head()

#### Generate Training & Test Data

As usual, we convert the dataframe into numpy arrays for further processing, including splitting the dataset into training and test data.

In [None]:
# Convert data to numpy arrays
X = df[['Length', 'Width', 'Gross Tonnage', 'Deadweight Tonnage']].to_numpy()
y = df[['Type']].to_numpy().squeeze()

# Split dataset in to training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

print("Size of training set: {}".format(len(X_train)))
print("Size of test: {}".format(len(X_test)))

#### Normalize Data via Standardization

Since we want to consider different polynomial degrees, it is strongly recommended – and almost required – to normalize/standardize the data. As the [`LogisticRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) implementation also applies regularization by default, we do normalize the data via standardization.

In [None]:
# We fit the scaler based on the training data only
scaler = StandardScaler().fit(X_train)

# Of course, we need to convert both training and test data
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

### Train and Evaluate Logistic Regression Classifier

We directly look into Polynomial Logistic Regression and try different maximum polynomial degrees $p$ (similar to the Linear Regression notebook). Recall from the lecture that the number of terms given a polynomial degree of $p$ a number of input features $d$ is

$$
\#terms = \binom{p+d}{p}
$$

Since our dataset has 8 input features, this equation simplifies to

$$
\#terms = \binom{p+5}{5}
$$

Below we consider $p$ as our hyperparameter, i.e., we transform the dataset using different polynomial degrees, apply Logistic Regression, and check the f1 score for each setup. Note that we no have to set `fit_intercept=False` as  [`PolynomialFeatures`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html) adds the constant term to the data matrix even if $p=1$.

In [None]:
%%time

for p in range(1, 9):
    
    # Transform data w.r.t to degree of polynomial p
    poly = PolynomialFeatures(p)
    X_train_poly = poly.fit_transform(X_train)
    X_test_poly = poly.fit_transform(X_test)
    
    # Train Linear Regressor or transformed data
    # fit_intercept=False since for p=1, transformation adds constant term to data
    poly_reg = LogisticRegression(fit_intercept=False, max_iter=1000).fit(X_train_poly, y_train)

    # Predict values for training and test set
    y_train_pred = poly_reg.predict(X_train_poly)
    y_test_pred = poly_reg.predict(X_test_poly)
    
    # Calculate MSE 
    f1_train = f1_score(y_train, y_train_pred, average='micro')
    f1_test = f1_score(y_test, y_test_pred, average='micro')
    
    
    print('Degree of polynomial: {} => f1 (train/test): {:.2f}/{:.2f} (#terms: {})'.format(p, f1_train, f1_test, X_train_poly.shape[1]))

The results show that setting `p=5` yields the highest average f1 score.

Lastly, we can also perform a more proper evaluation using k-fold cross-validation to find the best value of $p$. To simplify things, we use the scikit-learn's method `cross_val_score()` to perform the cross-validation for us.

In [None]:
%%time

# Initialize the best f1-score and respective k value
best_p, best_f1 = None, 0.0

# Loop over a range of values for setting k
for p in range(1, 9):
    
    # Transform data w.r.t to degree of polynomial p
    poly = PolynomialFeatures(p)
    X_train_poly = poly.fit_transform(X_train)
    X_test_poly = poly.fit_transform(X_test)
    
    # Specfify type of classifier
    clf = LogisticRegression(fit_intercept=False, max_iter=1000)
    
    # perform cross validation (here with 5 folds)
    # f1_scores is an array containg the 5 f1-scores
    f1_scores = cross_val_score(clf, X_train_poly, y_train, cv=5)
    
    # Calculate the f1-score for the current k value as the mean over all 5 f1-scores
    f1_fold_mean = np.mean(f1_scores)

    print('p={}, f1 score (mean/std): {:.3f}/{:.3f}'.format(p, f1_fold_mean, np.std(f1_scores)))
    
    # Keep track of the best f1-score and the respective k value
    if f1_fold_mean > best_f1:
        best_p, best_f1 = p, f1_fold_mean
  

print('The best f1-score was {:.3f} for p={}'.format(best_f1, best_p))

Again, $p=5$ yields the best average f1 score. Having found the best value for $p$ we can now fit a Logistic Regression model on the whole training data and $p=5$ and evaluate the f1 score on the test data.

In [None]:
# Transform data
poly = PolynomialFeatures(5)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.fit_transform(X_test)

# Fite Logistic Regression model on complete training data
clf = LogisticRegression(fit_intercept=False, max_iter=1000).fit(X_train_poly, y_train)

# Predict class labels for test data
y_pred = clf.predict(X_test_poly)

# Calculate f1 scores based on ground truth of test set
f1 = f1_score(y_test, y_pred, average='micro')

print('F1 score of Linear Regression model on the test data: {:.3f}'.format(f1))

---

## Summary

Logistic Regression is a popular machine learning algorithm used for binary classification tasks, where the goal is to predict the probability of an instance belonging to a certain class. Despite its name, logistic regression is a classification algorithm, not a regression algorithm. It is called "Logistic Regression" because it is based on the concept of logistic function or sigmoid function.

In Logistic Regression, the algorithm models the relationship between the independent variables and the binary outcome using a logistic function. The logistic function maps the linear combination of the independent variables to a value between 0 and 1, which represents the probability of belonging to the positive class. This mapping allows logistic regression to estimate the likelihood of an instance belonging to a class and make predictions accordingly.

One of the key advantages of Logistic Regression is its simplicity and interpretability. The algorithm provides coefficients for each independent variable, allowing us to understand the impact and direction of each variable on the probability of the positive class. However, logistic regression has some limitations. It assumes a linear relationship between the independent variables and the log-odds of the positive class, which may not always hold true. It may struggle with nonlinear relationships or complex interactions between variables. Additionally, logistic regression is sensitive to outliers and can be affected by overfitting when the number of independent variables is large compared to the number of instances.

In summary, logistic regression is a straightforward and interpretable algorithm for binary classification tasks. Its pros include simplicity, interpretability, and handling of categorical and continuous variables. However, it has limitations in capturing nonlinear relationships and can be sensitive to outliers and overfitting. Therefore, it is important to assess the assumptions and limitations of logistic regression before applying it to a given problem.