# Logistic Regression

We use logistic regression when we want to *classify* things, i.e. when our label (dependent variable) is qualitative. Maybe we want to:
* say whether each data point is "in" or "out" (*binary classification*) 
* assign each data point to a single class
* assign each data point to one or more classes (*multi label classification*)



## Step one: reconfigure the labels

Consider the iris data. The last column is the species; each iris is assigned to one of four species. 

**However**, a neural network only processes numbers. So we need our labels to become numeric somehow. We typically do this using a *one hot encoding*:
1. First, we walk over the data and make a list $L$ of the unique labels. $||L|| = q$.
2. Then, we walk over the data again, and replace each label with a vector of length $q$ in which every element is 0 except the one indexed at the index of this label in $L$, which will be 1.

## Step two: reconfigure the network

Instead of one node in the output layer, we have one for each label in the label list. 

We connect *all* input layer nodes to *each* output layer node.

**Question**: The way this network is connected, it is described as what type of network?

## Step three: set the activation function

Now that our labels are numeric, we could treat this as just a bunch of linear regressions (one per output node). However, if we do that we can't ensure that two important criteria for classification are met:
* all the outputs are in [0, 1] 
* (for regular classification) their sum is 1

Instead, we use the *softmax* function:
$\hat{y_i} = \frac{exp(o_i)}{\sum_j exp(o_j)}$ where $o_i$ is the output of the $i$ th output node.

We often interpret the output of the network, a vector $\hat{y}$, as containing the conditional probability of each class (label) given the features $x$. 

For more on the softmax: https://web.stanford.edu/~nanbhas/blog/sigmoid-softmax/.

## Step four: Set the loss function

For a loss function, we use $-\sum_{j=1}^q y_j \log \hat{y}_j$. This is called the *cross-entropy loss*.

The entropy of a distribution $P$ is defined as:
 $\sum_j -P(j) \log P(j)$. 
 
 This is:
* the amount of space required to encode data drawn at random from $P$.
* the amount of surprise one might expect at seeing an event $j$ drawn from $P$.

You can think of this loss function as comparing the model's estimates of the probability of each class given the data, with the actual probability of each class given the data.

## Let's Take a Look

In [1]:
!pip install torch torchvision
!pip install d2l==1.0.0b0
!pip install wandb

Collecting torchvision
  Downloading torchvision-0.14.1-cp310-cp310-manylinux1_x86_64.whl (24.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.2/24.2 MB[0m [31m40.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: torchvision
Successfully installed torchvision-0.14.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0[0m[39;49m -> [0m[32;49m23.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
Collecting d2l==1.0.0b0
  Downloading d2l-1.0.0b0-py3-none-any.whl (141 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m141.6/141.6 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
Collecting jupyter
  Downloading jupyter-1.0.0-py2.py3-none-any.whl (2.7 kB)
Collecting gpytorch
  Downloading gpytorch-1.9.1-py3-none-any.whl (250 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m250.

Now import required packages.

In [None]:
import numpy as np
import torch
from torch import nn
from torch.nn import functional as F
from d2l import torch as d2l

I will use the "compact" implementation of logistic regression as a neural network in pytorch.

In [None]:
class LogisticRegression(d2l.Classifier):  #@save
    """The logistic regression model."""
    def __init__(self, in_features, out_features, lr):
        super().__init__()
        self.save_hyperparameters()
        self.net = nn.Sequential(nn.Flatten(), nn.Linear(in_features, out_features))
       
    # why not computing softmax here? per the textbook: "By combining softmax and cross-entropy, we can escape the numerical stability issues altogether."
    def forward(self, X):
        return self.net(X)

    # cross-entropy loss
    def loss(self, Y_hat, Y, averaged=True):
        Y_hat = Y_hat.reshape((-1, Y_hat.shape[-1]))
        Y = Y.reshape((-1,))
        return F.cross_entropy(
            Y_hat, Y, reduction='mean' if averaged else 'none')

Use our reader for CSV data from last week.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

class CsvData(d2l.DataModule):  #@save
    def __init__(self, labelColIndex, path, batch_size=32, shuffle=True, split=0.2):
        super().__init__()
        self.save_hyperparameters()
        # read the data
        df = pd.read_csv(path)
        self.labels = df.iloc[:, labelColIndex].unique()
        colIndices = list(range(df.shape[1]))
        colIndices.pop(labelColIndex)
        features = df.iloc[:, colIndices]
        # one hot encoding of the labels column
        #labels = pd.get_dummies(df.iloc[:, labelColIndex])
        labels = df.iloc[:, labelColIndex]
        labels = labels.apply(lambda x: np.where(self.labels==x)[0][0]).astype(float)
        # split the dataset
        self.train, self.val, self.train_y, self.val_y = train_test_split(features, labels, test_size=split, shuffle=shuffle)
        print("shuffle", shuffle, "batch_size", batch_size, "split", split)
        print(self.get_feature_count(), self.get_train_data_size(), self.get_test_data_size())
         
    def get_feature_count(self):
        return self.train.shape[1]

    def get_label_count(self):
        return len(self.labels)

    def get_train_data_size(self):
        return self.train.shape[0]

    def get_test_data_size(self):
        return self.val.shape[0]
                
    def text_labels(self, indices):
        """Return text labels."""
        return [self.labels[int(i)] for i in indices]

    def get_dataloader(self, train):
        features = self.train if train else self.val
        labels = self.train_y if train else self.val_y
        get_tensor = lambda x: torch.tensor(x.values, dtype=torch.float32)
        tensors = (get_tensor(features), get_tensor(labels).long())
        return self.get_tensorloader(tensors, train)

Set our hyperparameters, fit a model.

In [None]:
lr = 0.01
epochs = 10
split = 0.2
batch_size = 12
shuffle = True

# You can get the iris data from https://archive-beta.ics.uci.edu/dataset/53/iris
data = CsvData(4,"data/iris.data", batch_size=batch_size, shuffle=shuffle, split=split)
model = LogisticRegression(data.get_feature_count(), data.get_label_count(), lr)

# textbook has this: 
#data = d2l.FashionMNIST(batch_size=256)
#model = LogisticRegression(784, 10, 0.1)

trainer = d2l.Trainer(max_epochs=epochs)
trainer.fit(model, data)