<a href="https://colab.research.google.com/github/antoh/DataScience/blob/main/LogisticRegression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Changing from regression to classification:

- change last activation to Sigmoid to get a value between 0 and 1
- use as many output nodes as you have choices (2 choices = 2 nodes)
- use CrossEntropyLoss to better measure error for classification

We are using "logistic regression" for classification tasks because the logistic function (aka, sigmoid) looks like a sideways "S" and ranges between 0 and 1. It essentially produces a "percent likelihood" that the data is in class A or B, etc.

In [2]:
import torch
import pandas as pd
import matplotlib.pyplot as plt
import random
import sklearn.metrics

In [3]:
adultdf = pd.read_csv("/content/drive/MyDrive/DataScience/LogisticRegression/adult.data", header=None)
adultdf.columns = ["age", "workclass", "fnlwgt", "education", "educationNum",
                  "maritalStatus", "occupation", "relationship", "race",
                  "sex", "capitalGain", "capitalLoss", "hoursPerWeek",
                  "nativeCountry", "income"]
adultdf

Unnamed: 0,age,workclass,fnlwgt,education,educationNum,maritalStatus,occupation,relationship,race,sex,capitalGain,capitalLoss,hoursPerWeek,nativeCountry,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


Need to turn some of these categorical columns into "one-hot" columns, such as workclass: Private = {0,0,0}, Self-emp-not-inc = {0,1,0}, etc. This way, the neural net does not see labels like Private = 1, Self-emp-not-inc = 2, etc. which may lead it to believe Private is "close to" Self-emp-not-inc, etc.

In [4]:
# example of pandas' get_dummies on a single column:
pd.get_dummies(adultdf[["workclass"]], prefix=["workclass"], columns=["workclass"])

Unnamed: 0,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,workclass_ Private,workclass_ Self-emp-inc,workclass_ Self-emp-not-inc,workclass_ State-gov,workclass_ Without-pay
0,0,0,0,0,0,0,0,1,0
1,0,0,0,0,0,0,1,0,0
2,0,0,0,0,1,0,0,0,0
3,0,0,0,0,1,0,0,0,0
4,0,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...
32556,0,0,0,0,1,0,0,0,0
32557,0,0,0,0,1,0,0,0,0
32558,0,0,0,0,1,0,0,0,0
32559,0,0,0,0,1,0,0,0,0


In [5]:
onehot_columns = ["relationship"]
x = pd.get_dummies(adultdf[onehot_columns],
                    prefix=onehot_columns,
                    columns=onehot_columns)
x

Unnamed: 0,relationship_ Husband,relationship_ Not-in-family,relationship_ Other-relative,relationship_ Own-child,relationship_ Unmarried,relationship_ Wife
0,0,1,0,0,0,0
1,1,0,0,0,0,0
2,0,1,0,0,0,0
3,1,0,0,0,0,0
4,0,0,0,0,0,1
...,...,...,...,...,...,...
32556,0,0,0,0,0,1
32557,1,0,0,0,0,0
32558,0,0,0,0,1,0
32559,0,0,0,1,0,0


Because we are doing classification, we will use CrossEntropyLoss for our criterion function. The documentation shows that the "input" (the last layer of the network) should have 2 neurons because we have two classes (<=50K, >50K) for our target. The docs also show we need to make our y target values a single number that is 0 or 1 (since we have two classes). We do not use "onehot" encoding on these y values.

In [6]:
y = pd.Series([x == ' <=50K' for x in list(adultdf["income"])])
y

0         True
1         True
2         True
3         True
4         True
         ...  
32556     True
32557    False
32558     True
32559     True
32560    False
Length: 32561, dtype: bool

**NOTE!!!**

There are only 24% of one class, so our model has to at least get less error than this! It can simply learn to always say "1" or whatever, ignoring the input data, and still be wrong only 24% of the time!

In [7]:
y_true = len(list(filter(lambda x: x, y)))
y_false = len(list(filter(lambda x: not x, y)))
print(y_true, y_false, 1-float(y_true)/float(y_true+y_false))

24720 7841 0.2408095574460244


In [8]:
indexes = pd.Series(y.sample(frac=1.0, random_state=0).index)
train_idxs = indexes.iloc[range(0, int(len(indexes)*0.6))]
val_idxs = indexes.iloc[range(int(len(indexes)*0.6), int(len(indexes)*0.8))]
test_idxs = indexes.iloc[range(int(len(indexes)*0.8), len(indexes))]
train_x = x.iloc[train_idxs]
val_x = x.iloc[val_idxs]
test_x = x.iloc[test_idxs]
train_y = y.iloc[train_idxs]
val_y = y.iloc[val_idxs]
test_y = y.iloc[test_idxs]
train_y

22278     True
8950      True
7838      True
16505     True
19140    False
         ...  
14525     True
26826    False
18552     True
17957    False
27290    False
Length: 19536, dtype: bool

In [9]:
model = torch.nn.Sequential(
    torch.nn.Linear(train_x.shape[1], 100), # compute number of columns from shape
    torch.nn.ReLU(),
    torch.nn.Dropout(p=0.5), # 50% of weights will not be trained each epoch
    torch.nn.Linear(100, 2),
    torch.nn.Sigmoid()
)
model.cuda()

Sequential(
  (0): Linear(in_features=6, out_features=100, bias=True)
  (1): ReLU()
  (2): Dropout(p=0.5, inplace=False)
  (3): Linear(in_features=100, out_features=2, bias=True)
  (4): Sigmoid()
)

In [10]:
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

In [11]:
train_x_gpu = torch.tensor(train_x.to_numpy()).float().cuda()
val_x_gpu = torch.tensor(val_x.to_numpy()).float().cuda()
train_y_gpu = torch.tensor(train_y.to_numpy()).long().cuda()
val_y_gpu = torch.tensor(val_y.to_numpy()).long().cuda()
train_x_gpu

tensor([[0., 0., 0., 0., 1., 0.],
        [0., 1., 0., 0., 0., 0.],
        [1., 0., 0., 0., 0., 0.],
        ...,
        [0., 0., 0., 1., 0., 0.],
        [1., 0., 0., 0., 0., 0.],
        [1., 0., 0., 0., 0., 0.]], device='cuda:0')

In [12]:
for epoch in range(100):
    optimizer.zero_grad()
    train_pred = model(train_x_gpu)
    train_loss = criterion(train_pred, train_y_gpu)
    val_pred = model(val_x_gpu)
    val_loss = criterion(val_pred, val_y_gpu)
    
    # compute % accuracy (actually, error)
    # first, convert the two outputs into a single prediction
    train_pred_label = [0 if pred[0] > pred[1] else 1 for pred in train_pred.cpu().detach().tolist()]
    val_pred_label = [0 if pred[0] > pred[1] else 1 for pred in val_pred.cpu().detach().tolist()]
    train_error = 1.0 - sklearn.metrics.accuracy_score(train_y_gpu.cpu().tolist(), train_pred_label)
    val_error = 1.0 - sklearn.metrics.accuracy_score(val_y_gpu.cpu().tolist(), val_pred_label)
    print(train_loss.item(), val_loss.item(), train_error, val_error)
    train_loss.backward()
    optimizer.step()

0.6983808279037476 0.6978382468223572 0.5700757575757576 0.5703316953316953
0.6981967687606812 0.6976204514503479 0.5681818181818181 0.5612714987714988
0.6972208023071289 0.697152853012085 0.5570741195741196 0.5578931203931203
0.6969624757766724 0.696898877620697 0.5545659295659295 0.5543611793611793
0.6965380907058716 0.6957377195358276 0.5527231777231778 0.5419226044226044
0.6962052583694458 0.6954973936080933 0.546990171990172 0.5336302211302211
0.6956281065940857 0.6955417990684509 0.5405917280917281 0.5290233415233415
0.6951631307601929 0.6949793100357056 0.5340909090909092 0.5313267813267813
0.6945670247077942 0.6943801641464233 0.5256449631449631 0.5187346437346437
0.6943113207817078 0.6935708522796631 0.5261568386568387 0.5076781326781327
0.6937029957771301 0.6941842436790466 0.5099303849303849 0.5216523341523341
0.6936092972755432 0.6933209300041199 0.5148443898443898 0.5042997542997543
0.6929850578308105 0.6929388046264648 0.5032248157248157 0.4993857493857494
0.6927643418312