# Binary classification

Classification is about predicting an outcome from a fixed list of classes. The prediction is a probability distribution that assigns a probability to each possible outcome.

A labeled classification sample is made up of a bunch of features and a class. The class is a boolean in the case of binary classification. We'll use the phishing dataset as an example.

In [30]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [32]:
import collections
from river import datasets

from source.river_utils import evaluate_binary_model

In [33]:
dataset = datasets.Phishing()
dataset

Phishing websites.

This dataset contains features from web pages that are classified as phishing or not.

    Name  Phishing                                                                                                                                              
    Task  Binary classification                                                                                                                                 
 Samples  1,250                                                                                                                                                 
Features  9                                                                                                                                                     
  Sparse  False                                                                                                                                                 
    Path  /home/denys_herasymuk/UCU/Studying_abroad/NYU_Internship/Code/RAI-summer-stabi

Let's take a look at the first sample.

In [34]:
x, y = next(iter(dataset))
x

{'empty_server_form_handler': 0.0,
 'popup_window': 0.0,
 'https': 0.0,
 'request_from_other_domain': 0.0,
 'anchor_from_other_domain': 0.0,
 'is_popular': 0.5,
 'long_url': 1.0,
 'age_of_domain': 1,
 'ip_in_url': 1}

In [35]:
y

True

In [36]:
counts = collections.Counter(y for _, y in dataset)

for c, count in counts.items():
    print(f'{c}: {count} ({count / sum(counts.values()):.5%})')

True: 548 (43.84000%)
False: 702 (56.16000%)


A common way to improve the performance of a logistic regression is to scale the data. This can be done by using a `preprocessing.StandardScaler`. In particular, we can define a pipeline to organise our model into a sequence of steps:

In [37]:
from river import compose
from river import metrics
from river import evaluate
from river import preprocessing
from river import linear_model

model = compose.Pipeline(
    preprocessing.StandardScaler(),
    linear_model.LogisticRegression()
)

model

In [39]:
target_mapping = {
    False: 0,
    True: 1,
}
evaluate_binary_model(dataset, model, target_mapping, measure_every=100)

TypeError: evaluate_binary_model() got an unexpected keyword argument 'measure_every'

In [10]:
metric = metrics.Accuracy()
evaluate.progressive_val_score(dataset, model, metric, print_every=100)

[100] Accuracy: 83.00%
[200] Accuracy: 83.50%
[300] Accuracy: 84.33%
[400] Accuracy: 86.00%
[500] Accuracy: 86.60%
[600] Accuracy: 87.33%
[700] Accuracy: 88.14%
[800] Accuracy: 88.38%
[900] Accuracy: 88.67%
[1,000] Accuracy: 89.00%
[1,100] Accuracy: 89.18%
[1,200] Accuracy: 89.25%


Accuracy: 89.20%