<a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" property="dct:title"><b>Linear Binary Classification by Linear Programming</b></span> by <a xmlns:cc="http://creativecommons.org/ns#" href="http://mate.unipv.it/gualandi" property="cc:attributionName" rel="cc:attributionURL">Stefano Gualandi</a> is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.<br />Based on a work at <a xmlns:dct="http://purl.org/dc/terms/" href="https://github.com/mathcoding/opt4ds" rel="dct:source">https://github.com/mathcoding/opt4ds</a>.

**NOTE:** Run the following script whenever running this script on a Google Colab.

In [None]:
import shutil
import sys
import os.path

if not shutil.which("pyomo"):
    !pip install -q pyomo
    assert(shutil.which("pyomo"))

if not (shutil.which("glpk") or os.path.isfile("glpk")):
    if "google.colab" in sys.modules:
        !apt-get install -y -qq glpk-utils
    else:
        try:
            !conda install -c conda-forge glpk 
        except:
            pass

# Linear Classification
In this lab session, you have to experiment with **(Integer) Linear Programming** to train a **binary linear classifier** for solving two different problems:

1. The classification two classes of random points drawn from two different 2D Gaussian distributions.
2. The classification of euro banknotes as *regular* or *fake*, given 4 attributes of banknote images:
    * Variance of Wavelet Transformed image (continuous)
    * Skewness of Wavelet Transformed image (continuous)
    * Curtosis of Wavelet Transformed image (continuous)
    * Entropy of image (continuous)

Your model will be evaluated in terms of overall accuracy, given as the percentage of object classified correctly as either positive or negative.

For the banknote classification problem, you have only a subset of all the data. The missing data will be used to nominate the best classifier proposed by the different student groups.

To design your classifier, you can only use **(Integer) Linear Programming**. If you want to try after the lecture other approaches, I will be curious the hear about.

## 1. Gaussian samples in 2D
The first dataset is generated randomly with the following code, which generates a list of $3n$ points in the plane. The first $2n$ points have both coordinates with mean equal to 2.0 and standard deviations $d$ (=0.5 by default), and they belong to the first class; the remaining $n$ points have coordinates with mean equal to 2.0 and standard deviations $d$, and they belong to the second class.

In [None]:
import numpy as np

def Gaussian(n, mu, sigma):
    return np.random.normal(mu, sigma, (n, 2))

def RandomData(n, d=0.5):
    # To experiment with different random dataset, comment the following line
    np.random.seed(17)
    
    # Generate points
    As = Gaussian(2*n, 2, d)
    Bs = Gaussian(n, 4, d)       
    Xs = []
    Ys = []    
    for a in As:
        Xs.append(a)
        Ys.append(0) # First class
    for a in Bs:
        Xs.append(a)
        Ys.append(1) # Second class
        
    return Xs, Ys

Xs, Ys = RandomData(50, 0.75)

print(Xs[0], Ys[0])

In practice, you have generated $m$ points $x_i$ with labels $y_i$, such that the points with $y_i=0$ belongs to the first class, and the point $y_i=1$ belongs to the second class.

You have to find a Linear Classifier in 2D, that is, you have to find the hyperplane defined by the vector $(a_0, a_1, a_2)$, such that:

$$
a_0 x_{i0} + a_1 x_{i1} + a_2 > 0 \quad \;\;\;\mbox{if $x_i$ belong to the first class } (y_i=0)\\
a_0 x_{i0} + a_1 x_{i1} + a_2 < 0 \quad \mbox{ if $x_i$ belong to the second class } (y_i=1)\\
$$

For instance, the (very bad!) linear classifier specified by $(a_0, a_1, a_2)=(1,-1,0.5)$, classifies each point of the dataset as shown in the following graphical representation.

In [None]:
import matplotlib.pyplot as plt

def PlotSolution(Xs, Ys, A):
    fig, (ax1, ax2) = plt.subplots(1, 2)
    # Left plot
    ax1.scatter([x[0] for x in Xs], [x[1] for x in Xs], 
                color=['green' if y == 1 else 'blue' for y in Ys],
                alpha=0.35)
    # Right plot
    ax2.scatter([x[0] for x in Xs], [x[1] for x in Xs], 
                color=['green' if y == 1 else 'blue' for y in Ys],
                alpha=0.35)
    
    xmin = min(x[0] for x in Xs)
    xmax = max(x[1] for x in Xs)
    x = np.linspace(xmin, xmax, 10)
    
    y = -A[0]/A[1]*x + A[2]/A[1]
    
    ax2.plot(x, y, color='red')
    
    # Miss-classifications
    Vs = []
    for i,x in enumerate(Xs):
        if A[0]*x[0] + A[1]*x[1] < A[2] and Ys[i] == 0:
            Vs.append(x)                        
        else:
            if A[0]*x[0] + A[1]*x[1] > A[2] and Ys[i] == 1:
                Vs.append(x)
    
    ax2.scatter([x[0] for x in Vs], [x[1] for x in Vs], color='red', alpha=0.5, marker='x')
    
    # Final plot
    ax1.axis([xmin-0.5, xmax+0.5, 1, 5.5])
    ax2.axis([xmin-0.5, xmax+0.5, 1, 5.5])
    ax1.axis('equal')
    ax2.axis('equal')
    # plt.savefig('lin_classifier.pdf') 
    plt.show() 
    
# HOW-TO PLOT
PlotSolution(Xs, Ys, [1,-1,0.5])  # <<<<============

We can evaluate the classifier in terms of *accuracy* and/or by using the *confusion matrix*.

In [None]:
def Accuracy(A, Bx, Bl):
    # Count overall miss-classifications
    v = 0
    for xs, y in zip(Bx, Bl):
        ax = sum(x*a for x,a in zip(xs, A[:-1]))
        if ax < A[-1] and y == 0:
            v += 1
        if ax > A[-1] and y == 1:
            v += 1    
    return round((len(Bx)-v)/len(Bx)*100, 3), v, len(Bx)

def Confusion(A, Bx, Bl):
    # Compute in order:
    # True Positive, False Positive, True Negative, False Negative
    tp, fp, tn, fn = 0, 0, 0, 0    
    for xs, y in zip(Bx, Bl):
        ax = sum(x*a for x,a in zip(xs, A[:-1]))
        if ax >= A[-1] and y == 0:
            tn += 1
        if ax < A[-1] and y == 1:
            tp += 1
            
        if ax < A[-1] and y == 0:
            fn += 1
        if ax > A[-1] and y == 1:
            fp += 1    
    return tp, fp, tn, fn

# HOW-TO EVALUATE  # <<<<============
print("Accuracy:", Accuracy([1,-1,0.5], Xs, Ys))
print("Confusion Matrix:", Confusion([1,-1,0.5], Xs, Ys))

**Note:** This classifier is indeed very poor: flipping a coin would given likely better results.

**EXERCISE 1:** Design your best possible Linear Classifier using Integer Linear Programming and Pyomo. Use the visual representation in 2D to get intuition on your classifier. Try to change the initial random distributions (mean and deviation).

Implement your solution in the following function:

In [None]:
from pyomo.environ import ConcreteModel, Var, Objective, Constraint, SolverFactory
from pyomo.environ import Binary, RangeSet, NonNegativeReals

def LinearClassifier(Xs, Ys):
    # TO COMPLETE WITH YOUR MODEL
    return [1, -1, 0.5]


# HOW-TO TEST YOUR SOLUTION
A = LinearClassifier(Xs,Ys)  # <<<<============

print("Accuracy:", Accuracy(A, Xs, Ys))
print("Confusion Matrix:", Confusion(A, Xs, Ys))
PlotSolution(Xs, Ys, [1,-1,0.5])

## 2. Banknote fake classification
In this second exercise, you have to design a Linear Classifier to distinguish between original and fake banknote.

Each single banknote is first digitalized, and second, 4 features of each image are reported in the dataset. The 4 features that you can use are:

* Variance of Wavelet Transformed image (continuous)
* Skewness of Wavelet Transformed image (continuous)
* Curtosis of Wavelet Transformed image (continuous)
* Entropy of image (continuous)

Your are given a subset of 992 banknotes that you can use to train your model.

**GROUP CHALLENGE:** Design your best Linear Classifier which will achieve the best **accuracy** on a subset of 380 banknotes that you cannot have access to. 

**DATA:** The data about the banknotes is given via a csv file, with a row for each banknote. The first 4 fields of each row gives the features of the banknote, the last field, while gives a binary label: 0 or 1. You can parse the data with the following snippet.

In [None]:
# Run this command to import the dataset
!wget http://www-dimat.unipv.it/gualandi/opt4ds/banknote_train.csv

In [None]:
# Parse the training set
def ParseData(filename):
    fh = open(filename, 'r', encoding="utf-8")
    Xs = []
    Ys = []
    for line in fh:
        row = line.replace('\n','').split(',')        
        Xs.append( list(map(float, row[:-1])) )
        Ys.append( int(row[-1]) )
    return Xs, Ys  

# HOW-TO PARSE
Xs, Ys = ParseData('banknote_train.csv')
for i in range(5):
    print(Xs[i], Ys[i])

**Note:** You may want to split your data into training and test subsets, in order to verify how your model generalize to *unseen* data (but you are not forced to). In the later case, you can use the following code:

In [None]:
def SplitTrainTestSet(Xs, Ys, t=0.3):
    Ax, Al = [], []  # Train sets
    Bx, Bl = [], []  # Test sets
    
    np.random.seed(13)
    
    for x, y in zip(Xs, Ys):
        if np.random.uniform(0, 1) > t:
            Ax.append(x)
            Al.append(y)
        else:
            Bx.append(x)
            Bl.append(y)
            
    return Ax, Al, Bx, Bl

# HOW-TO USE A DATA SPLITTING
Ax, Al, Bx, Bl = SplitTrainTestSet(Xs, Ys)  

In [None]:
def LinearClassifierBank(Xs, Ys):
    # TO COMPLETE WITH YOUR DATA
    return []

**SUBMISSION:** Each group must submit a solution by Monday night, at the latest. Next Tuesday, each group will receive and will correct the solution of another group next Tuesday.  All the submitted classifier will be evaluated during the lecture, on Wednesday, 31st, by using the **unseen** validation dataset.