# Logistic Regression for classification on Banking Note dataset

In this notebook, we use simple logistic regression technique on the banking note dataset to calssify based on the attributes, whether a note is genuine or forged.

Data Set Information:

Data were extracted from images that were taken from genuine and forged banknote-like specimens. For digitization, an industrial camera usually used for print inspection was used. The final images have 400x 400 pixels. Due to the object lens and distance to the investigated object gray-scale pictures with a resolution of about 660 dpi were gained. Wavelet Transform tool were used to extract features from images.

Attribute Information:

1. variance of Wavelet Transformed image (continuous) 
2. skewness of Wavelet Transformed image (continuous) 
3. curtosis of Wavelet Transformed image (continuous) 
4. entropy of image (continuous) 
5. class (integer) 

Source: UCI Machine Learning Repository

We have splitted the dataset into 'Training data' and 'Test data'.

In [22]:
import numpy as np
import pandas as pd
import csv

In [23]:
#---- Logistic function ----#

def sigmoid(scores):
    return 1 / (1 + np.exp(-scores))

In [24]:
#---- Log Likelihood ----#

def log_likelihood(features, target, weights):
    scores = np.dot(features, weights)
    logl = np.sum( target*scores - np.log(1 + np.exp(scores)) )
    return logl

In [25]:
#---- Logistic Regression Function ----#

# We give our definition to Logistic Regression as follows:

def logistic_regression(features, target, num_steps, learning_rate, add_intercept = False):
    if add_intercept:
                # if True
        intercept = np.ones((features.shape[0], 1))
        features = np.hstack((intercept, features))
        
    weights = np.zeros(features.shape[1]) #place golder or initial weights

    for step in range(num_steps):
        scores = np.dot(features, weights) #linear function
        predictions = sigmoid(scores) 

        # Update weights with log likelihood gradient
        output_error_signal = target - predictions #compute the error
        
        gradient = np.dot(features.T, output_error_signal) #find the contribution
                                                           #of weights towards the total error
        weights += learning_rate * gradient

      
        if step % 10000 == 0:
            print (log_likelihood(features, target, weights))
        
    return weights


In [26]:
series = pd.read_csv('banknote-traindata.csv')
series.columns = ['X1','X2','X3','X4','Y'] #for the sake of ease
series.head()

Unnamed: 0,X1,X2,X3,X4,Y
0,4.5459,8.1674,-2.4586,-1.4621,0
1,3.866,-2.6383,1.9242,0.10645,0
2,3.4566,9.5228,-4.0112,-3.5944,0
3,0.32924,-4.4552,4.5718,-0.9888,0
4,4.3684,9.6718,-3.9606,-3.1625,0


In [27]:
X = series.iloc[:,0:3].values #separating the feature values

Y = series.iloc[:,4].values #separating the label vector


In [28]:
weights = logistic_regression(X, Y, num_steps = 100000, learning_rate = 5e-5, add_intercept=True)

-616.3051504547756
-26.482079668068927
-24.387972336440313
-23.594066189699703
-23.18343578806925
-22.939691754448447
-22.78312199742724
-22.677246705037064
-22.603021091832588
-22.549576756187278


Notice how the values decrease. Each printed value of log likelihood printed has occured at the end of 10,000th step. After about 50000th step the decrement in the value is very small.

In [29]:
print(weights)

[ 5.80687242 -5.40882618 -2.83507258 -3.57599376]


Let us store these weights

In [30]:
np.savetxt('optimum_weights.txt', weights,  delimiter=' ')

Now let us load the test dataset

In [31]:
series_test = pd.read_csv('banknote-testdata.csv')
series_test.columns = ['X1','X2','X3','X4','Y']
series_test.head()

Unnamed: 0,X1,X2,X3,X4,Y
0,4.6765,-3.3895,3.4896,1.4771,0
1,2.6719,3.0646,0.37158,0.58619,0
2,5.7867,7.8902,-2.6196,-0.48708,0
3,0.3292,-4.4552,4.5718,-0.9888,0
4,3.9362,10.1622,-3.8235,-4.0172,0


In [32]:
X_test = series_test.iloc[:,0:3].values
Y_test = series_test.iloc[:,4].values

We can re-read the weights from the stored directory with the following instruction:

In [33]:
weights = np.loadtxt('optimum_weights.txt')

In [34]:
score = np.dot(np.hstack((np.ones((X_test.shape[0], 1)), X_test)), weights)

In [35]:
#making predictions
preds = np.round(sigmoid(score))

Let's look at the accuracy calculated in the cell below:

In [36]:
print ('Accuracy: {0}%'.format((preds == Y_test).sum().astype(float) / len(preds)*100))

Accuracy: 98.9010989010989%


That's cool. Its a good accuracy.