# Training a Classifier on the *Salammbô* Dataset with scikit-learn
Author: Pierre Nugues

We first need to import a few modules

In [1]:
import numpy as np
from sklearn.datasets import load_svmlight_file
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.model_selection import cross_val_score, LeaveOneOut

## Reading the Dataset
We can read the data from a file with the svmlight format. We convert ${X}$ to a dense array so that we can easily read it.

In [2]:
X, y = load_svmlight_file('../salammbo/salammbo_a_binary.libsvm')

FileNotFoundError: [Errno 2] No such file or directory: '../salammbo/salammbo_a_binary.libsvm'

In [3]:
X = np.array(X.todense())
print(type(X))
X[:4]

NameError: name 'X' is not defined

In [4]:
print(type(y))
y[:4]

<class 'numpy.ndarray'>


array([0., 0., 0., 0.])

We can also directly create numpy arrays 

In [5]:
X = np.array(
    [[35680, 2217], [42514, 2761], [15162, 990], [35298, 2274],
     [29800, 1865], [40255, 2606], [74532, 4805], [37464, 2396],
     [31030, 1993], [24843, 1627], [36172, 2375], [39552, 2560],
     [72545, 4597], [75352, 4871], [18031, 1119], [36961, 2503],
     [43621, 2992], [15694, 1042], [36231, 2487], [29945, 2014],
     [40588, 2805], [75255, 5062], [37709, 2643], [30899, 2126],
     [25486, 1784], [37497, 2641], [40398, 2766], [74105, 5047],
     [76725, 5312], [18317, 1215]
     ])

y = np.array(
    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

## Fitting the Data
We create a classifier and train a model

In [7]:
classifier = LogisticRegression()
classifier

In [8]:
classifier.fit(X, y)

## Predicting Classes
We now apply the model to the training set

We predict the classes for the whole dataset

In [9]:
y_hat = classifier.predict(X)
y_hat

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1])

We predict two observations

In [10]:
classifier.predict([X[-1]])

array([1])

In [11]:
classifier.predict(np.array([[35680, 2217]]))

array([0])

We predict the training set with probabilities

In [12]:
y_predicted = classifier.predict_proba(X)
y_predicted[:4]

array([[1.00000000e+00, 1.28980319e-30],
       [9.99999999e-01, 8.16295157e-10],
       [9.91302434e-01, 8.69756611e-03],
       [1.00000000e+00, 2.35657080e-12]])

This is a perfect prediction, but not a good evaluation practice because we did it on the training set. 

In [13]:
classifier.predict_proba([X[-1]])

array([[0.0180183, 0.9819817]])

## The Model
We print the model weights

In [14]:
'Model weights: {}, {}'.format(classifier.intercept_, classifier.coef_)

'Model weights: [-4.51879339e-05], [[-0.03372363  0.51169867]]'

Using this model, we predict the classes with the logistic function

The weight vector

In [15]:
w = np.append(classifier.intercept_, classifier.coef_)
w

array([-4.51879339e-05, -3.37236260e-02,  5.11698674e-01])

The feature vector of one observation

In [16]:
x = np.append([1.0], X[-1])
x

array([1.0000e+00, 1.8317e+04, 1.2150e+03])

The prediction

In [17]:
1/(1 + np.exp(-w @ x))

0.9819817031873619

## Evaluation

On the training set

In [18]:
print("Classification report for classifier on the training set:\n",
      metrics.classification_report(y, y_hat))

Classification report for classifier on the training set:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        15
           1       1.00      1.00      1.00        15

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



We use cross validation instead

In [19]:
scores = cross_val_score(classifier, X, y, cv=5, scoring='accuracy')
scores

array([1., 1., 1., 1., 1.])

In [20]:
scores.mean()

1.0

### Leave one out

We train on all the observations, except one that serves as test set. We repeat this evaluation with a different observation as many times as there are observations.

In [21]:
loo = LeaveOneOut()
predictions = 0
correct_predictions = 0
for train_index, test_index in loo.split(X):
    predictions += 1
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    classifier.fit(X_train, y_train)
    if classifier.predict(X_test)[0] == y_test:
        correct_predictions += 1
'Leave-one-out crossvalidation accuracy: {}'.format(correct_predictions / predictions)

'Leave-one-out crossvalidation accuracy: 0.9666666666666667'