<h1>
Unsupervised Machine Learning with
</h1>
<h1>
One-Class (SVMs) Support Vector Machines
</h1>

In [2]:
%matplotlib inline

import numpy as np
import pandas as pd
from sklearn import utils
import matplotlib

data = pd.read_csv('common_sites_aqua_distance.csv', low_memory=False)
print ('Number of rows in initial data set {}'.format(data.size))

# data.set_index('id')

Number of rows in initial data set 12176


1. Extracting the relevant features in our data (which seems to be all of them so far except the id and coordinate fields)
<br>
2. Replace the data with a subset containing only the relevant features (again, which seems to be all of them until further changes)
<br>
3. Next we normalize the data - this leads to better accuracy and reduces the numerical instability in the SVM

In [3]:
list(data)

['Country',
 'Latitude',
 'Longitude',
 'SiteID',
 'Elevation',
 'SiteName',
 'Distance matrix lakes_Distance',
 'Distance matrix rivers_Distance']

In [4]:
# extract features
relevant_features = [
    'Elevation',
    'Distance matrix lakes_Distance',
    'Distance matrix rivers_Distance'
]

# replace data
data = data[relevant_features]

# normalize the data
for feature in relevant_features:
    data[feature] = np.log((data[feature] + 0.1).astype(float))

# print data
data

Unnamed: 0,Elevation,Distance matrix lakes_Distance,Distance matrix rivers_Distance
0,5.318610,0.649659,0.566866
1,4.821088,1.492799,-0.939658
2,-2.302585,1.360694,0.058422
3,-2.302585,0.644151,0.708386
4,4.852811,-0.299526,-1.425983
5,-2.302585,1.716673,-1.049898
6,5.075799,-0.281390,0.620001
7,-2.302585,0.012866,-1.195684
8,4.821088,1.479244,-0.553765
9,-2.302585,1.525927,-0.251569


<h2>Making our Data one-class</h2>
<br>
We're using a one-class SVM, so we need.. a single class
<br>
All the data we have is positive so we have
<br>
class 1 (site)

In [5]:
data['site'] = 1.0
data.head(5)

Unnamed: 0,Elevation,Distance matrix lakes_Distance,Distance matrix rivers_Distance,site
0,5.31861,0.649659,0.566866,1.0
1,4.821088,1.492799,-0.939658,1.0
2,-2.302585,1.360694,0.058422,1.0
3,-2.302585,0.644151,0.708386,1.0
4,4.852811,-0.299526,-1.425983,1.0


Grab out the site value as the target for training and testing
<br>
Since we're only selecting a single column from the 'data' dataframe, we'll just get a series, not a new data frame

In [6]:
target = data['site']
target

0       1.0
1       1.0
2       1.0
3       1.0
4       1.0
5       1.0
6       1.0
7       1.0
8       1.0
9       1.0
10      1.0
11      1.0
12      1.0
13      1.0
14      1.0
15      1.0
16      1.0
17      1.0
18      1.0
19      1.0
20      1.0
21      1.0
22      1.0
23      1.0
24      1.0
25      1.0
26      1.0
27      1.0
28      1.0
29      1.0
       ... 
1492    1.0
1493    1.0
1494    1.0
1495    1.0
1496    1.0
1497    1.0
1498    1.0
1499    1.0
1500    1.0
1501    1.0
1502    1.0
1503    1.0
1504    1.0
1505    1.0
1506    1.0
1507    1.0
1508    1.0
1509    1.0
1510    1.0
1511    1.0
1512    1.0
1513    1.0
1514    1.0
1515    1.0
1516    1.0
1517    1.0
1518    1.0
1519    1.0
1520    1.0
1521    1.0
Name: site, Length: 1522, dtype: float64

Find proportion? 
<br>
maybe. skip this for now

We drop site column name from the data frame so we can do unsupervied training with unlabelled data. We've already copied the site column name out into the target series so we can compare against it later

In [8]:
data.drop(['site'], axis=1, inplace=True)
data.head(5)

Unnamed: 0,Elevation,Distance matrix lakes_Distance,Distance matrix rivers_Distance
0,5.31861,0.649659,0.566866
1,4.821088,1.492799,-0.939658
2,-2.302585,1.360694,0.058422
3,-2.302585,0.644151,0.708386
4,4.852811,-0.299526,-1.425983


Check the shape for sanity checking

In [9]:
data.shape

(1522, 3)

<h2>
Splitting the Data into Training and Test Sets
</h2>

In [10]:
from sklearn.model_selection import train_test_split

train_data, test_data, train_target, test_target = train_test_split(data, target, train_size=0.8)

train_data.shape



(1217, 3)

<strong>nu</strong>     - 'An upper bound on the fraction of training errors and a lower bound of the fraction of support vectors' and must be between 0 and 1. Basically this means the proportion of outliers we expect in our data.
<br>
<strong>kernel</strong> - the kernel type to be used. Setting kernel to something other than linear here will achieve that. The default is rbf (RBF - radial basis function).

<strong>gamma</strong>  - parameter of the RBF kernel type and controls the influence of individual training samples, the effects the 'smoothness' of the model. A low value improves the smoothness and "generalizability" of the model, while a high values reduces it but makes the model "tighter-fitted" to the training data. Some experimentation is often required to find the best value.


In [None]:
from sklearn import svm

# set nu (which should be proportion of outliers in our data, but we don't exactly know it)
# attributes (nu=0.01, kernel='rbf', gamma=0.00005)
model = svm.OneClassSVM()
model.fit(train_data)

from sklearn import metrics
preds = model.predict(train_data)
targs = train_target

args = (targs, preds)

print("accuracy:  ", metrics.accuracy_score(*args))
print("precision: ", metrics.precision_score(*args))
print("recall:    ", metrics.recall_score(*args))
print("f1:        ", metrics.f1_score(*args))
# print("area under curve (auc): ", metrics.roc_auc_score(*args))

In [11]:
from sklearn import svm

# set nu (which should be proportion of outliers in our data, but we don't exactly know it)
# attributes (nu=0.01, kernel='rbf', gamma=0.00005)
model = svm.OneClassSVM()
model.fit(train_data)

OneClassSVM(cache_size=200, coef0=0.0, degree=3, gamma='auto', kernel='rbf',
      max_iter=-1, nu=0.5, random_state=None, shrinking=True, tol=0.001,
      verbose=False)

In [13]:
from sklearn import metrics
preds = model.predict(train_data)
targs = train_target

args = (targs, preds)

print("accuracy:  ", metrics.accuracy_score(*args))
print("precision: ", metrics.precision_score(*args))
print("recall:    ", metrics.recall_score(*args))
print("f1:        ", metrics.f1_score(*args))
# print("area under curve (auc): ", metrics.roc_auc_score(*args))

accuracy:   0.49958915365653245
precision:  1.0
recall:     0.49958915365653245
f1:         0.6663013698630137
