# Ordinal Regression - Baseline

In order to model categorical data there are two possible approaches:
- multinomial logistic regression
- ordinal regression

### Multinomial logistic regression
If the dependent variable $Y$ takes different discrete values, it can be modelled with an extension to the wide-known logistic regression model. Each category $c = c_1, ..., c_C$ is modelled via

$$ \mathbb{P}(Y_i = c_j) = \frac{\exp(\beta_K X_i)}{1 + \sum_{k=1}^K \exp(\beta_k X_i)}$$

Both sklearn and Mlib implement this approach to some extent. I.e. sklearn can handle multinomial logistic regression, but uses a lbfgs or newton-cg approach only (no sgd) with support for L2 regularization solely. In Mlib 

However, multinomial logistic regression is not always the best model to choose. Consider i.e. the case of ratings. Here, the different categories represent ordinal values implying some kind of natural order. In a multinomial logistic regression model this order is neglected. I.e. rating '5' is as good as '4' and just another category.

Links:
- <http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html>
- <http://spark.apache.org/docs/latest/mllib-linear-methods.html>
- <http://de.slideshare.net/dbtsai/2014-0620-mlor-36132297>

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import scipy as sc
from sklearn.linear_model import LogisticRegression
import os
import pandas as pd
import itertools
from utils import loadCovertypeData

For a description of the dataset see <https://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.info>

In [2]:
# Loading the dataset
cache_path = os.path.join('..', 'cache')

df = loadCovertypeData(os.path.join(cache_path, 'covtype.csv'))

In [3]:
df.head()

Unnamed: 0,Elevation,Aspect,Slope,HDHyrdo,VDHydro,HDRoadways,9amHills,NoonHills,3pmHills,HDFirePoints,...,Soil_Type_32,Soil_Type_33,Soil_Type_34,Soil_Type_35,Soil_Type_36,Soil_Type_37,Soil_Type_38,Soil_Type_39,Soil_Type_40,Cover_Type
0,2596,51,3,258,0,510,221,232,148,6279,...,0,0,0,0,0,0,0,0,0,5
1,2590,56,2,212,-6,390,220,235,151,6225,...,0,0,0,0,0,0,0,0,0,5
2,2804,139,9,268,65,3180,234,238,135,6121,...,0,0,0,0,0,0,0,0,0,2
3,2785,155,18,242,118,3090,238,238,122,6211,...,0,0,0,0,0,0,0,0,0,2
4,2595,45,2,153,-1,391,220,234,150,6172,...,0,0,0,0,0,0,0,0,0,5


In [4]:
# split into variables
respvar = ['Cover_Type']
expvar = list(set(df.columns) - set(respvar))
X = df[expvar]
Y = df[respvar]

In [5]:
#Splitting into test and training sets
from sklearn.cross_validation import train_test_split
itrain, itest = train_test_split(xrange(X.shape[0]), train_size=0.7)

mask=np.ones(X.shape[0], dtype='int')
mask[itrain]=1
mask[itest]=0
mask = (mask==1)

X_train = np.array(X[mask])
X_test = np.array(X[~mask])
Y_train = np.array(Y[mask]).flatten()
Y_test = np.array(Y[~mask]).flatten()

In [6]:
# sklearn multinomial logistic regression
# set max_iter higher to increase accuracy. Leaving to default 100 needs ~ 1min
clf = LogisticRegression(penalty='l2', multi_class='multinomial', solver='lbfgs', max_iter=200)

In [7]:
# train train set
%time clf.fit(X_train, Y_train)

CPU times: user 1min 13s, sys: 21.7 s, total: 1min 35s
Wall time: 1min 5s


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=200, multi_class='multinomial',
          penalty='l2', random_state=None, solver='lbfgs', tol=0.0001,
          verbose=0)

for the sklearn implementation of the newton-cg / lbfgs solvers see https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/optimize.py

In [8]:
# store class labels
%time Y_pred = clf.predict(X_test)

CPU times: user 82.4 ms, sys: 101 ms, total: 184 ms
Wall time: 110 ms


In [9]:
# compute accuracy
accuracy = (1.0 * np.sum(Y_pred - Y_test == 0)) / Y_test.shape[0]
print 'accuracy for multinomial logistic regression:', accuracy * 100

accuracy for multinomial logistic regression: 64.4827427942


### Ordinal regression
To make up for this missing order, another model that is not yet implemented in sklearn nor Mlib is ordinal regression. I.e. in a ordinal regression model the categories satisfy (reindex if necessary) $c_1 \leq c_2 \leq ... \leq c_C$. 

Links:
- <http://arxiv.org/pdf/1408.2327v6.pdf>
- <http://www.stat.ufl.edu/~aa/ordinal/agresti_ordinal_tutorial.pdf>
- <http://onlinelibrary.wiley.com/book/10.1002/9780470594001>