# Measurement Scales for Machine Learning
## By Jeff Hale

It seems that many machine learning guides and tutorials would benefit from clearer terminology around measurement scales. Numeric and categorical are the two ways data are most commonly divided. Time series data often gets its own treatment when present. But categorical is often used to mean several different things. It is often used to mean what is often called "nominal" data, but sometimes ordinal data is lumped in with nominal data and called "categorical". Sometimes all string data is assumed to be "categorical".  I see upvoted StackOverflow and Kaggle code snippets that impute missing values according to whether the data type is numeric or string data. 

This isn't particularly correct, at least from a theoretical perspective. 

This kernel is my attempt to help create a more coherent lexicon for discussing measurement scales of data in Machine Learning. I hope it helps  you more quickly encode ordinal and nominal data. But I must warn you that your head might spin when you see all of the many types of encoding possible for nominal and ordinal data. 

I won't go into treatment of time series data as python's pandas library was well designed for it's treatment and there are numerous online tutorials and courses on dealing with time series data in pandas for machine learning.

## Stevens' typology of measurement scales

Stevens created four measurement scales for data [measurement scales](https://www.mymarketresearchmethods.com/types-of-data-nominal-ordinal-interval-ratio/). His typology became extremely popular, especially in the social sciences.

Ratio (equal spaces between values and a meaningful zero value - mean makes sense)
Interval (equal spaces between values, but no meaningful zero value - mean make sense)
Ordinal (first, second, third values, but not equal space between first and second and second and third - median makes sense)
Nominal (no numerical relationship between the different categories - mean and median are meaningless)

Some researchers have further expanded the number of measurement scales [wikipedia](https://en.wikipedia.org/wiki/Level_of_measurement#Comparison), but that isn't generally necessary in machine learning problems.

## Getting the data into machine learning format

Interval and ratio numeric data are the nice easy stuff to use in machine learning. I group ratio data under the interval label here as it generally doesn't really matter for machine learning whether it's true interval or ratio data. 

Ordinal data might come to you as an integer data type or as string (object) data type. How you treat it should not depend on whether it comes to you as a number or a string. What matters is whether the ordinal data is close enough to interval data to treat it as interval data. Social scientists make this assumption all the time with likert scales (e.g. On a scale from 1 to 7, 1 being extremely unlikely, 4 being neither likely nor unlikely and 7 being exteremley likely, how likely are you to recommend this movie to a friend?). Here the difference between 3 and 4 and the difference between 6 and 7 can be reasonably assumed to be similar.

Nominal data might also come to you in numeric form, but correspond to categories that have no numeric relationship to each other. That's fine, you just saved a step by not needing to transform it into numeric data, unless you want to play with some more advanced and less common encodings of nominal data we'll discuss below.

## One Hot Encoding

Ordinal data and nominal data are most commonly - one-hot encodeed. One-hot encoding consists of making a new variable for each value in a column and binary encoding that value so that if the value is present, the observation gets a 1 and if it is absent, it gets a zero. See the example below.

In [6]:
import numpy as np 
import pandas as pd  
import category_encoders as ce 
from sklearn.preprocessing import OneHotEncoder 

pd.options.display.max_columns = 50

In [7]:
# show package versions for reproducibility
!pip list

Package                            Version                   Location                           
---------------------------------- ------------------------- -----------------------------------
absl-py                            0.2.2                     
alabaster                          0.7.10                    
algopy                             0.5.7                     
altair                             2.1.0                     
anaconda-client                    1.6.5                     
anaconda-navigator                 1.6.9                     
anaconda-project                   0.8.0                     
annoy                              1.12.0                    
appdirs                            1.4.3                     
arrow                              0.12.1                    
asn1crypto                         0.22.0                    
astor                              0.7.1                     
astroid                            1.5.3        

In [8]:
df = pd.DataFrame({
    "color": ["red", "blue", "red", "green"],
    "texture": ['rough', "rough", "smooth", "slimy"]
})  
print(df)
print("")

# make a copy for use later
df2 = df.copy()
print(df2)

   color texture
0    red   rough
1   blue   rough
2    red  smooth
3  green   slimy

   color texture
0    red   rough
1   blue   rough
2    red  smooth
3  green   slimy


Let's one-hot encode it with sklearn's default OneHotEncoder().

In [9]:
enc = OneHotEncoder()
df = enc.fit_transform(df)

print(df)

  (0, 2)	1.0
  (0, 3)	1.0
  (1, 0)	1.0
  (1, 3)	1.0
  (2, 2)	1.0
  (2, 5)	1.0
  (3, 1)	1.0
  (3, 4)	1.0


Note that you did not get a data frame back.

In [10]:
enc = ce.OneHotEncoder()
df2 = enc.fit_transform(df2)

print(df2)

   color_1  color_2  color_3  color_-1  texture_1  texture_2  texture_3  \
0        1        0        0         0          1          0          0   
1        0        1        0         0          1          0          0   
2        1        0        0         0          0          1          0   
3        0        0        1         0          0          0          1   

   texture_-1  
0           0  
1           0  
2           0  
3           0  


Note how the category-encoders version returns a data frame. Nice! 

See how each observation that had a value *red* in the *color* column now has a 1 in the *color_1* column and all other values are zero in that column. The same pattern for the other colors repeats in the othe new color columns. The *color -1* column has no 1s in it because no values were missing in the original *color* column.

One-hot encoding is also called Dummy Encoding - Pandas has the GetDummies function you can use as an alternative to  sklearn's OneHotEncoder. There are some differences between the two - most notably, sklearn's fits into pipelines easily, so I prefer to use sklearn's (or rather the nice wrapper around it it provided by category_encoders so I can get a dataframe back easily).

One-hot-encoding is the most popular method for encoding categorical data, but it can have some serious drawbacks with high cardinality.
1. Memory use can sky rocket
2. Curse of dimensionality - too easily overfitting in some cases, resulting poor model performance
3. Lots of sparse data doesn't work well with decision-tree base algorithms

You can use PCA to select a limited number of features to avoid the curse of dimensionality on small data sets with large numbers of nominal variables you've one hot encoded. Regardless, one-hot-encoding might not be the best way to treat your ordinal or nominal data. 

## Helmert Contrast anyone? Options for encoding ordinal and nominal data

 It seems most folks stop at one-hot encoding in the sklearn community. But there are A LOT of different ways to encode ordinal and nominal data and they perform very differently over different data sets and different algorithms.

This [article](https://stats.idre.ucla.edu/stata/webbooks/reg/chapter5/regression-with-statachapter-5-additional-coding-systems-for-categorical-variables-in-regressionanalysis/) explains some of the many options in addition to one-hot-encoding for encoding nominal and ordinal data. The code samples are for SPSS statistical package.

Many of these methods for handling ordinal and nominal data are in the python library [statsmodel's](http://www.statsmodels.org/devel/contrasts.html)  [patsy](http://patsy.readthedocs.io/en/latest/API-reference.html#handling-categorical-data) package. Will McGinnis brought lots of these methods of encoding ordinal and nominal data to sklearn with his excellent package [Category Encoders](http://contrib.scikit-learn.org/categorical-encoding/) . 

So what's possible with Category Encoders? From the docs:

> A set of scikit-learn-style transformers for encoding categorical variables into numeric with different techniques. Currently implemented are:
> 
> * Ordinal
> * One-Hot
> * Binary
> * Helmert Contrast
> * Sum Contrast
> * Polynomial Contrast
> * Backward Difference Contrast
> * Hashing
> * BaseN
> * LeaveOneOut
> * Target Encoding
> 
> The ordinal, one-hot, and hashing encoders have similar equivalents in the existing scikit-learn version, but the transformers in this library all share a few useful properties:
> 
> * First-class support for pandas dataframes as an input (and optionally as output)
> * Can explicitly configure which columns in the data are encoded by name or index, or infer non-numeric columns regardless of input type
> * Can drop any columns with very low variance based on training set optionally
> * Portability: train a transformer on data, pickle it, reuse it later and get the same thing out.
> * Full compatibility with sklearn pipelines, input an array-like dataset like any other transformera

If you want more explanation on the different encoding schemes, read [this UCLA article](https://stats.idre.ucla.edu/stata/webbooks/reg/chapter5/regression-with-statachapter-5-additional-coding-systems-for-categorical-variables-in-regressionanalysis/) also referenced above.

There are even more alternatives, including using Neural Networks to encode categorical data as discussed in this study: [Entity Embeddings of Categorical Variables](https://arxiv.org/abs/1604.06737).

If a column is ordinal but can't reasonably be assumed to be approximately interval, then it might make sense to make a new binary variable and split the data with sklearn's Binarizer so that high values receive a 1 and low values receive a 0.  You could also bin the values into a few numeric categories and try treating those as interval data.

Although it's beyond the scope of this article, it would be valuable for future research to benchmark the many options for treating ordinal and interval data across many different datasets and algorithms to create better guidelines. 

## Action plan

First make a list of each variable's dtype and measurement scale. I recommend doing this in a Google Sheet where you can include more relevant information to help you understand the variable and generate feature creation ideas. This takes some time, but is well worth it.

Once you've figured out which variables are categorical and nominal, I suggest imputing missing data. You can read my brief guide on options for imputation [here](https://www.kaggle.com/discdiver/imputing-values-for-machine-learning/).

Then you need to decide how to encode your ordinal and nominal data. 

If the ordinal data is close to interval data, I'd just code it as interval data and be done with it. If it isn't interval enought, then traditionally you would one-hot-encode it. That seems like it should be one of the schemes you try.

Is it worth the effort and reasonable to try a bunch of different encoding options of ordinal and nominal data? If you have a lot of ordinal and nominal data and the time available, it well might be. McGinnis's [benchmark](http://www.willmcginnis.com/2015/11/29/beyond-one-hot-an-exploration-of-categorical-variables/) of one-hot, helmert, binary, sum, ordinal, backward-difference, and polynomial encoding on three different data classification problems led to wide varieties in encoding method performance across data sets. 

Hopefully this post will help you consider your measurement scales of your data and the resulting encoding options for ordinal and nominal features. I'd love to hear what encoding methods you've found most valuable. 