# Module 8 Lab 3  - Scikit-learn preprocessing for itemization

Sometimes the data the you have are not in a format conducive to machine learning, or a particular algorithm that you wish to use.  Some common preprocessing that might need to take place are concepts like itemization of continuous features (e.g. Age as a continuous value can be categorized into buckets by gae range).  You may also need to represent categorical values as dummy variables (common for regressions).  You may wish to convert continuous values into a binary variable, such as above a threhold or below it.  Finally, some data may be textual, and to process this kind of data you may need to convert it into a numeric representation of some kind.

We will use the breast cancer dataset to illustrate these concepts.

In [1]:
import sklearn.datasets as d
from sklearn import preprocessing
import numpy as np
import pandas as pd

In [2]:
bc = d.load_breast_cancer()
data = pd.DataFrame(bc.data, columns = bc.feature_names)

## Creating a binary feature from continuous data

Often you will have data that are continuous in nature, like some measurement, that can take on an infinite or large number of possible values.  Think about numbers with decimal points, such as temperature, or even integer values with a large range like annual salary.  It can be useful to treat this data in a binary fashion, i.e. True or False, Above or Below, High or Low, etc.

To transform a continuous feature into a binary feature is straighforward.  You first need to know the cutoff point that marks one value from the other.  This can be obtained from subject matter experts or similar ways.  You could also split on the mean, median, mode, or any other central measure, although the value of doing that will depend on your goals.

Below, we will binarize a column by splitting on the mean of the feature.  The [Binarizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html) class expects an array of values, so we perform some transformation of the pandas Series into an array using `data['mean radius'].values.reshape(-1,1)`.  `.values` gives us a one dimensional array and `.reshape(-1,1)` turns it into the 2d array form that the Binarizer class expects.

In [3]:
binarized = preprocessing.Binarizer(data['mean radius'].mean()).fit_transform(data['mean radius'].values.reshape(-1,1))
data['mean radius bin'] = binarized
print(data['mean radius'].mean())
display(data[['mean radius bin', 'mean radius']])

14.127291739894552


Unnamed: 0,mean radius bin,mean radius
0,1.0,17.99
1,1.0,20.57
2,1.0,19.69
3,0.0,11.42
4,1.0,20.29
...,...,...
564,1.0,21.56
565,1.0,20.13
566,1.0,16.60
567,1.0,20.60


## Create a categorical feature from continuous data (discretization)
This process is similar to binarize, however instead of just two possible values, we will bucket the continuous variable into multiple categories.  This technique is useful to reduce the number of possible values your learning models have to deal with.  Some algorithms, like association rule mining, will only work with disctretized data.

There are a couple of approaches that can be taken.  The first approach is to dump the values into a set number of bins.  This is a straighforward approach that requires you to choose how many bins you want, the desired composition of the bins, and the encoding method.  We will use the [KBinsDiscretizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html).  

Below, we will create 3 bins of equal range using the `strategy='uniform'` (based on the range of data: min to max).  We also specify ordinal output, so that the new bins are numbered and encoded in a single column.  With these parameters, we are essentially creating a histogram.

We do the same reshaping process as above.  Then we add the data back to our pandas data frame, group by the new column, and get a count of each of the bins. 

In [4]:
num_bins = 3

discrete = preprocessing.KBinsDiscretizer(n_bins=num_bins, encode='ordinal', strategy='uniform').fit_transform(data['mean radius'].values.reshape(-1,1))

data['mean radius ordinal'] = discrete
data.groupby('mean radius ordinal')['mean radius ordinal'].count()

mean radius ordinal
0.0    338
1.0    209
2.0     22
Name: mean radius ordinal, dtype: int64

Now lets try the other strategies to see the difference in bin sizes.

Quantile will create bins that contain the same number of data points (as close as possible).

KMeans will use 1-dimensional clustering to produce the bins, so values that are closer together will be in one bin, and farther apart will be in spearate bins.

In [5]:
discrete = preprocessing.KBinsDiscretizer(n_bins=num_bins, encode='ordinal', strategy='quantile').fit_transform(data['mean radius'].values.reshape(-1,1))

data['mean radius quantile'] = discrete
display(data.groupby('mean radius quantile')['mean radius quantile'].count())

discrete = preprocessing.KBinsDiscretizer(n_bins=num_bins, encode='ordinal', strategy='kmeans').fit_transform(data['mean radius'].values.reshape(-1,1))

data['mean radius kmeans'] = discrete
display(data.groupby('mean radius kmeans')['mean radius kmeans'].count())

mean radius quantile
0.0    189
1.0    190
2.0    190
Name: mean radius quantile, dtype: int64

mean radius kmeans
0.0    249
1.0    212
2.0    108
Name: mean radius kmeans, dtype: int64

As you can see each strategy produces very different results.  With quantile you get as close to the same points in each bin as possible, so it is the range of each bin that changes.  With KMeans, you will get data that is clustered around 3 points on a 1-d line, so both the counts and the bin ranges vary.


## Create a categorical feature from continuous data (discretization) with custom intervals
KBinsDiscretizer is powerful, but it will not accept custom intervals for creating bins.  If you have some domain knowledge about your continuous data and want to discretize it using that knowledge, then you can use the Pandas cut method.

Suppose we have some knoweldge that a `mean radius` <= 11 is "small", <= 14 is "medium", and everything else is "large".  We want to discretize on the specific values.  Pandas [cut](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.cut.html) will allow us to specify these cut points.

`bins=[0, 11, 14, 99999999]` defines the edges of our data.  We start with zero since a growth radius cannot be negative.  Next is the top range for our first category.  We've defined the first category as (> 0 and <= 11).  The second category is (> 11 and <= 14), and the last category is (> 14 and <= 99999999).  We use a suitably large value to cover the "everything else" case.  We then give a label to each of these ranges using the `labels` parameter.

In [6]:
data['mean radius custom'] = pd.cut(data['mean radius'], bins=[0, 11, 14, 99999999], labels=[0, 1, 2])
data[['mean radius custom', 'mean radius']].head(30)

Unnamed: 0,mean radius custom,mean radius
0,2,17.99
1,2,20.57
2,2,19.69
3,1,11.42
4,2,20.29
5,1,12.45
6,2,18.25
7,1,13.71
8,1,13.0
9,1,12.46


## Create dummy variables from categorical data
Continuing with the KBinsDiscretizer class, we will now investigate the other options for the `encode` parameter.  Previously we looked at `ordinal`, which output a single column.  Now we will examine the "one hot" encoding options.

The term "one hot" has its roots in digital circuitry, in which only one of a set of digital lines has voltage applied in order to indicate some input or output.  Translated to data, we will get multiple binary columns to represent ordinal values, as in the below image.

<img src="../resources/one-hot-example.jpg" alt="Example of one-hot encoding" title="Example of one-hot encoding" />

One-hot encoding is a necessary format for many types of machine learning algorithms.  Categorical data often can't be used directly in machine learning.  Primarily this is because numeric categorical data do not have mathematical properties like less than or greater than (although you can create categorical data where order is important: ordinal).  Also, it does not make sense to add, subtract, multiply, or divide categorical data.  There are a few algorithms that can use some types of categorical data directly (decision trees can use ordinal categoricals directly), however most machine learning will require one-hot encoding.

Let's see what that will look like with our sample data.  We will create a one-hot dense encoding.  Dense encoding is used machine learning and data mining techniques techniques such as artificial neural networks and association rule mining.

For comparison we will append our previously created ordinal representation and the original data.



In [7]:
discrete = preprocessing.KBinsDiscretizer(n_bins=num_bins, encode='onehot-dense', strategy='uniform').fit_transform(data['mean radius'].values.reshape(-1,1))

mean_radius_onehot = pd.DataFrame(discrete)
mean_radius_onehot['mean radius ordinal'] = data['mean radius ordinal']
mean_radius_onehot['mean radius'] = data['mean radius']
display(mean_radius_onehot.head(10))

Unnamed: 0,0,1,2,mean radius ordinal,mean radius
0,0.0,1.0,0.0,1.0,17.99
1,0.0,1.0,0.0,1.0,20.57
2,0.0,1.0,0.0,1.0,19.69
3,1.0,0.0,0.0,0.0,11.42
4,0.0,1.0,0.0,1.0,20.29
5,1.0,0.0,0.0,0.0,12.45
6,0.0,1.0,0.0,1.0,18.25
7,1.0,0.0,0.0,0.0,13.71
8,1.0,0.0,0.0,0.0,13.0
9,1.0,0.0,0.0,0.0,12.46


When using one-hot encoded variables in statistical regression, you must drop one of the dummy varaibles.  This is commonly the first one.  The above table illustrates why k-1 encoding is necessary in regression.  Recall that one of the assumptions for regression is that there is no multi-colinearity in the independent variables.  From the table above we can see that columns 1 and 2 together predict the value of column 0 (when column 1 and 2 are both zero, column 0 will be one), leading to violation of this assumption.

We can modify our data frame easily enough with the drop method.  

In [8]:
mean_radius_onehot.drop(0, axis='columns').head(10)

Unnamed: 0,1,2,mean radius ordinal,mean radius
0,1.0,0.0,1.0,17.99
1,1.0,0.0,1.0,20.57
2,1.0,0.0,1.0,19.69
3,0.0,0.0,0.0,11.42
4,1.0,0.0,1.0,20.29
5,0.0,0.0,0.0,12.45
6,1.0,0.0,1.0,18.25
7,0.0,0.0,0.0,13.71
8,0.0,0.0,0.0,13.0
9,0.0,0.0,0.0,12.46


We can see that when `mean radius ordinal` is zero, then all of the one-hot encoded columns are also zero.  This condition implies that the categorical value is `0`.  We don't need the column we dropped to make this inference, so we've solved the multi-colinearity violation and we haven't lost any information.

You can see the relationship between which of the columns `0`, `1`, and `2` contain the "on" value of 1 based on the value of the ordinal encoding.  These columns are mutually exclusive, so the set will only ever represent one category.

## Create a categorical feature from labeled data
Sometimes you will get data that has been labeled for you.  The labels may be text or numeric.  To perform machine learning, it is necessary to convert the text data into categorical data, and then possibly into dummy variables as we already discussed.

This will not be a full fledged introduction to text mining, but rather a simple text to numeric category conversion.

For this lab we will be using a dataset that contains text based categorical data, the [NCHS - Leading Causes of Death: United States](https://catalog.data.gov/dataset/age-adjusted-death-rates-for-the-top-10-leading-causes-of-death-united-states-2013), which contains the top 10 leading causes of death in the US from 1999 to 2016.  There are three text based categorical columns in the data:
* 113 Cause Name - The cause of death from the [CDC's list of 113 causes of death](https://www.cdc.gov/nchs/data/dvs/im9_2002.pdf.pdf)
* Cause Name - a shortened cause of death
* State - The state to which the data apply


In [9]:
data = pd.read_csv('../resources/NCHS_-_Leading_Causes_of_Death__United_States.csv', delimiter=',')

data.head()

Unnamed: 0,Year,113 Cause Name,Cause Name,State,Deaths,Age-adjusted Death Rate
0,2016,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,Alabama,2755,55.5
1,2016,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,Alaska,439,63.1
2,2016,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,Arizona,4010,54.2
3,2016,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,Arkansas,1604,51.8
4,2016,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,California,13213,32.0


We will utilize the [OrdinalEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html) class from sklearn.  We are going to save the instantiated OrdinalEncoder class so that we can extract the categories from it after transforming our data.  We will also cast the numpy array results back into a Pandas DataFrame.

First, we'll just pass our whole data set into the encoder.  This is probably a terrible idea, because some of our data is not text based.  Let's see what we get.

In [10]:
encoder = preprocessing.OrdinalEncoder()
new_data = pd.DataFrame(encoder.fit_transform(data), columns=data.columns)
display(new_data.head())

Unnamed: 0,Year,113 Cause Name,Cause Name,State,Deaths,Age-adjusted Death Rate
0,17.0,0.0,10.0,0.0,2318.0,512.0
1,17.0,0.0,10.0,1.0,410.0,583.0
2,17.0,0.0,10.0,2.0,2895.0,499.0
3,17.0,0.0,10.0,3.0,1472.0,475.0
4,17.0,0.0,10.0,4.0,4336.0,277.0


Wow, looks like we got every column encoded.  Not terribly useful...and potentially misleading.  The `Deaths` column data actually looks legitimate at first glance, and one might mistakenly think it is a count of deaths rather than a categorical representation of death count.  

We can see the categories assigned by examining a property on the encoder.  The position in the category array equals the category number we find in our transformed dataset.

In [11]:
display(encoder.categories_)

[array([1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009,
        2010, 2011, 2012, 2013, 2014, 2015, 2016], dtype=object),
 array(['Accidents (unintentional injuries) (V01-X59,Y85-Y86)',
        'All Causes', "Alzheimer's disease (G30)",
        'Cerebrovascular diseases (I60-I69)',
        'Chronic lower respiratory diseases (J40-J47)',
        'Diabetes mellitus (E10-E14)',
        'Diseases of heart (I00-I09,I11,I13,I20-I51)',
        'Influenza and pneumonia (J09-J18)',
        'Intentional self-harm (suicide) (*U03,X60-X84,Y87.0)',
        'Malignant neoplasms (C00-C97)',
        'Nephritis, nephrotic syndrome and nephrosis (N00-N07,N17-N19,N25-N27)'],
       dtype=object),
 array(['All causes', "Alzheimer's disease", 'CLRD', 'Cancer', 'Diabetes',
        'Heart disease', 'Influenza and pneumonia', 'Kidney disease',
        'Stroke', 'Suicide', 'Unintentional injuries'], dtype=object),
 array(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California',
        'Colo

OK, so lets just be specific, and encode only the text based columns that we care about.  At the same time, we can control the resulting data type.  The default is float64, but we don't really need decimal places.

In [12]:
text_columns = ['113 Cause Name', 'Cause Name'] # these are the columns we want to encode
encoder = preprocessing.OrdinalEncoder(dtype=np.int32) # create a new encoder.  The one we already have is trained already, and we want to retrain it, and tell it to use a different data type
new_data = pd.DataFrame(encoder.fit_transform(data[text_columns]), columns=text_columns)
data = data.join(new_data,rsuffix = ' Enc') # join the encoded back to original.  rsuffix will add the specified text as a suffix to the columns from new_data to avoid duplicate column names
display(data.head(30))


Unnamed: 0,Year,113 Cause Name,Cause Name,State,Deaths,Age-adjusted Death Rate,113 Cause Name Enc,Cause Name Enc
0,2016,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,Alabama,2755,55.5,0,10
1,2016,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,Alaska,439,63.1,0,10
2,2016,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,Arizona,4010,54.2,0,10
3,2016,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,Arkansas,1604,51.8,0,10
4,2016,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,California,13213,32.0,0,10
5,2016,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,Colorado,2880,51.2,0,10
6,2016,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,Connecticut,1978,50.3,0,10
7,2016,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,Delaware,516,52.4,0,10
8,2016,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,District of Columbia,401,58.3,0,10
9,2016,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,Florida,12561,54.9,0,10


In [13]:
display(encoder.categories_)

[array(['Accidents (unintentional injuries) (V01-X59,Y85-Y86)',
        'All Causes', "Alzheimer's disease (G30)",
        'Cerebrovascular diseases (I60-I69)',
        'Chronic lower respiratory diseases (J40-J47)',
        'Diabetes mellitus (E10-E14)',
        'Diseases of heart (I00-I09,I11,I13,I20-I51)',
        'Influenza and pneumonia (J09-J18)',
        'Intentional self-harm (suicide) (*U03,X60-X84,Y87.0)',
        'Malignant neoplasms (C00-C97)',
        'Nephritis, nephrotic syndrome and nephrosis (N00-N07,N17-N19,N25-N27)'],
       dtype=object),
 array(['All causes', "Alzheimer's disease", 'CLRD', 'Cancer', 'Diabetes',
        'Heart disease', 'Influenza and pneumonia', 'Kidney disease',
        'Stroke', 'Suicide', 'Unintentional injuries'], dtype=object)]

From here, you can one-hot encode these categorical features as needed.
There are other sklearn Encoders that can create categorical data from text columns:
* [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) - Take string data right to one-hot encoding.  It has a lot of parameters to configure.
* [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) - Work on one feature at a time
* [LabelBinarizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html) - Similar to one-hot encoder but designed to be used with multi-class classification approaches