# Convert Numeric Data to Binary Categories using a Binarizer

For some kind of ML algorithms you may want to binarize continuous data. This notebook will focus on using scikit-learn's preprocessing function for binarizing data using a threshold.

First we will start with a toy dataset to get a grasp on how the binarizer works

In [1]:
import pandas as pd
import numpy as np


from sklearn.preprocessing import Binarizer, binarize

In [2]:
num_list = [[ -1000, 0],
            [  500, -3000],
            [  100, 650]]

In [3]:
binarizer = Binarizer()
binarizer

Binarizer(copy=True, threshold=0.0)

In [4]:
binarizer.fit(num_list)

Binarizer(copy=True, threshold=0.0)

The fit function doesn't do anything for us because we already expressed a threshold (0- default)

In [5]:
binarized_list = binarizer.transform(num_list)

binarized_list

array([[0, 0],
       [1, 0],
       [1, 1]])

When we call transform, every value in our dataset needs to be compared individually and then binarized

In [6]:
binarizer = Binarizer(threshold=500)

binarizer.fit_transform(num_list)

array([[0, 0],
       [0, 0],
       [0, 1]])

When we call fit transform the threshold is applied and it binarizes the data

In [7]:
binarizer = Binarizer(threshold=[0, 100])

binarizer.fit_transform(num_list)

array([[0, 0],
       [1, 0],
       [1, 1]])

Binarizers are useful when the actual value is not important, but rather the precesnce or absence of a value is. Binarizers can be set to the threshold you determine which is useful in asssinging categories for your data. 

### Using the binarizer to make numeric values categorical

Here we will use the diet datset, we will drop some features to focus that don't have missing values and contain relative data

In [8]:
diet_data = pd.read_csv('Datasets/diet_data.csv')

In [9]:
diet_data.head()

Unnamed: 0,Date,Stone,Pounds,Ounces,weight_oz,calories,cals_per_oz,five_donuts,walk,run,wine,prot,weight,change
0,7/30/2018,12.0,2.0,6.0,2726.0,1950.0,0.72,1.0,1.0,0.0,0.0,0.0,0.0,-30.0
1,7/31/2018,12.0,0.0,8.0,2696.0,2600.0,0.96,1.0,0.0,0.0,0.0,0.0,0.0,8.0
2,08/01/18,12.0,1.0,0.0,2704.0,2500.0,0.92,1.0,1.0,0.0,0.0,0.0,0.0,0.0
3,08/02/18,12.0,1.0,0.0,2704.0,1850.0,0.68,1.0,1.0,0.0,1.0,0.0,0.0,-40.0
4,08/03/18,11.0,12.0,8.0,2664.0,2900.0,1.09,1.0,1.0,0.0,0.0,0.0,0.0,14.0


In [10]:
diet_data = diet_data.dropna()

diet_data = diet_data.drop(['Date', 'Stone', 'Pounds', 'Ounces'], axis=1)

In [11]:
diet_data = diet_data.astype(np.float64)

In [12]:
diet_data.head()

Unnamed: 0,weight_oz,calories,cals_per_oz,five_donuts,walk,run,wine,prot,weight,change
0,2726.0,1950.0,0.72,1.0,1.0,0.0,0.0,0.0,0.0,-30.0
1,2696.0,2600.0,0.96,1.0,0.0,0.0,0.0,0.0,0.0,8.0
2,2704.0,2500.0,0.92,1.0,1.0,0.0,0.0,0.0,0.0,0.0
3,2704.0,1850.0,0.68,1.0,1.0,0.0,1.0,0.0,0.0,-40.0
4,2664.0,2900.0,1.09,1.0,1.0,0.0,0.0,0.0,0.0,14.0


In [13]:
diet_data.describe()

Unnamed: 0,weight_oz,calories,cals_per_oz,five_donuts,walk,run,wine,prot,weight,change
count,140.0,140.0,140.0,140.0,140.0,140.0,140.0,140.0,140.0,140.0
mean,2687.7,3180.714286,1.183643,0.371429,0.678571,0.25,0.307143,0.178571,0.442857,-1.485714
std,28.663189,1478.753433,0.5517,0.484922,0.468702,0.434568,0.462966,0.384368,0.498508,25.098793
min,2628.0,1400.0,0.51,0.0,0.0,0.0,0.0,0.0,0.0,-58.0
25%,2670.0,2187.5,0.8075,0.0,0.0,0.0,0.0,0.0,0.0,-18.0
50%,2689.0,2575.0,0.955,0.0,1.0,0.0,0.0,0.0,0.0,-3.0
75%,2704.0,3850.0,1.45,1.0,1.0,0.25,1.0,0.0,1.0,16.0
max,2768.0,9150.0,3.45,1.0,1.0,1.0,1.0,1.0,1.0,102.0


Calories consumed is a continuous variable and can be categorized in different formats, whether the person over ate, under ate, etc. First we can calculate the median amount of calories eaten. 

In [14]:
median_calories = diet_data['calories'].median()

median_calories

2575.0

In [15]:
binarizer = Binarizer(threshold=median_calories)

diet_data['calories_above_median'] = binarizer.fit_transform(diet_data[['calories']])

In [16]:
diet_data.head()

Unnamed: 0,weight_oz,calories,cals_per_oz,five_donuts,walk,run,wine,prot,weight,change,calories_above_median
0,2726.0,1950.0,0.72,1.0,1.0,0.0,0.0,0.0,0.0,-30.0,0.0
1,2696.0,2600.0,0.96,1.0,0.0,0.0,0.0,0.0,0.0,8.0,1.0
2,2704.0,2500.0,0.92,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2704.0,1850.0,0.68,1.0,1.0,0.0,1.0,0.0,0.0,-40.0,0.0
4,2664.0,2900.0,1.09,1.0,1.0,0.0,0.0,0.0,0.0,14.0,1.0


In [17]:
mean_calories_per_oz = diet_data['cals_per_oz'].mean()

mean_calories_per_oz

1.1836428571428572

In [18]:
diet_data['cals_per_oz_above_mean'] = binarize(diet_data[['cals_per_oz']], 
                                               threshold=mean_calories_per_oz)

In [19]:
diet_data.sample(10)

Unnamed: 0,weight_oz,calories,cals_per_oz,five_donuts,walk,run,wine,prot,weight,change,calories_above_median,cals_per_oz_above_mean
59,2680.0,2550.0,0.95,1.0,1.0,0.0,0.0,1.0,1.0,-8.0,0.0,0.0
109,2680.0,5050.0,1.88,0.0,1.0,0.0,0.0,1.0,1.0,12.0,1.0,1.0
2,2704.0,2500.0,0.92,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
85,2674.0,3750.0,1.4,0.0,1.0,0.0,1.0,0.0,0.0,-10.0,1.0,1.0
67,2744.0,2900.0,1.06,1.0,1.0,0.0,1.0,0.0,0.0,-2.0,1.0,0.0
15,2662.0,2400.0,0.9,1.0,1.0,0.0,0.0,0.0,0.0,-4.0,0.0,0.0
142,2628.0,1500.0,0.57,0.0,1.0,0.0,1.0,0.0,0.0,-30.0,0.0,0.0
11,2630.0,3000.0,1.14,1.0,1.0,0.0,1.0,0.0,0.0,12.0,1.0,0.0
79,2676.0,1800.0,0.67,0.0,0.0,0.0,1.0,0.0,0.0,-14.0,0.0,0.0
38,2656.0,3800.0,1.43,0.0,1.0,0.0,1.0,1.0,1.0,32.0,1.0,1.0


Here we have set our threshold to be the median calories consumed, when the individual consumes over the median we say they have over eaten, when the individual has eaten less than the calculated median we say they ahve undereaten