# Data Discretization


# Binning: 
Binning aims to discretise continuous values into discrete bins. We explain in the following different ways for implementing  binning with Python.  

## A. Using the digitize() method:

In [None]:
import numpy as np
np.random.seed(1234) # make it reproducible

n = 100 # how much data
data = np.random.random(n) # n random numbers on 0..1

bins = np.linspace(0, 1, 11) # equally spaced bins, from 0 to 1.0 
# 11 bin 'edges' or boundaries, gives us 10 bins

digitized = np.digitize(data, bins) # put the n in the b


In [None]:
len(bins), bins # there are 11 bin boundaries or 'edges', i.e. 10 bins 
# 0 to 0.1... 0.9, 1.0 i.e. 10 bins, it's tidier

In [None]:
data

In [None]:
data.min(), data.max(), data.mean(), data.std()

In [None]:
digitized # so the 100 values are now group into 10 bins

In [None]:
# not so easy to look at so put them side by side in a DataFrame
import pandas as pd
df = pd.DataFrame({"Data" : data, "DigBin" : digitized})

In [None]:
df.DigBin.value_counts().sort_index() 
# and there are the 10 bins,change the seed above from '1234' to something else to see 

In [None]:
df.head()

In [None]:
df.sort_values("Data")
# df.sort_values("DigBin").head()

In [None]:
# so all the little numbers ended up in bin 1, all the big ones in bin 10:
df.sort_values("DigBin").tail()

In [None]:
%matplotlib inline
df.hist() # now we can see before and after (left to right), should have the same shape
# note the x scale 0..12 vs 0..1.0

## B. Histograms...
You can also use histogram to do binning for you:

In [None]:
binH = (np.histogram(data, bins, weights = data)[0] / np.histogram(data, bins)[0])

In [None]:
len(binH), binH # where binH is he mean of the values in each bin

## C. Using scipy:

In [None]:
# import numpy as np
from scipy.stats import binned_statistic
# we use the same data
binS = binned_statistic(data, data, bins = 10, range = (0, 1))[0]

### What is binned_statistic doing?
####  bin_count, bin_edges, bin_number

In [None]:
bc, be, bn = binned_statistic(data, None, statistic = 'count', bins = 10)


In [None]:
bc #, bc.sum() # =100

In [None]:
be # the edges, not so tidy...

In [None]:
bn # the bins

In [None]:
df["SciBin"] = bn # put side by side with the previous df

In [None]:
df.sort_values("Data")

In [None]:
df.hist() # before and after again

In [None]:
df.DigBin.hist(), df.SciBin.hist() # couple of blue peekers

In [None]:
# do it the other way
df.SciBin.hist() ,df.DigBin.hist()

Check out http://docs.scipy.org/doc/numpy/reference/generated/numpy.digitize.html and http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binned_statistic.html
to find out why the return values are different. 

## D. Using Pandas Cut

## Example: ages dataset


In [None]:
ages = [20, 22, 25, 26, 21, 23, 37, 31, 61, 45, 41, 32]

In [None]:
# Let’s divide these into bins of 18 to 25, 26 to 35, 35 to 60, and finally 60 and older. 
# To do so, you can use 'cut', a function in pandas:
bins = [18, 25, 35, 60, 100] # as above 5 numbers give us 4 bins
cats = pd.cut(ages, bins)
cats

In [None]:
# cats is a struture showing which bin the values were placed in (the first three are 18-25)
# then the total of the data (12), which is just:
len(ages)
# then the categories or bins, as specified

In [None]:
cats.codes # 0 is the first bin, 3 the last

In [None]:
cats.categories # the names, note the use of '(' and ']'
# '(' means open, '[' means closed, or inclusive, exclusive

In [None]:
pd.value_counts(cats) # and we can see there's only 1 60-100

which side is closed can be changed 

In [None]:
cats2 = pd.cut(ages, [18, 26, 36, 61, 100], right = True)

In [None]:
pd.value_counts(cats2) # now 26 is in 18 - 26, 61 is in 36 - 61

In [None]:
# want your own bin names?
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
cats = pd.cut(ages, bins, labels = group_names)
cats

If you pass cut an integer number of bins instead of explicit bin edges, it will compute equal-length bins based on the minimum and maximum values in the data. Consider the case of some uniformly distributed data chopped into tenths
same data as above, 100 randoms

In [None]:
pd.cut(data, 10, precision = 5)

In [None]:
bincut = pd.cut(data, 10, precision = 5)

In [None]:
bincut.codes # notice 0 to 9 (not 1 to 10) for bins, so they're out of synch by 1

In [None]:
df.head()

In [None]:
bincut.codes + 1 # hack them into order

In [None]:
df['BinCut'] = bincut.codes + 1

In [None]:
df.head() # are they the same? 

In [None]:
df.hist()

A closely related function, qcut, bins the data based on sample quantiles. Depending on the distribution of the data, using cut  will not usually result in each bin having the
same number of data points. Since qcut  uses sample quantiles instead, by definition
you will obtain roughly equal-size bins:

In [None]:
catsq = pd.qcut(data, 4) # Cut into quartiles

In [None]:
pd.value_counts(catsq) # 25 in each, means the edges/bins are not likely to be round or 'tidy' numbers

In [None]:
catsq

In [None]:
# first one is 0.00621, 0.313, should match the min value
data.min()

In [None]:
# Similar to cut you can pass your own quantiles (numbers between 0 and 1, inclusive):
pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.])

In [None]:
pd.value_counts(pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.]))