### All Pandas cut() you should know for transforming numerical data into categorical data
All examples are taken from the following article

https://towardsdatascience.com/all-pandas-cut-you-should-know-for-transforming-numerical-data-into-categorical-data-1370cf7f4c4f

Numerical data is common in data analysis. Often you have numerical data that is continuous, or very large scales, or is highly skewed. Sometimes, it can be easier to bin values into discrete intervals. This is helpful to perform descriptive statistics when values are divided into meaningful categories. For example, we can divide the age into Toddler, Child, Adult, and Elder.

Pandas’ built-in [cut()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html) function is a great way to transform numerical data into categorical data. In this article, you’ll learn how to use it to deal with the following common tasks.

1. [Discretizing into equal-sized bins](#equal)
2. [Adding custom bins](#custom)
3. [Adding labels to bins](#labels)
4. [Configuring leftmost edge with `right=False`](#edge)
5. [Include the lowest value with `include_lowest=True`](#lowest)
6. [Passing an `IntervalIndex` to `bins`](#interval)
7. [Returning bins with `retbins=True`](#retbins)
8. [Creating unordered categories](#unordered)

## 1. Discretizing into equal-sized bins
<a id='equal'></a>

The simplest usage of cut() must has a column and an integer as input. It is discretizing values into equal-sized bins.

<img src="https://miro.medium.com/max/700/1*j_ZpX8hK__RVMMDotr9HVA.png">

In [4]:
import pandas as pd

In [11]:
df = pd.DataFrame({'age': [2, 67, 40, 32, 4, 15, 82, 99, 26, 30]})
df['age_group'] = pd.cut(df['age'], 3) # add a new column
df

Unnamed: 0,age,age_group
0,2,"(1.903, 34.333]"
1,67,"(66.667, 99.0]"
2,40,"(34.333, 66.667]"
3,32,"(1.903, 34.333]"
4,4,"(1.903, 34.333]"
5,15,"(1.903, 34.333]"
6,82,"(66.667, 99.0]"
7,99,"(66.667, 99.0]"
8,26,"(1.903, 34.333]"
9,30,"(1.903, 34.333]"


Observe intervals in the age_group column. Those interval values are having a round bracket at the start and a square bracket at the end, for example (1.903, 34.333]. It basically means any value on the side of the round bracket is not included in the interval and any value on the side of the square bracket is included (It is known as open and closed intervals in Math).

Now, let's take a look at the new column age_group.

In [12]:
df['age_group']

0     (1.903, 34.333]
1      (66.667, 99.0]
2    (34.333, 66.667]
3     (1.903, 34.333]
4     (1.903, 34.333]
5     (1.903, 34.333]
6      (66.667, 99.0]
7      (66.667, 99.0]
8     (1.903, 34.333]
9     (1.903, 34.333]
Name: age_group, dtype: category
Categories (3, interval[float64]): [(1.903, 34.333] < (34.333, 66.667] < (66.667, 99.0]]

<img src="https://miro.medium.com/max/1000/1*5uT5BzEFG7-JW3E6qBbPrA.png">

It shows dtype: category with 3 label values:
- (1.903, 34.333] , 
- (34.333, 66.667] , and 
- (66.667, 99.0]. 

Those label values are ordered as indicated with the symbol <. 

Behind the theme, an interval is calculated as follows in order to generate the equal-sized bins:

$interval$ = (max_value — min_value) / num_of_bins = (99 - 2 / $3 = 32.33333
         
        (<--32.3333-->] < (<--32.3333-->] < (<--32.3333-->]
        (1.903, 34.333] < (34.333, 66.667] < (66.667, 99.0]

## 2. Adding custom bins
<a id='custom'></a>

Let’s divide the above age values into 4 custom groups i.e. 0–12, 12–19, 19–60, 61–100. To do that, we can simply pass those values in a list (`[0, 12, 19, 61, 100]`) to the argument `bins`.

In [14]:
df['age_group'] = pd.cut(df['age'], bins=[0, 12, 19, 61, 100])
df

Unnamed: 0,age,age_group
0,2,"(0, 12]"
1,67,"(61, 100]"
2,40,"(19, 61]"
3,32,"(19, 61]"
4,4,"(0, 12]"
5,15,"(12, 19]"
6,82,"(61, 100]"
7,99,"(61, 100]"
8,26,"(19, 61]"
9,30,"(19, 61]"


<img src = https://miro.medium.com/max/700/1*P4cgZsvCM5o4LqaJ5-MgYg.png>

We can see dtype: category with 4 ordered label values: (0, 12] < (12, 19] < (19, 61] < (61, 100].

Let’s sort the DataFrame by the column age_group:

In [15]:
df.sort_values('age_group')

Unnamed: 0,age,age_group
0,2,"(0, 12]"
4,4,"(0, 12]"
5,15,"(12, 19]"
2,40,"(19, 61]"
3,32,"(19, 61]"
8,26,"(19, 61]"
9,30,"(19, 61]"
1,67,"(61, 100]"
6,82,"(61, 100]"
7,99,"(61, 100]"


<img src = https://miro.medium.com/max/1000/1*0LP51LTjIZfXhXPphOUuXQ.png>

Let’s count that how many values fall into each bin.

In [16]:
df['age_group'].value_counts().sort_index()

(0, 12]      2
(12, 19]     1
(19, 61]     4
(61, 100]    3
Name: age_group, dtype: int64

## 3. Adding labels to bins
<a id='labels'></a>

It is more descriptive to label these age_group values as `<12`, `Teen`, `Adult`, `Older`.
To do that, we can simply pass those values in a list to the argument labels

In [18]:
bins = [0, 12, 19, 61, 100]
labels = ['<12', 'Teen', 'Adult', 'Older']
df['age_group'] = pd.cut(df['age'], bins, labels=labels)
df

Unnamed: 0,age,age_group
0,2,<12
1,67,Older
2,40,Adult
3,32,Adult
4,4,<12
5,15,Teen
6,82,Older
7,99,Older
8,26,Adult
9,30,Adult


<img src = 'https://miro.medium.com/max/700/1*hD7KJ5aNHW9_4FYpFIriTg.png'>

Similarly, it is showing label when sorting and counting

In [20]:
df['age_group'].value_counts().sort_index()

<12      2
Teen     1
Adult    4
Older    3
Name: age_group, dtype: int64

## 4. Configuring leftmost edge with `right=False`
<a id='edge'></a>

There is an argument right in Pandas `cut()` to configure whether `bins` include the rightmost edge or not. `right` defaults to True, which mean bins like `[0, 12, 19, 61, 100]` indicate `(0,12], (12,19], (19,61],(61,100]` . To include the leftmost edge, we can set `right=False`:

In [21]:
pd.cut(df['age'], bins=[0, 12, 19, 61, 100], right=False)

0      [0, 12)
1    [61, 100)
2     [19, 61)
3     [19, 61)
4      [0, 12)
5     [12, 19)
6    [61, 100)
7    [61, 100)
8     [19, 61)
9     [19, 61)
Name: age, dtype: category
Categories (4, interval[int64]): [[0, 12) < [12, 19) < [19, 61) < [61, 100)]

## 5. Include the lowest value with `include_lowest=True`
<a id='lowest'></a>

Suppose you would like to divide the above age values into `2–12`, `12–19`, `19–60`, `61–100` instead. You will get a result contains `NaN` when setting the bins to `[2, 12, 19, 61, 100]`.

In [23]:
df['age_group'] = pd.cut(df['age'], bins=[2, 12, 19, 61, 100])
df

Unnamed: 0,age,age_group
0,2,
1,67,"(61.0, 100.0]"
2,40,"(19.0, 61.0]"
3,32,"(19.0, 61.0]"
4,4,"(2.0, 12.0]"
5,15,"(12.0, 19.0]"
6,82,"(61.0, 100.0]"
7,99,"(61.0, 100.0]"
8,26,"(19.0, 61.0]"
9,30,"(19.0, 61.0]"


We get a `NaN` because the value 2 is the leftmost edge of the first bin `(2.0, 19.0]` and is not included. To include the lowest value, we can set `include_lowest=True`. Alternatively, you can set the `right` to `False` to include the leftmost edge.

In [25]:
df['age_group'] = pd.cut(
    df['age'], 
    bins=[2, 12, 19, 61, 100], 
    include_lowest=True)
df

Unnamed: 0,age,age_group
0,2,"(1.999, 12.0]"
1,67,"(61.0, 100.0]"
2,40,"(19.0, 61.0]"
3,32,"(19.0, 61.0]"
4,4,"(1.999, 12.0]"
5,15,"(12.0, 19.0]"
6,82,"(61.0, 100.0]"
7,99,"(61.0, 100.0]"
8,26,"(19.0, 61.0]"
9,30,"(19.0, 61.0]"


## 6. Passing an `IntervalIndex` to `bins`
<a id='interval'></a>

So far we have been passing an array to `bins`. Instead of an array, we can also pass an `IntervalIndex`.
Let’s create an IntervalIndex with 3 bins `(0, 12], (19, 61], (61, 100]`:

In [26]:
bins = pd.IntervalIndex.from_tuples([(0, 12), (19, 61), (61, 100)])

IntervalIndex([(0, 12], (19, 61], (61, 100]],
              closed='right',
              dtype='interval[int64]')

Next, let’s pass it to the argument `bins`

In [28]:
df['age_group'] = pd.cut(df['age'], bins)
df

Unnamed: 0,age,age_group
0,2,"(0.0, 12.0]"
1,67,"(61.0, 100.0]"
2,40,"(19.0, 61.0]"
3,32,"(19.0, 61.0]"
4,4,"(0.0, 12.0]"
5,15,
6,82,"(61.0, 100.0]"
7,99,"(61.0, 100.0]"
8,26,"(19.0, 61.0]"
9,30,"(19.0, 61.0]"


## 7. Returning bins with `retbins=True`
<a id='retbins'></a>

There is an argument called `retbin` to return the `bins`. If it is set to `True`, the result will return the `bins` and it is useful when bins is passed as a single number value

In [29]:
result, bins = pd.cut(
    df['age'], 
    bins=4,            # A single number value
    retbins=True)

In [32]:
# Print out result
result

0    (1.903, 26.25]
1     (50.5, 74.75]
2     (26.25, 50.5]
3     (26.25, 50.5]
4    (1.903, 26.25]
5    (1.903, 26.25]
6     (74.75, 99.0]
7     (74.75, 99.0]
8    (1.903, 26.25]
9     (26.25, 50.5]
Name: age, dtype: category
Categories (4, interval[float64]): [(1.903, 26.25] < (26.25, 50.5] < (50.5, 74.75] < (74.75, 99.0]]

In [31]:
# Print out bins value
bins 

array([ 1.903, 26.25 , 50.5  , 74.75 , 99.   ])

## 8. Creating unordered categories
<a id='unordered'></a>

`ordered=False` will result in unordered categories when labels are passed. This parameter can be used to allow non-unique labels:

In [35]:
pd.cut(
    df['age'], 
    bins=[0, 12, 19, 61, 100], 
    labels=['<12', 'Teen', 'Adult', 'Older'], 
    ordered=False)

0      <12
1    Older
2    Adult
3    Adult
4      <12
5     Teen
6    Older
7    Older
8    Adult
9    Adult
Name: age, dtype: category
Categories (4, object): ['<12', 'Teen', 'Adult', 'Older']