Link to Medium blog post: https://towardsdatascience.com/how-to-do-a-custom-sort-on-pandas-dataframe-ac18e7ea5320

# How to do a Custom Sort on Pandas DataFrame

Pandas DataFrame has a built-in method sort_values() to sort values by the given variable(s). The method itself is fairly straightforward to use, however it doesn’t work for custom sorting, for example,

- the t-shirt size: XS, S, M, L, and XL

- the month: Jan, Feb, Mar, Apr , ….etc

- the day of the week: Mon, Tue, Wed, Thu, Fri, Sat, and Sun.

## Take a look at the problem

In [2]:
# import necessary libraries
import pandas as pd

In [4]:
# Suppose we have a dataset about a clothing store:

df = pd.DataFrame({
    'cloth_id': [1001, 1002, 1003, 1004, 1005, 1006],
    'size': ['S', 'XL', 'M', 'XS', 'L', 'S'],
})

df

Unnamed: 0,cloth_id,size
0,1001,S
1,1002,XL
2,1003,M
3,1004,XS
4,1005,L
5,1006,S


We can see that each cloth has a size value and the data should be sorted by the following order:

- XS for extra small
- S for small
- M for medium
- L for large
- XL for extra large

However, you will get the following output when calling sort_values('size') .

In [5]:
# sort the values using sort_values('size')

df.sort_values('size')

Unnamed: 0,cloth_id,size
4,1005,L
2,1003,M
0,1001,S
5,1006,S
1,1002,XL
3,1004,XS


The output is not we want, but it is technically correct. Under the hood, sort_values() is sorting values by numerical order for number data or character alphabetically for object data.

Here are two common solutions:

1. Create a new column for custom sorting
2. Cast data to category type with orderedness using CategoricalDtype

## Create a new column for custom sorting.

In this solution, a mapping DataFrame is needed to represent a custom sort, then a new column will be created according to the mapping, and finally we can sort the data by the new column. Let’s see how this works with the help of an example.

Firstly, let’s create a mapping DataFrame to represent a custom sort.

In [6]:
df_mapping = pd.DataFrame({
    'size': ['XS', 'S', 'M', 'L', 'XL'],
})
sort_mapping = df_mapping.reset_index().set_index('size')

After that, create a new column size_num with mapped value from sort_mapping.

In [7]:
df['size_num'] = df['size'].map(sort_mapping['index'])

Finally, sort values by the new column size_num.

In [8]:
df.sort_values('size_num')

Unnamed: 0,cloth_id,size,size_num
3,1004,XS,0
0,1001,S,1
5,1006,S,1
2,1003,M,2
4,1005,L,3
1,1002,XL,4


This certainly does our work. But it has created a spare column and can be less efficient when dealing with a large dataset.

We can solve this more efficiently using CategoricalDtype.

## Cast data to category type with orderedness using CategoricalDtype

CategoricalDtype is a type for categorical data with the categories and orderedness [1]. It is very useful for creating a custom sort [2]. Let’s see how this works with the help of an example.

Firstly, let’s import CategoricalDtype.

In [9]:
from pandas.api.types import CategoricalDtype

Then, create a custom category type cat_size_order with

- the 1st argument set to ['XS', 'S', 'M', 'L', 'XL'] for the unique value of cloth size.
- and the 2nd argument ordered=True for this variable to be treated as a ordered categorical.

In [10]:
cat_size_order = CategoricalDtype(
    ['XS', 'S', 'M', 'L', 'XL'], 
    ordered=True
)


After that, call astype(cat_size_order) to cast the size data to the custom category type. By running df['size'], we can see that the size column has been casted to a category type with the order [XS < S < M < L < XL].

In [12]:
df['size'] = df['size'].astype(cat_size_order)
df['size']

0     S
1    XL
2     M
3    XS
4     L
5     S
Name: size, dtype: category
Categories (5, object): ['XS' < 'S' < 'M' < 'L' < 'XL']

And finally, we can call the same method to sort values.

In [13]:
df.sort_values('size')

Unnamed: 0,cloth_id,size,size_num
3,1004,XS,0
0,1001,S,1
5,1006,S,1
2,1003,M,2
4,1005,L,3
1,1002,XL,4


### View category codes property with the Series.cat accessor

Now the size column has been casted to a category type, and we could use Series.cat accessor to view categorical properties. Under the hood, it is using the category codes to represent the position in an ordered categorical.

Let’s create a new column codes, so we could compare size and codes values side by side.

In [14]:
df['codes'] = df['size'].cat.codes
df

Unnamed: 0,cloth_id,size,size_num,codes
0,1001,S,1,1
1,1002,XL,4,4
2,1003,M,2,2
3,1004,XS,0,0
4,1005,L,3,3
5,1006,S,1,1


We can see that XS, S, M, L, and XL has got a code 0, 1, 2, 3, 4, and 5 respectively. Codes are the positions of the actual values in the category type. By running df.info() , we can see that codes are int8.



In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   cloth_id  6 non-null      int64   
 1   size      6 non-null      category
 2   size_num  6 non-null      int64   
 3   codes     6 non-null      int8    
dtypes: category(1), int64(2), int8(1)
memory usage: 452.0 bytes


## Sort by multiple variables

Next, let’s make things a little more complicated. Here, we’re going to sort our DataFrame by multiple variables.

In [16]:
df = pd.DataFrame({
    'order_id': [1001, 1002, 1003, 1004, 1005, 1006, 1007],
    'customer_id': [10, 12, 12, 12, 10, 10, 10],
    'month': ['Feb', 'Jan', 'Jan', 'Feb', 'Feb', 'Jan', 'Feb'],
    'day_of_week': ['Mon', 'Wed', 'Sun', 'Tue', 'Sat', 'Mon', 'Thu'],
})

Similarly, let’s create 2 custom category types cat_day_of_week and cat_month, and pass them to astype().

In [17]:
cat_day_of_week = CategoricalDtype(
    ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'], 
    ordered=True
)
cat_month = CategoricalDtype(
    ['Jan', 'Feb', 'Mar', 'Apr'], 
    ordered=True,
)
df['day_of_week'] = df['day_of_week'].astype(cat_day_of_week)
df['month'] = df['month'].astype(cat_month)

To sort by multiple variables, we just need to pass a list to sort_values() in stead. For example, sort by month and day_of_week.

In [18]:
df.sort_values(['month', 'day_of_week'])

Unnamed: 0,order_id,customer_id,month,day_of_week
5,1006,10,Jan,Mon
1,1002,12,Jan,Wed
2,1003,12,Jan,Sun
0,1001,10,Feb,Mon
3,1004,12,Feb,Tue
6,1007,10,Feb,Thu
4,1005,10,Feb,Sat


And sort by customer_id, month and day_of_week.

In [19]:
df.sort_values(['customer_id', 'month', 'day_of_week'])

Unnamed: 0,order_id,customer_id,month,day_of_week
5,1006,10,Jan,Mon
0,1001,10,Feb,Mon
6,1007,10,Feb,Thu
4,1005,10,Feb,Sat
1,1002,12,Jan,Wed
2,1003,12,Jan,Sun
3,1004,12,Feb,Tue
