#### Pandas Tutorial - Part 63: DataFrame Methods (nsmallest, rank)

This notebook covers two important DataFrame methods:
- `nsmallest()` - Return the first n rows ordered by columns in ascending order
- `rank()` - Compute numerical data ranks along the specified axis

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)

##### 1. DataFrame.nsmallest()

The `nsmallest()` method returns the first n rows ordered by columns in ascending order. It's equivalent to `df.sort_values(columns, ascending=True).head(n)`, but more performant.

In [2]:
# Create a DataFrame for countries
df = pd.DataFrame({
    'population': [59000000, 65000000, 434000, 434000, 434000, 337000, 11300, 11300, 11300],
    'GDP': [1937894, 2583560, 12011, 4520, 12128, 17036, 182, 38, 311],
    'alpha-2': ["IT", "FR", "MT", "MV", "BN", "IS", "NR", "TV", "AI"]
}, index=["Italy", "France", "Malta", "Maldives", "Brunei", "Iceland", "Nauru", "Tuvalu", "Anguilla"])

print("Countries DataFrame:")
df

Countries DataFrame:


Unnamed: 0,population,GDP,alpha-2
Italy,59000000,1937894,IT
France,65000000,2583560,FR
Malta,434000,12011,MT
Maldives,434000,4520,MV
Brunei,434000,12128,BN
Iceland,337000,17036,IS
Nauru,11300,182,NR
Tuvalu,11300,38,TV
Anguilla,11300,311,AI


In [3]:
# Get the 3 smallest countries by population
print("3 smallest countries by population:")
df.nsmallest(3, 'population')

3 smallest countries by population:


Unnamed: 0,population,GDP,alpha-2
Nauru,11300,182,NR
Tuvalu,11300,38,TV
Anguilla,11300,311,AI


In [4]:
# Get the 3 smallest countries by GDP
print("3 smallest countries by GDP:")
df.nsmallest(3, 'GDP')

3 smallest countries by GDP:


Unnamed: 0,population,GDP,alpha-2
Tuvalu,11300,38,TV
Nauru,11300,182,NR
Anguilla,11300,311,AI


In [5]:
# Get the 3 smallest countries by multiple columns
print("3 smallest countries by population and GDP:")
df.nsmallest(3, ['population', 'GDP'])

3 smallest countries by population and GDP:


Unnamed: 0,population,GDP,alpha-2
Tuvalu,11300,38,TV
Nauru,11300,182,NR
Anguilla,11300,311,AI


In [6]:
# Get the 3 smallest countries by GDP and population
print("3 smallest countries by GDP and population:")
df.nsmallest(3, ['GDP', 'population'])

3 smallest countries by GDP and population:


Unnamed: 0,population,GDP,alpha-2
Tuvalu,11300,38,TV
Nauru,11300,182,NR
Anguilla,11300,311,AI


In [7]:
# Demonstrate the 'keep' parameter with duplicate values
# Note that there are duplicate population values (434000 and 11300)

# Keep='first' (default)
print("nsmallest with keep='first' (default):")
df.nsmallest(3, 'population', keep='first')

nsmallest with keep='first' (default):


Unnamed: 0,population,GDP,alpha-2
Nauru,11300,182,NR
Tuvalu,11300,38,TV
Anguilla,11300,311,AI


In [8]:
# Keep='last'
print("nsmallest with keep='last':")
df.nsmallest(3, 'population', keep='last')

nsmallest with keep='last':


Unnamed: 0,population,GDP,alpha-2
Anguilla,11300,311,AI
Tuvalu,11300,38,TV
Nauru,11300,182,NR


In [9]:
# Keep='all'
print("nsmallest with keep='all':")
df.nsmallest(3, 'population', keep='all')

nsmallest with keep='all':


Unnamed: 0,population,GDP,alpha-2
Nauru,11300,182,NR
Tuvalu,11300,38,TV
Anguilla,11300,311,AI


In [10]:
# Create a Series
s = pd.Series([3, 2, 1, 5, 4])
print("Series:")
print(s)

# Get the 3 smallest values
print("\n3 smallest values:")
print(s.nsmallest(3))

Series:
0    3
1    2
2    1
3    5
4    4
dtype: int64

3 smallest values:
2    1
1    2
0    3
dtype: int64


In [11]:
# Compare nsmallest with sort_values().head()
print("Using nsmallest(3, 'GDP'):")
print(df.nsmallest(3, 'GDP'))

print("\nUsing sort_values('GDP').head(3):")
print(df.sort_values('GDP').head(3))

Using nsmallest(3, 'GDP'):
          population  GDP alpha-2
Tuvalu         11300   38      TV
Nauru          11300  182      NR
Anguilla       11300  311      AI

Using sort_values('GDP').head(3):
          population  GDP alpha-2
Tuvalu         11300   38      TV
Nauru          11300  182      NR
Anguilla       11300  311      AI


##### 2. DataFrame.rank()

The `rank()` method computes numerical data ranks along the specified axis. By default, equal values are assigned a rank that is the average of the ranks of those values.

In [12]:
# Create a DataFrame with some duplicate values and NaN
df = pd.DataFrame(data={
    'Animal': ['cat', 'penguin', 'dog', 'spider', 'snake'],
    'Number_legs': [4, 2, 4, 8, np.nan]
})

print("Animals DataFrame:")
df

Animals DataFrame:


Unnamed: 0,Animal,Number_legs
0,cat,4.0
1,penguin,2.0
2,dog,4.0
3,spider,8.0
4,snake,


In [13]:
# Default rank (method='average')
df['default_rank'] = df['Number_legs'].rank()
print("Default rank (method='average'):")
df

Default rank (method='average'):


Unnamed: 0,Animal,Number_legs,default_rank
0,cat,4.0,2.5
1,penguin,2.0,1.0
2,dog,4.0,2.5
3,spider,8.0,4.0
4,snake,,


In [14]:
# Rank with method='max'
df['max_rank'] = df['Number_legs'].rank(method='max')
print("Rank with method='max':")
df

Rank with method='max':


Unnamed: 0,Animal,Number_legs,default_rank,max_rank
0,cat,4.0,2.5,3.0
1,penguin,2.0,1.0,1.0
2,dog,4.0,2.5,3.0
3,spider,8.0,4.0,4.0
4,snake,,,


In [15]:
# Rank with na_option='bottom'
df['NA_bottom'] = df['Number_legs'].rank(na_option='bottom')
print("Rank with na_option='bottom':")
df

Rank with na_option='bottom':


Unnamed: 0,Animal,Number_legs,default_rank,max_rank,NA_bottom
0,cat,4.0,2.5,3.0,2.5
1,penguin,2.0,1.0,1.0,1.0
2,dog,4.0,2.5,3.0,2.5
3,spider,8.0,4.0,4.0,4.0
4,snake,,,,5.0


In [16]:
# Rank with pct=True (percentile rank)
df['pct_rank'] = df['Number_legs'].rank(pct=True)
print("Rank with pct=True (percentile rank):")
df

Rank with pct=True (percentile rank):


Unnamed: 0,Animal,Number_legs,default_rank,max_rank,NA_bottom,pct_rank
0,cat,4.0,2.5,3.0,2.5,0.625
1,penguin,2.0,1.0,1.0,1.0,0.25
2,dog,4.0,2.5,3.0,2.5,0.625
3,spider,8.0,4.0,4.0,4.0,1.0
4,snake,,,,5.0,


In [17]:
# Create a DataFrame to demonstrate different ranking methods
df_methods = pd.DataFrame({
    'values': [1, 2, 2, 3, 3, 3, 4, 5]
})

print("DataFrame with duplicate values:")
df_methods

DataFrame with duplicate values:


Unnamed: 0,values
0,1
1,2
2,2
3,3
4,3
5,3
6,4
7,5


In [18]:
# Demonstrate all ranking methods
methods = ['average', 'min', 'max', 'first', 'dense']

for method in methods:
    df_methods[f'rank_{method}'] = df_methods['values'].rank(method=method)

print("Comparison of different ranking methods:")
df_methods

Comparison of different ranking methods:


Unnamed: 0,values,rank_average,rank_min,rank_max,rank_first,rank_dense
0,1,1.0,1.0,1.0,1.0,1.0
1,2,2.5,2.0,3.0,2.0,2.0
2,2,2.5,2.0,3.0,3.0,2.0
3,3,5.0,4.0,6.0,4.0,3.0
4,3,5.0,4.0,6.0,5.0,3.0
5,3,5.0,4.0,6.0,6.0,3.0
6,4,7.0,7.0,7.0,7.0,4.0
7,5,8.0,8.0,8.0,8.0,5.0


In [19]:
# Explanation of each method
print("Explanation of ranking methods:")
print("- average: average rank of the group (default)")
print("- min: lowest rank in the group")
print("- max: highest rank in the group")
print("- first: ranks assigned in order they appear in the array")
print("- dense: like 'min', but rank always increases by 1 between groups")

Explanation of ranking methods:
- average: average rank of the group (default)
- min: lowest rank in the group
- max: highest rank in the group
- first: ranks assigned in order they appear in the array
- dense: like 'min', but rank always increases by 1 between groups


In [20]:
# Demonstrate ranking with ascending=False
df_methods['rank_desc'] = df_methods['values'].rank(ascending=False)
print("Ranking with ascending=False:")
df_methods[['values', 'rank_desc']]

Ranking with ascending=False:


Unnamed: 0,values,rank_desc
0,1,8.0
1,2,6.5
2,2,6.5
3,3,4.0
4,3,4.0
5,3,4.0
6,4,2.0
7,5,1.0


In [21]:
# Demonstrate ranking along different axes in a DataFrame
df_axis = pd.DataFrame({
    'A': [1, 2, 3, 4],
    'B': [5, 4, 3, 2],
    'C': [3, 3, 2, 1]
})

print("DataFrame for axis demonstration:")
df_axis

DataFrame for axis demonstration:


Unnamed: 0,A,B,C
0,1,5,3
1,2,4,3
2,3,3,2
3,4,2,1


In [22]:
# Rank along axis=0 (default, rank within each column)
print("Rank along axis=0 (within each column):")
df_axis.rank(axis=0)

Rank along axis=0 (within each column):


Unnamed: 0,A,B,C
0,1.0,4.0,3.5
1,2.0,3.0,3.5
2,3.0,2.0,2.0
3,4.0,1.0,1.0


In [23]:
# Rank along axis=1 (rank within each row)
print("Rank along axis=1 (within each row):")
df_axis.rank(axis=1)

Rank along axis=1 (within each row):


Unnamed: 0,A,B,C
0,1.0,3.0,2.0
1,1.0,3.0,2.0
2,2.5,2.5,1.0
3,3.0,2.0,1.0


In [24]:
# Create a DataFrame with mixed data types
df_mixed = pd.DataFrame({
    'numeric': [1, 2, 3, 4],
    'string': ['a', 'b', 'c', 'd']
})

print("DataFrame with mixed data types:")
df_mixed

DataFrame with mixed data types:


Unnamed: 0,numeric,string
0,1,a
1,2,b
2,3,c
3,4,d


In [25]:
# Rank with numeric_only=True
try:
    print("Rank with numeric_only=True:")
    df_mixed.rank(numeric_only=True)
except Exception as e:
    print(f"Error: {e}")

Rank with numeric_only=True:


##### Summary

In this notebook, we've explored two important DataFrame methods:

1. **nsmallest()**: Returns the first n rows ordered by columns in ascending order. This method is useful for quickly finding the smallest values in a DataFrame. It's equivalent to `df.sort_values(columns, ascending=True).head(n)`, but more performant. The `keep` parameter controls how to handle duplicate values.

2. **rank()**: Computes numerical data ranks along the specified axis. This method offers several options for handling ties (equal values) through the `method` parameter:
   - 'average': average rank of the group (default)
   - 'min': lowest rank in the group
   - 'max': highest rank in the group
   - 'first': ranks assigned in order they appear in the array
   - 'dense': like 'min', but rank always increases by 1 between groups
   
   Additional parameters include:
   - `na_option`: How to handle NaN values ('keep', 'top', or 'bottom')
   - `ascending`: Whether to rank in ascending order (default) or descending order
   - `pct`: Whether to return percentile ranks

These methods are valuable for data analysis, particularly when you need to identify the smallest values in a dataset or assign ranks to values for statistical analysis.