Link to Medium blog post: https://towardsdatascience.com/how-to-get-the-row-count-of-a-pandas-dataframe-be67232ad5de

Let’s create an example DataFrame that we’ll reference throughout this guide in order to demonstrate a few concepts.

In [1]:
import pandas as pd
df = pd.DataFrame({
    'colA':[1, 2, None, 4, 5], 
    'colB': [None, 'b', 'c', 'd', 'e'],
    'colC': [True, False, False, True, None],
})
print(df)

   colA  colB   colC
0   1.0  None   True
1   2.0     b  False
2   NaN     c  False
3   4.0     d   True
4   5.0     e   None


### Using len()

The most simple and clear way to compute the row count of a DataFrame is to use len() built-in method:

In [2]:
len(df)

5

Note that you can even pass df.index for slightly improved performance (more on this in the final section of the article):

In [3]:
len(df.index)

5

### Using shape

Alternatively, you can even use pandas.DataFrame.shape that returns a tuple representing the dimensionality of the DataFrame. The first element of the tuple corresponds to the number of rows while the second element represents the number of columns.

In [4]:
df.shape[0]

5

You can also unpack the result of df.shape and infer the row count as shown below:

In [7]:
n_rows, _ = df.shape
n_rows

5

### Using count()

The third option you have when it comes to computing row counts in pandas is pandas.DataFrame.count() method that returns the count for non-NA entries.

Let’s assume that we want to count all the rows which have no null values under a certain column. The following should do the trick for us:

In [8]:
df[df.columns[1]].count()

4

This method should only be used when you want to ignore null values. If this is not the case then you should use either len() or shape.



### Performance

Now that we know a few different ways for computing the count of rows in DataFrames, it would be interesting to discuss the performance implications around them. To do so, we are going to create a larger DataFrame than the one we used so far in this guide.

In [9]:
import numpy as np
import pandas as pd
df = pd.DataFrame(
    np.random.randint(0, 100, size=(10000, 4)),
    columns=list('ABCD')
)
print(df)

       A   B   C   D
0     69  21  60  61
1     84  86   5  15
2     82  53  42  58
3     17  48  10  68
4     37  40  11  54
...   ..  ..  ..  ..
9995  47  24  52  61
9996   2  37  92  17
9997  71  65  42   8
9998  58  39  57   9
9999   5  83   6  86

[10000 rows x 4 columns]


In order to evaluate performance, we are going to use timeit which is useful when it comes to timing small bits of Python code.

In [None]:
'''>>> %timeit len(df)
548 ns ± 24.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>> %timeit len(df.index)
358 ns ± 10.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>> %timeit df.shape[0]
904 ns ± 48 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>> %timeit df[df.columns[1]].count()
81.9 µs ± 4.91 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)'''

We can see from the results above that the most efficient way for counting rows in pandas is len() method. By providing just the index (len(df.index)) is even faster.

The least efficient way is count() and thus you should only be using this method only if you need to exclude null entries from the counts.

