https://realpython.com/python-pandas-tricks/#1-configure-options-settings-at-interpreter-startup

# Python Pandas: Tricks & Features You May Not Know

In [None]:
Table of Contents

1. Configure Options & Settings at Interpreter Startup
2. Make Toy Data Structures With Pandas’ Testing Module
3. Take Advantage of Accessor Methods
4. Create a DatetimeIndex From Component Columns
5. Use Categorical Data to Save on Time and Space
6. Introspect Groupby Objects via Iteration
7. Use This Mapping Trick for Membership Binning
8. Understand How Pandas Uses Boolean Operators
9. Load Data From the Clipboard
10. Write Pandas Objects Directly to Compressed Format
Want to Add to This List? Let Us Know

## 1. Configure Options & Settings at Interpreter Startup

In [9]:
import pandas as pd

In [15]:
pd.__version__

'1.3.0'

In [10]:
def start():
    options = {
        'display': {
            'max_columns': None,
            'max_colwidth': 25,
            'expand_frame_repr': False,  # Don't wrap to multiple pages
            'max_rows': 14,
            'max_seq_items': 50,         # Max length of printed sequence
            'precision': 4,
            'show_dimensions': False
        },
        'mode': {
            'chained_assignment': None   # Controls SettingWithCopyWarning
        }
    }

    for category, option in options.items():
        for op, value in option.items():
            pd.set_option(f'{category}.{op}', value)  # Python 3.6+

In [11]:
"""
if __name__ == '__main__':
    start()
    del start  # Clean up namespace in the interpreter
"""

"\nif __name__ == '__main__':\n    start()\n    del start  # Clean up namespace in the interpreter\n"

In [12]:
start()

In [13]:
pd.__name__

pd.get_option('display.max_rows')


14

Let’s use some data on abalone hosted by the UCI Machine Learning Repository to demonstrate the formatting that was set in the startup file. The data will truncate at 14 rows with 4 digits of precision for floats:

In [14]:
url = ('https://archive.ics.uci.edu/ml/'
       'machine-learning-databases/abalone/abalone.data')
cols = ['sex', 'length', 'diam', 'height', 'weight', 'rings']
abalone = pd.read_csv(url, usecols=[0, 1, 2, 3, 4, 8], names=cols)

abalone

Unnamed: 0,sex,length,diam,height,weight,rings
0,M,0.455,0.365,0.095,0.5140,15
1,M,0.350,0.265,0.090,0.2255,7
2,F,0.530,0.420,0.135,0.6770,9
3,M,0.440,0.365,0.125,0.5160,10
4,I,0.330,0.255,0.080,0.2050,7
...,...,...,...,...,...,...
4172,F,0.565,0.450,0.165,0.8870,11
4173,M,0.590,0.440,0.135,0.9660,10
4174,M,0.600,0.475,0.205,1.1760,9
4175,F,0.625,0.485,0.150,1.0945,10


## 2. Make Toy Data Structures With Pandas’ Testing Module

Note: The pandas.util.testing module was deprecated in Pandas 1.0. The “public testing API” from pandas.testing is now limited to assert_extension_array_equal(), assert_frame_equal(), assert_series_equal(), and assert_index_equal(). The author admits that he gets a taste of his own medicine for relying on undocumented portions of the Pandas library.

## 3. Take Advantage of Accessor Methods

Perhaps you’ve heard of the term accessor, which is somewhat like a getter (although getters and setters are used infrequently in Python). For our purposes here, you can think of a Pandas accessor as a property that serves as an interface to additional methods.

Pandas Series have three of them:

In [17]:
pd.Series._accessors

{'cat', 'dt', 'sparse', 'str'}

Yes, that definition above is a mouthful, so let’s take a look at a few examples before discussing the internals.

.cat is for categorical data, .str is for string (object) data, and .dt is for datetime-like data. Let’s start off with .str: imagine that you have some raw city/state/ZIP data as a single field within a Pandas Series.

**Pandas string methods are vectorized, meaning that they operate on the entire array without an explicit for-loop:**

In [18]:
addr = pd.Series([
    'Washington, D.C. 20003',
    'Brooklyn, NY 11211-1755',
    'Omaha, NE 68154',
    'Pittsburgh, PA 15211'
])

addr.str.upper()

0     WASHINGTON, D.C. 20003
1    BROOKLYN, NY 11211-1755
2            OMAHA, NE 68154
3       PITTSBURGH, PA 15211
dtype: object

In [19]:
addr.str.count(r'\d')  # 5 or 9-digit zip?

0    5
1    9
2    5
3    5
dtype: int64

You can pass a regular expression to .str.extract() to “extract” parts of each cell in the Series. In .str.extract(), .str is the accessor, and .str.extract() is an accessor method:

In [20]:
regex = (r'(?P<city>[A-Za-z ]+), '      # One or more letters
         r'(?P<state>[A-Z]{2}) '        # 2 capital letters
         r'(?P<zip>\d{5}(?:-\d{4})?)')  # Optional 4-digit extension

addr.str.replace('.', '').str.extract(regex)

  """


Unnamed: 0,city,state,zip
0,Washington,DC,20003
1,Brooklyn,NY,11211-1755
2,Omaha,NE,68154
3,Pittsburgh,PA,15211


This also illustrates what is known as method-chaining, where .str.extract(regex) is called on the result of addr.str.replace('.', ''), which cleans up use of periods to get a nice 2-character state abbreviation.

In [None]:
It’s helpful to know a tiny bit about how these accessor methods work as a motivating reason for why you should use them in the first place, rather than something like addr.apply(re.findall, ...).

Each accessor is itself a bona fide Python class:

.str maps to StringMethods.
.dt maps to CombinedDatetimelikeProperties.
.cat routes to CategoricalAccessor.
These standalone classes are then “attached” to the Series class using a CachedAccessor. It is when the classes are wrapped in CachedAccessor that a bit of magic happens.

The second accessor, .dt, is for datetime-like data. It technically belongs to Pandas’ DatetimeIndex, and if called on a Series, it is converted to a DatetimeIndex first:

In [21]:
daterng = pd.Series(pd.date_range('2017', periods=9, freq='Q'))
daterng

0   2017-03-31
1   2017-06-30
2   2017-09-30
3   2017-12-31
4   2018-03-31
5   2018-06-30
6   2018-09-30
7   2018-12-31
8   2019-03-31
dtype: datetime64[ns]

In [22]:
daterng.dt.day_name()

0      Friday
1      Friday
2    Saturday
3      Sunday
4    Saturday
5    Saturday
6      Sunday
7      Monday
8      Sunday
dtype: object

In [23]:
# Second-half of year only
daterng[daterng.dt.quarter > 2]

2   2017-09-30
3   2017-12-31
6   2018-09-30
7   2018-12-31
dtype: datetime64[ns]

In [24]:
daterng[daterng.dt.is_year_end]

3   2017-12-31
7   2018-12-31
dtype: datetime64[ns]

## 4. Create a DatetimeIndex From Component Columns

In [29]:
#Speaking of datetime-like data, as in daterng above, it’s possible to create a Pandas DatetimeIndex from multiple component columns that together form a date or datetime:

from itertools import product
datecols = ['year', 'month', 'day']

df = pd.DataFrame(list(product([2017, 2016], [1, 2], [1, 2, 3])),
                  columns=datecols)
df['data'] = np.random.randn(len(df))
df

Unnamed: 0,year,month,day,data
0,2017,1,1,-1.6734
1,2017,1,2,0.3704
2,2017,1,3,-1.2323
3,2017,2,1,0.6782
4,2017,2,2,-1.215
5,2017,2,3,-1.5603
6,2016,1,1,0.4666
7,2016,1,2,0.9685
8,2016,1,3,0.955
9,2016,2,1,0.5436


In [30]:
df.index = pd.to_datetime(df[datecols])
df.head()

Unnamed: 0,year,month,day,data
2017-01-01,2017,1,1,-1.6734
2017-01-02,2017,1,2,0.3704
2017-01-03,2017,1,3,-1.2323
2017-02-01,2017,2,1,0.6782
2017-02-02,2017,2,2,-1.215


In [31]:
#Finally, you can drop the old individual columns and convert to a Series:

df = df.drop(datecols, axis=1).squeeze()
df.head()

2017-01-01   -1.6734
2017-01-02    0.3704
2017-01-03   -1.2323
2017-02-01    0.6782
2017-02-02   -1.2150
Name: data, dtype: float64

In [33]:
df.index

DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-02-01',
               '2017-02-02', '2017-02-03', '2016-01-01', '2016-01-02',
               '2016-01-03', '2016-02-01', '2016-02-02', '2016-02-03'],
              dtype='datetime64[ns]', freq=None)

## 5. Use Categorical Data to Save on Time and Space

In [34]:
colors = pd.Series([
    'periwinkle',
    'mint green',
    'burnt orange',
    'periwinkle',
    'burnt orange',
    'rose',
    'rose',
    'mint green',
    'rose',
    'navy'
])

import sys
colors.apply(sys.getsizeof)

0    59
1    59
2    61
3    59
4    61
5    53
6    53
7    59
8    53
9    53
dtype: int64

In [35]:
mapper = {v: k for k, v in enumerate(colors.unique())}
mapper

{'periwinkle': 0, 'mint green': 1, 'burnt orange': 2, 'rose': 3, 'navy': 4}

In [36]:
as_int = colors.map(mapper)
as_int

0    0
1    1
2    2
3    0
4    2
5    3
6    3
7    1
8    3
9    4
dtype: int64

In [37]:
as_int.apply(sys.getsizeof)

0    24
1    28
2    28
3    24
4    28
5    28
6    28
7    28
8    28
9    28
dtype: int64

In [39]:
#Another way to do this same thing is with Pandas’ pd.factorize(colors):

pd.factorize(colors)[0]


array([0, 1, 2, 0, 2, 3, 3, 1, 3, 4])

In [41]:
# Not a huge space-saver to encode as Categorical
colors.memory_usage(index=False, deep=True)

650

In [42]:
colors.astype('category').memory_usage(index=False, deep=True)

507

However, if you blow out the proportion above, with a lot of data and few unique values (think about data on demographics or alphabetic test scores), the reduction in memory required is over 10 times:

In [43]:
manycolors = colors.repeat(10)
len(manycolors) / manycolors.nunique()  # Much greater than 2.0x

20.0

In [44]:
manycolors.memory_usage(index=False, deep=True)

6500

In [45]:
manycolors.astype('category').memory_usage(index=False, deep=True)

597

## 6. Introspect Groupby Objects via Iteration

When you call df.groupby('x'), the resulting Pandas groupby objects can be a bit opaque. This object is lazily instantiated and doesn’t have any meaningful representation on its own.

You can demonstrate with the abalone dataset from example 1:

In [46]:
abalone['ring_quartile'] = pd.qcut(abalone.rings, q=4, labels=range(1, 5))
grouped = abalone.groupby('ring_quartile')

grouped

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f70f01fd790>

Alright, now you have a groupby object, but what is this thing, and how do I see it?

Before you call something like grouped.apply(func), you can take advantage of the fact that groupby objects are iterable:

In [47]:
help(grouped.__iter__)

Help on method __iter__ in module pandas.core.groupby.groupby:

__iter__() -> 'Iterator[tuple[Hashable, FrameOrSeries]]' method of pandas.core.groupby.generic.DataFrameGroupBy instance
    Groupby iterator.
    
    Returns
    -------
    Generator yielding sequence of (name, subsetted object)
    for each group



Each “thing” yielded by grouped.__iter__() is a tuple of (name, subsetted object), where name is the value of the column on which you’re grouping, and subsetted object is a DataFrame that is a subset of the original DataFrame based on whatever grouping condition you specify. That is, the data gets chunked by group:

In [48]:
for idx, frame in grouped:
    print(f'Ring quartile: {idx}')
    print('-' * 16)
    print(frame.nlargest(3, 'weight'), end='\n\n')

Ring quartile: 1
----------------
     sex  length   diam  height  weight  rings ring_quartile
2619   M   0.690  0.540   0.185  1.7100      8             1
1044   M   0.690  0.525   0.175  1.7005      8             1
1026   M   0.645  0.520   0.175  1.5610      8             1

Ring quartile: 2
----------------
     sex  length  diam  height  weight  rings ring_quartile
2811   M   0.725  0.57   0.190  2.3305      9             2
1426   F   0.745  0.57   0.215  2.2500      9             2
1821   F   0.720  0.55   0.195  2.0730      9             2

Ring quartile: 3
----------------
     sex  length  diam  height  weight  rings ring_quartile
1209   F   0.780  0.63   0.215   2.657     11             3
1051   F   0.735  0.60   0.220   2.555     11             3
3715   M   0.780  0.60   0.210   2.548     11             3

Ring quartile: 4
----------------
     sex  length   diam  height  weight  rings ring_quartile
891    M   0.730  0.595    0.23  2.8255     17             4
1763   M   0.77

In [57]:
#BK: Comprehension solution
[print(idx, i.nlargest(3, 'weight'), "\n") for idx,i in grouped];

1      sex  length   diam  height  weight  rings ring_quartile
2619   M   0.690  0.540   0.185  1.7100      8             1
1044   M   0.690  0.525   0.175  1.7005      8             1
1026   M   0.645  0.520   0.175  1.5610      8             1 

2      sex  length  diam  height  weight  rings ring_quartile
2811   M   0.725  0.57   0.190  2.3305      9             2
1426   F   0.745  0.57   0.215  2.2500      9             2
1821   F   0.720  0.55   0.195  2.0730      9             2 

3      sex  length  diam  height  weight  rings ring_quartile
1209   F   0.780  0.63   0.215   2.657     11             3
1051   F   0.735  0.60   0.220   2.555     11             3
3715   M   0.780  0.60   0.210   2.548     11             3 

4      sex  length   diam  height  weight  rings ring_quartile
891    M   0.730  0.595    0.23  2.8255     17             4
1763   M   0.775  0.630    0.25  2.7795     12             4
165    M   0.725  0.570    0.19  2.5500     14             4 



In [58]:
type(grouped)

pandas.core.groupby.generic.DataFrameGroupBy

In [59]:
#Relatedly, a groupby object also has .groups and a group-getter, .get_group():
grouped.groups.keys()

dict_keys([1, 2, 3, 4])

In [60]:
grouped.get_group(2).head()

Unnamed: 0,sex,length,diam,height,weight,rings,ring_quartile
2,F,0.53,0.42,0.135,0.677,9,2
8,M,0.475,0.37,0.125,0.5095,9,2
19,M,0.45,0.32,0.1,0.381,9,2
23,F,0.55,0.415,0.135,0.7635,9,2
39,M,0.355,0.29,0.09,0.3275,9,2


In [61]:
dir(grouped)

['__annotations__',
 '__class__',
 '__class_getitem__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__orig_bases__',
 '__parameters__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__slots__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_accessors',
 '_agg_examples_doc',
 '_agg_general',
 '_agg_py_fallback',
 '_aggregate_frame',
 '_aggregate_item_by_item',
 '_aggregate_with_numba',
 '_apply_allowlist',
 '_apply_filter',
 '_apply_to_column_groupbys',
 '_bool_agg',
 '_cache',
 '_can_use_transform_fast',
 '_choose_path',
 '_concat_objects',
 '_constructor',
 '_cumcount_array',
 '_cython_agg_general',
 '_cython_transform',
 '_define_paths',
 '_dir_additions',
 '_dir_deletions',
 '_fill',
 '_get_cythonized_res

In [62]:
#This can help you be a little more confident that the operation you’re performing is the one you want:

grouped['height', 'weight'].agg(['mean', 'median'])

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0_level_0,height,height,weight,weight
Unnamed: 0_level_1,mean,median,mean,median
ring_quartile,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,0.1066,0.105,0.4324,0.3685
2,0.1427,0.145,0.852,0.844
3,0.1572,0.155,1.0669,1.0645
4,0.1648,0.165,1.1149,1.0655


No matter what calculation you perform on grouped, be it a single Pandas method or custom-built function, each of these “sub-frames” is passed one-by-one as an argument to that callable. This is where the term “split-apply-combine” comes from: break the data up by groups, perform a per-group calculation, and recombine in some aggregated fashion.

If you’re having trouble visualizing exactly what the groups will actually look like, simply iterating over them and printing a few can be tremendously useful.

## 7. Use This Mapping Trick for Membership Binning

Let’s say that you have a Series and a corresponding “mapping table” where each value belongs to a multi-member group, or to no groups at all:

In [64]:
countries = pd.Series([
    'United States',
    'Canada',
    'Mexico',
    'Belgium',
    'United Kingdom',
    'Thailand'
])

groups = {
    'North America': ('United States', 'Canada', 'Mexico', 'Greenland'),
    'Europe': ('France', 'Germany', 'United Kingdom', 'Belgium')}

In [None]:
In other words, you need to map countries to the following result:

0    North America
1    North America
2    North America
3           Europe
4           Europe
5            other

In [65]:
# What you need here is a function similar to Pandas’ pd.cut(), but for binning based on categorical membership. You can use pd.Series.map(), which you already saw in example #5, to mimic this:

from typing import Any

def membership_map(s: pd.Series, groups: dict,
                   fillvalue: Any=-1) -> pd.Series:
    # Reverse & expand the dictionary key-value pairs
    groups = {x: k for k, v in groups.items() for x in v}
    return s.map(groups).fillna(fillvalue)

This should be significantly faster than a nested Python loop through groups for each country in countries.

Here’s a test drive:

In [66]:
membership_map(countries, groups, fillvalue='other')

0    North America
1    North America
2    North America
3           Europe
4           Europe
5            other
dtype: object

Let’s break down what’s going on here. (Sidenote: this is a great place to step into a function’s scope with Python’s debugger, pdb, to inspect what variables are local to the function.)

The objective is to map each group in groups to an integer. However, Series.map() will not recognize 'ab'—it needs the broken-out version with each character from each group mapped to an integer. This is what the dictionary comprehension is doing:

In [67]:
groups = dict(enumerate(('ab', 'cd', 'xyz')))
{x: k for k, v in groups.items() for x in v}

{'a': 0, 'b': 0, 'c': 1, 'd': 1, 'x': 2, 'y': 2, 'z': 2}

This dictionary can be passed to s.map() to map or “translate” its values to their corresponding group indices.

## 9. Load Data From the Clipboard

It’s a common situation to need to transfer data from a place like Excel or Sublime Text to a Pandas data structure. Ideally, you want to do this without going through the intermediate step of saving the data to a file and afterwards reading in the file to Pandas.

You can load in DataFrames from your computer’s clipboard data buffer with pd.read_clipboard(). Its keyword arguments are passed on to pd.read_table().

This allows you to copy structured text directly to a DataFrame or Series. In Excel, the data would look something like this:

Excel Clipboard Data
Its plain-text representation (for example, in a text editor) would look like this:



a   b           c       d
0   1           inf     1/1/00
2   7.389056099 N/A     5-Jan-13
4   54.59815003 nan     7/24/18
6   403.4287935 None    NaT

In [69]:
df = pd.read_clipboard(na_values=[None], parse_dates=['d'])
df

Unnamed: 0,a,b,c,d
0,0,1.0,inf,2000-01-01
1,2,7.3891,,2013-01-05
2,4,54.5982,,2018-07-24
3,6,403.4288,,NaT


In [70]:
df.dtypes

a             int64
b           float64
c           float64
d    datetime64[ns]
dtype: object

## 10. Write Pandas Objects Directly to Compressed Format

This one’s short and sweet to round out the list. As of Pandas version 0.21.0, you can write Pandas objects directly to gzip, bz2, zip, or xz compression, rather than stashing the uncompressed file in memory and converting it. Here’s an example using the abalone data from trick #1:

In [71]:
abalone.to_json('df.json.gz', orient='records',
                lines=True, compression='gzip')

In [72]:
import os.path
abalone.to_json('df.json', orient='records', lines=True)
os.path.getsize('df.json') / os.path.getsize('df.json.gz')


11.603064345539263

In [74]:
ll

total 682
-rwxrwxrwx 1 bk    653 Jan 25 15:53 [0m[01;32mCI_with_python.py[0m*
-rwxrwxrwx 1 bk  60993 Jul 13 14:50 [01;32mUntitled.ipynb[0m*
drwxrwxrwx 1 bk    512 Feb  1 08:50 [34;42m__pycache__[0m/
-rwxrwxrwx 1 bk    505 May  4 12:12 [01;32marg_unpacking.py[0m*
-rwxrwxrwx 1 bk    645 Dec 11  2020 [01;32mbest_practices.py[0m*
-rwxrwxrwx 1 bk   2244 Dec  1  2020 [01;32mcompr.py[0m*
-rwxrwxrwx 1 bk     35 Dec  1  2020 [01;32mdata.txt[0m*
-rwxrwxrwx 1 bk   1612 Jan  6  2021 [01;32mdecorators.py[0m*
-rwxrwxrwx 1 bk  24705 Jan  7  2021 [01;32mdecorators_tutorial.py[0m*
-rwxrwxrwx 1 bk   6393 Dec 22  2020 [01;32mdefault_dict.py[0m*
-rwxrwxrwx 1 bk   7633 Dec 14  2020 [01;32mdescriptors.py[0m*
-rwxrwxrwx 1 bk 405910 Jul 13 14:50 [01;32mdf.json[0m*
-rwxrwxrwx 1 bk  34983 Jul 13 14:50 [01;32mdf.json.gz[0m*
-rwxrwxrwx 1 bk   8647 Dec 22  2020 [01;32mdicts.py[0m*
-rwxrwxrwx 1 bk   4134 May  3 06:47 [01;32menumerate.py[0m*
-rwxrwxrwx 1 bk   2237 Dec  