# Data Cleaning and Preparation

In [1]:
import numpy as np
import pandas as pd

Missing data occurs commonly in many data analysis applications. One of the goals of pandas is to make working with missing data as painless as possible. For example, all of the descriptive statistics on pandas objects exclude missing data by default.

In [2]:
float_data = pd.Series([1.2, -3, 5, np.nan, 0])
float_data

0    1.2
1   -3.0
2    5.0
3    NaN
4    0.0
dtype: float64

In [3]:
# the isna method gives back a Boolean Series with Ture where values are null
float_data.isna()

0    False
1    False
2    False
3     True
4    False
dtype: bool

The pandas community has adopted the convention used in R by referring to missing data as NA, which stands for *not available*. In statistics applications, NA data may either be data that does not exist or that exists but was not observer.
When cleaning up data for analysis, it is important to do analysis on the missing data itself to identify data collection problems or potential biases in the data caused by missing data.

In [4]:
# the built-in Python None value is also treated as NA
string_data = pd.Series(["aardvark", np.nan, None, "avocado"])
string_data

0    aardvark
1         NaN
2        None
3     avocado
dtype: object

In [5]:
string_data.isna()

0    False
1     True
2     True
3    False
dtype: bool

You always have the option to filter out missing data by using pandas.isna and Boolean indexing. On these cases, dropna can be helpful.

In [6]:
data = pd.Series([1, np.nan, 3.5, np.nan, 7])
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

In [7]:
data[data.notna()]

0    1.0
2    3.5
4    7.0
dtype: float64

With DataFrame objects there are different ways to remove missing data. You may want to drop rows or columns that are all NA values, or only those containing any NAs at all. dropna by default drops any row containing a missing value.

In [8]:
data = pd.DataFrame(
    [
        [1.0, 6.5, 3.0],
        [1.0, np.nan, np.nan],
        [np.nan, np.nan, np.nan],
        [np.nan, 6.5, 3.0],
    ]
)
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [9]:
data.dropna()

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


In [10]:
# how="all" will drop only rows that are all NA
data.dropna(how="all")

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


Keep in mind that these functions return new objects by default and do not modify the contents of the original object.

In [11]:
# drop columns with all NA values
data[4] = np.nan
data.dropna(axis="columns", how="all")

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


Let's suppose that you want to keep only rows containing at most a certain number of observations. To do this, use the thresh argument.

In [12]:
df = pd.DataFrame(np.random.standard_normal((7, 3)))
df.iloc[:4, 1] = np.nan
df.iloc[:2, 2] = np.nan
df

Unnamed: 0,0,1,2
0,1.00096,,
1,0.099213,,
2,0.225933,,-0.221745
3,0.234974,,0.623302
4,0.72915,1.152723,0.433806
5,0.889571,0.831591,-0.099884
6,1.726128,-0.247744,-0.164434


In [13]:
df.dropna()

Unnamed: 0,0,1,2
4,0.72915,1.152723,0.433806
5,0.889571,0.831591,-0.099884
6,1.726128,-0.247744,-0.164434


In [14]:
df.dropna(thresh=2)

Unnamed: 0,0,1,2
2,0.225933,,-0.221745
3,0.234974,,0.623302
4,0.72915,1.152723,0.433806
5,0.889571,0.831591,-0.099884
6,1.726128,-0.247744,-0.164434


Sometimes you may want to fill in missing data instead of filtering it out. For most purposes, the fillna method is the workhorse function to use. Calling fillna with a constant replaces missing values with that value.

In [15]:
df.fillna(0)

Unnamed: 0,0,1,2
0,1.00096,0.0,0.0
1,0.099213,0.0,0.0
2,0.225933,0.0,-0.221745
3,0.234974,0.0,0.623302
4,0.72915,1.152723,0.433806
5,0.889571,0.831591,-0.099884
6,1.726128,-0.247744,-0.164434


In [16]:
# fillna with a dictionary uses a different fill value for each column
df.fillna({1: 0.5, 2: 0})

Unnamed: 0,0,1,2
0,1.00096,0.5,0.0
1,0.099213,0.5,0.0
2,0.225933,0.5,-0.221745
3,0.234974,0.5,0.623302
4,0.72915,1.152723,0.433806
5,0.889571,0.831591,-0.099884
6,1.726128,-0.247744,-0.164434


The same interpolation methods available for reindexing can be used with fillna.

In [17]:
df = pd.DataFrame(np.random.standard_normal((6, 3)))
df.iloc[2:, 1] = np.nan
df.iloc[4:, 2] = np.nan
df

Unnamed: 0,0,1,2
0,0.14644,-1.041718,0.465934
1,0.186127,0.023918,-0.886531
2,1.582528,,-0.158878
3,-0.03944,,-0.094606
4,-0.360787,,
5,1.301278,,


In [18]:
df.fillna(method="ffill")

Unnamed: 0,0,1,2
0,0.14644,-1.041718,0.465934
1,0.186127,0.023918,-0.886531
2,1.582528,0.023918,-0.158878
3,-0.03944,0.023918,-0.094606
4,-0.360787,0.023918,-0.094606
5,1.301278,0.023918,-0.094606


In [19]:
df.fillna(method="ffill", limit=2)

Unnamed: 0,0,1,2
0,0.14644,-1.041718,0.465934
1,0.186127,0.023918,-0.886531
2,1.582528,0.023918,-0.158878
3,-0.03944,0.023918,-0.094606
4,-0.360787,,-0.094606
5,1.301278,,-0.094606


With fillna you can do lots of other things such as simple data imputation using the median or mean statistics.

In [20]:
data = pd.Series([1.0, np.nan, 3.5, np.nan, 7])
data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

In [21]:
data = pd.DataFrame({"k1": ["one", "two"] * 3 + ["two"], "k2": [1, 1, 2, 3, 3, 4, 4]})
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


In [22]:
# returns a Boolean Series indicating whether each row is a duplicate or not
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

In [23]:
# drop duplicates
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


In [24]:
data["v1"] = range(7)
data

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
5,two,4,5
6,two,4,6


In [25]:
# drop duplicates based on a single column
data.drop_duplicates(subset=["k1"], keep="last")

Unnamed: 0,k1,k2,v1
4,one,3,4
6,two,4,6


In [26]:
data = pd.DataFrame(
    {
        "food": [
            "bacon",
            "pulled pork",
            "bacon",
            "pastrami",
            "corned beef",
            "bacon",
            "pastrami",
            "honey ham",
            "nova lox",
        ],
        "ounces": [4, 3, 12, 6, 7.5, 8, 3, 5, 6],
    }
)
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,pastrami,6.0
4,corned beef,7.5
5,bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


In [27]:
meat_to_animal = {
    "bacon": "pig",
    "pulled pork": "pig",
    "pastrami": "cow",
    "bacon": "pig",
    "corned beef": "cow",
    "honey ham": "pig",
    "nova lox": "salmon",
}

The map method on a Series accepts a function or dictionary-like containing a mapping to do the transformation of values.

In [28]:
data["animal"] = data["food"].map(meat_to_animal)
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,pastrami,6.0,cow
4,corned beef,7.5,cow
5,bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


In [29]:
# replace values
data = pd.Series([1.0, -999, 2.0, -999, -1000, 3.0])
data.replace(-999, np.nan)

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

In [30]:
# replace multiple values at once
data.replace([-999, -1000], np.nan)

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

In [31]:
# replace different values for different targets
data.replace([-999, -1000], [np.nan, 0])

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

Continous data is often discretized or otherwise separated into "bins" for analysis. Suppose you have data about a group of people and you want to divide them into discrete age buckets.

In [32]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

Let's divide these into bins of 18 to 25, 26 to 35, 36 to 60, and finally 61 and older by using the pandas.cut function.

In [33]:
bins = [18, 25, 35, 60, 100]
age_categories = pd.cut(ages, bins)
age_categories

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64, right]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

The object returned is a special Categorical object. The output describes the bins computed by pandas.cut. Each bin is identifies by a special interval value type containing the lower and upper limit of each bin.

In [34]:
age_categories.codes

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [35]:
age_categories.categories

IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]], dtype='interval[int64, right]')

In [36]:
age_categories.categories[0]

Interval(18, 25, closed='right')

In [37]:
age_categories.value_counts()

(18, 25]     5
(25, 35]     3
(35, 60]     3
(60, 100]    1
dtype: int64

In the string representation of an interval, a parenthesis means that the side is open (exclusive) while the square bracket means it is closed (inclusive).

pandas.qcut  bins the data based on sample quantiles. Depending on the distribution of the data, using cut will not usually result in each bin having the same number of data points. Since qcut uses sample quantiles instead, by definition you will obtain roughly equal-size bins.

In [38]:
data = np.random.standard_normal(1000)
quartiles = pd.qcut(data, 4, precision=2)
quartiles

[(0.69, 3.99], (-3.32, -0.74], (-0.74, -0.0048], (-0.0048, 0.69], (0.69, 3.99], ..., (-0.0048, 0.69], (-3.32, -0.74], (-3.32, -0.74], (-3.32, -0.74], (0.69, 3.99]]
Length: 1000
Categories (4, interval[float64, right]): [(-3.32, -0.74] < (-0.74, -0.0048] < (-0.0048, 0.69] < (0.69, 3.99]]

In [39]:
quartiles.value_counts()

(-3.32, -0.74]      250
(-0.74, -0.0048]    250
(-0.0048, 0.69]     250
(0.69, 3.99]        250
dtype: int64

Filtering or transforming outliers is largely a matter of applying array operations.

In [40]:
data = pd.DataFrame(np.random.standard_normal((1000, 4)))
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,-0.058634,-0.002881,0.045835,-0.038504
std,0.982504,1.023025,1.002959,0.992858
min,-3.11827,-3.091501,-3.365934,-2.98797
25%,-0.78324,-0.709136,-0.624377,-0.726604
50%,-0.03373,-0.022512,0.041454,-0.032927
75%,0.638049,0.660383,0.727476,0.620111
max,2.992993,3.511558,3.132372,3.270636


In [41]:
# values in column 2 exceeding 3 in absolute value
col = data[2]
col[np.abs(col) > 3]

55     3.132372
184   -3.365934
233    3.057691
Name: 2, dtype: float64

In [42]:
# select all rows having a value exceeding 3 or -3
data[(data.abs() > 3).any(axis="columns")]

Unnamed: 0,0,1,2,3
55,-0.556277,0.428025,3.132372,-1.259748
145,0.473338,0.875659,0.502633,3.270636
150,-0.905916,3.120727,0.358988,0.699507
184,-0.540844,-0.434283,-3.365934,-1.626957
233,0.081579,0.102357,3.057691,0.988547
252,-0.636344,3.511558,0.689506,-0.638148
389,-1.069771,-3.091501,0.198449,0.968126
425,-3.11827,0.502086,0.521797,0.798297
526,-1.334398,0.022303,-0.853871,3.063907
651,-3.06703,0.89067,-1.531716,-0.444486


In [43]:
np.sign(data).head()

Unnamed: 0,0,1,2,3
0,1.0,-1.0,1.0,-1.0
1,1.0,1.0,1.0,1.0
2,-1.0,1.0,1.0,-1.0
3,1.0,1.0,-1.0,-1.0
4,-1.0,1.0,1.0,-1.0


Permuting a Series or the rows in a DataFrame is possible using the numpy.random.permutation function. Calling permutation with the length of the axis you want to permute produces an array of integers indicating the new ordering.

In [44]:
df = pd.DataFrame(np.arange(5 * 7).reshape(5, 7))
df

Unnamed: 0,0,1,2,3,4,5,6
0,0,1,2,3,4,5,6
1,7,8,9,10,11,12,13
2,14,15,16,17,18,19,20
3,21,22,23,24,25,26,27
4,28,29,30,31,32,33,34


In [45]:
sampler = np.random.permutation(5)
sampler

array([0, 1, 4, 2, 3])

The sampler array can be used in iloc-based indexing or the equivalent take function.

In [46]:
df.take(sampler)

Unnamed: 0,0,1,2,3,4,5,6
0,0,1,2,3,4,5,6
1,7,8,9,10,11,12,13
4,28,29,30,31,32,33,34
2,14,15,16,17,18,19,20
3,21,22,23,24,25,26,27


In [47]:
df.iloc[sampler]

Unnamed: 0,0,1,2,3,4,5,6
0,0,1,2,3,4,5,6
1,7,8,9,10,11,12,13
4,28,29,30,31,32,33,34
2,14,15,16,17,18,19,20
3,21,22,23,24,25,26,27


In [48]:
column_sampler = np.random.permutation(7)
column_sampler

array([0, 6, 5, 4, 2, 3, 1])

In [49]:
# permutation of the columns
df.take(column_sampler, axis="columns")

Unnamed: 0,0,6,5,4,2,3,1
0,0,6,5,4,2,3,1
1,7,13,12,11,9,10,8
2,14,20,19,18,16,17,15
3,21,27,26,25,23,24,22
4,28,34,33,32,30,31,29


In [50]:
# select a random subset without replacement
df.sample(n=3)

Unnamed: 0,0,1,2,3,4,5,6
1,7,8,9,10,11,12,13
2,14,15,16,17,18,19,20
0,0,1,2,3,4,5,6


In [51]:
# generate a sample with replacement
choices = pd.Series([5, 7, -1, 6, 4])
choices.sample(n=10, replace=True)

0    5
4    4
0    5
3    6
0    5
1    7
4    4
2   -1
2   -1
1    7
dtype: int64

Another type of transformation for statistical modeling or machine learning applications is converting a categorical variable into a *dummy* or *indicator matrix*.  
If a column in a DataFrame has `k` distinct values, you would derive a matrix or DataFrame with `k` columns containing all 1s and 0s. pandas has a `pandas.get_dummies` function for doing this, though you could also devise one yourself.

In [52]:
df = pd.DataFrame(
    {
        "key": ["b", "b", "a", "c", "a", "b"],
        "data1": range(6),
    }
)
df

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,b,5


In [53]:
pd.get_dummies(df["key"])

Unnamed: 0,a,b,c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


In some cases you may want to add a prefix to the columns in the indicator DataFrame, which can then be merged with other data.

In [54]:
dummies = pd.get_dummies(df["key"], prefix="key")
df_with_dummy = df[["data1"]].join(dummies)
df_with_dummy

Unnamed: 0,data1,key_a,key_b,key_c
0,0,0,1,0
1,1,0,1,0
2,2,1,0,0
3,3,0,0,1
4,4,1,0,0
5,5,0,1,0


If a row in a DataGrame belongs to multiple categories, we have to use a different approach to create the dummy variables.

In [55]:
mnames = ["movie_id", "title", "genres"]
movies = pd.read_table(
    "datasets/movielens/movies.dat",
    sep="::",
    header=None,
    names=mnames,
    engine="python",
)
movies.head()

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [56]:
dummies = movies["genres"].str.get_dummies("|")
dummies.iloc[:10, :6]

Unnamed: 0,Action,Adventure,Animation,Children's,Comedy,Crime
0,0,0,1,1,1,0
1,0,1,0,1,0,0
2,0,0,0,0,1,0
3,0,0,0,0,1,0
4,0,0,0,0,1,0
5,1,0,0,0,0,1
6,0,0,0,0,1,0
7,0,1,0,1,0,0
8,1,0,0,0,0,0
9,1,1,0,0,0,0


In [57]:
movies_windic = movies.join(dummies.add_prefix("Genre_"))
movies_windic.iloc[0]

movie_id                                       1
title                           Toy Story (1995)
genres               Animation|Children's|Comedy
Genre_Action                                   0
Genre_Adventure                                0
Genre_Animation                                1
Genre_Children's                               1
Genre_Comedy                                   1
Genre_Crime                                    0
Genre_Documentary                              0
Genre_Drama                                    0
Genre_Fantasy                                  0
Genre_Film-Noir                                0
Genre_Horror                                   0
Genre_Musical                                  0
Genre_Mystery                                  0
Genre_Romance                                  0
Genre_Sci-Fi                                   0
Genre_Thriller                                 0
Genre_War                                      0
Genre_Western       

For much larget data, this method of constructing indicator variables with multiple membership is not especially speedy.It would be better to write a lower-level function that writes directly to a NumPy array, and then wraps the result in a DataFrame.

A useful recipe for statistical applications is to combine `pandas.get_dummies` with a discretization function like `pandas.cut`.

In [58]:
np.random.seed(12345)
values = np.random.uniform(size=10)
values

array([0.92961609, 0.31637555, 0.18391881, 0.20456028, 0.56772503,
       0.5955447 , 0.96451452, 0.6531771 , 0.74890664, 0.65356987])

In [59]:
bins = [0, 0.2, 0.4, 0.6, 0.8, 1]
pd.get_dummies(pd.cut(values, bins))

Unnamed: 0,"(0.0, 0.2]","(0.2, 0.4]","(0.4, 0.6]","(0.6, 0.8]","(0.8, 1.0]"
0,0,0,0,0,1
1,0,1,0,0,0
2,1,0,0,0,0
3,0,1,0,0,0
4,0,0,1,0,0
5,0,0,1,0,0
6,0,0,0,0,1
7,0,0,0,1,0
8,0,0,0,1,0
9,0,0,0,1,0


Frequently, a column in a table may contain repeated instances of a smaller set of distinct values.

In [61]:
values = pd.Series(["apple", "orange", "apple", "apple"] * 2)
values

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
dtype: object

In [62]:
pd.unique(values)

array(['apple', 'orange'], dtype=object)

In [63]:
pd.value_counts(values)

apple     6
orange    2
dtype: int64

Many data systems have developed specialized approaches for representing data with repeated values for more efficient storage and computation. In data warehousing, a best practice is to use so-called *dimension tables* containing the distinct values and storing the primary observations as integers keys referencing the dimension table.

In [65]:
values = pd.Series([0, 1, 0, 0] * 2)
dim = pd.Series(["apple", "orange"])

In [67]:
# the take method restores the original Series of strings.
dim.take(values)

0     apple
1    orange
0     apple
0     apple
0     apple
1    orange
0     apple
0     apple
dtype: object

This representation as integer is called the *categorical* or *dictionary-encoded* representation. The array of distinct values can be called the *categories*, dictionary or *levels* of the data. The integer values that reference the categories are called the *category codes* or simply *codes*.  
The categorical representation can yield significant performance improvements when you are doing analytics. You can also perform transformations on the categories while leaving the codes unmodified. Some example transformations that can be made at relatively low cost are:
- Renaming categories.
- Appending a new category without changing the order or position of the existing categories.

pandas has a special `Categorical` extension type for holding data that uses the integer-based categorical representation or *encoding*. This is a popular data compression technique for data with many occurrences of similar values and can provide significantly faster performance with lower memory use. Specially for string data.

In [69]:
fruits = ["apple", "orange", "apple", "apple"] * 4
N = len(fruits)
N

16

In [72]:
rng = np.random.default_rng(seed=12345)
df = pd.DataFrame(
    {
        "fruit": fruits,
        "basket_id": np.arange(N),
        "count": rng.integers(3, 15, size=N),
        "weight": rng.uniform(0, 4, size=N),
    },
    columns=["basket_id", "fruit", "count", "weight"],
)
df

Unnamed: 0,basket_id,fruit,count,weight
0,0,apple,11,2.691024
1,1,orange,5,3.767211
2,2,apple,12,0.992983
3,3,apple,6,3.795525
4,4,apple,5,2.66895
5,5,orange,12,0.383592
6,6,apple,10,1.767359
7,7,apple,11,3.54592
8,8,apple,14,2.789814
9,9,orange,7,1.305891


In [73]:
# convert df["fruit"] into a categorical
fruit_cat = df["fruit"].astype("category")
fruit_cat

0      apple
1     orange
2      apple
3      apple
4      apple
5     orange
6      apple
7      apple
8      apple
9     orange
10     apple
11     apple
12     apple
13    orange
14     apple
15     apple
Name: fruit, dtype: category
Categories (2, object): ['apple', 'orange']

The `Categorical` object has `categories` and `codes` attributes.

In [74]:
c = fruit_cat.array
c.categories

Index(['apple', 'orange'], dtype='object')

In [75]:
c.codes

array([0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0], dtype=int8)

In [76]:
# trick to map between codes and categories
dict(enumerate(c.categories))

{0: 'apple', 1: 'orange'}

Unless explicitly specified, categorical conversions assume no specific ordering of the categories. So the `categories` array may be in a different order depending on the ordering of the input data. When using `from_codes` or any of the other constructors, you can indicate that the categories have a meaningful ordering.

In [78]:
categories = ["foo", "bar", "baz"]
codes = [0, 1, 2, 0, 0, 1]

ordered_cat = pd.Categorical.from_codes(codes, categories, ordered=True)
ordered_cat

['foo', 'bar', 'baz', 'foo', 'foo', 'bar']
Categories (3, object): ['foo' < 'bar' < 'baz']

Using `Categorical` in pandas compared with the nonencoded version generally behaves the same way. Some parts of pandas, like the `groupby` function, perform better when working with categoricals. There are also some functions that can utilize the `ordered` flag.

In [79]:
N = 10_000_000
labels = pd.Series(["foo", "bar", "baz", "qux"] * (N // 4))
len(labels)

10000000

In [80]:
# convert labels to categorical
categories = labels.astype("category")

In [81]:
labels.memory_usage(deep=True)

600000128

In [82]:
categories.memory_usage(deep=True)

10000540

In [83]:
%time _ = labels.astype("category")

CPU times: user 383 ms, sys: 374 ms, total: 757 ms
Wall time: 751 ms


GroupBy operations can be significantly faster with categoricals because the underlying algorithms use the integer-based codes array instead of an array of strings.

In [84]:
%timeit labels.value_counts()

385 ms ± 11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [85]:
%timeit categories.value_counts()

42.4 ms ± 1.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
