# Data access and modification

## 1. Essential functionality

This section will walk you through the fundamental mechanics of interacting with the data contained in a Series or DataFrame. In the chapters to come, we will delve more deeply into data analysis and manipulation topics using pandas. This book is not intended to serve as exhaustive documentation for the pandas library; instead, we’ll focus on the most important features, leaving the less common (i.e., more esoteric) things for you to explore on your own.

### Reindexing

An important method on pandas objects is reindex, which means to create a new object with the data conformed to a new index. Consider an example:

In [None]:
import pandas as pd
import numpy as np

In [None]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])

In [None]:
obj

Calling `reindex` on this Series rearranges the data according to the new index, introducing missing values if any index values were not already present:

In [None]:
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])

In [None]:
obj2

For ordered data like time series, it may be desirable to do some interpolation or filling of values when reindexing. The `method` option allows us to do this, using a method such as`ffill`, which forward-fills the values

In [None]:
obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])

In [None]:
obj3

In [None]:
obj3.reindex(range(6), method='ffill')

With DataFrame, `reindex` can alter either the (row) index, columns, or both. When passed only a sequence, it reindexes the rows in the result

In [None]:
frame = pd.DataFrame(np.arange(9).reshape((3, 3)), 
                     index=['a', 'c', 'd'], 
                     columns=['Ohio', 'Texas', 'California'])

In [None]:
frame

In [None]:
frame2 = frame.reindex(['a', 'b', 'c', 'd'])

In [None]:
frame2

The columns can be reindexed with the `columns` keyword:

In [None]:
states = ['Texas', 'Utah', 'California']

In [None]:
frame.reindex(columns=states)

As we’ll explore in more detail, you can reindex more succinctly by label-indexing with loc, and many users prefer to use it exclusively.

*Table: reindex function arguments*

| Argument | Description |
| :--- | :--- |
| `index` | New sequence to use as index. Can be Index instance or any other sequence-like Python data structure. An Index will be used exactly as is without any copying. |
| `method` | Interpolation (fill) method; 'ffill' fills forward, while 'bfill' fills backward. |
| `fill_value` | Substitute value to use when introducing missing data by reindexing. |
| `limit` | When forward- or backfilling, maximum size gap (in number of elements) to fill. |
| `tolerance` | When forward- or backfilling, maximum size gap (in absolute numeric distance) to fill for inexact matches. |
| `level` | Match simple Index on level of MultiIndex; otherwise select subset of. |
| `copy` | If True, always copy underlying data even if new index is equivalent to old index; if False, do not copy the data when the indexes are equivalent. |

### Dropping Entries from an Axis

Dropping one or more entries from an axis is easy if you already have an index array or list without those entries. As that can require a bit of munging and set logic, the drop method will return a new object with the indicated value or values deleted from an axis:

In [None]:
obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])

In [None]:
obj

In [None]:
new_obj = obj.drop('c')

In [None]:
new_obj

In [None]:
obj.drop(['d', 'c'])

With DataFrame, index values can be deleted from either axis. To illustrate this, we first create an example DataFrame:

In [None]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'], 
                    columns=['one', 'two', 'three', 'four'])

In [None]:
data

Calling `drop` with a sequence of labels will drop values from the row labels (axis 0):

In [None]:
data.drop(['Colorado', 'Ohio'])

You can drop values from the columns by passing `axis=1` or `axis='columns'`

In [None]:
data.drop('two', axis=1)

In [None]:
data.drop(['two', 'four'], axis='columns')

Many functions, like `drop`, wich modify the size or shape of a Series or DataFrame, can manipulate an object in-place without returning a new object:

In [None]:
obj.drop('c', inplace=True)

In [None]:
obj

Be careful with the `inplace`, as it destroys any data that is dropped.

### Indexing, Selection, and Filtering

Series indexing (obj[...]) works analogously to NumPy array indexing, except you can use the Series’s index values instead of only integers. Here are some examples of this:

In [None]:
obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj

In [None]:
obj['b']

In [None]:
obj[1]

In [None]:
obj[2:4]

In [None]:
obj[['b', 'a', 'd']]

In [None]:
obj[[1, 3]]

In [None]:
obj[obj < 2]

Slicing with labels behaves differently than normal Python slicing in that the endpoint is inclusive:

In [None]:
obj['b':'c']

Setting using these methods modifies the corresponding section of the Series:

In [None]:
obj['b':'c'] = 5
obj

Indexing into a DataFrame is for retrieving one or more columns either with a single value or sequence:

In [None]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data

In [None]:
data['two']

In [None]:
data[['three', 'one']]

Indexing like this has a few special cases. First, slicing or selecting data with a boolean array:

In [None]:
data[:2]

In [None]:
data[data['three'] > 5]

The row selection syntax `data[:2]` is provided as a convenience. Passing a single element or a list to the `[]` operator selects columns.

Another use case is in indexing with a boolean DataFrame, such as one produced by a scalar comparison:

In [None]:
data < 5

In [None]:
data[data < 5] = 0
data

This makes DataFrame syntactically more like a two-dimensional NumPy array in this particular case.

### SELECTION WITH LOC AND ILOC

For DataFrame label-indexing on the rows, I introduce the special indexing operators `loc` and `iloc`. They enable you to select a subset of the rows and columns from a DataFrame with NumPy-like notation using either axis labels (`loc`) or integers (`iloc`).

As a preliminary example, let’s select a single row and multiple columns by label:

In [None]:
data.loc['Colorado', ['two', 'three']]

We’ll then perform some similar selections with integers using `iloc`:

In [None]:
data.iloc[2, [3, 0, 1]]

In [None]:
data.iloc[2]

In [None]:
data.iloc[[1, 2], [3, 0, 1]]

Both indexing functions work with slices in addition to single labels or lists of labels:

In [None]:
data.loc[:'Utah', 'two']

In [None]:
data.iloc[:, :3][data.three > 5]

So there are many ways to select and rearrange the data contained in a pandas object. For DataFrame, Table 5-4 provides a short summary of many of them. As you’ll see later, there are a number of additional options for working with hierarchical indexes.

>**Note:** *When originally designing pandas, I felt that having to type `frame[:, col]` to select a column was too verbose (and error-prone), since column selection is one of the most common operations. I made the design trade-off to push all of the fancy indexing behavior (both labels and integers) into the ix operator. In practice, this led to many edge cases in data with integer axis labels, so the pandas team decided to create the `loc` and `iloc` operators to deal with strictly label-based and integer-based indexing, respectively.*

The `ix` indexing operator still exists, but it is deprecated. I do not recommend using it.

*Table: Indexing options with DataFrame*

| Type | Notes |
| :--- | :--- |
| `df[val]` | Select single column or sequence of columns from the DataFrame; special case conveniences: boolean array (filter rows), slice (slice rows), or boolean DataFrame (set values based on some criterion) |
| `df.loc[val]` | Select single column or sequence of columns from the DataFrame; special case conveniences: boolean array (filter rows), slice (slice rows), or boolean DataFrame (set values based on some criterion) |
| `df.loc[:, val]` | Selects single column or subset of columns by label |
| `df.loc[val1, val2]` | Select both rows and columns by label |
| `df.iloc[where]` | Selects single row or subset of rows from the DataFrame by integer position |
| `df.iloc[:, where]` | Selects single column or subset of columns by integer position |
| `df.iloc[where_i, where_j]` | Select both rows and columns by integer position |
| `df.at[label_i, label_j]` | Select a single scalar value by row and column label |
| `df.iat[i, j]` | Select a single scalar value by row and column position (integers) |
| `reindex` method | Select either rows or columns by labels |
| `get_value, set_value` methods | Select single value by row and column label |

### Integer Indexes

Working with pandas objects indexed by integers is something that often trips up new users due to some differences with indexing semantics on built-in Python data structures like lists and tuples. For example, you might not expect the following code to generate an error:

``
ser = pd.Series(np.arange(3.))
ser
ser[-1]
``

In this case, pandas could “fall back” on integer indexing, but it’s difficult to do this in general without introducing subtle bugs. Here we have an index containing 0, 1, 2, but inferring what the user wants (label-based indexing or position-based) is difficult:

In [None]:
ser = pd.Series(np.arange(3.))
ser

On the other hand, with a non-integer index, there is no potential for ambiguity:

In [None]:
ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c'])

In [None]:
ser2[-1]

If you have an axis index containing integers, data selection will always be label-oriented. For more precise handling, use loc (for labels) or iloc (for integers):

In [None]:
ser.iloc[-1]

On the other hand, slicing with integers is always integer-oriented:

In [None]:
ser[:2]

### Arithmetic and Data Alignment

An important pandas feature for some applications is the behavior of arithmetic between objects with different indexes. When you are adding together objects, if any index pairs are not the same, the respective index in the result will be the union of the index pairs. For users with database experience, this is similar to an automatic outer join on the index labels. Let’s look at an example:

In [None]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])

In [None]:
s1

In [None]:
s2

In [None]:
s1 + s2

The internal data alignment introduces missing values in the label locations that don’t overlap. Missing values will then propagate in further arithmetic computations.

In the case of DataFrame, alignment is performed on both the rows and the columns:

In [None]:
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'), index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [None]:
df1

In [None]:
df2

In [None]:
df1 + df2

Since the 'c' and 'e' columns are not found in both DataFrame objects, they appear as all missing in the result. The same holds for the rows whose labels are not common to both objects.

If you add DataFrame objects with no column or row labels in common, the result will contain all nulls:

In [None]:
df1 = pd.DataFrame({'A': [1, 2]})
df2 = pd.DataFrame({'B': [3, 4]})

In [None]:
df1

In [None]:
df2

In [None]:
df1 + df2

### ARITHMETIC METHODS WITH FILL VALUES

In arithmetic operations between differently indexed objects, you might want to fill with a special value, like 0, when an axis label is found in one object but not the other:

In [None]:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))

In [None]:
df1

In [None]:
df2

In [None]:
df1 + df2

Using the `add` method on `df1`, I pass `df2` and an argument to `fill_value`:

In [None]:
df1.add(df2, fill_value=0)

See the table for a listing of Series and DataFrame methods for arithmetic. Each of them has a counterpart, starting with the letter r, that has arguments flipped. So these two statements are equivalent:

In [None]:
1 / df1

In [None]:
df1.rdiv(1)

Relatedly, when reindexing a Series or DataFrame, you can also specify a different fill value:

In [None]:
df1.reindex(columns=df2.columns, fill_value=0)

*Table: Flexible arithmetic methods*

| Method | Description |
| :--- | :--- |
| `add, radd` | Methods for addition (+) |
| `sub, rsub` | Methods for subtraction (-) |
| `div, rdiv` | Methods for division (/) |
| `floordiv, rfloordiv` | Methods for floor division (//) |
| `mul, rmul` | Methods for multiplication (*) |
| `pow, rpow` | Methods for exponentiation (**) |

### OPERATIONS BETWEEN DATAFRAME AND SERIES

As with NumPy arrays of different dimensions, arithmetic between DataFrame and Series is also defined. First, as a motivating example, consider the difference between a two-dimensional array and one of its rows:

In [None]:
arr = np.arange(12.).reshape((3, 4))
arr

In [None]:
arr[0]

In [None]:
 arr - arr[0]

When we subtract `arr[0]` from `arr`, the subtraction is performed once for each row. This is referred to as broadcasting and is explained in more detail as it relates to general NumPy arrays in Appendix A. Operations between a DataFrame and a Series are similar:

In [None]:
frame = pd.DataFrame(np.arange(12.).reshape((4, 3)), 
                     columns=list('bde'), 
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [None]:
series = frame.iloc[0]

In [None]:
frame

In [None]:
series

By default, arithmetic between DataFrame and Series matches the index of the Series on the DataFrame’s columns, broadcasting down the rows:

In [None]:
frame - series

If an index value is not found in either the DataFrame’s columns or the Series’s index, the objects will be reindexed to form the union:

In [None]:
series2 = pd.Series(range(3), index=['b', 'e', 'f'])

In [None]:
series2

In [None]:
frame + series2

If you want to instead broadcast over the columns, matching on the rows, you have to use one of the arithmetic methods. For example:

In [None]:
series3 = frame['d']
frame

In [None]:
series3

In [None]:
frame.sub(series3, axis='index')

The axis number that you pass is the axis to match on. In this case we mean to match on the DataFrame’s row index (axis='index' or axis=0) and broadcast across.

### Function Application and Mapping

NumPy ufuncs (element-wise array methods) also work with pandas objects:

In [None]:
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [None]:
frame

In [None]:
np.abs(frame)

Another frequent operation is applying a function on one-dimensional arrays to each column or row. DataFrame’s apply method does exactly this:

In [None]:
f = lambda x: x.max() - x.min()

In [None]:
frame.apply(f)

Here the function `f`, which computes the difference between the maximum and minimum of a Series, is invoked once on each column in frame. The result is a Series having the columns of frame as its index.

If you pass `axis='columns'` to `apply`, the function will be invoked once per row instead:

In [None]:
frame.apply(f, axis='columns')

Many of the most common array statistics (like sum and mean) are DataFrame methods, so using apply is not necessary.

The function passed to apply need not return a scalar value; it can also return a Series with multiple values:

In [None]:
def f(x): return pd.Series([x.min(), x.max()], index=['min', 'max'])

In [None]:
frame.apply(f)

Element-wise Python functions can be used, too. Suppose you wanted to compute a formatted string from each floating-point value in frame. You can do this with `applymap`:

In [None]:
format = lambda x: '%.2f' % x

In [None]:
frame.applymap(format)

The reason for the name `applymap` is that Series has a map method for applying an element-wise function:

In [None]:
frame['e'].map(format)

### Sorting and Ranking

Sorting a dataset by some criterion is another important built-in operation. To sort lexicographically by row or column index, use the `sort_index` method, which returns a new, sorted object:

In [None]:
obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
obj

In [None]:
obj.sort_index()

With a DataFrame, you can sort by index on either axis:

In [None]:
frame = pd.DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'], columns=['d', 'a', 'b', 'c'])
frame

In [None]:
frame.sort_index()

In [None]:
frame.sort_index(axis=1)

The data is sorted in ascending order by default, but can be sorted in descending order, too:

In [None]:
frame.sort_index(axis=1, ascending=False)

To sort a Series by its values, use its `sort_values` method:

In [None]:
obj = pd.Series([4, 7, -3, 2])
obj

In [None]:
obj.sort_values()

Any missing values are sorted to the end of the Series by default:

In [None]:
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
obj

In [None]:
obj.sort_values()

When sorting a DataFrame, you can use the data in one or more columns as the sort keys. To do so, pass one or more column names to the by option of `sort_values`:

In [None]:
frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
frame

In [None]:
frame.sort_values(by='b')

In [None]:
frame.sort_values(by=['a', 'b'])

Ranking assigns ranks from one through the number of valid data points in an array. The rank methods for Series and DataFrame are the place to look; by default rank breaks ties by assigning each group the mean rank:

In [None]:
obj = pd.Series([7, -5, 7, 4, 2, 0, 4, 12])
obj

In [None]:
obj.rank()

Ranks can also be assigned according to the order in which they’re observed in the data:

In [None]:
obj.rank(method='first')

Here, instead of using the average rank 6.5 for the entries 0 and 2, they instead have been set to 6 and 7 because label 0 precedes label 2 in the data.

You can rank in descending order, too:

In [None]:
obj.rank(ascending=False, method='max')

See the table for a list of tie-breaking methods available.

DataFrame can compute ranks over the rows or the columns:

In [None]:
frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1], 'c': [-2, 5, 8, -2.5]})
frame

In [None]:
frame.rank(axis='columns')

*Table: Tie-breaking methods with rank*

| Method | Description |
| :--- | :--- |
| `average` | Default: assign the average rank to each entry in the equal group |
| `min` | Use the minimum rank for the whole group |
| `max` | Use the maximum rank for the whole group |
| `first` | Assign ranks in the order the values appear in the data |
| `dense` | Like method='min', but ranks always increase by 1 in between groups rather than the number of equal elements in a group |

### Axis Indexes with Duplicate Labels

Up until now all of the examples we’ve looked at have had unique axis labels (index values). While many pandas functions (like reindex) require that the labels be unique, it’s not mandatory. Let’s consider a small Series with duplicate indices:

In [None]:
obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])

In [None]:
obj

The index’s `is_unique` property can tell you whether its labels are unique or not:

In [None]:
obj.index.is_unique

Data selection is one of the main things that behaves differently with duplicates. Indexing a label with multiple entries returns a Series, while single entries return a scalar value:

In [None]:
obj['a']

In [None]:
obj['c']

This can make your code more complicated, as the output type from indexing can vary based on whether a label is repeated or not.

The same logic extends to indexing rows in a DataFrame:

In [None]:
df = pd.DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])

In [None]:
df

In [None]:
df.loc['b']

## 2. Summarizing and Computing Descriptive Statistics

pandas objects are equipped with a set of common mathematical and statistical methods. Most of these fall into the category of reductions or summary statistics, methods that extract a single value (like the sum or mean) from a Series or a Series of values from the rows or columns of a DataFrame. Compared with the similar methods found on NumPy arrays, they have built-in handling for missing data. Consider a small DataFrame:

In [None]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]], 
                  index=['a', 'b', 'c', 'd'], 
                  columns=['one', 'two'])
df

Calling DataFrame’s sum method returns a Series containing column sums:

In [None]:
df.sum()

Passing `axis='columns'` or `axis=1` sums across the columns instead:

In [None]:
df.sum(axis='columns')

NA values are excluded unless the entire slice (row or column in this case) is NA. This can be disabled with the `skipna` option:

In [None]:
df.mean(axis='columns', skipna=False)

*Table: Options for reduction methods*

| Option | Description |
| :--- | :--- |
| `axis` | Axis to reduce over; 0 for DataFrame’s rows and 1 for columns |
| `skipna` | Exclude missing values; True by default |
| `level` | Reduce grouped by level if the axis is hierarchically indexed (MultiIndex) |

Some methods, like `idxmin` and `idxmax`, return indirect statistics like the index value where the minimum or maximum values are attained:

In [None]:
df.idxmax()

Other methods are accumulations:

In [None]:
df.cumsum()

Another type of method is neither a reduction nor an accumulation. describe is one such example, producing multiple summary statistics in one shot:

In [None]:
df.describe()

On non-numeric data, `describe` produces alternative summary statistics:

In [None]:
obj = pd.Series(['a', 'a', 'b', 'c'] * 4)
obj

In [None]:
obj.describe()

|Method|Description|
|:---|:---|
|`count`|Number of non-NA values|
|`describe`|Compute set of summary statistics for Series or each DataFrame column|
|`min, max`|Compute minimum and maximum values|
|`argmin, argmax`|Compute index locations (integers) at which minimum or maximum value obtained, respectively|
|`idxmin, idxmax`|Compute index labels at which minimum or maximum value obtained, respectively|
|`quantile`|Compute sample quantile ranging from 0 to 1|
|`sum`|Sum of values|
|`mean`|Mean of values|
|`median`|Arithmetic median (50% quantile) of values|
|`mad`|Mean absolute deviation from mean value|
|`prod`|Product of all values|
|`var`|Sample variance of values|
|`std`|Sample standard deviation of values|
|`skew`|Sample skewness (third moment) of values|
|`kurt`|Sample kurtosis (fourth moment) of values|
|`cumsum`|Cumulative sum of values|
|`cummin, cummaxod`|Cumulative minimum or maximum of values, respectively|
|`cumprod`|Cumulative product of values|
|`diff`|Compute first arithmetic difference (useful for time series)|
|`pct_change`|Compute percent changes|

### Correlation and Covariance

Some summary statistics, like correlation and covariance, are computed from pairs of arguments. Let’s consider some DataFrames of stock prices and volumes obtained from Yahoo! Finance using the add-on `pandas-datareader` package. If you don’t have it installed already, it can be obtained via conda or pip:

``
conda install pandas-datareader
``

I use the `pandas_datareader` module to download some data for a few stock tickers:

In [None]:
import pandas_datareader.data as web
all_data = {ticker: web.get_data_yahoo(ticker)
            for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']}

price = pd.DataFrame({ticker: data['Adj Close']
                     for ticker, data in all_data.items()})
volume = pd.DataFrame({ticker: data['Volume']
                      for ticker, data in all_data.items()})

>**Caution:** *It’s possible by the time you are reading this that Yahoo! Finance no longer exists since Yahoo! was acquired by Verizon in 2017. Refer to the pandas-datareader documentation online for the latest functionality.*

I now compute percent changes of the prices, a time series operation which will be explored in future chapters.

In [None]:
returns = price.pct_change()

In [None]:
returns.tail()

The `corr` method of Series computes the correlation of the overlapping, non-NA, aligned-by-index values in two Series. Relatedly, `cov` computes the covariance:

In [None]:
returns['MSFT'].corr(returns['IBM'])

In [None]:
returns['MSFT'].cov(returns['IBM'])

Since MSFT is a valid Python attribute, we can also select these columns using more concise syntax:

In [None]:
returns.MSFT.corr(returns.IBM)

DataFrame’s `corr` and `cov` methods, on the other hand, return a full correlation or covariance matrix as a DataFrame, respectively:

In [None]:
returns.corr()

In [None]:
returns.cov()

Using DataFrame’s `corrwith` method, you can compute pairwise correlations between a DataFrame’s columns or rows with another Series or DataFrame. Passing a Series returns a Series with the correlation value computed for each column:

In [None]:
returns.corrwith(returns.IBM)

Passing a DataFrame computes the correlations of matching column names. Here I compute correlations of percent changes with volume:

In [None]:
returns.corrwith(volume)

Passing `axis='columns'` does things row-by-row instead. In all cases, the data points are aligned by label before the correlation is computed.

### Unique Values, Value Counts, and Membership

Another class of related methods extracts information about the values contained in a one-dimensional Series. To illustrate these, consider this example:

In [None]:
obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
obj

The first function is `unique`, which gives you an array of the unique values in a Series:

In [None]:
uniques = obj.unique()
uniques

The unique values are not necessarily returned in sorted order, but could be sorted after the fact if needed (`uniques.sort()`). Relatedly, `value_counts` computes a Series containing value frequencies:

In [None]:
obj.value_counts()

The Series is sorted by value in descending order as a convenience. `value_counts` is also available as a top-level pandas method that can be used with any array or sequence:

In [None]:
pd.value_counts(obj.values, sort=False)

`isin` performs a vectorized set membership check and can be useful in filtering a dataset down to a subset of values in a Series or column in a DataFrame:

In [None]:
obj

In [None]:
mask = obj.isin(['b', 'c'])
mask

In [None]:
obj[mask]

Related to `isin` is the `Index.get_indexer` method, which gives you an index array from an array of possibly non-distinct values into another array of distinct values:

In [None]:
to_match = pd.Series(['c', 'a', 'b', 'b', 'c', 'a'])
to_match

In [None]:
unique_vals = pd.Series(['c', 'b', 'a'])
unique_vals

In [None]:
pd.Index(unique_vals).get_indexer(to_match)

*Unique, value counts, and set membership methods*

|Method|Description|
|:---|:---|
|`isin`|Compute boolean array indicating whether each Series value is contained in the passed sequence of values|
|`get_indexer`|Compute integer indices for each value in an array into another array of distinct values; helpful for data alignment and join-type operations|
|`unique`|Compute array of unique values in a Series, returned in the order observed|
|`value_counts`|Return a Series containing unique values as its index and frequencies as its values, ordered count in descending order|

In some cases, you may want to compute a histogram on multiple related columns in a DataFrame. Here’s an example:

In [None]:
data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4],
                     'Qu2': [2, 3, 1, 2, 3],
                     'Qu3': [1, 5, 2, 4, 4]})
data

Passing `pandas.value_counts` to this DataFrame’s apply function gives:

In [None]:
result = data.apply(pd.value_counts).fillna(0)

In [None]:
result

Here, the row labels in the result are the distinct values occurring in all of the columns. The values are the respective counts of these values in each column.

## 3. Categorical Data

This section introduces the pandas Categorical type. I will show how you can achieve better performance and memory use in some pandas operations by using it. I also introduce some tools for using categorical data in statistics and machine learning applications.

###  Background and Motivation

Frequently, a column in a table may contain repeated instances of a smaller set of distinct values. We have already seen functions like unique and value_counts, which enable us to extract the distinct values from an array and compute their frequencies, respectively:

In [None]:
values = pd.Series(['apple', 'orange', 'apple', 'apple'] * 2)

In [None]:
values

In [None]:
pd.unique(values)

In [None]:
pd.value_counts(values)

Many data systems (for data warehousing, statistical computing, or other uses) have developed specialized approaches for representing data with repeated values for more efficient storage and computation. In data warehousing, a best practice is to use so-called dimension tables containing the distinct values and storing the primary observations as integer keys referencing the dimension table:

In [None]:
values = pd.Series([0, 1, 0, 0] * 2)
values

In [None]:
dim = pd.Series(['apple', 'orange'])
dim

We can use the `take` method to restore the original Series of strings:

In [None]:
dim.take(values)

This representation as integers is called the categorical or dictionary-encoded representation. The array of distinct values can be called the categories, dictionary, or levels of the data. In this book we will use the terms categorical and categories. The integer values that reference the categories are called the category codes or simply codes.

The categorical representation can yield significant performance improvements when you are doing analytics. You can also perform transformations on the categories while leaving the codes unmodified. Some example transformations that can be made at relatively low cost are:

- Renaming categories

- Appending a new category without changing the order or position of the existing categories

### Categorical Type in pandas

pandas has a special Categorical type for holding data that uses the integer-based categorical representation or encoding. Let’s consider the example Series from before:

In [None]:
fruits = ['apple', 'orange', 'apple', 'apple'] * 2
fruits

In [None]:
N = len(fruits)
N

In [None]:
df = pd.DataFrame({'fruit': fruits,
                   'basket_id': np.arange(N),
                   'count': np.random.randint(3, 15, size=N),
                   'weight': np.random.uniform(0, 4, size=N)},
                columns=['basket_id', 'fruit', 'count', 'weight'])
df

Here, `df['fruit']` is an array of Python string objects. We can convert it to categorical by calling:

In [None]:
fruit_cat = df['fruit'].astype('category')
fruit_cat

The values for `fruit_cat` are not a NumPy array, but an instance of `pandas.Categorical`:

In [None]:
c = fruit_cat.values
c

In [None]:
type(c)

The `Categorical` object has `categories` and `codes` attributes:

In [None]:
c.categories

In [None]:
c.codes

You can convert a DataFrame column to categorical by assigning the converted result:

In [None]:
df['fruit'] = df['fruit'].astype('category')

In [None]:
df.fruit

You can also create `pandas.Categorical` directly from other types of Python sequences:

In [None]:
my_categories = pd.Categorical(['foo', 'bar', 'baz', 'foo', 'bar'])
my_categories

If you have obtained categorical encoded data from another source, you can use the alternative `from_codes` constructor:

In [None]:
categories = ['foo', 'bar', 'baz']
codes = [0, 1, 2, 0, 0, 1]
my_cats_2 = pd.Categorical.from_codes(codes, categories)
my_cats_2

Unless explicitly specified, categorical conversions assume no specific ordering of the categories. So the `categories` array may be in a different order depending on the ordering of the input data. When using `from_codes` or any of the other constructors, you can indicate that the categories have a meaningful ordering:

In [None]:
ordered_cat = pd.Categorical.from_codes(codes, categories, ordered=True)
ordered_cat

The output `[foo < bar < baz]` indicates that `'foo'` precedes `'bar'` in the ordering, and so on. An unordered categorical instance can be made ordered with `as_ordered`:

In [None]:
my_cats_2.as_ordered()

As a last note, categorical data need not be strings, even though I have only showed string examples. A categorical array can consist of any immutable value types.

### Computations with Categoricals

Using Categorical in pandas compared with the non-encoded version (like an array of strings) generally behaves the same way. Some parts of pandas, like the groupby function, perform better when working with categoricals. There are also some functions that can utilize the ordered flag.

Let’s consider some random numeric data, and use the `pandas.qcut` binning function. This return `pandas.Categorical`; we used `pandas.cut` earlier in the book but glossed over the details of how categoricals work:

In [None]:
np.random.seed(12345)

In [None]:
draws = np.random.randn(1000)

In [None]:
draws[:5]

Let’s compute a quartile binning of this data and extract some statistics:

In [None]:
bins = pd.qcut(draws, 4)
bins

While useful, the exact sample quartiles may be less useful for producing a report than quartile names. We can achieve this with the labels argument to `qcut`:

In [None]:
bins = pd.qcut(draws, 4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
bins

In [None]:
bins.codes[:10]

The labeled `bins` categorical does not contain information about the bin edges in the data, so we can use `groupby` to extract some summary statistics:

In [None]:
bins = pd.Series(bins, name='quartile')
results = (pd.Series(draws).groupby(bins).agg(['count', 'min', 'max']).reset_index())
results

The `'quartile'` column in the result retains the original categorical information, including ordering, from bins:

In [None]:
results['quartile']

### BETTER PERFORMANCE WITH CATEGORICALS

If you do a lot of analytics on a particular dataset, converting to categorical can yield substantial overall performance gains. A categorical version of a DataFrame column will often use significantly less memory, too. Let’s consider some Series with 10 million elements and a small number of distinct categories:

In [None]:
N = 10000000

In [None]:
draws = pd.Series(np.random.randn(N))
draws

In [None]:
labels = pd.Series(['foo', 'bar', 'baz', 'qux'] * (N // 4))
labels

Now we convert labels to categorical:

In [None]:
categories = labels.astype('category')

Now we note that labels uses significantly more memory than categories:

In [None]:
labels.memory_usage()

In [None]:
categories.memory_usage()

The conversion to category is not free, of course, but it is a one-time cost:

In [None]:
%time _ = labels.astype('category')

GroupBy operations can be significantly faster with categoricals because the underlying algorithms use the integer-based codes array instead of an array of strings.

### Categorical Methods

Series containing categorical data have several special methods similar to the Series.str specialized string methods. This also provides convenient access to the categories and codes. Consider the Series:

In [None]:
s = pd.Series(['a', 'b', 'c', 'd'] * 2)
s

In [None]:
cat_s = s.astype('category')
cat_s

The special attribute cat provides access to categorical methods:

In [None]:
cat_s.cat.codes

In [None]:
cat_s.cat.categories

Suppose that we know the actual set of categories for this data extends beyond the four values observed in the data. We can use the set_categories method to change them:

In [None]:
actual_categories = ['a', 'b', 'c', 'd', 'e']
actual_categories

In [None]:
cat_s2 = cat_s.cat.set_categories(actual_categories)
cat_s2

While it appears that the data is unchanged, the new categories will be reflected in operations that use them. For example, value_counts respects the categories, if present:

In [None]:
cat_s.value_counts()

In [None]:
cat_s2.value_counts()

In large datasets, categoricals are often used as a convenient tool for memory savings and better performance. After you filter a large DataFrame or Series, many of the categories may not appear in the data. To help with this, we can use the remove_unused_categories method to trim unobserved categories:

In [None]:
cat_s3 = cat_s[cat_s.isin(['a', 'b'])]
cat_s3

In [None]:
cat_s3.cat.remove_unused_categories()

*Table: Categorical methods for Series in pandas*

| Method | Description |
| :--- | :--- |
| `add_categories` | Append new (unused) categories at end of existing categories |
| `as_ordered` | Make categories ordered |
| `as_unordered` | Make categories unordered |
| `remove_categories` | Remove categories, setting any removed values to null |
| `remove_unused_categories` | Remove any category values which do not appear in the data |
| `rename_categories` | Replace categories with indicated set of new category names; cannot change the number of categories |
| `reorder_categories` | Behaves like `rename_categories`, but can also change the result to have ordered categories |
| `set_categories` | Replace the categories with the indicated set of new categories; can add or remove categories |

### CREATING DUMMY VARIABLES FOR MODELING

When you’re using statistics or machine learning tools, you’ll often transform categorical data into dummy variables, also known as one-hot encoding. This involves creating a DataFrame with a column for each distinct category; these columns contain 1s for occurrences of a given category and 0 otherwise.

Consider the previous example:

In [None]:
cat_s = pd.Series(['a', 'b', 'c', 'd'] * 2, dtype='category')
cat_s

The `pandas.get_dummies` function converts this one-dimensional categorical data into a DataFrame containing the dummy variable:

In [None]:
pd.get_dummies(cat_s)