# Data Cleaning and Preparation
During the course of doing data analysis and modeling, a significant amount of time
is spent on data preparation: loading, cleaning, transforming, and rearranging. Such
tasks are often reported to take up **80%** or more of an analyst’s time. Sometimes the
way that data is stored in files or databases is not in the right format for a particular
task.

In [1]:
import numpy as np
import pandas as pd
PREVIOUS_MAX_ROWS = pd.options.display.max_rows
pd.options.display.max_rows = 25
pd.options.display.max_columns = 20
pd.options.display.max_colwidth = 82
np.random.seed(12345)
import matplotlib.pyplot as plt
plt.rc("figure", figsize=(10, 6))
np.set_printoptions(precision=4, suppress=True)

In [2]:
import numpy as np
import pandas as pd

## Handling Missing Data
Missing data occurs commonly in many data analysis applications. One of the goals
of pandas is to make working with missing data as painless as possible. For example,
all of the descriptive statistics on pandas objects exclude missing data by default.

The way that missing data is represented in pandas objects is somewhat imperfect,
but it is sufficient for most real-world use. For data with float64 dtype, pandas uses
the floating-point value NaN (Not a Number) to represent missing data.

We call this a **sentinel value**: when present, it indicates a missing (or null) value

In [3]:
float_data = pd.Series([1.2, -3.5, np.nan, 0])
float_data

0    1.2
1   -3.5
2    NaN
3    0.0
dtype: float64

In [4]:
float_data.isna()

0    False
1    False
2     True
3    False
dtype: bool

In pandas, we’ve adopted a convention used in the R programming language by referring to missing data as **NA**, which stands for **not available**. In statistics applications,
NA data may either be data that does not exist or that exists but was not observed
(through problems with data collection, for example). When cleaning up data for
analysis, it is often important to do analysis on the missing data itself to identify data collection problems or potential biases in the data caused by missing data.

The built-in Python **None** value is also treated as **NA**

In [5]:
string_data = pd.Series(["aardvark", np.nan, None, "avocado"])
string_data

0    aardvark
1         NaN
2        None
3     avocado
dtype: object

In [6]:
string_data.isna()

0    False
1     True
2     True
3    False
dtype: bool

In [7]:
float_data = pd.Series([1, 2, None], dtype='float64')
float_data

0    1.0
1    2.0
2    NaN
dtype: float64

In [8]:
float_data.isna()

0    False
1    False
2     True
dtype: bool

The pandas project has attempted to make working with missing data consistent
**across data type**s. Functions like pandas.isna abstract away many of the annoying
details. 

<img src="Img/pd_na.png" alt="NA handling object methods" title="NA handling object methods" />

- **ffill**: propagate last valid observation forward to next valid.
- **bfill**: use next valid observation to fill gap

### Filtering Out Missing Data
There are a few ways to filter out missing data. While you always have the option to
do it by hand using `pandas.isna` and Boolean indexing, `dropna` can be helpful. On a
Series, it returns the Series with only the nonnull data and index values

In [9]:
data = pd.Series([1, np.nan, 3.5, np.nan, 7])
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [10]:
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

In [11]:
# same as above
data[data.notna()]

0    1.0
2    3.5
4    7.0
dtype: float64

With DataFrame objects, there are different ways to remove missing data. You may
want to drop rows or columns that are all NA, or only those rows or columns
containing any NAs at all. `dropna` by default drops any row containing a missing
value

In [12]:
data = pd.DataFrame([[1., 6.5, 3.], [1., np.nan, np.nan],
                     [np.nan, np.nan, np.nan], [np.nan, 6.5, 3.]])
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [13]:
data.dropna()

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


In [14]:
# how="all" will drop only rows that are all NA
data.dropna(how="all")

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


Keep in mind that these functions return **new objects** by default and do not modify
the contents of the original object.

In [15]:
data[4] = np.nan
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [16]:
data.dropna(axis="columns", how="all")

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [17]:
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [18]:
df = pd.DataFrame(np.random.standard_normal((7, 3)))
df

Unnamed: 0,0,1,2
0,-0.204708,0.478943,-0.519439
1,-0.55573,1.965781,1.393406
2,0.092908,0.281746,0.769023
3,1.246435,1.007189,-1.296221
4,0.274992,0.228913,1.352917
5,0.886429,-2.001637,-0.371843
6,1.669025,-0.43857,-0.539741


In [19]:
df.iloc[:4, 1] = np.nan
df.iloc[:2, 2] = np.nan
df

Unnamed: 0,0,1,2
0,-0.204708,,
1,-0.55573,,
2,0.092908,,0.769023
3,1.246435,,-1.296221
4,0.274992,0.228913,1.352917
5,0.886429,-2.001637,-0.371843
6,1.669025,-0.43857,-0.539741


In [20]:
df.dropna()

Unnamed: 0,0,1,2
4,0.274992,0.228913,1.352917
5,0.886429,-2.001637,-0.371843
6,1.669025,-0.43857,-0.539741


Suppose you want to keep only rows containing at most a certain number of missing
observations. You can indicate this with the thresh argument

In [21]:
df.dropna(thresh=2)

Unnamed: 0,0,1,2
2,0.092908,,0.769023
3,1.246435,,-1.296221
4,0.274992,0.228913,1.352917
5,0.886429,-2.001637,-0.371843
6,1.669025,-0.43857,-0.539741


### Filling In Missing Data
Rather than filtering out missing data (and potentially discarding other data along
with it), you may want to **fill** in the “holes” in any number of ways. For most
purposes, the `fillna` method is the workhorse function to use. 

Calling `fillna` with a
constant replaces missing values with that value

In [22]:
df.fillna(0)

Unnamed: 0,0,1,2
0,-0.204708,0.0,0.0
1,-0.55573,0.0,0.0
2,0.092908,0.0,0.769023
3,1.246435,0.0,-1.296221
4,0.274992,0.228913,1.352917
5,0.886429,-2.001637,-0.371843
6,1.669025,-0.43857,-0.539741


Calling fillna with a dictionary, you can use a different fill value for each column

In [23]:
df.fillna({1: 0.5, 2: 0})

Unnamed: 0,0,1,2
0,-0.204708,0.5,0.0
1,-0.55573,0.5,0.0
2,0.092908,0.5,0.769023
3,1.246435,0.5,-1.296221
4,0.274992,0.228913,1.352917
5,0.886429,-2.001637,-0.371843
6,1.669025,-0.43857,-0.539741


The same interpolation methods available for reindexing can be used with fillna

In [24]:
df = pd.DataFrame(np.random.standard_normal((6, 3)))
df

Unnamed: 0,0,1,2
0,0.476985,3.248944,-1.021228
1,-0.577087,0.124121,0.302614
2,0.523772,0.00094,1.34381
3,-0.713544,-0.831154,-2.370232
4,-1.860761,-0.860757,0.560145
5,-1.265934,0.119827,-1.063512


In [25]:
df.iloc[2:, 1] = np.nan
df.iloc[4:, 2] = np.nan
df

Unnamed: 0,0,1,2
0,0.476985,3.248944,-1.021228
1,-0.577087,0.124121,0.302614
2,0.523772,,1.34381
3,-0.713544,,-2.370232
4,-1.860761,,
5,-1.265934,,


In [26]:
df.fillna(method="ffill")

  df.fillna(method="ffill")


Unnamed: 0,0,1,2
0,0.476985,3.248944,-1.021228
1,-0.577087,0.124121,0.302614
2,0.523772,0.124121,1.34381
3,-0.713544,0.124121,-2.370232
4,-1.860761,0.124121,-2.370232
5,-1.265934,0.124121,-2.370232


In [27]:
df.fillna(method="ffill", limit=2)

  df.fillna(method="ffill", limit=2)


Unnamed: 0,0,1,2
0,0.476985,3.248944,-1.021228
1,-0.577087,0.124121,0.302614
2,0.523772,0.124121,1.34381
3,-0.713544,0.124121,-2.370232
4,-1.860761,,-2.370232
5,-1.265934,,-2.370232


With fillna you can do lots of other things such as simple data imputation using the
median or mean statistics

In [28]:
data = pd.Series([1., np.nan, 3.5, np.nan, 7])
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [29]:
data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

<img src="Img/pd_fillna.png" alt="fillna function arguments" title="fillna function arguments" />

## Data Transformation

### Removing Duplicates
The DataFrame method `duplicated` returns a Boolean Series indicating whether
each row is a duplicate or not

`duplicated` and `drop_duplicates` by default keep the **first** observed value combination. Passing `keep="last"` will return the last one

In [30]:
data = pd.DataFrame({"k1": ["one", "two"] * 3 + ["two"],
                     "k2": [1, 1, 2, 3, 3, 4, 4]})
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


In [31]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

In [34]:
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


In [33]:
data.drop_duplicates(keep="last")

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
6,two,4


Both methods by default consider all of the columns; alternatively, you can specify
any subset of them to detect duplicates. Suppose we had an additional column of
values and wanted to filter duplicates based only on the "k1" column

In [35]:
data["v1"] = range(7)
data

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
5,two,4,5
6,two,4,6


In [36]:
data.drop_duplicates(subset=["k1"])

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1


In [37]:
data.drop_duplicates(["k1", "k2"], keep="last")

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
6,two,4,6


### Transforming Data Using a Function or Mapping

In [38]:
data = pd.DataFrame({"food": ["bacon", "pulled pork", "bacon",
                              "pastrami", "corned beef", "bacon",
                              "pastrami", "honey ham", "nova lox"],
                     "ounces": [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,pastrami,6.0
4,corned beef,7.5
5,bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


Suppose you wanted to add a column indicating the type of animal that each food
came from.

In [39]:
meat_to_animal = {
  "bacon": "pig",
  "pulled pork": "pig",
  "pastrami": "cow",
  "corned beef": "cow",
  "honey ham": "pig",
  "nova lox": "salmon"
}

The `map` method on a Series accepts a function or dictionary-like object containing a mapping to do the transformation of values

In [40]:
data["animal"] = data["food"].map(meat_to_animal)
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,pastrami,6.0,cow
4,corned beef,7.5,cow
5,bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


In [41]:
def get_animal(x):
    return meat_to_animal[x]

data["food"].map(get_animal)

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

Using `map` is a convenient way to perform **element-wise** transformations and other
data cleaning-related operations.

### Replacing Values
Filling in missing data with the `fillna` method is a special case of more general value
*replacement*. As you’ve already seen, `map` can be used to modify a subset of values
in an object, but `replace` provides a simpler and more flexible way to do so.

In [42]:
data = pd.Series([1., -999., 2., -999., -1000., 3.])
data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

In [43]:
data.replace(-999, np.nan)

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

In [44]:
data.replace([-999, -1000], np.nan)

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

In [45]:
data.replace([-999, -1000], [np.nan, 0])

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

In [46]:
data.replace({-999: np.nan, -1000: 0})

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

The `data.replace` method is distinct from `data.str.replace`, which performs element-wise string substitution. 

### Renaming Axis Indexes
Like values in a Series, axis labels can be similarly transformed by a function or
mapping of some form to produce new, differently labeled objects. You can also
modify the axes in place without creating a new data structure.

In [50]:
data = pd.DataFrame(np.arange(12).reshape((3, 4)),
                    index=["Ohio", "Colorado", "New York"],
                    columns=["one", "two", "three", "four"])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


In [51]:
def transform(x):
    return x[:4].upper()

data.index.map(transform)

Index(['OHIO', 'COLO', 'NEW '], dtype='object')

In [52]:
# You can assign to the index attribute, modifying the DataFrame in place
data.index = data.index.map(transform)
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


In [53]:
data.rename(index=str.title, columns=str.upper)

Unnamed: 0,ONE,TWO,THREE,FOUR
Ohio,0,1,2,3
Colo,4,5,6,7
New,8,9,10,11


In [54]:
data.rename(index={"OHIO": "INDIANA"},
            columns={"three": "peekaboo"})

Unnamed: 0,one,two,peekaboo,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


`rename` saves you from the chore of copying the DataFrame manually and assigning
new values to its index and columns attributes

### Discretization and Binning
Continuous data is often discretized or otherwise separated into “bins” for analysis.

In [56]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

In [57]:
bins = [18, 25, 35, 60, 100]
age_categories = pd.cut(ages, bins)
age_categories

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64, right]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

The object pandas returns is a special **Categorical** object. The output you see
describes the bins computed by `pandas.cut`. Each bin is identified by a special
(unique to pandas) interval value type containing the lower and upper limit of each
bin

In [58]:
age_categories.codes

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [59]:
age_categories.categories

IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]], dtype='interval[int64, right]')

In [60]:
age_categories.categories[0]

Interval(18, 25, closed='right')

In [61]:
pd.value_counts(age_categories)

  pd.value_counts(age_categories)


(18, 25]     5
(25, 35]     3
(35, 60]     3
(60, 100]    1
Name: count, dtype: int64

In the string representation of an interval, a parenthesis means that the side is open
(exclusive), while the square bracket means it is closed (inclusive). You can change
which side is closed by passing `right=False`

In [62]:
pd.cut(ages, bins, right=False)

[[18, 25), [18, 25), [25, 35), [25, 35), [18, 25), ..., [25, 35), [60, 100), [35, 60), [35, 60), [25, 35)]
Length: 12
Categories (4, interval[int64, left]): [[18, 25) < [25, 35) < [35, 60) < [60, 100)]

You can override the default interval-based bin labeling by passing a list or array to
the labels option

In [63]:
group_names = ["Youth", "YoungAdult", "MiddleAged", "Senior"]
pd.cut(ages, bins, labels=group_names)

['Youth', 'Youth', 'Youth', 'YoungAdult', 'Youth', ..., 'YoungAdult', 'Senior', 'MiddleAged', 'MiddleAged', 'YoungAdult']
Length: 12
Categories (4, object): ['Youth' < 'YoungAdult' < 'MiddleAged' < 'Senior']

In [64]:
data = np.random.uniform(size=20)
pd.cut(data, 4, precision=2)

[(0.34, 0.55], (0.34, 0.55], (0.76, 0.97], (0.76, 0.97], (0.34, 0.55], ..., (0.34, 0.55], (0.34, 0.55], (0.55, 0.76], (0.34, 0.55], (0.12, 0.34]]
Length: 20
Categories (4, interval[float64, right]): [(0.12, 0.34] < (0.34, 0.55] < (0.55, 0.76] < (0.76, 0.97]]

If you pass an **integer number** of bins to `pandas.cut` instead of explicit bin edges, it
will compute **equal-length** bins based on the minimum and maximum values in the
data.

A closely related function, `pandas.qcut`, bins the data based on sample quantiles.
Depending on the distribution of the data, using `pandas.cut` will not usually result in each bin having the **same number** of data points. Since `pandas.qcut` uses sample
quantiles instead, you will obtain roughly equally sized bins

In [65]:
# precision=2 option limits the decimal precision to two digits
data = np.random.standard_normal(1000)
quartiles = pd.qcut(data, 4, precision=2)
quartiles

[(-0.026, 0.62], (0.62, 3.93], (-0.68, -0.026], (0.62, 3.93], (-0.026, 0.62], ..., (-0.68, -0.026], (-0.68, -0.026], (-2.96, -0.68], (0.62, 3.93], (-0.68, -0.026]]
Length: 1000
Categories (4, interval[float64, right]): [(-2.96, -0.68] < (-0.68, -0.026] < (-0.026, 0.62] < (0.62, 3.93]]

In [66]:
pd.value_counts(quartiles)

  pd.value_counts(quartiles)


(-2.96, -0.68]     250
(-0.68, -0.026]    250
(-0.026, 0.62]     250
(0.62, 3.93]       250
Name: count, dtype: int64

In [67]:
# you can pass your own quantiles
pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.]).value_counts()

(-2.9499999999999997, -1.187]    100
(-1.187, -0.0265]                400
(-0.0265, 1.286]                 400
(1.286, 3.928]                   100
Name: count, dtype: int64

These discretization functions are especially useful for quantile and group analysis

### Detecting and Filtering Outliers
Filtering or transforming outliers is largely a matter of applying array operations

In [68]:
data = pd.DataFrame(np.random.standard_normal((1000, 4)))
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,0.049091,0.026112,-0.002544,-0.051827
std,0.996947,1.007458,0.995232,0.998311
min,-3.64586,-3.184377,-3.745356,-3.428254
25%,-0.599807,-0.612162,-0.687373,-0.747478
50%,0.047101,-0.013609,-0.022158,-0.088274
75%,0.756646,0.695298,0.699046,0.623331
max,2.653656,3.525865,2.735527,3.366626


Suppose you wanted to find values in one of the columns exceeding 3 in absolute
value

In [69]:
col = data[2]
col[col.abs() > 3]

41    -3.399312
136   -3.745356
Name: 2, dtype: float64

In [70]:
data[(data.abs() > 3).any(axis="columns")]

Unnamed: 0,0,1,2,3
41,0.457246,-0.025907,-3.399312,-0.974657
60,1.951312,3.260383,0.963301,1.201206
136,0.508391,-0.196713,-3.745356,-1.520113
235,-0.242459,-3.05699,1.918403,-0.578828
258,0.682841,0.326045,0.425384,-3.428254
322,1.179227,-3.184377,1.369891,-1.074833
544,-3.548824,1.553205,-2.186301,1.277104
635,-0.578093,0.193299,1.397822,3.366626
782,-0.207434,3.525865,0.28307,0.544635
803,-3.64586,0.255475,-0.549574,-1.907459


The parentheses around data.abs() > 3 are necessary in order to call the any
method on the result of the comparison operation.

In [71]:
data[data.abs() > 3] = np.sign(data) * 3
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,0.050286,0.025567,-0.001399,-0.051765
std,0.99292,1.004214,0.991414,0.995761
min,-3.0,-3.0,-3.0,-3.0
25%,-0.599807,-0.612162,-0.687373,-0.747478
50%,0.047101,-0.013609,-0.022158,-0.088274
75%,0.756646,0.695298,0.699046,0.623331
max,2.653656,3.0,2.735527,3.0


In [72]:
np.sign(data).head()

Unnamed: 0,0,1,2,3
0,-1.0,1.0,-1.0,1.0
1,1.0,-1.0,1.0,-1.0
2,1.0,1.0,1.0,-1.0
3,-1.0,-1.0,1.0,-1.0
4,-1.0,1.0,-1.0,-1.0


The statement `np.sign(data)` produces 1 and –1 values based on whether the values
in data are positive or negative

### Permutation and Random Sampling
Permuting (randomly reordering) a Series or the rows in a DataFrame is possible
using the `numpy.random.permutation` function. Calling permutation with the length
of the axis you want to permute produces an array of integers indicating the new
ordering

In [73]:
df = pd.DataFrame(np.arange(5 * 7).reshape((5, 7)))
df

Unnamed: 0,0,1,2,3,4,5,6
0,0,1,2,3,4,5,6
1,7,8,9,10,11,12,13
2,14,15,16,17,18,19,20
3,21,22,23,24,25,26,27
4,28,29,30,31,32,33,34


In [74]:
sampler = np.random.permutation(5)
sampler

array([3, 1, 4, 2, 0])

That array can then be used in `iloc`-based indexing or the equivalent `take` function:

In [None]:
# Return the elements in the given positional indices along an axis
df.take(sampler)

Unnamed: 0,0,1,2,3,4,5,6
1,7,8,9,10,11,12,13
0,0,1,2,3,4,5,6
3,21,22,23,24,25,26,27
4,28,29,30,31,32,33,34
2,14,15,16,17,18,19,20


In [75]:
df.iloc[sampler]

Unnamed: 0,0,1,2,3,4,5,6
3,21,22,23,24,25,26,27
1,7,8,9,10,11,12,13
4,28,29,30,31,32,33,34
2,14,15,16,17,18,19,20
0,0,1,2,3,4,5,6


By invoking take with `axis="columns"`, we could also select a permutation of the
columns

In [76]:
column_sampler = np.random.permutation(7)
column_sampler

array([4, 6, 3, 2, 1, 0, 5])

In [77]:
df.take(column_sampler, axis="columns")

Unnamed: 0,4,6,3,2,1,0,5
0,4,6,3,2,1,0,5
1,11,13,10,9,8,7,12
2,18,20,17,16,15,14,19
3,25,27,24,23,22,21,26
4,32,34,31,30,29,28,33


To select a random subset **without** replacement (the same row cannot appear twice),
you can use the sample method on Series and DataFrame

In [78]:
df.sample(n=3)

Unnamed: 0,0,1,2,3,4,5,6
2,14,15,16,17,18,19,20
4,28,29,30,31,32,33,34
0,0,1,2,3,4,5,6


To generate a sample **with** replacement (to allow repeat choices), pass replace=True
to sample

In [79]:
choices = pd.Series([5, 7, -1, 6, 4])
choices.sample(n=10, replace=True)

2   -1
0    5
3    6
1    7
4    4
0    5
4    4
0    5
4    4
4    4
dtype: int64

### Computing Indicator/Dummy Variables
Another type of transformation for statistical modeling or machine learning applications is converting a categorical variable into a **dummy** or **indicator matrix**. If a column in a DataFrame has k distinct values, you would derive a matrix or DataFrame
with k columns containing all 1s and 0s. pandas has a `pandas.get_dummies` function
for doing this, though you could also devise one yourself.

In [80]:
df = pd.DataFrame({"key": ["b", "b", "a", "c", "a", "b"],
                   "data1": range(6)})
df

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,b,5


In [81]:
pd.get_dummies(df["key"], dtype=float)

Unnamed: 0,a,b,c
0,0.0,1.0,0.0
1,0.0,1.0,0.0
2,1.0,0.0,0.0
3,0.0,0.0,1.0
4,1.0,0.0,0.0
5,0.0,1.0,0.0


In some cases, you may want to add a prefix to the columns in the indicator DataFrame, which can then be merged with the other data. `pandas.get_dummies` has a
`prefix` argument for doing this

In [82]:
dummies = pd.get_dummies(df["key"], prefix="key", dtype=float)
dummies

Unnamed: 0,key_a,key_b,key_c
0,0.0,1.0,0.0
1,0.0,1.0,0.0
2,1.0,0.0,0.0
3,0.0,0.0,1.0
4,1.0,0.0,0.0
5,0.0,1.0,0.0


In [83]:
df_with_dummy = df[["data1"]].join(dummies)
df_with_dummy

Unnamed: 0,data1,key_a,key_b,key_c
0,0,0.0,1.0,0.0
1,1,0.0,1.0,0.0
2,2,1.0,0.0,0.0
3,3,0.0,0.0,1.0
4,4,1.0,0.0,0.0
5,5,0.0,1.0,0.0


If a row in a DataFrame belongs to **multiple categories**, we have to use a different
approach to create the dummy variables.

In [85]:
mnames = ["movie_id", "title", "genres"]
movies = pd.read_table("datasets/movielens/movies.dat", sep="::",
                       header=None, names=mnames)
movies[:10]

  movies = pd.read_table("datasets/movielens/movies.dat", sep="::",


Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children's
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


pandas has implemented a special Series method `str.get_dummies` that handles this scenario of multiple group membership encoded as a delimited string

In [86]:
dummies = movies["genres"].str.get_dummies("|")
dummies.iloc[:10, :6]

Unnamed: 0,Action,Adventure,Animation,Children's,Comedy,Crime
0,0,0,1,1,1,0
1,0,1,0,1,0,0
2,0,0,0,0,1,0
3,0,0,0,0,1,0
4,0,0,0,0,1,0
5,1,0,0,0,0,1
6,0,0,0,0,1,0
7,0,1,0,1,0,0
8,1,0,0,0,0,0
9,1,1,0,0,0,0


In [88]:
movies_windic = movies.join(dummies.add_prefix("Genre_"))
movies_windic

Unnamed: 0,movie_id,title,genres,Genre_Action,Genre_Adventure,Genre_Animation,Genre_Children's,Genre_Comedy,Genre_Crime,Genre_Documentary,...,Genre_Fantasy,Genre_Film-Noir,Genre_Horror,Genre_Musical,Genre_Mystery,Genre_Romance,Genre_Sci-Fi,Genre_Thriller,Genre_War,Genre_Western
0,1,Toy Story (1995),Animation|Children's|Comedy,0,0,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,Jumanji (1995),Adventure|Children's|Fantasy,0,1,0,1,0,0,0,...,1,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men (1995),Comedy|Romance,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
3,4,Waiting to Exhale (1995),Comedy|Drama,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Father of the Bride Part II (1995),Comedy,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3878,3948,Meet the Parents (2000),Comedy,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3879,3949,Requiem for a Dream (2000),Drama,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3880,3950,Tigerland (2000),Drama,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3881,3951,Two Family House (2000),Drama,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [89]:
movies_windic.iloc[0]

movie_id                                       1
title                           Toy Story (1995)
genres               Animation|Children's|Comedy
Genre_Action                                   0
Genre_Adventure                                0
Genre_Animation                                1
Genre_Children's                               1
Genre_Comedy                                   1
Genre_Crime                                    0
Genre_Documentary                              0
Genre_Drama                                    0
Genre_Fantasy                                  0
Genre_Film-Noir                                0
Genre_Horror                                   0
Genre_Musical                                  0
Genre_Mystery                                  0
Genre_Romance                                  0
Genre_Sci-Fi                                   0
Genre_Thriller                                 0
Genre_War                                      0
Genre_Western       

For much larger data, this method of constructing indicator variables with multiple membership is not especially speedy. It would
be better to write a lower-level function that writes directly to a
NumPy array, and then wrap the result in a DataFrame

A useful recipe for statistical applications is to combine `pandas.get_dummies` with a
discretization function like `pandas.cut`

In [90]:
np.random.seed(12345) # to make the example repeatable
values = np.random.uniform(size=10)
values

array([0.9296, 0.3164, 0.1839, 0.2046, 0.5677, 0.5955, 0.9645, 0.6532,
       0.7489, 0.6536])

In [92]:
bins = [0, 0.2, 0.4, 0.6, 0.8, 1]
pd.cut(values, bins)

[(0.8, 1.0], (0.2, 0.4], (0.0, 0.2], (0.2, 0.4], (0.4, 0.6], (0.4, 0.6], (0.8, 1.0], (0.6, 0.8], (0.6, 0.8], (0.6, 0.8]]
Categories (5, interval[float64, right]): [(0.0, 0.2] < (0.2, 0.4] < (0.4, 0.6] < (0.6, 0.8] < (0.8, 1.0]]

In [93]:
pd.get_dummies(pd.cut(values, bins))

Unnamed: 0,"(0.0, 0.2]","(0.2, 0.4]","(0.4, 0.6]","(0.6, 0.8]","(0.8, 1.0]"
0,False,False,False,False,True
1,False,True,False,False,False
2,True,False,False,False,False
3,False,True,False,False,False
4,False,False,True,False,False
5,False,False,True,False,False
6,False,False,False,False,True
7,False,False,False,True,False
8,False,False,False,True,False
9,False,False,False,True,False


## Extension Data Types
pandas was originally built upon the capabilities present in NumPy. Many pandas concepts,
such as missing data, were implemented using what was available in NumPy while
trying to maximize compatibility between libraries that used NumPy and pandas
together.

More recently, pandas has developed an **extension type system** allowing for new data
types to be added even if they are not supported natively by NumPy. These new data
types can be treated as first class alongside data coming from NumPy arrays.

In [94]:
s = pd.Series([1, 2, 3, None])
s

0    1.0
1    2.0
2    3.0
3    NaN
dtype: float64

In [95]:
s.dtype

dtype('float64')

Mainly for backward compatibility reasons, Series uses the legacy behavior of using
a `float64` data type and `np.nan` for the missing value. We could create this Series
instead using `pandas.Int64Dtype`

In [151]:
s = pd.Series([1, 2, 3, None], dtype=pd.Int64Dtype())
s

0       1
1       2
2       3
3    <NA>
dtype: Int64

In [152]:
s.isna()

0    False
1    False
2    False
3     True
dtype: bool

In [153]:
s.dtype

Int64Dtype()

The output `<NA>` indicates that a value is missing for an extension type array. This
uses the special `pandas.NA` sentinel value

In [154]:
s[3]

<NA>

In [157]:
s[3] is pd.NA

True

We also could have used the shorthand `"Int64"` instead of `pd.Int64Dtype()` to
specify the type. The capitalization is necessary, otherwise it will be a NumPy-based
nonextension type

In [158]:
s = pd.Series([1, 2, 3, None], dtype="Int64")
s

0       1
1       2
2       3
3    <NA>
dtype: Int64

pandas also has an extension type specialized for string data that does not use
NumPy object arrays (it requires the **pyarrow** library, which you may need to install
separately)

In [103]:
s = pd.Series(['one', 'two', None, 'three'], dtype=pd.StringDtype())
s

0      one
1      two
2     <NA>
3    three
dtype: string

These string arrays generally use much less memory and are frequently computation‐
ally more efficient for doing operations on large datasets

Extension types can be passed to the Series `astype` method, allowing you to convert
easily as part of your data cleaning process

In [159]:
df = pd.DataFrame({"A": [1, 2, None, 4],
                   "B": ["one", "two", "three", None],
                   "C": [False, None, False, True]})
df

Unnamed: 0,A,B,C
0,1.0,one,False
1,2.0,two,
2,,three,False
3,4.0,,True


In [160]:
df["A"] = df["A"].astype("Int64")
df["B"] = df["B"].astype("string")
df["C"] = df["C"].astype("boolean")
df

Unnamed: 0,A,B,C
0,1.0,one,False
1,2.0,two,
2,,three,False
3,4.0,,True


<img src="Img/pd_extension.png" alt="pandas extension data types" title="pandas extension data types" />

## String Manipulation

### Python Built-In String Object Methods

In [107]:
val = "a,b,  guido"
val.split(",")

['a', 'b', '  guido']

In [108]:
pieces = [x.strip() for x in val.split(",")]
pieces

['a', 'b', 'guido']

In [109]:
first, second, third = pieces
first + "::" + second + "::" + third

'a::b::guido'

In [110]:
"::".join(pieces)

'a::b::guido'

In [161]:
"guido" in val

True

In [162]:
val.index(",")

1

In [163]:
val.find(":")

-1

In [164]:
val.index(":")

ValueError: substring not found

In [165]:
val.count(",")

2

In [167]:
val.replace(",", "::")

'a::b::  guido'

In [168]:
val.replace(",", "")

'ab  guido'

### Regular Expressions
Regular expressions provide a flexible way to search or match (often more complex)
string patterns in text. A single expression, commonly called a **regex**, is a string
formed according to the regular expression language. Python’s built-in `re` module is
responsible for applying regular expressions to strings

In [170]:
import re
text = "foo    bar\t baz  \tqux"
# one or more whitespace
re.split(r"\s+", text)

['foo', 'bar', 'baz', 'qux']

When you call `re.split(r"\s+", text)`, the regular expression is first **compiled**, and
then its **split** method is called on the passed text. You can **compile** the regex yourself
with `re.compile`, forming a reusable regex object

In [171]:
regex = re.compile(r"\s+")
regex.split(text)

['foo', 'bar', 'baz', 'qux']

In [172]:
regex.findall(text)

['    ', '\t ', '  \t']

In [174]:
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com"""
pattern = r"[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}"

# re.IGNORECASE makes the regex case insensitive
regex = re.compile(pattern, flags=re.IGNORECASE)

In [175]:
regex.findall(text)

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

In [176]:
m = regex.search(text)
m

<re.Match object; span=(5, 20), match='dave@google.com'>

In [177]:
text[m.start():m.end()]

'dave@google.com'

In [None]:
# regex.match will match only if the pattern occurs at the start of thestring
print(regex.match(text))

None


In [None]:
# return a new string with occurrences of the pattern replaced by a new string
print(regex.sub("REDACTED", text))

Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED


In [179]:
pattern = r"([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})"
regex = re.compile(pattern, flags=re.IGNORECASE)

In [180]:
m = regex.match("wesm@bright.net")
m.groups()

('wesm', 'bright', 'net')

In [181]:
regex.findall(text)

[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

In [182]:
print(regex.sub(r"Username: \1, Domain: \2, Suffix: \3", text))

Dave Username: dave, Domain: google, Suffix: com
Steve Username: steve, Domain: gmail, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com


### String Functions in pandas

In [183]:
data = {"Dave": "dave@google.com", "Steve": "steve@gmail.com",
        "Rob": "rob@gmail.com", "Wes": np.nan}
data = pd.Series(data)
data

Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                  NaN
dtype: object

In [184]:
data.isna()

Dave     False
Steve    False
Rob      False
Wes       True
dtype: bool

In [185]:
data.str.contains("gmail")

Dave     False
Steve     True
Rob       True
Wes        NaN
dtype: object

In [186]:
data_as_string_ext = data.astype('string')
data_as_string_ext

Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                 <NA>
dtype: string

In [187]:
data_as_string_ext.str.contains("gmail")

Dave     False
Steve     True
Rob       True
Wes       <NA>
dtype: boolean

In [188]:
pattern = r"([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})"
data.str.findall(pattern, flags=re.IGNORECASE)

Dave     [(dave, google, com)]
Steve    [(steve, gmail, com)]
Rob        [(rob, gmail, com)]
Wes                        NaN
dtype: object

In [189]:
matches = data.str.findall(pattern, flags=re.IGNORECASE).str[0]
matches

Dave     (dave, google, com)
Steve    (steve, gmail, com)
Rob        (rob, gmail, com)
Wes                      NaN
dtype: object

In [190]:
matches.str.get(1)

Dave     google
Steve     gmail
Rob       gmail
Wes         NaN
dtype: object

In [135]:
data.str[:5]

Dave     dave@
Steve    steve
Rob      rob@g
Wes        NaN
dtype: object

In [191]:
data.str.extract(pattern, flags=re.IGNORECASE)

Unnamed: 0,0,1,2
Dave,dave,google,com
Steve,steve,gmail,com
Rob,rob,gmail,com
Wes,,,


<img src="Img/pd_string.png" alt="Partial listing of Series string methods" title="Partial listing of Series string methods" />


## Categorical Data

### Background and Motivation
Frequently, a column in a table may contain repeated instances of a smaller set of
distinct values. We have already seen functions like unique and value_counts, which
enable us to extract the distinct values from an array and compute their frequencies,
respectively

In [196]:
values = pd.Series(['apple', 'orange', 'apple',
                    'apple'] * 2)
values

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
dtype: object

In [197]:
pd.unique(values)


array(['apple', 'orange'], dtype=object)

In [198]:
pd.value_counts(values)

  pd.value_counts(values)


apple     6
orange    2
Name: count, dtype: int64

Many data systems (for data warehousing, statistical computing, or other uses) have
developed specialized approaches for representing data with repeated values for more
efficient storage and computation. In data warehousing, a best practice is to use
so-called dimension tables containing the distinct values and storing the primary
observations as integer keys referencing the dimension table

In [199]:
values = pd.Series([0, 1, 0, 0] * 2)
values

0    0
1    1
2    0
3    0
4    0
5    1
6    0
7    0
dtype: int64

In [200]:
dim = pd.Series(['apple', 'orange'])
dim

0     apple
1    orange
dtype: object

In [201]:
dim.take(values)

0     apple
1    orange
0     apple
0     apple
0     apple
1    orange
0     apple
0     apple
dtype: object

This representation as integers is called the **categorical** or **dictionary-encoded** **representation**. 
The array of distinct values can be called the **categories**, dictionary, or levels
of the data. The integer values that reference the categories are called the **category codes** or simply codes.
The categorical representation can yield significant performance improvements when
you are doing analytics. You can also perform transformations on the categories while
leaving the codes unmodified. Some example transformations that can be made at
relatively low cost are:

- Renaming categories
- Appending a new category without changing the order or position of the existing
categories

### Categorical Extension Type in pandas
pandas has a special `Categorical` extension type for holding data that uses the
integer-based categorical representation or encoding. This is a popular data compression 
technique for data with many occurrences of similar values and can provide
significantly faster performance with lower memory use, especially for string data.

In [202]:
fruits = ['apple', 'orange', 'apple', 'apple'] * 2
N = len(fruits)
rng = np.random.default_rng(seed=12345)
df = pd.DataFrame({'fruit': fruits,
                   'basket_id': np.arange(N),
                   'count': rng.integers(3, 15, size=N),
                   'weight': rng.uniform(0, 4, size=N)},
                  columns=['basket_id', 'fruit', 'count', 'weight'])
df

Unnamed: 0,basket_id,fruit,count,weight
0,0,apple,11,1.564438
1,1,orange,5,1.331256
2,2,apple,12,2.393235
3,3,apple,6,0.746937
4,4,apple,5,2.691024
5,5,orange,12,3.767211
6,6,apple,10,0.992983
7,7,apple,11,3.795525


In [203]:
fruit_cat = df['fruit'].astype('category')
fruit_cat

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): ['apple', 'orange']

The values for `fruit_cat` are now an instance of `pandas.Categorical`, which you
can access via the `.array` attribute

In [204]:
c = fruit_cat.array
type(c)

pandas.core.arrays.categorical.Categorical

In [205]:
c

['apple', 'orange', 'apple', 'apple', 'apple', 'orange', 'apple', 'apple']
Categories (2, object): ['apple', 'orange']

In [206]:
c.categories

Index(['apple', 'orange'], dtype='object')

In [207]:
c.codes

array([0, 1, 0, 0, 0, 1, 0, 0], dtype=int8)

A useful **trick** to get a mapping between codes and categories is

In [148]:
dict(enumerate(c.categories))

{0: 'apple', 1: 'orange'}

You can convert a DataFrame column to categorical by assigning the converted result

In [208]:
df['fruit'] = df['fruit'].astype('category')
df["fruit"]

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): ['apple', 'orange']

You can also create `pandas.Categorical` directly from other types of Python
sequences

In [209]:
my_categories = pd.Categorical(['foo', 'bar', 'baz', 'foo', 'bar'])
my_categories

['foo', 'bar', 'baz', 'foo', 'bar']
Categories (3, object): ['bar', 'baz', 'foo']

If you have obtained categorical encoded data from another source, you can use the
alternative `from_codes` constructor

In [210]:
categories = ['foo', 'bar', 'baz']
codes = [0, 1, 2, 0, 0, 1]
my_cats_2 = pd.Categorical.from_codes(codes, categories)
my_cats_2

['foo', 'bar', 'baz', 'foo', 'foo', 'bar']
Categories (3, object): ['foo', 'bar', 'baz']

Unless explicitly specified, categorical conversions assume no specific ordering of the
categories. So the `categories` array may be in a different order depending on the
ordering of the input data. When using `from_codes` or any of the other constructors,
you can indicate that the categories have a meaningful ordering

In [211]:
ordered_cat = pd.Categorical.from_codes(codes, categories,
                                        ordered=True)
ordered_cat

['foo', 'bar', 'baz', 'foo', 'foo', 'bar']
Categories (3, object): ['foo' < 'bar' < 'baz']

In [212]:
my_cats_2.as_ordered()

['foo', 'bar', 'baz', 'foo', 'foo', 'bar']
Categories (3, object): ['foo' < 'bar' < 'baz']

### Computations with Categoricals
Using `Categorical` in pandas compared with the nonencoded version (like an array
of strings) generally behaves the same way. Some parts of pandas, like the `groupby`
function, perform better when working with categoricals. There are also some functions that can utilize the `ordered` flag.

Let’s consider some random numeric data and use the `pandas.qcut `binning function. 
This returns `pandas.Categorical`

In [213]:
rng = np.random.default_rng(seed=12345)
draws = rng.standard_normal(1000)
draws[:5]

array([-1.4238,  1.2637, -0.8707, -0.2592, -0.0753])

In [214]:
bins = pd.qcut(draws, 4)
bins

[(-3.121, -0.675], (0.687, 3.211], (-3.121, -0.675], (-0.675, 0.0134], (-0.675, 0.0134], ..., (0.0134, 0.687], (0.0134, 0.687], (-0.675, 0.0134], (0.0134, 0.687], (-0.675, 0.0134]]
Length: 1000
Categories (4, interval[float64, right]): [(-3.121, -0.675] < (-0.675, 0.0134] < (0.0134, 0.687] < (0.687, 3.211]]

In [215]:
bins = pd.qcut(draws, 4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
bins

['Q1', 'Q4', 'Q1', 'Q2', 'Q2', ..., 'Q3', 'Q3', 'Q2', 'Q3', 'Q2']
Length: 1000
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']

In [216]:
bins.codes[:10]

array([0, 3, 0, 1, 1, 0, 0, 2, 2, 0], dtype=int8)

The labeled bins categorical does not contain information about the bin edges in the
data, so we can use `groupby` to extract some summary statistics

In [219]:
bins = pd.Series(bins, name='quartile')
results = (pd.Series(draws)
           .groupby(bins)
           .agg(['count', 'min', 'max'])
           .reset_index())
results

  .groupby(bins)


Unnamed: 0,quartile,count,min,max
0,Q1,250,-3.119609,-0.678494
1,Q2,250,-0.673305,0.008009
2,Q3,250,0.018753,0.686183
3,Q4,250,0.688282,3.211418


In [220]:
results['quartile']

0    Q1
1    Q2
2    Q3
3    Q4
Name: quartile, dtype: category
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']

### Better performance with categoricals

In [221]:
N = 10_000_000
labels = pd.Series(['foo', 'bar', 'baz', 'qux'] * (N // 4))

In [222]:
categories = labels.astype('category')

In [224]:
labels.memory_usage(deep=True)

520000132

In [225]:
categories.memory_usage(deep=True)

10000512

In [226]:
%time _ = labels.astype('category')

CPU times: user 242 ms, sys: 32.3 ms, total: 274 ms
Wall time: 292 ms


GroupBy operations can be significantly faster with categoricals because the underlying algorithms 
use the integer-based codes array instead of an array of strings. Here
we compare the performance of value_counts(), which internally uses the GroupBy
machinery

In [227]:
%timeit labels.value_counts()

232 ms ± 1.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [228]:
%timeit categories.value_counts()

30.8 ms ± 309 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Categorical Methods
Series containing categorical data have several special methods similar to the `Series.str` specialized string methods. 
This also provides convenient access to the categories and codes. 

In [229]:
s = pd.Series(['a', 'b', 'c', 'd'] * 2)
cat_s = s.astype('category')
cat_s

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']

The special accessor attribute `cat` provides access to categorical methods

In [230]:
cat_s.cat.codes

0    0
1    1
2    2
3    3
4    0
5    1
6    2
7    3
dtype: int8

In [231]:
cat_s.cat.categories

Index(['a', 'b', 'c', 'd'], dtype='object')

Suppose that we know the actual set of categories for this data extends beyond the
four values observed in the data. We can use the `set_categories` method to change
them

In [232]:
actual_categories = ['a', 'b', 'c', 'd', 'e']
cat_s2 = cat_s.cat.set_categories(actual_categories)
cat_s2

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (5, object): ['a', 'b', 'c', 'd', 'e']

While it appears that the data is unchanged, the new categories will be reflected
in operations that use them. For example, value_counts respects the categories, if
present

In [233]:
cat_s.value_counts()

a    2
b    2
c    2
d    2
Name: count, dtype: int64

In [234]:
cat_s2.value_counts()

a    2
b    2
c    2
d    2
e    0
Name: count, dtype: int64

In large datasets, categoricals are often used as a convenient tool for memory savings and better performance. 
After you filter a large DataFrame or Series, many
of the categories may not appear in the data. To help with this, we can use the
`remove_unused_categories` method to trim unobserved categories

In [235]:
cat_s3 = cat_s[cat_s.isin(['a', 'b'])]
cat_s3

0    a
1    b
4    a
5    b
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']

In [236]:
cat_s3.cat.remove_unused_categories()

0    a
1    b
4    a
5    b
dtype: category
Categories (2, object): ['a', 'b']

<img src="Img/pd_categorical.png" alt="Categorical methods for Series in pandas" title="Categorical methods for Series in pandas" />


### Creating dummy variables for modeling
When you’re using statistics or machine learning tools, you’ll often transform categorical data 
into dummy variables, also known as **one-hot encoding**. This involves creating a DataFrame 
with a column for each distinct category; these columns contain 1s
for occurrences of a given category and 0 otherwise.

In [238]:
cat_s = pd.Series(['a', 'b', 'c', 'd'] * 2, dtype='category')

In [239]:
pd.get_dummies(cat_s, dtype=float)

Unnamed: 0,a,b,c,d
0,1.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0
2,0.0,0.0,1.0,0.0
3,0.0,0.0,0.0,1.0
4,1.0,0.0,0.0,0.0
5,0.0,1.0,0.0,0.0
6,0.0,0.0,1.0,0.0
7,0.0,0.0,0.0,1.0
