#### Part 20: Advanced Merging and Data Transformation in Pandas

In this notebook, we'll explore:
- Completing our exploration of merge_asof
- Using get_dummies for one-hot encoding
- Factorizing values

##### Setup
First, let's import the necessary libraries:

In [1]:
import pandas as pd
import numpy as np

##### 1. Completing our Exploration of merge_asof

Let's continue with the trades and quotes example from the previous notebook:

In [2]:
# Create sample DataFrames for trades and quotes
trades = pd.DataFrame({
    'time': pd.to_datetime(['20160525 13:30:00.023',
                           '20160525 13:30:00.038',
                           '20160525 13:30:00.048',
                           '20160525 13:30:00.048',
                           '20160525 13:30:00.048']),
    'ticker': ['MSFT', 'MSFT', 'GOOG', 'GOOG', 'AAPL'],
    'price': [51.95, 51.95, 720.77, 720.92, 98.00],
    'quantity': [75, 155, 100, 100, 100]
}, columns=['time', 'ticker', 'price', 'quantity'])

quotes = pd.DataFrame({
    'time': pd.to_datetime(['20160525 13:30:00.023',
                           '20160525 13:30:00.023',
                           '20160525 13:30:00.030',
                           '20160525 13:30:00.041',
                           '20160525 13:30:00.048',
                           '20160525 13:30:00.049',
                           '20160525 13:30:00.072',
                           '20160525 13:30:00.075']),
    'ticker': ['GOOG', 'MSFT', 'MSFT', 'MSFT', 'GOOG', 'AAPL', 'GOOG', 'MSFT'],
    'bid': [720.50, 51.95, 51.97, 51.99, 720.50, 97.99, 720.50, 52.01],
    'ask': [720.93, 51.96, 51.98, 52.00, 720.93, 98.01, 720.88, 52.03]
}, columns=['time', 'ticker', 'bid', 'ask'])

# Display the DataFrames
print("Trades:")
display(trades)
print("\nQuotes:")
display(quotes)

Trades:


Unnamed: 0,time,ticker,price,quantity
0,2016-05-25 13:30:00.023,MSFT,51.95,75
1,2016-05-25 13:30:00.038,MSFT,51.95,155
2,2016-05-25 13:30:00.048,GOOG,720.77,100
3,2016-05-25 13:30:00.048,GOOG,720.92,100
4,2016-05-25 13:30:00.048,AAPL,98.0,100



Quotes:


Unnamed: 0,time,ticker,bid,ask
0,2016-05-25 13:30:00.023,GOOG,720.5,720.93
1,2016-05-25 13:30:00.023,MSFT,51.95,51.96
2,2016-05-25 13:30:00.030,MSFT,51.97,51.98
3,2016-05-25 13:30:00.041,MSFT,51.99,52.0
4,2016-05-25 13:30:00.048,GOOG,720.5,720.93
5,2016-05-25 13:30:00.049,AAPL,97.99,98.01
6,2016-05-25 13:30:00.072,GOOG,720.5,720.88
7,2016-05-25 13:30:00.075,MSFT,52.01,52.03


By default, we are taking the asof of the quotes:

In [3]:
# Basic asof merge
pd.merge_asof(trades, quotes, on='time', by='ticker')

Unnamed: 0,time,ticker,price,quantity,bid,ask
0,2016-05-25 13:30:00.023,MSFT,51.95,75,51.95,51.96
1,2016-05-25 13:30:00.038,MSFT,51.95,155,51.97,51.98
2,2016-05-25 13:30:00.048,GOOG,720.77,100,720.5,720.93
3,2016-05-25 13:30:00.048,GOOG,720.92,100,720.5,720.93
4,2016-05-25 13:30:00.048,AAPL,98.0,100,,


We can also specify a tolerance, so we only asof within a certain time difference (e.g., 2ms) between the quote time and the trade time:

In [4]:
# Asof merge with tolerance
pd.merge_asof(trades, quotes, on='time', by='ticker', tolerance=pd.Timedelta('2ms'))

Unnamed: 0,time,ticker,price,quantity,bid,ask
0,2016-05-25 13:30:00.023,MSFT,51.95,75,51.95,51.96
1,2016-05-25 13:30:00.038,MSFT,51.95,155,,
2,2016-05-25 13:30:00.048,GOOG,720.77,100,720.5,720.93
3,2016-05-25 13:30:00.048,GOOG,720.92,100,720.5,720.93
4,2016-05-25 13:30:00.048,AAPL,98.0,100,,


We can also use the direction parameter to control whether the merge should look for values forward, backward, or nearest:

In [5]:
# Asof merge with direction='forward'
pd.merge_asof(trades, quotes, on='time', by='ticker', direction='forward')

Unnamed: 0,time,ticker,price,quantity,bid,ask
0,2016-05-25 13:30:00.023,MSFT,51.95,75,51.95,51.96
1,2016-05-25 13:30:00.038,MSFT,51.95,155,51.99,52.0
2,2016-05-25 13:30:00.048,GOOG,720.77,100,720.5,720.93
3,2016-05-25 13:30:00.048,GOOG,720.92,100,720.5,720.93
4,2016-05-25 13:30:00.048,AAPL,98.0,100,97.99,98.01


In [6]:
# Asof merge with direction='nearest'
pd.merge_asof(trades, quotes, on='time', by='ticker', direction='nearest')

Unnamed: 0,time,ticker,price,quantity,bid,ask
0,2016-05-25 13:30:00.023,MSFT,51.95,75,51.95,51.96
1,2016-05-25 13:30:00.038,MSFT,51.95,155,51.99,52.0
2,2016-05-25 13:30:00.048,GOOG,720.77,100,720.5,720.93
3,2016-05-25 13:30:00.048,GOOG,720.92,100,720.5,720.93
4,2016-05-25 13:30:00.048,AAPL,98.0,100,97.99,98.01


##### 2. Using get_dummies for One-Hot Encoding

The `get_dummies()` function is used to convert categorical variables into dummy/indicator variables (also known as one-hot encoding).

In [7]:
# Create a sample DataFrame with categorical columns
df = pd.DataFrame({
    'A': ['a', 'b', 'a'],
    'B': ['b', 'c', 'b'],
    'C': [1, 2, 3]
})
df

Unnamed: 0,A,B,C
0,a,b,1
1,b,c,2
2,a,b,3


### 2.1 Basic One-Hot Encoding

In [8]:
# Convert categorical variables to dummy/indicator variables
pd.get_dummies(df)

Unnamed: 0,C,A_a,A_b,B_b,B_c
0,1,True,False,True,False
1,2,False,True,False,True
2,3,True,False,True,False


### 2.2 Customizing Prefix

In [9]:
# Using a list to specify prefixes
from_list = pd.get_dummies(df, prefix=['from_A', 'from_B'])
from_list

Unnamed: 0,C,from_A_a,from_A_b,from_B_b,from_B_c
0,1,True,False,True,False
1,2,False,True,False,True
2,3,True,False,True,False


In [10]:
# Using a dictionary to specify prefixes
from_dict = pd.get_dummies(df, prefix={'B': 'from_B', 'A': 'from_A'})
from_dict

Unnamed: 0,C,from_A_a,from_A_b,from_B_b,from_B_c
0,1,True,False,True,False
1,2,False,True,False,True
2,3,True,False,True,False


### 2.3 Dropping First Category

Sometimes it will be useful to only keep k-1 levels of a categorical variable to avoid collinearity when feeding the result to statistical models. You can switch to this mode by turning on `drop_first`.

In [11]:
# Create a sample Series
s = pd.Series(list('abcaa'))

# Default behavior
pd.get_dummies(s)

Unnamed: 0,a,b,c
0,True,False,False
1,False,True,False
2,False,False,True
3,True,False,False
4,True,False,False


In [12]:
# Drop first category
pd.get_dummies(s, drop_first=True)

Unnamed: 0,b,c
0,False,False
1,True,False
2,False,True
3,False,False
4,False,False


When a column contains only one level, it will be omitted in the result.

In [13]:
# Create a DataFrame with one column having only one level
df = pd.DataFrame({'A': list('aaaaa'), 'B': list('ababc')})

# Default behavior
pd.get_dummies(df)

Unnamed: 0,A_a,B_a,B_b,B_c
0,True,True,False,False
1,True,False,True,False
2,True,True,False,False
3,True,False,True,False
4,True,False,False,True


In [14]:
# Drop first category
pd.get_dummies(df, drop_first=True)

Unnamed: 0,B_b,B_c
0,False,False
1,True,False
2,False,False
3,True,False
4,False,True


### 2.4 Specifying Data Type

By default, new columns will have `np.uint8` dtype. To choose another dtype, use the `dtype` argument:

In [15]:
# Create a DataFrame with mixed types
df = pd.DataFrame({'A': list('abc'), 'B': [1.1, 2.2, 3.3]})

# Specify boolean dtype for dummy variables
pd.get_dummies(df, dtype=bool).dtypes

B      float64
A_a       bool
A_b       bool
A_c       bool
dtype: object

##### 3. Factorizing Values

To encode 1-d values as an enumerated type, use `factorize()`:

In [16]:
# Create a Series with mixed types and missing values
x = pd.Series(['A', 'A', np.nan, 'B', 3.14, np.inf])
x

0       A
1       A
2     NaN
3       B
4    3.14
5     inf
dtype: object

In [17]:
# Factorize the Series
labels, uniques = pd.factorize(x)

print("Labels:")
print(labels)
print("\nUniques:")
print(uniques)

Labels:
[ 0  0 -1  1  2  3]

Uniques:
Index(['A', 'B', 3.14, inf], dtype='object')


Note that `factorize` is similar to `numpy.unique`, but differs in its handling of NaN values. With `factorize`, NaN values are assigned a code of -1.

In [18]:
# Factorize with sort=True
labels, uniques = pd.factorize(x, sort=True)

print("Labels with sort=True:")
print(labels)
print("\nUniques with sort=True:")
print(uniques)

Labels with sort=True:
[ 2  2 -1  3  0  1]

Uniques with sort=True:
Index([3.14, inf, 'A', 'B'], dtype='object')


### 3.1 Using Categorical Data Type

If you just want to handle one column as a categorical variable (like R's factor), you can use `pd.Categorical` or the `category` dtype:

In [19]:
# Create a DataFrame with a column to be treated as categorical
df = pd.DataFrame({
    'id': [1, 2, 3, 4, 5],
    'color': ['red', 'blue', 'red', 'green', 'blue']
})

# Method 1: Using pd.Categorical
df['color_cat1'] = pd.Categorical(df['color'])

# Method 2: Using astype('category')
df['color_cat2'] = df['color'].astype('category')

df

Unnamed: 0,id,color,color_cat1,color_cat2
0,1,red,red,red
1,2,blue,blue,blue
2,3,red,red,red
3,4,green,green,green
4,5,blue,blue,blue


In [20]:
# Examine the dtypes
df.dtypes

id               int64
color           object
color_cat1    category
color_cat2    category
dtype: object

In [21]:
# Get the categorical codes
print("Categorical codes:")
print(df['color_cat1'].cat.codes)

Categorical codes:
0    2
1    0
2    2
3    1
4    0
dtype: int8


### 3.2 Practical Example: Pivot Table with Factorized Data

Let's create a more complex example using factorized data in a pivot table:

In [23]:
# Set a random seed for reproducibility
np.random.seed(3141)  # Simplified seed
n = 20

# Create column names
cols = ['key', 'row', 'item', 'col']

# Create a DataFrame with random data - fixed approach
data = np.random.randint(5, size=(n, 4))  # n rows, 4 columns
df = pd.DataFrame(data, columns=cols)
df['value'] = np.random.randn(n)

df

Unnamed: 0,key,row,item,col,value
0,0,1,2,3,0.363064
1,1,1,0,2,-0.863485
2,0,3,4,2,0.131559
3,2,3,3,4,0.891781
4,3,4,0,2,-1.370387
5,1,2,0,0,-0.179278
6,3,3,3,2,0.624069
7,4,1,1,2,1.352848
8,3,2,1,4,1.185215
9,2,1,1,3,0.354881


In [24]:
# Create a pivot table
pivot = pd.pivot_table(df, values='value', index=['row', 'item'],
                      columns='col', aggfunc=np.sum)
pivot

  pivot = pd.pivot_table(df, values='value', index=['row', 'item'],


Unnamed: 0_level_0,col,0,2,3,4
row,item,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0,,,1.855163,0.147537
1,0,,-0.863485,1.392675,
1,1,,1.352848,0.354881,
1,2,,,0.006578,
1,4,,-0.120159,,
2,0,-0.179278,,,
2,1,,,,1.185215
2,3,,,,-0.256172
3,1,0.822207,,,
3,2,-0.271172,,,


##### Summary

In this notebook, we've explored:

1. Advanced features of `merge_asof` for time-series data, including tolerance and direction parameters
2. Using `get_dummies` for one-hot encoding categorical variables with various options like prefix customization and dropping first categories
3. Using `factorize` to encode values as enumerated types
4. Working with the `Categorical` data type in pandas

These techniques are essential for data preprocessing and transformation in data analysis and machine learning workflows.