# Part 20: Advanced Merging and Data Transformation in Pandas

In this notebook, we'll explore:
- Completing our exploration of merge_asof
- Using get_dummies for one-hot encoding
- Factorizing values

## Setup
First, let's import the necessary libraries:

In [None]:
import pandas as pd
import numpy as np

## 1. Completing our Exploration of merge_asof

Let's continue with the trades and quotes example from the previous notebook:

In [None]:
# Create sample DataFrames for trades and quotes
trades = pd.DataFrame({
    'time': pd.to_datetime(['20160525 13:30:00.023',
                           '20160525 13:30:00.038',
                           '20160525 13:30:00.048',
                           '20160525 13:30:00.048',
                           '20160525 13:30:00.048']),
    'ticker': ['MSFT', 'MSFT', 'GOOG', 'GOOG', 'AAPL'],
    'price': [51.95, 51.95, 720.77, 720.92, 98.00],
    'quantity': [75, 155, 100, 100, 100]
}, columns=['time', 'ticker', 'price', 'quantity'])

quotes = pd.DataFrame({
    'time': pd.to_datetime(['20160525 13:30:00.023',
                           '20160525 13:30:00.023',
                           '20160525 13:30:00.030',
                           '20160525 13:30:00.041',
                           '20160525 13:30:00.048',
                           '20160525 13:30:00.049',
                           '20160525 13:30:00.072',
                           '20160525 13:30:00.075']),
    'ticker': ['GOOG', 'MSFT', 'MSFT', 'MSFT', 'GOOG', 'AAPL', 'GOOG', 'MSFT'],
    'bid': [720.50, 51.95, 51.97, 51.99, 720.50, 97.99, 720.50, 52.01],
    'ask': [720.93, 51.96, 51.98, 52.00, 720.93, 98.01, 720.88, 52.03]
}, columns=['time', 'ticker', 'bid', 'ask'])

# Display the DataFrames
print("Trades:")
display(trades)
print("\nQuotes:")
display(quotes)

By default, we are taking the asof of the quotes:

In [None]:
# Basic asof merge
pd.merge_asof(trades, quotes, on='time', by='ticker')

We can also specify a tolerance, so we only asof within a certain time difference (e.g., 2ms) between the quote time and the trade time:

In [None]:
# Asof merge with tolerance
pd.merge_asof(trades, quotes, on='time', by='ticker', tolerance=pd.Timedelta('2ms'))

We can also use the direction parameter to control whether the merge should look for values forward, backward, or nearest:

In [None]:
# Asof merge with direction='forward'
pd.merge_asof(trades, quotes, on='time', by='ticker', direction='forward')

In [None]:
# Asof merge with direction='nearest'
pd.merge_asof(trades, quotes, on='time', by='ticker', direction='nearest')

## 2. Using get_dummies for One-Hot Encoding

The `get_dummies()` function is used to convert categorical variables into dummy/indicator variables (also known as one-hot encoding).

In [None]:
# Create a sample DataFrame with categorical columns
df = pd.DataFrame({
    'A': ['a', 'b', 'a'],
    'B': ['b', 'c', 'b'],
    'C': [1, 2, 3]
})
df

### 2.1 Basic One-Hot Encoding

In [None]:
# Convert categorical variables to dummy/indicator variables
pd.get_dummies(df)

### 2.2 Customizing Prefix

In [None]:
# Using a list to specify prefixes
from_list = pd.get_dummies(df, prefix=['from_A', 'from_B'])
from_list

In [None]:
# Using a dictionary to specify prefixes
from_dict = pd.get_dummies(df, prefix={'B': 'from_B', 'A': 'from_A'})
from_dict

### 2.3 Dropping First Category

Sometimes it will be useful to only keep k-1 levels of a categorical variable to avoid collinearity when feeding the result to statistical models. You can switch to this mode by turning on `drop_first`.

In [None]:
# Create a sample Series
s = pd.Series(list('abcaa'))

# Default behavior
pd.get_dummies(s)

In [None]:
# Drop first category
pd.get_dummies(s, drop_first=True)

When a column contains only one level, it will be omitted in the result.

In [None]:
# Create a DataFrame with one column having only one level
df = pd.DataFrame({'A': list('aaaaa'), 'B': list('ababc')})

# Default behavior
pd.get_dummies(df)

In [None]:
# Drop first category
pd.get_dummies(df, drop_first=True)

### 2.4 Specifying Data Type

By default, new columns will have `np.uint8` dtype. To choose another dtype, use the `dtype` argument:

In [None]:
# Create a DataFrame with mixed types
df = pd.DataFrame({'A': list('abc'), 'B': [1.1, 2.2, 3.3]})

# Specify boolean dtype for dummy variables
pd.get_dummies(df, dtype=bool).dtypes

## 3. Factorizing Values

To encode 1-d values as an enumerated type, use `factorize()`:

In [None]:
# Create a Series with mixed types and missing values
x = pd.Series(['A', 'A', np.nan, 'B', 3.14, np.inf])
x

In [None]:
# Factorize the Series
labels, uniques = pd.factorize(x)

print("Labels:")
print(labels)
print("\nUniques:")
print(uniques)

Note that `factorize` is similar to `numpy.unique`, but differs in its handling of NaN values. With `factorize`, NaN values are assigned a code of -1.

In [None]:
# Factorize with sort=True
labels, uniques = pd.factorize(x, sort=True)

print("Labels with sort=True:")
print(labels)
print("\nUniques with sort=True:")
print(uniques)

### 3.1 Using Categorical Data Type

If you just want to handle one column as a categorical variable (like R's factor), you can use `pd.Categorical` or the `category` dtype:

In [None]:
# Create a DataFrame with a column to be treated as categorical
df = pd.DataFrame({
    'id': [1, 2, 3, 4, 5],
    'color': ['red', 'blue', 'red', 'green', 'blue']
})

# Method 1: Using pd.Categorical
df['color_cat1'] = pd.Categorical(df['color'])

# Method 2: Using astype('category')
df['color_cat2'] = df['color'].astype('category')

df

In [None]:
# Examine the dtypes
df.dtypes

In [None]:
# Get the categorical codes
print("Categorical codes:")
print(df['color_cat1'].cat.codes)

### 3.2 Practical Example: Pivot Table with Factorized Data

Let's create a more complex example using factorized data in a pivot table:

In [None]:
# Set a random seed for reproducibility
np.random.seed([3, 1415])
n = 20

# Create column names
cols = np.array(['key', 'row', 'item', 'col'])

# Create a DataFrame with random data
df = pd.DataFrame(dict(zip(cols, np.random.randint(5, size=(4, n)).T)))
df['value'] = np.random.randn(n)

df

In [None]:
# Create a pivot table
pivot = pd.pivot_table(df, values='value', index=['row', 'item'],
                      columns='col', aggfunc=np.sum)
pivot

## Summary

In this notebook, we've explored:

1. Advanced features of `merge_asof` for time-series data, including tolerance and direction parameters
2. Using `get_dummies` for one-hot encoding categorical variables with various options like prefix customization and dropping first categories
3. Using `factorize` to encode values as enumerated types
4. Working with the `Categorical` data type in pandas

These techniques are essential for data preprocessing and transformation in data analysis and machine learning workflows.