# Pandas General Functions - Part 40

This notebook covers important general functions in pandas, including `get_dummies()`, `factorize()`, `period_range()`, `timedelta_range()`, and `infer_freq()`.

In [None]:
import pandas as pd
import numpy as np

## pd.get_dummies()

The `get_dummies()` function converts categorical variables into dummy/indicator variables (also known as one-hot encoding). This is particularly useful for preparing data for machine learning models.

In [None]:
# Basic example with a Series
s = pd.Series(list('abca'))
pd.get_dummies(s)

In [None]:
# Handling NaN values
s1 = ['a', 'b', np.nan]
print("Without dummy_na:")
print(pd.get_dummies(s1))

print("\nWith dummy_na=True:")
print(pd.get_dummies(s1, dummy_na=True))

In [None]:
# Using with a DataFrame
df = pd.DataFrame({
    'A': ['a', 'b', 'a'], 
    'B': ['b', 'a', 'c'],
    'C': [1, 2, 3]
})

# Using prefix to distinguish columns from different features
pd.get_dummies(df, prefix=['col1', 'col2'])

In [None]:
# Using drop_first to avoid the dummy variable trap
print("Original dummies:")
print(pd.get_dummies(pd.Series(list('abcaa'))))

print("\nWith drop_first=True:")
print(pd.get_dummies(pd.Series(list('abcaa')), drop_first=True))

In [None]:
# Specifying dtype
pd.get_dummies(pd.Series(list('abc')), dtype=float)

## pd.factorize()

The `factorize()` function encodes the object as an enumerated type or categorical variable. It's useful for obtaining a numeric representation of an array when all that matters is identifying distinct values.

In [None]:
# Basic example
codes, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'])
print("Codes:")
print(codes)
print("\nUniques:")
print(uniques)

In [None]:
# With sort=True
codes, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'], sort=True)
print("Codes (sorted):")
print(codes)
print("\nUniques (sorted):")
print(uniques)

In [None]:
# With Series
s = pd.Series(['b', 'b', 'a', 'c', 'b'])
codes, uniques = s.factorize()
print("Codes:")
print(codes)
print("\nUniques:")
print(uniques)

## pd.period_range()

The `period_range()` function returns a fixed frequency `PeriodIndex`, which is useful for representing time spans like quarters, months, etc.

In [None]:
# Basic example
pd.period_range(start='2020-01-01', end='2020-12-31', freq='M')

In [None]:
# Using periods instead of end
pd.period_range(start='2020-01-01', periods=12, freq='M')

In [None]:
# Using Period objects as endpoints
pd.period_range(
    start=pd.Period('2020Q1', freq='Q'),
    end=pd.Period('2020Q4', freq='Q'), 
    freq='M'
)

## pd.timedelta_range()

The `timedelta_range()` function returns a fixed frequency `TimedeltaIndex`, with day as the default frequency. This is useful for representing differences in time.

In [None]:
# Basic example
pd.timedelta_range(start='1 day', periods=4)

In [None]:
# Using closed parameter
pd.timedelta_range(start='1 day', periods=4, closed='right')

In [None]:
# Using a different frequency
pd.timedelta_range(start='1 day', end='2 days', freq='6H')

In [None]:
# Specifying start, end, and periods (linearly spaced)
pd.timedelta_range(start='1 day', end='5 days', periods=4)

## pd.infer_freq()

The `infer_freq()` function infers the most likely frequency given the input index. This is useful when you have a datetime index but don't know its frequency.

In [None]:
# Create a DatetimeIndex with daily frequency
dates = pd.date_range('2020-01-01', periods=5, freq='D')
print("Original index:")
print(dates)

# Infer the frequency
freq = pd.infer_freq(dates)
print(f"\nInferred frequency: {freq}")

In [None]:
# Create a DatetimeIndex with business day frequency
dates = pd.date_range('2020-01-01', periods=5, freq='B')
print("Original index:")
print(dates)

# Infer the frequency
freq = pd.infer_freq(dates)
print(f"\nInferred frequency: {freq}")

In [None]:
# Create a TimedeltaIndex
deltas = pd.timedelta_range(start='1 day', periods=5, freq='2D')
print("Original index:")
print(deltas)

# Infer the frequency
freq = pd.infer_freq(deltas)
print(f"\nInferred frequency: {freq}")

## pd.interval_range()

The `interval_range()` function returns a fixed frequency `IntervalIndex`. This is useful for binning data into intervals.

In [None]:
# Basic example
pd.interval_range(start=0, end=5)

In [None]:
# Using periods instead of end
pd.interval_range(start=0, periods=5)

In [None]:
# Using a different frequency
pd.interval_range(start=0, end=10, freq=2)

In [None]:
# Using closed parameter
pd.interval_range(start=0, end=5, closed='left')

## Practical Examples

Let's explore some practical examples of these functions.

### Example 1: One-hot encoding for machine learning

In [None]:
# Create a sample dataset
data = {
    'id': range(1, 6),
    'color': ['red', 'blue', 'green', 'red', 'blue'],
    'size': ['S', 'M', 'L', 'XL', 'S'],
    'price': [10.5, 15.0, 20.0, 25.5, 12.0]
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

In [None]:
# One-hot encode the categorical columns
df_encoded = pd.get_dummies(df, columns=['color', 'size'], drop_first=True)
print("Encoded DataFrame:")
print(df_encoded)

### Example 2: Creating time-based features for time series analysis

In [None]:
# Create a sample time series dataset
dates = pd.date_range('2020-01-01', periods=12, freq='M')
values = np.random.randn(12).cumsum()
ts = pd.Series(values, index=dates)
print("Time Series:")
print(ts)

In [None]:
# Convert to DataFrame and add time-based features
df_ts = ts.reset_index()
df_ts.columns = ['date', 'value']

# Extract time components
df_ts['year'] = df_ts['date'].dt.year
df_ts['month'] = df_ts['date'].dt.month
df_ts['quarter'] = df_ts['date'].dt.quarter

# One-hot encode month
month_dummies = pd.get_dummies(df_ts['month'], prefix='month')
df_ts = pd.concat([df_ts, month_dummies], axis=1)

print("Time Series with Features:")
print(df_ts.head())

### Example 3: Using factorize for label encoding

In [None]:
# Create a sample dataset with categorical variables
data = {
    'id': range(1, 8),
    'category': ['A', 'B', 'C', 'A', 'B', 'A', 'D'],
    'subcategory': ['X', 'Y', 'Z', 'X', 'Y', 'Z', 'X']
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

In [None]:
# Label encode the categorical columns
df['category_code'], category_uniques = pd.factorize(df['category'])
df['subcategory_code'], subcategory_uniques = pd.factorize(df['subcategory'])

print("\nEncoded DataFrame:")
print(df)

print("\nCategory Mapping:")
for i, cat in enumerate(category_uniques):
    print(f"{i}: {cat}")

print("\nSubcategory Mapping:")
for i, subcat in enumerate(subcategory_uniques):
    print(f"{i}: {subcat}")

## Summary

In this notebook, we've explored several important general functions in pandas:

1. **pd.get_dummies()**: Converts categorical variables into dummy/indicator variables (one-hot encoding).
2. **pd.factorize()**: Encodes objects as enumerated types, useful for label encoding.
3. **pd.period_range()**: Creates a fixed frequency PeriodIndex for representing time spans.
4. **pd.timedelta_range()**: Creates a fixed frequency TimedeltaIndex for representing time differences.
5. **pd.infer_freq()**: Infers the frequency of a DatetimeIndex or TimedeltaIndex.
6. **pd.interval_range()**: Creates a fixed frequency IntervalIndex for representing intervals.

These functions are essential tools for data preprocessing, time series analysis, and feature engineering in pandas.