#### Pandas General Functions - Part 40

This notebook covers important general functions in pandas, including `get_dummies()`, `factorize()`, `period_range()`, `timedelta_range()`, and `infer_freq()`.

In [1]:
import pandas as pd
import numpy as np

##### pd.get_dummies()

The `get_dummies()` function converts categorical variables into dummy/indicator variables (also known as one-hot encoding). This is particularly useful for preparing data for machine learning models.

In [2]:
# Basic example with a Series
s = pd.Series(list('abca'))
pd.get_dummies(s)

Unnamed: 0,a,b,c
0,True,False,False
1,False,True,False
2,False,False,True
3,True,False,False


In [3]:
# Handling NaN values
s1 = ['a', 'b', np.nan]
print("Without dummy_na:")
print(pd.get_dummies(s1))

print("\nWith dummy_na=True:")
print(pd.get_dummies(s1, dummy_na=True))

Without dummy_na:
       a      b
0   True  False
1  False   True
2  False  False

With dummy_na=True:
       a      b    NaN
0   True  False  False
1  False   True  False
2  False  False   True


In [4]:
# Using with a DataFrame
df = pd.DataFrame({
    'A': ['a', 'b', 'a'], 
    'B': ['b', 'a', 'c'],
    'C': [1, 2, 3]
})

# Using prefix to distinguish columns from different features
pd.get_dummies(df, prefix=['col1', 'col2'])

Unnamed: 0,C,col1_a,col1_b,col2_a,col2_b,col2_c
0,1,True,False,False,True,False
1,2,False,True,True,False,False
2,3,True,False,False,False,True


In [5]:
# Using drop_first to avoid the dummy variable trap
print("Original dummies:")
print(pd.get_dummies(pd.Series(list('abcaa'))))

print("\nWith drop_first=True:")
print(pd.get_dummies(pd.Series(list('abcaa')), drop_first=True))

Original dummies:
       a      b      c
0   True  False  False
1  False   True  False
2  False  False   True
3   True  False  False
4   True  False  False

With drop_first=True:
       b      c
0  False  False
1   True  False
2  False   True
3  False  False
4  False  False


In [6]:
# Specifying dtype
pd.get_dummies(pd.Series(list('abc')), dtype=float)

Unnamed: 0,a,b,c
0,1.0,0.0,0.0
1,0.0,1.0,0.0
2,0.0,0.0,1.0


##### pd.factorize()

The `factorize()` function encodes the object as an enumerated type or categorical variable. It's useful for obtaining a numeric representation of an array when all that matters is identifying distinct values.

In [7]:
# Basic example
codes, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'])
print("Codes:")
print(codes)
print("\nUniques:")
print(uniques)

Codes:
[0 0 1 2 0]

Uniques:
['b' 'a' 'c']


  codes, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'])


In [8]:
# With sort=True
codes, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'], sort=True)
print("Codes (sorted):")
print(codes)
print("\nUniques (sorted):")
print(uniques)

Codes (sorted):
[1 1 0 2 1]

Uniques (sorted):
['a' 'b' 'c']


  codes, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'], sort=True)


In [9]:
# With Series
s = pd.Series(['b', 'b', 'a', 'c', 'b'])
codes, uniques = s.factorize()
print("Codes:")
print(codes)
print("\nUniques:")
print(uniques)

Codes:
[0 0 1 2 0]

Uniques:
Index(['b', 'a', 'c'], dtype='object')


##### pd.period_range()

The `period_range()` function returns a fixed frequency `PeriodIndex`, which is useful for representing time spans like quarters, months, etc.

In [10]:
# Basic example
pd.period_range(start='2020-01-01', end='2020-12-31', freq='M')

PeriodIndex(['2020-01', '2020-02', '2020-03', '2020-04', '2020-05', '2020-06',
             '2020-07', '2020-08', '2020-09', '2020-10', '2020-11', '2020-12'],
            dtype='period[M]')

In [11]:
# Using periods instead of end
pd.period_range(start='2020-01-01', periods=12, freq='M')

PeriodIndex(['2020-01', '2020-02', '2020-03', '2020-04', '2020-05', '2020-06',
             '2020-07', '2020-08', '2020-09', '2020-10', '2020-11', '2020-12'],
            dtype='period[M]')

In [12]:
# Using Period objects as endpoints
pd.period_range(
    start=pd.Period('2020Q1', freq='Q'),
    end=pd.Period('2020Q4', freq='Q'), 
    freq='M'
)

PeriodIndex(['2020-03', '2020-04', '2020-05', '2020-06', '2020-07', '2020-08',
             '2020-09', '2020-10', '2020-11', '2020-12'],
            dtype='period[M]')

##### pd.timedelta_range()

The `timedelta_range()` function returns a fixed frequency `TimedeltaIndex`, with day as the default frequency. This is useful for representing differences in time.

In [13]:
# Basic example
pd.timedelta_range(start='1 day', periods=4)

TimedeltaIndex(['1 days', '2 days', '3 days', '4 days'], dtype='timedelta64[ns]', freq='D')

In [14]:
# Using closed parameter
pd.timedelta_range(start='1 day', periods=4, closed='right')

TimedeltaIndex(['2 days', '3 days', '4 days'], dtype='timedelta64[ns]', freq='D')

In [15]:
# Using a different frequency
pd.timedelta_range(start='1 day', end='2 days', freq='6H')

  pd.timedelta_range(start='1 day', end='2 days', freq='6H')


TimedeltaIndex(['1 days 00:00:00', '1 days 06:00:00', '1 days 12:00:00',
                '1 days 18:00:00', '2 days 00:00:00'],
               dtype='timedelta64[ns]', freq='6h')

In [16]:
# Specifying start, end, and periods (linearly spaced)
pd.timedelta_range(start='1 day', end='5 days', periods=4)

TimedeltaIndex(['1 days 00:00:00', '2 days 08:00:00', '3 days 16:00:00',
                '5 days 00:00:00'],
               dtype='timedelta64[ns]', freq=None)

##### pd.infer_freq()

The `infer_freq()` function infers the most likely frequency given the input index. This is useful when you have a datetime index but don't know its frequency.

In [17]:
# Create a DatetimeIndex with daily frequency
dates = pd.date_range('2020-01-01', periods=5, freq='D')
print("Original index:")
print(dates)

# Infer the frequency
freq = pd.infer_freq(dates)
print(f"\nInferred frequency: {freq}")

Original index:
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05'],
              dtype='datetime64[ns]', freq='D')

Inferred frequency: D


In [18]:
# Create a DatetimeIndex with business day frequency
dates = pd.date_range('2020-01-01', periods=5, freq='B')
print("Original index:")
print(dates)

# Infer the frequency
freq = pd.infer_freq(dates)
print(f"\nInferred frequency: {freq}")

Original index:
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-06',
               '2020-01-07'],
              dtype='datetime64[ns]', freq='B')

Inferred frequency: B


In [19]:
# Create a TimedeltaIndex
deltas = pd.timedelta_range(start='1 day', periods=5, freq='2D')
print("Original index:")
print(deltas)

# Infer the frequency
freq = pd.infer_freq(deltas)
print(f"\nInferred frequency: {freq}")

Original index:
TimedeltaIndex(['1 days', '3 days', '5 days', '7 days', '9 days'], dtype='timedelta64[ns]', freq='2D')

Inferred frequency: 2D


##### pd.interval_range()

The `interval_range()` function returns a fixed frequency `IntervalIndex`. This is useful for binning data into intervals.

In [20]:
# Basic example
pd.interval_range(start=0, end=5)

IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4], (4, 5]], dtype='interval[int64, right]')

In [21]:
# Using periods instead of end
pd.interval_range(start=0, periods=5)

IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4], (4, 5]], dtype='interval[int64, right]')

In [22]:
# Using a different frequency
pd.interval_range(start=0, end=10, freq=2)

IntervalIndex([(0, 2], (2, 4], (4, 6], (6, 8], (8, 10]], dtype='interval[int64, right]')

In [23]:
# Using closed parameter
pd.interval_range(start=0, end=5, closed='left')

IntervalIndex([[0, 1), [1, 2), [2, 3), [3, 4), [4, 5)], dtype='interval[int64, left]')

##### Practical Examples

Let's explore some practical examples of these functions.

### Example 1: One-hot encoding for machine learning

In [24]:
# Create a sample dataset
data = {
    'id': range(1, 6),
    'color': ['red', 'blue', 'green', 'red', 'blue'],
    'size': ['S', 'M', 'L', 'XL', 'S'],
    'price': [10.5, 15.0, 20.0, 25.5, 12.0]
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

Original DataFrame:
   id  color size  price
0   1    red    S   10.5
1   2   blue    M   15.0
2   3  green    L   20.0
3   4    red   XL   25.5
4   5   blue    S   12.0


In [25]:
# One-hot encode the categorical columns
df_encoded = pd.get_dummies(df, columns=['color', 'size'], drop_first=True)
print("Encoded DataFrame:")
print(df_encoded)

Encoded DataFrame:
   id  price  color_green  color_red  size_M  size_S  size_XL
0   1   10.5        False       True   False    True    False
1   2   15.0        False      False    True   False    False
2   3   20.0         True      False   False   False    False
3   4   25.5        False       True   False   False     True
4   5   12.0        False      False   False    True    False


### Example 2: Creating time-based features for time series analysis

In [26]:
# Create a sample time series dataset
dates = pd.date_range('2020-01-01', periods=12, freq='M')
values = np.random.randn(12).cumsum()
ts = pd.Series(values, index=dates)
print("Time Series:")
print(ts)

Time Series:
2020-01-31   -0.191536
2020-02-29   -0.193383
2020-03-31    1.125679
2020-04-30    1.011478
2020-05-31    0.239731
2020-06-30    0.426610
2020-07-31   -0.391490
2020-08-31   -0.254242
2020-09-30   -1.548778
2020-10-31   -0.838395
2020-11-30   -1.196775
2020-12-31   -1.267135
Freq: ME, dtype: float64


  dates = pd.date_range('2020-01-01', periods=12, freq='M')


In [27]:
# Convert to DataFrame and add time-based features
df_ts = ts.reset_index()
df_ts.columns = ['date', 'value']

# Extract time components
df_ts['year'] = df_ts['date'].dt.year
df_ts['month'] = df_ts['date'].dt.month
df_ts['quarter'] = df_ts['date'].dt.quarter

# One-hot encode month
month_dummies = pd.get_dummies(df_ts['month'], prefix='month')
df_ts = pd.concat([df_ts, month_dummies], axis=1)

print("Time Series with Features:")
print(df_ts.head())

Time Series with Features:
        date     value  year  month  quarter  month_1  month_2  month_3  \
0 2020-01-31 -0.191536  2020      1        1     True    False    False   
1 2020-02-29 -0.193383  2020      2        1    False     True    False   
2 2020-03-31  1.125679  2020      3        1    False    False     True   
3 2020-04-30  1.011478  2020      4        2    False    False    False   
4 2020-05-31  0.239731  2020      5        2    False    False    False   

   month_4  month_5  month_6  month_7  month_8  month_9  month_10  month_11  \
0    False    False    False    False    False    False     False     False   
1    False    False    False    False    False    False     False     False   
2    False    False    False    False    False    False     False     False   
3     True    False    False    False    False    False     False     False   
4    False     True    False    False    False    False     False     False   

   month_12  
0     False  
1     False  
2    

### Example 3: Using factorize for label encoding

In [28]:
# Create a sample dataset with categorical variables
data = {
    'id': range(1, 8),
    'category': ['A', 'B', 'C', 'A', 'B', 'A', 'D'],
    'subcategory': ['X', 'Y', 'Z', 'X', 'Y', 'Z', 'X']
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

Original DataFrame:
   id category subcategory
0   1        A           X
1   2        B           Y
2   3        C           Z
3   4        A           X
4   5        B           Y
5   6        A           Z
6   7        D           X


In [29]:
# Label encode the categorical columns
df['category_code'], category_uniques = pd.factorize(df['category'])
df['subcategory_code'], subcategory_uniques = pd.factorize(df['subcategory'])

print("\nEncoded DataFrame:")
print(df)

print("\nCategory Mapping:")
for i, cat in enumerate(category_uniques):
    print(f"{i}: {cat}")

print("\nSubcategory Mapping:")
for i, subcat in enumerate(subcategory_uniques):
    print(f"{i}: {subcat}")


Encoded DataFrame:
   id category subcategory  category_code  subcategory_code
0   1        A           X              0                 0
1   2        B           Y              1                 1
2   3        C           Z              2                 2
3   4        A           X              0                 0
4   5        B           Y              1                 1
5   6        A           Z              0                 2
6   7        D           X              3                 0

Category Mapping:
0: A
1: B
2: C
3: D

Subcategory Mapping:
0: X
1: Y
2: Z


##### Summary

In this notebook, we've explored several important general functions in pandas:

1. **pd.get_dummies()**: Converts categorical variables into dummy/indicator variables (one-hot encoding).
2. **pd.factorize()**: Encodes objects as enumerated types, useful for label encoding.
3. **pd.period_range()**: Creates a fixed frequency PeriodIndex for representing time spans.
4. **pd.timedelta_range()**: Creates a fixed frequency TimedeltaIndex for representing time differences.
5. **pd.infer_freq()**: Infers the frequency of a DatetimeIndex or TimedeltaIndex.
6. **pd.interval_range()**: Creates a fixed frequency IntervalIndex for representing intervals.

These functions are essential tools for data preprocessing, time series analysis, and feature engineering in pandas.