#### Pandas Tutorial - Part 65: DataFrame Methods (select_dtypes, sem, set_axis)

This notebook covers three important DataFrame methods:
- `select_dtypes()` - Return a subset of the DataFrame's columns based on the column dtypes
- `sem()` - Return unbiased standard error of the mean over requested axis
- `set_axis()` - Assign desired index to given axis

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)

##### 1. DataFrame.select_dtypes()

The `select_dtypes()` method returns a subset of the DataFrame's columns based on the column dtypes. You can include or exclude specific data types.

In [None]:
# Create a DataFrame with different data types
df = pd.DataFrame({
    'a': [1, 2] * 3,                      # int64
    'b': [True, False] * 3,               # bool
    'c': [1.0, 2.0] * 3,                  # float64
    'd': ['a', 'b'] * 3,                  # object (string)
    'e': pd.Series([1, 2] * 3).astype('category'),  # category
    'f': pd.date_range('20210101', periods=6),      # datetime64[ns]
    'g': pd.Series([pd.Timedelta(days=i) for i in range(6)])  # timedelta64[ns]
})

print("DataFrame with different data types:")
df

In [None]:
# Display the data types of each column
print("Data types of each column:")
df.dtypes

In [None]:
# Select boolean columns
print("Select boolean columns:")
df.select_dtypes(include='bool')

In [None]:
# Select float columns
print("Select float columns:")
df.select_dtypes(include=['float64'])

In [None]:
# Select integer columns
print("Select integer columns:")
df.select_dtypes(include=['int64'])

In [None]:
# Select string (object) columns
print("Select string (object) columns:")
df.select_dtypes(include=['object'])

In [None]:
# Select category columns
print("Select category columns:")
df.select_dtypes(include=['category'])

In [None]:
# Select datetime columns
print("Select datetime columns:")
df.select_dtypes(include=['datetime64'])

In [None]:
# Select timedelta columns
print("Select timedelta columns:")
df.select_dtypes(include=['timedelta64'])

In [None]:
# Select numeric columns (int, float)
print("Select numeric columns:")
df.select_dtypes(include=['number'])

In [None]:
# Exclude integer columns
print("Exclude integer columns:")
df.select_dtypes(exclude=['int64'])

In [None]:
# Include float and exclude object columns
print("Include float and exclude object columns:")
df.select_dtypes(include=['float64'], exclude=['object'])

In [None]:
# Select all columns except numeric ones
print("Select all columns except numeric ones:")
df.select_dtypes(exclude=['number'])

##### 2. DataFrame.sem()

The `sem()` method returns the unbiased standard error of the mean over the requested axis. It's normalized by N-1 by default, which can be changed using the `ddof` parameter.

In [None]:
# Create a DataFrame with numeric data
df_numeric = pd.DataFrame({
    'A': [1, 2, 3, 4, 5],
    'B': [10, 20, 30, 40, 50],
    'C': [100, 50, 10, 20, 30]
})

print("DataFrame with numeric data:")
df_numeric

In [None]:
# Calculate the standard error of the mean for each column
print("Standard error of the mean for each column:")
df_numeric.sem()

In [None]:
# Calculate the standard error of the mean for each row
print("Standard error of the mean for each row:")
df_numeric.sem(axis=1)

In [None]:
# Create a DataFrame with NaN values
df_with_nan = pd.DataFrame({
    'A': [1, 2, np.nan, 4, 5],
    'B': [10, np.nan, 30, 40, 50],
    'C': [100, 50, 10, np.nan, 30]
})

print("DataFrame with NaN values:")
df_with_nan

In [None]:
# Calculate the standard error of the mean with skipna=True (default)
print("Standard error of the mean with skipna=True (default):")
df_with_nan.sem()

In [None]:
# Calculate the standard error of the mean with skipna=False
print("Standard error of the mean with skipna=False:")
df_with_nan.sem(skipna=False)

In [None]:
# Create a DataFrame with different ddof values
print("Standard error of the mean with different ddof values:")
for ddof in [0, 1, 2]:
    print(f"\nddof={ddof}:")
    print(df_numeric.sem(ddof=ddof))

In [None]:
# Create a DataFrame with a MultiIndex
index = pd.MultiIndex.from_tuples([
    ('A', 1), ('A', 2), ('A', 3),
    ('B', 1), ('B', 2), ('B', 3)
], names=['group', 'subgroup'])

df_multi = pd.DataFrame({
    'X': [1, 2, 3, 4, 5, 6],
    'Y': [10, 20, 30, 40, 50, 60]
}, index=index)

print("DataFrame with MultiIndex:")
df_multi

In [None]:
# Calculate the standard error of the mean by level
print("Standard error of the mean by level 'group':")
df_multi.sem(level='group')

##### 3. DataFrame.set_axis()

The `set_axis()` method assigns desired index to a given axis. Indexes for column or row labels can be changed by assigning a list-like or Index.

In [None]:
# Create a simple DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6],
    'C': [7, 8, 9]
})

print("Original DataFrame:")
df

In [None]:
# Set new row labels (axis=0)
print("DataFrame with new row labels:")
df.set_axis(['X', 'Y', 'Z'], axis=0)

In [None]:
# Set new column labels (axis=1)
print("DataFrame with new column labels:")
df.set_axis(['P', 'Q', 'R'], axis=1)

In [None]:
# Modify the DataFrame inplace
print("Before inplace modification:")
print(df)

df.set_axis(['X', 'Y', 'Z'], axis=0, inplace=True)
print("\nAfter inplace modification of row labels:")
print(df)

df.set_axis(['P', 'Q', 'R'], axis=1, inplace=True)
print("\nAfter inplace modification of column labels:")
print(df)

In [None]:
# Create a DataFrame with MultiIndex
df_multi = pd.DataFrame(
    np.random.randn(3, 3),
    index=pd.MultiIndex.from_tuples([('A', 1), ('A', 2), ('B', 1)])
)

print("DataFrame with MultiIndex:")
df_multi

In [None]:
# Set new MultiIndex
new_index = pd.MultiIndex.from_tuples([('X', 1), ('X', 2), ('Y', 1)])
print("DataFrame with new MultiIndex:")
df_multi.set_axis(new_index, axis=0)

In [None]:
# Using set_axis with a Series index
s = pd.Series(['a', 'b', 'c'])
print("Using Series index:")
df.set_axis(s.index, axis=0)

In [None]:
# Compare set_axis with other index setting methods
df_original = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6],
    'C': [7, 8, 9]
})

# Using set_axis
df1 = df_original.copy()
df1 = df1.set_axis(['X', 'Y', 'Z'], axis=0)

# Using index property
df2 = df_original.copy()
df2.index = ['X', 'Y', 'Z']

# Using rename
df3 = df_original.copy()
df3 = df3.rename(index={0: 'X', 1: 'Y', 2: 'Z'})

print("Using set_axis:")
print(df1)
print("\nUsing index property:")
print(df2)
print("\nUsing rename:")
print(df3)

print("\nAre df1 and df2 equal?", df1.equals(df2))
print("Are df1 and df3 equal?", df1.equals(df3))

##### Summary

In this notebook, we've explored three important DataFrame methods:

1. **select_dtypes()**: Returns a subset of the DataFrame's columns based on the column dtypes. This method is useful for filtering columns by their data types. You can include or exclude specific data types using the `include` and `exclude` parameters. Common data types include 'number', 'bool', 'object', 'category', 'datetime64', and 'timedelta64'.

2. **sem()**: Returns the unbiased standard error of the mean over the requested axis. It's normalized by N-1 by default, which can be changed using the `ddof` parameter. This method is useful for statistical analysis, particularly when you want to estimate the standard error of a sample mean.

3. **set_axis()**: Assigns desired index to a given axis. Indexes for column or row labels can be changed by assigning a list-like or Index. This method is an alternative to directly assigning to the `index` or `columns` attributes, or using the `rename` method.

These methods are essential for data manipulation, statistical analysis, and index management in pandas.