#### Pandas Tutorial - Part 65: DataFrame Methods (select_dtypes, sem, set_axis)

This notebook covers three important DataFrame methods:
- `select_dtypes()` - Return a subset of the DataFrame's columns based on the column dtypes
- `sem()` - Return unbiased standard error of the mean over requested axis
- `set_axis()` - Assign desired index to given axis

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)

##### 1. DataFrame.select_dtypes()

The `select_dtypes()` method returns a subset of the DataFrame's columns based on the column dtypes. You can include or exclude specific data types.

In [2]:
# Create a DataFrame with different data types
df = pd.DataFrame({
    'a': [1, 2] * 3,                      # int64
    'b': [True, False] * 3,               # bool
    'c': [1.0, 2.0] * 3,                  # float64
    'd': ['a', 'b'] * 3,                  # object (string)
    'e': pd.Series([1, 2] * 3).astype('category'),  # category
    'f': pd.date_range('20210101', periods=6),      # datetime64[ns]
    'g': pd.Series([pd.Timedelta(days=i) for i in range(6)])  # timedelta64[ns]
})

print("DataFrame with different data types:")
df

DataFrame with different data types:


Unnamed: 0,a,b,c,d,e,f,g
0,1,True,1.0,a,1,2021-01-01,0 days
1,2,False,2.0,b,2,2021-01-02,1 days
2,1,True,1.0,a,1,2021-01-03,2 days
3,2,False,2.0,b,2,2021-01-04,3 days
4,1,True,1.0,a,1,2021-01-05,4 days
5,2,False,2.0,b,2,2021-01-06,5 days


In [3]:
# Display the data types of each column
print("Data types of each column:")
df.dtypes

Data types of each column:


a              int64
b               bool
c            float64
d             object
e           category
f     datetime64[ns]
g    timedelta64[ns]
dtype: object

In [4]:
# Select boolean columns
print("Select boolean columns:")
df.select_dtypes(include='bool')

Select boolean columns:


Unnamed: 0,b
0,True
1,False
2,True
3,False
4,True
5,False


In [5]:
# Select float columns
print("Select float columns:")
df.select_dtypes(include=['float64'])

Select float columns:


Unnamed: 0,c
0,1.0
1,2.0
2,1.0
3,2.0
4,1.0
5,2.0


In [6]:
# Select integer columns
print("Select integer columns:")
df.select_dtypes(include=['int64'])

Select integer columns:


Unnamed: 0,a
0,1
1,2
2,1
3,2
4,1
5,2


In [7]:
# Select string (object) columns
print("Select string (object) columns:")
df.select_dtypes(include=['object'])

Select string (object) columns:


Unnamed: 0,d
0,a
1,b
2,a
3,b
4,a
5,b


In [8]:
# Select category columns
print("Select category columns:")
df.select_dtypes(include=['category'])

Select category columns:


Unnamed: 0,e
0,1
1,2
2,1
3,2
4,1
5,2


In [9]:
# Select datetime columns
print("Select datetime columns:")
df.select_dtypes(include=['datetime64'])

Select datetime columns:


Unnamed: 0,f
0,2021-01-01
1,2021-01-02
2,2021-01-03
3,2021-01-04
4,2021-01-05
5,2021-01-06


In [10]:
# Select timedelta columns
print("Select timedelta columns:")
df.select_dtypes(include=['timedelta64'])

Select timedelta columns:


Unnamed: 0,g
0,0 days
1,1 days
2,2 days
3,3 days
4,4 days
5,5 days


In [11]:
# Select numeric columns (int, float)
print("Select numeric columns:")
df.select_dtypes(include=['number'])

Select numeric columns:


Unnamed: 0,a,c,g
0,1,1.0,0 days
1,2,2.0,1 days
2,1,1.0,2 days
3,2,2.0,3 days
4,1,1.0,4 days
5,2,2.0,5 days


In [12]:
# Exclude integer columns
print("Exclude integer columns:")
df.select_dtypes(exclude=['int64'])

Exclude integer columns:


Unnamed: 0,b,c,d,e,f,g
0,True,1.0,a,1,2021-01-01,0 days
1,False,2.0,b,2,2021-01-02,1 days
2,True,1.0,a,1,2021-01-03,2 days
3,False,2.0,b,2,2021-01-04,3 days
4,True,1.0,a,1,2021-01-05,4 days
5,False,2.0,b,2,2021-01-06,5 days


In [13]:
# Include float and exclude object columns
print("Include float and exclude object columns:")
df.select_dtypes(include=['float64'], exclude=['object'])

Include float and exclude object columns:


Unnamed: 0,c
0,1.0
1,2.0
2,1.0
3,2.0
4,1.0
5,2.0


In [14]:
# Select all columns except numeric ones
print("Select all columns except numeric ones:")
df.select_dtypes(exclude=['number'])

Select all columns except numeric ones:


Unnamed: 0,b,d,e,f
0,True,a,1,2021-01-01
1,False,b,2,2021-01-02
2,True,a,1,2021-01-03
3,False,b,2,2021-01-04
4,True,a,1,2021-01-05
5,False,b,2,2021-01-06


##### 2. DataFrame.sem()

The `sem()` method returns the unbiased standard error of the mean over the requested axis. It's normalized by N-1 by default, which can be changed using the `ddof` parameter.

In [15]:
# Create a DataFrame with numeric data
df_numeric = pd.DataFrame({
    'A': [1, 2, 3, 4, 5],
    'B': [10, 20, 30, 40, 50],
    'C': [100, 50, 10, 20, 30]
})

print("DataFrame with numeric data:")
df_numeric

DataFrame with numeric data:


Unnamed: 0,A,B,C
0,1,10,100
1,2,20,50
2,3,30,10
3,4,40,20
4,5,50,30


In [16]:
# Calculate the standard error of the mean for each column
print("Standard error of the mean for each column:")
df_numeric.sem()

Standard error of the mean for each column:


A     0.707107
B     7.071068
C    15.937377
dtype: float64

In [17]:
# Calculate the standard error of the mean for each row
print("Standard error of the mean for each row:")
df_numeric.sem(axis=1)

Standard error of the mean for each row:


0    31.606961
1    14.000000
2     8.089774
3    10.413666
4    13.017083
dtype: float64

In [18]:
# Create a DataFrame with NaN values
df_with_nan = pd.DataFrame({
    'A': [1, 2, np.nan, 4, 5],
    'B': [10, np.nan, 30, 40, 50],
    'C': [100, 50, 10, np.nan, 30]
})

print("DataFrame with NaN values:")
df_with_nan

DataFrame with NaN values:


Unnamed: 0,A,B,C
0,1.0,10.0,100.0
1,2.0,,50.0
2,,30.0,10.0
3,4.0,40.0,
4,5.0,50.0,30.0


In [19]:
# Calculate the standard error of the mean with skipna=True (default)
print("Standard error of the mean with skipna=True (default):")
df_with_nan.sem()

Standard error of the mean with skipna=True (default):


A     0.912871
B     8.539126
C    19.311050
dtype: float64

In [20]:
# Calculate the standard error of the mean with skipna=False
print("Standard error of the mean with skipna=False:")
df_with_nan.sem(skipna=False)

Standard error of the mean with skipna=False:


A   NaN
B   NaN
C   NaN
dtype: float64

In [21]:
# Create a DataFrame with different ddof values
print("Standard error of the mean with different ddof values:")
for ddof in [0, 1, 2]:
    print(f"\nddof={ddof}:")
    print(df_numeric.sem(ddof=ddof))

Standard error of the mean with different ddof values:

ddof=0:
A     0.632456
B     6.324555
C    14.254824
dtype: float64

ddof=1:
A     0.707107
B     7.071068
C    15.937377
dtype: float64

ddof=2:
A     0.816497
B     8.164966
C    18.402898
dtype: float64


In [22]:
# Create a DataFrame with a MultiIndex
index = pd.MultiIndex.from_tuples([
    ('A', 1), ('A', 2), ('A', 3),
    ('B', 1), ('B', 2), ('B', 3)
], names=['group', 'subgroup'])

df_multi = pd.DataFrame({
    'X': [1, 2, 3, 4, 5, 6],
    'Y': [10, 20, 30, 40, 50, 60]
}, index=index)

print("DataFrame with MultiIndex:")
df_multi

DataFrame with MultiIndex:


Unnamed: 0_level_0,Unnamed: 1_level_0,X,Y
group,subgroup,Unnamed: 2_level_1,Unnamed: 3_level_1
A,1,1,10
A,2,2,20
A,3,3,30
B,1,4,40
B,2,5,50
B,3,6,60


In [24]:
# Calculate the standard error of the mean by level
print("Standard error of the mean by level 'group':")
print(df_multi.groupby(level='group').sem())

Standard error of the mean by level 'group':
             X         Y
group                   
A      0.57735  5.773503
B      0.57735  5.773503


##### 3. DataFrame.set_axis()

The `set_axis()` method assigns desired index to a given axis. Indexes for column or row labels can be changed by assigning a list-like or Index.

In [25]:
# Create a simple DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6],
    'C': [7, 8, 9]
})

print("Original DataFrame:")
df

Original DataFrame:


Unnamed: 0,A,B,C
0,1,4,7
1,2,5,8
2,3,6,9


In [26]:
# Set new row labels (axis=0)
print("DataFrame with new row labels:")
df.set_axis(['X', 'Y', 'Z'], axis=0)

DataFrame with new row labels:


Unnamed: 0,A,B,C
X,1,4,7
Y,2,5,8
Z,3,6,9


In [27]:
# Set new column labels (axis=1)
print("DataFrame with new column labels:")
df.set_axis(['P', 'Q', 'R'], axis=1)

DataFrame with new column labels:


Unnamed: 0,P,Q,R
0,1,4,7
1,2,5,8
2,3,6,9


In [30]:
# Modify the DataFrame
print("Before modification:")
print(df)

# Modify row labels
df.index = ['X', 'Y', 'Z']
print("\nAfter modification of row labels:")
print(df)

# Modify column labels
df.columns = ['P', 'Q', 'R']
print("\nAfter modification of column labels:")
print(df)

Before modification:
   A  B  C
0  1  4  7
1  2  5  8
2  3  6  9

After modification of row labels:
   A  B  C
X  1  4  7
Y  2  5  8
Z  3  6  9

After modification of column labels:
   P  Q  R
X  1  4  7
Y  2  5  8
Z  3  6  9


In [31]:
# Create a DataFrame with MultiIndex
df_multi = pd.DataFrame(
    np.random.randn(3, 3),
    index=pd.MultiIndex.from_tuples([('A', 1), ('A', 2), ('B', 1)])
)

print("DataFrame with MultiIndex:")
df_multi

DataFrame with MultiIndex:


Unnamed: 0,Unnamed: 1,0,1,2
A,1,-0.657684,-2.52967,-1.204881
A,2,-1.284078,-0.082382,-0.061799
B,1,-0.217673,-0.601379,1.474827


In [32]:
# Set new MultiIndex
new_index = pd.MultiIndex.from_tuples([('X', 1), ('X', 2), ('Y', 1)])
print("DataFrame with new MultiIndex:")
df_multi.set_axis(new_index, axis=0)

DataFrame with new MultiIndex:


Unnamed: 0,Unnamed: 1,0,1,2
X,1,-0.657684,-2.52967,-1.204881
X,2,-1.284078,-0.082382,-0.061799
Y,1,-0.217673,-0.601379,1.474827


In [33]:
# Using set_axis with a Series index
s = pd.Series(['a', 'b', 'c'])
print("Using Series index:")
df.set_axis(s.index, axis=0)

Using Series index:


Unnamed: 0,P,Q,R
0,1,4,7
1,2,5,8
2,3,6,9


In [34]:
# Compare set_axis with other index setting methods
df_original = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6],
    'C': [7, 8, 9]
})

# Using set_axis
df1 = df_original.copy()
df1 = df1.set_axis(['X', 'Y', 'Z'], axis=0)

# Using index property
df2 = df_original.copy()
df2.index = ['X', 'Y', 'Z']

# Using rename
df3 = df_original.copy()
df3 = df3.rename(index={0: 'X', 1: 'Y', 2: 'Z'})

print("Using set_axis:")
print(df1)
print("\nUsing index property:")
print(df2)
print("\nUsing rename:")
print(df3)

print("\nAre df1 and df2 equal?", df1.equals(df2))
print("Are df1 and df3 equal?", df1.equals(df3))

Using set_axis:
   A  B  C
X  1  4  7
Y  2  5  8
Z  3  6  9

Using index property:
   A  B  C
X  1  4  7
Y  2  5  8
Z  3  6  9

Using rename:
   A  B  C
X  1  4  7
Y  2  5  8
Z  3  6  9

Are df1 and df2 equal? True
Are df1 and df3 equal? True


##### Summary

In this notebook, we've explored three important DataFrame methods:

1. **select_dtypes()**: Returns a subset of the DataFrame's columns based on the column dtypes. This method is useful for filtering columns by their data types. You can include or exclude specific data types using the `include` and `exclude` parameters. Common data types include 'number', 'bool', 'object', 'category', 'datetime64', and 'timedelta64'.

2. **sem()**: Returns the unbiased standard error of the mean over the requested axis. It's normalized by N-1 by default, which can be changed using the `ddof` parameter. This method is useful for statistical analysis, particularly when you want to estimate the standard error of a sample mean.

3. **set_axis()**: Assigns desired index to a given axis. Indexes for column or row labels can be changed by assigning a list-like or Index. This method is an alternative to directly assigning to the `index` or `columns` attributes, or using the `rename` method.

These methods are essential for data manipulation, statistical analysis, and index management in pandas.