#### Pandas Tutorial - Part 62: DataFrame Methods (max, mean, nlargest, notna, notnull)

This notebook covers several important DataFrame methods including:
- `max()` - Return the maximum of the values for the requested axis
- `mean()` - Return the mean of the values for the requested axis
- `nlargest()` - Return the first n rows ordered by columns in descending order
- `notna()` - Detect existing (non-missing) values
- `notnull()` - Alias of notna()

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)

##### 1. DataFrame.max()

The `max()` method returns the maximum of the values for the requested axis.

In [2]:
# Create a Series with MultiIndex
idx = pd.MultiIndex.from_arrays([
    ['warm', 'warm', 'cold', 'cold'],
    ['dog', 'falcon', 'fish', 'spider']
], names=['blooded', 'animal'])

s = pd.Series([4, 2, 0, 8], name='legs', index=idx)
print("Series with MultiIndex:")
s

Series with MultiIndex:


blooded  animal
warm     dog       4
         falcon    2
cold     fish      0
         spider    8
Name: legs, dtype: int64

In [3]:
# Find the maximum value
print("Maximum value:")
s.max()

Maximum value:


np.int64(8)

In [5]:
# Find the maximum value by level
print("Maximum value by level 'blooded':")
print(s.groupby(level='blooded').max())

Maximum value by level 'blooded':
blooded
cold    8
warm    4
Name: legs, dtype: int64


In [7]:
# Find the maximum value by level index
print("Maximum value by level 0:")
print(s.groupby(level=0).max())

Maximum value by level 0:
blooded
cold    8
warm    4
Name: legs, dtype: int64


In [8]:
# Create a DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5],
    'B': [10, 20, 30, 40, 50],
    'C': [100, 50, 10, 20, 30]
})
print("DataFrame:")
df

DataFrame:


Unnamed: 0,A,B,C
0,1,10,100
1,2,20,50
2,3,30,10
3,4,40,20
4,5,50,30


In [9]:
# Find the maximum value for each column
print("Maximum value for each column:")
df.max()

Maximum value for each column:


A      5
B     50
C    100
dtype: int64

In [10]:
# Find the maximum value for each row
print("Maximum value for each row:")
df.max(axis=1)

Maximum value for each row:


0    100
1     50
2     30
3     40
4     50
dtype: int64

In [11]:
# Create a DataFrame with NaN values
df_with_na = pd.DataFrame({
    'A': [1, 2, np.nan, 4, 5],
    'B': [10, np.nan, 30, 40, 50],
    'C': [100, 50, 10, np.nan, 30]
})
print("DataFrame with NaN values:")
df_with_na

DataFrame with NaN values:


Unnamed: 0,A,B,C
0,1.0,10.0,100.0
1,2.0,,50.0
2,,30.0,10.0
3,4.0,40.0,
4,5.0,50.0,30.0


In [12]:
# Find the maximum value for each column (skipna=True by default)
print("Maximum value for each column (skipna=True):")
df_with_na.max()

Maximum value for each column (skipna=True):


A      5.0
B     50.0
C    100.0
dtype: float64

In [13]:
# Find the maximum value for each column (skipna=False)
print("Maximum value for each column (skipna=False):")
df_with_na.max(skipna=False)

Maximum value for each column (skipna=False):


A   NaN
B   NaN
C   NaN
dtype: float64

##### 2. DataFrame.mean()

The `mean()` method returns the mean of the values for the requested axis.

In [14]:
# Using the same DataFrame
print("DataFrame:")
df

DataFrame:


Unnamed: 0,A,B,C
0,1,10,100
1,2,20,50
2,3,30,10
3,4,40,20
4,5,50,30


In [15]:
# Calculate the mean for each column
print("Mean for each column:")
df.mean()

Mean for each column:


A     3.0
B    30.0
C    42.0
dtype: float64

In [16]:
# Calculate the mean for each row
print("Mean for each row:")
df.mean(axis=1)

Mean for each row:


0    37.000000
1    24.000000
2    14.333333
3    21.333333
4    28.333333
dtype: float64

In [17]:
# Using the DataFrame with NaN values
print("DataFrame with NaN values:")
df_with_na

DataFrame with NaN values:


Unnamed: 0,A,B,C
0,1.0,10.0,100.0
1,2.0,,50.0
2,,30.0,10.0
3,4.0,40.0,
4,5.0,50.0,30.0


In [18]:
# Calculate the mean for each column (skipna=True by default)
print("Mean for each column (skipna=True):")
df_with_na.mean()

Mean for each column (skipna=True):


A     3.0
B    32.5
C    47.5
dtype: float64

In [19]:
# Calculate the mean for each column (skipna=False)
print("Mean for each column (skipna=False):")
df_with_na.mean(skipna=False)

Mean for each column (skipna=False):


A   NaN
B   NaN
C   NaN
dtype: float64

In [20]:
# Create a DataFrame with mixed data types
df_mixed = pd.DataFrame({
    'A': [1, 2, 3, 4, 5],
    'B': ['a', 'b', 'c', 'd', 'e'],
    'C': [True, False, True, True, False]
})
print("DataFrame with mixed data types:")
df_mixed

DataFrame with mixed data types:


Unnamed: 0,A,B,C
0,1,a,True
1,2,b,False
2,3,c,True
3,4,d,True
4,5,e,False


In [23]:
# Calculate the mean with numeric_only=True
print("Mean (numeric_only=True):")
print(df_mixed.mean(numeric_only=True))

Mean (numeric_only=True):
A    3.0
C    0.6
dtype: float64


In [22]:
# Calculate the mean (numeric_only=True)
print("Mean (numeric_only=True):")
df_mixed.mean(numeric_only=True)

Mean (numeric_only=True):


A    3.0
C    0.6
dtype: float64

##### 3. DataFrame.nlargest()

The `nlargest()` method returns the first n rows ordered by columns in descending order.

In [24]:
# Create a DataFrame for countries
df = pd.DataFrame({
    'population': [59000000, 65000000, 434000, 434000, 37800000],
    'GDP': [1937894, 2583560, 12128, 17036, 1493000],
    'alpha-2': ['IT', 'FR', 'BN', 'TL', 'CA']
}, index=['Italy', 'France', 'Brunei', 'Timor-Leste', 'Canada'])

print("Countries DataFrame:")
df

Countries DataFrame:


Unnamed: 0,population,GDP,alpha-2
Italy,59000000,1937894,IT
France,65000000,2583560,FR
Brunei,434000,12128,BN
Timor-Leste,434000,17036,TL
Canada,37800000,1493000,CA


In [25]:
# Get the 3 largest countries by population
print("3 largest countries by population:")
df.nlargest(3, 'population')

3 largest countries by population:


Unnamed: 0,population,GDP,alpha-2
France,65000000,2583560,FR
Italy,59000000,1937894,IT
Canada,37800000,1493000,CA


In [26]:
# Get the 3 largest countries by GDP
print("3 largest countries by GDP:")
df.nlargest(3, 'GDP')

3 largest countries by GDP:


Unnamed: 0,population,GDP,alpha-2
France,65000000,2583560,FR
Italy,59000000,1937894,IT
Canada,37800000,1493000,CA


In [27]:
# Get the 3 largest countries by multiple columns
print("3 largest countries by population and GDP:")
df.nlargest(3, ['population', 'GDP'])

3 largest countries by population and GDP:


Unnamed: 0,population,GDP,alpha-2
France,65000000,2583560,FR
Italy,59000000,1937894,IT
Canada,37800000,1493000,CA


In [28]:
# Get the 3 largest countries by GDP and population
print("3 largest countries by GDP and population:")
df.nlargest(3, ['GDP', 'population'])

3 largest countries by GDP and population:


Unnamed: 0,population,GDP,alpha-2
France,65000000,2583560,FR
Italy,59000000,1937894,IT
Canada,37800000,1493000,CA


In [29]:
# Create a Series
s = pd.Series([3, 2, 1, 5, 4])
print("Series:")
print(s)

# Get the 3 largest values
print("\n3 largest values:")
print(s.nlargest(3))

Series:
0    3
1    2
2    1
3    5
4    4
dtype: int64

3 largest values:
3    5
4    4
0    3
dtype: int64


##### 4. DataFrame.notna() and DataFrame.notnull()

The `notna()` and `notnull()` methods detect existing (non-missing) values. `notnull()` is an alias of `notna()`.

In [30]:
# Create a DataFrame with missing values
df = pd.DataFrame({
    'age': [5, 6, np.nan],
    'born': [pd.NaT, pd.Timestamp('1939-05-27'), pd.Timestamp('1940-04-25')],
    'name': ['Alfred', 'Batman', ''],
    'toy': [None, 'Batmobile', 'Joker']
})

print("DataFrame with missing values:")
df

DataFrame with missing values:


Unnamed: 0,age,born,name,toy
0,5.0,NaT,Alfred,
1,6.0,1939-05-27,Batman,Batmobile
2,,1940-04-25,,Joker


In [31]:
# Detect non-missing values using notna()
print("Non-missing values (notna):")
df.notna()

Non-missing values (notna):


Unnamed: 0,age,born,name,toy
0,True,False,True,False
1,True,True,True,True
2,False,True,True,True


In [32]:
# Detect non-missing values using notnull()
print("Non-missing values (notnull):")
df.notnull()

Non-missing values (notnull):


Unnamed: 0,age,born,name,toy
0,True,False,True,False
1,True,True,True,True
2,False,True,True,True


In [33]:
# Verify that notna() and notnull() return the same result
print("Are notna() and notnull() results equal?")
print(df.notna().equals(df.notnull()))

Are notna() and notnull() results equal?
True


In [34]:
# Create a Series with missing values
ser = pd.Series([5, 6, np.nan])
print("Series with missing values:")
print(ser)

Series with missing values:
0    5.0
1    6.0
2    NaN
dtype: float64


In [35]:
# Detect non-missing values in the Series
print("\nNon-missing values in Series:")
print(ser.notna())


Non-missing values in Series:
0     True
1     True
2    False
dtype: bool


In [36]:
# Use notna() to filter a DataFrame
print("Filtering DataFrame to keep only rows where 'age' is not NA:")
df[df['age'].notna()]

Filtering DataFrame to keep only rows where 'age' is not NA:


Unnamed: 0,age,born,name,toy
0,5.0,NaT,Alfred,
1,6.0,1939-05-27,Batman,Batmobile


In [37]:
# Count non-missing values in each column
print("Count of non-missing values in each column:")
df.notna().sum()

Count of non-missing values in each column:


age     2
born    2
name    3
toy     2
dtype: int64

In [38]:
# Check if all values in a row are non-missing
print("Rows where all values are non-missing:")
df[df.notna().all(axis=1)]

Rows where all values are non-missing:


Unnamed: 0,age,born,name,toy
1,6.0,1939-05-27,Batman,Batmobile


##### Summary

In this notebook, we've explored several important DataFrame methods:

1. **max()**: Returns the maximum of the values for the requested axis. It can be used with the `level` parameter for hierarchical indices.

2. **mean()**: Returns the mean of the values for the requested axis. It can be used with the `skipna` parameter to control how missing values are handled.

3. **nlargest()**: Returns the first n rows ordered by columns in descending order. It's useful for quickly finding the largest values in a DataFrame.

4. **notna()** and **notnull()**: Detect existing (non-missing) values in a DataFrame or Series. `notnull()` is an alias of `notna()`. These methods are essential for identifying and handling missing data.

These methods are essential for data analysis, statistical calculations, and handling missing data in pandas.