#### Part 18: Sorting MultiIndex and Concatenation in Pandas

In this notebook, we'll explore:
- Sorting MultiIndex objects
- Concatenating DataFrames
- Different join types in concatenation
- Using the append method
- Ignoring indexes during concatenation

##### Setup
First, let's import the necessary libraries:

In [1]:
import pandas as pd
import numpy as np

##### 1. Sorting MultiIndex Objects

For MultiIndex-ed objects to be indexed and sliced effectively, they need to be sorted. As with any index, you can use `sort_index()`.

In [2]:
# Create a Series with MultiIndex
tuples = [('foo', 'one'), ('foo', 'two'), ('bar', 'one'), ('bar', 'two'), ('qux', 'one'), ('qux', 'two')]
s = pd.Series(np.random.randn(6), index=pd.MultiIndex.from_tuples(tuples))
s

foo  one   -0.165423
     two    0.852777
bar  one    1.610141
     two   -0.712366
qux  one    0.596095
     two   -0.027566
dtype: float64

In [3]:
# Sort the index
s.sort_index()

bar  one    1.610141
     two   -0.712366
foo  one   -0.165423
     two    0.852777
qux  one    0.596095
     two   -0.027566
dtype: float64

In [4]:
# Sort by level 0
s.sort_index(level=0)

bar  one    1.610141
     two   -0.712366
foo  one   -0.165423
     two    0.852777
qux  one    0.596095
     two   -0.027566
dtype: float64

In [5]:
# Sort by level 1
s.sort_index(level=1)

bar  one    1.610141
foo  one   -0.165423
qux  one    0.596095
bar  two   -0.712366
foo  two    0.852777
qux  two   -0.027566
dtype: float64

You may also pass a level name to `sort_index` if the MultiIndex levels are named.

In [6]:
# Set names for the levels
s.index.set_names(['L1', 'L2'], inplace=True)
s

L1   L2 
foo  one   -0.165423
     two    0.852777
bar  one    1.610141
     two   -0.712366
qux  one    0.596095
     two   -0.027566
dtype: float64

In [7]:
# Sort by level name 'L1'
s.sort_index(level='L1')

L1   L2 
bar  one    1.610141
     two   -0.712366
foo  one   -0.165423
     two    0.852777
qux  one    0.596095
     two   -0.027566
dtype: float64

In [8]:
# Sort by level name 'L2'
s.sort_index(level='L2')

L1   L2 
bar  one    1.610141
foo  one   -0.165423
qux  one    0.596095
bar  two   -0.712366
foo  two    0.852777
qux  two   -0.027566
dtype: float64

On higher dimensional objects, you can sort any of the other axes by level if they have a MultiIndex:

In [9]:
# Create a DataFrame with MultiIndex
arrays = [['one', 'one', 'zero', 'zero'], ['y', 'x', 'y', 'x']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples)
df = pd.DataFrame(np.random.randn(4, 2), index=index)
df

Unnamed: 0,Unnamed: 1,0,1
one,y,-1.223566,0.002311
one,x,0.548743,1.027004
zero,y,-2.140442,1.430706
zero,x,-0.352799,-0.16454


In [10]:
# Sort the transposed DataFrame by level 1 on axis 1
df.T.sort_index(level=1, axis=1)

Unnamed: 0_level_0,one,zero,one,zero
Unnamed: 0_level_1,x,x,y,y
0,0.548743,-0.352799,-1.223566,-2.140442
1,1.027004,-0.16454,0.002311,1.430706


Indexing will work even if the data are not sorted, but will be rather inefficient (and show a PerformanceWarning). It will also return a copy of the data rather than a view.

In [11]:
# Create an unsorted MultiIndex DataFrame
dfm = pd.DataFrame({'jim': [0, 0, 1, 1],
                    'joe': ['x', 'x', 'z', 'y'],
                    'jolie': np.random.rand(4)})
dfm = dfm.set_index(['jim', 'joe'])
dfm

Unnamed: 0_level_0,Unnamed: 1_level_0,jolie
jim,joe,Unnamed: 2_level_1
0,x,0.6714
0,x,0.056355
1,z,0.024561
1,y,0.504431


In [14]:
# Check if MultiIndex is lexically sorted
is_sorted = dfm.index.is_monotonic_increasing

# Alternative approach to check sorting
# You can also manually verify if the index is sorted
index_values = list(dfm.index)
is_sorted_manual = index_values == sorted(index_values)

print(f"Is monotonically increasing: {is_sorted}")
print(f"Is sorted (manual check): {is_sorted_manual}")

Is monotonically increasing: False
Is sorted (manual check): False


In [17]:
# Sort the index
dfm = dfm.sort_index()
dfm

Unnamed: 0_level_0,Unnamed: 1_level_0,jolie
jim,joe,Unnamed: 2_level_1
0,x,0.6714
0,x,0.056355
1,y,0.504431
1,z,0.024561


In [19]:
# Check if the index is now lexsorted
# Check if MultiIndex is lexically sorted
is_sorted = dfm.index.is_monotonic_increasing

# Alternative approach to check sorting
# You can also manually verify if the index is sorted
index_values = list(dfm.index)
is_sorted_manual = index_values == sorted(index_values)

print(f"Is monotonically increasing: {is_sorted}")
print(f"Is sorted (manual check): {is_sorted_manual}")

Is monotonically increasing: True
Is sorted (manual check): True


In [21]:
# Sort the DataFrame
dfm_sorted = dfm.sort_index()

# Check sorting status of each level
def check_level_sorting(idx):
    """Check which levels of a MultiIndex are sorted"""
    if not isinstance(idx, pd.MultiIndex):
        return [idx.is_monotonic_increasing]
    
    results = []
    for level in range(idx.nlevels):
        # Check if values at this level are sorted within each group of the previous levels
        level_values = idx.get_level_values(level)
        
        # For the first level, just check if it's sorted
        if level == 0:
            results.append(level_values.is_monotonic_increasing)
        else:
            # For subsequent levels, it's more complex - we'll use a proxy check
            # by comparing with a sorted version
            is_sorted = dfm_sorted.index.get_level_values(level).equals(level_values)
            results.append(is_sorted)
    
    return results

# Get sorting status for each level
level_sorting = check_level_sorting(dfm.index)
print(f"Levels sorted: {level_sorting}")
print(f"All levels sorted: {all(level_sorting)}")
print(f"Number of sorted levels: {sum(level_sorting)} out of {len(level_sorting)}")

Levels sorted: [True, True]
All levels sorted: True
Number of sorted levels: 2 out of 2


##### 2. Concatenating DataFrames

Pandas provides various facilities for combining together Series and DataFrame objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.

In [22]:
# Create sample DataFrames for concatenation
df1 = pd.DataFrame({
    'A': ['A0', 'A1', 'A2', 'A3'],
    'B': ['B0', 'B1', 'B2', 'B3'],
    'C': ['C0', 'C1', 'C2', 'C3'],
    'D': ['D0', 'D1', 'D2', 'D3']
}, index=[0, 1, 2, 3])

df2 = pd.DataFrame({
    'A': ['A4', 'A5', 'A6', 'A7'],
    'B': ['B4', 'B5', 'B6', 'B7'],
    'C': ['C4', 'C5', 'C6', 'C7'],
    'D': ['D4', 'D5', 'D6', 'D7']
}, index=[4, 5, 6, 7])

df3 = pd.DataFrame({
    'A': ['A8', 'A9', 'A10', 'A11'],
    'B': ['B8', 'B9', 'B10', 'B11'],
    'C': ['C8', 'C9', 'C10', 'C11'],
    'D': ['D8', 'D9', 'D10', 'D11']
}, index=[8, 9, 10, 11])

df4 = pd.DataFrame({
    'B': ['B2', 'B3', 'B6', 'B7'],
    'D': ['D2', 'D3', 'D6', 'D7'],
    'F': ['F2', 'F3', 'F6', 'F7']
}, index=[2, 3, 6, 7])

# Display df1 and df2
print("DataFrame 1:")
display(df1)
print("\nDataFrame 2:")
display(df2)

DataFrame 1:


Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3



DataFrame 2:


Unnamed: 0,A,B,C,D
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7


### 2.1 Concatenation with pd.concat

The `concat()` function does all of the heavy lifting of performing concatenation operations along an axis while performing optional set logic (union or intersection) of the indexes on the other axes.

In [23]:
# Basic concatenation along axis=0 (rows)
result = pd.concat([df1, df2])
result

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7


### 2.2 Set Logic on the Other Axes

When gluing together multiple DataFrames, you have a choice of how to handle the other axes (other than the one being concatenated):
- Take the union of them all, `join='outer'`. This is the default option as it results in zero information loss.
- Take the intersection, `join='inner'`.

In [24]:
# Outer join (default)
result = pd.concat([df1, df4], axis=1, sort=False)
result

Unnamed: 0,A,B,C,D,B.1,D.1,F
0,A0,B0,C0,D0,,,
1,A1,B1,C1,D1,,,
2,A2,B2,C2,D2,B2,D2,F2
3,A3,B3,C3,D3,B3,D3,F3
6,,,,,B6,D6,F6
7,,,,,B7,D7,F7


In [25]:
# Inner join
result = pd.concat([df1, df4], axis=1, join='inner')
result

Unnamed: 0,A,B,C,D,B.1,D.1,F
2,A2,B2,C2,D2,B2,D2,F2
3,A3,B3,C3,D3,B3,D3,F3


Reusing the exact index from the original DataFrame:

In [26]:
# Reindex after concatenation
result = pd.concat([df1, df4], axis=1).reindex(df1.index)
result

Unnamed: 0,A,B,C,D,B.1,D.1,F
0,A0,B0,C0,D0,,,
1,A1,B1,C1,D1,,,
2,A2,B2,C2,D2,B2,D2,F2
3,A3,B3,C3,D3,B3,D3,F3


In [27]:
# Reindex before concatenation
result = pd.concat([df1, df4.reindex(df1.index)], axis=1)
result

Unnamed: 0,A,B,C,D,B.1,D.1,F
0,A0,B0,C0,D0,,,
1,A1,B1,C1,D1,,,
2,A2,B2,C2,D2,B2,D2,F2
3,A3,B3,C3,D3,B3,D3,F3


### 2.3 Concatenating Using append

A useful shortcut to `concat()` are the `append()` instance methods on Series and DataFrame. These methods actually predated concat. They concatenate along axis=0, namely the index.

In [30]:
# Append df2 to df1
result = pd.concat([df1, df2], axis=0)
result

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7


In [31]:
# Append df4 to df1
result = pd.concat([df1, df4], axis=0, sort=False)
result

Unnamed: 0,A,B,C,D,F
0,A0,B0,C0,D0,
1,A1,B1,C1,D1,
2,A2,B2,C2,D2,
3,A3,B3,C3,D3,
2,,B2,,D2,F2
3,,B3,,D3,F3
6,,B6,,D6,F6
7,,B7,,D7,F7


In [33]:
# Append multiple DataFrames
result = pd.concat([df2, df3])
result

Unnamed: 0,A,B,C,D
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7
8,A8,B8,C8,D8
9,A9,B9,C9,D9
10,A10,B10,C10,D10
11,A11,B11,C11,D11


### 2.4 Ignoring Indexes on the Concatenation Axis

For DataFrame objects which don't have a meaningful index, you may wish to append them and ignore the fact that they may have overlapping indexes. To do this, use the `ignore_index` argument.

In [34]:
# Concatenate with ignore_index=True
result = pd.concat([df1, df4], ignore_index=True, sort=False)
result

Unnamed: 0,A,B,C,D,F
0,A0,B0,C0,D0,
1,A1,B1,C1,D1,
2,A2,B2,C2,D2,
3,A3,B3,C3,D3,
4,,B2,,D2,F2
5,,B3,,D3,F3
6,,B6,,D6,F6
7,,B7,,D7,F7
