# Part 13: Advanced Excel and HDF5 Operations in Pandas

In this notebook, we'll explore:
- Working with MultiIndex in Excel files
- Parsing specific columns in Excel
- Parsing dates and cell converters
- HDF5 operations and iterators

## Setup
First, let's import the necessary libraries:

In [None]:
import pandas as pd
import numpy as np
from io import StringIO

## 1. Excel Operations with MultiIndex

### 1.1 MultiIndex in Columns

In [None]:
# Create a DataFrame with MultiIndex in both rows and columns
df = pd.DataFrame({'a': [1, 2, 3, 4], 'b': [5, 6, 7, 8]},
                 index=pd.MultiIndex.from_product([['a', 'b'], ['c', 'd']],
                                                 names=['lvl1', 'lvl2']))

# Set MultiIndex for columns
df.columns = pd.MultiIndex.from_product([['a'], ['b', 'd']],
                                       names=['c1', 'c2'])

df

In [None]:
# This would write to an Excel file and then read it back
'''
df.to_excel('path_to_file.xlsx')
df = pd.read_excel('path_to_file.xlsx', index_col=[0, 1], header=[0, 1])
'''

## 2. Parsing Specific Columns in Excel

The `usecols` parameter allows you to specify a subset of columns to parse:

In [None]:
# These are code examples - you would need an actual Excel file to run them
'''
# Specify columns by index (list of integers)
pd.read_excel('path_to_file.xls', 'Sheet1', usecols=[0, 2, 3])

# Specify columns as a string with column letters
pd.read_excel('path_to_file.xls', 'Sheet1', usecols='A,C:E')

# Specify columns by name
pd.read_excel('path_to_file.xls', 'Sheet1', usecols=['foo', 'bar'])

# Use a callable function to select columns
pd.read_excel('path_to_file.xls', 'Sheet1', usecols=lambda x: x.isalpha())
'''

## 3. Parsing Dates

Datetime-like values are normally automatically converted to the appropriate dtype when reading the Excel file. But if you have a column of strings that look like dates (but are not actually formatted as dates in Excel), you can use the `parse_dates` keyword:

In [None]:
'''
pd.read_excel('path_to_file.xls', 'Sheet1', parse_dates=['date_strings'])
'''

## 4. Cell Converters

It is possible to transform the contents of Excel cells via the `converters` option:

In [None]:
'''
# Convert a column to boolean
pd.read_excel('path_to_file.xls', 'Sheet1', converters={'MyBools': bool})
'''

In [None]:
'''
# Custom converter function
def cfun(x):
    return int(x) if x else -1

pd.read_excel('path_to_file.xls', 'Sheet1', converters={'MyInts': cfun})
'''

## 5. Dtype Specifications

As an alternative to converters, the type for an entire column can be specified using the `dtype` keyword:

In [None]:
'''
pd.read_excel('path_to_file.xls', dtype={'MyInts': 'int64', 'MyText': str})
'''

## 6. HDF5 Operations

### 6.1 Using Iterator with HDF5

You can pass `iterator=True` or `chunksize=number_in_a_chunk` to `select` and `select_as_multiple` to return an iterator on the results. The default is 50,000 rows returned in a chunk.

In [None]:
'''
# Example with HDFStore
store = pd.HDFStore('store.h5')

# Iterate through chunks of 3 rows
for df in store.select('df', chunksize=3):
    print(df)
'''

You can also use the iterator with `read_hdf` which will open, then automatically close the store when finished iterating:

In [None]:
'''
for df in pd.read_hdf('store.h5', 'df', chunksize=3):
    print(df)
'''

### 6.2 Creating Equal Sized Chunks with Queries

The `chunksize` keyword applies to the source rows. If you are doing a query, then the chunksize will subdivide the total rows in the table and the query applied, returning an iterator on potentially unequal sized chunks. Here is a recipe for generating a query and using it to create equal sized return chunks:

In [None]:
'''
# Create a sample DataFrame
dfeq = pd.DataFrame({'number': np.arange(1, 11)})
dfeq
'''

In [None]:
'''
# Store the DataFrame in HDF5
store.append('dfeq', dfeq, data_columns=['number'])

# Function to create chunks of a list
def chunks(l, n):
    return [l[i:i + n] for i in range(0, len(l), n)]

# Define values to query
evens = [2, 4, 6, 8, 10]

# Get coordinates for the query
coordinates = store.select_as_coordinates('dfeq', 'number=evens')

# Process in chunks of 2
for c in chunks(coordinates, 2):
    print(store.select('dfeq', where=c))
'''

### 6.3 Advanced Queries - Select a Single Column

To retrieve a single indexable or data column, use the method `select_column`. This will enable you to get the index very quickly:

In [None]:
'''
# Select just the index column
store.select_column('df_dc', 'index')
'''