# Part 14: Advanced HDF5 and Stata Operations in Pandas

In this notebook, we'll explore:
- Advanced HDF5 query operations
- Working with coordinates in HDF5
- Multiple table queries
- Working with Stata files

## Setup
First, let's import the necessary libraries:

In [None]:
import pandas as pd
import numpy as np
from io import StringIO

## 1. Advanced HDF5 Operations

### 1.1 Selecting Coordinates

Sometimes you want to get the coordinates (index locations) of your query. This returns an Int64Index of the resulting locations. These coordinates can also be passed to subsequent where operations.

In [None]:
'''
# Create a sample DataFrame
df_coord = pd.DataFrame(np.random.randn(1000, 2),
                       index=pd.date_range('20000101', periods=1000))

# Store the DataFrame
store = pd.HDFStore('store.h5')
store.append('df_coord', df_coord)

# Select coordinates where index is greater than a specific date
c = store.select_as_coordinates('df_coord', 'index > 20020101')
c
'''

In [None]:
'''
# Use the coordinates to select data
store.select('df_coord', where=c)
'''

### 1.2 Selecting Using a Where Mask

Sometimes your query can involve creating a list of rows to select. Usually this mask would be a resulting index from an indexing operation. This example selects the months of a datetimeindex which are 5 (May).

In [None]:
'''
# Create a sample DataFrame
df_mask = pd.DataFrame(np.random.randn(1000, 2),
                      index=pd.date_range('20000101', periods=1000))

# Store the DataFrame
store.append('df_mask', df_mask)

# Get the index column
c = store.select_column('df_mask', 'index')

# Create a mask for May months
where = c[pd.DatetimeIndex(c).month == 5].index

# Select using the mask
store.select('df_mask', where=where)
'''

### 1.3 Storer Object

If you want to inspect the stored object, retrieve via `get_storer`. You could use this programmatically to say get the number of rows in an object.

In [None]:
'''
store.get_storer('df_dc').nrows
'''

### 1.4 Multiple Table Queries

The methods `append_to_multiple` and `select_as_multiple` can perform appending/selecting from multiple tables at once. The idea is to have one table (call it the selector table) that you index most/all of the columns, and perform your queries. The other table(s) are data tables with an index matching the selector table's index. You can then perform a very fast query on the selector table, yet get lots of data back.

## 2. Working with Stata Files

### 2.1 Reading Stata Files

Pandas provides the `read_stata` function to read Stata data files (.dta).

In [None]:
'''
# Read a Stata file
df = pd.read_stata('stata.dta')
df
'''

### 2.2 Reading Stata Files in Chunks

Specifying a `chunksize` yields a StataReader instance that can be used to read `chunksize` lines from the file at a time. The StataReader object can be used as an iterator.

In [None]:
'''
# Read a Stata file in chunks
reader = pd.read_stata('stata.dta', chunksize=3)

for df in reader:
    print(df.shape)
'''

For more fine-grained control, use `iterator=True` and specify `chunksize` with each call to `read()`.

In [None]:
'''
reader = pd.read_stata('stata.dta', iterator=True)

chunk1 = reader.read(5)
chunk2 = reader.read(5)
'''

### 2.3 Handling Categorical Data in Stata Files

Categorical data can be exported to Stata data files as value labeled data. The exported data consists of the underlying category codes as integer data values and the categories as value labels. Stata does not have an explicit equivalent to a Categorical and information about whether the variable is ordered is lost when exporting.

Labeled data can similarly be imported from Stata data files as Categorical variables using the keyword argument `convert_categoricals` (True by default). The keyword argument `order_categoricals` (True by default) determines whether imported Categorical variables are ordered.

In [None]:
'''
# Import with categorical data
df_cat = pd.read_stata('stata.dta', convert_categoricals=True, order_categoricals=True)

# Import without converting to categorical
df_no_cat = pd.read_stata('stata.dta', convert_categoricals=False)
'''

### 2.4 Handling Missing Values in Stata Files

The parameter `convert_missing` indicates whether missing value representations in Stata should be preserved. If False (the default), missing values are represented as np.nan. If True, missing values are represented using StataMissingValue objects, and columns containing missing values will have object data type.

In [None]:
'''
# Import with default missing value handling (convert to np.nan)
df_default = pd.read_stata('stata.dta')

# Import preserving Stata missing value representation
df_missing = pd.read_stata('stata.dta', convert_missing=True)
'''

### 2.5 Data Type Preservation

Setting `preserve_dtypes=False` will upcast to the standard pandas data types: int64 for all integer types and float64 for floating point data. By default, the Stata data types are preserved when importing.

In [None]:
'''
# Import preserving Stata data types
df_preserve = pd.read_stata('stata.dta')

# Import with standard pandas data types
df_standard = pd.read_stata('stata.dta', preserve_dtypes=False)
'''