# Pandas Tutorial - Part 35

This notebook covers:
- Performance considerations with pandas.eval()
- Scaling to large datasets
- Handling NaN, Integer NA values, and NA type promotions

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time

%matplotlib inline

## Performance Considerations with pandas.eval()

Continuing from Part 34, let's explore more performance considerations with pandas.eval().

### When to Use eval()

Operations with smallish objects (around 15k-20k rows) are often faster using plain Python rather than `eval()`. The real performance benefits of `eval()` are seen with larger datasets.

In [None]:
# Create small DataFrames
small_df1 = pd.DataFrame(np.random.randn(1000, 3))
small_df2 = pd.DataFrame(np.random.randn(1000, 3))

# Create large DataFrames
large_df1 = pd.DataFrame(np.random.randn(50000, 3))
large_df2 = pd.DataFrame(np.random.randn(50000, 3))

In [None]:
# Compare performance for small DataFrames
start = time.time()
result1 = small_df1 + small_df2
end = time.time()
print(f"Small DataFrame - Regular Python: {end - start:.6f} seconds")

start = time.time()
result2 = pd.eval('small_df1 + small_df2')
end = time.time()
print(f"Small DataFrame - eval(): {end - start:.6f} seconds")

In [None]:
# Compare performance for large DataFrames
start = time.time()
result3 = large_df1 + large_df2
end = time.time()
print(f"Large DataFrame - Regular Python: {end - start:.6f} seconds")

start = time.time()
result4 = pd.eval('large_df1 + large_df2')
end = time.time()
print(f"Large DataFrame - eval(): {end - start:.6f} seconds")

### Technical Details Regarding Expression Evaluation

There are some technical details to be aware of when using `eval()`:

1. Expressions that would result in an object dtype or involve datetime operations (because of NaT) must be evaluated in Python space.
2. String comparisons must be evaluated in Python space.
3. The numeric part of a comparison (e.g., `nums == 1`) will be evaluated by numexpr.

In [None]:
# Create a DataFrame with strings and numbers
df = pd.DataFrame({
    'strings': np.repeat(list('cba'), 3),
    'nums': np.repeat(range(3), 3)
})
df

In [None]:
# Query with a string comparison and a numeric comparison
df.query('strings == "a" and nums == 1')

In the above example, the string comparison (`strings == "a"`) is evaluated in Python space, while the numeric comparison (`nums == 1`) is evaluated by numexpr.

## Scaling to Large Datasets

Pandas provides data structures for in-memory analytics, which makes using pandas to analyze datasets that are larger than memory somewhat tricky. Here are some recommendations for scaling your analysis to larger datasets.

### Load Less Data

When working with large datasets, it's often beneficial to load only the columns you need rather than the entire dataset.

In [None]:
# Example of loading specific columns from a CSV file
# This is just an example - adjust the file path and column names as needed
# df = pd.read_csv('large_file.csv', usecols=['timestamp', 'id', 'value'])

### Use Chunking

For very large files, you can process the data in chunks rather than loading it all at once.

In [None]:
# Example of processing a large CSV file in chunks
# This is just an example - adjust the file path as needed
"""
chunk_size = 10000
results = []

for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
    # Process each chunk
    processed = chunk.groupby('column_name').mean()
    results.append(processed)

# Combine the results
combined_results = pd.concat(results)
final_result = combined_results.groupby(level=0).mean()
"""

## NaN, Integer NA Values, and NA Type Promotions

Pandas has a specific way of handling missing values (NA values) which is important to understand.

### Choice of NA Representation

Pandas uses the special value NaN (Not-A-Number) as the NA value for most data types. There are API functions `isna` and `notna` which can be used across the dtypes to detect NA values.

In [None]:
# Create a Series with some NaN values
s = pd.Series([1, 2, np.nan, 4, 5])
s

In [None]:
# Detect NA values
pd.isna(s)

In [None]:
# Detect non-NA values
pd.notna(s)

### Support for Integer NA

One limitation of using NaN as the NA value is the inability to represent NAs in integer arrays. When you introduce NAs into an integer Series, it gets converted to float64.

In [None]:
# Create an integer Series
s = pd.Series([1, 2, 3, 4, 5], index=list('abcde'))
s

In [None]:
# Check the dtype
s.dtype

In [None]:
# Reindex to introduce NAs
s2 = s.reindex(['a', 'b', 'c', 'f', 'u'])
s2

In [None]:
# Check the dtype after reindexing
s2.dtype

### Nullable Integer Data Types

If you need to represent integers with possibly missing values, use one of the nullable-integer extension dtypes provided by pandas:
- Int8Dtype
- Int16Dtype
- Int32Dtype
- Int64Dtype

In [None]:
# Create an integer Series with a nullable integer dtype
s_int = pd.Series([1, 2, 3, 4, 5], index=list('abcde'), dtype=pd.Int64Dtype())
s_int

In [None]:
# Check the dtype
s_int.dtype

In [None]:
# Reindex to introduce NAs
s2_int = s_int.reindex(['a', 'b', 'c', 'f', 'u'])
s2_int

In [None]:
# Check the dtype after reindexing
s2_int.dtype

### NA Type Promotions

When introducing NAs into an existing Series or DataFrame via reindex() or some other means, boolean and integer types will be promoted to a different dtype in order to store the NAs. The promotions are summarized in this table:

| Typeclass | Promotion dtype for storing NAs |
|-----------|--------------------------------|
| floating  | no change                      |
| object    | no change                      |
| integer   | cast to float64                |
| boolean   | cast to object                 |

In [None]:
# Example of boolean type promotion
bool_series = pd.Series([True, False, True], index=['a', 'b', 'c'])
print(f"Original dtype: {bool_series.dtype}")

# Reindex to introduce NAs
bool_series2 = bool_series.reindex(['a', 'b', 'c', 'd'])
print(f"Reindexed dtype: {bool_series2.dtype}")
bool_series2

## Conclusion

In this notebook, we've explored:

1. Performance considerations with pandas.eval(), including when to use it and technical details about expression evaluation

2. Strategies for scaling to large datasets, such as loading less data and using chunking

3. Handling of NaN, Integer NA values, and NA type promotions in pandas, including:
   - The choice of NA representation
   - Support for integer NA values using nullable integer data types
   - Type promotions that occur when introducing NAs into different data types

These concepts are important for efficient data manipulation and analysis with pandas, especially when working with large datasets or dealing with missing values.