# StaticFrame from the Ground Up: Getting Started with Immutable DataFrames
### Christopher Ariza

Back in 2017 I found myself frequently asking: "is Pandas a suitable foundation for production library code?" While Pandas is well-known for its utility in data science, I consistently found its flexibility a detriment in building library code for production systems.

This led me to create StaticFrame, an alternative dataframe library built on an immutable data model. After years of development and use, I am confident that StaticFrame reduces opportunities for error and leads to more maintainable code. While not yet always more efficient than Pandas, in some areas StaticFrame offers very significant improvements in run time and memory usage. Beyond common functionality, StaticFrame offers a more explicit and consistent API, novel multi-Frame containers and processors, and support for high-performance serialization through the NPZ format.

This notebook is designed as to provide a rapid, breadth-first survey of StaticFrame. Reference to how StaticFrame relates to Pandas is highlighted.

# What is a DataFrame?
* A 2D table with labelled rows and columns
    * Labels stay with data after selection
    * Operations align on labels
    * Can reindex based on labels
* Distinct from a simple 2D array
    * Labels can be any (hashable) type
    * Support for hetergenous column types
* Just like a 2D array, supports binary operators and broadcasting
    * Can multiply a dataframe by constant, 1D, or 2D container
    * All operations align on labels, not order
* A high-level language (Python) can be used to implement dataframe functionality over a high-performance, low-level array library (NumPy)

# A Brief History of DataFrames

* 1991: earliest implementation of a dataframe in the S language
* 2009: Pandas 0.1 released
* 2018: StaticFrame 0.1 released
* There are presently a number of dataframe libraries in Python and other languages


# Why Not Just Use Pandas?

* Pandas prioritizes ease of use over explicit, strict interfaces
* Pandas API has many inconsistencies
* Many Pandas interfaces have non-orthogonal parameters
* Pandas supports in-place mutation
* Pandas only optionally supports unique indices (`verify_integrity` defaults to `False`)
* Pandas does not support all NumPy types (Unicode, `datetime64`)
* Pandas abandoned multi-frame containers (i.e., removing the `pd.Panel`)

* See also: https://dev.to/flexatone/ten-reasons-to-use-staticframe-instead-of-pandas-4aad

# Learning StaticFrame from Pandas

* Nearly everything you can do with Pandas you can do with StaticFrame
* Things Pandas does that StaticFrame does not
    * No internal graphing / plotting support
    * Few internal implementations of calculations available elsewhere (NumPy, SciPy)
* Much of what you already know will directly translate
    * Many interfaces and methods are identical
    * StaticFrame has more numerous, more narrow interfaces with keyword only arguments
    * StaticFrame follows hierarchical naming
* You can go back and forth
    * `Frame.to_pandas()`
    * `Frame.from_pandas()`

# Learning StaticFrame from Examples
* Examples used here are intentionally compact
* Examples mostly on `sf.Frame`
* Interfaces on `sf.Series` are often identical

# StaticFrame Development

* Development
    * Code contributions from a small pool of developers
    * Feature and design contributions from multiple internal teams
    * New contributors are welcome!
* Releases
    * Regular releases via PIP
    * Stable API on minor releases (i.e., 0.9 will introduce backward incompatibilities on 0.8)
* Quality & Test
    * 100% test coverage
    * Robust CI/CD with MyPy, Pylint, and multiplatform test
* Documentation
    * Fully code-generated API documentation (https://static-frame.readthedocs.io)
    * Every object exposes API via `interface` attribute
* Core Dependencies
    * NumPy
    * Team-maintained CPython extension libraries: `automap`, `arraykit`
* When will there be a 1.0?
    * Pending `arraykit` implementation of delimited file readers to fix known issues
    * Maybe by end of 2022


# Installing & Importing

* Available via pip, conda-forge
* `import static_frame as sf`


In [2]:
import static_frame as sf
import numpy as np

# The Frame & the Series
* A `Series` is a 1D array (of a single dtype) with labels 
* A `Frame` is a 2D container (of one or more columnar dtypes) with row and column labels
* When extracting a row or column from a `Frame`, we get a `Series`.
* Support for higher-dimensional data
    * Use hierarchical indices on a 2D container
    * Use multi-`Frame` containers (i.e., the `Bus`)

# Anatomy of a Frame

* A `sf.Frame` wraps 1D and 2D NumPy arrays
* NumPy dtypes are unified by column
* Each axis is labelled with an `sf.Index` (or subclass)
    * Row labels via `sf.Frame.index`
    * Column labels via `sf.Frame.columns`
* Hashable metadata via `name` attributes on all containers
    * `sf.Frame.name` (StaticFrame only)
    * `sf.Frame.index.name`
    * `sf.Frame.columns.name`

# Getting Data In & Out: Constructors & Exporters

* Constructors always live on containers (i.e., `sf.Frame`)
    * `pd.read_csv()`, `pd.DataFrame.from_records()`
    * `sf.Frame.from_csv()`, `sf.Frame.from_records()`
* Explicit constructors with narrow functionality
    * `pd.DataFrame()` supports a single element, or a column of elements
    * `sf.Frame.from_element()`, `sf.Frame.from_elements()`
* Support for common serialization formats
    * `pd.read_excel()`, `pd.read_csv()`, `pd.read_parquet()`
    * `sf.Frame.from_xlsx()`, `sf.Frame.from_csv()`, `sf.Frame.from_parquet()`
* Serialization methods exclusive to StaticFrame
    * NPZ and NPY formats faster than parquet with comparable file sizes
    * Encodes all `sf.Frame` characteristics
    * NPY supports memory mapping out-of-core data
    * `sf.Frame.to_npz()`, `sf.Frame.from_npz()`

In [9]:
# Creating a Frame from row iterables
f = sf.Frame.from_records(((True, 20, '1954-11-02'), (False, 30, '2020-04-28')))
# Force a string representation 
print(str(f))

<Frame>
<Index> 0      1       2          <int64>
<Index>
0       True   20      1954-11-02
1       False  30      2020-04-28
<int64> <bool> <int64> <<U10>


# String Representations

* `sf.Frame.__repr__()` provides more information than `pd.DataFrame.__repr__()`
* Shows types of `Frame`, `.index`, and `.columns`
* Shows NumPy dtypes of each column, `.index`, and `.columns`
* In terminal environments can use colors for types, dtypes

In [4]:
# Creating a Frame with Frame subclass, Index subclasses, name attributes
f = sf.FrameGO.from_records(((True, 20, '1954-11'), (False, 30, '2020-04')), 
        index=sf.IndexYear(('1954', '2020'), name='year'),
        columns=('A', 'B', 'C'),
        name='records', 
        )
print(str(f))

<FrameGO: records>
<IndexGO>          A      B       C       <<U1>
<IndexYear: year>
1954               True   20      1954-11
2020               False  30      2020-04
<datetime64[Y]>    <bool> <int64> <<U7>


# Representation in Jupyter Notebooks

* An HTML table repsentation
* name attributes, type, and dtype information is hidden by default

In [4]:
f1 = sf.Frame.from_records(((True, 20, '1954-11-02'), (False, 30, '2020-04-28')), 
                            index=tuple('xy'), columns=tuple('ABC'))
f1

Unnamed: 0,A,B,C
x,True,20,1954-11-02
y,False,30,2020-04-28


# Finding All Constructors

* Every SF container has an `.interface` attribute
* `.interface` returns a `sf.Frame` of the complete interface
* The same representation is used to populate API overview: https://static-frame.readthedocs.io/en/latest/api_overview/frame.html


In [6]:
# Using the interface attribute to show the signature of all constructors
f = sf.Frame.interface
f.loc[f['group'] == 'Constructor'].head()

Unnamed: 0,cls_name,group,doc
"__init__(data, *, index, columns, ...)",Frame,Constructor,Initializer. Args: data: Default Frame initialization requires typed data such a...
"from_arrow(value, *, index_depth, index_name_depth_level, ...)",Frame,Constructor,Realize a Frame from an Arrow Table. Args: value: A pyarrow.Table instance. inde...
"from_clipboard(*, delimiter, index_depth, index_column_first, ...)",Frame,Constructor,Create a Frame from the contents of the clipboard (assuming a table is stored as...
"from_concat(frames, *, axis, union, ...)",Frame,Constructor,Concatenate multiple Frames into a new Frame. If index or columns are provided a...
"from_concat_items(items, *, axis, union, ...)",Frame,Constructor,"Produce a Frame with a hierarchical index from an iterable of pairs of labels, F..."


# Constructors Are Class Methods
* Pandas places some constructors on the `pd` name space
* All StaticFrame constructors are class methods on classes
* Creating a Frame from concatenation 
    * Pandas: `pd.concat()`
    * StaticFrame: `sf.Frame.from_concat()`, `sf.Frame.from_concat_items()`
* Creating a Frame from other Frames by overlaying on missing values
    * Pandas: `pd.DataFrame.combine_first()` # instance method for combining one Frame
    * StaticFrame: `sf.Frame.from_overlay()` # class method for combining one or more Frame

# Selection
* StaticFrame exposes all types of NumPy and Pandas-style selection routines
* StaticFrame interfaces are more narrow than Pandas
* Selection interfaces
    * `loc[]`: use lables
    * `iloc[]`: use integer position (from zero)
    * `bloc[]`: use Boolean indicator (StaticFrame only)
* NumPy-style selection values 
    * A single label (a tuple is a single label)
    * A list of labels (must be a list to distinguish from a tuple label)
    * A slice of labels
    * A 1D Boolean arary selecting labels


# Selection Interfaces on `sf.Frame`
    
* `[]`: root `__getitem__` selection 
    * `pd.DataFrame[]` selects by column labels, or row and column labels, or by 2D Boolean array
    * `sf.Frame[]` is exclusively column selection
* `loc[]`: select rows, optionally columns, by label (same as Pandas)
* `iloc[]`: select rows, optionally columns, by integer position (same as Pandas)
* `bloc[]`: select with a 2D Boolean array (StaticFrame only)

In [7]:
f1 = sf.Frame.from_records(((True, 20, '1954-11-02'), (False, 30, '2020-04-28')), 
                            index=tuple('xy'), columns=tuple('ABC'))
display(f1)
f1['B'] # Select a column with a single label

Unnamed: 0,A,B,C
x,True,20,1954-11-02
y,False,30,2020-04-28


0,1
x,20
y,30


In [8]:
display(f1.columns == 'C')
# Select columns with a Boolean indicator
f1[f1.columns == 'C'] 

array([False, False,  True])

Unnamed: 0,C
x,1954-11-02
y,2020-04-28


In [9]:
f1.loc['y':, ['A', 'C']] # Select a row with a slice and list of labels

Unnamed: 0,A,C
y,False,2020-04-28


In [10]:
f1.iloc[-1, -1] # Select an element with iloc labels

'2020-04-28'

In [11]:
f1.bloc[f1.isin([30, '2020-04-28'])] # Selcting non contiguous values

0,1
"('y', 'B')",30
"('y', 'C')",2020-04-28


# Mixing `loc` and `iloc` Selection

* `sf.ILoc` (StaticFrame only) permits embedding `iloc` selection in a `loc` selection
* `sf.HLoc` (similar to `pd.IndexSlice`) permits embedding hierarchical selection in `loc` selection

In [12]:
display(f1)
f1.loc[sf.ILoc[-1], ['A', 'C']] # Get the last row, columns A and C

Unnamed: 0,A,B,C
x,True,20,1954-11-02
y,False,30,2020-04-28


0,1
A,False
C,2020-04-28


# Handling Missing Values
* Missing values are `None` and `np.nan` (same as Pandas)
* Boolean indicators (same as Pandas)
    * `sf.Frame.isna()`
    * `sf.Frame.notna()`
* Replacing missing values with new containers (same as Pandas)
    * `sf.Frame.dropna()`
    * `sf.Frame.fillna()`

# Handling Falsy Values
* Sometimes we want to treat `0` or `''` or `()` as missing
* Functions corresponding to `*na` functions (StaticFrame only)
    * `sf.Frame.isfalsy()`
    * `sf.Frame.notfalsy()`
    * `sf.Frame.dropfalsy()`
    * `sf.Frame.fillfalsy()`

# Fill Missing Values Along an Axis
* Fill the first or last non-missing observation up to the `limit` parameter.
    * Related functionaliy provided in `pd.DataFrame.fillna()`
    * `sf.Frame.fillna_forward()`
    * `sf.Frame.fillna_backward()`
* Fill the leading or trailing missing values with a provided value
    * StaticFrame only
    * `sf.Frame.fillna_leading()`
    * `sf.Frame.fillna_trailing()`

# Fill Falsy Values Along an Axis
* StaticFrame only
* Fill the first or last non-missing observation up to the `limit` parameter.
    * `sf.Frame.fillfalsy_forward()`
    * `sf.Frame.fillfalsy_backward()`
* Fill the leading or trailing missing values with a provided value
    * `sf.Frame.fillfalsy_leading()`
    * `sf.Frame.fillfalsy_trailing()`

# Immutability and "No-Copy" Operations
* Immutability reduces opportunities for errors 
* NumPy provides no-copy "views" of array data when possible
* With immutabile arrays, we can pass around views without defensive copies
* Examples:
    * Renaming an `sf.Frame` is no-copy
    * Relabelling `index` or `columns` does not copy underlying arrays
    * Horizontal concatenation of same-index components is no-copy
* Pandas support for mutation, combined with NumPy views, leads to commonly observed Pandas `SettingWithCopyWarning`

# Assignment with Immutable Frames
* Pandas permits in-place assignment and mutationi to all types of selections
    * `pd.DataFrame.loc['x', 'B':] = 1.0`
* StaticFrame offers an `assign` interface that defines a selection that is then called with a value to assign
* The value to assign can be an element or labelled data (`sf.Series`, `sf.Frame`)
* `sf.Frame.assign.loc['x', 'B':](1.0)`
    * Returns a new container
    * Unchanged columns will be views and re-used (no-copy)

In [13]:
# Assigning a value to a slice in a single row
f1.assign.loc['x', 'B':](-1)

Unnamed: 0,A,B,C
x,True,-1,-1
y,False,30,2020-04-28


In [14]:
# Assigning a Series to a column, matching on label
f1.assign['B'](sf.Series(('y', 'x'), index=('y', 'x')))

Unnamed: 0,A,B,C
x,True,x,1954-11-02
y,False,y,2020-04-28


# Grow-Only Mutation
* Pandas permits growing a DataFrame by columns (efficient) and rows (very inefficient)
* The `sf.FrameGO` permits grow-only column addition or whole-frame extension
* While the container is muetable, underlying array data always remains immutable
    * Going from an `sf.Frame` to an `sf.FrameGO` is a no-copy operations
    * Often used within a narrow scope
* Growing rows is never permitted (use `sf.Frame.from_concat()` with collected rows)

In [13]:
# Adding a column to a FrameGO
f2 = f1.to_frame_go()
f2['D'] = (34, 87)
f2

Unnamed: 0,A,B,C,D
x,True,20,1954-11-02,34
y,False,30,2020-04-28,87


In [14]:
# Extending a FrameGO with another Frame
# On aligned indices this is a no-copy operation
f3 = (f1[['A', 'B']] * 100).relabel(columns=lambda l: l.lower())
f2.extend(f3)
f2

Unnamed: 0,A,B,C,D,a,b
x,True,20,1954-11-02,34,100,2000
y,False,30,2020-04-28,87,0,3000


# A Family of `sf.Frame`

* Pandas has only one `DataFrame` class
* StaticFrame has a family
    * `sf.Frame`
    * `sf.FrameGO`: a grow-only `sf.Frame`
    * `sf.FrameHE`: a hashable `sf.Frame`
        * HE for `__hash__` and `__eq__`, the methods implemented to support hashability
        * Some hasing scenarios mare require a full values comparison for lookup
* Methods exist to easily convert between all three (always a no-copy operation)
    * `sf.Frame.to_frame_go()`
    * `sf.Frame.to_frame_he()`
    * `sf.FrameGO.to_frame()`
    * `sf.FrameGO.to_frame_he()`
    * `sf.FrameHE.to_frame()`
    * `sf.FrameHE.to_frame_go()`


In [15]:
# A Frame as a key in a dictionary
f = sf.Frame(np.arange(4).reshape(2, 2)).to_frame_he()
d = {f: True} 
f in d

True

# Changing Columnar dtypes

* `sf.Frame.astype()` can be used to retype an entire Frame (sme as Pandas)
* Can use column selection to isolate targets
    * Similar to `sf.Frame.assign` interface
    * `sf.Frame.astype[sf.columns.via_str.startwith('--')](int)`
* Changing types will be no-copy for unaffected columns

In [18]:
f1.astype[['A', 'B']](float)

Unnamed: 0,A,B,C
x,1.0,20.0,1954-11-02
y,0.0,30.0,2020-04-28


# Full Support for All NumPy dtypes
* NumPy is the foundation of StaticFrame and Pandas
* Pandas only uses a subset of NumPy dtypes; StaticFrame supports all
* NumPy's fixed-size Unicode arrays
    * Optimal when elements are diverse and of similar size
    * Pandas always converts these to object arrays of Python strings
* NumPy's `datetime64` type
    * Fast datetime representation with units for resolution (from year to attosecond)
    * Pandas coerces any `datetime64` to nanosecond units
    * StaticFrame permits using year, date, or any `datetime64` unit
    * See also: https://www.youtube.com/watch?v=jdnr7sgxCQI

In [16]:
# By default, StaticFrame always shows all types and dtypes
print(str(f1))
# Can get a Series by column label
f1.dtypes

<Frame>
<Index> A      B       C          <<U1>
<Index>
x       True   20      1954-11-02
y       False  30      2020-04-28
<<U1>   <bool> <int64> <<U10>


0,1
A,bool
B,int64
C,<U10


In [20]:
# Can convert Unicode dtypes to Python string object
print(str(f1.astype['C'](object)))

<Frame>
<Index> A      B       C          <<U1>
<Index>
x       True   20      1954-11-02
y       False  30      2020-04-28
<<U1>   <bool> <int64> <object>


In [21]:
# Can convert strings NumPy datetime64 date objects
print(str(f1.astype['C'](np.datetime64)))

<Frame>
<Index> A      B       C               <<U1>
<Index>
x       True   20      1954-11-02
y       False  30      2020-04-28
<<U1>   <bool> <int64> <datetime64[D]>


# A Family of `sf.Index`

* To use `datetime64` as an index, use a `datetime64` `sf.Index` subclass
    * `sf.IndexDate`, `sf.indexYearMonth`, etc.
    * Provides robust translation from Python date / datetime objects
    * Provides partial selection with less granular date units
    * Provides alternative constructor for date ranges
* Hierarchical indices with `sf.IndexHierarchy`
* Many interfaces expose `index_constructor` arguments to specify what kind of index to make.
    

In [49]:
# Transfer a column to an index
f4 = f1.set_index('C', drop=True, index_constructor=sf.IndexDate)
f4

Unnamed: 0,A,B
1954-11-02,True,20
2020-04-28,False,30


In [50]:
# Selection with a less granular unit (year)
f4.loc['2020']

Unnamed: 0,A,B
2020-04-28,False,30


In [51]:
# sf.IndexDate understands Python datetime objects
import datetime
f4.loc[datetime.date(1954, 11, 2)]

0,1
A,True
B,20


In [54]:
# Removing an index (pd.DataFrame.reset_index()
print(str(f4.unset_index()))

<Frame>
<Index> C               A      B       <<U1>
<Index>
0       1954-11-02      True   20
1       2020-04-28      False  30
<int64> <datetime64[D]> <bool> <int64>


# Rename, Reindex, Relabel

* `rename()` sets the `name` attribute on all containers
    * `pd.DataFrame.rename()` relabels the axis, `pd.Series.rename()` sets the name of the container
    * `sf.Frame.rename()`, `sf.Series.rename()` all do the same thing
    * renaming is a no-copy operations
* `reindex()` applies new index, aligning to the previous index
    * Similar to `pd.DataFrame.reindex()`
    * Matching labels will retain thier data
    * New labels will introduce missing values (provided with a `fill_value`)
* `relabel()` applies a new index, regardless of alignment to previous index
    * Can map old to new with `dict`
    * Can process old to new with a function
    * Can replace with a new `sf.Index` or iterable

# Iteration
* Iterating elements: `Frame.iter_elements()`
* Iterating rows or columns:
    * Specify axis=1 for rows, axis=0 for columns
    * Choose what you want to get back
        * `Frame.iter_series()`
        * `Frame.iter_tuple()`
        * `Frame.iter_array()`

In [32]:
f5 = sf.FrameGO(np.arange(18).reshape(6,3), columns=tuple('ABC'))
f5['D'] = tuple('abbacc')
f5

Unnamed: 0,A,B,C,D
0,0,1,2,a
1,3,4,5,b
2,6,7,8,b
3,9,10,11,a
4,12,13,14,c
5,15,16,17,c


In [30]:
# Axis 1 iterates rows; next() gets the first
display(next(iter(f5.iter_series(axis=1))))
next(iter(f5.iter_array(axis=1)))

0,1
A,0
B,1
C,2
D,a


array([0, 1, 2, 'a'], dtype=object)

In [31]:
# Axis 0 iterates columns, next() gets the first
display(next(iter(f5.iter_series(axis=0))))
next(iter(f5.iter_array(axis=0)))

0,1
0,0
1,3
2,6
3,9
4,12
5,15


array([ 0,  3,  6,  9, 12, 15])

# Function & Mapping Application
* Function application implies iteration
* Choose what you want to iterate on and call `apply()`
    * Always returns an `sf.Series`
* Can multi-process / thread with `apply_pool()`
* Can iterate through results with `apply_iter()`
* Can map instead of apply
    * `map_all()`: if value not mappable, raise
    * `map_any()`: map what you can, leave the rest unchanged
    * `map_fill()`: map what you can, provide `fill_value` for others

# Grouping & Windowing

* `sf.Frame.iter_group()`
    * Group by unique values in one or more columns (axis 0) or rows (axis 1)
    * Can use `apply()` if reducing to an `sf.Series`
    * Can use an `sf.Batch` for performing operations on sub-Frames like `pd.DataFrameGroupBy`
* `sf.Frame.iter_window()`
    * Can use an `sf.Batch` for performing operations on sub Frames like `pd.Rolling`

# Working with Collections of Frames
* Pandas deprecated the `pd.Panel` for 3D data
* Hierarchical indices incur overhead and force loading all data at once
* The `sf.Bus`
    * Offers a Series-like interface to collections of Frames
    * Can read to and write from multi-tabel storage formats
        * XLSX, HDF5, SQLite
            * XLSX authoring similar to Pandas `pd.ExcelWriter`
            * HDF5 authoring similar to Panas `pd.HDFStore`
        * Zipped archives of CSV, TSV, Parquet, and NPZ
    * Reads lazily
    * Optionally unloads eagerly with `max_persist` argument
    

# Interfaces for Working with Strings
* `sf.Frame.via_str`, similar to `pd.Series.str`
* Expose Python string object interface for application on all elements
* https://static-frame.readthedocs.io/en/latest/api_overview/frame.html#frame-accessor-string

In [59]:
f1.via_str.upper()

Unnamed: 0,A,B,C
x,True,20,1954-11-02
y,False,30,2020-04-28


In [61]:
f1.via_str.replace('0', '+')

Unnamed: 0,A,B,C
x,True,2+,1954-11-+2
y,False,3+,2+2+-+4-28


# Interfaces for Working with Dates
* `sf.Frame.via_dt`, similar to `pd.Series.dt`
* Expose Python `date`, `datetime` interface for application on all elements
* https://static-frame.readthedocs.io/en/latest/api_overview/frame.html#frame-accessor-datetime

In [63]:
f1['C'].astype(np.datetime64).via_dt.month

0,1
x,11
y,4


In [64]:
f1['C'].astype(np.datetime64).via_dt.year

0,1
x,1954
y,2020


In [57]:
f1['C'].astype(np.datetime64).via_dt.weekday()

0,1
x,1
y,1


# Interfaces for Applying Regular Expressions
* `sf.Frame.via_re` 
* Similar to `pd.Series.str.extract()`, but provides full interface from `re` module
* https://static-frame.readthedocs.io/en/latest/api_overview/frame.html#frame-accessor-regular-expression

In [79]:
display(f1)
f1.via_re('[2a]').search()

Unnamed: 0,A,B,C
x,True,20,1954-11-02
y,False,30,2020-04-28


Unnamed: 0,A,B,C
x,False,True,True
y,True,False,True


# Configuring `fill_value` in Operator Application

* Operations on labelled containers force reindexing
* `sf.Frame.via_fill_value()` permits providing a fill value
* Pandas offers related functionality with `pd.DataFrame.add()`, `pd.DataFrame.sub()`, `pd.DataFrame.mul()`, etc., methods.

In [5]:
display(f1)
# Default binary operator application takes the union index and uses `nan` as a fill value
f1['B'] * sf.Series((1000, 1, .001), index=tuple('zyx'))

Unnamed: 0,A,B,C
x,True,20,1954-11-02
y,False,30,2020-04-28


0,1
x,0.02
y,30.0
z,


In [6]:
# Using `via_fill_value` a fill value can be specified
f1['B'].via_fill_value(0) * sf.Series((1000, 1, .001), index=tuple('zyx'))

0,1
x,0.02
y,30.0
z,0.0


# Virtual Transposition in Operator Application
* Applying a 1D container on a 2D container applies to rows
* `sf.Frame.via_T` presents 2D containers "virtually" transposed
* Useful for applying a 1D container to the columns of a 2D container
* Pandas offers related functionality with `pd.DataFrame.add()`, `pd.DataFrame.sub()`, `pd.DataFrame.mul()`, etc., methods.

In [109]:
# 2D to 1D assumes row-wise application
display(sf.Frame(np.arange(8).reshape(2, 4), index=tuple('xy')))
display(f1['B'])
sf.Frame(np.arange(8).reshape(2, 4), index=tuple('xy')) * f1['B']

Unnamed: 0,0,1,2,3
x,0,1,2,3
y,4,5,6,7


0,1
x,20
y,30


Unnamed: 0,0,1,2,3,x,y
x,,,,,,
y,,,,,,


In [104]:
# Using via_T, can apply column-wise application
display(sf.Frame(np.arange(8).reshape(2, 4), index=tuple('xy')))
display(f1['B'])
sf.Frame(np.arange(8).reshape(2, 4), index=tuple('xy')).via_T * f1['B']


Unnamed: 0,0,1,2,3
x,0,1,2,3
y,4,5,6,7


0,1
x,20
y,30


Unnamed: 0,0,1,2,3
x,0,20,40,60
y,120,150,180,210


# All the Rest

* Complete API best viewed through docs: https://static-frame.readthedocs.io/en/latest/api_overview/frame.html
        

# All the Rest: NumPy-Style Interfaces

* StaticFrame supports common NumPy interfaces and methods (Same as Pandas)
* Attributes:
    * `sf.Frame.shape`
    * `sf.Frame.ndim`
    * `sf.Frame.size`
    * `sf.Frame.nbytes`
    * `sf.Frame.T`
* Logical operations (by axis):
    * `sf.Frame.all()`
    * `sf.Frame.any()`
* Mathematical operations (by axis):
    * `sf.Frame.sum()`
    * `sf.Frame.min()`
    * `sf.Frame.max()`
    * `sf.Frame.mean()`
    * `sf.Frame.median()`
    * `sf.Frame.std()`
    * `sf.Frame.var()`
    * `sf.Frame.prod()`
    * `sf.Frame.cumsum()`
    * `sf.Frame.cumprod()`
    

# All the Rest: Joins
* Pandas: `pd.DataFrame.join()` with a `how` parameter (‘left’, ‘right’, ‘outer’, ‘inner’)
* StaticFrame:
    * `sf.Frame.join_left()`
    * `sf.Frame.join_right()`
    * `sf.Frame.join_outer()`
    * `sf.Frame.join_inner()`    

# All the Rest: Ranking
* Pandas: `pd.DataFrame.rank` with a `method` parameter of (‘average’, ‘min’, ‘max’, ‘first’, ‘dense’)
* StaticFrame:
    * `sf.Frame.rank_mean`, `sf.Frame.rank_min()`, `sf.Freame.rank_max()`, `sf.Frame.rank_ordinal()`, `sf.Frame.rank_dense()`

# All the Rest: Pivot
* Pivoting
    * Pandas: `pd.DataFrame.pivot()`, `pd.DataFrame.pivot_table()`
    * StaticFrame: `sf.Frame.pivot()`
* Stacking & unstacking
    * Pandas: `pd.DataFrame.stack()`, `pd.DataFrame.unstack()`
    * StaticFrame: `sf.Frame.pivot_stack()`, `sf.Frame.pivot_unstack()`


# Performance
* In many situations StaticFrame can lead to more efficient systems
* Code can be more efficient with memory
    * Can reuse immutable views
    * No need for defensive copies
* Focus of current development is performance
    * Profiling with `cprofile`, `pyinstrument`, `line-profiler` and `gprof2dot` (for call graph analysis)
    * C-extensions in ArrayKit
    

# Performance: Sample Measures

* Current metrics under study
* Native is StaticFrame, Reference is Pandas
* When StaticFrame is faster, it tends to be a lot faster
* Out of 50 test, StaticFrame out-performs in 32

### python:3.8.12|numpy:1.17.4|pandas:1.3.5|static_frame:0.8.34


|name                                                             |iterations |Native |Reference |n/r    |r/n     |win                 |
|-----------------------------------------------------------------|-----------|-------|----------|-------|--------|--------------------|
|IndexIterLabelApply.index_int                     |200.0      |0.0228 |0.049     |0.466  |2.146   |True   |
|IndexIterLabelApply.index_int_dtype               |200.0      |0.0108 |0.0469    |0.2306 |4.3371  |True   |
|SeriesIsNa.bool_index_auto                        |10000.0    |0.0386 |0.4357    |0.0885 |11.2971 |True   |
|SeriesIsNa.float_index_auto                       |10000.0    |0.0304 |0.4442    |0.0685 |14.5922 |True   |
|SeriesIsNa.object_index_auto                      |10000.0    |0.7061 |0.849     |0.8317 |1.2023  |True   |
|SeriesDropNa.bool_index_auto                      |200.0      |0.0003 |0.0052    |0.0663 |15.0853 |True   |
|SeriesDropNa.bool_index_str                       |200.0      |0.0003 |0.0125    |0.0246 |40.5973 |True   |
|SeriesDropNa.float_index_auto                     |200.0      |0.5844 |0.3402    |1.7177 |0.5822  |False|
|SeriesDropNa.float_index_str                      |200.0      |2.0477 |1.0093    |2.0288 |0.4929  |False|
|SeriesDropNa.object_index_auto                    |200.0      |2.2168 |1.2578    |1.7624 |0.5674  |False|
|SeriesDropNa.object_index_str                     |200.0      |3.8207 |2.0625    |1.8524 |0.5398  |False|
|SeriesFillNa.float_index_str                      |100.0      |0.02   |0.0344    |0.5814 |1.7199  |True   |
|SeriesFillNa.object_index_str                     |100.0      |0.7479 |0.4065    |1.8397 |0.5436  |False|
|SeriesDropDuplicated.bool_index_str               |500.0      |0.0193 |0.03      |0.6427 |1.5559  |True   |
|SeriesDropDuplicated.float_index_str              |500.0      |0.075  |0.0493    |1.5191 |0.6583  |False|
|SeriesDropDuplicated.object_index_str             |500.0      |0.1217 |0.4774    |0.2549 |3.9226  |True   |
|SeriesIterElementApply.bool_index_str             |500.0      |0.3462 |0.1436    |2.4118 |0.4146  |False|
|SeriesIterElementApply.float_index_str            |500.0      |0.3526 |0.2661    |1.3253 |0.7546  |False|
|SeriesIterElementApply.object_index_str           |500.0      |0.312  |0.2341    |1.333  |0.7502  |False|
|FrameDropNa.float_index_auto_column               |100.0      |0.0134 |0.1052    |0.1273 |7.8532  |True   |
|FrameDropNa.float_index_auto_row                  |100.0      |0.0079 |0.0751    |0.1057 |9.4644  |True   |
|FrameDropNa.float_index_str_column                |100.0      |0.0158 |0.1031    |0.1533 |6.5251  |True   |
|FrameDropNa.float_index_str_row                   |100.0      |0.0081 |0.0742    |0.1086 |9.2069  |True   |
|FrameILoc.element_index_auto                      |100000.0   |0.1713 |1.9643    |0.0872 |11.4639 |True   |
|FrameILoc.element_index_str                       |100000.0   |0.172  |2.0113    |0.0855 |11.6921 |True   |
|FrameLoc.element_index_auto                       |100000.0   |0.2638 |0.5898    |0.4473 |2.2358  |True   |
|FrameLoc.element_index_str                        |100000.0   |0.3851 |0.5571    |0.6912 |1.4467  |True   |
|FrameIterSeriesApply.float_index_str_column       |50.0       |2.48   |4.3301    |0.5727 |1.746   |True   |
|FrameIterSeriesApply.float_index_str_column_dtype |50.0       |2.134  |4.2312    |0.5044 |1.9827  |True   |
|FrameIterSeriesApply.float_index_str_row          |50.0       |2.1213 |2.9716    |0.7139 |1.4008  |True   |
|FrameIterSeriesApply.float_index_str_row_dtype    |50.0       |1.9963 |2.9624    |0.6739 |1.484   |True   |
|FrameIterSeriesApply.mixed_index_str_column       |50.0       |0.1574 |1.1348    |0.1387 |7.2097  |True   |
|FrameIterSeriesApply.mixed_index_str_column_dtype |50.0       |0.1599 |1.2063    |0.1326 |7.5424  |True   |
|FrameIterSeriesApply.mixed_index_str_row          |50.0       |2.2708 |1.7064    |1.3307 |0.7515  |False|
|FrameIterSeriesApply.mixed_index_str_row_dtype    |50.0       |2.3071 |1.6826    |1.3712 |0.7293  |False|
|FrameIterGroupApply.int_index_str_double          |1000.0     |1.393  |0.8971    |1.5528 |0.644   |False|
|FrameIterGroupApply.int_index_str_single          |1000.0     |0.578  |0.5381    |1.0741 |0.931   |False|
|FrameIterGroupApply.str_index_str_double          |1000.0     |1.406  |0.9642    |1.4583 |0.6857  |False|
|FrameIterGroupApply.str_index_str_single          |1000.0     |0.5893 |0.6984    |0.8438 |1.1852  |True   |
|Pivot.index1_columns0_data2                       |150.0      |0.1941 |0.7838    |0.2477 |4.037   |True   |
|Pivot.index1_columns1_data1                       |150.0      |7.5364 |0.9452    |7.9737 |0.1254  |False|
|BusItemsZipPickle.int_index_str                   |1.0        |4.9487 |          |       |        |True   |
|FrameToParquet.write_tall_mixed_index_str         |4.0        |0.0535 |0.0394    |1.3565 |0.7372  |False|
|FrameToParquet.write_wide_mixed_index_str         |4.0        |2.0016 |2.6561    |0.7536 |1.327   |True   |
|Group.tall_group_100                              |150.0      |3.1352 |0.7401    |4.2359 |0.2361  |False|
|Group.wide_group_2                                |150.0      |2.3462 |1.6902    |1.3881 |0.7204  |False|
|FrameFromConcat.tall_mixed_20                     |50.0       |0.2972 |0.7582    |0.392  |2.5508  |True   |
|FrameFromConcat.tall_uniform_20                   |50.0       |0.1278 |0.1516    |0.843  |1.1862  |True   |
|min                                               |           |0.0003 |0.0052    |0.0246 |0.1254  |                    |
|max                                               |           |7.5364 |4.3301    |7.9737 |40.5973 |                    |
|mean                                              |           |1.0298 |0.9428    |1.0114 |4.144   |                    |
|median                                            |           |0.3356 |0.5898    |0.6912 |1.4467  |                    |
|std                                               |           |1.4853 |1.0476    |1.2932 |6.6758  |                    |
