# Torcharrow: State handling -- Scopes, Multi-targeting and Tracing


Torcharrow has **no global mutable state**. But it has the concept of a scope which is threaded implicitly through a pipeline. This allows for configuration management, multi-device targeting and tracing. This short doc explain the concepts and their use.



## Scopes
 
Each torcharrow pipeline runs within the context of a scope. A scope defines the pipelines configuration settings and thus influences location and behavior of columns, dataframes and their operations. Users can explicitly create a scope by calling various Scope constructors. The simplest one creates the default Scope...


In [1]:
import torcharrow as T
import torcharrow.dtypes as dt

ts = T.Scope()

The default scope has the following three default settings:

- `device`: `std`, means that the columns and dataframes are allocated as numpy arrays (see section below on  Multi-device targeting for more details)
- `tracing`: `False`, means that the code is currently not traced (see section below on Tracing for more details)
- `types_to_trace`: `[]`, if tracing holds, then only these types will be traced.

## Column and Dataframe Factories

Columns and dataframes are created with respect to a scope. Columns and dataframes inherit the scopes' settings. Here we show that a column or dataframe inherits the device setting, which is accessible under the `device` property of the created column or dataframe.

In [2]:
c = ts.Column([1,2,3])
d = ts.DataFrame({'a': [1,2,3], 'b' : ['a','b','c']})
print(f"Column c: {list(c)}, its device: {c.device}")
print(f"DataFrame d: {list(d)}, its device: {d.device})")

Column c: [1, 2, 3], its device: std
DataFrame d: [(1, 'a'), (2, 'b'), (3, 'c')], its device: std)


Most programs don't have to worry about scopes at all. They can just use public constructor `Column` and `Frame` which implicitly pick up the scope's default. So TorchArrow non-power users can be completely unaware of configs, sessions, multi-device targeting, tracing, etc.


In [3]:
d = T.Column(['abc',None])
d.device

'std'

Note: We call the factory method for a `DataFrame` currently simply `Frame`, since `DataFrame` denotes the resulting class but `Frame` is not its constructor, but a factory method!. Once we make all classes Device specific we can have `DataFrame` back.

## Multi-device targeting

Torcharrow supports multi-device targeting - i.e., columns and dataframes can reside in different memory (which we call also device). Currently we support 3 configurations:

- std, which means columns and dataframes are backed by by Numpy
- cpu, which means columns and dataframes are backed by Velox,
- gpu, which means columns and dataframes are backed by CuPy (i.e. GPU memory).

The user controls the assignment in 3 ways:

- the default assignment is done via the config's `device` parameter. The current device default is `std`. 
- the `device` parameter of the `Column` or `(Data)Frame` factory method. If `device` is None, the data is allocated at the default device; otherwise it is created at the specified device.
- the `to` instance method call defined on the base class `IColumn`. The method moves the column/frame to the designated device. 

Torcharrow requires that  
- creation of a dataframe on a particular device assumes that all its columns are created on the same device. 
- applying an operation on a column or dataframe will result in a column or dataframe on the same device.
- if the operation requires several columns/frames as input, all of them have to be on the same device.

Let's see this in practice: First we create a dataframe and we inspect the dataframe's and column's `device`...


In [4]:
e =T.Frame({'a': [1.0, None], 'b':['a','c']})
f = e['a'] > 12
(e.device, e['a'].device, e['b'].device, f.device ) 

('test', 'test', 'test', 'test')

Alternatively we could have created a column/frame on a particular device:

In [5]:
g = T.Column([1.0, None], device = 'cpu')
g.device

'cpu'

To add `e['a']` to `f` we have to bring the columns to the same device. Let's say it is `cpu`. Then add will return a new column on `cpu`.

In [6]:
h = e['a'].to('cpu') + g
h.device

'cpu'

The system raises a TypeError if two columns to add reside on different devices.

In [7]:
x = T.Column([1], device = 'cpu') 
y = T.Column([1], device = 'std')
try:
    z = x+y
except TypeError as e:
    print(f"error: {e}")


## Tracing


Torcharrow programs are executed eagerly -- that is every expression is evaluated bottom up and statements  are executed one after another. While this is fast and allows developers to debug programs easily it doesn't allow to inspect the executed code for analysis, optimization or platform re-targeting. 

To get the best of both worlds, fast execution, and ease of analyzability, torcharrow introduces tracing. To create a torcharrow trace, author a new setting, in which you set `tracing` to True and provide the types of classes that you want to trace. For Torcharrow the tracing defaults should always include `Scope`, `IColumn` and `GroupedDataFrame`.

In [8]:
types= [T.Scope, T.IColumn, T.GroupedDataFrame]
settings = {'tracing': True, 'types_to_trace':types}


Next we run the program unchanged. For visibility on what happens we print out the resulting dataframe, each column having particular object ids, here named `s`*i* and `c`*i*. 

In [9]:
from torcharrow import me

ts = T.Scope(settings)
d0 = ts.DataFrame(dtype=dt.Struct([dt.Field(i, dt.int64) for i in ['a', 'b', 'c']]))
d1 = d0.select('*', e=me['a'] + me['b'])
str(d1)

"self._fromdata({'a':Column([], id = c0), 'b':Column([], id = c1), 'c':Column([], id = c2), 'e':Column([], id = c4), id = c5})"

A faithful trace should have captured this execution and be able to replay with the same results.  Let's see whether that's the case:

The generated `trace` is accessible via the `session` object. The trace has two components:
-  `statements` returns a list of assignments where each
   - right hand side is an operation of the types to trace  
   - left hand side is named after the object id that's is created by the right hand side 
- `result` returns the name of the variable that was last assigned. 

In [10]:
d1_result = ts.trace.result()
d1_stms = ts.trace.statements()
(d1_result, d1_stms)

('c5',
 ["c3 = torcharrow.scope.Scope.DataFrame(s0, dtype=Struct([Field('a', int64), Field('b', int64), Field('c', int64)]))",
  "c5 = torcharrow.dataframe.DataFrame.select(c3, '*', e=torcharrow.dataframe.me.__getitem__('a').__add__(torcharrow.dataframe.me.__getitem__('b')))"])

The right-hand side of each statement is a fully resolved and type checked expressions in normal form, e.g. see the assignment to c5. Arguments to all expressions are Python values or references to variables introduced earlier.   

What can we do with such trace? We can 
 * analyze it for type correctness or for privacy flows
 * optimize and rewrite it
 * capture it, ship it to another machine and re-execute with or without data. 
 
Here we just replay the trace using Pythons exec and eval (TODO: Use fully qualified names everywhere so that the below import can be dropped). 

In [11]:
import torcharrow
from torcharrow.dtypes import Struct, Field, int64

# execute the statements
s0 = T.Scope()
for stm in d1_stms:
    exec(stm)
#eval the result
str(eval(d1_result))

"self._fromdata({'a':Column([], id = c0), 'b':Column([], id = c1), 'c':Column([], id = c2), 'e':Column([], id = c4), id = c5})"

We see that `d1` and `eval(d1_result)` are structurally exactly the same, including their object ids. Thus the trace preserved 100% of the original semantics. 