# TorchArrow in 10 minutes

TorchArrow is a Python DataFrame library built on the Apache Arrow columnar memory format and leveraging the Velox vectorized engine for loading, filtering, mapping, joining, aggregating, and otherwise manipulating tabular data on CPUs.

TorchArrow allows mostly zero copy interop with Numpy, Pandas, PyArrow, CuDf and of course PyTorch.
In fact, it is the integration with PyTorch which has triggered the development of TorchArrow. 
So TorchArrow understands Tensors natively.  

(Remark. In case the following looks familiar, it is with gratitude that portions of this tutorial were borrowed and adapted from the 10 Minutes to Pandas (and CuDF) tutorial.)



In [1]:

import pandas as pd
import numpy as np
import pyarrow as pa

The TorchArrow library consists of 3 parts: 

  * *DTypes* define *Schema*, *Fields*, primitive and composite *Types*. 
  * *Columns* defines sequences of strongly typed data with vectorized operations.
  * *Dataframes*  are sequences of named and typed columns of same length with relational operations.  

Let's get started...

In [2]:
import torcharrow as T
import torcharrow.dtypes as dt

ta = T.Scope()


## Constructing data: Columns

### From Pandas to TorchArrow
To start let's create a Panda series and a TorchArrow column and compare them:

In [3]:
pd.Series([1,2,None,4])

0    1.0
1    2.0
2    NaN
3    4.0
dtype: float64

In Pandas each Series has an index, here depicted as the first column. Note also that the inferred type is float and not int, since in Pandas None implicitly  promotes an int list to a float series.

TorchArrow has a much more precise type system:

In [4]:
s = ta.Column([1,2,None,4])
s

0  1
1  2
2  None
3  4
dtype: Int64(nullable=True), length: 4, null_count: 1

TorchArrow infers that that the type is `Int64(nullable=True)` which required that the vectors is represented internally via two arrays, its data and validity bit mask (the current implementation uses one byte for each bit). We can make the internal representation explicit by looking at the underlying representation:


In [5]:
 from tabulate import tabulate
 
 print(tabulate([(d,m) for d,m in zip(s._data, s._mask)], headers = ["data", "mask"]))

  data    mask
------  ------
     1       0
     2       0
     0       1
     4       0


Of course, we can always get lots of more information  from a column:  the `length`, `count`, `null_count` determine the total number, the number of non-null, and the number of nulls, respectively. 

In [6]:
len(s), s.count(), s.null_count()


(4, 3, 1)

TorchArrow supports (almost all of Arrow types), including arbitrarily nested structs, maps, lists, and fixed size lists. Here is a non-nullable column of a list of non-nullable strings of arbitrary length.

In [7]:
sf = ta.Column([ ["hello", "world"], ["how", "are", "you"] ], dtype =dt.List(dt.string))
sf.dtype

List(item_dtype=String(nullable=False), nullable=False, fixed_size=-1)

And here is a column of average climate data, one map per continent, with city as key and yearly average min and max temperature:


In [8]:
mf = ta.Column([ 
    {'helsinki': [-1.3, 21.5], 'moscow': [-4.0,24.3]}, 
    {'algiers':[11.2, 25.2], 'kinshasa':[22.2,26.8]}
    ])
mf

0  {'helsinki': [-1.3, 21.5], 'moscow': [-4.0, 24.3]}
1  {'algiers': [11.2, 25.2], 'kinshasa': [22.2, 26.8]}
dtype: Map(string, List(float64)), length: 2, null_count: 0

### Append and concat

Columns are immutable (or in more detail: the public API defines columns as being immutable). Use `append` to add a list of values or `concat` to combine a list of columns.

In [9]:
sf = sf.append([["I", "am", "fine"]])
sf

0  ['hello', 'world']
1  ['how', 'are', 'you']
2  ['I', 'am', 'fine']
dtype: List(string), length: 3, null_count: 0


## Constructing data: Dataframes

A Dataframe is just a set of named and strongly typed columns of equal length:

In [10]:
df = ta.DataFrame({'a': list(range(7)),
                     'b': list(reversed(range(7))),
                     'c': list(range(7))
                    })
df

  index    a    b    c
-------  ---  ---  ---
      0    0    6    0
      1    1    5    1
      2    2    4    2
      3    3    3    3
      4    4    2    4
      5    5    1    5
      6    6    0    6
dtype: Struct([Field('a', int64), Field('b', int64), Field('c', int64)]), count: 7, null_count: 0

To access a dataframes columns write:

In [11]:
df.columns

['a', 'b', 'c']

Dataframes are also immutable, except you can always add a new column, provided its name hasn't been used. The column is appended to the set of existing columns at the end.

In [12]:
df['d'] = ta.Column(list(range(99, 99+7)))
df

  index    a    b    c    d
-------  ---  ---  ---  ---
      0    0    6    0   99
      1    1    5    1  100
      2    2    4    2  101
      3    3    3    3  102
      4    4    2    4  103
      5    5    1    5  104
      6    6    0    6  105
dtype: Struct([Field('a', int64), Field('b', int64), Field('c', int64), Field('d', int64)]), count: 7, null_count: 0

Dataframes can be nested. Here is a Dataframe having sub-dataframes. 


In [13]:

df_inner = ta.DataFrame({'b1': [11, 22, 33], 'b2':[111,222,333]})
df_outer = ta.DataFrame({'a': [1, 2, 3], 'b':df_inner})
df_outer

  index    a  b
-------  ---  ---------
      0    1  (11, 111)
      1    2  (22, 222)
      2    3  (33, 333)
dtype: Struct([Field('a', int64), Field('b', Struct([Field('b1', int64), Field('b2', int64)]))]), count: 3, null_count: 0

We can not only add columns to dataframes, we can append rows, too. A row of a dataframe is expressed as a tuple. So a row of a nested dataframe is represented by a nested tuple. 


In [14]:
# TODO: Currently failing; check why...
# df_outer = df_outer.append([(4,(44,444))])
# df_outer

## Interop

Take a Pandas dataframe and move it zero copy (if possible) to TorchArrow.

In [15]:

pdf = pd.DataFrame({'a': [0, 1, 2, 3],'b': [0.1, 0.2, None, 0.3]})
gdf = T.from_pandas_dataframe(pdf, scope= ta)
gdf

  index    a    b
-------  ---  ---
      0    0  0.1
      1    1  0.2
      2    2
      3    3  0.3
dtype: Struct([Field('a', int64), Field('b', Float64(nullable=True))]), count: 4, null_count: 0

And bring it back to Pandas

In [16]:
gdf.to_pandas()

Unnamed: 0,a,b
0,0,0.1
1,1,0.2
2,2,
3,3,0.3


The same works for arrow, too. 

In [17]:
T.from_arrow_table(pa.table({'a': [0, 1, 2, 3],'b': [0.1, 0.2, None, 0.3]})).to_arrow()

pyarrow.Table
a: int64
b: double

## Viewing (sorted) data

Take the (head of) the top n rows

In [18]:
df.head(2)

  index    a    b    c    d
-------  ---  ---  ---  ---
      0    0    6    0   99
      1    1    5    1  100
dtype: Struct([Field('a', int64), Field('b', int64), Field('c', int64), Field('d', int64)]), count: 2, null_count: 0

Or return the last n rows

In [19]:
df.tail(1)


  index    a    b    c    d
-------  ---  ---  ---  ---
      0    6    0    6  105
dtype: Struct([Field('a', int64), Field('b', int64), Field('c', int64), Field('d', int64)]), count: 1, null_count: 0

or sort the values before hand.

In [20]:
df.sort(by=['c', 'b']).head(2)

  index    a    b    c    d
-------  ---  ---  ---  ---
      0    0    6    0   99
      1    1    5    1  100
dtype: Struct([Field('a', int64), Field('b', int64), Field('c', int64), Field('d', int64)]), count: 2, null_count: 0

Sorting can be controlled not only by which columns to sort on, but also whether to sort ascending or descending, and how to deal with nulls, are they listed first or last.  

## Selection using Indices

Torcharrow supports two indices:
 - Integer indices select rows
 - String indices select columns

So projecting a single column of a dataframe is simply

In [21]:
df['a']

0  0
1  1
2  2
3  3
4  4
5  5
6  6
dtype: int64, length: 7, null_count: 0

Selecting a single row uses an integer index. (In Torcharrow everything is zero-based.)

In [22]:
df[1]

(1, 5, 1, 100)

Selecting a slice keeps the type alive. Here we slice rows:


In [23]:
df[2:6:2]

  index    a    b    c    d
-------  ---  ---  ---  ---
      0    2    4    2  101
      1    4    2    4  103
dtype: Struct([Field('a', int64), Field('b', int64), Field('c', int64), Field('d', int64)]), count: 2, null_count: 0

But you can also slice columns. The below return all columns after and including 'c'.

In [24]:
df['c':]

  index    c    d
-------  ---  ---
      0    0   99
      1    1  100
      2    2  101
      3    3  102
      4    4  103
      5    5  104
      6    6  105
dtype: Struct([Field('c', int64), Field('d', int64)]), count: 7, null_count: 0

You can even access columns by position. Simply pass the columns index as a string. So above `df['c':]` is the same as `df['2':]`. 

Torcharrow follows the normal Python semantics for slices: that is a slice interval is closed on the left and open on the right.

## Selection by Condition

Selection of a column or dataframe *c* by a condition takes a boolean column *b* of the same length as *c*. If the *i*th row in *b* is true, *c*'s *i*th row is included in the result otherwise it is dropped. Below expression selects the first row, since it is true, and drops all remaining rows, since they are false.



In [25]:
df[[True] + [False] * (len(df)-1)]

  index    a    b    c    d
-------  ---  ---  ---  ---
      0    0    6    0   99
dtype: Struct([Field('a', int64), Field('b', int64), Field('c', int64), Field('d', int64)]), count: 1, null_count: 0

Conditional expressions over vectors return boolean vectors. Conditionals are thus the usual way to write filters. 

In [26]:
b = df['a'] > 4
df[b]

  index    a    b    c    d
-------  ---  ---  ---  ---
      0    5    1    5  104
      1    6    0    6  105
dtype: Struct([Field('a', int64), Field('b', int64), Field('c', int64), Field('d', int64)]), count: 2, null_count: 0

Torcharrow supports all the usual predicates, like <,==,!=>,>=,<= as well as _in_. The later is denoted by `isin`


In [27]:
df[df['a'].isin([5])]

  index    a    b    c    d
-------  ---  ---  ---  ---
      0    5    1    5  104
dtype: Struct([Field('a', int64), Field('b', int64), Field('c', int64), Field('d', int64)]), count: 1, null_count: 0

## Missing data
 Missing data can be filled in via the `fillna` method 

In [28]:
t = s.fillna(999)
t

0    1
1    2
2  999
3    4
dtype: int64, length: 4, null_count: 0

Alternatively data that has null data can be dropped:

In [29]:
s.dropna()

0  1
1  2
2  4
dtype: int64, length: 3, null_count: 0

## Operators
Columns and dataframes support all of Python's usual binary operators, like  ==,!=,<=,<,>,>= for equality  and comparison,  +,-,*,,/.//,** for performing arithmetic and &,|,~ for conjunction, disjunction and negation. 

The semantics of each operator is given by lifting their scalar operation to vectors and dataframes. So given for instance a scalar comparison operator, in TorchArrow a scalar can be compared to each item in a column, two columns can be compared pointwise, a column can be compared to each column of a dataframe, and two dataframes can be compared by comparing each of their respective columns. 

Here are some example expressions:

In [30]:
u = ta.Column(list(range(5)))
v = -u
w = v+1
v*w

0   0
1   0
2   2
3   6
4  12
dtype: int64, length: 5, null_count: 0

In [31]:
uv = ta.DataFrame({'a': u, 'b': v})
uu = ta.DataFrame({'a': u, 'b': u})
(uv==uu)

  index    a    b
-------  ---  ---
      0    1    1
      1    1    0
      2    1    0
      3    1    0
      4    1    0
dtype: Struct([Field('a', boolean), Field('b', boolean)]), count: 5, null_count: 0

## Null strictness

The default behavior of torcharrow operators and functions is that *if any argument is null then the result is null*. For instance:

In [32]:
u = ta.Column([1,None,3])
v = ta.Column([11,None, None])
u+v

0  12
1  None
2  None
dtype: Int64(nullable=True), length: 3, null_count: 2

If null strictness does not work for your code you could call first `fillna` to provide a value that is used instead of null. 

NOTE: THIS IS CURRENTLY DISABLED. But since you might need different values for different operators, all operators and functions in torcharrow provide an optional parameter for a fillna value. So if we wanted to use 7 instead of null, we could write (where `add` is the function name for the operator `+`): END OF NOTE



In [33]:
#u.add(v, fill_value=7)

## Numerical columns and descriptive statistics
Numerical columns also support lifted operations, for `abs`, `ceil`, `floor`, `round`. Even more excited might be to use their aggregation operators like `count`, `sum`, `prod`, `min`, `max`, or descriptive statistics like `std`, `mean`, `median`, and `mode`. Here is an example ensemble:


In [34]:
(t.min(), t.max(), t.sum(), t.mean())

(1, 999, 1006, 251.5)

The `describe` method puts this nicely together: 

In [35]:
t.describe()

  index  statistic      value
-------  -----------  -------
      0  count          4
      1  mean         251.5
      2  std          498.335
      3  min            1
      4  25%            1.5
      5  50%            3
      6  75%          501.5
      7  max          999
dtype: Struct([Field('statistic', string), Field('value', float64)]), count: 8, null_count: 0

Sum, prod, min and max are also available as accumulating operators called `cumsum`, `cumprod`, etc. 

Boolean vectors are very similar to numerical vector. They offer the aggregation operators `any` and `all`. 

## String, list and map methods
Torcharrow provides all of Python's string, list and map processing methods, just lifted to work over columns. Like in Pandas they are all accessible via the `str`, `list` and `map` property, respectively.

### Strings
Let's capitalize a column of strings.


In [36]:
s = ta.Column(['what a wonderful world!', 'really?'])
s.str.capitalize()

0  'What a wonderful world!'
1  'Really?'
dtype: string, length: 2, null_count: 0

Split is more involved. We have to decide whether a string gets split into a list of strings or into spread over a set of columns. So split gets an extra parameter called expand: the default is that expand=False, in which case split return a list column, if expand=True split return a list of columns as a dataframe:

In [37]:
ss= s.str.split(sep=' ')
ss

0  ['what', 'a', 'wonderful', 'world!']
1  ['really?']
dtype: List(string), length: 2, null_count: 0

In [38]:
cs =  s.str.split(sep=' ', expand = True, maxsplit = 2)
cs

  index  0        1
-------  -------  ---
      0  what     a
      1  really?
dtype: Struct([Field('0', String(nullable=True)), Field('1', String(nullable=True))]), count: 2, null_count: 0

### Lists

To operate on a list column use the usual pure list operations, like `len(gth)`, `slice`, `index` and `count`, etc. But there are a couple of additional operations. 

For instance to invert the result of a string split operation a list of string column also offers a join operation. 


In [39]:
ss.list.join(sep='-')

0  'what-a-wonderful-world!'
1  'really?'
dtype: string, length: 2, null_count: 0

In addition lists provide `filter`, `map`, `flatmap` and `reduce` operators, which we will discuss as in more details in functional tools.

### Maps

Column of type map provide the usual map operations like `len(gth)`, `[.]`, `keys` and `values`. Keys and values both return a list column. Key and value columns can be reassembled by calling `mapsto`.

In [40]:
mf.map.keys()

0  ['helsinki', 'moscow']
1  ['algiers', 'kinshasa']
dtype: List(string), length: 2, null_count: 0

## Relational tools: Where, select, groupby, join, etc.
 
Torcharrow will soon support all relational operators on dataframes. The following sections discuss what exists today.

### Where
The simplest operator is `df.where(p)` which is just another way of writing `df[p]`. (Note: TorchArrow's `where`  != Pandas' `where`, the latter is a vectorized if-then-else which we call in Torcharrow `ite`.)

In [41]:
xf = ta.DataFrame({
    'A':['a', 'b', 'a', 'b'], 
    'B': [1, 2, 3, 4], 
    'C': [10,11,12,13]})

xf.where(xf['B']>2)

  index  A      B    C
-------  ---  ---  ---
      0  a      3   12
      1  b      4   13
dtype: Struct([Field('A', string), Field('B', int64), Field('C', int64)]), count: 2, null_count: 0

Note that in `xf.where` the predicate `xf['B']>2` refers to self, i.e. `xf`. To access self in an expression TorchArrow introduces the special name `me`. That is, we can also write:


In [42]:
from torcharrow import me
xf.where(me['B']>2)


  index  A      B    C
-------  ---  ---  ---
      0  a      3   12
      1  b      4   13
dtype: Struct([Field('A', string), Field('B', int64), Field('C', int64)]), count: 2, null_count: 0

### Select

Select is SQL's standard way to define a new set of columns. We use *positional args to keep columns and kwargs to give new bindings. Here is a typical example that keeps all of xf's columns but adds column 'D').


In [43]:

xf.select(*xf.columns, D=me['B']+me['C'])

  index  A      B    C    D
-------  ---  ---  ---  ---
      0  a      1   10   11
      1  b      2   11   13
      2  a      3   12   15
      3  b      4   13   17
dtype: Struct([Field('A', string), Field('B', int64), Field('C', int64), Field('D', int64)]), count: 4, null_count: 0

The short form of `*xf.columns` is '\*', so `xf.select('*', D=me['B']+me['C'])` does the same.

### Grouping

Like pandas, torcharrow supports the Split-Apply-Combine groupby paradigm. Let's see a couple of examples: 

In [44]:
df = ta.DataFrame({'A': ['a', 'b', 'a', 'b'], 'B': [1, 2, 3, 4]})

#group by A
grouped = df.groupby(['A'])

# apply sum on each of B's grouped column to create a new column
grouped_sum = grouped['B'].sum()

#combine a new dataframe from old and new columns
res = ta.DataFrame()
res['A']= grouped['A']
res['B.sum']= grouped_sum

res

  index  A      B.sum
-------  ---  -------
      0  a          4
      1  b          6
dtype: Struct([Field('A', string), Field('B.sum', int64)]), count: 2, null_count: 0

The same can be written as a one liner:

In [45]:
df.groupby(['A']).sum()


  index  A      B.sum
-------  ---  -------
      0  a          4
      1  b          6
dtype: Struct([Field('A', string), Field('B.sum', int64)]), count: 2, null_count: 0

Of course, you can group by more than one column, e.g. `df.groupby(['A','B'])`.

If you wanted to apply a whole set of functions on different parts of the dataframe use groupby followed by select on grouped data. (TorchArrow also supports Pandas `agg` and `aggregate` function, which are not sampled here.)

In [46]:
df = ta.DataFrame({
    'A':['a', 'b', 'a', 'b'], 
    'B': [1, 2, 3, 4], 
    'C': [10,11,12,13]})
df.groupby(['A']).select(b_sum=me['B'].sum(), c_count=me['C'].count())

  index  A      b_sum    c_count
-------  ---  -------  ---------
      0  a          4          2
      1  b          6          2
dtype: Struct([Field('A', string), Field('b_sum', int64), Field('c_count', int64)]), count: 2, null_count: 0

All aggregation functions (`min`, `max`, `any`, `all`, `sum`, `prod`, `count`, etc)  are available. If none of these work one can use `reduce`. 

Finally to see what data groups contain iterate over them:

In [47]:
for g, df in grouped:
    print(g)
    print("  ",df)

('a',)
   self._fromdata({'B':Column([1, 3], id = c156), id = c157})
('b',)
   self._fromdata({'B':Column([2, 4], id = c158), id = c159})


### Join -- TODO


In [48]:
# df_a = ta.DataFrame()
# df_a['key'] = ['a', 'b', 'c', 'd', 'e']
# df_a['vals_a'] = [float(i + 10) for i in range(5)]

# df_b = ta.DataFrame()
# df_b['key'] = ['a', 'c', 'e']
# df_b['vals_b'] = [float(i+100) for i in range(3)]

# merged = df_a.join(df_b, on=['key'], how='left')
# merged

### Transpose -- TODO



In [49]:
# sample = ta.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
# sample

In [50]:
# sample.transpose() -TODO

## User defined functions and functional tools:  map, filter, reduce

Column and dataframe pipelines support map/reduce style programming as well. We first explore column oriented operations.

###  Map and its variations

`map` maps values of a column according to input correspondence. The input correspondence can be given as a mapping or as a (user-defined-) function (UDF). If the mapping is a dict, then non mapped values become null.




In [51]:
ta.Column([1,2,None,4]).map({1:111})

0  111
1  None
2  None
3  None
dtype: Int64(nullable=True), length: 4, null_count: 3

If the mapping is a defaultdict, all values will be mapped as described by the default dict.

In [52]:
from collections import defaultdict
ta.Column([1,2,None,4]).map(defaultdict(lambda: -1, {1:111}))

0  111
1   -1
2   -1
3   -1
dtype: Int64(nullable=True), length: 4, null_count: 0

If the mapping is a function, then it will be applied on all values (including null), unless na_action is `'ignore'`, in which case, null values are passed through.

In [53]:
def add_ten(num):
    return num + 10

ta.Column([1,2,None,4]).map(add_ten, na_action='ignore')

0  11
1  12
2  None
3  14
dtype: Int64(nullable=True), length: 4, null_count: 1

Note that `.map(add_ten, na_action=None)` would fail with a type error since `addten` is not defined for `None`/null. So if we wanted to pass null to `add_ten` we would have to prepare for it, maybe like so:

In [54]:
def add_ten_or_0(num):
    return 0 if num is None else num + 10
    
ta.Column([1,2,None,4]).map(add_ten_or_0, na_action= None)

0  11
1  12
2   0
3  14
dtype: Int64(nullable=True), length: 4, null_count: 0

**Mapping to different types.** If `map` returns a column type that is different from the input column type, then `map` has to specify the returned column type. 

In [55]:
ta.Column([1,2,3,4]).map(str, dtype=dt.string)

0  '1'
1  '2'
2  '3'
3  '4'
dtype: string, length: 4, null_count: 0

**Map over Dataframes** Of course, `map` works over Dataframes, too. In this case the callable gets the whole row as a tuple. 

In [56]:
def add_unary(tup): 
    return tup[0]+tup[1]

ta.DataFrame({'a': [1,2,3], 'b': [1,2,3]}).map(add_unary , dtype = dt.int64)

0  2
1  4
2  6
dtype: int64, length: 3, null_count: 0

**Multi-parameter UDFs**. So far all our user defined functions were unary functions. But `map` can be used for n-ary functions, too: simply specify the set of `columns` you want to pass to the nary function. 


In [57]:
def add_binary(a,b):
    return a + b

ta.DataFrame({'a': [1,2,3], 'b': ['a', 'b', 'c'], 'c':[1,2,3]}).map(add_binary, columns = ['a','c'], dtype = dt.int64)

0  2
1  4
2  6
dtype: int64, length: 3, null_count: 0

**Multi-return UDFs.** Functions that return more than one column can be specified by returning a dataframe  (aka as struct column); providing the  return type is mandatory.

In [58]:
ta.DataFrame({'a': [17, 29, 30], 'b': [3,5,11]}).map(divmod, columns= ['a','b'], dtype = dt.Struct([dt.Field('quotient', dt.int64), dt.Field('remainder', dt.int64)])) 

  index    quotient    remainder
-------  ----------  -----------
      0           5            2
      1           5            4
      2           2            8
dtype: Struct([Field('quotient', int64), Field('remainder', int64)]), count: 3, null_count: 0

**UDFs with state**. UDFs need sometimes additional precomputed state. We capture the state in a (data)class and use a method as a delegate:
 

In [59]:
def fib(n):
    if n == 0:
        return 0
    elif n == 1 or n == 2:
        return 1
    else:
        return fib(n-1) + fib(n-2)
    
from dataclasses import dataclass
@dataclass
class State:
    state: int
    def __post_init__(self):
        self.state = fib(self.state) 
    def add_fib(self, x):
        return self.state+x

m = State(10)
ta.Column([1,2,3]).map(m.add_fib)

0  56
1  57
2  58
dtype: int64, length: 3, null_count: 0

TorchArrow requires that only global functions or methods on class instances can be used as user defined functions. Lambdas, which can can capture arbitrary state and are not inspectable, are not supported. 

### Filter

`filter` takes a predicate and returns all those rows for which the predicate holds. Instead of the predicate you can pass an iterable of boolean of the same length as the column. Here are both versions:

In [60]:

ta.Column([1,2,3,4]).filter([True, False, True, False]) == ta.Column([1,2,3,4]).filter(lambda x: x%2==1)




0  1
1  1
dtype: boolean, length: 2, null_count: 0

If the predicate is an n-ary function, use the  `columns` argument as we have seen for `map`.  

### Flatmap

`flatmap` combines `map` with `filter`. Each callable can return a list of elements. If that list is empty, flatmap filters, if the returned list is a singleton, flatmap acts like map, if it returns several elements it 'explodes' the input. Here is an example: 

In [61]:
def selfish(words):
    return [words, words] if len(words)>=1 and words[0] == "I" else []

sf.flatmap(selfish)

0  ['I', 'am', 'fine']
1  ['I', 'am', 'fine']
dtype: List(string), length: 2, null_count: 0

`flatmap` has all the flexibility of `map`, i.e it can take the `ignore`, `dtype` and `column` arguments.

### Reduce
`reduce` is just like Python's `reduce`. Here we compute the product of a column.

In [62]:
import operator
ta.Column([1,2,3,4]).reduce(operator.mul)

24

Even transforms can be expressed with `reduce`.  Here we define `accusum` using the internal(!) column API. The initializer creates an `_EmptyColumn` Column, at each iteration we `_append` the new value, once we are done we `_finalize` the column.
 

In [64]:
def accusum(col, val):
    if len(col) == 0:
        col._append(val)
    else:
        col._append(col[-1] + val)
    return col
    
ta.Column([1,2,3]). reduce(accusum,ta._EmptyColumn(dt.int64), lambda x: x._finalize())

0  1
1  3
2  6
dtype: int64, length: 3, null_count: 0

Reduce can take the ignore and column arguments as well.

### UDFs in C++.

TorchArrow allows writing any UDF in C++. All you need to do is write the function in C++ and bind it to a name, let's call it `f`, in Python using PyBind. The foreign function `f` behaves now like a normal Python function, in particular it can be passed to any of the functional tools. A function application like `map(f)` now performs the whole computation in C++ and no longer in Python.


## Vectorized UDFs (TODO)

Vectorized function leak TorchArrow representation boundaries! So read the following with the big caveat that it can change quickly!

Vectorized functions get *n* strongly typed vectors as input and return *m* vectors as output. Validity handling is optional. The following assumes that all data is valid!

In [None]:

# def conditional_add(x, y, out):
#     for i, (a, e) in enumerate(zip(x, y)):
#         if a > 0:
#             out[i] = a + e
#         else:
#             out[i] = a

This code is perfect for vectorization via Numba. Leveraging Numba will require us to only add some custom attributes. (TODO)

Vectorized functions can be applied using `transform`. We pass a list of data columns and return a typed list of data columns. 

In [None]:
# df = ta.transform(conditional_add, incols= ['a','b'], dtypes = [int64]) -- TODO
# df.head()

If you want to pass the underlying validity map in and/or out as well, you have to provide it as  incols and out dtypes respectively. The input and output names are called name.data and name.validity respectively. The dtype for a validity map is called nullable. So for the following transform, we pass all data and validity masks and return a validity vector as well. 

In [None]:
# ta.transform(conditional_add_with_mask, incols = ['a.data','a.mask', 'b.data', 'b.mask'], dtype = [int64, nullable]]) -- TODO

Assuming that nulls are handled as bitarrays of 64 bytes each, and that the return must be null if row a's value is > 0, then we can define it like so.

In [None]:
"End of tutorial"

## User defined types (TODO)
The notebook *torcharrow_user_defined_types* describe how we can modularly extend torcharrow with new types. As example we will use Tensors. In fact all concrete columns follow the same paradigm. That is we have to define a file called X_couln, with two classes and add the class to the int and the two factories. 