# Overview

In [1]:
from utilz import mapcat, pmap, randdf
from time import sleep
import numpy as np
import pandas as pd

## 1. Functional tools

### `mapcat`: easy concatenation of loop results

In [None]:
from utilz import mapcat

Just like `map` but no need to call `list` after:

In [16]:
def myfunc(x):
    return x * 2

mapcat(myfunc, range(10))

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

If `myfunc` is None it can be used to flatten nested lists (max 2 levels deep):

In [19]:
mapcat(None, [[1,2,3], [4,5,6]])

[1, 2, 3, 4, 5, 6]

If `myfunc` returns a dataframe, will try to concat the results by default:

In [21]:
def myfunc(f):
    """simulate loading a 2x3 dataframe from file"""
    return randdf(size=(2,3))

mapcat(myfunc, range(4))

Unnamed: 0,A1,B1,C1
0,0.850319,0.251434,0.080166
1,0.679445,0.961425,0.800978
2,0.098042,0.111205,0.00452
3,0.939616,0.088765,0.642473
4,0.4543,0.652232,0.121049
5,0.356389,0.280754,0.193288
6,0.055517,0.394354,0.611836
7,0.298881,0.610756,0.956311


If your `myfunc` returns an array, will try also try concat the results by default, while preserving the output shape. Because `myfunc` returns a 1d array, the final result is 2d:

In [22]:
def myfunc(f):
    """Function that returns 1d array"""
    return np.arange(3)

mapcat(myfunc, range(4))

array([[0, 1, 2],
       [0, 1, 2],
       [0, 1, 2],
       [0, 1, 2]])

This is equivalent to passing `axis=1`:

In [23]:
mapcat(myfunc, range(4), axis=1)

array([[0, 1, 2],
       [0, 1, 2],
       [0, 1, 2],
       [0, 1, 2]])

You can instead flatten the array, but passing `axis=0`:

In [24]:
mapcat(myfunc, range(4), axis=0)

array([0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2])

Or stack it in a 3rd dimension by passing `axis=2`:

In [25]:
mapcat(myfunc, range(4), axis=2)

array([[0, 0, 0, 0],
       [1, 1, 1, 1],
       [2, 2, 2, 2]])

### `pmap`: easy parallel looping

In [None]:
from utilz import pmap
from time import sleep

In [27]:
def myfunc(x):
    """Simulate expensive function"""
    sleep(1)
    return x * 2

pmap(myfunc, range(10), n_jobs=2)

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

You can easily pass the loop index to `myfunc` by setting `enum=True`:

In [2]:
# myfunc needs to accept an 'idx' argument
def myfunc(x, idx):
    """Simulate expensive function"""
    sleep(1)
    return x * idx

pmap(myfunc, range(10), n_jobs=2, enum=True)

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

Likewise if your function uses randomization, you can set the `random_state` to reproduce parallel runs:

In [4]:
# myfunc needs to accept an 'random_seed' argument
def myfunc(x, random_state=None):
    """Simulate expensive function"""
    from utilz import check_random_state

    rng = check_random_state(random_state)
    sleep(1)
    return x * rng.random()


pmap(myfunc, range(10), n_jobs=2, random_state=1)

[0.0,
 0.7026449924443589,
 1.3671148797485828,
 2.491197036621863,
 2.8137601674196255,
 4.388425659603134,
 5.805394778570049,
 4.548025497425891,
 2.529163623883595,
 4.026815575448365]

Now this second run reproduces the same values:

In [5]:
pmap(myfunc, range(10), n_jobs=2, random_state=1)

[0.0,
 0.7026449924443589,
 1.3671148797485828,
 2.491197036621863,
 2.8137601674196255,
 4.388425659603134,
 5.805394778570049,
 4.548025497425891,
 2.529163623883595,
 4.026815575448365]

`pmap` also behaves like `mapcat` and will try to smartly concatenate results:

In [6]:
def myfunc(f):
    """simulate loading a 2x3 dataframe from file"""
    sleep(1)
    return randdf(size=(2,3))

pmap(myfunc, range(10))

Unnamed: 0,A1,B1,C1
0,0.826214,0.744462,0.962846
1,0.614326,0.70131,0.907144
2,0.292599,0.17896,0.715301
3,0.510024,0.374922,0.788807
4,0.379082,0.949833,0.697136
5,0.461286,0.692534,0.478875
6,0.498815,0.983345,0.245962
7,0.923638,0.476808,0.599911
8,0.46508,0.821859,0.836529
9,0.627359,0.181799,0.646754


## 2. Decorators

Utilz decorators can be added to any function to provide some convenient information or checks before or after execution

### `show`: print the result of a function call in addition to returning it

In [None]:
# Coming soon

### `log`: print information before and after function execution

In [None]:
# Coming soon

### `timeit`: print how long a function took after it finishes

In [None]:
# Coming soon

### `maybe`: run a function only if a file doesn't exist or a directory isn't empty

In [None]:
# Coming soon

### `expensive`: cache function results to disk and return them on subsequent function calls

In [None]:
# Coming soon

## 3. Dataframe tools

Utilz makes working with dataframes a bit easier by offering **decorators** or **extra methods** without altering core pandas functionality: 

### `.norm_by_group`: center, scale, or z-score separately by group

In [9]:
# Add a group col
df = randdf()
df['group'] = ['A'] * 5 + ['B'] * 5

# This is a new method!
new_df = df.norm_by_group('group', 'A1')
new_df

Unnamed: 0,A1,B1,C1,group,A1_normed_by_group
0,0.750879,0.553232,0.530478,A,0.043798
1,0.570674,0.739102,0.07365,A,-1.020684
2,0.594186,0.16877,0.022596,A,-0.881793
3,0.820368,0.938592,0.895036,A,0.45427
4,0.981217,0.07717,0.313783,A,1.40441
5,0.950427,0.774752,0.15863,B,1.326601
6,0.414215,0.705356,0.025005,B,-0.08282
7,0.172715,0.647416,0.042338,B,-0.717596
8,0.683768,0.019522,0.480834,B,0.625696
9,0.007492,0.406587,0.824741,B,-1.151881


You can use the `scale` and `center` args to control whether mean-centering and dividing by standard-deviation are done (both default to `True`). This will also change the generated column name appropriately:

In [10]:
new_df.norm_by_group('group', 'A1', scale=False)

Unnamed: 0,A1,B1,C1,group,A1_normed_by_group,A1_centered_by_group
0,0.750879,0.553232,0.530478,A,0.043798,0.007415
1,0.570674,0.739102,0.07365,A,-1.020684,-0.172791
2,0.594186,0.16877,0.022596,A,-0.881793,-0.149278
3,0.820368,0.938592,0.895036,A,0.45427,0.076903
4,0.981217,0.07717,0.313783,A,1.40441,0.237752
5,0.950427,0.774752,0.15863,B,1.326601,0.504703
6,0.414215,0.705356,0.025005,B,-0.08282,-0.031509
7,0.172715,0.647416,0.042338,B,-0.717596,-0.273008
8,0.683768,0.019522,0.480834,B,0.625696,0.238045
9,0.007492,0.406587,0.824741,B,-1.151881,-0.438231


## 4. Miscellaneous tools

### `randdf`: quickly generate random data

In [None]:
from utilz import randdf

In [None]:
randdf()

Unnamed: 0,A1,B1,C1
0,0.82323,0.435518,0.759073
1,0.202369,0.192461,0.67063
2,0.082511,0.985877,0.512224
3,0.36007,0.860255,0.115435
4,0.882193,0.536968,0.587085
5,0.71445,0.339533,0.050096
6,0.520111,0.030953,0.688294
7,0.385747,0.140701,0.842956
8,0.781974,0.435973,0.921362
9,0.603653,0.24525,0.075831
