# Model description features

In [1]:
import formulae as fm
import numpy as np
import pandas as pd

## Operator precedence

All binary operators are left-associative, but some of them bind tighther than others. This is a list of binary operators sorted from lowest to highest precedence.

* `~` which separates the response term from the explanatory variables. There can be at most one of this per formula.
* `|` random effect operator. As seen in `lme4` R package. Don't necessary require parentheses but will be almost always used with them.
* `+` and `-` add and remove terms. `+` can be thought as a set union operator, while `-` removes a term from a set if the term is present and leaves the term as it is if not.
* `*` and `/`. `a*b` is a shorthand for `a + b + a:b` and `/` is a shorthand for `a + a:b`. However there are particular cases for the latter when working with grouped terms (i.e. more than one term within parentheses).
* `:` indicates interaction between operands.
* `**` as in Patsy, it takes a set of terms on the left and a positive integer `n` on the right. Then it computes all interactions between the terms in the set up to order `n`.


## Implicit intercept

As in R, Patsy and all/most implementations, this implementation assumes there is an implicit intercept. This can be removed with `0` or `-1`.

In [2]:
fm.model_description('y ~ x')

ModelTerms(
  ResponseTerm(
    Term(
      name= y
      variable= y
      kind= None
    )
  ),
  InterceptTerm(),
  Term(
    name= x
    variable= x
    kind= None
  )
)

In [3]:
fm.model_description('y ~ 0 + x') # same with -1

ModelTerms(
  ResponseTerm(
    Term(
      name= y
      variable= y
      kind= None
    )
  ),
  Term(
    name= x
    variable= x
    kind= None
  )
)

## Group specific terms (a.k.a random terms)

One of the main differences between Patsy and this implementation is that you can pass group specific terms to the model formula using the same syntax than in `lme4` R package. As in there, parentheses are optional and there is an implicit intercept too. Then, `(x | g)` is a shorthand for `(1|g)` + `(x|g)`

In [4]:
fm.model_description('1|x')

RandomTerm(
  expr= InterceptTerm(),
  factor= Term(
    name= x
    variable= x
    kind= None
  )
)

Note that if we don't use parenthesis here, formulae will understand that the LHS of the `|` operator is `a + 1`. That's why you will almost always see parenthesis with random terms.

In [5]:
fm.model_description('a + (1|x)')

ModelTerms(
  InterceptTerm(),
  Term(
    name= a
    variable= a
    kind= None
  ),
  RandomTerm(
    expr= InterceptTerm(),
    factor= Term(
      name= x
      variable= x
      kind= None
    )
  )
)

The operator is associative.

In [6]:
fm.model_description('(x | g1 + g2)')

ModelTerms(
  InterceptTerm(),
  RandomTerm(
    expr= InterceptTerm(),
    factor= Term(
      name= g1
      variable= g1
      kind= None
    )
  ),
  RandomTerm(
    expr= InterceptTerm(),
    factor= Term(
      name= g2
      variable= g2
      kind= None
    )
  ),
  RandomTerm(
    expr= Term(
      name= x
      variable= x
      kind= None
    ),
    factor= Term(
      name= g1
      variable= g1
      kind= None
    )
  ),
  RandomTerm(
    expr= Term(
      name= x
      variable= x
      kind= None
    ),
    factor= Term(
      name= g2
      variable= g2
      kind= None
    )
  )
)

## Add and remove terms

Just a simple demonstration, not much fun going on here. Nothing happens with `c` because it is not in the model specification.

In [7]:
fm.model_description('y ~ a + b - c')

ModelTerms(
  ResponseTerm(
    Term(
      name= y
      variable= y
      kind= None
    )
  ),
  InterceptTerm(),
  Term(
    name= a
    variable= a
    kind= None
  ),
  Term(
    name= b
    variable= b
    kind= None
  )
)

And here we don't see it either.

In [8]:
fm.model_description('y ~ a + c + b - c')

ModelTerms(
  ResponseTerm(
    Term(
      name= y
      variable= y
      kind= None
    )
  ),
  InterceptTerm(),
  Term(
    name= a
    variable= a
    kind= None
  ),
  Term(
    name= b
    variable= b
    kind= None
  )
)

But since terms are left-associative, there's no `c` when we remove it and we end up adding it in the end.

In [9]:
fm.model_description('y ~ a - c + b + c')

ModelTerms(
  ResponseTerm(
    Term(
      name= y
      variable= y
      kind= None
    )
  ),
  InterceptTerm(),
  Term(
    name= a
    variable= a
    kind= None
  ),
  Term(
    name= b
    variable= b
    kind= None
  ),
  Term(
    name= c
    variable= c
    kind= None
  )
)

Below we're going to see better usages for the `-` operator.

## Interactions
Interactions of a term with itself return the term unchanged, i.e. `a:a` equals `a`.

### `:` operator

In [10]:
fm.model_description('y ~ a:b + c:d')

ModelTerms(
  ResponseTerm(
    Term(
      name= y
      variable= y
      kind= None
    )
  ),
  InterceptTerm(),
  InteractionTerm(
    name= a:b
    variables= {'b', 'a'}
  ),
  InteractionTerm(
    name= c:d
    variables= {'d', 'c'}
  )
)

### `*` operator


In [11]:
fm.model_description('y ~ 0 + a*b + c*d')

ModelTerms(
  ResponseTerm(
    Term(
      name= y
      variable= y
      kind= None
    )
  ),
  Term(
    name= a
    variable= a
    kind= None
  ),
  Term(
    name= b
    variable= b
    kind= None
  ),
  InteractionTerm(
    name= a:b
    variables= {'b', 'a'}
  ),
  Term(
    name= c
    variable= c
    kind= None
  ),
  Term(
    name= d
    variable= d
    kind= None
  ),
  InteractionTerm(
    name= c:d
    variables= {'d', 'c'}
  )
)

### `/` operator

The behavior of this operator has a special case we'll see below when using grouped terms.

In [12]:
fm.model_description('y ~ 0 + a/b + c/d')

ModelTerms(
  ResponseTerm(
    Term(
      name= y
      variable= y
      kind= None
    )
  ),
  Term(
    name= a
    variable= a
    kind= None
  ),
  InteractionTerm(
    name= a:b
    variables= {'b', 'a'}
  ),
  Term(
    name= c
    variable= c
    kind= None
  ),
  InteractionTerm(
    name= c:d
    variables= {'d', 'c'}
  )
)

## Power operator

This operator can be used with a single term, which returns the term unchanged and with a set of terms, which computes interactions between terms in the set up to order `n`, which must always be positive integer.

In [13]:
fm.model_description('a**3')

ModelTerms(
  InterceptTerm(),
  Term(
    name= a
    variable= a
    kind= None
  )
)

In [14]:
fm.model_description('(a + b + c)**3')

ModelTerms(
  InterceptTerm(),
  Term(
    name= a
    variable= a
    kind= None
  ),
  Term(
    name= b
    variable= b
    kind= None
  ),
  Term(
    name= c
    variable= c
    kind= None
  ),
  InteractionTerm(
    name= a:b
    variables= {'b', 'a'}
  ),
  InteractionTerm(
    name= a:c
    variables= {'a', 'c'}
  ),
  InteractionTerm(
    name= b:c
    variables= {'b', 'c'}
  ),
  InteractionTerm(
    name= a:b:c
    variables= {'b', 'a', 'c'}
  )
)

## Function calls

`formulae` detects when you want to transform one of the terms using a function. Currently, it just returns an object of class `CallTerm` which has the name of the function to be called and the arguments passed and it does not check whether the arguments are proper Python code. This is going to happen when we send the code to the Python interpreter. The `special` attribute is not used yet.

In [15]:
fm.model_description('y ~ center(x) + d')

ModelTerms(
  ResponseTerm(
    Term(
      name= y
      variable= y
      kind= None
    )
  ),
  InterceptTerm(),
  CallTerm(
    call=center(x),
    args=[Variable(x)],
    special=False
  ),
  Term(
    name= d
    variable= d
    kind= None
  )
)

You can also have calls on the left side of the formula. Of course, function names will have to be bound to a value for calls to work. 
Note: the response term is not printed as a CallTerm but it is a call term.

In [16]:
fm.model_description('np.log(y) ~ center(x) + d')

ModelTerms(
  ResponseTerm(
    CallTerm(
      call=np.log(y),
      args=[Variable(y)],
      special=False
    )
  ),
  InterceptTerm(),
  CallTerm(
    call=center(x),
    args=[Variable(x)],
    special=False
  ),
  Term(
    name= d
    variable= d
    kind= None
  )
)

## Some examples to see associativity rules

In [17]:
fm.model_description('y ~ a * (b + c)')

ModelTerms(
  ResponseTerm(
    Term(
      name= y
      variable= y
      kind= None
    )
  ),
  InterceptTerm(),
  Term(
    name= a
    variable= a
    kind= None
  ),
  Term(
    name= b
    variable= b
    kind= None
  ),
  Term(
    name= c
    variable= c
    kind= None
  ),
  InteractionTerm(
    name= a:b
    variables= {'b', 'a'}
  ),
  InteractionTerm(
    name= a:c
    variables= {'a', 'c'}
  )
)

In [18]:
fm.model_description('y ~ (a+b)*(c+d)')

ModelTerms(
  ResponseTerm(
    Term(
      name= y
      variable= y
      kind= None
    )
  ),
  InterceptTerm(),
  Term(
    name= a
    variable= a
    kind= None
  ),
  Term(
    name= b
    variable= b
    kind= None
  ),
  Term(
    name= c
    variable= c
    kind= None
  ),
  Term(
    name= d
    variable= d
    kind= None
  ),
  InteractionTerm(
    name= a:c
    variables= {'a', 'c'}
  ),
  InteractionTerm(
    name= a:d
    variables= {'a', 'd'}
  ),
  InteractionTerm(
    name= b:c
    variables= {'b', 'c'}
  ),
  InteractionTerm(
    name= b:d
    variables= {'b', 'd'}
  )
)

See the following about the `/` operator  

`a / (b + c)` is equivalent to `a + a:b + a:c`

In [19]:
fm.model_description('y ~ a / (b+c)')

ModelTerms(
  ResponseTerm(
    Term(
      name= y
      variable= y
      kind= None
    )
  ),
  InterceptTerm(),
  Term(
    name= a
    variable= a
    kind= None
  ),
  InteractionTerm(
    name= a:b
    variables= {'b', 'a'}
  ),
  InteractionTerm(
    name= a:c
    variables= {'a', 'c'}
  )
)

but `(a + b) / c` is not equivalent to `a + a:c + b + b:c`, i.e. `/` is not leftward distributive over `+`. In [this](https://patsy.readthedocs.io/en/latest/formulas.html) Patsy doc you have an explanation of it (nested terms and S/R conventions).

In [20]:
fm.model_description('y ~ (a + b) / c')

ModelTerms(
  ResponseTerm(
    Term(
      name= y
      variable= y
      kind= None
    )
  ),
  InterceptTerm(),
  Term(
    name= a
    variable= a
    kind= None
  ),
  Term(
    name= b
    variable= b
    kind= None
  ),
  InteractionTerm(
    name= a:b:c
    variables= {'b', 'a', 'c'}
  )
)

In [21]:
fm.model_description("(x + y) * u * v")

ModelTerms(
  InterceptTerm(),
  Term(
    name= x
    variable= x
    kind= None
  ),
  Term(
    name= y
    variable= y
    kind= None
  ),
  Term(
    name= u
    variable= u
    kind= None
  ),
  InteractionTerm(
    name= x:u
    variables= {'u', 'x'}
  ),
  InteractionTerm(
    name= y:u
    variables= {'y', 'u'}
  ),
  Term(
    name= v
    variable= v
    kind= None
  ),
  InteractionTerm(
    name= x:v
    variables= {'x', 'v'}
  ),
  InteractionTerm(
    name= y:v
    variables= {'y', 'v'}
  ),
  InteractionTerm(
    name= u:v
    variables= {'u', 'v'}
  ),
  InteractionTerm(
    name= x:u:v
    variables= {'u', 'x', 'v'}
  ),
  InteractionTerm(
    name= y:u:v
    variables= {'y', 'u', 'v'}
  )
)

## Some group level effects specifications

In [22]:
fm.model_description("y ~ a + b + (0 + a | g) + (b | g)")

ModelTerms(
  ResponseTerm(
    Term(
      name= y
      variable= y
      kind= None
    )
  ),
  InterceptTerm(),
  Term(
    name= a
    variable= a
    kind= None
  ),
  Term(
    name= b
    variable= b
    kind= None
  ),
  RandomTerm(
    expr= ModelTerms(
      Term(
        name= a
        variable= a
        kind= None
      )
    ),
    factor= Term(
      name= g
      variable= g
      kind= None
    )
  ),
  RandomTerm(
    expr= InterceptTerm(),
    factor= Term(
      name= g
      variable= g
      kind= None
    )
  ),
  RandomTerm(
    expr= Term(
      name= b
      variable= b
      kind= None
    ),
    factor= Term(
      name= g
      variable= g
      kind= None
    )
  )
)

`(x1 + x2 + x3) ** 2` computes all the pairwise interactions between the terms between parenthesis. We then remove `x2:x3` and `x1`.

In [23]:
fm.model_description("np.sqrt(y) ~ -1 + (x1 + x2 + x3) ** 2 - x2:x3 - x1")

ModelTerms(
  ResponseTerm(
    CallTerm(
      call=np.sqrt(y),
      args=[Variable(y)],
      special=False
    )
  ),
  Term(
    name= x2
    variable= x2
    kind= None
  ),
  Term(
    name= x3
    variable= x3
    kind= None
  ),
  InteractionTerm(
    name= x1:x2
    variables= {'x1', 'x2'}
  ),
  InteractionTerm(
    name= x1:x3
    variables= {'x1', 'x3'}
  )
)

# Design matrices generation

In [24]:
SIZE = 20
CNT = 20
data = pd.DataFrame(
    {
        'x': np.random.normal(size=SIZE), 
        'y': np.random.normal(size=SIZE),
        '$2#abc': np.random.normal(size=SIZE),
        'CNT': np.random.normal(size=SIZE),
        'g1': np.random.choice(['a', 'b', 'c'], SIZE),
        'g2': pd.Categorical(np.random.choice(['a', 'b', 'c'], SIZE), ordered=True, categories=['c', 'b', 'a'])
    }
)
design = fm.DesignMatrices(fm.model_description('y ~ x + np.exp(`$2#abc` + CNT) + g2'), data)

In [25]:
print(str(design.response) + "\n\n" + str(design.response.data))

ResponseVector(name=y, type=numeric, length=20)

[[ 0.89100042]
 [-0.01813171]
 [ 1.11452042]
 [-1.37771093]
 [-0.91751957]
 [ 1.8082413 ]
 [-1.020362  ]
 [-1.12268083]
 [ 0.93222868]
 [ 0.88494862]
 [-0.4728115 ]
 [-0.12212859]
 [-1.39852753]
 [-1.03228612]
 [ 1.02218545]
 [-1.25255783]
 [ 0.0759756 ]
 [ 0.06015534]
 [-1.45300427]
 [-0.50088228]]


In [26]:
print(str(design.common) + "\n\n" + str(design.common.data))

CommonEffectsMatrix(
  shape=(20, 5),
  terms={
    'Intercept': {type=Intercept, cols=slice(0, 1, None)},
    'x': {type=numeric, cols=slice(1, 2, None)},
    'np.exp(`$2#abc` + CNT)': {type=call, cols=slice(2, 3, None)},
    'g2': {type=categoric, cols=slice(3, 5, None)}
  }
)

[[ 1.          0.0762068   1.09531007  1.          0.        ]
 [ 1.         -0.80419833  3.11490019  0.          1.        ]
 [ 1.          2.47725592  0.0240628   0.          0.        ]
 [ 1.          0.22365899  5.18640881  0.          1.        ]
 [ 1.         -0.65244315  2.9937765   0.          0.        ]
 [ 1.         -0.90242039  0.87364354  0.          1.        ]
 [ 1.         -0.52465383  1.16439738  1.          0.        ]
 [ 1.          0.44967259  0.64547803  0.          0.        ]
 [ 1.         -0.36922561  1.83175899  0.          0.        ]
 [ 1.          1.26657769  2.28181355  0.          1.        ]
 [ 1.          0.40110413  1.50714683  1.          0.        ]
 [ 1.          0.52384051 