# Formula Notation
This document provides a brief guide to writing Formulas. Formulas are provided via the [Formulaic Python Package](https://matthewwardrop.github.io/formulaic/latest/).

For the curious, the notation used in formulas is based on Wilkinson notation and was developed by Wilkinson and Rogers in their paper [Symbolic Description of Factorial Models for Analysis of Variance](https://www.jstor.org/stable/2346786).

## Basic Notation
Formulas are of the general form `formula = 'y ~ x + a + b'` where y is referred to as the response variable or depedent variable and x, a and b are referred to as the regressors or independent variables.

Notice that a tilde `~` is used instead of an equal sign `=`. Regressors are seperated by a plus sign ` + `.

## Transformations
Formulas can also be used to transform data. 

To take the log of a variable, write `log(x)` as in `formula = 'y ~ log(x) + a + b'`.

The following functions are available to use inside of formula:
- log
- exp
- floor
- ceil
- trunc
- absolute

## Indicator Variables for Categorical Variables
To use categorical variables in a regression, the variables must be recoded into seperate indicator variables for each category. Formulas make recoding categorical variables easy.

To use a categorical variable in a regression, write `C(X)` as in `formula = 'y ~ C(X) + a + b'`. This will automatically create indicator variables for each category in X.  

To specify the reference category, write `C(X, contr.treatment("x"))` where x is the name of the desired reference category in X.

 For more details on Categorical Encoding, see the Formulaic documentation on [Categorical Encoding](https://matthewwardrop.github.io/formulaic/latest/guides/contrasts/).
