# Low-Level API [under construction]

<img src="../_static/images/banner/pipes.png" class="banner-photo"/>

Argument | Type | Default | Description
--- | --- | --- | ---
**item** | type | None | text
**item** | type | None | text
**item** | type | None | text

## Object-Relational Model (ORM)

The Low-Level API is an *object-relational model* for machine learning. Each class in the [ORM](http://docs.peewee-orm.com/en/latest/peewee/models.html) maps to a table in a SQLite database that serves as a machine learning *metastore*. 

The real power lies in the relationships between these objects (e.g. `Label`→`Splitset`←`Feature` and `Queue`→`Job`→`Predictor`→`Prediction`), which enable us to construct rule-base protocols for various types of data and analysis.

Goobye, *X_train, y_test*. Hello, object-oriented machine learning.

---

## 1. Dataset

![Datasets](../_static/images/api/dimensions.png)

The `Dataset` class provides the following subclasses for working with different types of data:

Type | Dimensionality | Supported Formats | Format (if ingested)
--- | --- | --- | ---
**Tabular** | 2D | Files (Parquet, CSV, TSV, Parquet) / Pandas DataFrame (in-memory) | Parquet
**Sequence** | 3D | NumPy (in-memory ndarray, npy file) | npy 
**Image** | 4D | NumPy (in-memory ndarray, npy file) / Pillow-supported formats | npy

> The names are merely suggestive, as the primary purpose of these subclasses is to provide a way to register data of known dimensionality.  For example, a practitioner could ingest many uni-channel/ grayscale images as a 3D Sequence Dataset instead of a multi-channel 4D Image Dataset.

> *Why not 2D NumPy?* The `Dataset.Tabular` class is intended for strict, column-specific dtypes and Parquet persistence upon ingestion. In practice, this conflicted too often with NumPy's array-wide dtyping. We use the best tools for the job (df/pq for 2D) and (array/npy for ND).

---

### 1a. Register

*Most of the Dataset registration methods share these arguments:*

Argument | Description
--- | ---
**ingest** | Determines if raw data is either stored directly inside the metastore or remains on disk to be accessed via path/url. *In-memory* data like DataFrames and ndarrays must be ingested. Whereas *file-based* data like Parquet, NPY, Image folders/urls may remain remote. Regardless of whether or not the raw data is ingested, metadata is always derived from it by parsing: 2D via DataFrame and N-D via ndarray.
**rename_columns** | Useful for assigning column names to arrays or delimited files that would otherwise be unnamed. `len(rename_columns)` must match the number of columns in the raw data. Normally, an int-based range is assigned to unnamed columns. In this case, AIQC converts each column name to a string e.g. '1' during the registration process.
**retype** | Change the dtype of data using [np.types](https://numpy.org/doc/stable/user/basics.types.html). All Dataset subclasses support mass typing via `np.type`/ `str(np.type)`. Only the Tabular subclass supports inidividual column retyping via `dict(column=str(np.type))`. If `rename_columns` is used in conjuction with `retype=dict()`, then each `dict['column']` key must match its counterpart in rename_columns.
**description** | What information does this dataset contain? What is unique about this dataset/ version -- did you edit the raw data, add rows, or change column names/ dtypes?
**name** | Triggers dataset *versioning*. Datasets that share a name will be assigned an auto-incrementing `version:int` number provided that they are not duplicates of each other based on a `sha256_hexdigest:str` hash. If you try to create an exact duplicate, it will warn you and `return` the matching duplicate instead of creating a new entity. This behavior makes it easy to rerun pipelines where Datasets are created inline.

*Ingestion provides the following benefits, especially for entry-level users:*

- Persist in-memory datasets (Pandas DataFrames, NumPy ndarrays).
- Keeps data coupled with the experiment in the portable SQLite file.
- Provides a more immutable and out-of-the-way storage location in comparison to a laptop file system.
- Encourages preserving tabular dtypes with the ecosystem-friendly Parquet format.

*Why would I avoid ingestion?*

- Happy with where the original data lives: e.g. S3 bucket.
- Don't want to duplicate the data.

> *sha256?* -- It's the one-way hash algorithm that GitHub aspires to upgrade to. AIQC runs it on compressed data because it's easier and probably less-error prone than intercepting the bytes of the *fastparquet* intermediary tables before appending the Parquet magic bytes.

> *Is SQLite a legitimate datastore?* -- In many cases, SQLite queries are faster than accessing data via a filesystem. It's a stable, 22 year-old technology that serves as the default database for iOS e.g. Apple Photos. AIQC uses it store raw data in byte format as a BlobField. I've stored tens-of-thousands of files in it over several years and never experienced corruption. Keep in mind that AWS S3 is blob store, and the Microsoft equivalent service is literally called Azure *Blob* Storage. The max size of a BlobField is 2GB, so ~20GB after compression. Either way, the goal of machine learning isn't to record the entire population within the weights of a neural network, it's to find subsets that are representative of the broader population.

---

### 1ai. `Dataset.Tabular`

Here are some of the ways practitioners can use this 2D structure:

||||
|---|---|---|
Multiple subjects (1 row per sample) | * | Multi-variate 1D (1 col per attribute)
Single subject (1 row per timestamp) | * | Multi-variate 1D (1 col per attribute)
Multiple subjects (1 row per timestamp) | * | Uni-variate 0D (1 col per sample)

> Tabular datasets may contain both features and labels

---

#### └── `Dataset.Tabular.from_df()`

```python
dataset = Dataset.Tabular.from_df(
    dataframe
    , rename_columns
    , retype
    , description
    , name
)
```

Argument | Type | Default | Description
--- | --- | --- | ---
**df** | DataFrame | Required | [pd.DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas-dataframe) with int-based single index. DataFrames are always ingested.
**rename_columns** | list[str]  | None | See [Registration](#1a.-Registration)
**retype** | np.type / dict(column:np.type) | None | See [Registration](#1a.-Registration)
**description** | str | None | See [Registration](#1a.-Registration)
**name** | str | None | See [Registration](#1a.-Registration)

---

#### └── `Dataset.Tabular.from_path()`

```python
Dataset.Tabular.from_path(
    file_path
    , ingest
    , rename_columns
    , retype
    , header
    , description
    , name
)
```

Argument | Type | Default | Description
--- | --- | --- | ---
**file_path** | str | Required | Parsed based on how the file name ends (.parquet, .tsv, .csv)
**ingest** | bool | True | See [Registration](#1a.-Registration). Defaults to True because I don't want to rely on CSV files as a source of truth for dtypes, and compression works great in Parquet.
**rename_columns** | list[str]  | None | See [Registration](#1a.-Registration)
**retype** | np.type / dict(column:np.type) | None | See [Registration](#1a.-Registration)
**header** | object | None | See [Registration](#1a.-Registration) 
**description** | str | None | See [Registration](#1a.-Registration)
**name** | str | None | See [Registration](#1a.-Registration)

---

### 1aii. `Dataset.Sequence`

Here are some of the ways practitioners can use this 3D structure:


||||
|---|---|---|
Single subject (1 patient) | * | Multiple 2D sequences
Multiple subjects | * | Single 2D sequence

> Sequence datasets are somewhat multi-modal in that, in order to perform supervised learning on them, they must eventually be paired with a `Dataset.Tabular` that acts as its `Label`.

---

#### └── `Dataset.Sequence.from_numpy()`

```python
Dataset.Sequence.from_numpy(
    arr3D_or_npyPath
    , ingest
    , rename_columns
    , retype        
    , description
    , name           
)
```

Argument | Type | Default | Description
--- | --- | --- | ---
**arr3D_or_npyPath** | object / str | Required | 3D array in the form of either an [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html) or [npy](https://numpy.org/doc/stable/reference/generated/numpy.save.html) file path
**ingest** | bool | None | See [Registration](#1a.-Registration). If left blank, ndarrays will be ingested and npy will not. Errors if ndarray and False.
**rename_columns** | list[str]  | None | See [Registration](#1a.-Registration)
**retype** | np.type / dict(column:np.type) | None | See [Registration](#1a.-Registration)
**description** | str | None | See [Registration](#1a.-Registration)
**name** | str | None | See [Registration](#1a.-Registration) 

---

### 1aiii. `Dataset.Image`

Here are some of the ways you can practitioners this 4D structure:

||||
|---|---|---|
Single subject (1 patient) | * | Multiple 3D images
Multiple subjects | * | Single 3D image


Users can ingest 4D data using either:
- [The Pillow library, which supports various formats](pillow.readthedocs.io/en/stable/handbook/image-file-formats.html)
- Or NumPy arrays as a simple alternative


> Image datasets are somewhat multi-modal in that, in order to perform supervised learning on them, they must eventually be paired with a `Dataset.Tabular` that acts as its `Label`.

---

#### └── `Dataset.Image.from_numpy()`

```python
Dataset.Image.from_numpy(
    arr4D_or_npyPath
    , ingest
    , rename_columns
    , retype        
    , description
    , name       
)
```

Argument | Type | Default | Description
--- | --- | --- | ---
**arr4D_or_npyPath** | object / str | Required | 4D array in the form of either an [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html) or [npy](https://numpy.org/doc/stable/reference/generated/numpy.save.html) file path
**ingest** | bool | None | See [Registration](#1a.-Registration). If left blank, ndarrays will be ingested and npy will not. Errors if ndarray and False.
**rename_columns** | list[str]  | None | See [Registration](#1a.-Registration) 
**retype** | np.type / dict(column:np.type) | None | See [Registration](#1a.-Registration) 
**description** | str | None | See [Registration](#1a.-Registration) 
**name** | str | None | See [Registration](#1a.-Registration) 

---

#### └── `Dataset.Image.from_folder()`

```python
Dataset.Image.from_folder(
    folder_path
    , ingest
    , rename_columns
    , retype
    , description
    , name
)
```

Argument | Type | Default | Description
--- | --- | --- | ---
**folder_path** | str | Required | Folder of images to be ingested via Pillow. All images must be cropped to the same dimensions ahead of time.
**ingest** | bool | False | See [Registration](#1a.-Registration) 
**rename_columns** | list[str]  | None | See [Registration](#1a.-Registration) 
**retype** | np.type / dict(column:np.type) | None | See [Registration](#1a.-Registration) 
**description** | str | None | See [Registration](#1a.-Registration) 
**name** | str | None | See [Registration](#1a.-Registration) 

---

#### └── `Dataset.Image.from_urls()`

```python
Dataset.Image.from_urls(
    urls
    , source_path
    , ingest
    , rename_columns
    , retype
    , description
    , name
)
```

Argument | Type | Default | Description
--- | --- | --- | ---
**urls** | list(str) | Required | URLs that point to an image to be ingested via Pillow. All images must be cropped to the same dimensions ahead of time.
**source_path** | str  | None | Optionally record a shared directory, bucket, or FTP site where images are stored. The backend won't use this information for anything. |
**ingest** | bool | False | See [Registration](#1a.-Registration)
**rename_columns** | list[str]  | None | See [Registration](#1a.-Registration) 
**retype** | np.type / dict(column:np.type) | None | See [Registration](#1a.-Registration) 
**description** | str | None | See [Registration](#1a.-Registration) 
**name** | str | None | See [Registration](#1a.-Registration) 

---

### 1b. Fetch

The following methods are exposed to end-users in case they want to inspect the data that they have ingested.

---

#### └── `Dataset.to_arr()`

Argument | Type | Default | Description
--- | --- | --- | ---
**id** | int | None | The id of the Dataset
**columns** | list(str) | None | If left blank, includes all columns
**samples** | list(int) | None | If left blank, includes all samples

Subclass | Returns
--- | ---
Tabular | ndarray.ndim==2
Sequence | ndarray.ndim==3
Image | ndarray.ndim==4

---

#### └── `Dataset.to_df()`

Argument | Type | Default | Description
--- | --- | --- | ---
**id** | int | None | The id of the Dataset
**columns** | list(str) | None | If left blank, includes all columns
**samples** | list(int) | None | If left blank, includes all samples

Subclass | Returns
--- | ---
Tabular | DataFrame
Sequence | list(DataFrame)
Image | list(list(DataFrame))

---

#### └── `Dataset.to_pillow()`

Argument | Type | Default | Description
--- | --- | --- | ---
**id** | int | None | The id of the Dataset
**samples** | list(int) | None | If left blank, includes all samples

Subclass | Returns
--- | ---
Image | list(PIL.Image)


---

#### └── `Dataset.get_dtypes()`

Argument | Type | Default | Description
--- | --- | --- | ---
**id** | int | None | The id of the Dataset
**columns** | list(str) | None | If left blank, includes all columns

Regardless of how the initial `Dataset.dtype` was formatted [e.g. single np.type / str(np.type) / dict(column=np.type)], this function intentionally returns then dtype of each column within a `dict(column=str(np.type)` format.

---

## 2. Feature

Determines the columns that will be used as predictive features during training. Columns is always the last dimension `shape[-1]` of a dataset.

---

### 2a. Create

#### └── `Feature.from_dataset()`

```python
Feature.from_dataset(
    dataset_id
    , include_columns
    , exclude_columns
)
```

Argument | Type | Default | Description
--- | --- | --- | ---
**dataset_id** | int | Required | `Dataset.id` from which you want to derive `Dataset.columns`.
**include_columns**  | list(str) | None | Specify columns that *will* be included in the Feature. All columns that are not specified will *not* be included.
**exclude_columns** | list(str) | None | Specify columns that will *not* be included in the Feature. All columns that are not specified *will* be included.

> If neither `include_columns` nor `exclude_columns` is defined, then all columns will be used.

> Both `include_columns` and `exclude_columns` cannot be used at the same time

---

### 2b. Fetch

Theses methods wrap Dataset's [fetch](#1b.-Fetch) methods:

Method | Arguments | Returns
--- | --- | ---
**to_arr()** | columns:list(str)=Feature.columns, samples:list(int)=None | ndarray 2D / 3D / 4D
**to_df()** | columns:list(str)=Feature.columns, samples:list(int)=None | df / list(df) / list(list(df))
**get_dtypes()** | columns:list(str)=Feature.columns | dict(column=str(np.type))

---

## 3. Label

Determines the column(s) that will be used as a target during supervised analysis. Do no create a Label if you intend to conduct unsupervised/ self-supervised analysis.

---

### 3a. Create

#### └── `Label.from_dataset()`

```python
Label.from_dataset(
    dataset_id
    , columns
)
```

Argument | Type | Default | Description
--- | --- | --- | ---
**dataset_id** | int | Required | `Dataset.id` from which you want to derive `Dataset.columns`. Only Tabular Datasets may be used as a Label.
**columns**  | list(str) | None | Specify columns that *will* be included in the Label. If left blank, defaults to all columns. If more than 1 column is provided, then the data in those columns must be in One-Hot Encoded (OHE) format.

---

### 3b. Fetch

Theses methods wrap Dataset's [fetch](#1b.-Fetch) methods:

Method | Arguments | Returns
--- | --- | ---
**to_arr()** | columns:list(str)=Label.columns, samples:list(int)=None | ndarray 2D / 3D / 4D
**to_df()** | columns:list(str)=Label.columns, samples:list(int)=None | df / list(df) / list(list(df))
**get_dtypes()** | columns:list(str)=Label.columns | dict(column=str(np.type))

---

## 4. Interpolate

If you don't have time series data then you do not need interpolation.

If you have continuous columns with missing data in a time series, then interpolation allows you to fill in those blanks mathematically. It does so by fitting a curve to each column. Therefore each column passed to an interpolater must satisfy: `np.issubdtype(dtype, np.floating)`.

Interpolation is the first preprocessor because you need to fill in blanks prior to encoding.

> `pandas.DataFrame.interpolate`
> 
> https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html
> 
> Is utilized due to its ease of use, variety of methods, and **support of sparse indices**. However, it does not follow the `fit/transform` pattern like many of the class-based sklearn preprocessors, so the interpolated training data is concatenated with the evalaution split during the interpolation of evaluation splits.

Below are the default settings if `interpolate_kwargs=None` that get passed to `df.interpolate()`. In my experience, `method=spline` produces the best results. However, if either (a) spline fails to fit to your data, or (b) you know that your pattern is linear - then try `method=linear`.

```python
interpolate_kwargs = dict(
    method            = 'spline'
    , limit_direction = 'both'
    , limit_area      = None
    , axis            = 0
    , order           = 1
)
```

Because the sample dimension is different for each Dataset Type, they approach interpolation differently.

Dataset Type | Approach
--- | ---
**Tabular** | Unlike encoders, there is no `fit` object. So first the training data rows are interpolated independently. Then, when it comes time to interpolate other splits like validation, the training data is included in the sequence to be interpolated.
**Sequence** | Interpolation is ran on each 2D sequence separately
**Image** | Interpolation is ran on each 2D channel separately

---

### 4a. `LabelInterpolater`


Label is intended for a single column, so only 1 Interpolater will be used during `Label.preprocess()`

---

#### └── `LabelInterpolater.from_label()`

```python
LabelInterpolater.from_label(
    label_id
    , process_separately
    , interpolate_kwargs
)
```

Argument | Type | Default | Description
--- | --- | --- | ---
**label_id** | int | Required | Points to the `Label.columns` to use
**process_separately** | bool | True | Used to restrict the fit to the training data, this may be flipped to `False`. However, doing so causes data leakage.
**interpolate_kwargs** | dict | None | Gets passed to `df.interpolate()`. See [Interpolate](#4.-Interpolate) section for defaults.

---

### 4b. `FeatureInterpolater`


For *multivariate* datasets, columns/dtypes may need to be handled differently. So we use column/dtype filters to apply separate transformations. If the first transformation's filter includes a certain column/dtype, then subsequent filters may not include that column/dtype.

---

#### └── `FeatureInterpolater.from_feature()`

```python
FeatureInterpolater.from_feature(
    feature_id
    , process_separately
    , interpolate_kwargs
    , dtypes
    , columns
    , verbose
)
```

Argument | Type | Default | Description
--- | --- | --- | ---
**feature_id** | int | Required | Points to the `Feature.columns` to use
**process_separately** | bool | True | Used to restrict the fit to the training data, this may be flipped to `False`. However, doing so causes data leakage.
**interpolate_kwargs** | dict | None | The `interpolate_kwargs:dict=None` object is what gets passed to Pandas interpolation. In my experience, `method=spline` produces the best results. However, if either (a) spline fails to fit to your data, or (b) you know that your pattern is linear - then try `method=linear`.
**dtypes** | list(str) | None | The dtypes to include
**columns** | list(str) | None | The columns to include. Errors if any of the columns were already included by dtypes.
**verbose** | bool | True | If True, messages will be printed about the status of the interpolaters as they attempt to fit on the filtered columns

---

## 5. Encode

Transform data into numerical format that is close to zero. Reference [Encoding](https://aiqc.readthedocs.io/en/latest/pages/explainer.html) for more information.

There are two phases of encoding:
1. `fit` on train - where the encoder learns about the values of the samples made available to it. Ideally, you only want to `fit` aka learn from your training split so that you are not [leaking](https://towardsdatascience.com/data-leakage-5dfc2e0127d4) information from your validation and test spits into your model! However, categorical encoders are always fit on the entire dataset because they are not prone to leakage and any weights tied to empty OHE inputs will zero-out.
2. `transform` each split/fold


> Only [sklearn.preprocessing](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing) methods are officially supported, but we have experimented with `sklearn.feature_extraction.text.CountVectorizer`

---

### 5a. `LabelCoder`


Label is intended for a single column, so only 1 LabelCoder will be used during `Label.preprocess()`

> Unfortunately, the name "LabelEncoder" is occupied by `sklearn.preprocessing.LabelEncoder`

---

#### `└── LabelCoder.from_label()`

```python
LabelCoder.from_label(
    label_id
    , sklearn_preprocess
)
```

Argument | Type | Default | Description
--- | --- | --- | ---
**label_id** | int | Required | Points to the `Label.columns` to use
**sklearn_preprocess** | object | Required | An instantiated `sklearn.preprocessing` class-based encoder - e.g. `StandardScaler()` neither `StandardScaler` nor  `scale()`. AIQC will automatically correct the attributes of your encoder to smooth out any common errors they would cause. For example, preventing sparse SciPy matrix output (errors during tensor conversion) and data `copy()`.

---

### 5b. `FeatureCoder`

For *multivariate* datasets, columns/dtypes may need to be handled differently. So we use column/dtype filters to apply separate transformations. If the first transformation's filter includes a certain column/dtype, then subsequent filters may not include that column/dtype.

---

#### └── `FeatureCoder.from_feature()`

```python
FeatureCoder.from_feature(
    feature_id
    , sklearn_preprocess
    , include
    , dtypes
    , columns
    , verbose
)
```

Argument | Type | Default | Description
--- | --- | --- | ---
**feature_id** | int | Required | Points to the `Feature.columns` to use
**sklearn_preprocess** | object | Required | An instantiated `sklearn.preprocessing` class-based encoder - e.g. `StandardScaler()` neither `StandardScaler` nor  `scale()`. AIQC will automatically correct the attributes of your encoder to smooth out any common errors they would cause. For example, preventing sparse SciPy matrix output (errors during tensor conversion) and data `copy()`.
**include** | bool | True | Whether to include or exclude the dtypes/columns that match the filter. You can create a filter for all columns by setting `include=False` and then setting both `dtypes` and `columns` to `None`.
**dtypes** | list(str) | None | The dtypes to filter
**columns** | list(str) | None | The columns to filter. Errors if any of the columns were already used by dtypes.
**verbose** | bool | True | If True, messages will be printed about the status of the encoders as they attempt to fit on the filtered columns

---

## 6. Shape

Changes the shape of data. Only supports Features, not Labels.

Reshaping is applied at the end of `Feature.preprocess()`. So if the feature data has been altered via time series windowing or One Hot Encoder, then those changes will be reflected in the shape that is fed to `

When working with architectures that are highly dimensional such convolutional and recurrent networks (Conv1D, Conv2D, Conv3D / ConvLSTM1D, ConvLSTM2D, ConvLSTM3D), you'll often find yourself needing to reshape data to fit a layer's required input shape. 

- *Reducing unused dimensions* - When working with grayscale images (1 channel, 25 rows, 25 columns) it's better to use Conv1D instead of Conv2D.
- *Adding wrapper dimensions* - Perhaps your data is a fit for ConvLSTM1D, but that layer is only supported in the nightly TensorFlow build so you want to add a wrapper dimension in order to use the production-ready ConvLSTM2D.

AIQC favors a *"channels_first"* (samples, channels, rows, columns) approach as opposed to *"channels_last"* (samples, rows, columns, channels).

> *Can't I just reshape the tensors during the training loop?* You could. However, AIQC systemtically provides the shape of features and labels to `Algorith.fn_build` to make designing the topology easier, so it's best to get the shape right beforehand. Additionally, if you reshape your data within the training loop, then you may also need to reshape the output of `Algorithm.fn_predict` so that it is correctly formatted for automatic post-processing. It's also more computationally efficient to do the reshaping once up front.

---

### 6a. Strategies

The `reshape_indices` argument is ultimately fed to [np.reshape(newshape)](https://numpy.org/doc/stable/reference/generated/numpy.reshape.html). We use *index n* to point to the value at `ndarray.shape[n]`.

#### Reshaping by Index 

Let's say we have a 4D feature consisting of 3D images `(samples * channels * rows * columns)`. Our problems is that the images are B&W, so we don't want a color channel because it would add unecessary dimensionality to our model. Thus, we want to drop the dimension at the shape index `1`. 

```python
reshape_indices = (0,2,3)
```

Thus we have wrangled ourselves a 3D feature consisting of 2D images `(samples * rows * columns)`. 

#### Reshaping Explicitly

But what if the dimensions we want cannot be expressed by rearranging the existing indices? If you define a number as a `str`, then that number will be used as directly as the value at that position.

So if I wanted to add an extra wrapper dimension to my data to serve as a single color channel, I would simply do:

```python
reshape_indices = (0,'1',1,2)
```

> *Then couldn't I just hardcode my shapes with strings?* Yes, but `FeatureShaper` is applied to all of the splits, which are assumed to have different shapes, which is why we use the indices.

#### Multiplicative Reshaping

Sometimes you need to stack/nest dimensions. This requires multiplying one shape index by another. 

For example, if I have a 3 separate hours worth of data and I want to treat it as 180 minutes, then I need to go from a shape of (3 hours * 60 minutes) to (180 minutes). Just provide the shape indices that you want to multiply in a `tuple` like so:

```python
reshape_indices = ((0,1), 2)
```

---

### 6b. Create

#### └── `FeatureShaper.from_feature()`

```python
FeatureShaper.from_feature(
    feature_id
    , reshape_indices
)
```

Argument | Type | Default | Description
--- | --- | --- | ---
**feature_id** | int | Required | The `Feature.id` to use
**reshape_indices** | tuple(int/str/tuple) | Required | See [Strategies](#6a.-Strategies).

Argument | Type | Default | Description
--- | --- | --- | ---
**item** | type | None | text
**item** | type | None | text
**item** | type | None | text

The `reshape_indices` argument accepts a tuple for rearranging indices in your order of choosing. Behind the scenes, it will use `np.reshape()` to rearrange the data at the end of your preprocessing pipeline. How the element is handled in that tuple is determined by its type.

`feature.make_featureshaper(reshape_indices:tuple)`

```python
# source code from the end of `feature.preprocess()`
current_shape = feature_array.shape

new shape = []
for i in featureshaper.reshape_indices:
    if (type(i) == int):
        new_shape.append(current_shape[i])
    elif (type(i) == str):
        new_shape.append(int(i))
    elif (type(i)== tuple):
        indices = [current_shape[idx] for idx in i]
        new_shape.append(math.prod(indices))
new_shape = tuple(new_shape)
            
feature_array = feature_array.reshape(new_shape)
```

*Warning:* if your model is unsupervised (aka generative or self-supervised), then it must output data in *"column (aka width) last"* shape. Otherwise, automated column decoding will be applied along the wrong dimension.

---

## 7. Window

![window_dimensions](../_static/images/api/window_dimensions.png)

Window facilitates sliding windows for a time series Feature. It does not apply to Labels. This is used for unsupervised (aka self-supervised) time series walk-forward forecasting.

As seen above, no matter what dimensionality the original data has, it will be windowed along the samples (first) dimension. `size_window` determine how many timepoints are included in a window, and `size_shift` determines how many timepoints to slide over before defining a new window. 

> For example, if we want to be able to *predict the next 7 days* worth of weather *using the past 21 days* of weather, then our `size_window=21` and our `size_shift=7`.

![window_samples](../_static/images/api/stratify_window.png)

After data is windowed, its dimensionality increases by 1. Why? Well, originally we had a *single time series*. However, if we window that data, then we have *many time series subsets*.

This means that the windows now act as the sample dimension, which is important for stratification. Therefore, Window must be created prior to Splitset.

![Windows](../_static/images/api/sliding_windows.png)

In a walk-forward analysis, we learn about the future by looking at the past. So we need 2 sets of windows:

- *Unshifted windows* (orange in diagram above): represent the past and serves as the features we learn from
- *Shifted windows* (green in diagram above): represent the future and serves as the target we predict

However, when conducting inference, we are trying to predict the shifted windows not learn from them. So we don't need to record any shifted windows.

---

### 7a. Create

##### └── `Window.from_feature()`

```python
Window.from_feature(
    feature_id
    , size_window
    , size_shift
    , record_shifted
)
```

Argument | Type | Default | Description
--- | --- | --- | ---
**dataset_id** | int | Required | `Feature.id` from which you want to derive windows.
**size_window**  | int | Required | The number of timesteps to include in a window.
**size_shift**  | int | Required | The number of timesteps to shift forward.
**record_shifted**  | bool | True | Whether or not we want to keep a shifted set of windows around. During pure inference, this is False.

---

## 8. Splitset

Used for sample stratification. Reference [Stratification](https://aiqc.readthedocs.io/en/latest/pages/explainer.html) section.

Split | Description
--- | ---
**train** | The samples that the model will be trained upon. Later, we’ll see how we can make *cross-folds from our training split*. Unsupervised learning will only have a training split.
**validation** (optional) | The samples used for training evaluation. Ensures that the test set is not revealed to the model during training.
**test** (optional) | The samples the model has never seen during training. Used to assess how well the model will perform on unobserved, natural data when it is applied in the real world aka how generalizable it is.

> Because Splitset groups together all of the data wrangling entities (Features, Label, Folds) it essentially represents a *Pipeline*, which is why it bears the name Pipeline in the High-Level API.

---

### 8a. Create

##### └── `Splitset.make()`

```python
Splitset.make(
    feature_ids
    , label_id
    , size_test
    , size_validation
    , bin_count
    , fold_count
    , unsupervised_stratify_col
    , name
    , description
    , predictor_id
)
```

Argument | Type | Default | Description
--- | --- | --- | ---
**feature_ids** | list(int) | Required | Multiple `Feature.id`'s may be included to enable multi-modal (aka mixed data-type) analysis. All of these Features must have the same number of samples.
**label_id** | int | None | The Label to be used as a target for supervised analysis. Must have the sample number of samples as the Features.
**size_test** | float | None | Percent of samples to be placed into the test split. Must be `> 0.0` and `< 1.0`.
**size_validation** | float | None | Percent of samples to be placed into the validation split. Must be `> 0.0` and `< 1.0`. If this is not None and used in combination with `fold_count`, then there will be 4 splits.
**bin_count** | int | None | For continous stratification columns, how many bins (aka quantiles) should be used?
**fold_count** | int | None | The number or cross-validation folds to generate. See [Cross-Validation](#5b.-Cross-Validation).
**unsupervised_stratify_col** | str | None | Used during unsupervised analysis. Specify a column from the first Feature in feature_ids to use for stratification. For example, when forecasting, it may make sense to stratify by the day of the year.
**name** | str | None | Used for versioning a pipeline (collection of inputs, label, and stratification). Two versions cannot have identical attributes.
**description** | str | None | What is unique about this this pipeline?

> `size_train = 1.00 - (size_test + size_validation)` the backend ensures that the sizes sum to 1.00

> *How does continuous binning work?* Reference the handy `Pandas.qcut()`  and the source code `pd.qcut(x=array_to_bin, q=bin_count, labels=False, duplicates='drop')` for more detail.

---

### 8b. Cross-Validation

Cross-validation is triggered by `fold_count:int` during Splitset creation. Reference the [scikit-learn documentation](https://scikit-learn.org/stable/modules/cross_validation.html) to learn more about cross-validation.

![cross fold objects](../_static/images/api/cross_fold_objects.png)

Each row in the diagram above is a `Fold` object.

Each green/blue box represents a bin of stratified samples. During preprocessing and training, we rotate which blue bin serves as the validation samples (`fold_validation`). The remaining green bins in the row serve as the training samples (`folds_train_combined`).

Let's say we defined `fold_count=5`. What are the implications?
- Creates 5 `Folds` related to a `Splitset`.
- 5x more models will be trained for each experiment.
- 5x more preprocessing and caching; the backend must preprocess each Fold separately to prevent data leakage by excluding `fold_validation` from the `fit`. Fits are saved to the Fold object as opposed to the Splitset object.
- 5x more evaluation.

*Disclaimer*

> DO NOT use cross-validation unless your *(total sample count / fold_count)* still gives you an accurate representation of your entire sample population. If you are ignoring that advice and stretching to perform cross-validation, then at least ensure that *(total sample count / fold_count)* is evenly divisible. Folds naturally have fewer samples, so a handful of incorrect predictions have the potential to offset your aggregate metrics. Both of these tips help avoid poorly stratified/ undersized folds that seem to perform either too well (only most common label class present) or poorly (handful of samples and a few inaccurate prediction on an otherwise good model).
> 
> Candidly, if you've ever performed cross-validation manually, let alone systematically, you'll know that, barring stratification of continuous labels, it's easy enough to construct the folds, but then it's a pain to generate performance metrics (e.g. `zero_division`, absent OHE classes) due to the absence of outlying classes and bins.  Time has been invested to handle these scenarios elegantly so that folds can be treated as first-class-citizens alongside splits. That being said, if you try to do something undersized like "150 samples in their dataset and a `fold_count` > 3 with `unique_classes` > 4," then you may run into edge cases.
>
> Cross validation is only included in AIQC to allow best practices to be upheld and to show off the power of systematic preprocessing.

---

## 7. Algorithm

Now that our data has been prepared, we transition to the 2nd half of the ORM where the focus is the logic that will be applied to that data.

The Algorithm contains all of the components needed to construct, train, and use our model. 

Reference the [tutorials](../pages/gallery.html) for examples of how Algorithms are defined.

---

### 7a. Create

Assemble an architecture consisting of components defined in functions.

The `**hp` kwargs are common to every Algorithm function except `fn_predict`. They are used to systematically pass a dictionary of *hyperparameters* into these functions. See Hyperparameters.

---

#### └── `Algorithm.make()`

```python
Algorithm.make(
    library
    , analysis_type
    , fn_build
    , fn_train
    , fn_predict
    , fn_lose
    , fn_optimize
)
```

Argument | Type | Default | Description
--- | --- | --- | ---
**library** | str | Required | 'keras' or 'pytorch' depending on the type of model defined in `fn_build`
**analysis_type** | str | Required | 'classification_binary', 'classification_multi', or 'regression'. Unsupervised/ self-supervised falls under regression. Used to determine which performance metrics are run. Errors if it is incompatible with the Label provided: e.g. classification_binary is incompatible with an np.floating Label.column.
**fn_build** | func | Required | See below 
**fn_train** | func | Required | See below
**fn_lose** | func | None | See below 
**fn_optimize** | func | None | See below
**fn_predict** | func | None | See below

**Required Functions**

```python
def fn_build(
    features_shape:tuple
    , label_shape:tuple
    , **hp:dict
):
    # Define tf/torch model
    return model
```

```python
def fn_train(
    model:object
    , loser:object
    , optimizer:object
    , train_features:ndarray
    , train_label:ndarray
    , eval_features:ndarray
    , eval_label:ndarray
    , **hp:dict
):
    # Define training/ eval loop. 
    # See `utils.pytorch.fit`
    
    # if tensorflow
    return model 
    # if torch
    # See `utils.pytorch.fit` and history metrics below
    return history:dict, model
```

**Optional Functions**

> *Where are the defaults for optional functions defined?* See [utils.tensorflow](https://github.com/aiqc/AIQC/blob/main/aiqc/utils/tensorflow.py) and [utils.pytorch](https://github.com/aiqc/AIQC/blob/main/aiqc/utils/pytorch.py) for examples of loss, optimization, and prediction.

```python
def fn_lose(**hp:dict):
    # Define tf/torch loss function
    return loser
```

```python
def fn_optimize(**hp:dict):
    # Define tf/torch optimizer 
    return optimizer
```

```python
def fn_predict(model:object, features:ndarray):
    #if classify: predictions as ordinal, not OHE.
    return prediction:ndarray, probabilities:ndarray
    #if regression
    return prediction:ndarray
```

---

### 7b. PyTorch `fit`

Provides an abstraction that eliminates the boilerplate code normally required to train and evaluate a PyTorch model.

- Before training - it shuffles samples, batches samples, and then shuffles batches.
- During training - it calculates batch loss, epoch loss, and epoch history metrics.
- After training - it calculates metrics for each split.

```python
model, history = utils.pytorch.fit(
    # These arguments come directly from `fn_train`
    model
    , loser
    , optimizer
    
    , train_features
    , train_label
    , eval_features
    , eval_label
    
    # These arguments are user-defined
    , epochs
    , batch_size
    , enforce_sameSize
    , allow_singleSample
    , metrics
)
```

User-Defined Arguments | Type | Default | Description
--- | --- | --- | ---
**epochs** | int | 30 | The number of times to loop over the features
**batch_size** | int | 5 | Divides features and lables into chunks to be trained upon
**enforce_sameSize** | bool | True | If `True`, drops `len(batch!=batch_size)`
**allow_singleSample** | bool | False | If `False`, drops `len(batch!=1)`
**metrics** | list(torchmetrics.metric()) | None | List of instantiated `torchmetrics` classes e.g. `Accuracy`

---

### 7b. History Metrics

The goal of the `Predictor.history` object is to record the training and evaluation metrics at the end of each epic so that they can be interpretted in the learning curve plots. Reference the [visualization](visualization.html) section.

- *Keras*: any `metrics=[]` specified are automatically added to the `History` callback object.

- *PyTorch*: if you use `fit` seen above, then you don't need to worry about this. Users are responsible for calculating their own metrics (we recommend the `torchmetrics` package) and placing them into a `history` dictionary that mirrors the schema of the Keras history object. Reference the torch [examples](gallery/pytorch/multi_class.html). 

> The schema of the `history` dictionary is as follows: `dict(*:ndarray, val_*=ndarray)`. For example, if you wanted to record the history of the 'loss' and 'accuracy' metrics manually for PyTorch, you would construct it like so:

```python
history = dict(
    loss           = ndarray
    , val_loss     = ndarray
    
    , accuracy     = ndarray
    , val_accuracy = ndarray
)
```

---

### 7c. TensorFlow Early Stopping

*Early stopping* isn't just about efficiency in reducing the number of `epochs`. If you've specified 300 epochs, there's a chance your model catches on to the underlying patterns early, say around 75-125 epochs. At this point, there's also good chance what it learns in the remaining epochs will cause it to overfit on patterns that are specific to the training data, and thereby and lose it's simplicity/ generalizability.

> The `metric=val_*` prefix refers to the evaluation samples.
>
> Remember, regression does not have accuracy metrics.
>
> `TrainingCallback.MetricCutoff` is a custom class we wrote to make *early stopping* easier, so you won't find information about it in the official Keras documentation.

Placed within `fn_train`:

```python
    from aiqc.utils.tensorflow import TrainingCallback
    
    #Define one or more metrics to monitor.
    metrics_cuttoffs = [
        dict(metric='val_accuracy', cutoff=0.96, above_or_below='above'),
        dict(metric='val_loss', cutoff=0.1, above_or_below='below')
    ]
    cutoffs = TrainingCallback.MetricCutoff(metrics_cuttoffs)

    # Pass it into keras callbacks
    model.fit(
        # other fit args
        callbacks = [cutoffs]
    )
```

---

## 8. Hyperparameters

As mentioned in [Algorithm](#7a.-Create), the `**hp` argument is used to systematically pass hyperparameters into the Algorithm functions.

For example, given the follow set of hyperparamets:

```python
hyperparameters = dict(
    epoch_count     = [30]
    , learning_rate = [0.01]
    , neuron_count  = [24, 48]
)
```

A grid search would produce the 2 unique hyperparameter combinations:

```python
[
    dict(
        epoch_count     = 30
        , learning_rate = 0.01
        , neuron_count  = 24 #<-- varies
    )
    
    , dict(
        epoch_count     = 30
        , learning_rate = 0.01
        , neuron_count  = 48 #<-- varies
    )
]
```

We access the current value in our model functions like so: `hp['neuron_count']`. 

---

### 8a. Philosophy

There are essentially 2 phases to hyperparameter tuning:

1. Figure out the right architecture with high-level concepts such as topologies (types of layers, # of layers, # neurons per layer) and batch size.
2. Once you find architecture that perform reasonably well, hone in on their nuances like weight initialization, activation methods, and learning rate.

Due to the fact that it doesn't really make sense to waste time training the nuances of architectures that perform poorly to begin with -- I don't recommend fancy hyperparameter search strategies. Just random search step #1 and then grid search step #2.

- Test high/medium/low values for each parameter. Given enough practice with an architecture, you'll intuitively get a feel for the right balance between "model complexity and data complexity." 

- If you limit your experiments to tweaking just 1-2 parameters at a time, then it's easy to see their effect as an *independent variable*. 

- Never forget, the goal of training is to neurally encode representative information using the simplest topology possible to ensure generalizability.

---

### 8b. Search Strategies

#### Grid search

By default AIQC will test all possible combinations.

#### Random selection

Testing many different combinations in your initial runs can be a good way to get a feel for the parameter space. Although if you are doing this you'll find that many of your combinations are too similar. So randomly sampling a them is a less computationally expensive way to test a wide variety of architectures.

* `search_count:int` the fixed # of combinations to sample.

* `search_percent:float` a % of combinations to sample.

#### Bayesian

"TPE (Tree-structured Parzen Estimator)" via `hyperopt` has been suggested as a future area to explore, but it does not exist right now.

---

### 8b. Create

#### └── `Hyperparamset.from_algorithm()`

```python
Hyperparamset.from_algorithm(
    algorithm_id
    , hyperparameters
    , search_count
    , search_percent
)
```

Argument | Type | Default | Description
--- | --- | --- | ---
**algorithm_id** | int | Required | The `Algorithm.id` whose functions these hyperparameters will be used with
**hyperparameters** | dict(str:list) | Required | See example in [Hyperparameters](#8.-Hyperparameters). Must be JSON compatible. Open an issue if you really want Pickle support.
**search_count** | int | None | Randomly select *n* hyperparameter combinations to test. Must be greater than 1. No upper limit, it will test all combinations if number of combinations is exceeded.
**search_percent** | float | None | Cannot be used if `search_count` is used. Between `0.0:1.0`

---

## 9. Queue

The Queue is the central object of the "logic side" of the ORM. It ties together everything we need to run training `Job`'s for hyperparameter tuning. That's why it is referred to as an Experiment in the High-Level API.

---

### 9a. Create

#### └── `Queue.from_algorithm()`

```python
Queue.from_algorithm(
    algorithm_id
    , splitset_id
    , repeat_count
    , permute_count
    , hyperparamset_id
    , description
)
```

Argument | Type | Default | Description
--- | --- | --- | ---
**algorithm_id** | int | Required | The `Algorithm.id` whose functions will be used during training and evaluation
**splitset_id** | int | Required | The `Splitset.id` whose samples will be used during training and evaluation
**repeat_count** | int | 1 | Each job will be repeat n times. Designed for use with random weight initialization (aka non-deterministic). This is why 1 `Job` has many `Predictors`
**permute_count** | int | 3 | Triggers a shuffled permutation of each training data column to determine which columns have the most impact on loss in comparison baseline training loss: `[training loss - (median loss of <n> permutations)]`. The count determines how many times the shuffled permutation is ran before taking the median loss. Permutation does *not* get run on `Feature.dataset.typ=='image'`. Set this to 0 if you do not care about feature importance.
**hyperparamset_id** | int | None | The `Hyperparamset.id` whose samples will be used during training and evaluation. This needs to be specified because an Algorithm can have many Hyperparamsets.
**description** | str | None | What is unique about this experiment?

---

### 9b. Run 

#### └── `Queue.run_jobs()`

Jobs are simply ran on a loop on the main process.

Stop the queue with a keyboard interrupt e.g. `ctrl+Z/D/C` in Python shell or `i,i` in Jupyter. It is listening for interupts so it will usually stop gracefully. Even if it errors upon during interrupt, it's not a problem. You can rerun the queue and it will resume on the same job it was running previously.

---

### 10. Job

The Queue spawns `Job`'s.

`# jobs = Hyperamset.hyperamcombo.count() * Queue.repeat_count * splitset.folds.count()`

---

### 11. Predictor

As the Jobs finish, they save the `model` and `history` metrics within a `Predictor` object.

---

#### └── `Predictor.get_model()`

Handles fetching and initializing the model (and PyTorch optimizer) from `Predictor.model_file` and `Predictor.input_shapes`

---

#### └── `Predictor.get_hyperparameters(as_pandas:bool=True)`

This is a shortcut to fetch the hyperparameters used to train this specific model. `as_pandas` toggles between `dict()` and `DataFrame`.

---

### 12. Prediction

When data is fed through a Predictor, you get a `Prediction`. During training, Predictions are automatically generated for every split/fold in the `Queue.splitset`.

#### Fetching metrics.

| Attribute | Description |
| --- | --- | 
| *predictions* | decoded predictions ndarray for per split/ fold/ inference |
| *feature_importance* | importance of each column. only for training split/fold |
| *probabilities* | prediction probabilities per split/ fold. `None` for regression. |
| *metrics* | statistics for each split/fold that vary based on the analysis_type.  |
| *metrics_aggregate* | average for each statistic across all splits/folds. |
| *plot_data* | metrics reformatted for plot functions. |

In [73]:
queue.jobs[0].predictors[0].predictions[0].metrics

{'train': {'accuracy': 0.9607843137254902,
  'f1': 0.9607503607503607,
  'loss': 0.08033698052167892,
  'precision': 0.9618055555555555,
  'recall': 0.9607843137254902,
  'roc_auc': 0.9985582468281432},
 'validation': {'accuracy': 0.9444444444444444,
  'f1': 0.9440559440559441,
  'loss': 0.13840313255786896,
  'precision': 0.9523809523809523,
  'recall': 0.9444444444444444,
  'roc_auc': 1.0},
 'test': {'accuracy': 0.9333333333333333,
  'f1': 0.9333333333333333,
  'loss': 0.1355634480714798,
  'precision': 0.9333333333333333,
  'recall': 0.9333333333333333,
  'roc_auc': 0.9900000000000001}}

---

## 9. Metrics & Visualization

For more information on visualization of performance metrics, reference the [Visualization & Metrics](visualization.html) documentation.