# Data Generation

**CS5483 Data Warehousing and Data Mining**
___

In [None]:
load("datamining.mac")$
set_draw_defaults(file_name="images/maxplot.svg", terminal=svg, point_type=square, point_size=2)$

## Introduction

This notebook demonstrates the data mining package written in Maxima, which is useful for

- computing some mathematical criteria exactly without numerical error/instability, and
- creating randomized moodle stack questions. 

To achieve the above goals, some of the implementations are simplified and may not be scalable to large data set.

To load the package:

```
load("datamining.mac")$
```

To learn Maxima, you may use the `describe` function or refer to the [documentation](https://maxima.sourceforge.io/documentation.html) for more details:

In [None]:
describe(block)$

As an example, the following defines a function that computes the maxima of its arguments:

In [None]:
maxima([lst]):=
if length(lst)>1 
/* recurse on tail maxima (tm) */
then block(
    [tm :apply('maxima,rest(lst))],
    if lst[1]>=tm[2] 
    then maxima(lst[1]) 
    else [tm[1]+1,tm[2]]
)
/* base cases */
else if length(lst)>0 
then [1, lst[1]]
else [0, -inf]$  /* trailing $ gives no output. */

maxima(1,2,3,2,1); /* trailing ; ends an expression and outputs its value. */

In the above example, `maxima([lst])` is a recursive function that 
- takes a variable number of arguments, which will be stored in `lst` as a list, and
- return a list `[i,m]` as follows:
  - If `lst` is non-empty, `lst[i]=m` is a maximum element of `lst` and i is the smallest such index.
  - If `lst` is empty, then `[0,-inf]` following the conventions that 
    - the maximum element of an empty list of numbers is `-inf`, and 
    - Maxima use 1-based numbering so `0` is the index of an imaginary item before the first item in a list.

## Generate data from lists

Data is a matrix of feature values associated with feature names. Data can be created by `build_data_from_list(fns, lst)` where
- `fns` is a list of feature names, and 
- `lst` is a list of instances, which are lists of feature values corresponding to the feature names.

In [None]:
block(
    [
        fns: ['i, 'X_1, 'X_2, 'Y],           /* feature names */
        lst: [[1, 0, 0, 0], [2, 1, 1, 1]],   /* instances */
        target: 'Y,
        xy: ['X_1, 'X_2],
        data
    ],
    data: build_data_from_list(fns, lst),
    plot_labeled_data(data,xy,target),
    [data, feature_names(data), size(data), feature_index(fns, target), get_data(data, 1), feature_values(data, target)]
);

```{note}
There are other functions that help extract information of a data set:
- `feature_names(data)` returns the feature names of `data`,
- `size(data)` returns the number of instances of `data`,
- `feature_index(fns, fn)` returns the index of a feature named `fn` in the list `fns` of feature names,
- `get_data(data, i)` returns the `i`-th instance of `data`, and
- `feature_values(data, fn)` returns the list of feature values of the feature `fn`.
```

## Generate data with rules

Data can also be generated (randomly) according to some specified rules using `build_data(fns, gen, n)` where
- `fns` is a list of feature names,
- `gen` is a function that takes a unique index and generate an instance (list of feature values) associated with the index, and
- `n` is the number of instances to generate.

In [None]:
block(
    [
        fns: ['i, 'X_1, 'X_2, 'Y],
        gen: lambda([i],
            [
                i,
                random(3),
                random(3),
                if 'X_1<1 and 'X_2>0 then 1 else 0
            ]),
        n: 10
    ],
    build_data(fns, gen, n)
);

In the above example, 
- $i$ is the unique index,
- $X1$ and $X2$ are uniformly random generated from $\Set{0,1,2}$, and
- $Y$ is a deterministic function of $X1$ and $X2$, namely,  
$$
Y=\begin{cases}
1 & X1<1, X2>0\\
0 & \text{otherwise.}
\end{cases}
$$

```{note}
The value of a feature 
- can depend on the index and the values of all the previously generated features of the same instance, but
- cannot depend on the feature values of other instances.
```

## Transform features

New features can be created by transforming existing ones using `transform_features(data, nfns, ngen)` where
- `data` is a data set,
- `nfns` is the list of new feature names, and
- `ngen` is a function that takes a unique index and returns an instance (list of new feature values).

In [None]:
block(
    [
        fns: ['X_1, 'X_2],
        gen: lambda([i], 
            [
                random(3), 
                random(3)
            ]),
        n: 10,
        nfns: ['i, 'X_1, 'X_2, 'Y],
        ngen: lambda([i],
            [
                i,
                'X_1,
                'X_2,
               if 'X_1<1 and 'X_2>0 then 1 else 0 
            ]
        ),
        data
    ],
    data: build_data(fns, gen, n),
    [data, transform_features(data, nfns, ngen)]
);

In the above example, 
- the features $X1$ and $X2$ in `data` are transformed to create the feature $Y$, and
- the row index is used to create the feature $i$.

```{note}
A new feature 
- can depend on the index, all previously generated features and the features in `data` of the same instance, but
- cannot depend on the feature values of other instances. 
```

## Subsample data

To subsample data based on certain condition:

In [None]:
block(
    [
        fns: ['X_1, 'X_2],
        gen: lambda([i],
            [
                random(3),
                random(3)
            ]),
        n: 10,
        cond: lambda([i],
            'X_1<1 and 'X_2>0
        ),
        data
    ],
    data: build_data(fns, gen, n),
    [data, subsample_data(data, cond)]
);

Data sets with the same list of features can be stacked together (vertically):

In [None]:
block(
    [
        fns: ['i, 'X_1, 'X_2]
    ],
    data_1: build_data(fns, lambda([i], [i, random(2), random(2)]),4),
    data_2: build_data(fns, lambda([i], [i, 3+random(2), random(2)]),4),
    data: transform_features(stack_data(data_1, data_2), fns, lambda([i], [i, 'X_1, 'X_2])),
    [data_1, data_2, data]
);

In the above example, we created data consisting of instances from the two clusters data_1 and data_2. The index column is regenerated so that
every instance has a unique index.

## Combine data

Data can be stacked (vertically) by `stack_data(data_1, data_2, ...)` where `data_i`'s are data with the same list of features.

In [None]:
block(
    [
        fns: ['i, 'X_1, 'X_2]
    ],
    data_1: build_data(fns, lambda([i], [i, random(2), random(2)]),4),
    data_2: build_data(fns, lambda([i], [i, 3+random(2), random(2)]),4),
    data: transform_features(stack_data(data_1, data_2), fns, lambda([i], [i, 'X_1, 'X_2])) 
);

In the above example, `data` consists of instances from `data_1` and `data_2`. 

```{note}
The index column is regenerated using `transform_features` for `data` so that every instance has a unique index.
```