# Classification

**CS5483 Data Warehousing and Data Mining**
___

In [None]:
load("datamining.mac")$

## Decision tree induction

### Information gain

An impurity measure for decision tree induction is entropy computed as `entropy(p)` for some distribution `p`:

In [None]:
entropy(ps);

The information gain ratios and related information quantities can be computed as follows:

In [None]:
block(
    [
        fns: ['i, 'X_1, 'X_2, 'Y],
        n: 6,
        gen: lambda([i], [i, random(2), random(2), random(2)]),
        conds: ['X_1, 'X_2],
        target: 'Y,
        data, info
    ],
    data: build_data(fns, gen, n),
    [
        data,
        Info(data, target),
        build_data_from_list(
            ['X, 'Info[X], 'Gain[X], 'SplitInfo[X], 'GainRatio[X]],
            makelist(
                map('simplify,
                    [X,
                     InfoX(data, target, X), 
                     Gain(data, target, X), 
                     SplitInfo(data, X), 
                     GainRatio(data, target, X)]
                ), 
                X, conds
            )
        )
    ]
);

- `Info(data, target)` computes the information content (entropy) of `target` in `data`.
- `InfoX(data, target, X)` computes the information (conditional entropy) given `X`.
- `Gain(data, target, X)` calculates the information gain of `target` with `X`.
- `SplitInfo(data, X)` calculates the split information (entropy) of `X`.
- `GainRation(data, target, X)` calculates the information gain ratio of `target` with `X`.

### Gini impurity

Another impurity measure is the Gini impurity:

In [None]:
gini(ps);

In [None]:
block(
    [
        fns: ['i, 'X_1, 'X_2, 'Y],
        n: 6,
        gen: lambda([i], [i, random(2), random(2), random(2)]),
        conds: ['X_1, 'X_2],
        target: 'Y,
        data
    ],
    data: build_data(fns, gen, n),
    [
        data, Gini(data, target),
        build_data_from_list(
            ['X, 'Gini[X], 'GiniDrop[X]],
            makelist(
                [X, GiniX(data, target, X), GiniDrop(data, target, X)],
                X, conds
            )
        )
    ]
);

## Rule-based classifier

### FOIL gain

The following formula computes FOIL gain 
- from a rule covering `p_0` positives and `n_0` negatives
- to a rule covering `p_1` positives and `n_1` negatives. 

In [None]:
foilgain(p_0,n_0,p_1,n_1);

To compute FOIL gain from data:

In [None]:
block(
    [
        fns: ['i, 'X_1, 'X_2, 'Y],
        n: 6,
        gen: lambda([i], [i, random(2), random(2), random(2)]),
        cjts: ['X_1=1, 'X_2=1],
        target: 'Y,
        data
    ],
    data: build_data(fns, gen, n),
    [data, FOILGain(data, target, cjts)]
);

It returns the FOIL gain from rule $R_0$ to rule $R_1$ where
- $R_0$: `rest(cjts,-1) => Y=1`
- $R_1$: `cjts => Y=1`

and `rest(cjts,-1)` is the list of conjuncts in cjts except the last one.

### FOIL prune

FOIL prune can be computed from the number `p` of positive and the number `n` of negatives covered by a rule.

In [None]:
foilprune(p,n);

To compute FOIL prune from data:

In [None]:
block(
    [
        fns: ['i, 'X_1, 'X_2, 'Y],
        n: 6,
        gen: lambda([i], [i, random(2), random(2), random(2)]),
        cjts: ['X_1=1, 'X_2=1],
        target: 'Y,
        data
    ],
    data: build_data(fns, gen, n),
    [data, FOILPrune(data, target, cjts)]
);

It returns a pair of FOIL prunes for the rules
- $R_0$: `rest(cjts,-1) => Y=1`
- $R_1$: `cjts => Y=1`.

# Classification

**CS5483 Data Warehousing and Data Mining**
___

In [None]:
load("datamining.mac")$

## Decision tree induction

### Information gain

An impurity measure for decision tree induction is entropy computed as `entropy(p)` for some distribution `p`:

In [None]:
entropy(ps);

The information gain ratios and related information quantities can be computed as follows:

In [None]:
block(
    [
        fns: ['i, 'X_1, 'X_2, 'Y],
        n: 6,
        gen: lambda([i], [i, random(2), random(2), random(2)]),
        conds: ['X_1, 'X_2],
        target: 'Y,
        data, info
    ],
    data: build_data(fns, gen, n),
    [
        data,
        Info(data, target),
        build_data_from_list(
            ['X, 'Info[X], 'Gain[X], 'SplitInfo[X], 'GainRatio[X]],
            makelist(
                map('simplify,
                    [X,
                     InfoX(data, target, X), 
                     Gain(data, target, X), 
                     SplitInfo(data, X), 
                     GainRatio(data, target, X)]
                ), 
                X, conds
            )
        )
    ]
);

- `Info(data, target)` computes the information content (entropy) of `target` in `data`.
- `InfoX(data, target, X)` computes the information (conditional entropy) given `X`.
- `Gain(data, target, X)` calculates the information gain of `target` with `X`.
- `SplitInfo(data, X)` calculates the split information (entropy) of `X`.
- `GainRation(data, target, X)` calculates the information gain ratio of `target` with `X`.

### Gini impurity

Another impurity measure is the Gini impurity:

In [None]:
gini(ps);

In [None]:
block(
    [
        fns: ['i, 'X_1, 'X_2, 'Y],
        n: 6,
        gen: lambda([i], [i, random(2), random(2), random(2)]),
        conds: ['X_1, 'X_2],
        target: 'Y,
        data
    ],
    data: build_data(fns, gen, n),
    [
        data, Gini(data, target),
        build_data_from_list(
            ['X, 'Gini[X], 'GiniDrop[X]],
            makelist(
                [X, GiniX(data, target, X), GiniDrop(data, target, X)],
                X, conds
            )
        )
    ]
);

## Rule-based classifier

### FOIL gain

The following formula computes FOIL gain 
- from a rule covering `p_0` positives and `n_0` negatives
- to a rule covering `p_1` positives and `n_1` negatives. 

In [None]:
foilgain(p_0,n_0,p_1,n_1);

To compute FOIL gain from data:

In [None]:
block(
    [
        fns: ['i, 'X_1, 'X_2, 'Y],
        n: 6,
        gen: lambda([i], [i, random(2), random(2), random(2)]),
        cjts: ['X_1=1, 'X_2=1],
        target: 'Y,
        data
    ],
    data: build_data(fns, gen, n),
    [data, FOILGain(data, target, cjts)]
);

It returns the FOIL gain from rule $R_0$ to rule $R_1$ where
- $R_0$: `rest(cjts,-1) => Y=1`
- $R_1$: `cjts => Y=1`

and `rest(cjts,-1)` is the list of conjuncts in cjts except the last one.

### FOIL prune

FOIL prune can be computed from the number `p` of positive and the number `n` of negatives covered by a rule.

In [None]:
foilprune(p,n);

To compute FOIL prune from data:

In [None]:
block(
    [
        fns: ['i, 'X_1, 'X_2, 'Y],
        n: 6,
        gen: lambda([i], [i, random(2), random(2), random(2)]),
        cjts: ['X_1=1, 'X_2=1],
        target: 'Y,
        data
    ],
    data: build_data(fns, gen, n),
    [data, FOILPrune(data, target, cjts)]
);

It returns a pair of FOIL prunes for the rules
- $R_0$: `rest(cjts,-1) => Y=1`
- $R_1$: `cjts => Y=1`.