# Classification workflow

Anton Antonov  
RakuForPrediction at WordPress   
October 2025

----

## Setup

In [4]:
use Data::Reshapers;
use Data::Importers;
use Data::Summarizers;
use ML::ROCFunctions;

use ML::SparseMatrixRecommender;

In [5]:
#% javascript
require.config({
     paths: {
     d3: 'https://d3js.org/d3.v7.min'
}});

require(['d3'], function(d3) {
     console.log(d3);
});

In [6]:
#% js
js-d3-list-line-plot(10.rand xx 40, background => 'none', stroke-width => 2)

In [7]:
my $title-color = 'Silver';
my $stroke-color = 'SlateGray';
my $tooltip-color = 'LightBlue';
my $tooltip-background-color = 'none';
my $tick-labels-font-size = 10;
my $tick-labels-color = 'Silver';
my $tick-labels-font-family = 'Helvetica';
my $background = '#1F1F1F';
my $color-scheme = 'schemeTableau10';
my $color-palette = 'Inferno';
my $edge-thickness = 3;
my $vertex-size = 6;
my $mmd-theme = q:to/END/;
%%{
  init: {
    'theme': 'forest',
    'themeVariables': {
      'lineColor': 'Ivory'
    }
  }
}%%
END
my %force = collision => {iterations => 0, radius => 10},link => {distance => 180};
my %force2 = charge => {strength => -30, iterations => 4}, collision => {radius => 50, iterations => 4}, link => {distance => 30};

my %opts = :$background, :$title-color, :$edge-thickness, :$vertex-size;

{background => #1F1F1F, edge-thickness => 3, title-color => Silver, vertex-size => 6}

----

## Ingestion

In [None]:
my $url = 'https://raw.githubusercontent.com/antononcube/MathematicaVsR/refs/heads/master/Data/MathematicaVsR-Data-Mushroom.csv';

my @dsData = data-import($url, headers => 'auto');

@dsData.&dimensions

(8124 24)

In [9]:
deduce-type(@dsData);

Vector(Assoc(Atom((Str)), Atom((Str)), 24), 8124)

---

## Preliminary data analysis tabulation

Here is a summary of the mushroom data:

In [10]:
my @field-names = <cap-Shape cap-Surface cap-Color bruises? odor gill-Attachment gill-Spacing gill-Size gill-Color edibility>;
sink records-summary(@dsData, :@field-names)

+-----------------+-----------------+-----------------+---------------+-----------------+------------------+-----------------+----------------+-------------------+-------------------+
| cap-Shape       | cap-Surface     | cap-Color       | bruises?      | odor            | gill-Attachment  | gill-Spacing    | gill-Size      | gill-Color        | edibility         |
+-----------------+-----------------+-----------------+---------------+-----------------+------------------+-----------------+----------------+-------------------+-------------------+
| convex  => 3656 | scaly   => 3244 | brown   => 2284 | False => 4748 | none    => 3528 | free     => 7914 | close   => 6812 | broad  => 5612 | buff      => 1728 | edible    => 4208 |
| flat    => 3152 | smooth  => 2556 | gray    => 1840 | True  => 3376 | foul    => 2160 | attached => 210  | crowded => 1312 | narrow => 2512 | pink      => 1492 | poisonous => 3916 |
| knobbed => 828  | fibrous => 2320 | red     => 1500 |               | spicy   

Before classifying for edibility consider this relationship between edibility and odor:

In [11]:
cross-tabulate(@dsData, 'odor', 'edibility')
==> to-pretty-table()

+----------+--------+-----------+
|          | edible | poisonous |
+----------+--------+-----------+
| almond   |  400   |           |
| anise    |  400   |           |
| creosote |        |    192    |
| fishy    |        |    576    |
| foul     |        |    2160   |
| musty    |        |     36    |
| none     |  3408  |    120    |
| pungent  |        |    256    |
| spicy    |        |    576    |
+----------+--------+-----------+

We can see that mushrooms with any odor are much more likely to be poisonous. Alternatively, mushrooms without bruise are much more likely to be poisonous:

In [12]:
cross-tabulate(@dsData, 'bruises?', 'edibility')
==> to-pretty-table()

+-------+-----------+--------+
|       | poisonous | edible |
+-------+-----------+--------+
| False |    3292   |  1456  |
| True  |    624    |  2752  |
+-------+-----------+--------+

---

## Procedure outline

Let us make a full blown classification workflow with the following steps:

- Split the data into training and testing sets
- Make an SMR object over the training set
- Classify with the SMR object all records of the testing set
- Derive (and display) classifier metrics:
    - Confusion matrix
    - ROC plots

---

## SMR object creation

Split the data:

In [13]:
my (@dsTraining, @dsTesting);
with take-drop(@dsData, floor(0.75 * @dsData.elems)) {
    @dsTraining = select-columns($_.head, ['id', |@field-names]); 
    @dsTesting =  select-columns($_.tail, ['id', |@field-names]); 
}

say deduce-type(@dsTraining);
say deduce-type(@dsTesting);

Vector(Assoc(Atom((Str)), Atom((Str)), 11), 6093)
Vector(Assoc(Atom((Str)), Atom((Str)), 11), 2031)


Create a Sparse Matrix Recommender (SMR) object with the training data:

In [14]:
my $smrObj = 
    ML::SparseMatrixRecommender.new(:native)
    .create-from-wide-form(
        @dsTraining,
        item-column-name => "id",
        tag-types => Whatever,
        :add-tag-types-to-column-names,
        tag-value-separator => ":")
    .apply-term-weight-functions("IDF", "None", "Cosine")

ML::SparseMatrixRecommender(:matrix-dimensions((6093, 50)), :density(0.2), :tag-types(("gill-Spacing", "bruises?", "odor", "gill-Size", "gill-Attachment", "cap-Color", "cap-Surface", "cap-Shape", "gill-Color", "edibility")))

Here is an example classification:

In [15]:
my $prof = @dsTesting.pick.grep(*.key ∉ <id edibility>).List;
my @prof = |(($prof».key X~ ':') Z~ $prof».value);
$smrObj.classify-by-profile('edibility', @prof, n-top-nearest-neighbors => 4).take-value

{edibility:poisonous => 1}

---

## Batch classification

In [16]:
@dsTesting.elems

2031

In [36]:
my $n-top-nearest-neighbors = 20;

my @noNNs;

# ≈2.5 times speed-up using .race(:4degree, :500batch)
my @dsResults = @dsTesting.map( -> %record {
    my @prof = %record.grep(*.key ∉ <id edibility>).List;
    @prof = |((@prof».key X~ ':') Z~ @prof».value);
    my %class = $smrObj.classify-by-profile('edibility', @prof, :$n-top-nearest-neighbors).take-value;
    %class .= map({ $_.key.subst('edibility:') => $_.value.Num});

    @noNNs.push(@prof) if %class.elems == 0;

    %( id => %record<id>, actual => %record<edibility>, predicted => %class.sort(-*.value).head.key, |%class)
});

deduce-type(@dsResults)

Vector((Any), 2031)

In [37]:
tally(@dsResults».elems)

{4 => 1903, 5 => 128}

In [38]:
my @prof = <gill-Size:broad odor:none gill-Attachment:free cap-Color:white gill-Spacing:crowded gill-Color:pink cap-Shape:convex cap-Surface:fibrous bruises?:False>;
$smrObj.classify-by-profile('edibility', @prof, :$n-top-nearest-neighbors).take-value

{edibility:edible => 1}

Make sure complete set of columns is presented:

In [39]:
my %empty = :0poisonous, :0edible;
@dsResults = @dsResults.map({ merge-hash(%empty, $_) });
tally(@dsResults».elems)

{5 => 2031}

In [40]:
sink records-summary(@dsResults)

+-----------------+-------------------------------+-------------------+-----------------------------+-------------------+
| id              | poisonous                     | predicted         | edible                      | actual            |
+-----------------+-------------------------------+-------------------+-----------------------------+-------------------+
| 6648    => 1    | Min    => 0                   | poisonous => 1478 | Min    => 0                 | poisonous => 1506 |
| 7882    => 1    | 1st-Qu => 0.16483516483516483 | edible    => 553  | 1st-Qu => 0                 | edible    => 525  |
| 7360    => 1    | Mean   => 0.7384550117920903  |                   | Mean   => 0.275103068186255 |                   |
| 7253    => 1    | Median => 1                   |                   | Median => 0                 |                   |
| 7118    => 1    | 3rd-Qu => 1                   |                   | 3rd-Qu => 1                 |                   |
| 6818    => 1    | Max 

In [41]:
#% html
@dsResults.pick(10)
==> to-html(field-names => <id actual predicted edible poisonous>)

id,actual,predicted,edible,poisonous
6332,edible,edible,1,0.0
6821,poisonous,poisonous,0,1.0
6744,poisonous,poisonous,0,1.0
6165,poisonous,poisonous,0,1.0
7074,edible,edible,1,0.4285714285714285
6277,poisonous,poisonous,0,1.0
6507,poisonous,poisonous,0,1.0
6219,poisonous,poisonous,0,1.0
7520,poisonous,poisonous,0,1.0
6792,poisonous,poisonous,0,1.0


---

## Classifier metrics

Confusion matrix:

In [42]:
my $ct = cross-tabulate(@dsResults.grep(*<predicted>), "actual", "predicted");
to-pretty-table($ct)

+-----------+--------+-----------+
|           | edible | poisonous |
+-----------+--------+-----------+
| edible    |  525   |           |
| poisonous |   28   |    1478   |
+-----------+--------+-----------+

Prettier version using HTML rendering:

In [31]:
#%html
$ct.map({ (actual => $_.key, |merge-hash(%empty, $_.value)) })».Hash.sort(*<actual>) 
==> to-html(field-names => <actual edible poisonous>)

actual,edible,poisonous
edible,525,0
poisonous,22,1484


Receiver Operating Characteristic (ROC) metrics computation:

In [32]:
my @thRange = [|(0, 0.01 ... 0.4), |(0.4, 0.45 ... 1)].unique.sort;

my @rocs = @thRange.map(-> $th { to-roc-hash('poisonous', 'edible', 
                                                select-columns(@dsResults, 'actual')>>.values.flat, 
                                                select-columns(@dsResults, 'poisonous')>>.values.flat.map({ $_ >= $th ?? 'poisonous' !! 'edible' })) });

deduce-type(@rocs)                                        

Vector(Assoc(Atom((Str)), Atom((Int)), 4), 53)

Tabulate ROC records:

In [None]:
#%html
@rocs
==> to-html(field-names => <FalsePositive FalseNegative TrueNegative TruePositive>)

Plot ROC functions (False Positive Rate vs True Positive Rate):

In [34]:
text-list-plot(roc-functions('FPR')(@rocs), roc-functions('TPR')(@rocs),
                width => 70, height => 25, 
                x-label => 'FPR', y-label => 'TPR', 
                x-limit => (0, 1))

++------------+-------------+------------+-------------+------------++        
|                                                                    |        
+  **   **  *  *    **             *                                *+  1.00  
|                                                                    |        
|  *                                                                 |        
| **                                                                 |        
| *                                                                  |        
|**                                                                  |        
|*                                                                   |        
+*                                                                   +        
|                                                                    |        
|*                                                                   |       T
|*                                                  

In [44]:
#% js
js-d3-list-plot(
    (roc-functions('FPR')(@rocs) Z roc-functions('TPR')(@rocs)).List,
    width => 400, height => 350, 
    x-label => 'FPR', y-label => 'TPR',
    :$title-color,
    :$background,
    :!grid-lines,
    title => 'ROC curve'
)