# Usage

Last revision: 2021-05-13

Based on https://github.com/efurlanm/ml/blob/master/docs/Usage.md

In [1]:
#-----------------------------------------------------------------------

The list of all supported options can be obtained by running:

In [3]:
! ./parf -h

PARF (C) 2005 Rudjer Boskovic Institute
Goran Topic, Tomislav Smuc; algorithm by Leo Breiman and Adele Cutler
Licensed under GNU GPL 2.0
 
Usage: rf [OPTION...]
-h | --help   show this message
-t file       file to use as training set
-a file       file to analyse and classify
-tv [file]    training set votes output file
-tc [file]    training set confusion matrix output file
-av [file]    test set votes output file
-ac [file]    test set confusion matrix output file
-ar [file]    test set classification results output file
-aa [file]    test set ARFF output file
-ta [file]    train + test set ARFF output file
-c class      the class attribute, or NEW, or LAST (default)
-cq [n[%]]    quantity of generated class instances (only with -c NEW)
-cp category  positive category
-n trees      the number of trees to grow
-f n          the fill method: 0=none, 1=rough, 2+=# of passes
-v n          redo the forest with n most important variables
-vs n         redo the forest with variables more s

\
Weka's ARFF datasets can be used with PARF. To train a forest in an example dataset (glass.arff), where `-t file` is the file to use as training set:

In [9]:
%%bash
./parf --verbose -t datasets/glass.arff | head -15

Seed:   -125114485
Loading training set
Number of training cases:    214
Number of attributes:         10
Counting classes
Number of used attributes:     9
Attributes to split on:        3
Sorting and ranking
Growing forest
        Tree #     1
        Tree #     2
        Tree #     3
        Tree #     4
        Tree #     5
        Tree #     6


By default, the program takes the last attribute as the class. This, in itself, is not useful, since the forest is not used at all, but we can see, thanks to the `--verbose` option, which is completely optional, that it is working. 

We can try to use the generated forest to classify another data set (or even the same, if we haven't had another data set yet), where `-t file` is the file to use as training set, and `-a file` is the file to analyse and classify the "supplied test set":

In [8]:
%%bash
./parf --verbose -t datasets/segment-challenge.arff -a datasets/segment-test.arff | head -15

Seed:   -136617806
Loading training set
Loading test set
Number of training cases:   1500
Number of attributes:         20
Counting classes
Number of used attributes:    19
Attributes to split on:        4
Sorting and ranking
Growing forest
        Tree #     1
        Tree #     2
        Tree #     3
        Tree #     4
        Tree #     5


The dataset to be classified can have the same attribute specification as the one used to train the forest, or it can lack the class attribute and have same all the other attributes. If it is unlabeled, the output will show the classes for all the instances; if it is labeled, only the erroneously classified ones will be shown.

In arguments where there is a file name, a dash `-` can be used to mean standard input or output. When the argument is optional, the dash can also be used. If an existing file name is specified for an output option, the file will be overwritten without warning.

In arguments where there is an attribute name, the attribute name or number can be used. If an argument contains spaces, it must be enclosed in quotation marks. In contrast to ARFF files, the only valid separator between items within an option is a comma - surrounding spaces are ignored; in addition, the entire list must be cited, and not just the name of the individual. (Note that the quotes used here are analyzed by your shell, and not by the parf code, and have appropriate semantics for that). For example:

In [2]:
! ./parf -t datasets/glass.arff -fd "glass forest.txt" -c 10 -u 'RI, Type'

Trainset classification error is  52.80% of     214 (kappa: 0.1812 )


where:

    -t file       file to use as training set
    -fd [file]    dump the forest as a text
    -c class      the class attribute
    -u(u) var,... comma-separated list of used or unused attributes

In [64]:
! head -n 15 "glass forest.txt"

R (214): RI is 1.517335 or less?
  Y (87): RI is 1.516785 or less?
    Y (66): RI is 1.51549 or less?
      Y (12): RI is 1.513075 or less?
        Y (4): RI is 1.51123 or less?
          Y (2): Type is [tableware].
          N (2): RI is 1.51215 or less?
            Y (1): Type is [headlamps].
            N (1): Type is [tableware].
        N (8): RI is 1.515225 or less?
          Y (6): RI is 1.513625 or less?
            Y (3): Type is [containers].
            N (3): RI is 1.514615 or less?
              Y (1): Type is [build wind non-float].
              N (2): Type is [containers].


# Forest Growth Options

    -t file
Deterermines which ARFF file to train from.

In [10]:
! ./parf -t datasets/glass.arff

Trainset classification error is  20.56% of     214 (kappa: 0.6812 )


    -tv [file]
Outputs the training set out-of-bag votes.

In [16]:
%%bash
./parf -t datasets/glass.arff -tv | head

Trainset classification error is  21.03% of     214 (kappa: 0.6739 )
         1     0.8293    0.1220    0.0488    0.0000    0.0000    0.0000    0.0000
         2     0.3429    0.2857    0.3714    0.0000    0.0000    0.0000    0.0000
         3     0.4595    0.4054    0.1351    0.0000    0.0000    0.0000    0.0000
         4     0.0976    0.1951    0.0000    0.0000    0.0488    0.5610    0.0976
         5     0.0714    0.8571    0.0238    0.0000    0.0476    0.0000    0.0000
         6     0.1944    0.6111    0.1944    0.0000    0.0000    0.0000    0.0000
         7     0.5676    0.0541    0.3514    0.0000    0.0000    0.0000    0.0270
         8     0.6923    0.2308    0.0256    0.0000    0.0256    0.0256    0.0000
         9     0.0286    0.0000    0.0571    0.0000    0.0000    0.0000    0.9143


    -tc [file]
Outputs the training set confusion matrix. Each row is one class as labeled in the file; each column is one class as classified by the program. "NoTag" row contains the rows with unknown class label (?); "NotCl" column contains the instances that could not be classified because they were taken into bootstrap for every single tree (and were thus never out-of-bag).

In [31]:
! ./parf -t datasets/glass.arff -tc

Trainset classification error is  21.50% of     214 (kappa: 0.6667 )
Tag\Cl NotCl build build vehic vehic conta table headl
 NoTag     0     0     0     0     0     0     0     0
 build     0    63     6     1     0     0     0     0
 build     0    10    59     2     0     2     2     1
 vehic     0    10     2     5     0     0     0     0
 vehic     0     0     0     0     0     0     0     0
 conta     0     0     4     0     0     8     0     1
 table     0     0     2     0     0     0     7     0
 headl     0     1     2     0     0     0     0    26


    -c attribute
Determines the attribute to be used as class attribute. Instead of the attribute number or name, special values last or new can be used. The former is the default, and takes the last attribute for the class attribute. The latter performs the "unsupervised learning", by creating a new attribute, with values { original, constructed }. The number of the instances will double; half of those will be the original data, while the second half will be constructed by permuting randomly each column of the original data.

In [33]:
! ./parf -t datasets/glass.arff -c 10

Trainset classification error is  22.90% of     214 (kappa: 0.6449 )


    -cp category
Determines the category of the class attribute that will be considered positive. All other categories are considered negative. If not specified, no category shall be singled out as positive.

In [34]:
! ./parf -t datasets/glass.arff -cp 2

Trainset classification error is  23.36% of     214 (kappa: 0.6377 )
Trainset Precision:  0.7600; Recall:  0.6404; Specificity:  0.8560


    -n num
Determines the number of trees to grow. Currently the default is the very low value of 10, which is good for testing, but does not give statistically good results. For serious analyses, significantly more should be used. More trees — more accuracy.

In [40]:
! ./parf -t datasets/glass.arff -n 20

Trainset classification error is  24.30% of     214 (kappa: 0.6232 )


    -m num
Determines the number of attributes to choose randomly at each node. The best node split is determined only on this subset of attributes. Some experimentation is required for each dataset to determine the best value of this parameter. The default is usually "good enough", and is defined as the square root of the number of used attributes.

In [41]:
! ./parf -t datasets/glass.arff -m 4

Trainset classification error is  22.43% of     214 (kappa: 0.6522 )


    -xs num[%]
Determines the minimum number of instances that a node should have before being considered for splitting. A percentage is the percentage of all instances. The default is 2 — i.e. any node that can be split at all should be analysed for splitting.

In [44]:
! ./parf -t datasets/glass.arff -xs 66%

Trainset classification error is  64.95% of     214 (kappa: -.0072 )


    -xr ratio[%]
Determines the minimum split ratio that is worth splitting for. If the node split is that unbalanced or more, the split will not happen. A percentage specifies a percentage of the instances in a node; if the percent sign is not used, then the number specifies a ratio (a number between 0 and 1). The default is 0 — any split that is not degenerate (0 : all) is worth it.

In [45]:
! ./parf -t datasets/glass.arff -xr 66%

Trainset classification error is  64.49% of     214 (kappa: 0.0000 )


    -b num
Determines the number of categories that form a cutoff point between exhaustive and random category split search ("big" attributes). If the number of categories of an attribute is this or greater, then the random search will be performed, since the exhaustive search space is exponentially dependent on the number of categories. Default is 12, signifying that the exhaustive search will be performed only where 2048 or less combinations of categories are possible.

In [46]:
! ./parf -t datasets/glass.arff -b 6

Trainset classification error is  23.36% of     214 (kappa: 0.6377 )


    -bi num
Determines the number of iterations for the random split search, where used.

In [47]:
! ./parf -t datasets/glass.arff -bi 2

Trainset classification error is  23.83% of     214 (kappa: 0.6304 )


    -u(u) attribute,...
Lists the attributes that will be used (or unused) in this run. Overrides the @ignore specifier in the ARFF file. It is invalid to override string attributes as used. All attributes except @ignored are used by default before these options take effect, except when -u switch is employed without -uu, when the default is that no attributes will be used before the option takes effect. No spaces should be included in the argument.

In [13]:
! ./parf -t datasets/glass.arff -u RI,Na,Mg,Al,Type

Trainset classification error is  24.77% of     214 (kappa: 0.6159 )


    -w num,...
Specifies the weight override. If this option is used, the program ignores the weights listed in the ARFF file, and uses these values instead. The number of weights must match the number of categories of the class attribute.

In [48]:
! ./parf -t datasets/glass.arff -w 60,80,20,10,20,5,30

Trainset classification error is  26.64% of     214 (kappa: 0.5870 )


    -r num
Seeds the random number generator. If you run the program with the same input files and the same command line options, specifying the same seed, the results should be the same in each run. If this option is not specified, the random number generator is seeded randomly (by the system time), and the results are not repeatable.

In [49]:
! ./parf -t datasets/glass.arff -r 1

Trainset classification error is  21.03% of     214 (kappa: 0.6739 )


    -f num
Specifies the number of passes for filling of missing values. If this value is 0, no missing values will be filled. Otherwise, in the first pass rough fills are calculated. Each subsequent pass obtains fills from proximity values. The default is 1, giving only the one rough fill pass.

In [50]:
! ./parf -t datasets/glass.arff -f 0

Trainset classification error is  22.43% of     214 (kappa: 0.6522 )


    -v num
If this option is specified, after the forest has been generated (with all its missing value filling passes), all but num most important attributes will be ignored, and the forest will be generated again.

In [51]:
! ./parf -t datasets/glass.arff -v 1

Trainset classification error is  53.27% of     214 (kappa: 0.1739 )


    -mv num
This option only makes sense if -v option is used as well, and works as the -m option in the second pass, with only the most important attributes retained. The default is the square root of the number of variables actually used in the second pass.

In [52]:
! ./parf -t datasets/glass.arff -v 1 -mv 1

Trainset classification error is  56.07% of     214 (kappa: 0.1304 )


# Additional Training Analysis Options

    -p [num[%]]
When calculating proximities (which is a prerequisite for some of the following calculations), specifies the number of closest instances to retain for each instance. Depending on the number of instances, the proximity matrix can become so large that the calculation cannot be completed. In such cases the value of this parameter should be reduced. The default is 100% of the instances. As many of the following calculations depend on the proximity matrix, this option does somewhat impact their accuracy; however, as the least proximate cases are dropped, the error thus introduced is not very large.

In [53]:
! ./parf -t datasets/glass.arff -p 50%

Trainset classification error is  19.16% of     214 (kappa: 0.7029 )


    -if [file]

Requests the fast calculation of variable importances.

In [1]:
! ./parf -t datasets/glass.arff -if

Trainset classification error is  22.43% of     214 (kappa: 0.6522 )
 dGINI Tag
  2.05 Al                            
  2.02 Mg                            
  1.69 Ba                            
  1.07 RI                            
  1.01 Ca                            
  0.83 Na                            
  0.72 K                             
  0.41 Si                            
  0.21 Fe                            


    -i [file]

Requests the full calculation of variable importances.

In [2]:
! ./parf -t datasets/glass.arff -i

Trainset classification error is  21.96% of     214 (kappa: 0.6594 )
  Imp   Z-Sc Significan Tag
 11.66  16.80   0.000000 RI                            
 15.57  16.48   0.000000 Mg                            
 11.78  15.30   0.000000 Al                            
  7.59  11.23   0.000000 Ca                            
  6.56  10.81   0.000000 Ba                            
  4.88   9.65   0.000000 K                             
  4.97   8.90   0.000000 Na                            
  3.55   8.60   0.000000 Si                            
  0.61   3.66   0.000124 Fe                            


    -ic [file]

Requests the calculation of case-by-case variable importances.

In [15]:
%%bash
./parf -t datasets/glass.arff -ic | head

Trainset classification error is  23.83% of     214 (kappa: 0.6304 )
         1      0.14      0.09      0.17      0.13      0.08     -0.00      0.04     -0.09     -0.00     -0.71
         2      0.14     -0.08      0.05      0.09     -0.05     -0.01      0.04     -0.17     -0.02     -0.72
         3      0.09     -0.01      0.03     -0.05     -0.03     -0.01     -0.00     -0.17     -0.02     -0.87
         4     -0.05      0.10      0.08      0.00      0.00      0.15     -0.06     -0.10     -0.01     -0.74
         5      0.10      0.01      0.13      0.01      0.05      0.04      0.10     -0.17     -0.04     -0.88
         6      0.05      0.01      0.04      0.04     -0.01     -0.00      0.04     -0.08      0.03     -0.65
         7      0.01      0.01      0.02     -0.02     -0.00      0.04      0.03     -0.13     -0.05     -0.75
         8     -0.00      0.02      0.06      0.05      0.01      0.05      0.05     -0.08     -0.02     -0.72
         9      0.08      0.03      0.11   

    -ii [file]

Requests the calculation of variable interactions. Experimental.

In [5]:
! ./parf -t datasets/glass.arff -ii

Trainset classification error is  20.09% of     214 (kappa: 0.6884 )
       0     -37     -81       2      -2     -36     -83       4     -70       0
     -37       0     -71     -76     -81     -14     -21      53      -2       0
     -81     -71       0     -88     -39     -72     -92      -7     -19       0
       2     -76     -88       0     -49     -76      -7     -12     -21       0
      -2     -81     -39     -49       0     -58     -81     -13     -42       0
     -36     -14     -72     -76     -58       0     -14     -42     -75       0
     -83     -21     -92      -7     -81     -14       0       3     -39       0
       4      53      -7     -12     -13     -42       3       0      35       0
     -70      -2     -19     -21     -42     -75     -39      35       0       0
       0       0       0       0       0       0       0       0       0       0


    -y [file]

Requests the calculation of class prototypes, and displays them in ARFF format (containing only the prototype centre).

In [6]:
! ./parf -t datasets/glass.arff -y

Trainset classification error is  19.16% of     214 (kappa: 0.7029 )
@relation Glass-proto
 
@attribute RI numeric
@attribute Na numeric
@attribute Mg numeric
@attribute Al numeric
@attribute Si numeric
@attribute K numeric
@attribute Ca numeric
@attribute Ba numeric
@attribute Fe numeric
@attribute Type { "build wind float", "build wind non-float", "vehic wind float", "vehic wind non-float", containers, tableware, headlamps }
 
@data
1.51784, 13.2, 3.57, 1.21, 72.79, 0.56, 8.7, 0, 0, "build wind float"
1.5169, 13.2, 3.55, 1.48, 72.75, 0.6, 8.21, 0, 0, "build wind non-float"
1.51776, 13.5, 3.54, 1.28, 72.65, 0.56, 8.79, 0, 0, "vehic wind float"
1.51994, 12.97, 0, 1.76, 72.69, 0.58, 11.27, 0, 0, containers
1.51888, 14.4, 1.74, 1.56, 72.74, 0, 9.57, 0, 0, tableware
1.51651, 14.39, 0, 2.06, 73.11, 0, 8.67, 0.81, 0, headlamps


    -ya [file]

Requests the calculation of class prototypes, and displays full prototype information.

In [18]:
%%bash
./parf -t datasets/glass.arff -ya | head -35

Trainset classification error is  23.36% of     214 (kappa: 0.6377 )
 Class: build wind float
     Prototype #1 (65)
         RI
                     1.52      1.52      1.52
                     0.26      0.29      0.53
         Na
                    12.84     13.20     13.58
                     0.16      0.31      0.47
         Mg
                     3.48      3.58      3.66
                     0.90      0.93      0.95
         Al
                     1.10      1.21      1.31
                     0.23      0.30      0.36
         Si
                    72.02     72.79     73.01
                     0.31      0.67      0.77
         K
                     0.19      0.56      0.59
                     0.25      0.74      0.78
         Ca
                     8.44      8.70      9.07
                     0.16      0.24      0.34
         Ba
                     0.00      0.00      0.00
                     0.00      0.00      0.00
         Fe
                     0.00      0.00     

     -yn num

Specifies the maximum number of prototypes to calculate for each class. The number of calculated prototypes may be smaller, if the calculated prototypes cover all the instances of the class in the dataset.

In [19]:
%%bash
./parf -t datasets/glass.arff -yn 2 | head -35

Trainset classification error is  20.09% of     214 (kappa: 0.6884 )
 Class: build wind float
     Prototype #1 (66)
         RI
                     1.52      1.52      1.52
                     0.26      0.29      0.53
         Na
                    12.84     13.21     13.58
                     0.16      0.32      0.47
         Mg
                     3.48      3.58      3.66
                     0.90      0.93      0.95
         Al
                     1.10      1.23      1.32
                     0.23      0.31      0.37
         Si
                    72.02     72.79     73.01
                     0.31      0.67      0.77
         K
                     0.19      0.56      0.59
                     0.25      0.74      0.78
         Ca
                     8.44      8.70      9.07
                     0.16      0.24      0.34
         Ba
                     0.00      0.00      0.00
                     0.00      0.00      0.00
         Fe
                     0.00      0.00     

    -yp num[%]

Specifies the number of instances to consider "close", for the purpose of deciding whether they are covered by a prototype. Must be less or equal to the value of the -p option. The default is 50% or the value of -p, whichever is smaller.

In [14]:
%%bash
./parf -t datasets/glass.arff -yp 10%

Trainset classification error is  21.50% of     214 (kappa: 0.6667 )


 stances to be considered (-p).


    -s [num]

Specifies the number of coordinates to scale to. The default is 2.

In [20]:
%%bash
./parf -t datasets/glass.arff -s 4 | head

Trainset classification error is  21.50% of     214 (kappa: 0.6667 )
  Row# Coordinates...
     1      0.221    -0.076     0.262     0.284
     2      0.255     0.106     0.035     0.467
     3      0.275    -0.011     0.195     0.276
     4     -0.218    -0.156    -0.223    -0.091
     5     -0.243    -0.584    -0.319     0.034
     6      0.189     0.158    -0.183     0.033
     7      0.152    -0.161     0.292    -0.038
     8      0.188    -0.180     0.256     0.184


    -sd [num]

Specifies the limit to consecutive diverging iterations in scaling computation before giving up. The default is 10. If no argument is given, divergence is wholly disallowed (same as -sd 0).

In [20]:
! ./parf -t datasets/glass.arff -sd 2

Trainset classification error is  21.50% of     214 (kappa: 0.6667 )


    -st [file]

Requests the calculation of training set scaling data. Each instance will be projected into a lower-dimensional space, suitable for plotting.

In [21]:
%%bash
./parf -t datasets/glass.arff -st | head

Trainset classification error is  24.77% of     214 (kappa: 0.6159 )
  Row# Coordinates...
     1      0.130     0.253
     2      0.267     0.000
     3      0.170     0.169
     4     -0.154    -0.229
     5     -0.768    -0.154
     6      0.213    -0.078
     7     -0.109     0.353
     8      0.020     0.189


    -sa [file]

Requests the calculation of classified set scaling data. Each instance will be projected into a lower-dimensional space, suitable for plotting.

In [22]:
%%bash
./parf -t datasets/glass.arff -st | head

Trainset classification error is  18.69% of     214 (kappa: 0.7101 )
  Row# Coordinates...
     1      0.240    -0.071
     2      0.449     0.485
     3      0.244     0.011
     4     -0.241     0.022
     5     -0.181    -0.174
     6      0.145     0.338
     7      0.176    -0.259
     8      0.172    -0.162


    -o num

Specifies the outlier cutoff value; only the instances with outlier measure above or equal to this value will be displayed in the outlier output. By default, all instances are outputted.

In [27]:
%%bash
./parf -t datasets/glass.arff -o 2

Trainset classification error is  19.63% of     214 (kappa: 0.6957 )


    -ot [file]

Requests the calculation of outlier measure in the training set.

In [24]:
%%bash
./parf -t datasets/glass.arff -ot | head -15

Trainset classification error is  23.36% of     214 (kappa: 0.6377 )
    Median  Deviation Tag
 0.3720E+01  0.2495E+01 build wind float              
 0.1510E+01  0.7857E+00 build wind non-float          
 0.1200E+01  0.1513E+01 vehic wind float              
        NaN    Infinity vehic wind non-float          
 0.1640E+00    Infinity containers                    
 0.9303E-01    Infinity tableware                     
 0.8135E+00    Infinity headlamps                     
  Row#    Outlier Tagged as
   208    62.5443 build wind float              
   171    10.8472 build wind non-float          
    14     3.9072 vehic wind float              
   172     3.8390 build wind non-float          
   111     2.9186 build wind non-float          


    -oa [file]

Requests the calculation of outlier measure in the classified set.

In [30]:
%%bash
./parf -t datasets/glass.arff -oa

Trainset classification error is  20.09% of     214 (kappa: 0.6884 )


# Dataset Classification Options

    -a file

Deterermines which ARFF file to classify.

In [3]:
! ./parf -t datasets/segment-challenge.arff -a datasets/segment-test.arff

Trainset classification error is   2.13% of    1500 (kappa: 0.9747 )
 Testset classification error is   2.59% of     810 (kappa: 0.9692 )


In [7]:
! time ./parf -t datasets/volcanoes-on-venus-c1.arff

Trainset classification error is   2.48% of   28626 (kappa: 0.0301 )

real	0m5,495s
user	0m3,209s
sys	0m0,095s


In [6]:
! time ./parf -t datasets/higgs-training.arff -a datasets/higgs-test.arff

Error accessing file datasets/huggs-training.arff
 Error reading datasets/huggs-training.arff: cannot open file

real	0m1,829s
user	0m0,013s
sys	0m0,037s


    -av [file]

Outputs the classification set votes.

In [25]:
%%bash
./parf -t datasets/segment-challenge.arff -a datasets/segment-test.arff -av | head

Trainset classification error is   2.67% of    1500 (kappa: 0.9684 )
 Testset classification error is   2.59% of     810 (kappa: 0.9692 )
         1     1.0000    0.0000    0.0000   92.0000    1.0000    6.0000    0.0000
         2     0.0000    0.0000    0.0000    2.0000    0.0000   98.0000    0.0000
         3     0.0000    0.0000    0.0000    0.0000    0.0000    1.0000   99.0000
         4     0.0000    0.0000    0.0000    0.0000    0.0000    0.0000  100.0000
         5     0.0000    0.0000    0.0000    0.0000  100.0000    0.0000    0.0000
         6     0.0000    0.0000   62.0000    0.0000   38.0000    0.0000    0.0000
         7    95.0000    0.0000    3.0000    2.0000    0.0000    0.0000    0.0000
         8     0.0000    0.0000    0.0000    2.0000    0.0000   98.0000    0.0000


    -ar [file]

Depending on whether the classification set is labeled or not, this option outputs either the misclassified instances, or the results of the classification for all instances.

In [41]:
%%bash
./parf -t datasets/segment-challenge.arff -a datasets/segment-test.arff -ar

Trainset classification error is   2.47% of    1500 (kappa: 0.9707 )
       Row Classified as                 Tagged as                     Certain
       120 foliage                       window                         76.00%
       128 brickface                     cement                         66.00%
       136 brickface                     window                         48.00%
       176 cement                        window                         62.00%
       179 window                        cement                         51.00%
       216 cement                        grass                          39.00%
       266 cement                        window                         68.00%
       316 foliage                       window                         73.00%
       322 foliage                       window                         69.00%
       433 brickface                     foliage                        44.00%
       442 cement                        grass                

    -ac [file]

Outputs the classification set confusion matrix. Each row is one class as labeled in the file; each column is one class as classified by the program. "NoTag" row contains the rows with unknown class label (?). The "NotCl" column is always zero, and is left for compatibility with -tc option output.

In [43]:
%%bash
./parf -t datasets/segment-challenge.arff -a datasets/segment-test.arff -ac

Trainset classification error is   2.47% of    1500 (kappa: 0.9707 )
 Testset classification error is   3.09% of     810 (kappa: 0.9634 )
Tag\Cl NotCl brick   sky folia cemen windo  path grass
 NoTag     0     0     0     0     0     0     0     0
 brick     0   124     0     0     0     1     0     0
 sky       0     0   110     0     0     0     0     0
 folia     0     2     0   117     1     2     0     0
 cemen     0     1     0     0   105     4     0     0
 windo     0     1     0     8     2   115     0     0
 path      0     0     0     0     0     0    94     0
 grass     0     0     0     1     2     0     0   120


In [4]:
%%bash
./parf -t datasets/segment-challenge.arff -a datasets/segment-test.arff -ac -r 1

Trainset classification error is   2.13% of    1500 (kappa: 0.9747 )
 Testset classification error is   2.72% of     810 (kappa: 0.9678 )
Tag\Cl NotCl brick   sky folia cemen windo  path grass
 NoTag     0     0     0     0     0     0     0     0
 brick     0   124     0     0     0     1     0     0
 sky       0     0   110     0     0     0     0     0
 folia     0     3     0   117     0     2     0     0
 cemen     0     1     0     0   107     2     0     0
 windo     0     1     0     7     2   116     0     0
 path      0     0     0     0     0     0    94     0
 grass     0     1     0     1     1     0     0   120


    -aa [file]

Outputs the classified set in the ARFF format.

In [27]:
%%bash
./parf -t datasets/segment-challenge.arff -a datasets/segment-test.arff -aa | head -30

Trainset classification error is   2.27% of    1500 (kappa: 0.9731 )
 Testset classification error is   2.72% of     810 (kappa: 0.9678 )
@relation segment
 
@attribute region-centroid-col numeric
@attribute region-centroid-row numeric
@attribute region-pixel-count numeric
@attribute short-line-density-5 numeric
@attribute short-line-density-2 numeric
@attribute vedge-mean numeric
@attribute vegde-sd numeric
@attribute hedge-mean numeric
@attribute hedge-sd numeric
@attribute intensity-mean numeric
@attribute rawred-mean numeric
@attribute rawblue-mean numeric
@attribute rawgreen-mean numeric
@attribute exred-mean numeric
@attribute exblue-mean numeric
@attribute exgreen-mean numeric
@attribute value-mean numeric
@attribute saturation-mean numeric
@attribute hue-mean numeric
@attribute class { brickface, sky, foliage, cement, window, path, grass }
 
@data
144, 35, 9, 0, 0, 2.33333, 2.03306, 2.05556, 1.73098, 37.5926, 32.3333, 47.4444, 33, -15.7778, 29.5556, -13.7778, 47.4444, 0.319714,

    -ta [file]

Outputs the combined training set and classified set in the ARFF format.

In [28]:
%%bash
./parf -t datasets/segment-challenge.arff -a datasets/segment-test.arff -ta | head -30

Trainset classification error is   2.73% of    1500 (kappa: 0.9676 )
 Testset classification error is   2.59% of     810 (kappa: 0.9692 )
@relation segment
 
@attribute region-centroid-col numeric
@attribute region-centroid-row numeric
@attribute region-pixel-count numeric
@attribute short-line-density-5 numeric
@attribute short-line-density-2 numeric
@attribute vedge-mean numeric
@attribute vegde-sd numeric
@attribute hedge-mean numeric
@attribute hedge-sd numeric
@attribute intensity-mean numeric
@attribute rawred-mean numeric
@attribute rawblue-mean numeric
@attribute rawgreen-mean numeric
@attribute exred-mean numeric
@attribute exblue-mean numeric
@attribute exgreen-mean numeric
@attribute value-mean numeric
@attribute saturation-mean numeric
@attribute hue-mean numeric
@attribute class { brickface, sky, foliage, cement, window, path, grass }
 
@data
38, 189, 9, 0, 0, 1, 0.222222, 6.22222, 33.3185, 29.0741, 26.3333, 35.2222, 25.6667, -8.22222, 18.4444, -10.2222, 35.2222, 0.271208,

# Forest Handling Options

    -fs prefix

Saves the forest for future classification run. A forest is saved in multiple files, with a common prefix.

In [51]:
%%bash
./parf -t datasets/segment-challenge.arff -a datasets/segment-test.arff -fs test

Trainset classification error is   2.27% of    1500 (kappa: 0.9731 )
 Testset classification error is   2.96% of     810 (kappa: 0.9648 )


In [52]:
! ls test*

test.001.tree  test.022.tree  test.043.tree  test.064.tree  test.085.tree
test.002.tree  test.023.tree  test.044.tree  test.065.tree  test.086.tree
test.003.tree  test.024.tree  test.045.tree  test.066.tree  test.087.tree
test.004.tree  test.025.tree  test.046.tree  test.067.tree  test.088.tree
test.005.tree  test.026.tree  test.047.tree  test.068.tree  test.089.tree
test.006.tree  test.027.tree  test.048.tree  test.069.tree  test.090.tree
test.007.tree  test.028.tree  test.049.tree  test.070.tree  test.091.tree
test.008.tree  test.029.tree  test.050.tree  test.071.tree  test.092.tree
test.009.tree  test.030.tree  test.051.tree  test.072.tree  test.093.tree
test.010.tree  test.031.tree  test.052.tree  test.073.tree  test.094.tree
test.011.tree  test.032.tree  test.053.tree  test.074.tree  test.095.tree
test.012.tree  test.033.tree  test.054.tree  test.075.tree  test.096.tree
test.013.tree  test.034.tree  test.055.tree  test.076.tree  test.097.tree
test.014.tree  test.035.tree  test.056

    -fl prefix

Loads a forest for classification run. None of the forest growth options, as well as the training analysis options except for the proximity, outlier and scaling ones, are available. Note that all the files generated by the -fs option must be present at the same location, denoted by the prefix.

In [53]:
%%bash
./parf -a datasets/segment-test.arff -fl test

 Testset classification error is   2.96% of     810 (kappa: 0.9648 )


    -fd [file]

Dumps the forest in a human-readable format.

In [56]:
%%bash
./parf -a datasets/segment-test.arff -fl test -fd output.txt
head -n 15 output.txt

 Testset classification error is   2.96% of     810 (kappa: 0.9648 )
R (1500): hue-mean is 0.6435 or less?
  Y (1293): region-centroid-row is 160.5 or less?
    Y (1047): hedge-mean is 0.4444 or less?
      Y (174): exred-mean is 1.0556 or less?
        Y (161): intensity-mean is 80.7409 or less?
          Y (151): exred-mean is -5.1667 or less?
            Y (38): value-mean is 50.4445 or less?
              Y (36): class is [window].
              N (2): class is [cement].
            N (113): exgreen-mean is -3.5 or less?
              Y (25): exred-mean is -3.9444 or less?
                Y (5): hue-mean is -2.0958 or less?
                  Y (1): class is [foliage].
                  N (4): class is [window].
                N (20): class is [window].


# Options for gnuplot

    -g [file]

Generates a gnuplot script. The output of several other options is modified for gnuplot data file format compatibility (notably: -i, -if, -ic, -ii and -s). The resulting script can be given as an argument to gnuplot, or loaded from gnuplot's command line with its load command. If the terminal (see -gt) is set to generate files (e.g. -gt png), then the easiest way is to pipe the script into gnuplot. The image files will have the same names as the respective data files, with an appropriate extension; if the data files were not saved into files, the images will have default names. Note: if there are any 3D plots included, do not use the piping method. The gnuplot 4.0 (the current version at the time of development) has a bug, because of which gnuplot does not correctly parse the 3D data from the standard input.

    -gt terminal

Specifies a terminal for gnuplot output. See gnuplot's set terminal command for details. If no terminal is selected, multi-window x11 is used, with a pause at the end.

More examples of gnuplot use can be found in the [GnuPlotUsage.md](https://github.com/efurlanm/ml/blob/master/docs/GnuPlotUsage.md).

# MPI

According to https://software.intel.com/content/www/us/en/develop/articles/improve-performance-and-stability-with-intel-mpi-library-on-infiniband.html , it is necessary include `export UCX_TLS=rc,ud,sm,self`

In [7]:
%%bash
#export UCX_TLS=ud,sm,self
export UCX_TLS=rc,ud,sm,self
mpirun -n 6 ./parf --verbose -t datasets/glass.arff

Seed:   -132228645
Loading and distributing training set
Number of training cases:    214
Number of attributes:         10
Counting classes
Number of used attributes:     9
Attributes to split on:        3
Sorting and ranking
Growing forest
        Tree #     1 on     0
        Tree #    18 on     1
        Tree #    34 on     2
        Tree #    84 on     5
        Tree #    51 on     3
        Tree #    68 on     4
        Tree #     2 on     0
        Tree #    35 on     2
        Tree #    19 on     1
        Tree #    52 on     3
        Tree #    85 on     5
        Tree #    69 on     4
        Tree #     3 on     0
        Tree #    36 on     2
        Tree #    20 on     1
        Tree #    53 on     3
        Tree #    86 on     5
        Tree #    70 on     4
        Tree #     4 on     0
        Tree #    21 on     1
        Tree #    37 on     2
        Tree #    87 on     5
        Tree #    54 on     3
        Tree #    71 on     4
        Tree #     5 on     0
        T