# Frequent Subtree Counting in Random Forests

Similar to the notebook 

    Initial Rooted Frequent Subtree Mining (without embedding computation).ipnb
    
I will start the same mining / evaluation process for the data, but will include the split value in the labels, 
i.e., graph vertices now are labeled 
    
    'ID<NUM'
    

## Datasets
There are several datasets.
At the moment, however, I'll experiment only with 'adult' and 'wine-quality'.

## Find Frequent Rooted Trees

Let's see how many rooted frequent trees we can find in the random forests.

In [1]:
%%bash
for dataset in spambase satlog; do
    for variant in NoLeafEdgesWithSplitValues; do
        mkdir forests/${dataset}/${variant}/
    done
done

mkdir: cannot create directory ‘forests/spambase/NoLeafEdgesWithSplitValues/’: File exists
mkdir: cannot create directory ‘forests/satlog/NoLeafEdgesWithSplitValues/’: File exists


CalledProcessError: Command 'b'for dataset in spambase satlog; do\n    for variant in NoLeafEdgesWithSplitValues; do\n        mkdir forests/${dataset}/${variant}/\n    done\ndone\n'' returned non-zero exit status 1.

In [2]:
%%bash
for dataset in spambase satlog; do
    for f in forests/${dataset}/text/*.json; do
        echo ${f} '->' `basename ${f} .json`.graph
        python json2graphNoLeafEdgesWithSplitValues.py ${f} > forests/${dataset}/NoLeafEdgesWithSplitValues/`basename ${f} .json`.graph
    done
done

forests/spambase/text/RF_10.json -> RF_10.graph
forests/spambase/text/RF_10_pruned_with_sigma_0_0.json -> RF_10_pruned_with_sigma_0_0.graph
forests/spambase/text/RF_10_pruned_with_sigma_0_1.json -> RF_10_pruned_with_sigma_0_1.graph
forests/spambase/text/RF_10_pruned_with_sigma_0_2.json -> RF_10_pruned_with_sigma_0_2.graph
forests/spambase/text/RF_10_pruned_with_sigma_0_3.json -> RF_10_pruned_with_sigma_0_3.graph
forests/spambase/text/RF_15.json -> RF_15.graph
forests/spambase/text/RF_15_pruned_with_sigma_0_0.json -> RF_15_pruned_with_sigma_0_0.graph
forests/spambase/text/RF_15_pruned_with_sigma_0_1.json -> RF_15_pruned_with_sigma_0_1.graph
forests/spambase/text/RF_15_pruned_with_sigma_0_2.json -> RF_15_pruned_with_sigma_0_2.graph
forests/spambase/text/RF_15_pruned_with_sigma_0_3.json -> RF_15_pruned_with_sigma_0_3.graph
forests/spambase/text/RF_20.json -> RF_20.graph
forests/spambase/text/RF_20_pruned_with_sigma_0_0.json -> RF_20_pruned_with_sigma_0_0.graph
forests/spambase/text/RF_20_

In [3]:
%%bash
mkdir forests/rootedFrequentTrees
# create output directories
for dataset in spambase satlog; do
    mkdir forests/rootedFrequentTrees/${dataset}/
    for variant in NoLeafEdgesWithSplitValues; do
        mkdir forests/rootedFrequentTrees/${dataset}/${variant}/
    done
done

mkdir: cannot create directory ‘forests/rootedFrequentTrees’: File exists
mkdir: cannot create directory ‘forests/rootedFrequentTrees/spambase/’: File exists


In [2]:
%%bash
./lwgr -h

This is a frequent rooted subtree mining tool.
Implemented by Pascal Welke starting in 2018.

This program computes and outputs frequent *rooted* subtrees and feature
representations of the mined graphs. The database is expected to contain
tree transactions that are interpreted as being rooted at the first
vertex.

usage: ./lwg [options] [FILE]

If no FILE argument is given or FILE is - the program reads from stdin.
It always prints to stdout (unless specified by parameters) and 
stderr (statistics).


Options:
-h:           print this possibly helpful information.

-t THRESHOLD: Minimum absolute support threshold in the graph database

-p SIZE:      Maximum size (number of vertices) of patterns returned

-o FILE:      output the frequent subtrees in this file

-f FILE:      output the feature information in this file

-i VALUE:     Some embedding operators require a parameter that might be
              a float between 0.0 and 1.0 or an integer >=1, depending 
              on the ope

In [4]:
%%bash
rm todolist.txt
for dataset in spambase satlog; do
    for variant in NoLeafEdgesWithSplitValues; do
        for f in forests/${dataset}/${variant}/*.graph; do
            for threshold in 2; do
                #echo "processing threshold ${threshold} for ${f}"
                echo "./lwgr -e rootedTrees -m bfs -t ${threshold} -p 10 \
                  -o forests/rootedFrequentTrees/${dataset}/${variant}/`basename ${f} .graph`_t${threshold}.patterns \
                  < ${f} \
                  > forests/rootedFrequentTrees/${dataset}/${variant}/`basename ${f} .graph`_t${threshold}.features \
                  2> forests/rootedFrequentTrees/${dataset}/${variant}/`basename ${f} .graph`_t${threshold}.logs" >> todolist.txt
                  
            done
        done
    done
done

In [1]:
%%bash
cat todolist.txt | parallel -j 24

In [1]:
%%bash
#for dataset in adult wine-quality; do
#    for variant in NoLeafEdgesWithSplitValues; do
#        for f in forests/${dataset}/${variant}/*_20.graph; do
#            threshold=2
#            echo "processing threshold ${threshold} for ${f}"
#            ./lwgr -e rootedTrees -m bfs -t ${threshold} -p 10 \
#              -o forests/rootedFrequentTrees/${dataset}/${variant}/`basename ${f} .graph`_t${threshold}.patterns \
#              < ${f} \
#              > forests/rootedFrequentTrees/${dataset}/${variant}/`basename ${f} .graph`_t${threshold}.features \
#              2> forests/rootedFrequentTrees/${dataset}/${variant}/`basename ${f} .graph`_t${threshold}.logs       
#        done
#    done
#done

processing threshold 2 for forests/adult/WithLeafEdgesWithSplitValues/DT_20.graph
processing threshold 2 for forests/adult/WithLeafEdgesWithSplitValues/ET_20.graph
Killed
processing threshold 2 for forests/adult/WithLeafEdgesWithSplitValues/RF_20.graph
Killed
processing threshold 2 for forests/adult/NoLeafEdgesWithSplitValues/DT_20.graph
processing threshold 2 for forests/adult/NoLeafEdgesWithSplitValues/ET_20.graph
processing threshold 2 for forests/adult/NoLeafEdgesWithSplitValues/RF_20.graph
processing threshold 2 for forests/wine-quality/WithLeafEdgesWithSplitValues/DT_20.graph
processing threshold 2 for forests/wine-quality/WithLeafEdgesWithSplitValues/ET_20.graph
processing threshold 2 for forests/wine-quality/WithLeafEdgesWithSplitValues/RF_20.graph
Killed
processing threshold 2 for forests/wine-quality/NoLeafEdgesWithSplitValues/DT_20.graph
processing threshold 2 for forests/wine-quality/NoLeafEdgesWithSplitValues/ET_20.graph
processing threshold 2 for forests/wine-quality/NoLe

In [2]:
%%bash
for dataset in spambase satlog; do
    for variant in NoLeafEdgesWithSplitValues; do
        for f in forests/rootedFrequentTrees/${dataset}/${variant}/*_t2.patterns;
        do
                f1=${f%_t2.patterns}
                File1="${f1}_t"
                File2=".patterns"
                echo "Processing $f"    
                for ((num2=3; num2<=25; num2++)); do
                        num=$(( $num2 -1 ))
                        grep -P -v "^${num}\t" "$File1$num$File2" > "$File1$num2$File2"
                done
        done
    done
done


Processing forests/rootedFrequentTrees/spambase/NoLeafEdgesWithSplitValues/DT_10_t2.patterns
Processing forests/rootedFrequentTrees/spambase/NoLeafEdgesWithSplitValues/DT_15_t2.patterns
Processing forests/rootedFrequentTrees/spambase/NoLeafEdgesWithSplitValues/DT_1_t2.patterns
Processing forests/rootedFrequentTrees/spambase/NoLeafEdgesWithSplitValues/DT_20_t2.patterns
Processing forests/rootedFrequentTrees/spambase/NoLeafEdgesWithSplitValues/DT_5_t2.patterns
Processing forests/rootedFrequentTrees/spambase/NoLeafEdgesWithSplitValues/RF_10_pruned_with_sigma_0_0_t2.patterns
Processing forests/rootedFrequentTrees/spambase/NoLeafEdgesWithSplitValues/RF_10_pruned_with_sigma_0_1_t2.patterns
Processing forests/rootedFrequentTrees/spambase/NoLeafEdgesWithSplitValues/RF_10_pruned_with_sigma_0_2_t2.patterns
Processing forests/rootedFrequentTrees/spambase/NoLeafEdgesWithSplitValues/RF_10_pruned_with_sigma_0_3_t2.patterns
Processing forests/rootedFrequentTrees/spambase/NoLeafEdgesWithSplitValues/RF

### Next Steps

The results of this mining process are plotted in the python3 notebook 'Results for Frequent Rooted Subtrees - With Split Values in Labels.ipynb'.