Skip to content
Code and data accompanying the paper "Sequences of Sets"
Julia
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
data
models
README.md
SmallFixedSizeSets.jl
common.jl
learn_CRU_model.jl
learn_flattened_model.jl
model_common.jl
paper_figures.jl
paper_tables.jl

README.md

Sequences of Sets

This code and data repository accompanies the paper

  • Sequences of Sets. Austin R. Benson, Ravi Kumar, and Andrew Tomkins. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '18'), 2018.

All of the code is written in Julia 1.0.

For questions, please email Austin at arb@cs.cornell.edu.

Data

The datasets are in the data/ directory. The file data/dataset-seqs.txt is a list of sequences. Each line of the file has the following form:

size1,size2,…,sizeN;elmt1,elmt2,…,elmtM

  • size1,size2,…,sizeN are the number of elements in the N sets in the sequence.
  • elmt1,elmt2,…,elmtM are the M elements (given as integer identifiers) in the N sets in order. The first size1 elements are in the first set, the next size2 elements are in the second set, and so on. For each sequence, size1 + size2 + … + sizeN = M.

The files email-Enron-core-element-labels.txt, tags-math-sx-element-labels.txt, and tags-mathoverflow-element-labels.txt contain labels for the element.

bash-3.2$ head -5 email-Enron-core-element-labels.txt 
1 phillip.allen@enron.com
2 john.arnold@enron.com
3 harry.arora@enron.com
4 robert.badeer@enron.com
5 susan.bailey@enron.com

Setup

First, download the repository:

git clone https://github.com/arbenson/FGDnPVC

The code uses Julia 1.0. To re-run all of the experiments, you need the following packages:

using Pkg
Pkg.add("Combinatorics")
Pkg.add("DataStructures")
Pkg.add("FileIO")
Pkg.add("JLD2")
Pkg.add("PyPlot")

Learning models

Correlated Repeated Unions (CRU) model

We first show how to learn the Correlated Repeated Unions (CRU) model. The learning has some built-in parallelism, which you can use by setting the JULIA_NUM_THREADS environment variable.

include("learn_CRU_model.jl")
dataset = "email-Enron-core"  # data file at data/$dataset.txt
p = 0.9                       # correlation probability
# Learning takes several minutes
learn(dataset, p)  # --> model saved to models/$dataset-CRU-$p.jld2

All of the learned CRU models used in the paper are pre-computed and saved in the output/ directory.

Flattened model

We use a "flattened" model as the baseline.

include("learn_flattened_model.jl")
dataset = "email-Enron-core"  # data file at data/$dataset.txt
# Learning takes several minutes
learn(dataset)  # --> model saved to models/$dataset-flattened.jld2

Reproduce the figures and tables in the paper

Figure 1: Distribution of set sizes.
include("paper_figures.jl")
set_size_dist_fig()  # --> set_size_dist.pdf
Figure 2: Repeat behavior in the datasets
include("paper_figures.jl")
# The following takes a minute or so
repeat_behavior_fig()  # --> repeat_behavior.pdf
Figure 3: Distribution of the number of repeatss in sets containing at least one repeat.
include("paper_figures.jl")
num_repeats_dist_fig()  # --> num_repeats_dist.pdf 
Figure 4: Evidence of recency bias in set selection.
include("paper_figures.jl")
# The following takes a minute or so
recency_bias_fig()  # --> recency_bias.pdf
Figure 5: Likelihoods.
include("paper_figures.jl")
for row in dataset_info()
    dataset = row[1]
    likelihoods_fig(dataset)  # --> $dataset-rel-likelihoods.pdf
end
Figure 6: Recency weights.
include("paper_figures.jl")
for row in dataset_info()
    dataset = row[1]
    recency_weights_fig(dataset)  # --> weights-$dataset.pdf
end
Table 1: Summary statistics of datasets.
include("paper_tables.jl")
for row in dataset_info()
    summary_stats(row[1])
end
Table 2: Subset correlations.
include("paper_tables.jl")
# This takes several minutes
for row in dataset_info()
    correlation_behavior(row[1])
end 
You can’t perform that action at this time.