Skip to content

This project includes all the data used to test the performance of the closed contiguous sequential pattern mining algorithm

Notifications You must be signed in to change notification settings

chuanchuan526/CCSMP-algorithm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 

Repository files navigation

This project includes the introduction and the access method of three datasets for closed contiguous sequential pattern mining.

dataset #1: Kosarak
This is a very large dataset containing 990 000 sequences of click-stream data from an hungarian news portal. The dataset uses the original data from: http://fimi.ua.ac.be/data/. The dataset for mining closed contiguous sequential patterns is processed. Here is an example from the processed dataset:

69 -1 11 -1 1 -1 6 -1 -2  
11 -1 70 -1 6 -1 -2  
6 -1 3 -1 71 -1 -2  

This example dataset includes three sequences, '-2' is the marker at the end of each sequence, and '-1' is the separator between items. Different items in the dataset represent different click items.

Kosarak Dataset are currently available from the following websites:
https://www.philippe-fournier-viger.com/spmf/datasets/kosarak_sequences.txt
https://mypikpak.com/s/VNVEdqo2R4Npnse32eziL911o1
password (提取码) :dj4k

dataset #2: Protein
The biological dataset Protein was publicly provided by the National Center for Biotechnology Information (NCBI, https://www.ncbi.nlm.nih.gov/). Protein was extracted by the conjunction of (1) search category = “Protein”, (2) species = “Bacteria”, (3) release date = [2022/3/1—2022/5/15], and (4) organism = “Escherichia coli”. The dataset file is in fasta format. Since the dataset is very large, it currently needs to be accessed through a network disk. Network disk links are as follows:
link #1: https://mypikpak.com/s/VNVEe4fDgd0QHay2oFQfmBDCo1
password (提取码) :nu8e
link #2:https://pan.baidu.com/s/1KPEVsKXKvF8x9X1EkFVFnA
password (提取码) :lszn

dataset #3: CDtaxi
CDtaxi includes trajectory data of over 10,000 taxis in Chengdu within the most 25 days. There are more than 1 billion GPS messages in the track. The data comes from a data analysis competition called "智慧中国杯" (https://www.pkbigdata.com/common/zhzgbCmptDetails.html). Here is an example from the dataset:

1,30.593305,104.067773,1,2014/8/30 19:59:40
1,30.598553,104.067605,1,2014/8/30 19:59:10
1,30.601324,104.124101,1,2014/8/30 13:53:36
1,30.601326,104.124119,1,2014/8/30 13:54:06
1,30.601589,104.123913,1,2014/8/30 13:53:06

The meaning of the data is as follows: the first column is the taxi ID, the second column is the latitude, the third column is the longitude, the fourth column is the passenger status, and the fifth column is the current time.

This dataset needs to be preprocessed by the road network matching algorithm to obtain the trajectory sequence before it can be used for closed contiguous sequential pattern mining. During the process of road network matching, only four columns 1, 3, 4, and 5 are used.

Since the dataset is also very large, it currently needs to be accessed through a network disk. The network disk link is as follows: https://pan.baidu.com/s/1phfGPIn6uul4jPt8BfBRwg
password (提取码) :a2dh https://mypikpak.com/s/VNVFb39Sgd0QHay2oFQfzFcyo1
password (提取码) :6gnn
https://pan.baidu.com/share/init?surl=o84gtPS
password (提取码) :meq5

About

This project includes all the data used to test the performance of the closed contiguous sequential pattern mining algorithm

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published