Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.

SPDX-License-Identifier: Apache-2.0


# Instructions for dataset download 

In [None]:
import pathlib
import requests

## Use case 1: User Behavior 

The Reddit dataset is sourced from [Pushshift](https://files.pushshift.io/reddit/comments/) and downloaded raw data file should be placed under `../data/01_raw/user_behavior`. 

The following cells uses the `requests` library to download one month of Reddit comments from May 2008 and places it in `data/01_raw/user_behavior`.

In [None]:
URL = "https://files.pushshift.io/reddit/comments/RC_2008-05.zst"
reddit_raw_data_file_path = '../data/01_raw/user_behavior/RC_2008-05.zst'

In [None]:
reddit_raw_data_file_path_abs = pathlib.Path(reddit_raw_data_file_path).resolve()
reddit_raw_data_file_dir = pathlib.Path('/'.join(reddit_raw_data_file_path.split('/')[:-1]))
reddit_raw_data_file_dir.mkdir(parents=True, exist_ok=True)

In [None]:
response = requests.get(URL)
with open(reddit_raw_data_file_path, "wb") as fp:
    fp.write(response.content)

In [None]:
!unzstd ../data/01_raw/user_behavior/RC_2008-05.zst --memory=2048MB

In [None]:
!ls ../data/01_raw/user_behavior/

## Use case 2:  Telecom Network 

The WiFi dataset can be downloaded from the [SPAMHMM repository](https://github.com/dpernes/spamhmm/blob/master/README.md#datasets). 

The following cell uses the `gdown` package to download from google drive. 

Alternatively, you can download `wifi_data.tar.gz`, put it under `../data/01_raw/wifi/`, and extract using `tar -xzvf ../data/01_raw/wifi/wifi_data.tar.gz --directory ../data/01_raw/wifi/`

In [None]:
!pip install gdown 

In [None]:
# a file
import gdown

url = "https://drive.google.com/uc?id=1IyK8lWvV9bDQ43ZT6a51lB9iPT9EtXt8"
output = "../data/01_raw/wifi/wifi_data.tar.gz"
gdown.download(url, output, quiet=False)

In [None]:
!tar -xzf ../data/01_raw/wifi/wifi_data.tar.gz --directory ../data/01_raw/wifi/

In [None]:
!ls ../data/01_raw/wifi

## Use case 3:  Financial Fraud

The financial fraud data set can be downloaded from [kaggle](https://www.kaggle.com/datasets/ealaxi/banksim1). The following cells uses the kaggle API to download the dataset.

To download the dataset, please follow the instructions in [API Credentials](https://github.com/Kaggle/kaggle-api#api-credentials) from the Kaggle API.

Alternatively to the API, you can download the two csv files (`bs140513_032310.csv` and `bsNET140513_032310.csv`) and put them under: `anomaly-detection-spatial-temporal-data/data/01_raw/financial_fraud`

In [None]:
pathlib.Path("../data/01_raw/financial_fraud").mkdir(parents=True, exist_ok=True)

In [None]:
!pip install kaggle --user

In [None]:
# make sure you have kaggle.json under ~/.kaggle/
!kaggle datasets download -d ealaxi/banksim1

In [None]:
!mv banksim1.zip ../data/01_raw/financial_fraud/banksim1.zip

In [None]:
!unzip ../data/01_raw/financial_fraud/banksim1.zip  -d ../data/01_raw/financial_fraud/

In [None]:
!ls ../data/01_raw/financial_fraud/

## Use case 4:  IoT Network 

The data set is available on the batadal [website](http://www.batadal.net/data.html). The following cells download training dataset 1, training dataset 2 and test dataset as `BATADAL_dataset03_train_no_anomaly.csv`, `BATADAL_dataset04_train_some_anomaly.csv` and `BATADAL_test_dataset_some_anomaly.csv` respectively and saves them in `../data/01_raw/iot`


In [None]:
pathlib.Path("../data/01_raw/iot").mkdir(parents=True, exist_ok=True)

In [None]:
!wget http://www.batadal.net/data/BATADAL_dataset03.csv -O ../data/01_raw/iot/BATADAL_dataset03_train_no_anomaly.csv
!wget http://www.batadal.net/data/BATADAL_dataset04.csv -O ../data/01_raw/iot/BATADAL_dataset04_train_some_anomaly.csv
!wget http://www.batadal.net/data/BATADAL_test_dataset.zip -O ../data/01_raw/iot/BATADAL_test_dataset.zip


In [None]:
!unzip ../data/01_raw/iot/BATADAL_test_dataset.zip  -d ../data/01_raw/iot/

In [None]:
!mv ../data/01_raw/iot/BATADAL_test_dataset.csv ../data/01_raw/iot/BATADAL_test_dataset_some_anomaly.csv

In [None]:
!ls ../data/01_raw/iot

# References

Edgar Alonso Lopez-Rojas and Stefan Axelsson. 2014. BANKSIM: A BANK PAYMENTS SIMULATOR FOR FRAUD DETECTION RESEARCH.

Riccardo Taormina and Stefano Galelli and Nils Ole Tippenhauer and Elad Salomons and Avi Ostfeld and Demetrios G. Eliades and Mohsen Aghashahi and Raanju Sundararajan and Mohsen Pourahmadi and M. Katherine Banks and B. M. Brentan and Enrique Campbell and G. Lima and D. Manzi and D. Ayala-Cabrera and M. Herrera and I. Montalvo and J. Izquierdo and E. Luvizotto and Sarin E. Chandy and Amin Rasekh and Zachary A. Barker and Bruce Campbell and M. Ehsan Shafiee and Marcio Giacomoni and Nikolaos Gatsis and Ahmad Taha and Ahmed A. Abokifa and Kelsey Haddad and Cynthia S. Lo and Pratim Biswas and M. Fayzul K. Pasha and Bijay Kc and Saravanakumar Lakshmanan Somasundaram and Mashor Housh and Ziv Ohar; "The Battle Of The Attack Detection Algorithms: Disclosing Cyber Attacks On Water Distribution Networks." Journal of Water Resources Planning and Management, 144 (8), August 2018

Anisa Allahdadi and Ricardo Morla. 2017. 802.11 Wireless Access Point Usage Simulation and Anomaly Detection. CoRR abs/1707.02933, (2017). Retrieved from http://arxiv.org/abs/1707.02933 

Jason Baumgartner, Savvas Zannettou, Brian Keegan, Megan Squire, and Jeremy Blackburn. 2020. The Pushshift Reddit Dataset.