
# CSC 786 – Data Ethics & Reproducibility Workshop  

This notebook demonstrates a complete ethical, reproducible data-collection workflow:

- Ethical handling of APIs and environment variables  
- Data collection using both key-based and public APIs  
- Provenance logging and metadata documentation  
- Responsible data storage and reproducible version control  
- Pushing results to a GitHub repository  

All steps run directly in Google Colab.


# Setup Cell
Run once per session

In [1]:
%env GITHUB_TOKEN=

!git config --global user.name "Collin Brueggeman" ## Display name not necessarily your username
!git config --global user.email "clbruegg2002@gmail.com"

env: GITHUB_TOKEN=


# When you reopen Colab next time
You’ll simply clone your GitHub repo back into /content, instead of re-initializing a new one.

So, the reconnect workflow will look like this:

In [None]:
# 1. Clone your existing repo from GitHub
!git clone https://github.com/clbruegg/csc786-GNNresearch
%cd csc786-GNNresearch


# 2. Optional: verify remote
!git remote -v


# 3. If you make changes and want to push again
!git remote set-url origin URL # todo update url

!git add .
!git commit -m "Update from Colab session"
!git push


Cloning into 'csc786-GNNresearch'...
remote: Enumerating objects: 31, done.[K
remote: Counting objects: 100% (31/31), done.[K
remote: Compressing objects: 100% (20/20), done.[K
remote: Total 31 (delta 4), reused 31 (delta 4), pack-reused 0 (from 0)[K
Receiving objects: 100% (31/31), 8.42 MiB | 20.82 MiB/s, done.
Resolving deltas: 100% (4/4), done.
/content/csc786-GNNresearch
origin	https://github.com/clbruegg/csc786-GNNresearch (fetch)
origin	https://github.com/clbruegg/csc786-GNNresearch (push)
On branch main
Your branch is up to date with 'origin/main'.

nothing to commit, working tree clean
fatal: 'URL' does not appear to be a git repository
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.


In [None]:
# You can always check what's currently configured by:

!git config --global --list

## Colab-specific access details
Note: While we work in Colab, everything inside /content/ is a temporary mini-repo.
As you run the notebook:
1. It creates the folder /content/data/ for your CSVs.
2. It appends provenance info into /content/DATA_README.md.
3. You can add extra markdown files manually.

## Step 1 – Setup Environment

In [None]:
!pip install python-dotenv --quiet
import os, pandas as pd, requests, hashlib, json, sys, time
from datetime import datetime, timezone
from pathlib import Path

ROOT = Path("/content/csc786-GNNresearch") ## todo: may update repo name if needed
DATA = ROOT / "data"
DATA.mkdir(parents=True, exist_ok=True)
print("Environment ready. Files will be stored in:", DATA)


Environment ready. Files will be stored in: /content/csc786-GNNresearch/data



# Ethical Reminder

Before collecting any data:

- Check Terms of Service and rate limits.  
- Avoid collecting or storing personally identifiable information (PII).  
- Document every endpoint, parameter, and date of collection.  
- Keep secrets (API keys) out of public repositories.  


## Downloading Pre-proccessed CIC-IDS2017 dataset via Kaggle

The **CIC-IDS2017** dataset is hosted by University of New Brunswick. It captures network traffic with both benign (normal) activity and various attack scenarios, making it suitable for developing and testing intrusion detection systems.

Official link: https://www.unb.ca/cic/datasets/ids-2017.html

I will be using a cleaned, preproccessed, and aggregated version distributed via Kaggle by Eric Anacleto Ribeiro.

Kaggle Link: https://www.kaggle.com/datasets/ericanacletoribeiro/cicids2017-cleaned-and-preprocessed

Download the zip file.

## Downloading UNSW Datasets via Official OneDrive

The **UNSW-NB15** and **TON_IoT** datasets are hosted by the
University of New South Wales (UNSW Canberra Cyber).

Due to data governance and file size limitations, these datasets must be
manually downloaded from their **official OneDrive links**.

### Official Download Links

- **UNSW-NB15 Dataset (Network Traffic + Features)**
  - Main site: https://research.unsw.edu.au/projects/unsw-nb15-dataset  
  - Direct OneDrive link: https://unsw-my.sharepoint.com/:f:/g/personal/z5025758_ad_unsw_edu_au/EnuQZZn3XuNBjgfcUu4DIVMBLCHyoLHqOswirpOQifr1ag?e=gKWkLS
  - Check the **CSV Files, Reports,** and **ReadMe.pdf** boxes and click download.

- **TON_IoT Dataset (IoT Telemetry + Network Logs)**
  - Main site: https://research.unsw.edu.au/projects/toniot-datasets  
  - Direct OneDrive link: https://unsw-my.sharepoint.com/:f:/g/personal/z5025758_ad_unsw_edu_au/EvBTaetotpdGnW7rJQ8fCvYBh8063CNeY9W33MpRsarJaQ?e=yZlnxW
  - Click all boxes **except Raw_datasets** and click download.

### Instructions for Reproducibility

1. Visit the above URLs and download the `.zip` or `.csv` files.
2. Extract the files into their respective folders: CIC-IDS2017, TON_IoT, UNSW-NB15.
3. Rename CSV Files to CSV_Files once the UNSW-NB15 download is completed and extracted.  
4. Upload the files to your Google Drive in a folder named `/MyDrive/datasets/`.  
5. Mount Google Drive in Colab using the cell below.  
6. Verify files acquired by sampling. (Avoids pushing full datasets)


In [None]:
# Mount Google Drive to access UNSW datasets
import pandas as pd
import numpy as np
from google.colab import drive
drive.mount('/content/drive')

# Set dataset paths (modify according to your Drive folder structure)
cicids_path = "/content/drive/MyDrive/datasets/CIC-IDS2017/cicids2017_cleaned.csv"
unsw_path = "/content/drive/MyDrive/datasets/UNSW-NB15/CSV_Files/UNSW-NB15_1.csv"
toniot_path = "/content/drive/MyDrive/datasets/TON_IoT/Processed_datasets/Processed_Network_dataset/Network_dataset_1.csv"

# Verify files exist
import os
assert os.path.exists(cicids_path), "CIC-IDS2017 file not found in Drive!"
assert os.path.exists(unsw_path), "UNSW_NB15 file not found in Drive!"
assert os.path.exists(toniot_path), "TON_IoT file not found in Drive!"
print("Datasets found in Google Drive.")

# Load into pandas
cicids = pd.read_csv(cicids_path, low_memory=False)
unsw = pd.read_csv(unsw_path, low_memory=False)
toniot = pd.read_csv(toniot_path, low_memory=False)


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Datasets found in Google Drive.


In [None]:
#Get samples of datasets to verify
cicids.sample(n=100, random_state=42).to_csv("/content/csc786-GNNresearch/data/cic-ids-sample.csv", index=False)
unsw.sample(n=100, random_state=42).to_csv("/content/csc786-GNNresearch/data/unsw-sample.csv", index=False)
toniot.sample(n=100, random_state=42).to_csv("/content/csc786-GNNresearch/data/TON_IoT-sample.csv", index=False)

#Print to verify you have acquired the data
sample_cic = pd.read_csv("/content/csc786-GNNresearch/data/cic-ids-sample.csv")
sample_cic.head()

Unnamed: 0,Destination Port,Flow Duration,Total Fwd Packets,Total Length of Fwd Packets,Fwd Packet Length Max,Fwd Packet Length Min,Fwd Packet Length Mean,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Min,...,Init_Win_bytes_backward,act_data_pkt_fwd,min_seg_size_forward,Active Mean,Active Max,Active Min,Idle Mean,Idle Max,Idle Min,Attack Type
0,53,30777,4,124,31,31,31.0,0.0,122,122,...,-1,3,20,0.0,0,0,0.0,0,0,Normal Traffic
1,443,2884,4,53,53,0,13.25,26.5,341,341,...,14472,1,32,0.0,0,0,0.0,0,0,Normal Traffic
2,80,73141122,8,56,20,0,7.0,5.656854,4380,0,...,229,6,20,7010.0,7010,7010,35800000.0,61500000,10100000,DDoS
3,443,115417660,32,845,330,0,26.40625,67.87363,4380,0,...,980,31,20,39035.818182,197822,22919,10004570.0,10024256,9999700,Normal Traffic
4,443,551378,13,1204,725,0,92.615385,204.750881,1756,0,...,123,6,32,0.0,0,0,0.0,0,0,Normal Traffic


In [None]:
sample_unsw = pd.read_csv("/content/csc786-GNNresearch/data/unsw-sample.csv")
sample_unsw.head()

Unnamed: 0,59.166.0.0,1390,149.171.126.6,53,udp,CON,0.001055,132,164,31,...,0.17,3,7,1,3.1,1.1,1.2,1.3,Unnamed: 47,0.18
0,59.166.0.0,62613,149.171.126.6,15816,udp,CON,0.00178,520,304,31,...,0,5,5,4,4,1,1,2,,0
1,59.166.0.8,28607,149.171.126.6,53,udp,CON,0.000987,146,178,31,...,0,4,3,4,2,1,1,1,,0
2,59.166.0.5,44569,149.171.126.9,53,udp,CON,0.000993,146,178,31,...,0,1,4,2,1,1,1,1,,0
3,59.166.0.0,18652,149.171.126.3,39373,udp,CON,0.001792,528,304,31,...,0,3,8,5,3,1,1,3,,0
4,59.166.0.9,20935,149.171.126.7,53,udp,CON,0.000995,130,162,31,...,0,1,1,5,3,1,1,1,,0


In [None]:
sample_TONiot = pd.read_csv("/content/csc786-GNNresearch/data/TON_IoT-sample.csv")
sample_TONiot.head()

Unnamed: 0,ts,src_ip,src_port,dst_ip,dst_port,proto,service,duration,src_bytes,dst_bytes,...,http_response_body_len,http_status_code,http_user_agent,http_orig_mime_types,http_resp_mime_types,weird_name,weird_addl,weird_notice,label,type
0,1556025777,192.168.1.32,50266,192.168.1.186,27228,tcp,-,0.0,0,0,...,0,0,-,-,-,-,-,-,1,scanning
1,1554273808,127.0.0.1,42100,127.0.0.1,7878,tcp,-,0.0,0,0,...,0,0,-,-,-,-,-,-,0,normal
2,1556025649,192.168.1.152,34422,192.168.1.32,10644,tcp,-,0.0,0,0,...,0,0,-,-,-,-,-,-,1,scanning
3,1556025616,192.168.1.30,22069,192.168.1.180,33883,tcp,-,0.0,0,0,...,0,0,-,-,-,-,-,-,1,scanning
4,1554258881,192.168.1.79,39062,192.168.1.255,15600,udp,-,0.0,0,0,...,0,0,-,-,-,-,-,-,0,normal


You can veryify everything before pushing.

In [None]:
!ls -lh /content
!ls -lh /content/csc786-GNNresearch/data
!head -n 5 README.md
!tail -n 5 DATA_README.md