# Experiment 3: Data Wrangling

The next step to building Faster RCNN would be an intersection of union implementation to compare proposed bounding boxes to the ground truth. 

However, the current layout of the dataset does not make this easy, as this dataset was originally created for tracking. So each data "point" is a sequence, where the sequence consists of 600+ frames.

The ground truth bounding boxes for every frame of each sequence are located in a single file: soccernet_data/tracking/train/SNMOT-XXX/gt/gt.txt, in the following format (we don't need the last 3 entries):

`[frame ID, track ID, top left coordinate of the bounding box, top y coordinate, width, height, confidence score for the detection (1 for ground truth), -1, -1, -1 ]` 

The frames for each sequence "XXX" are located in  soccernet_data/tracking/train/SNMOT-XXX/img1/

The data structure looks like this:
 
 ```
 soccernet_data/
 └─ tracking/
    └─ train/
       ├─ SNMOT-060/
       │  ├─ seqinfo.ini       # sequence metadata
       │  ├─ gameinfo.ini      # game metadata
       │  ├─ img1/             # all image frames
       │  ├─ gt/               # ground‐truth annotations
       │  ├─ det/              # detection results
       ├─ …
       ├─ SNMOT-XXX/           # …
       │  └─ (same as above)
       ├─ …                    
       └─ SNMOT-170/
          ├─ seqinfo.ini
          ├─ gameinfo.ini
          ├─ img1/
          ├─ gt/
          ├─ det/
 ```

So, we need to:
1. load frames for a specified sequence
2. match each of the images with their respective bounding boxes in gt.txt, keeping only the necessary values
3. transform the ground truth bounding boxes to match the format from experiment 2: [ymin, xmin, ymax, xmax]


In [3]:
import os
import numpy as np

"""
Let's make a soccernet_data/tracking/train/SNMOT-XXX/gt-frame/ directory where each frame has a txt file:
i.e. 
soccernet_data/tracking/train/SNMOT-XXX/gt-frame/000001.txt
= track ID, ymin, xmin, ymax, xmax
"""

def generate_per_frame_gt_files(sequence_id):
    base_dir   = os.path.join("..", "soccernet_data", "tracking", "train", sequence_id)
    gt_txt     = os.path.join(base_dir, "gt", "gt.txt")
    output_dir = os.path.join(base_dir, "gt-frame")

    os.makedirs(output_dir, exist_ok=True)

    gt_data = np.loadtxt(gt_txt, delimiter=",")
    for frame_id in sorted(set(gt_data[:, 0].astype(int))):
        rows = gt_data[gt_data[:, 0] == frame_id]
        frame_name = f"{frame_id:06d}.txt"
        out_path = os.path.join(output_dir, frame_name)
        with open(out_path, "w") as fout:
            for row in rows:
                _, track_id, x, y, w, h, *_ = row
                xmin, ymin = x, y
                xmax, ymax = x + w, y + h
                fout.write(f"{int(track_id)},{int(ymin)},{int(xmin)},{int(ymax)},{int(xmax)}\n")

    print(f"Generated per-frame ground truth files in: {output_dir}")


In [None]:
# Verifying the output
generate_per_frame_gt_files("SNMOT-060")

Generated per-frame ground truth files in: ../soccernet_data/tracking/train/SNMOT-060/gt-frame


In [5]:
train_root = os.path.join("..", "soccernet_data", "tracking", "train")


sequence_ids = [
    d for d in os.listdir(train_root)
    if os.path.isdir(os.path.join(train_root, d))
]

for seq in sorted(sequence_ids):
    print(f"Processing sequence {seq}...")
    generate_per_frame_gt_files(seq)


Processing sequence SNMOT-060...
Generated per-frame ground truth files in: ../soccernet_data/tracking/train/SNMOT-060/gt-frame
Processing sequence SNMOT-061...
Generated per-frame ground truth files in: ../soccernet_data/tracking/train/SNMOT-061/gt-frame
Processing sequence SNMOT-062...
Generated per-frame ground truth files in: ../soccernet_data/tracking/train/SNMOT-062/gt-frame
Processing sequence SNMOT-063...
Generated per-frame ground truth files in: ../soccernet_data/tracking/train/SNMOT-063/gt-frame
Processing sequence SNMOT-064...
Generated per-frame ground truth files in: ../soccernet_data/tracking/train/SNMOT-064/gt-frame
Processing sequence SNMOT-065...
Generated per-frame ground truth files in: ../soccernet_data/tracking/train/SNMOT-065/gt-frame
Processing sequence SNMOT-066...
Generated per-frame ground truth files in: ../soccernet_data/tracking/train/SNMOT-066/gt-frame
Processing sequence SNMOT-067...
Generated per-frame ground truth files in: ../soccernet_data/tracking/t