# BSM Parquet Analysis

This notebook processes and analyzes Basic Safety Message (BSM) data stored in Parquet files. It demonstrates how to load, filter, decode, and visualize BSM messages, focusing on specific geohashes and vehicle movement over time.

The workflow includes:
- Loading and concatenating Parquet files
- Time and geohash-based grouping
- Focusing on specific geohashes
- Decoding BSM messages using a C binary
- Extracting BSM IDs and timestamps
- Identifying repeated BSMs
- Visualizing vehicle movement with Plotly

## Import Required Libraries
This cell imports essential Python libraries for data manipulation, visualization, and file handling, including pandas, matplotlib, seaborn, pyarrow, and os.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pyarrow.parquet as pq
import os

## Load and Concatenate Parquet Files
This cell locates all Parquet files in the specified directory, reads them into pandas DataFrames, concatenates them into a single DataFrame, and displays basic information about the combined data.

In [None]:
import glob

# Set the directory containing your parquet files
parquet_dir = "./parquet"  # Update if needed

# Find all .parquet files in the directory
parquet_files = sorted(glob.glob(os.path.join(parquet_dir, "*.parquet")))

print(f"Found {len(parquet_files)} Parquet files.")

# Read and concatenate all parquet files into a single DataFrame
if parquet_files:
	df_all = pd.concat([pd.read_parquet(f) for f in parquet_files], ignore_index=True)
	# Show basic info
	df_all.info()
	df_all.head()
else:
	print("No Parquet files found in the directory.")
	df_all = pd.DataFrame()


## Group by Time and Geohash
This cell converts timestamps to datetime, buckets them by minute, groups the data by time bucket and geohash, and displays the largest groups.

In [None]:
import pandas as pd

# Convert float timestamp to datetime
df_all['Time'] = pd.to_datetime(df_all['TimeStamp'], unit='s')

# Optional: round to nearest 1 minute
df_all['TimeBucket'] = df_all['Time'].dt.floor('1MIN')  # Change to '1min', '30s' etc. as needed

grouped = df_all.groupby(['TimeBucket', 'Geohash'])

group_sizes = grouped.size().reset_index(name='Count')
group_sizes.sort_values('Count', ascending=False).head(10)



## Filter for Focus Geohashes
This cell filters the DataFrame to include only rows with geohashes of interest, sorts the results for readability, and displays message counts per geohash.

In [None]:
focus_geohashes = ['9tbq2v6h', '9tbq8b1c', '9tbq8c1g', '9tbq8ccy']
df_focus = df_all[df_all['Geohash'].isin(focus_geohashes)]

print("Message count per geohash:")
print(df_focus['Geohash'].value_counts())

# Sort by geohash and time for better readability
df_focus_sorted = df_focus.sort_values(by=["Geohash", "Time"])
df_focus_sorted.reset_index(drop=True, inplace=True)
df_focus_sorted



## Decode BSM Messages Using C Binary
This cell locates the C binary decoder, prepares the focus DataFrame, converts message bytes to hex, and decodes each BSM message using the external decoder. It also prints a few decoded outputs for inspection.

In [None]:
import os
import subprocess
import pandas as pd
from pathlib import Path

# 1. Detect repo root and set decoder path RELATIVE to repo root
notebook_dir = Path.cwd()
repo_root = None

# Traverse up to find .git as marker for root
for parent in notebook_dir.parents:
    if (parent / ".git").exists():
        repo_root = parent
        break
if repo_root is None:
    repo_root = notebook_dir  # fallback if not using git

DECODER_PATH = repo_root / "libsm/b2v-libsm/build/bin/decodeToJER"
print("Detected repo root:", repo_root)
print("Decoder Path:", DECODER_PATH)
if not DECODER_PATH.exists():
    raise FileNotFoundError(f"decodeToJER not found at {DECODER_PATH}")

# 2. Focused geohashes
focus_geohashes = ['9tbq2v6h', '9tbq8b1c', '9tbq8c1g', '9tbq8ccy']
df_focus = df_all[df_all['Geohash'].isin(focus_geohashes)].copy()
print("BSMs in focus:", len(df_focus))

# 3. Convert mf_bytes to hex
def mf_bytes_to_hex(val):
    if isinstance(val, (bytes, bytearray)):
        return val.hex()
    if isinstance(val, str) and val.startswith("b'"):  # as string repr
        return eval(val).hex()
    return None

df_focus["mf_hex"] = df_focus["mf_bytes"].apply(mf_bytes_to_hex)

# 4. Decode each BSM using the C binary
def decode_bsm_hex(hex_str):
    try:
        result = subprocess.run(
            [str(DECODER_PATH), "-i", hex_str],
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            check=True,
            text=True,
            timeout=3,
        )
        return result.stdout
    except Exception as e:
        print(f"[DecodeError] {e}")
        return None

df_focus["jer"] = df_focus["mf_hex"].apply(decode_bsm_hex)
print("Decoded BSMs:", df_focus['jer'].notnull().sum())

# 5. (Optional) Show a few decoded outputs for inspection
for jer in df_focus["jer"].dropna().head(3):
    print(jer)


## Extract BSM ID and Identify Repeated Messages
This cell parses the decoded JER output to extract BSM IDs and secMark values, sorts by ID and timestamp, computes time differences between consecutive messages with the same ID, and identifies BSMs that are repeated within a short time window.

In [None]:
import json
import numpy as np

# 1. Extract id and secMark from the decoded JER string (for each row)
def extract_id_secmark(jer_str):
    try:
        jer = json.loads(jer_str)
        bsm = jer["value"]["BasicSafetyMessage"]["coreData"]
        return bsm.get("id"), bsm.get("secMark")
    except Exception as e:
        return None, None

df_focus[["bsm_id", "bsm_secMark"]] = df_focus["jer"].apply(lambda x: pd.Series(extract_id_secmark(x)))

# 2. Check which BSMs have the same id and nearby timestamps (TimeStamp or secMark)
# Sort for easier comparison
df_focus_sorted = df_focus.sort_values(["bsm_id", "TimeStamp"])

# Compute time difference (in seconds) to previous message with same id
df_focus_sorted["prev_TimeStamp"] = df_focus_sorted.groupby("bsm_id")["TimeStamp"].shift(1)
df_focus_sorted["dt_sec"] = df_focus_sorted["TimeStamp"] - df_focus_sorted["prev_TimeStamp"]

# Show BSMs with dt_sec < threshold (e.g., 2 seconds)
threshold = 2
nearby = df_focus_sorted[df_focus_sorted["dt_sec"].notnull() & (df_focus_sorted["dt_sec"] < threshold)]

print(f"BSMs with repeated id within {threshold} seconds:")
display(nearby[["bsm_id", "TimeStamp", "dt_sec", "Geohash", "Latitude", "Longitude"]].head(10))


## Visualize BSM Movement by Vehicle ID
This cell prepares the data for animation, then uses Plotly to create an animated map showing the movement of vehicles (by BSM ID) over time, with one frame per second.

In [None]:
import plotly.express as px

# Ensure Time is datetime and round/floor to seconds
df_focus_sorted["Time"] = pd.to_datetime(df_focus_sorted["Time"])
df_focus_sorted["Time_sec"] = df_focus_sorted["Time"].dt.floor("S")

# (Optional) Convert bsm_id to string for display
df_focus_sorted["bsm_id"] = df_focus_sorted["bsm_id"].astype(str)

# Sort by Time for animation
df_anim = df_focus_sorted.sort_values("Time_sec")

fig = px.scatter_map(
    df_anim,
    lat="Latitude",
    lon="Longitude",
    color="bsm_id",      # Color by vehicle
    animation_frame=df_anim["Time_sec"].dt.strftime('%Y-%m-%d %H:%M:%S'),
    hover_name="bsm_id",
    zoom=12,
    height=600,
    map_style="open-street-map"  # Same style as before
)

fig.update_layout(
    title="BSM Movement by Vehicle ID (Per Second)",
    margin={"r":0, "t":30, "l":0, "b":0},
)
fig.show()
