When gamers, youtubers, or streamers conduct speedrunning sessions, their =data is saved into a .lss file; a file that stores pre-determined marker time splits in an XML format.

Pre-determined markers may be "Start," "Level 3," "Defeat Final Boss," etc. depending on the type of speedrun.

The file can prove difficult to work with without prior cleaning, as the data is noded by the pre-determined marker, not by the attempt.

That is to say, "Attempts" (a.k.a "runs") are one node, holding an id, a start time, and a stop time. The "Segment" nodes hold the time for all attempts.

For example: 100 speedrunning attempts of Super Mario Bros. would have the "AttemptHistory" node, which holds 100 child nodes with previously mentioned three pieces of information; the "Segments" node would hold a series of child nodes that are listed by the chornological marker order; within "Segments" are further child nodes titled "Segment," containing the "SegmentHistory" which lists each attempt's time data for that segment.

For the data contained in these nested trees, analysis is difficult and thus best converted to tabular or dataframe formats.

The following notebook will display parsing of an .lss file, parsing via XPath, then converting into a tabular csv file.

Thereafter, some explatory data analysis is performed and some visualizations are displayed.

In [9]:
#the .lss file was visually inspected to check out the segment names; some of which start with a hyphen.
#let's see the full list of segments

#import the relevant package/library
import xml.etree.ElementTree as ET

#import the file and initialize
tree = ET.parse('/content/any6bCP.lss')
root = tree.getroot()

#print split names
for seg in root.findall('.//Segment'):
    print(seg.find('Name').text)


Prologue
-Start
-Crossing
{Forsaken City} Chasm
-Start
-Intervention
{Old Site} Awake
-Start
-Huge Mess
-Elevator Shaft
{Celestial Resort} Presidential Suite
-Start
-Shrine
-Old Trail
{Golden Ridge} Cliff Face
-Start
-Cassette
-5B Start
-Central Chamber
-Through the Mirror
{Mirror Temple} Mix Master
-Start
-Lake
-Cassette
-6B Start
-Reflection
-Rock Bottom
{Reflection} Reprieve
-0m
-500m
-1000m
-1500m
-2000m
-2500m
{The Summit} 3000m


The hyphens seen on Segment names will prove annoying in tabular form, so we'll trim those now.

In [5]:
#let's run a for loop to remove the hyphen from affected segment names

for seg in root.findall('.//Segment'):
  title_hyphen = seg.find('Name')
  if title_hyphen is not None:
    title_text = title_hyphen.text or ""
    if title_text.startswith("-"):
      title_hyphen.text = title_text[1:]

#we'll save as a new file for preservation
tree.write("splits_updated.xml", encoding="utf-8", xml_declaration=True)

Now we'll move onto converting the .lss file into a tabular format (csv).

The original stakeholder of this project wanted the data converted into a tabular format with observations as 1 run, and features as segment names (e.g. ascending markers from start to finish of the speedrun). The cell value is time in duration of the run in milliseconds (ex. 5:24:00 of game time has passed, displayed as 324000 milliseconds).  

Thus, we'll extract the segment names to be column names, and we'll order them appropriately.
The time splits in the .lss file are formatted as hh:mm:ss.fraction, so we'll parse them to convert them to milliseconds.
Then we'll add up the time across the attempt.

To finish, we'll write the data to a csv.

In [8]:
#import relevant packages / libraries

import re
from collections import defaultdict, OrderedDict
import csv

#use the new file with updated segment names and prep the output file

LSS_PATH = "/content/splits_updated.xml"
CSV_OUT = "attempts_cumulative_ms.csv"

#helper functions

#some xml files will have namespaces that, parsed into python, don't translate well.
#this function splits on the final } of the namespace and extracts just the localname

def local(tag):
    """Return tag localname without namespace."""
    return tag.rsplit('}', 1)[-1]

#in order to parse the formatting of the time splits in the .lss, we'll use regular expression
#to convert from strings to numerical values

_time_re = re.compile(
    r'^(?:(?P<days>\d+)\.)?(?P<h>\d{1,2}):(?P<m>\d{2}):(?P<s>\d{2})(?:\.(?P<f>\d{1,9}))?$'
)

#we'll convert the time values to a millisecond integer

def parse_time_to_ms(text):
    """
    Parse LiveSplit time like:
    '00:12:34.567', '1.02:03:04.789' (d.hh:mm:ss.fff) â†’ milliseconds (int).
    Returns None if text is empty/invalid.
    """
    if not text:
        return None
    m = _time_re.match(text.strip())
    if not m:
        return None
    days = int(m.group("days") or 0)
    h = int(m.group("h"))
    mi = int(m.group("m"))
    s = int(m.group("s"))
    frac = m.group("f") or "0"
    #any fractional seconds will be normalized (truncate/exact if 3+ digits)
    #e.g., "5"->500ms, "56"->560ms
    if len(frac) < 3:
        ms = int(frac.ljust(3, "0"))
    else:
        ms = int(frac[:3])
    total_ms = (((days*24 + h)*60 + mi)*60 + s)*1000 + ms
    return total_ms

#load the XML and initiatize the root

tree = ET.parse(LSS_PATH)
root = tree.getroot()

#get ordered segment names

segments_node = next((n for n in root.iter() if local(n.tag) == "Segments"), None)
if segments_node is None:
    raise RuntimeError("No <Segments> found in LSS.")

segment_names = []
for seg in segments_node:
    if local(seg.tag) == "Segment":
        name_node = next((c for c in seg if local(c.tag) == "Name"), None)
        segment_names.append(name_node.text if name_node is not None else "")

#collect per-attempt per-segment, segment times
#SegmentHistory/Time @id + GameTime text
attempt_segment_ms = defaultdict(dict)  # {attempt_id: {segment_name: ms}}

for seg in segments_node:
    if local(seg.tag) != "Segment":
        continue
    seg_name = next((c.text for c in seg if local(c.tag) == "Name"), "")
    hist = next((c for c in seg if local(c.tag) == "SegmentHistory"), None)
    if hist is None:
        continue
    for t in hist:
        if local(t.tag) != "Time":
            continue
        att_id = t.get("id")
        rt = next((c.text for c in t if local(c.tag) == "GameTime"), None)
        ms = parse_time_to_ms(rt)
        if att_id and ms is not None:
            attempt_segment_ms[att_id][seg_name] = ms

#compute cumulative per attempt in segment order
attempt_ids_sorted = sorted(attempt_segment_ms.keys(), key=lambda x: int(x))
rows = OrderedDict()  # {attempt_id: [cumulative_ms per segment order]}
for att_id in attempt_ids_sorted:
    cum = 0
    row = []
    for seg_name in segment_names:
        seg_ms = attempt_segment_ms[att_id].get(seg_name)
        if seg_ms is None:
            row.append(None)  # missing
        else:
            cum += seg_ms
            row.append(cum)
    rows[att_id] = row

#write output csv
with open(CSV_OUT, "w", newline="", encoding="utf-8") as f:
    w = csv.writer(f)
    w.writerow(["attempt_id"] + segment_names)
    for att_id, vals in rows.items():
        w.writerow([att_id] + vals)

print(f"Wrote {CSV_OUT} with {len(rows)} attempts and {len(segment_names)} segments.")

#print a row sample to verify
for att_id, vals in rows.items():
    numbers = ", ".join("" if v is None else str(v) for v in vals)
    print(f"{att_id}: {numbers}")
    break


Wrote attempts_cumulative_ms.csv with 3796 attempts and 35 segments.
-1: , 21930, 46138, , 68068, , , 89998, , , , 111928, 128639, , 171427, 193357, , , , 235975, , 257905, , , , , 312644, 384741, , , , , , , 
