Skip to content
Scripts for retrieving and aligning speech and meeting transcripts from the web portal of the Parliament of Finland (https://www.eduskunta.fi)
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
align
extract
retrieve
test
LICENSE
README.md

README.md

finnish-parliament-scripts

Scripts for retrieving and aligning speech and meeting transcripts from the web portal of the Parliament of Finland (https://www.eduskunta.fi)

Dependencies:

  • sox
  • avconv
  • sclite
  • python3
  • python3-lxml
  • wget

ASR system is also required to produce first-pass hypotheses

Download videos and meeting transcripts and save into DATA-FOLDER:

retrieve/retrieve_sessions.py DATA-FOLDER

Four different files will be saved for each session:

  • *.mp4 - video of the session
  • *.wav - audio file stored in wav-format (16kHz,mono)
  • *.transcript - meeting transcript with speaker information for each paragraph
  • *.metadata - metadata file containing date information and links to the original video and meeting transcript

EDIT: Currently the retrieval of the meeting transcripts fails because the publishing format has changed.

Produce first-pass recognition output with an ASR system (preferably train a biased LM with the meeting transcripts).

Store recognition output in the following format:

  • start-time-in-seconds end-time-in-seconds word

Align the first-pass output with the meeting transcript using sclite:

align/asr_align_2_elan.py asr-output transcript-file metadata-filename elan-filename

The output is in the Elan EAF-format.

Test the alignment script with example files:

align/asr_align_2_elan.py test/session_79_2008.asr test/session_79_2008.transcript test/session_79_2008.metadata test/session_79_2008.eaf

Extract individual speech segments from a list of EAF-files:

extract/elan_wav_extractor.py eaf-list wav-segment-dir

Stores both audio file (.wav) and transcript (.trn)

Extract individual speech segments from a list of metadata files:

extract/corpus_extractor.py metadata-file-list 

Stores audio file (.wav)

André Mansikkaniemi, andre.mansikkaniemi@aalto.fi

You can’t perform that action at this time.