Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adapt SHAWI data management workflows to include audio files #10

Closed
Tracked by #56
simar0at opened this issue May 5, 2023 · 6 comments
Closed
Tracked by #56

Adapt SHAWI data management workflows to include audio files #10

simar0at opened this issue May 5, 2023 · 6 comments
Assignees
Labels
data-processing enhancement New feature or request

Comments

@simar0at
Copy link
Contributor

simar0at commented May 5, 2023

We have wav files and we have timestamps in TEI XML.

We need a way to cut the wav files and also probably to export them as mp4.

One way to do this is to transform the TEI files to audacity 2 project files which also happen to be XML files.

@simar0at
Copy link
Contributor Author

simar0at commented Jul 25, 2023

We can generate the labels in text format:

0.000000	0.439679	H
0.346413	0.912666	IG
0.925990	2.394558	IG
2.394558	5.009671	IG
2.711361	2.851247	H
5.006803	5.558277	ID
5.558277	8.576871	IG
8.576871	11.116553	IG
11.116553	23.942498	H

We can use this with audacity 2 or 3. Audacity 3 moved away from any (visible) XML.

@dasch124
Copy link
Member

IMO we should go for the CSV/TSV format which seems less effort to create.

@dasch124
Copy link
Member

dasch124 commented Dec 19, 2023

General Workflow:

  1. For each TEI transcription document, we generate "region labels" (i.e. named time spans in Audacity) in the format mentioned by Omar in the comment above and add them to the SHAWI data repository.
  2. Team members at university with access to the original audio files open the files in Audacity, import the region label list and export the audio snippets both as uncompressed WAV (for archiving) and MP3 (for publishing in application)
  3. They upload the resulting audio files to https://oeawcloud.oeaw.ac.at/index.php/apps/files/?dir=/R_Shawi_19367&fileid=36169042
  4. We add references to both versions of the audio files to the TEI documents.

TEI > Auacity labels conversion

The TSV format is described here: https://manual.audacityteam.org/man/importing_and_exporting_labels.html

This should be generated by taking all <u> elements in the transcription documents, and re-calculating the absolute timestamps from the @interval attribute on the <when> elements inside of the <timeline>:

  <timeline unit="ms">
         <when xml:id="T0"/>
         …
         <when interval="197124" since="#T0" xml:id="T19"/>
         <when interval="197256" since="#T0" xml:id="T20"/>
          …
      </timeline>
     …
    <annotationBlock>
               <u xml:lang="ar-acm-x-shawi-vicav" xml:id="URFA-034_a20" who="#default" end="#T20" start="#T0">
                  … 
              </u>
    <annotationBlock>    

Instead of having the speaker name as the label, we should use the utterance's xml:id, so the exported audio snippet can be named after the utterance id.

@dasch124
Copy link
Member

dasch124 commented Feb 1, 2024

for some reason, the xml:id is missing from the @url attribute on , e.g. https://github.com/acdh-oeaw/shawi-data/blob/main/010_manannot/Urfa-097_Three_Daughters-Harran-2010.xml#L210

@rausch-supola
Copy link
Collaborator

As I said I only inserted the two lines with the media tag. I guess the linking to some data is missing

@rausch-supola
Copy link
Collaborator

this issue can be closed I guess @dasch124 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-processing enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants