# Forced Alignment Notes for Southpark TTS Data

The purpose of this notebook is to record some of the exploration around the topic of forced aligment for these data.

For a full description of the steps leading up to generating wav files and srt files, see the main notebook.

## CCAligner

One tool specifically designed to work with subtitle data is [CCAligner](https://github.com/saurabhshri/CCAligner/blob/master/README.adoc). 

### CCAligner: Single file example

See main notebook for running large batches

In [None]:
cd "/vm/CCAligner/install"
echo "In $PWD"
#-audioWindow 500 -searchWindow 6 -useBatchMode yes
./ccaligner -wav /y/south-park-1-to-20/1-1.wav -srt /y/south-park-1-to-20/1-1.srt -oFormat json >> ccalign.log 2>&1

### CCAligner: Evaluations


#### Comparison to SRT

We load up SRT and CCAligner outputs for the same disc and evaluate the times:

- Subtitle times (when it appears and disappears); compare across SRT and CCAligner
- Number of words recognized (only in CCAligner)
- Word alignment times (only in CCAligner)

Comparison between SRT and CCAligner times is approximate b/c ccaligner changes strings. 
Some strings will not be matched at all (missing at random?); some repeated strings will be matched in the wrong location (false difference).

In [None]:
#r "/z/aolney/repos/Newtonsoft.Json.9.0.1/lib/net40/Newtonsoft.Json.dll"

type Word =
    {
        word : string
        recognised : int //logical 1/0
        start : int
        ``end`` : int
        duration : int
    }
    
type Subtitle =
    {
        subtitle : string
        edited_text : string
        start : int
        ``end`` : int
        words: Word[]
    }
    override x.ToString() = x.edited_text + "-" + x.start.ToString()  + ":" + x.``end``.ToString()

type CCAligned =
    {
        subtitles : Subtitle[]
    }
    
///The CCAligned json is invalid; we must fix unescaped quotes and put commas b/w objects
let FixCCAlignedJSON filePath =
    let json = 
        filePath
        |> System.IO.File.ReadAllLines
        |> Seq.map( fun line -> line.Replace("} {","},  {"))
        |> Seq.map( fun line -> line.Replace("}\t{","},  {")) //new version of json.net requires
        |> Seq.map( fun line -> line.Replace("\\","")) //bad escape sequences - bad OCR?
        |> Seq.map( fun line ->
                   let fieldIndex = line.IndexOf(":")
                   if fieldIndex > 0 then 
                       let propertyString = line.Substring(fieldIndex+1)
                       if propertyString.Contains("\"") then
                           line.Substring(0,fieldIndex + 1) + "\"" + propertyString.Trim([|' ';'"';','|]).Replace("\"","\\\"") + "\","
                       else
                           line
                   else 
                       line
                  )
    //System.IO.File.WriteAllLines( filePath + ".corrected",json )
    //let correctedJson = System.IO.File.ReadAllText( filePath + ".corrected" )
    let ccAligned = Newtonsoft.Json.JsonConvert.DeserializeObject<CCAligned>(json |> String.concat "\n" ) //correctedJson)
    //
    ccAligned

let SrtStringToTime( timeString ) =  
    System.TimeSpan.ParseExact(timeString, @"hh\:mm\:ss\,fff", null).TotalMilliseconds
let blankLinesRegex = new System.Text.RegularExpressions.Regex("\n\n+")
let RegexSplit (regex : System.Text.RegularExpressions.Regex)  (input:string) = regex.Split( input )
let SubtitlesFromSRT filePath = 
    ( filePath |> System.IO.File.ReadAllText).Trim() 
    |> RegexSplit blankLinesRegex 
    //Split([|"\n\n"|],System.StringSplitOptions.None)
    |> Array.map(fun block ->
        let blockLines = block.Split('\n')
        if blockLines.[1].Contains( " --> ") |> not then
            System.Console.WriteLine()
        let startEnd = blockLines.[1].Replace( " --> ", " ").Trim().Split(' ')
        let start = startEnd.[0] |> SrtStringToTime |> int
        let stop = startEnd.[1] |> SrtStringToTime |> int
        let subtitle = blockLines |> Array.skip 2 |> String.concat " "
        { subtitle = subtitle ; edited_text = ""; start = start;  ``end`` = stop ; words = [||]}
        )
    
let CompareSrtAndCCAlign srtDirectory alignedDirectory =
    let fileTuples = 
        Seq.zip 
            ( System.IO.Directory.GetFiles(srtDirectory, "*.srt") |> Seq.sort )
            ( System.IO.Directory.GetFiles(alignedDirectory, "*.json") |> Seq.sort )

    //this is approximate b/c ccAligner changes the text slightly for some subtitles
    let srtCCAlignSubtitleCorrespondence =
        fileTuples
        |> Seq.collect( fun (srtFile,alignedFile) ->
            let srtMap = srtFile |> SubtitlesFromSRT |> Seq.groupBy( fun subtitle -> subtitle.subtitle ) |> Map.ofSeq
            let ccAligned = alignedFile |> FixCCAlignedJSON
            ccAligned.subtitles
            |> Seq.choose( fun aSubtitle -> 
                match srtMap.TryFind( aSubtitle.subtitle ) with
                | Some( subSequence ) -> 
                    let closestSubtitleToSRT =
                        subSequence 
                        |> Seq.sortBy( fun s -> System.Math.Abs( aSubtitle.start - s.start ) ) 
                        |> Seq.head 
                    let isMatch,matchString =
                        if closestSubtitleToSRT.start = aSubtitle.start && closestSubtitleToSRT.``end`` = aSubtitle.``end`` then 
                            true,"SAME"
                        else 
                            false,"DIFF"
                    Some(
                        (isMatch, srtFile + "\t" + aSubtitle.edited_text + "\t" + aSubtitle.start.ToString() + "\t" + aSubtitle.``end``.ToString() + "\t" + 
                            closestSubtitleToSRT.subtitle  + "\t" + closestSubtitleToSRT.start.ToString() + "\t" +  closestSubtitleToSRT.``end``.ToString()  + "\t" + matchString)
                        )
                | None -> None
            )   
        )
    System.IO.File.WriteAllLines( "srt-aligned-correspondence-" + System.DateTime.Now.ToString("s").Replace(" ","-").Replace(":","_") + ".tsv", srtCCAlignSubtitleCorrespondence |> Seq.map snd )

//Compare with default CCAlign
//CompareSrtAndCCAlign "/y/south-park-1-to-20/" "/y/south-park-1-to-20/ccalign-json-default"
//CompareSrtAndCCAlign "/y/south-park-1-to-20/" "/y/south-park-1-to-20/ccalign-json-audioWindow500"

//Count percentage of recognized words
let PercentRecognized alignedDirectory =
    let words = 
        System.IO.Directory.GetFiles(alignedDirectory, "*.json") 
        |> Seq.collect( fun alignedFile -> 
            (alignedFile |> FixCCAlignedJSON).subtitles 
            |> Seq.collect( fun s -> s.words ) )
    let totalWords = words |> Seq.length |> float
    let recognizedWords = words |> Seq.sumBy( fun w -> w.recognised ) |> float
    //
    (recognizedWords/totalWords).ToString()
    
//Percent recognized; notebook output (crashes mono)
[
    "default" ; PercentRecognized "/y/south-park-1-to-20/ccalign-json-default";
    "audioWindow500" ; PercentRecognized "/y/south-park-1-to-20/ccalign-json-audioWindow500";
]
    

### CCAligner: Results

1. CCAligner's start/end at the subtitle level is not changed by audio parameters. So unless word timings are going to be used for alignment, CCAligner adds no value over SRT.
2. Listening to wav in Audacity at the start/end points of the word alignments indicate they are usually OK **when the word is recognized**; even still there is some clipping around words. 
3. If the word is not recognized, the alignments are not good at all.
4. Percent correct words recognized
    - Using default settings on CCAligner (default + useBatchMode) gives 31% recognized words
    - Using audioWindow = 500 with useBatchMode gives 33% recognized words
    - The relative improvement between these settings is not clear

**Overall, CCAligner seems marginally viable for South Park. It probably isn't better than using the SRT**

# aeneas

[aeneas](https://www.readbeyond.it/aeneas/) was investigated to see if it improved performance relative to CCAligner.

Using aeneas required reformatting the srt file into a suitable text file. 
The code below creates a whole disc of text.
This file was also used in later evaluations of whole disc text.

In [None]:
//create plain text file from srt https://www.readbeyond.it/aeneas/docs/textfile.html#aeneas.textfile.TextFileFormat.PLAIN
let whiteSpaceRegex = System.Text.RegularExpressions.Regex("\s+")
let nonApostrophePunctRegex = System.Text.RegularExpressions.Regex("[^\w\s']")
let RemovePunctuation inputString =
    whiteSpaceRegex.Replace( nonApostrophePunctRegex.Replace( inputString, " "), " " ).Trim()
let tagRegex = System.Text.RegularExpressions.Regex("<.*?>") //only acceptable b/c our subtitle markup is so simplistic it does not require CFG parser
let RemoveTags inputString =
    whiteSpaceRegex.Replace( tagRegex.Replace( inputString, " "), " " ).Trim()

let lines =
    ("/y/south-park-1-to-20/1-1.srt" |> System.IO.File.ReadAllText).Trim().Split("\n\n")
    |> Seq.map(fun block ->
        let text =
            block.Split("\n")
            |> Seq.skip 2
            |> String.concat " "
            |> RemoveTags
        text
              )
System.IO.File.WriteAllLines("1-1.foraeneas", lines)

### aeneas: Whole Disc Example

In [None]:
python -m aeneas.tools.execute_task \
   /y/south-park-1-to-20/1-1.wav \
   1-1.foraeneas \
   "task_language=eng|os_task_file_format=json|is_text_type=plain" \
   1-1.aeneas.json

### aeneas: Whole Disc Results

Aeneas lost alignment by a minute or two into the whole disc.
The first minute (real speech, not character voices) had good alignment.

### aeneas: Clip Evaluation, Strict SRT

The first couple of utterances looked plausible though, so the evaluation was repeated using the following methodology:

- Using SRT to get clip for alignment, where clip consists of multiple subtitles
- Extract WAV file using the SRT clip boundaries
- Use corresponding text of clip

The first text block evaluated was
```
AND NOW A FIRESIDE CHAT
WITH THE CREATORS OF COMEDY CENTRAL'S SOUTH PARK
MATT STONE AND TREY PARKER
```
The second block was
```
THEN I WAS LYING ON A TABLE
AND THESE SCARY ALIENS WANTED TO OPERATE ON ME.
AND THEY HAD BIG HEADS AND BIG BLACK EYES.
DUDE, VISITORS!
TOTALLY!
WHAT?
THAT WASN'T A DREAM, CARTMAN.
THOSE WERE VISITORS!
NO, IT WAS JUST A DREAM.
MY MOM SAID SO.
```

The third block was the same as the second but with one word per line

In [7]:
python -m aeneas.tools.execute_task \
   1-1-clip1.wav \
   1-1-clip1.txt \
   "task_language=eng|os_task_file_format=json|is_text_type=plain" \
   1-1-clip1.json

python -m aeneas.tools.execute_task \
   1-1-clip2.wav \
   1-1-clip2.txt \
   "task_language=eng|os_task_file_format=json|is_text_type=plain" \
   1-1-clip2.json

python -m aeneas.tools.execute_task \
   1-1-clip2.wav \
   1-1-clip2words.txt \
   "task_language=eng|os_task_file_format=json|is_text_type=plain" \
   1-1-clip2words.json

[INFO] Validating config string (specify --skip-validator to bypass)...
[INFO] Validating config string... done
[INFO] Creating task...
[INFO] Creating task... done
[INFO] Executing task...
[INFO] Executing task... done
[INFO] Creating output sync map file...
[INFO] Creating output sync map file... done
[92m[INFO] Created file '1-1-clip1.json'[0m
[INFO] Validating config string (specify --skip-validator to bypass)...
[INFO] Validating config string... done
[INFO] Creating task...
[INFO] Creating task... done
[INFO] Executing task...
[INFO] Executing task... done
[INFO] Creating output sync map file...
[INFO] Creating output sync map file... done
[92m[INFO] Created file '1-1-clip2.json'[0m
[INFO] Validating config string (specify --skip-validator to bypass)...
[INFO] Validating config string... done
[INFO] Creating task...
[INFO] Creating task... done
[INFO] Executing task...
[INFO] Executing task... done
[INFO] Creating output sync map file...
[INFO] Creating output sync map file..

### aeneas: Clip Results, Strict SRT

- Clip 1: Reasonable for narrator speech. Was not that different from SRT boundaries, but was a little tighter on the ends
- Clip 2: When given longer clip 'THEN - EYES', does a better job of finding the end boundary than the SRT; However, next turn is already off by one word.
- Clip 2words: Had different errors than Clip 2, but was similarly off

Based on these results, a reasonable question is whether we can "pad" the SRT boundaries and find some words within them. 
We call this "loose" SRT.

### aeneas: Clip Results, Loose SRT

Using just the following text, with SRT times +/- 1s

```
DUDE, VISITORS!
TOTALLY!
WHAT?
```

And again, with SRT +/1 500ms

In [9]:
python -m aeneas.tools.execute_task \
   1-1-clip3.wav \
   1-1-clip3.txt \
   "task_language=eng|os_task_file_format=json|is_text_type=plain" \
   1-1-clip3.json

python -m aeneas.tools.execute_task \
   1-1-clip4.wav \
   1-1-clip3.txt \
   "task_language=eng|os_task_file_format=json|is_text_type=plain" \
   1-1-clip4.json

[INFO] Validating config string (specify --skip-validator to bypass)...
[INFO] Validating config string... done
[INFO] Creating task...
[INFO] Creating task... done
[INFO] Executing task...
[INFO] Executing task... done
[INFO] Creating output sync map file...
[INFO] Creating output sync map file... done
[92m[INFO] Created file '1-1-clip3.json'[0m
[INFO] Validating config string (specify --skip-validator to bypass)...
[INFO] Validating config string... done
[INFO] Creating task...
[INFO] Creating task... done
[INFO] Executing task...
[INFO] Executing task... done
[INFO] Creating output sync map file...
[INFO] Creating output sync map file... done
[92m[INFO] Created file '1-1-clip4.json'[0m


### aeneas: Clip Results, Loose SRT

- Clip 3: Was basically totally garbage
- Clip 4: Same
    
Looks like aeneas needs pretty clean audio to do the alignment.

**Since SRT does not give perfect boundaries, and since we can't align a whole file with aeneas, it doesn't seem that aeneas can be used.**

## eesen-transcriber

The [vagrant installation method](https://github.com/srvk/eesen-transcriber/blob/master/INSTALL.md) was used to simplify the installation.
However, even with vagrant, the installation was fairly complex and required tweaking of various scripts.
As a result, it's not clear if eesen was installed correctly, though spot checking suggests that the ASR was working correctly.

Example usage for alignment is `vagrant ssh -c "align.sh /vagrant/1-1.wav"` with the corresponding STM file in the same directory as the wav (i.e. the vagrant directory). The STM was created [using the SRT file](https://git.capio.ai/pub/srt-to-stm-converter).

### eesen-transcriber: Whole disc results

Even the begining was significantly shifted in time:

```
1-1-A---0005.610-0006.610 1 5.61 0.06 now
1-1-A---0005.610-0006.610 1 5.67 0.00 a
1-1-A---0005.610-0006.610 1 5.67 0.00 fireside
1-1-A---0005.610-0006.610 1 5.67 0.93 chat
1-1-A---0006.610-0008.610 1 6.61 0.03 the
1-1-A---0006.610-0008.610 1 6.64 0.06 creators
```

When the kids start speaking, the alignments get very spotty:

```
1-1-A---0178.280-0179.780 1 178.31 0.00 my
1-1-A---0178.280-0179.780 1 178.31 0.21 little
1-1-A---0178.280-0179.780 1 178.52 1.26 brother's
1-1-A---0180.780-0183.280 1 180.78 0.06 <unk>
1-1-A---0180.780-0183.280 1 180.84 0.00 <unk>
1-1-A---0184.280-0186.780 1 184.28 0.09 he
1-1-A---0184.280-0186.780 1 184.37 0.33 <unk>
1-1-A---0184.280-0186.780 1 184.70 0.57 he
1-1-A---0184.280-0186.780 1 185.27 0.72 <unk>
1-1-A---0186.790-0188.790 1 186.79 0.84 <unk>
1-1-A---0188.790-0189.790 1 188.79 0.03 don't
1-1-A---0188.790-0189.790 1 188.82 0.15 call
1-1-A---0188.790-0189.790 1 188.97 0.09 my
1-1-A---0188.790-0189.790 1 189.06 0.60 brother
```

Overall,  eesen-transcriber does not seem viable for this project.

# Gentle

The [docker installation method](https://github.com/lowerquality/gentle) was used to reduce the effort of installation, but the run instructions had to be adapted to: `docker run -p 8765:8765  lowerquality/gentle`

Example usage is `curl -F "audio=@audio.mp3" -F "transcript=@words.txt" "http://0.0.0.0:8765/transcriptions?async=false"`

### Gentle: parameters

Additional parameters are `?async=false&disfluency=true&conservative=true`

The meanings of these parameters appear to be

> Use the given token sequence to make a bigram language model
>    in OpenFST plain text format.
>    When the "conservative" flag is set, an [oov] is interleaved
>    between successive words.
>    When the "disfluency" flag is set, a small set of disfluencies is
>    interleaved between successive words
>    `Word sequence` is a list of lists, each valid as a start

### Gentle: Whole disc example

In [None]:
date
curl -F "audio=@/y/south-park-1-to-20/1-1.wav" -F "transcript=@1-1.foraeneas" "http://0.0.0.0:8765/transcriptions?async=false&disfluency=true&conservative=true" -o 1-1.gentle.json
date

### Gentle: Whole Disc Results

Gentle was fairly excellent. Spot checking showed alignment was still good at minutes 23, 59, 1:14, 1:30, and 1:33.

# Overall Results

**Based on these results, it seems plausible to use Gentle for alignment at the whole disc level.**

It does not appear necessary to manually check alignments if Gentle is used. 
However, tools like [finetuneas](https://github.com/ozdefir/finetuneas) that facilitate this process exist.