# Transcription using OpenAI Whisper
Documentation on how to install Whisper is on the [OpenAI GitHub Page](https://github.com/openai/whisper) | [Colab example](https://colab.research.google.com/github/openai/whisper/blob/master/notebooks/LibriSpeech.ipynb#scrollTo=-YcRU5jqNqo2)

In [None]:
%load_ext autoreload
%autoreload 2

## Installing Whisper

In [None]:
#!sudo apt install ffmpeg mediainfo sox libsox-fmt-mp3
!brew install ffmpeg
!brew install sox
!brew install mediainfo
!pip install audiofile
!pip install setuptools-rust

In [12]:
!pip install git+https://github.com/openai/whisper.git 

Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /private/var/folders/kr/m6yb8cr154dcdkxnm75b4clw0000gn/T/pip-req-build-v8ztzydv
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /private/var/folders/kr/m6yb8cr154dcdkxnm75b4clw0000gn/T/pip-req-build-v8ztzydv
  Resolved https://github.com/openai/whisper.git to commit 9f70a352f9f8630ab3aa0d06af5cb9532bd8c21d
  Preparing metadata (setup.py) ... [?25ldone
Collecting more-itertools
  Using cached more_itertools-9.0.0-py3-none-any.whl (52 kB)
Collecting transformers>=4.19.0
  Using cached transformers-4.24.0-py3-none-any.whl (5.5 MB)
Collecting ffmpeg-python==0.2.0
  Using cached ffmpeg_python-0.2.0-py3-none-any.whl (25 kB)
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2.tar.gz (359 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m359.1/359.1 kB[0m [31m971.5 kB/s[0m eta [36m0:00:00[0ma [36m0:00

`ffmpeg` can be used to [extract audio from the video](https://stackoverflow.com/questions/9913032/how-can-i-extract-audio-from-video-with-ffmpeg): 
```
ffmpeg -i ~/Downloads/tmp/bkt-pilot-221103.mov -ss 00:00:00 -t 00:00:45.0 -q:a 0 -map a ~/Downloads/tmp/1minaudio.mp3
```

In our case, `whisper` can be called from the command line, with the output:
```bash
(base) eliot@Eliots-MBP aclpubcheck % whisper ~/Downloads/tmp/1minaudio.mp3 --language French
100%|████████████████████████████████████████| 461M/461M [09:57<00:00, 809kiB/s]
/Users/eliot/miniconda3/lib/python3.9/site-packages/whisper/transcribe.py:78: UserWarning: FP16 is not supported on CPU; using FP32 instead
  warnings.warn("FP16 is not supported on CPU; using FP32 instead")
[00:00.000 --> 00:13.320]  Donc qu'est ce que tu as comme module comme petit carré? Il y a une horloge à côté de l'horloge
[00:13.320 --> 00:24.840]  et il y a une espèce de balise avec un chronomètre et il y a un bouton annulé il y a quatre
[00:24.840 --> 00:30.840]  boutons rouge bleu vert et jeune rouge bleu vert jeune on va aller là dessus je clique sur cela
[00:30.840 --> 00:41.400]  alors ça s'appelle un saimon saimon 16 donc un des quatre boutons va s'allumer oui voilà quel
[00:41.400 --> 01:11.400]  bouton s'allumer et jeune dans ce cas
```

Downsides:
* Loose audio boundaries
* No distinctions between channels

Steps need to be taken to split between channels and align the transcription to IPUs/Words.

## Libraries

In [None]:
!pip install audiofile
!pip install textgrid
!pip install ffmpeg

In [13]:
import os,sys
import shutil
import re
import json
import numpy as np
import pandas as pd
import audiofile
from tqdm import tqdm
import time
import datetime
from glob import glob
import textgrid
import shutil

import seaborn as sns
import matplotlib.pyplot as plt
import IPython

# google timeout
import requests
from requests import ConnectionError
import xml.etree.ElementTree as ET
import ffmpeg

import logging
logger = logging.getLogger()
logger.setLevel(logging.ERROR)

# SPPAS
SPPAS_PATH = "/Users/eliot/Documents/tools/SPPAS"
sys.path.append(SPPAS_PATH)
# reading / writing textgrids
import sppas.src.anndata.aio.readwrite as spp
import sppas.src.anndata as sad
# searching for IPUs
import sppas
from sppas.src.annotations import sppasParam, sppasAnnotationsManager
from sppas.src.plugins import sppasPluginsManager

## Calling the model (Python)
* [Audio resampling in Python](https://github.com/jonashaag/audio-resampling-in-python)
* Paths with SPPAS: counts relatively from where the software is stored, no absolute paths
* `RuntimeError: "slow_conv2d_cpu" not implemented for 'Half' #92` for [whisper](https://github.com/openai/whisper/discussions/92)

In [14]:
import os
import numpy as np

try:
    import tensorflow  # required in Colab to avoid protobuf compatibility issues
except ImportError:
    pass

import torch
import pandas as pd
import whisper
import torchaudio

#from tqdm.notebook import tqdm

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

In [42]:
def get_mel(audio, device=DEVICE):
    audio = whisper.pad_or_trim(audio.flatten()).to(device) 
    # docstring: whisper.pad_or_trim(array, length: int = 480000, *, axis: int = -1)
    mel = whisper.log_mel_spectrogram(audio)
    return mel

def resample_audio(audio, orig_fs:int, target_fq:int=16000, device=DEVICE):
    if not isinstance(audio, torch.Tensor):
        audio = torch.Tensor(audio, device=device)
    # resample
    rs = torchaudio.transforms.Resample(orig_freq=orig_fs, new_freq=target_fq)
    return rs(audio)

class AudioFileDataset(torch.utils.data.Dataset):
    def __init__(self, filepath, device=DEVICE, fs_th:int=16000, cleanup_after_splitting:bool=False):
        self.filepath = filepath if filepath[0] != '~' else os.path.expanduser(filepath)
        self.audio_duration = audiofile.duration(filepath)
        self.audio, self.fs = audiofile.read(filepath, always_2d=True)
        self.nb_channels = audiofile.channels(filepath)
        self.save_folder = os.path.realpath(os.path.dirname(self.filepath))
        self.filename = self.filepath.split('/')[-1]
        self.device = device
        self.cleanup_onechannel = cleanup_after_splitting
        print(f"File {filepath}: {self.nb_channels} channels, sampling frequency = {self.fs}Hz, {self.audio_duration}s")
        print('Creating dataset...', end=' ')
        if self.fs != fs_th:
            self._resample(target_fq=fs_th)
        self.ipus_bounds = self._searchipus()
        self._create_dataset()
        self.datasets = {'base':self.dataset}
        print('done.')

    def _resample(self, target_fq:int=16000):
        # to tensor + resample
        self.or_audio = self.audio
        self.audio = resample_audio(self.audio, self.fs, target_fq, device=self.device)

    def _searchipus(self):
        """For each channel, locate IPUs using SPPAS. 
        1. SPPAS creates new files 
        2. read from those files then delete them
        """
        # SPPAS - activate
        spass_log=f"searchipus-{datetime.datetime.now().strftime('%y%m%d-%H:%M:%S')}-{self.filepath}"
        actions = ['searchipus']
        parameters = sppasParam([f"{x}.json" for x in actions])
        for x in actions:
            ann_step_idx = parameters.activate_annotation(x)
            ann_options = parameters.get_options(ann_step_idx)

        # Files - for each channel
        for c in range(self.nb_channels):
            # extract sound and create a new file
            ns = self.audio[c,:]
            ns_path = os.path.join(self.save_folder, f"{self.filename[:-4]}_mono_{c}.wav")
            audiofile.write(ns_path, ns, self.fs)
            parameters.add_to_workspace(ns_path)
    
        # SPPAS - Fix the output file extension and others
        parameters.set_lang("fra")
        parameters.set_output_extension('.TextGrid', "ANNOT")
        parameters.set_report_filename(spass_log)
        # SPPAS - Execute pipeline
        process = sppasAnnotationsManager()
        process.annotate(parameters)

        # Files - read each file to dataset
        ipus_bounds = []
        for c in range(self.nb_channels):
            # use relpath to get the relative path from SPPAS to target folder
            # uses os.getcwd() to compute path if given in relative - os.path.expanduser() necessited
            #self.save_folder = os.path.relpath(os.path.dirname(self.filepath), start=SPPAS_PATH)
            ns_path = os.path.join(self.save_folder, f"{self.filename[:-4]}_mono_{c}.wav")
            tg_path = f'{ns_path[:-4]}-ipus.TextGrid'
            tg = textgrid.TextGrid.fromFile(tg_path)
            for t in tg[0]:
                if t.mark not in ["#",""]: # check that
                    ipus_bounds.append({
                        'channel': c, 'start': t.minTime, 'stop': t.maxTime,
                    })
            # Files - cleanup sound and textgrid
            if self.cleanup_onechannel:
                os.remove(ns_path)
                os.remove(tg_path)
        
        return pd.DataFrame(ipus_bounds)

    def _create_dataset(self, fs:int=16000): # fs set by whisper
        # from bounds + signal + fs, extract signal for each ipu
        dataset = {}
        for idx, row in tqdm(self.ipus_bounds.iterrows()):
            dataset[idx] = [row.start, row.stop, int(row.channel), self.audio[int(row.channel), int(row.start*fs):int(row.stop*fs)]]
        self.dataset = dataset

    def _create_long_dataset(self, maximum_pause_duration:float=3, fs:int=16000, **kwargs):
        self.ipus_bounds['duration'] = self.ipus_bounds.stop - self.ipus_bounds.start
        self.ipus_bounds['pause'] = (self.ipus_bounds.start.shift(-1) - self.ipus_bounds.stop).fillna(0.) # last value
        # aggregate - need to keep it shorter than 300s
        self.ipus_bounds['new_line'] = self.ipus_bounds['pause'] > maximum_pause_duration
        self.ipus_bounds['new_line_idx'] = self.ipus_bounds['new_line'].cumsum()
        ipu_concat = self.ipus_bounds.groupby(['channel','new_line_idx']).agg({ 'start': 'min', 'stop':'max' }).reset_index(drop=False)
        # call cataset - in other variable
        dataset = {}
        for idx, row in tqdm(ipu_concat.iterrows()):
            dataset[idx] = [row.start, row.stop, int(row.channel), self.audio[int(row.channel), int(row.start*fs):int(row.stop*fs)]]
        self.datasets[f'long_{maximum_pause_duration}'] = dataset

    def set_dataset(self, mode:str, **kwargs):
        if (mode not in self.datasets) and (mode != 'long'):
            raise ValueError('`mode` should be in ["base","long"]')
        if mode in self.datasets:
            print(f'Switching to {mode}')
            self.dataset = self.datasets[mode]
            self.mode = mode
        elif mode == "long":
            self._create_long_dataset(**kwargs)
            mode = f"long_{kwargs.get('maximum_pause_duration',3) }"
            print(f'Switching to concatenated IPUs - {mode}')
            self.dataset = self.datasets[mode]
            self.mode = mode

    def __len__(self):
        return len(self.dataset)
    
    def __getitem__(self, item):
        start, stop, channel, audio = self.dataset[item]
        if not isinstance(audio, torch.Tensor):
            audio = torch.Tensor(audio, device=self.device)
        mel = get_mel(audio, device=self.device)
        return start, stop, channel, mel

In [43]:
filepath = "~/Downloads/tmp/bkt-pilot-221103-audio.wav"
dataset = AudioFileDataset(filepath)
loader = torch.utils.data.DataLoader(dataset, batch_size=16)

File ~/Downloads/tmp/bkt-pilot-221103-audio.wav: 2 channels, sampling frequency = 16000Hz, 2703.68s
Creating dataset... 

1237it [00:00, 23680.95it/s]

done.





In [42]:
model = whisper.load_model("base")
options = whisper.DecodingOptions(language="fr", without_timestamps=False, fp16 = False)

In [43]:
start, stop, channel, mels = dataset[0]
results = model.decode(mels, options)

In [60]:
transcr = {'start':[], 'stop':[], 'channel':[], 'text': []}
for start, stop, channel, mels in tqdm(loader):
    results = model.decode(mels, options)
    ### nope batch
    # if 'text' in dir(results):
    #     text = results.text
    # else:
    #     text = ' '.join([result.text for result in results])
    # transcr.append({ 'channel': channel, 'start':start, 'stop':stop, 'text': text })
    ### batch >> extend
    for k,v in zip(['start','stop','channel','text'],[start, stop, channel, [result.text for result in results]]):
        if not isinstance(v,list):
            v = v.tolist()
        transcr[k].extend(v)

transcr = pd.DataFrame(transcr)
#transcr['start'] = transcr.start.apply(lambda x: x.item())

100%|██████████| 40/40 [12:22<00:00, 18.57s/it]


In [73]:
transcr.head()

Unnamed: 0,start,stop,channel,text
0,1.375,2.215,0,ok
1,2.715,5.06,0,"donc, qui se tracent à mouge?"
2,6.19,10.355,0,carré. Il y a une horloge.
3,10.975,11.985,0,à cause de la tronte.
4,12.205,13.065,0,...


Results aren't very good 
* option 1: try with model other than `'base'` 
* option 2: try with longer audios
* option 3: issue with resampling that needs to be done before

### Option 1: with heavier model

In [74]:
# available models: tiny base small medium large
# default setting from command line: using 'small' model
model = whisper.load_model("medium")


100%|█████████████████████████████████████| 1.42G/1.42G [24:33<00:00, 1.04MiB/s]


In [75]:
t2 = {'start':[], 'stop':[], 'channel':[], 'text': []}
for start, stop, channel, mels in tqdm(loader):
    results = model.decode(mels, options)
    for k,v in zip(['start','stop','channel','text'],[start, stop, channel, [result.text for result in results]]):
        if not isinstance(v,list):
            v = v.tolist()
        t2[k].extend(v)

t2 = pd.DataFrame(transcr)
t2.head()

100%|██████████| 40/40 [1:59:45<00:00, 179.64s/it]  


Unnamed: 0,start,stop,channel,text
0,1.375,2.215,0,ok
1,2.715,5.06,0,"donc, qui se tracent à mouge?"
2,6.19,10.355,0,carré. Il y a une horloge.
3,10.975,11.985,0,à cause de la tronte.
4,12.205,13.065,0,...


### Option 2: with longer extracts
* Either configuring SPPAS differently for the IPUs
* Or using longer audio sequences in the model (max: 480000 / 16000 = 300 seconds)

Whisper can be used directly to transcribe. [From Doc] Internally, the `transcribe()` method reads the entire file and processes the audio with a sliding 30-second window, performing autoregressive sequence-to-sequence predictions on each window.

Configuring SPPAS differently:

In [76]:
#spass_log=f"searchipus-{datetime.datetime.now().strftime('%y%m%d-%H:%M:%S')}-{self.filepath}"
actions = ['searchipus']
parameters = sppasParam([f"{x}.json" for x in actions])
for x in actions:
    ann_step_idx = parameters.activate_annotation(x)
    ann_options = parameters.get_options(ann_step_idx)

In [78]:
ann_options[0]

<sppas.src.structs.baseoption.sppasOption at 0x13a0c6d00>

Using longer sequences of audio (using window):

In [92]:
audio_duration = audiofile.duration(filepath)
audio, fs = audiofile.read(filepath, always_2d=True)
nb_channels = audiofile.channels(filepath)
audio = resample_audio(audio, fs)
audio.shape

torch.Size([2, 43259200])

In [106]:
def predict_extract(model, audio, s_start:float=0., s_stop:float=300., channel:int=0, 
            device=DEVICE, model_options=whisper.DecodingOptions()):
    fs = 16000
    audio = audio[channel, int(s_start*fs):int(s_stop*fs)]
    mel = get_mel(audio, device)
    results = model.decode(mel, model_options)
    return results

In [None]:
res = predict_extract(model, audio, model_options=options)

In [110]:
res

DecodingResult(audio_features=tensor([[-1.2868e-01,  7.4106e-01, -1.8666e-01,  ...,  3.0302e-01,
          3.1401e-01,  1.4705e+00],
        [ 2.8794e-02,  4.5873e-01,  3.7895e-01,  ...,  3.5652e-03,
         -1.7153e-01,  8.3642e-01],
        [ 1.8124e-01,  2.4811e-02,  2.5020e-01,  ..., -3.3304e-01,
         -4.4838e-01,  2.3060e-01],
        ...,
        [ 3.0571e-01,  3.7052e-01,  3.0373e-01,  ...,  3.4993e-01,
         -3.4388e-01,  1.5461e+00],
        [-4.0124e-01,  8.9956e-01,  8.0569e-01,  ...,  2.0286e+00,
         -1.8074e-03,  1.3847e+00],
        [-1.1076e+00,  5.8126e-01,  1.3918e+00,  ...,  1.7550e+00,
         -9.9986e-01,  1.0190e+00]]), language='fr', language_probs=None, tokens=[50364, 7477, 421, 6, 377, 1769, 631, 2604, 382, 5173, 10088, 11, 5173, 9686, 1032, 10521, 2506, 50688, 50688, 4416, 288, 257, 2251, 2569, 752, 432, 1531, 18437, 368, 287, 6, 2335, 752, 432, 1030, 1930, 288, 257, 2251, 7089, 30236, 368, 3119, 908, 4163, 517, 19393, 298, 1462, 3599, 51508, 5150

In [115]:
result = model.transcribe(audio[1, 0:(60*16000)], fp16=False)
print(result["text"])

 Donc qu'est ce que tu as comme module, comme petit carré? Il y a une horloge à côté de l'horloge et il y a une espèce de balise avec un chronomètre et il y a un bouton annulé et il y a quatre boutons rouge, bleu, vert et jaune. Rouges, bleu, vert, jaune, on va aller là-dessus. Tu cliques sur cela. Ça s'appelle un Simon, Simon Says. Donc un des quatre boutons va s'allumer. Oui. Voilà. Quel bouton s'allumer? Jaune. Dans ce cas, déjà est-ce que tu vois le numéro de série ou pas? Un numéro. Non. Non. Clique sur vert. Est-ce que ça t'a mis quelque chose?


Using longer sequences of audio (using IPU):

In [147]:
dataset = AudioFileDataset(filepath)
dataset.set_dataset("long")
print(len(dataset.dataset))
loader = torch.utils.data.DataLoader(dataset, batch_size=16)

File ~/Downloads/tmp/bkt-pilot-221103-audio.wav: 2 channels, sampling frequency = 16000Hz, 2703.68s
Creating dataset... 

1237it [00:00, 32883.89it/s]


done.


284it [00:00, 33185.19it/s]

Switching to concatenated IPUs - long_3
284





In [149]:
t3 = {'start':[], 'stop':[], 'channel':[], 'text': []}
for start, stop, channel, mels in tqdm(loader):
    results = model.decode(mels, options)
    for k,v in zip(['start','stop','channel','text'],[start, stop, channel, [result.text for result in results]]):
        if not isinstance(v,list):
            v = v.tolist()
        t3[k].extend(v)

t3 = pd.DataFrame(t3)
t3.head()

100%|██████████| 18/18 [1:17:21<00:00, 257.86s/it]


Unnamed: 0,start,stop,channel,text
0,8.185,12.365,0,"Il y a une horloge, il est 10h31."
1,13.9,22.67,0,et il a une espèce de balise avec un chronomètre.
2,22.895,27.51,0,Il y a un bouton annulé. Il y a quatre boutons...
3,42.555,42.985,0,Et je...
4,50.195,96.0,0,Euh... Non. Non. Euh... Clique sur vert. Est-c...


Conclusion: longer sequences required.

## Writing to TextGrid

In [150]:
transcr.to_csv("~/Downloads/tmp/bkt-pilot-221103-ipu-base.csv", index=False)
t2.to_csv("~/Downloads/tmp/bkt-pilot-221103-ipu-medium.csv", index=False)
t3.to_csv("~/Downloads/tmp/bkt-pilot-221103-agg-medium.csv", index=False)

In [10]:
t3 = pd.read_csv("~/Downloads/tmp/bkt-pilot-221103-agg-medium.csv")
t3.head()

Unnamed: 0,start,stop,channel,text
0,8.185,12.365,0,"Il y a une horloge, il est 10h31."
1,13.9,22.67,0,et il a une espèce de balise avec un chronomètre.
2,22.895,27.51,0,Il y a un bouton annulé. Il y a quatre boutons...
3,42.555,42.985,0,Et je...
4,50.195,96.0,0,Euh... Non. Non. Euh... Clique sur vert. Est-c...


#### Functions

In [139]:
#%% ---------- READING ----------
def read_tier(tg_path:str, tier_name=None) -> pd.DataFrame:
    """Read tier and return as DataFrame.
    If no tier is given, returns the first tier.
    DataFrame includes a column for the source tier.
    """
    tg = textgrid.TextGrid.fromFile(tg_path)
    # get tier names
    tg_tiers = {tg[i].name : i for i in range(len(tg))}
    if tier_name is None: 
        print(f"Reading first tier from file: {tg[0].name}")
        tier_name = [tg[0].name]
    elif isinstance(tier_name,str):
        tier_name = [tier_name]
    # read from tier
    dialogs = []
    for tname in tier_name:
        tn = tg_tiers[tname]
        tier = tg[tn]
        for t in tier:
            if t.mark not in ["#",""]: # check that
                dialogs.append({
                    'file': tg_path.split('/')[-1],
                    'tier': tname,
                    'start': t.minTime,
                    'stop': t.maxTime,
                    'text': t.mark
                })
    # return
    return pd.DataFrame(dialogs)

#%% ---------- WRITING -----------
def write_tier(df:pd.DataFrame, file_name:str, 
        example_file:str='/Users/eliot/Documents/projects/multimodal-grounding/multimodalgrounding/utils/example.TextGrid', 
        annot_tier:str='TRS', 
        text_col:str='text', timestart_col:str='start', timestop_col:str='stop',
        file_duration:float=None, overwrite:bool=True, **kwargs):
    """Write data from a dataframe into a TextGrid file in intervals
    """ 
    if os.path.exists(file_name):
        tg = textgrid.TextGrid.fromFile(file_name)
        tg_tiers = {tg[i].name : i for i in range(len(tg))} 
        if annot_tier in tg_tiers and overwrite:
            print('Tier exists. Overwriting.')
        elif (annot_tier in tg_tiers):
            print('Tier exists. Changing tier name.')
    # Check if there are no ipus overlapping themselves
    overlaps = ((df[timestart_col] - df[timestop_col].shift()).dropna() < 0).sum()
    if overlaps > 0:
        raise IndexError("Overlaps between several speakers exist in this DataFrame.")
    # Add silence rows in DataFrame
    stops = df[timestart_col].iloc[1:].tolist()
    starts = df[timestop_col].iloc[:-1].tolist()
    if df[timestart_col].iloc[0] > 0:
        stops = [df[timestart_col].iloc[0]] + stops
        starts = [0.0] + starts 
    if file_duration is not None:
        stops.append(file_duration)
        starts.append(df[timestop_col].iloc[-1])
        print(df[timestop_col].iloc[-1])
    df_sil = pd.DataFrame({timestart_col: starts, timestop_col: stops})
    df_sil[text_col] = "#"
    df = pd.concat([df, df_sil], axis=0).sort_values(by=[timestart_col]).reset_index(drop=True)
    # Create / Read file
    if os.path.exists(file_name):
        parser = spp.sppasTrsRW(file_name)
        tier_list = parser.read()
    elif not os.path.exists(example_file):
        raise ValueError(f'`example_file` argument must point to an existing file.')
    else:
        parser = spp.sppasTrsRW(example_file)
        tier_list = parser.create_trs_from_extension(example_file)
    # Create annotations
    tier = tier_list.create_tier(annot_tier)
    # Sequentially add rows
    for _,row in tqdm(df.iterrows()):
        if row[timestart_col] < row[timestop_col]: # rows throwing errors
            # Add row
            interval = sad.sppasInterval(sad.sppasPoint(row[timestart_col],0.0), sad.sppasPoint(row[timestop_col], 0.0))
            tier.create_annotation(sad.sppasLocation(interval), sad.sppasLabel(sad.sppasTag(row[text_col])))
        else:
            print("\nError printing the following row:")
            print(row)

    parser.set_filename(file_name)
    parser.write(tier_list)


def write_tiers(df:pd.DataFrame, file_name:str, tier_col:str, tier_values:list=None, **kwargs) ->None:
    """Loop over tiers to write in the same file
    """
    tier_values = tier_values if tier_values is not None else df[tier_col].unique()
    for tier_name in tier_values:
        write_tier(df[df[tier_col] == tier_name], file_name, annot_tier=tier_name, **kwargs)

#### Calling

In [46]:
for c in [0,1]:
    # saving in different files to test alignment
    write_tier(t3[t3.channel == c], file_name=os.path.join(dataset.save_folder, f"bkt-pilot-221103-{c}.TextGrid"), 
            annot_tier=f'spk{c}', text_column='text', timestart_col='start', timestop_col='stop',
            file_duration=dataset.audio_duration)

155it [00:00, 7323.78it/s]
129it [00:00, 9413.92it/s]


In [45]:
dataset.save_folder

'/Users/eliot/Downloads/tmp'

## Aligning

### Using SPPAS
* Pros: offline
* Cons: need parametrizing to use 'Fill in IPUs'; other pipelines can't be used without using Tokenisation, which requires IPUs / orthographic correction

### Using BAS
From the [BAS Documentation Page](https://clarin.phonetik.uni-muenchen.de/BASWebServices/services/help), several pipelines are available (with -- those that interest us the most):
```
-- runPipelineWithASR       runCOALA                runGetVersion
runCOALAGetTemplates        -- runASR               -- runPipeline
runTTS                      runSubtitle             runTextEnhance
runDoReCo                   runSpeakDiar            runVoiceActivityDetection
runChunkPreparation         runMAUSGetInventar      runASRGetQuota
runMINNI                    runMAUS                 getLoadIndicatorXML
runTextAlign                runMAUSBasic            -- runPho2Syl
getLoadIndicator            runAnnotConv            runFormantAnalysis
runChunker                  runMAUSGetHelp          runChannelSeparator
runTTSFile                  runAudioEnhance         runEMUMagic
runG2P                      runGetVersion           runAnonymizer
```
Calling those pipelines can be done using CURL:
```bash
curl -v -X POST -H 'content-type: multipart/form-data' -F com=yes -F INSKANTEXTGRID=true -F USETEXTENHANCE=true -F TARGETRATE=100000 -F TEXT=@/Users/neako/Documents/Cours-MasCo/PhD/multimodalgrounding/data/EPSN/extract-transcribe/080101-000-freeConv_mono_1.TextGrid -F NOISE=0 -F PIPE=G2P_CHUNKER_MAUS_PHO2SYL -F aligner=hirschberg -F NOISEPROFILE=0 -F speakNumber=0 -F ASIGNAL=brownNoise -F NORM=true -F mauschunking=false -F INSORTTEXTGRID=true -F WEIGHT=default -F minanchorlength=3 -F LANGUAGE=eng-US -F USEAUDIOENHANCE=true -F maxlength=0 -F KEEP=false -F preference=-2.97 -F nrm=no -F LOWF=0 -F WHITESPACE_REPLACEMENT=_ -F marker=punct -F USEREMAIL=maesneako@gmail.com -F boost=true -F MINPAUSLEN=5 -F forcechunking=false -F NOINITIALFINALSILENCE=false -F minVoicedLength=200 -F InputTierName=google-annot -F OUTFORMAT=TextGrid -F syl=no -F ENDWORD=999999 -F minSilenceLength=200 -F wsync=yes -F UTTERANCELEVEL=false -F featset=standard -F INSPROB=0.0 -F OUTSYMBOL=x-sampa -F minchunkduration=15 -F SIGNAL=@/Users/neako/Documents/Cours-MasCo/PhD/multimodalgrounding/data/EPSN/extract-transcribe/080101-000-freeConv_mono_1.wav -F stress=no -F MODUS=default -F RELAXMINDUR=false -F RELAXMINDURTHREE=false -F STARTWORD=0 -F INSYMBOL=sampa -F PRESEG=false -F AWORD=ANONYMIZED -F USETRN=false -F MAUSSHIFT=default -F HIGHF=0 -F silenceonly=0 -F boost_minanchorlength=4 -F ADDSEGPROB=false 'https://clarin.phonetik.uni-muenchen.de/BASWebServices/services/runPipeline'
```

In [88]:
with open('bas-pipeline-config.json','r') as f: 
    gen_config = json.loads(f.read())

# transform to dict and add additional parameters
spe_config = {
    "InputTierName":'spk1',
    "USEREMAIL":"maesneako@gmail.com",
#    "SIGNAL":@/Users/neako/Documents/Cours-MasCo/PhD/multimodalgrounding/data/EPSN/extract-transcribe/080101-000-freeConv_mono_1.wav,
#    "TEXT":@/Users/neako/Documents/Cours-MasCo/PhD/multimodalgrounding/data/EPSN/extract-transcribe/080101-000-freeConv_mono_1.TextGrid,
    "LANGUAGE":"fra-FR", # accepted languages: [cat, deu, eng, fin, hat, hun, ita, mlt, nld, nze, pol, aus-AU, afr-ZA, sqi-AL, arb, eus-ES, eus-FR, cat-ES, nld-NL-GN, nld-NL, nld-NL-OH, nld-NL-PR, eng-US, eng-AU, eng-GB, eng-GB-OH, eng-GB-OHFAST, eng-GB-LE, eng-SC, eng-NZ, ekk-EE, kat-GE, fin-FI, fra-FR, deu-DE, gsw-CH-BE, gsw-CH-BS, gsw-CH-GR, gsw-CH-SG, gsw-CH-ZH, gsw-CH, hat-HT, hun-HU, isl-IS, ita-IT, jpn-JP, gup-AU, sampa, ltz-LU, mlt-MT, nor-NO, fas-IR, pol-PL, ron-RO, rus-RU, slk-SK, spa-ES, swe-SE, tha-TH, guf-AU]
}
config = dict(gen_config, **spe_config)

files = {
    "SIGNAL": open("/Users/eliot/Downloads/tmp/bkt-pilot-221103-audio_mono_1_compressed.wav", "rb"),
    "TEXT": open("/Users/eliot/Downloads/tmp/bkt-pilot-221103-1.TextGrid", "rb"),
}

In [89]:
SOURCE_URL = "https://clarin.phonetik.uni-muenchen.de/BASWebServices/services/runPipeline"
HEADER = {}
response = requests.post(SOURCE_URL, files=files, data=config, headers = HEADER)

In [98]:
response.status_code

200

In [94]:
response.text



In [104]:
root = ET.fromstring(response.text)
root[0].text

'true'

In [96]:
file_url = root[1].text #['WebServiceResponseLink']['downloadLink']
dl_file = requests.get(file_url)
open("test.TextGrid", "wb").write(dl_file.content)

3600401

Sending compressed audio:

In [None]:
ffmpeg.output(ffmpeg.input("/Users/eliot/Downloads/tmp/bkt-pilot-221103-audio_mono_1.wav").audio, 
                "/Users/eliot/Downloads/tmp/bkt-pilot-221103-audio_mono_1_compressed.wav",
                #*['sameq'],
                ).overwrite_output().run()

As a class running the operations:

In [136]:
sys.path.append('../src')
from bas_pipeline_analysis import AlignTranscription
from textgrid_utils import read_tier, write_tier

In [111]:
at = AlignTranscription(audio_path="/Users/eliot/Downloads/tmp/bkt-pilot-221103-audio_mono_1.wav",
    transcription_path="/Users/eliot/Downloads/tmp/bkt-pilot-221103-1.TextGrid", 
    transcription_tier="spk1", 
    lg='fra-FR', bas_option_path='bas-pipeline-config.json')

In [None]:
at.run_pipeline(compress=True) # error writing final df

In [115]:
at.aligned_transcription

'/Users/eliot/Downloads/tmp/bkt-pilot-221103-1-bas.TextGrid'

In [140]:
outfile = at.aligned_transcription.replace('bas','bas-ipus')
# read file
words = read_tier(at.aligned_transcription, tier_name="ORT-MAU")
# create ipus - columns start, stop
words['pause_duration'] = (words['start'] - words['stop'].shift()).fillna(0.) > 0.3
words['ipu_id'] = words['pause_duration'].cumsum()
ipus = words.groupby('ipu_id').agg({
    'start': 'min', 'stop': 'max', 'text': lambda x: ' '.join(list(x))
})
# write ipus
write_tier(ipus, outfile)

2086it [00:00, 17188.12it/s]


In [120]:
words

Unnamed: 0,file,tier,start,stop,text,pause_duration,ipu_id
0,bkt-pilot-221103-1-bas.TextGrid,ORT-MAU,1.910,2.190,OK,False,0
1,bkt-pilot-221103-1-bas.TextGrid,ORT-MAU,2.910,3.170,qu'est,True,1
2,bkt-pilot-221103-1-bas.TextGrid,ORT-MAU,4.210,4.280,ce,True,2
3,bkt-pilot-221103-1-bas.TextGrid,ORT-MAU,4.280,4.400,que,False,2
4,bkt-pilot-221103-1-bas.TextGrid,ORT-MAU,4.400,4.540,tu,False,2
...,...,...,...,...,...,...,...
3392,bkt-pilot-221103-1-bas.TextGrid,ORT-MAU,2692.425,2693.425,génial,True,1039
3393,bkt-pilot-221103-1-bas.TextGrid,ORT-MAU,2693.605,2694.225,Nous,False,1039
3394,bkt-pilot-221103-1-bas.TextGrid,ORT-MAU,2697.545,2697.905,avons,True,1040
3395,bkt-pilot-221103-1-bas.TextGrid,ORT-MAU,2698.475,2698.865,terminé,True,1041


Note: BAS seems to be messing up the alignment, being too (more than whisper) sensitive to echo.