# Test Notebook for various data files

This notebook contains tests for:   
    - pyspch.timit.read_seg_file()   
    - pyspch.core.read_txt()  ... or read_data_file()   

*read_seg_file()* is a segmentation reading routine included with a lot of other TIMIT related definitions   

*read_txt()* is suitable for importing corpus style files

TIMIT alphabets:   TIMIT48, TIMIT41
TIMIT mappings:    timit61_48, timit48_41, timit61_41

01/04/2022: verified with v0.6.3

In [1]:
# optional install of the pyspch package
#!pip install git+https://github.com/compi1234/pyspch.git

In [2]:
%matplotlib inline
import os,sys,io 
import scipy.signal

from urllib.request import urlopen
from IPython.display import display, Audio, HTML, clear_output
from ipywidgets import interact

import math,time
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec 

import librosa as librosa
    
import pyspch.sp as Sps
import pyspch.display as Spd
import pyspch.core as Spch

In [3]:
### Predefined symbol sets and mappings (translations).
Spch.timit.TIMIT48, Spch.timit.timit61_41_diff

(['aa',
  'ae',
  'ah',
  'ao',
  'aw',
  'ax',
  'er',
  'ay',
  'b',
  'ch',
  'd',
  'dh',
  'dx',
  'eh',
  'el',
  'm',
  'en',
  'ng',
  'ey',
  'f',
  'g',
  'hh',
  'ih',
  'ix',
  'iy',
  'jh',
  'k',
  'l',
  'n',
  'ow',
  'oy',
  'p',
  'r',
  's',
  'sh',
  't',
  'th',
  'uh',
  'uw',
  'v',
  'w',
  'y',
  'z',
  'zh',
  'sil',
  'epi',
  'vcl',
  'cl'],
 {'axr': 'er',
  'em': 'm',
  'eng': 'ng',
  'nx': 'n',
  'hv': 'hh',
  'ux': 'uw',
  'kcl': 'cl',
  'pcl': 'cl',
  'tcl': 'cl',
  'h#': 'sil',
  'pau': 'sil',
  'q': 'sil',
  'bcl': 'cl',
  'dcl': 'cl',
  'gcl': 'cl',
  'epi': 'sil',
  'dx': 't',
  'ax-h': 'ah',
  'ix': 'ih',
  'ax': 'ah',
  'el': 'l',
  'en': 'n'})

## File Objects

pyspch provides generic (read) access to files (locally, mounted, via URL) via
- open_fobj(): opens any file like objects
- read_fobj(): reads data from any file like object in a FILE IO object in raw format
- read_data_file(): reads text data from any file like object (default=list of lines)
- read_tsv_file(): reads a TAB separated data file
- load(): loads audio from any file like object

### Example 1: read local file into list of lines using native Python

In [4]:
fname = 'mini_corpus.lst'
dir = '../pyspch/data/'
f = open(dir+fname)
f.read().splitlines()

['friendly      0.0   -1.  friendly computers',
 'beed          0.0   -1.  beed bad booed',
 'f1            0.1  1.30  misinterpret ',
 'f1            1.3  2.7  expansionist ',
 'f1            2.80 3.7  circumspect']

### Example 2: read_data_file() to read local or remote files

- Reads local or remote files into list of lines
- optionally splits each line on whitespace in a max of maxcols+1

In [5]:
fname = 'mini_corpus.lst'
dir = '../pyspch/data/'
dir ='HTTPS://homes.esat.kuleuven.be/~spchlab/data/misc/'
#dir = 'V:public_html/data/misc/'
# read_fobj() would just read the raw data, unencoded
# data = Spch.read_fobj(dir+fname)
#
data_split = Spch.read_data_file(dir+fname,maxcols=3)
data = Spch.read_data_file(dir+fname)
data, data_split

(['friendly      0.0   -1.  friendly computers',
  'beed          0.0   -1.  beed bad booed',
  'f1            0.1  1.30  misinterpret',
  'f1            1.3  2.7  expansionist',
  'f1            2.80 3.7  circumspect'],
 [['friendly', '0.0', '-1.', 'friendly computers'],
  ['beed', '0.0', '-1.', 'beed bad booed'],
  ['f1', '0.1', '1.30', 'misinterpret'],
  ['f1', '1.3', '2.7', 'expansionist'],
  ['f1', '2.80', '3.7', 'circumspect']])

In [14]:
fname = 'timit_train.corpus'
root ='HTTPS://homes.esat.kuleuven.be/~spchlab/data/timit/'

# read_fobj() would just read the raw data, unencoded
# data = Spch.read_fobj(dir+fname)
#
#data_split = Spch.read_data_file(root+fname,maxcols=3)
data = Spch.read_data_file(root+fname)
data  #, data_split

['train/dr1/fcjf0/sa1',
 'train/dr1/fcjf0/sa2',
 'train/dr1/fcjf0/si1027',
 'train/dr1/fcjf0/si1657',
 'train/dr1/fcjf0/si648',
 'train/dr1/fcjf0/sx127',
 'train/dr1/fcjf0/sx217',
 'train/dr1/fcjf0/sx307',
 'train/dr1/fcjf0/sx37',
 'train/dr1/fcjf0/sx397',
 'train/dr1/fdaw0/sa1',
 'train/dr1/fdaw0/sa2',
 'train/dr1/fdaw0/si1271',
 'train/dr1/fdaw0/si1406',
 'train/dr1/fdaw0/si2036',
 'train/dr1/fdaw0/sx146',
 'train/dr1/fdaw0/sx236',
 'train/dr1/fdaw0/sx326',
 'train/dr1/fdaw0/sx416',
 'train/dr1/fdaw0/sx56',
 'train/dr1/fdml0/sa1',
 'train/dr1/fdml0/sa2',
 'train/dr1/fdml0/si1149',
 'train/dr1/fdml0/si1779',
 'train/dr1/fdml0/si2075',
 'train/dr1/fdml0/sx159',
 'train/dr1/fdml0/sx249',
 'train/dr1/fdml0/sx339',
 'train/dr1/fdml0/sx429',
 'train/dr1/fdml0/sx69',
 'train/dr1/fecd0/sa1',
 'train/dr1/fecd0/sa2',
 'train/dr1/fecd0/si1418',
 'train/dr1/fecd0/si2048',
 'train/dr1/fecd0/si788',
 'train/dr1/fecd0/sx158',
 'train/dr1/fecd0/sx248',
 'train/dr1/fecd0/sx338',
 'train/dr1/fecd0/sx4

### Example 3: read_dataframe() to read column oriented files

- Reads local or remote files into a DataFrame
- is a wrapper around pd.read_csv() with useful presets and options
    + assumes no header
    + tab delimited by default, but can handle any sep
    + strip=True strips white space at edges from a string datafield 
    + does a reasonable automatic data type casting
    + ...

In [6]:
fname = 'mini_corpus.txt'
dir = '../pyspch/data/'
dir ='HTTPS://homes.esat.kuleuven.be/~spchlab/data/misc/'
corpus = Spch.read_dataframe(dir+fname,names=['file','t0','t1','text'],strip=True)
corpus.dtypes, corpus

(file     object
 t0      float64
 t1      float64
 text     object
 dtype: object,
        file   t0   t1                text
 0  friendly  0.0 -1.0  friendly computers
 1      beed  0.0 -1.0      beed bad booed
 2        f1  0.1  1.3        misinterpret
 3        f1  1.3  2.7        expansionist
 4        f1  2.8  3.7         circumspect)

In [7]:
# read a segmentation datafile
dir='https://homes.esat.kuleuven.be/~spchlab/data/'
file = "timit/audio/train/si1027"
# get segmentations
segwrd = Spch.read_dataframe(dir+file+ ".wrd",names=['t0','t1','seg'],sep='\s+')
print(segwrd.dtypes)
segwrd

HTTPError: HTTP Error 404: Not Found

### Reading TIMIT transcriptions files

In [9]:
fn1 = 'timit/sa1.txt'
#fn1="timit/phones-61-48-39-41.txt"
fname='https://homes.esat.kuleuven.be/~spchlab/data/'
transcript = Spch.read_data_file(fname+fn1)[0].strip().split(None,2)
print('Samples: ',int(transcript[0]),int(transcript[1]))
print('Transcript: ',transcript[2].strip('.,!?:;'))

Samples:  0 46797
Transcript:  She had your dark suit in greasy wash water all year


### Reading Mapping files and converting to dictionary

In [8]:
fn1="timit/phones-61-48-39-41.txt"
fname='https://homes.esat.kuleuven.be/~spchlab/data/'
transcript = Spch.read_data_file(fname+fn1)
cols = Spch.read_data_file(fname+fn1, maxcols = 4)
dict(zip(cols[0],cols[3]))

{'aa': 'ao'}

## (TIMIT) Segmentation Files

Segmentation Files are assumed to be in the format

t0  t1   seg    
....


#### time units
t0, t1 are begin and end-times of segment 'seg'
the units of time, can be specified in the read_seg_file() module with 'dt', by default segmentation times are given in seconds;
in timit it is often in samples (with SR=16000), thus use dt=1/16000

#### phonetic symbols
Phonetic transcriptions come in a variety of phonetic symbol sets.
These utilities include default definitions (and orderings) of **TIMIT48** and **TIMIT41**.   TIMIT48 is the default used in experiments with TIMIT.   TIMIT41 is our own more compact version inspired by the alphabet in the CMU dictionaries, with 1 additional closure ('cl') symbol.

A number of mappings between the different alphabets are foreseen.  To apply them use the field 'xlat' at time of reading and specify the desired translation:  timit61_48, timit61_41, ..
These are simple dictionary based mappings.   

In [9]:
# read a datafile
dir='https://homes.esat.kuleuven.be/~spchlab/data/'
file = "timit/si1027" #@param {type:"string"}
wavfile = dir+file+".wav" 
wavdata, sr = Spch.audio.load(wavfile)
spg1 = Sps.spectrogram(wavdata,sample_rate=sr,n_mels=None)

# get segmentations
segwrd = Spch.timit.read_seg_file(dir+file+ ".wrd",dt=1/sr,fmt='float32')
segphn61 = Spch.timit.read_seg_file(dir+file+ ".phn",dt=1/sr,fmt='float32')
segphn = Spch.timit.read_seg_file(dir+file+ ".phn",dt=1/sr,fmt='float32',xlat='timit61_41')

HTTPError: HTTP Error 404: Not Found

In [10]:
fig = Spd.PlotSpg(spg1,wavdata=wavdata,segwav=segwrd,segspg=segphn,sample_rate=sr)
display(fig)
display(Audio(data=wavdata,rate=sr))

NameError: name 'spg1' is not defined