# Test Notebook for various data files

This notebook contains tests for:   
    - pyspch.timit.read_seg_file()   
    - pyspch.core.read_txt()  ... or read_data_file()   

*read_seg_file()* is a segmentation reading routine included with a lot of other TIMIT related definitions   

*read_txt()* is suitable for importing corpus style files

TIMIT alphabets:   TIMIT48, TIMIT41
TIMIT mappings:    timit61_48, timit48_41, timit61_41

01/04/2022: verified with v0.6.3

In [None]:
# optional install of the pyspch package
#!pip install git+https://github.com/compi1234/pyspch.git

In [1]:
%matplotlib inline
import os,sys,io, pkg_resources
import scipy.signal

from urllib.request import urlopen
from IPython.display import display, Audio, HTML, clear_output
from ipywidgets import interact
import urllib.request
import scipy.io as sio

import math,time
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec 

import librosa as librosa
    
import pyspch.sp as Sps
import pyspch.display as Spd
import pyspch.core as Spch

In [None]:
def load_data(name,dir='pkg_resources',**kwargs):
    '''
    A generic  data loading function.
    Data can be loaded from the example data included in the package as well from directories on disk or URL based resources.

    The precise data loading and processing depends the filename extension; extra **kwargs can be passed 
    to the underlying functions being called.   Data stored in filenames with deviating naming conventions should
    be processed with a lower level routine.
    
    Recognized and treated extensions:
        Waveform data:                       .wav       
        (TIMIT like) Segmentation Data :     .gra, .phn, .syl, .wrd, .seg  
        Database (comma separated) :         .csv
        MATLAB data :                        .mat
        
    Other data is assumed to be 'line data'
        will be read into list of lines
        or into a DataFrame iff the 'sep' keyword is present
    
    '''
    if dir == "pkg_resources":
        filename = pkg_resources.resource_filename('pyspch','data/' + name)
    else:
        filename = dir + name
    
    _, ext = os.path.splitext(filename)
    
    if ext == '.wav':        
        wavdata,sample_rate= Spch.audio.load(filename,**kwargs)
        return wavdata, sample_rate
    elif ext in ['.gra','.phn','.syl','.wrd','.seg']:
        data = Spch.timit.read_seg_file(filename,**kwargs)
        return data
    elif ext in ['.csv']: 
        data = pd.read_csv(filename,**kwargs)
        return data
    elif ext == '.mat':
        raw_data = Spch.read_fobj(filename)
        data = sio.loadmat(raw_data,squeeze_me=True)
        # remove keys of the ext __xxxxx__
        remove_keys = [ k for k in data.keys() if (k[0]=='_' and k[-1]=='_')]
        for k in remove_keys:    del data[k]
        return data
    elif ext in ['.lst','.txt']: 
        if 'sep' in kwargs.keys():
            print('read dataframe')
            data = Spch.read_dataframe(filename,**kwargs)
        else:
            print('read txt')
            data = Spch.read_txt(filename,**kwargs)
        return data
    elif ext == '':
        data = []
        return data

In [None]:
wavdata, sample_rate = Spch.load_data('friendly.wav')
#wavdata, sample_rate = load_data('si1027.wav',dir = 'https://homes.esat.kuleuven.be/~spchlab/data/timit/')
plt.plot(wavdata)
# if you prefer the plot above the audio widget, then flush the plot by adding the line below
# plt.show()
display(Audio(data=wavdata,rate=sample_rate,autoplay=False))

In [None]:
help(Spch.read_fobj)

In [None]:
seg = Spch.load_data('friendly.gra')
#seg=load_data('si1027.phn',dir = 'https://homes.esat.kuleuven.be/~spchlab/data/timit/',xlat='timit61_41')
seg

In [None]:
name="phones-61-48-39-41.txt"
dir='https://homes.esat.kuleuven.be/~spchlab/data/timit/'
Spch.load_data(name,dir=dir,sep='\t')

In [None]:
hildata = Spch.load_data('hildata.csv')  
hildata

In [None]:
corpus = Spch.load_data('mini_corpus.lst')
corpus

In [None]:
tinytimit = 'https://homes.esat.kuleuven.be/~spchlab/datasets/tinytimit/'
#data_mf = load_data('male-female',ext='mat',dir=tinytimit)
data = Spch.load_data('a-i-uw-800.mat',dir=tinytimit)
data['ALLtrain']

In [None]:
df = pd.DataFrame(data['ALLtrain'].T)
df

## File Objects

pyspch provides generic (read) access to files (locally, mounted, via URL) via
- open_fobj(): opens any file like objects
- read_fobj(): reads data from any file like object in a FILE IO object in raw format
- read_data_file(): reads text data from any file like object (default=list of lines)
- read_tsv_file(): reads a TAB separated data file
- load(): loads audio from any file like object

### Example 1: read local file into list of lines using native Python

In [None]:
fname = 'mini_corpus.lst'
dir = '../pyspch/data/'
f = open(dir+fname)
f.read().splitlines()

### Example 2: read_data_file() to read local or remote files

- Reads local or remote files into list of lines
- optionally splits each line on whitespace in a max of maxcols+1

In [None]:
fname = 'mini_corpus.lst'
dir = '../pyspch/data/'
dir ='HTTPS://homes.esat.kuleuven.be/~spchlab/data/misc/'
#dir = 'V:public_html/data/misc/'
# read_fobj() would just read the raw data, unencoded
# data = Spch.read_fobj(dir+fname)
#
data_split = Spch.read_data_file(dir+fname,maxcols=3)
data = Spch.read_data_file(dir+fname)
data, data_split

### Example 3: read_dataframe() to read column oriented files

- Reads local or remote files into a DataFrame
- is a wrapper around pd.read_csv() with useful presets and options
    + assumes no header
    + tab delimited by default, but can handle any sep
    + strip=True strips white space at edges from a string datafield 
    + does a reasonable automatic data type casting
    + ...

In [None]:
fname = 'mini_corpus.txt'
dir = '../pyspch/data/'
dir ='HTTPS://homes.esat.kuleuven.be/~spchlab/data/misc/'
corpus = Spch.read_dataframe(dir+fname,names=['file','t0','t1','text'],strip=True)
corpus.dtypes, corpus

In [None]:
# read a segmentation datafile
dir='https://homes.esat.kuleuven.be/~spchlab/data/'
file = "timit/si1027"
# get segmentations
segwrd = Spch.read_dataframe(dir+file+ ".wrd",names=['t0','t1','seg'],sep='\s+')
print(segwrd.dtypes)
segwrd

### Reading TIMIT transcriptions files

In [None]:
fn1 = 'timit/sa1.txt'
#fn1="timit/phones-61-48-39-41.txt"
fname='https://homes.esat.kuleuven.be/~spchlab/data/'
transcript = Spch.read_data_file(fname+fn1)[0].strip().split(None,2)
print('Samples: ',int(transcript[0]),int(transcript[1]))
print('Transcript: ',transcript[2].strip('.,!?:;'))

### Reading Mapping files and converting to dictionary

In [None]:
fn1="timit/phones-61-48-39-41.txt"
fname='https://homes.esat.kuleuven.be/~spchlab/data/'
transcript = Spch.read_data_file(fname+fn1)
cols = Spch.read_data_file(fname+fn1, maxcols = 4)
dict(zip(cols[0],cols[3]))

## (TIMIT) Segmentation Files

Segmentation Files are assumed to be in the format

t0  t1   seg    
....


#### time units
t0, t1 are begin and end-times of segment 'seg'
the units of time, can be specified in the read_seg_file() module with 'dt', by default segmentation times are given in seconds;
in timit it is often in samples (with SR=16000), thus use dt=1/16000

#### phonetic symbols
Phonetic transcriptions come in a variety of phonetic symbol sets.
These utilities include default definitions (and orderings) of **TIMIT48** and **TIMIT41**.   TIMIT48 is the default used in experiments with TIMIT.   TIMIT41 is our own more compact version inspired by the alphabet in the CMU dictionaries, with 1 additional closure ('cl') symbol.

A number of mappings between the different alphabets are foreseen.  To apply them use the field 'xlat' at time of reading and specify the desired translation:  timit61_48, timit61_41, ..
These are simple dictionary based mappings.   

In [None]:
# read a datafile
dir='https://homes.esat.kuleuven.be/~spchlab/data/'
file = "timit/si1027" #@param {type:"string"}
wavfile = dir+file+".wav" 
wavdata, sr = Spch.audio.load(wavfile)
spg1 = Sps.spectrogram(wavdata,sample_rate=sr,n_mels=None)

# get segmentations
segwrd = Spch.timit.read_seg_file(dir+file+ ".wrd",dt=1/sr,fmt='float32')
segphn61 = Spch.timit.read_seg_file(dir+file+ ".phn",dt=1/sr,fmt='float32')
segphn = Spch.timit.read_seg_file(dir+file+ ".phn",dt=1/sr,fmt='float32',xlat='timit61_41')

In [None]:
fig = Spd.PlotSpg(spg1,wavdata=wavdata,segwav=segwrd,segspg=segphn,sample_rate=sr)
display(fig)
display(Audio(data=wavdata,rate=sr))