# Analysis of raw dataset fisher/fe_03_p1

## Where does the data come from?

The corpus is the first half of a collection of conversational telephone speech (CTS) created at LDC during 2003.

It contains **5850** audio files, each with one full conversation of **upto 10 minutes** between **2 participants**.

**Origin Location on disk**

- AUDIO
    + `/nm-raid/audio/data/corpora/LDC/fisher_eng_tr_sp_LDC2004S13/fisher_eng_tr_sp_LDC2004S13.zip`
- LABELS
    + `/nm-raid/audio/data/corpora/LDC/Other/LDC2004T19.tgz`
    
The raw files were copied into `$RENNET_ROOT/data/raw/fisher/fe_03_p1/`

The audio files are NIST Sphere files (.sph), with two channels, one per speaker, (0: A, 1: B). 
The files are grouped into directories of a 100 files each, while the groups are available on 7 different discs.
The `filetable.txt` has complete listing of all the files in this part of the dataset, including the gender of the speakers.

The labels come in two forms: One that was extracted in an automated way marking speech parts, in `data/bbn_orig/`.
The relevant transcription is in `data/trans/`, which are txt files in groups of 100 files as above.
The labels are however not divided based on discs.
The `doc` folder has useful readmes and metadata for the recordings, with more information about the conversation and the speakers involved.

## Copying files to `working` folder

The copy was done **manually**, to `$RENNET_ROOT/data/working/fisher/fe_03_p1/raw` with the following modifications.

- `audio`
    + has the readme file, and the `filetable.txt` with list of all audio files, and corresponding speaker genders
    + `data/disc1` to `data/disc7` with grouped audio sph files, each group having roughly a hundred of them.
        * the groups are named based on the first 3 digits of the conversation IDs of the files in them.
- `labels`
    + has readmes and doc files with more info about the transcriptions and the metadata in the same folder
        * `fe_03_p1_calldata.tbl` has most of the relevant speaker and annotation metadata
        * `fe_03_pindata.tbl` has deeper information about the speaker themselves.
        * `fe_03_topics.sgm` is an xml like file with the information about the topics of conversation, referred in the `calldata` file
    + Same as above, `data/disc1` to `data/disc7` with grouped transcription txt files, each group having roughly a hundred of them.
        * the groups are named based on the first 3 digits of the conversation IDs of the files in them.

In [1]:
from __future__ import print_function, division

import os
import sys

rennet_root = os.environ['RENNET_ROOT']
sys.path.append(rennet_root)

%load_ext autoreload
%autoreload 1

## Gather all media filepaths and labels

> **NOTE:**
>
> If it is not clear by now, we will be, from now on, working exclusively with the working directory we created earlier. Keep that in mind for all the instructions to come.



In [10]:
# Finding audio files
import glob

# UPDATE HERE ###############################################################

rennet_workingdir = os.path.join(rennet_root, 'data', 'working')

provider = 'fisher'
dataset = 'fe_03_p1'
working_rawaudio_dir = os.path.join(rennet_workingdir, provider, dataset, 
                                    'raw', 'audio', 'data')

glob_str = str(os.path.join(working_rawaudio_dir, "**", "*.*"))

# ###########################################################################

print("Query: ", glob_str)
audio_fp = sorted(list(glob.iglob(glob_str, recursive=True)))

print("Found audio files: {}".format(len(audio_fp)), *audio_fp[:10], "...", sep="\n")

Query:  /home/aabdullah/delve/rennet/data/working/fisher/fe_03_p1/raw/audio/data/**/*.*
Found audio files: 5850
/home/aabdullah/delve/rennet/data/working/fisher/fe_03_p1/raw/audio/data/disc1/000/fe_03_00001.sph
/home/aabdullah/delve/rennet/data/working/fisher/fe_03_p1/raw/audio/data/disc1/000/fe_03_00002.sph
/home/aabdullah/delve/rennet/data/working/fisher/fe_03_p1/raw/audio/data/disc1/000/fe_03_00003.sph
/home/aabdullah/delve/rennet/data/working/fisher/fe_03_p1/raw/audio/data/disc1/000/fe_03_00004.sph
/home/aabdullah/delve/rennet/data/working/fisher/fe_03_p1/raw/audio/data/disc1/000/fe_03_00005.sph
/home/aabdullah/delve/rennet/data/working/fisher/fe_03_p1/raw/audio/data/disc1/000/fe_03_00006.sph
/home/aabdullah/delve/rennet/data/working/fisher/fe_03_p1/raw/audio/data/disc1/000/fe_03_00007.sph
/home/aabdullah/delve/rennet/data/working/fisher/fe_03_p1/raw/audio/data/disc1/000/fe_03_00008.sph
/home/aabdullah/delve/rennet/data/working/fisher/fe_03_p1/raw/audio/data/disc1/000/fe_03_00009.s

In [13]:
# Finding label files

# UPDATE HERE ###############################################################

working_rawlabel_dir = os.path.join(rennet_workingdir, provider, dataset, 
                                    'raw', 'labels', 'data')

glob_str = str(os.path.join(working_rawlabel_dir, "**", "*.*"))

# ###########################################################################

print("Query: ", glob_str)
label_fp = sorted(list(glob.iglob(glob_str, recursive=True)))

print("Found transcription files: {}".format(len(label_fp)), *label_fp[:10], "...", sep="\n")

Query:  /home/aabdullah/delve/rennet/data/working/fisher/fe_03_p1/raw/labels/data/**/*.*
Found transcription files: 5850
/home/aabdullah/delve/rennet/data/working/fisher/fe_03_p1/raw/labels/data/disc1/000/fe_03_00001.txt
/home/aabdullah/delve/rennet/data/working/fisher/fe_03_p1/raw/labels/data/disc1/000/fe_03_00002.txt
/home/aabdullah/delve/rennet/data/working/fisher/fe_03_p1/raw/labels/data/disc1/000/fe_03_00003.txt
/home/aabdullah/delve/rennet/data/working/fisher/fe_03_p1/raw/labels/data/disc1/000/fe_03_00004.txt
/home/aabdullah/delve/rennet/data/working/fisher/fe_03_p1/raw/labels/data/disc1/000/fe_03_00005.txt
/home/aabdullah/delve/rennet/data/working/fisher/fe_03_p1/raw/labels/data/disc1/000/fe_03_00006.txt
/home/aabdullah/delve/rennet/data/working/fisher/fe_03_p1/raw/labels/data/disc1/000/fe_03_00007.txt
/home/aabdullah/delve/rennet/data/working/fisher/fe_03_p1/raw/labels/data/disc1/000/fe_03_00008.txt
/home/aabdullah/delve/rennet/data/working/fisher/fe_03_p1/raw/labels/data/disc1

In [None]:
# Reading the calldata.tbl file for metadata