##ECE M214A Project: Speaker Region Identification



In this project, we'll train a machine learning algorithm to classify speakers by regional dialect.  We will use speech samples from the Corpus of Regional African American Language (CORAAL - https://oraal.uoregon.edu/coraal) with speakers each belonging to one of five different US cities: 1) Rochester, NY (ROC), 2) Lower East Side, Manhattan, NY (LES), 3) Washington DC (DCB), 4) Princeville, NC (PRV), or 5) Valdosta, GA (VLD).

The project files can be downloaded from [this link](https://ucla.box.com/s/332ewjf1fjmod77c4r2b7c1zq8j1a9pp)

To do this, we will first extract features from the audio files and then train a classifier to predict the city of origin of the utterance's speaker.  The goal is to extract a feature that contains useful information about regional dialect characteristics.

##1. Setting up the data directories and Google Colab

Find the data for this project here: https://drive.google.com/drive/folders/1DRiIxfj5G6VzfHr1ojXxeE1YdLbae5xH?usp=sharing and store a copy in your google drive.  

Make sure that the 'project_data' folder is stored in the top level of your google drive.  Otherwise, you will need to change the corresponding paths in the remainder of the notebook.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## 2. Getting familiar with the data

In [2]:
!pip install opensmile

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting opensmile
  Downloading opensmile-2.4.2-py3-none-any.whl (4.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.5/4.5 MB[0m [31m74.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting audinterface>=0.7.0
  Downloading audinterface-1.0.0-py3-none-any.whl (31 kB)
Collecting audobject>=0.6.1
  Downloading audobject-0.7.9-py3-none-any.whl (24 kB)
Collecting audformat<2.0.0,>=0.15.3
  Downloading audformat-0.16.0-py3-none-any.whl (63 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.0/64.0 KB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting audresample<2.0.0,>=1.1.0
  Downloading audresample-1.2.1-py3-none-any.whl (494 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m494.3/494.3 KB[0m [31m31.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting audmath>=1.2.1
  Downloading audmath-1.2.1-py3-none-any.whl (10 kB)
Collecti

In [3]:
!pip install spafe

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting spafe
  Downloading spafe-0.3.2-py3-none-any.whl (93 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m93.5/93.5 KB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: spafe
Successfully installed spafe-0.3.2


In [4]:
from IPython.display import Audio
from numpy.core.fromnumeric import ndim
import librosa
import torchaudio
import opensmile

import numpy as np
from glob import glob
from tqdm import tqdm
from pathlib import Path
import soundfile as sf
import re

import spafe.features
from spafe.features import mfcc
from spafe.features import rplp, pncc

sr = 44100

In [5]:
import pandas as pd


Let's take a moment to understand the data.  The original CORAAL dataset consists of ~85 different speakers, each from one of five cities.  The audio files are names with the convention: DCB_se1_ag1_f_03.  Here, DCB is the city code, se1 denotes the socioeconomic group of the speaker, ag1 denotes the age group of the speaker, f denotes female, and 03 denotes the participant number.  These unique combinations of identifiers mark the speaker.  

The dataset has been preprocessed to only include audio segments greater than 10 seconds in length. there are a number of audio snippets of at least 10sec in length.  Those segments are numbered with the appending tag _seg_number for each segment.

You can also try listening to any segment like this:

In [6]:
#Audio(filename= "drive/MyDrive/project_data/train_clean/DCB_se1_ag1_f_03_1_seg_3.wav", rate=sr)

The original dataset has also been split into a train and test set. The test set has been further split, with a portion corrupted with the addition of 10dB babble noise:

In [7]:
#Audio(filename= "drive/MyDrive/project_data/test_noisy/LES_se0_ag3_f_01_1_seg_57.wav", rate=sr)

# Explore Dataset

In [8]:
#First we obtain the list of all files in the train_clean directory
train_files = glob('drive/MyDrive/project_data/train_clean/*.wav')

In [9]:
# explore data by counts

# Get just the file names without paths or extension.
file_names = [Path(x).stem for x in train_files]

# For each file name, split it by "_" and save the first five 
# fields (city, socio-economic, age, sex, clip).
file_rows = [i.split("_")[0:5] for i in file_names]

# Now append a speaker ID (Sex + Clip) to each, along with the full file path
for i in range(len(file_rows)):
  r = file_rows[i]
  r.append(r[3] + r[4])
  r.append(train_files[i])

# Construct the data frame from these rows and our column names
fcdf = pd.DataFrame(file_rows, columns=['City', 'Socio_Economic', 'Age', 'Sex', 'Clip', 'Speaker', 'File'])

In [10]:
# Quick dataset overview
print(fcdf.describe())

        City Socio_Economic   Age   Sex  Clip Speaker  \
count   4372           4372  4372  4372  4372    4372   
unique     5              4     4     2     5       8   
top      DCB            se0   ag2     f    01     f01   
freq    2457           1915  1311  2766  2184    1160   

                                                     File  
count                                                4372  
unique                                               4372  
top     drive/MyDrive/project_data/train_clean/ROC_se0...  
freq                                                    1  


In [11]:
#Check for skew in dataset - DCB and females are overrepresented

print(fcdf['City'].value_counts())
print(fcdf.groupby(['City', 'Sex'])['Sex'].count())
print(fcdf.groupby(['City', 'Speaker'])['Speaker'].count())

DCB    2457
ROC     647
VLD     567
LES     459
PRV     242
Name: City, dtype: int64
City  Sex
DCB   f      1569
      m       888
LES   f       265
      m       194
PRV   f       170
      m        72
ROC   f       451
      m       196
VLD   f       311
      m       256
Name: Sex, dtype: int64
City  Speaker
DCB   f01        655
      f02        412
      f03        288
      f04        139
      f05         75
      m01        606
      m02        230
      m03         52
LES   f01        144
      f02        121
      m01        149
      m02         45
PRV   f01         67
      f02        103
      m01         26
      m02         46
ROC   f01        176
      f02         93
      f03        149
      f04         33
      m01         82
      m02         50
      m03         64
VLD   f01        118
      f02        193
      m01        161
      m02         37
      m03         58
Name: Speaker, dtype: int64


# Augment Dataset

**Functions for Data Augmentation**

In [12]:
#!pip install pyplnoise

In [13]:
#import pyplnoise

In [14]:
#inject noise into files (this technique abandoned due to degradation in clean data performance)

def add_noise(audio_file, noise_factor):
  audio,fs = torchaudio.load(audio_file)
  audio = audio.numpy().reshape(-1)
  noise = np.random.randn(len(audio))
  noisy_audio = audio + noise_factor * noise
  noisy_audio = noisy_audio.astype(type(audio[0]))
  return noisy_audio

def add_pink_noise(audio_file, noise_factor):
  audio,fs = torchaudio.load(audio_file)
  audio = audio.numpy().reshape(-1)
  pknoise = pyplnoise.PinkNoise(fs, 1e-2, 50)
  x_pk = pknoise.get_series(len(audio))
  noisy_audio = audio + x_pk*noise_factor
  noisy_audio = noisy_audio.astype(type(audio[0]))
  return noisy_audio 

In [15]:
#shift time

def time_shift(audio_file, shift_max, shift_direction):
  audio,fs = torchaudio.load(audio_file)
  audio = audio.numpy().reshape(-1)
  shift = np.random.randint(fs * shift_max)
  if shift_direction == 'right':
      shift = -shift
  elif shift_direction == 'both':
      direction = np.random.randint(0, 2)
      if direction == 1:
          shift = -shift
  shifted_audio = np.roll(audio, shift)
  # Set to silence for heading/tailing
  if shift > 0:
      shifted_audio[:shift] = 0
  else:
      shifted_audio[shift:] = 0
  return shifted_audio

In [16]:
#change voice speed

def alter_speed(audio_file, speed_factor):
  audio,fs = torchaudio.load(audio_file)
  audio = audio.numpy().reshape(-1)
  return librosa.effects.time_stretch(y=audio, rate=speed_factor)

In [17]:
#change pitch

def alter_pitch(audio_file, pitch_factor):
  audio,fs = torchaudio.load(audio_file)
  audio = audio.numpy().reshape(-1)
  return librosa.effects.pitch_shift(y=audio, sr=fs, n_steps=pitch_factor)

**Augmenting the Dataset and Creating New Audio Files**

In [19]:
#time shifting for data augmentation (all cities except for DCB)

for f in tqdm(train_files):
  if 'DCB' not in f:
    ts_data = time_shift(f, 5, 'right')
    ts_file = f.replace("project_data", "speech").replace("train_clean", "aug/shift")
    sf.write(ts_file, ts_data, sr, 'PCM_24')

  1%|          | 35/4372 [00:19<39:48,  1.82it/s]


KeyboardInterrupt: ignored

In [20]:
#Audio(filename= "drive/MyDrive/project_data/train_clean/ROC_se0_ag1_m_03_1_seg_0.wav", rate=sr)

In [21]:
#Audio(filename= "drive/MyDrive/speech/aug/shift/ROC_se0_ag1_m_03_1_seg_0.wav", rate=sr)

In [22]:
#change speed for data augmentation (all cities except DCB)

#not_dcb_count = 0
for f in tqdm(train_files):
  if 'DCB' not in f:
  # could also do: if ('ROC' in f) or ('VLD' in f) or ('LES' in f) or ('PRV' in f):
    #not_dcb_count += 1
    as_data = alter_speed(f, 0.95)
    as_file = f.replace("project_data", "speech").replace("train_clean", "aug/speed")
    sf.write(as_file, as_data, sr, 'PCM_24')
#print(str(not_dcb_count) + " out of " + str(len(train_files)) + " included")

  0%|          | 7/4372 [00:13<2:23:36,  1.97s/it]


KeyboardInterrupt: ignored

In [23]:
#change pitch for data augmentation (all cities except DCB) 
#lower pitch for 1/2 of female speakers into "male" speakers

select = True
for f in tqdm(train_files):
  if ('DCB' not in f) and ('_f_' in f):
    if select:
      ap_data = alter_pitch(f, -5)
      ap_file = f.replace("project_data", "speech").replace("train_clean", "aug/pitch")
      sf.write(ap_file, ap_data, sr, 'PCM_24')
  select = not select

  2%|▏         | 82/4372 [00:10<08:49,  8.10it/s]


KeyboardInterrupt: ignored

In [24]:
#Audio(filename= "drive/MyDrive/project_data/train_clean/ROC_se0_ag2_f_02_1_seg_0.wav", rate=sr)

In [25]:
#Audio(filename= "drive/MyDrive/speech/aug/pitch/ROC_se0_ag2_f_02_1_seg_0.wav", rate=sr)

In [26]:
#add white noise to every other file for data augmentation (abandoned)

#select = True
#for f in tqdm(train_files):
  #if select:
      #an_data = add_noise(f, 0.005)
      #an_file = f.replace("project_data", "speech").replace("train_clean", "aug/noise_VLD_LES")
      #sf.write(an_file, an_data, sr, 'PCM_24')
  #select = not select

In [27]:
#add pink noise to every other file for data augmentation (VLD and LES problematic for noise) (abandoned)

#select = True
#for f in tqdm(train_files):
  #if ('VLD' in f) or ('LES' in f):
    #if select:
        #an_data = add_pink_noise(f, 0.0001)
        #an_file = f.replace("project_data", "speech").replace("train_clean", "aug/pk_noise")
        #sf.write(an_file, an_data, sr, 'PCM_24')
    #select = not select

**Listen to Noise**

White Noise

In [28]:
#Audio(filename= "drive/MyDrive/project_data/train_clean/DCB_se1_ag1_f_01_1_seg_2.wav", rate=sr)

In [29]:
#Audio(filename= "drive/MyDrive/speech/aug/noise/DCB_se1_ag1_f_01_1_seg_2.wav", rate=sr)

Pink Noise

In [30]:
#Audio(filename= "drive/MyDrive/project_data/train_clean/VLD_se0_ag4_m_01_1_seg_81.wav", rate=sr)

In [31]:
#Audio(filename= "drive/MyDrive/speech/aug/pk_noise/VLD_se0_ag4_m_01_1_seg_81.wav", rate=sr)

## 3. Feature Extraction

As a baseline, we will be using the average mfcc value over time from the Librosa Python library. Your job will be to choose better features to improve performance on both the clean and noisy data

We first define a pair of functions to create features and labels for our classification model:


**Functions to Extract Features**

In [32]:
#Librosa Features (MFCC=20, spectrogram, rolloff)

def extract_feature(audio_file, n_mfcc=20):

  audio,fs = torchaudio.load(audio_file)
  audio = audio.numpy().reshape(-1)

  # get mfcc feature
  mfccs = librosa.feature.mfcc(y=audio, sr=fs, n_mfcc=n_mfcc)
  mfccs_mean = np.mean(mfccs, axis=1)
  
  # get spectrogram feature
  spectros = librosa.feature.melspectrogram(y=audio, sr=fs)
  spectro_mean = np.mean(spectros, axis=1)
  
  # get spectral rolloffs feature
  spectral_rolloffs = librosa.feature.spectral_rolloff(y=audio)
  spectral_rolloffs_mean = np.mean(spectral_rolloffs, axis=1)

  # concatenate feature arrays
  feat_out = np.concatenate([mfccs_mean, spectro_mean, spectral_rolloffs_mean])

  return feat_out

In [33]:
#Open Smile ComParE Features

def extract_smile(wav):

  audio,sample_fs = torchaudio.load(wav)
  sample_audio = audio.numpy().reshape(-1)
  
  smile = opensmile.Smile(
      feature_set=opensmile.FeatureSet.ComParE_2016,
      feature_level=opensmile.FeatureLevel.Functionals,
  )
  #opensmile.FeatureSet.ComParE_2016,
  #opensmile.FeatureSet.eGeMAPSv02
  y = smile.process_signal(
      sample_audio,
      sample_fs
  )
  #feat_names = smile.feature_names
  return np.array(y.iloc[0])

In [34]:
#Spafe PNCC Feature

def get_pncc(audio_file):
  audio,fs = torchaudio.load(audio_file)
  audio = audio.numpy().reshape(-1)

  pncc = spafe.features.pncc.pncc(audio, sr)
  pncc_mean = np.mean(pncc, axis=0)

  feat_out = pncc_mean
  return feat_out

In [35]:
#Spafe PLP Feature

def get_plp(audio_file):
  audio,fs = torchaudio.load(audio_file)
  audio = audio.numpy().reshape(-1)

  plp = spafe.features.rplp.plp(audio, sr)
  plp_mean = np.mean(plp, axis=0)

  feat_out = plp_mean
  return feat_out

# Create Altered Files for Augmentation

In [36]:
#First we obtain the list of all files in the train_clean directory and the 
#augmented directories
train_files = glob('drive/MyDrive/project_data/train_clean/*.wav')
shift_files = glob('drive/MyDrive/speech/aug/shift/*.wav')
speed_files = glob('drive/MyDrive/speech/aug/speed/*.wav')
pitch_files = glob('drive/MyDrive/speech/aug/pitch/*.wav')

train_files.extend(shift_files)
train_files.extend(speed_files)
train_files.extend(pitch_files)

#Let's sort it so that we're all using the same file list order
train_files.sort()

lib_train_feat=[]
com_train_feat=[]
pncc_train_feat=[]


for wav in tqdm(train_files):
  #Librosa features
  lib_train_feat.append(extract_feature(wav))
  #ComParE features
  com_train_feat.append(extract_smile(wav))
  #Spafe PNCC feature
  pncc_train_feat.append(get_pncc(wav))

  0%|          | 17/4428 [00:35<2:33:51,  2.09s/it]


KeyboardInterrupt: ignored

In [37]:
# PLP did not like time shifted files (time shift removed)
#First we obtain the list of all files in the train_clean directory and the 
#augmented directories
train_files = glob('drive/MyDrive/project_data/train_clean/*.wav')
speed_files = glob('drive/MyDrive/speech/aug/speed/*.wav')
pitch_files = glob('drive/MyDrive/speech/aug/pitch/*.wav')

train_files.extend(speed_files)
train_files.extend(pitch_files)

#Let's sort it so that we're all using the same file list order
train_files.sort()


plp_train_feat=[]


for wav in tqdm(train_files):
  plp_train_feat.append(get_plp(wav))

  1%|          | 26/4392 [00:17<49:03,  1.48it/s]  


KeyboardInterrupt: ignored

Let us now call these functions to extract the features from the train_clean directory

In [39]:
#Now we obtain the list of all files in the test_clean directory
test_clean_files = glob('drive/MyDrive/project_data/test_clean/*.wav')

#Similar to above, we sort the files
test_clean_files.sort() 

lib_test_clean_feat=[]
com_test_clean_feat=[]
pncc_test_clean_feat=[]
plp_test_clean_feat=[]


for wav in tqdm(test_clean_files):
  lib_test_clean_feat.append(extract_feature(wav))
  com_test_clean_feat.append(extract_smile(wav))
  pncc_test_clean_feat.append(get_pncc(wav))
  plp_test_clean_feat.append(get_plp(wav))

  4%|▍         | 19/447 [00:50<13:14,  1.86s/it]Exception ignored on calling ctypes callback function: <function OpenSMILE.external_sink_set_callback_ex.<locals>.internal_callback_ex at 0x7fc0eab6ddc0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/opensmile/core/SMILEapi.py", line 362, in internal_callback_ex
    def internal_callback_ex(data, nt, n, meta: POINTER(FrameMetaData), _):
KeyboardInterrupt: 
  9%|▉         | 40/447 [01:46<18:03,  2.66s/it]


KeyboardInterrupt: ignored

In [40]:
#Finally we obtain the list of all files in the test_noisy directory
test_noisy_files = glob('drive/MyDrive/project_data/test_noisy/*.wav')

#Similar to above, we sort the files
test_noisy_files.sort() 

lib_test_noisy_feat=[]
com_test_noisy_feat=[]
pncc_test_noisy_feat=[]
plp_test_noisy_feat=[]

for wav in tqdm(test_noisy_files):
  lib_test_noisy_feat.append(extract_feature(wav))
  com_test_noisy_feat.append(extract_smile(wav))
  pncc_test_noisy_feat.append(get_pncc(wav))
  plp_test_noisy_feat.append(get_plp(wav))

  3%|▎         | 10/347 [00:30<12:52,  2.29s/it]Exception ignored on calling ctypes callback function: <function OpenSMILE.external_sink_set_callback_ex.<locals>.internal_callback_ex at 0x7fc0f44c58b0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/opensmile/core/SMILEapi.py", line 362, in internal_callback_ex
    def internal_callback_ex(data, nt, n, meta: POINTER(FrameMetaData), _):
KeyboardInterrupt: 
  3%|▎         | 11/347 [00:36<18:46,  3.35s/it]


KeyboardInterrupt: ignored

Add headers and write files to directory for subsequent model training

In [41]:
#Create Librosa feature names for list
feat_names_mfcc=['mfcc_' +str(n) for n in range(20)]
feat_names_spectro = ['spectro_' +str(n) for n in range(20, 148)]
feat_names_rolloff = ['rolloff_' +str(n) for n in range(148, 149)]
lib_feat_names = feat_names_mfcc+feat_names_spectro+feat_names_rolloff

In [42]:
#Create ComParE feature names for list
smile = opensmile.Smile(
      feature_set=opensmile.FeatureSet.ComParE_2016,
      feature_level=opensmile.FeatureLevel.Functionals,
  )
#print(smile.feature_names)

#scrub feature name list to remove special characters for a dataframe

com_feat_names=smile.feature_names
com_feat_names=[re.sub(r"[\[\]]", "_", s) for s in com_feat_names]
#print(feat_names)

In [43]:
#Create Spafe PNCC feature names for list

pncc_feat_names=['pncc_' +str(n) for n in range(len(pncc_train_feat[0]))]

In [44]:
#Create Spafe PLP feature names for list

plp_feat_names=['plp_' +str(n) for n in range(len(plp_train_feat[0]))]

In [48]:
#Make dataframes to write Librosa to csv

stack = np.stack(lib_train_feat)
lib_train_feat_df = pd.DataFrame(data=stack, columns=lib_feat_names)
lib_test_clean_feat_df = pd.DataFrame(data=np.stack(lib_test_clean_feat), columns=lib_feat_names)
lib_test_noisy_feat_df = pd.DataFrame(data=np.stack(lib_test_noisy_feat), columns=lib_feat_names)


#Make dataframes to write ComParE to csv

stack = np.stack(com_train_feat)
com_train_feat_df = pd.DataFrame(data=stack, columns=com_feat_names)
com_test_clean_feat_df = pd.DataFrame(data=np.stack(com_test_clean_feat), columns=com_feat_names)
com_test_noisy_feat_df = pd.DataFrame(data=np.stack(com_test_noisy_feat), columns=com_feat_names)

#Make dataframes to write PNCC feature to csv

stack = np.stack(pncc_train_feat)
pncc_train_feat_df = pd.DataFrame(data=stack, columns=pncc_feat_names)
pncc_test_clean_feat_df = pd.DataFrame(data=np.stack(pncc_test_clean_feat), columns=pncc_feat_names)
pncc_test_noisy_feat_df = pd.DataFrame(data=np.stack(pncc_test_noisy_feat), columns=pncc_feat_names)

#Make dataframes to write PLP feature to csv

stack = np.stack(plp_train_feat)
plp_train_feat_df = pd.DataFrame(data=stack, columns=plp_feat_names)
plp_test_clean_feat_df = pd.DataFrame(data=np.stack(plp_test_clean_feat), columns=plp_feat_names)
plp_test_noisy_feat_df = pd.DataFrame(data=np.stack(plp_test_noisy_feat), columns=plp_feat_names)

In [49]:
#Save feature set to drive as csv

lib_train_feat_df.to_csv('drive/MyDrive/train_feat_librosa_aug.csv')
lib_test_clean_feat_df.to_csv('drive/MyDrive/test_clean_feat_librosa_aug.csv')
lib_test_noisy_feat_df.to_csv('drive/MyDrive/test_noise_feat_librosa_aug.csv')

com_train_feat_df.to_csv('drive/MyDrive/train_feat_ComParE_aug2.csv')
com_test_clean_feat_df.to_csv('drive/MyDrive/test_clean_feat_ComParE_aug.csv')
com_test_noisy_feat_df.to_csv('drive/MyDrive/test_noise_feat_ComParE_aug.csv')

pncc_train_feat_df.to_csv('drive/MyDrive/train_feat_pncc_aug2.csv')
pncc_test_clean_feat_df.to_csv('drive/MyDrive/test_clean_feat_pncc_aug.csv')
pncc_test_noisy_feat_df.to_csv('drive/MyDrive/test_noise_feat_pncc_aug.csv')

plp_train_feat_df.to_csv('drive/MyDrive/train_feat_plp_aug2.csv')
plp_test_clean_feat_df.to_csv('drive/MyDrive/test_clean_feat_plp_aug.csv')
plp_test_noisy_feat_df.to_csv('drive/MyDrive/test_noise_feat_plp_aug.csv')