# Hackathon 1: Northeastern vs. Western Accents

## Overview of the task
For this first hackathon, your team will develop a model for distinguishing the speech of people who grew up in the **Northeastern US** (specifically, New England and New York City) from the speech of people who grew up in the **Western US** (the states on the US west coast and in the southwest). 

### Data
The `wav` and `TextGrid` files can be found in the correspondingly named directories (`Northeast` and `West`). This data is a subset of the TIMIT database, which you can learn more about [here](https://github.com/philipperemy/timit). 

### Code
Below I have given you code that extracts some basic speech features and classifies with cross validation. 

### Baselines
I've provided two baselines: the majority class baseline, and a baseline using the sorts of features we've  extracted before (mean pitch, mean intensity, duration, and mean F1 in vowel segments). I've used parselmouth here, but you are free to use Praat directly and then save out the features to files that you then read in with Python to do classification.

---

# Starter code

Here's the code that will give you the baselines. Please, I beg of you, **READ THE COMMENTS IN THE CODE**! Also, remember that you can't proceed to the next code block until the asterisk to the left of the current code block has turned into a number.

```
[ ] = code has not been executed
[*] = code is currently executing
[2] = code has been executed

```

Post your questions on the [Slack channel for Hackathon 1](https://csci3398.slack.com/archives/C019MUSM0BY).

In [1]:
# Import statements
# I am importing several classifiers here to give you some ideas!

import glob
import re
import parselmouth
from parselmouth.praat import call
import numpy as np
from sklearn.model_selection import cross_validate
from sklearn import metrics
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

In [2]:
### GET NORTHEAST DATA

# Extracting the formant information takes quite a while!
# Don't proceed to the next step until the asterisk in brackets to the
# left has a number inside it, meaning execution is complete.
# [*] means "this code is still executing"
# [2] (or any number) means "this code has been executed"

# This variable here will keep track of how many files have
# been processed so far. It will print out for every 100 files.

counter = 0 


# Store northeast data in this array:
northeast = []
northeast_mfccs = []

# You will obviously need to change this path below to make it work.
for wav_file in glob.glob("/Users/amaliariegelhuth/Desktop/NLPCode/hackathon1-team-2/Northeast/*.wav"):
    
    # Print out a status every time you've processed 100 files.
    counter += 1
    if counter % 100 == 0:
        print(counter, wav_file)

    # This code should give you an idea of how to use parselmouth
    # to carry out most of Praat's functionality within python.
    # Basically, you can go run a command in Praat, paste the command
    # history into a Praat script, and then insert the command into
    # the parselmouth call() method, adjusting commas, quotes, etc.

    # Get duration, mean pitch, mean intensity
    sound = parselmouth.Sound(wav_file)
    pitch = call(sound, "To Pitch", 0, 75, 600) 
    meanpitch = call(pitch, "Get mean", 0, 0, "Hertz")
    intensity = call(sound, "To Intensity", 75, 0, "yes")
    meanintensity = call(intensity, "Get mean", 0, 0, "energy")
    duration = call(sound, "Get total duration")
    mfccs = call(sound, "To MFCC", 12, 0.015, 0.005, 100, 100, 0)
    northeast_mfccs.append(np.array(mfccs))

    # get mean F1 (vowels only)
    formant = call(sound, "To Formant (burg)", 0, 5, 5500, 0.025, 50)
    tg_file = re.sub("wav", "TextGrid", wav_file)
    textgrid = call("Read from file", tg_file)
    intv = call(textgrid, "Get number of intervals", 1)
    vowels = 0
    f_one = 0
    f_two = 0
    for i in range(1, intv):
        phone = call(textgrid, "Get label of interval", 1, i)
        if re.match('[AEIOU]', phone):
            vowels += 1
            vowel_onset = call(textgrid, "Get starting point", 1, i)
            vowel_offset = call(textgrid, "Get end point", 1, i)
            midpoint = vowel_onset + ((vowel_offset - vowel_onset) / 2)
            f_one += call(formant, "Get value at time", 1, midpoint, "Hertz", "Linear")
            f_two += call(formant, "Get value at time", 2, midpoint, "Hertz", "Linear")


    # Append all the features to the northeast data list.
    northeast.append([meanpitch, meanintensity, duration, (f_one/vowels), f_one, f_two])


100 /Users/amaliariegelhuth/Desktop/NLPCode/hackathon1-team-2/Northeast/NE_MKLS0_SI1437.wav
200 /Users/amaliariegelhuth/Desktop/NLPCode/hackathon1-team-2/Northeast/NYC_MTJU0_SI2020.wav
300 /Users/amaliariegelhuth/Desktop/NLPCode/hackathon1-team-2/Northeast/NE_MWAD0_SI1749.wav
400 /Users/amaliariegelhuth/Desktop/NLPCode/hackathon1-team-2/Northeast/NYC_MKLN0_SA2.wav
500 /Users/amaliariegelhuth/Desktop/NLPCode/hackathon1-team-2/Northeast/NE_FMEM0_SX297.wav
600 /Users/amaliariegelhuth/Desktop/NLPCode/hackathon1-team-2/Northeast/NYC_MKLN0_SX248.wav
700 /Users/amaliariegelhuth/Desktop/NLPCode/hackathon1-team-2/Northeast/NE_FVFB0_SI1510.wav


In [4]:
### GET WEST DATA

# **Remember**
# Don't proceed to the next code block until the asterisk in brackets 
# to the left has a number inside it, meaning execution is complete.
# [*] means "this code is still executing"
# [2] (or any number) means "this code has been executed"

# Exactly as above for northeast.
west = []
counter = 0 
west_mfccs = []

# Change this path to make it work on your machine!
for wav_file in glob.glob("/Users/amaliariegelhuth/Desktop/NLPCode/hackathon1-team-2/West/*.wav"):    
    counter += 1
    if counter % 100 == 0:
        print(counter, wav_file)

    # Get duration, mean pitch, mean intensity
    sound = parselmouth.Sound(wav_file)
    pitch = call(sound, "To Pitch", 0, 75, 600) 
    meanpitch = call(pitch, "Get mean", 0, 0, "Hertz")
    intensity = call(sound, "To Intensity", 75, 0, "yes")
    meanintensity = call(intensity, "Get mean", 0, 0, "energy")
    duration = call(sound, "Get total duration")
    mfccs = call(sound, "To MFCC", 12, 0.015, 0.005, 100, 100, 0)
    west_mfccs.append(np.array(mfccs))

    # get mean F1 (vowels only)
    formant = call(sound, "To Formant (burg)", 0, 5, 5500, 0.025, 50)
    tg_file = re.sub("wav", "TextGrid", wav_file)
    textgrid = call("Read from file", tg_file)
    intv = call(textgrid, "Get number of intervals", 1)
    vowels = 0
    f_one = 0
    f_two = 0
    for i in range(1, intv):
        phone = call(textgrid, "Get label of interval", 1, i)
        if re.match('[AEIOU]', phone):
            vowels += 1
            vowel_onset = call(textgrid, "Get starting point", 1, i)
            vowel_offset = call(textgrid, "Get end point", 1, i)
            midpoint = vowel_onset + ((vowel_offset - vowel_onset) / 2)
            f_one += call(formant, "Get value at time", 1, midpoint, "Hertz", "Linear")
            f_two += call(formant, "Get value at time", 2, midpoint, "Hertz", "Linear")


    west.append([meanpitch, meanintensity, duration, (f_one/vowels), f_one, f_two])


100 /Users/amaliariegelhuth/Desktop/NLPCode/hackathon1-team-2/West/W_MTPR0_SX430.wav
200 /Users/amaliariegelhuth/Desktop/NLPCode/hackathon1-team-2/West/W_FCJS0_SI977.wav
300 /Users/amaliariegelhuth/Desktop/NLPCode/hackathon1-team-2/West/W_MTAB0_SA1.wav
400 /Users/amaliariegelhuth/Desktop/NLPCode/hackathon1-team-2/West/W_MWRP0_SI1525.wav
500 /Users/amaliariegelhuth/Desktop/NLPCode/hackathon1-team-2/West/W_MTLC0_SX307.wav
600 /Users/amaliariegelhuth/Desktop/NLPCode/hackathon1-team-2/West/W_MWRP0_SX273.wav
700 /Users/amaliariegelhuth/Desktop/NLPCode/hackathon1-team-2/West/W_FCJS0_SA2.wav


In [5]:
### PUT EVERYTHING IN NUMPY ARRAYS

# put everything in a single numpy array
npdata = np.array(northeast + west)

# create the class labels: northeast = 0, west = 1
northeast_labels = np.zeros(len(northeast), dtype=int)
west_labels = np.ones(len(west), dtype=int)
nptarget = np.concatenate([northeast_labels, west_labels])

#Get the mean of the mfcc values for each sound and put them in a nparray and then add them to a new npdata array.
mfccs_total = northeast_mfccs + west_mfccs

mfcc = np.array([np.array(xi) for xi in mfccs_total], dtype=object)

mfcc_mean = []
for i in range(len(mfcc)):
    total = 0
    count = 0
    for j in range(len(mfcc[i])):
        for k in range(len(mfcc[i][j])):
            #print(mfcc[i][j][k])
            count = count + 1
            total = total + mfcc[i][j][k]
    mfcc_mean.append([total/count])

mfcc_mean = np.array(mfcc_mean)
npdata_mfcc = np.concatenate((npdata, mfcc_mean), 1)


1500
711
[1.22484088e+02 5.87883217e+01 3.58400000e+00 6.20822287e+02
 7.44986744e+03 2.23269915e+04 5.21984096e+01]
(1500, 7)


In [7]:
### GET THE BASELINES

# Remember, your goal is to beat these two baselines:
# 1. Majority baseline
print("maj_baseline:\t", np.round((float(len(west))/(len(west)+len(northeast))), 4))


# 2. Prof. Prud'hommeaux's "state of the art" (lol) classification:
#    Gaussian Naive Bayes classifier with default parameterizations
#    using four features: mean pitch, mean intensity, duration, mean vowel F1

# Create a Naive Bayes classifier
gnb = GaussianNB()

# Select some scoring metrics
scoring_metrics = ['accuracy', 'precision', 'recall', 'f1']

# Train a Naive Bayes model with 5-fold cross validation for all four features
scores = cross_validate(gnb, npdata, nptarget, cv=5, scoring=scoring_metrics)

# Print out each of the metrics for each of the 5 folds and their means.
for score_name, score_value in scores.items():
    if "test" in score_name:
        print(score_name, "\t", np.round(np.mean(score_value),4))

#This is the training on cross validation for Naive Bayes with the mfcc
scores = cross_validate(gnb, npdata_mfcc, nptarget, cv=5, scoring=scoring_metrics)

# Print out each of the metrics for each of the 5 folds and their means.
for score_name, score_value in scores.items():
    if "test" in score_name:
        print(score_name, "\t", np.round(np.mean(score_value),4))

maj_baseline:	 0.5133
test_accuracy 	 0.562
test_precision 	 0.5599
test_recall 	 0.6883
test_f1 	 0.6168
test_accuracy 	 0.5667
test_precision 	 0.5636
test_recall 	 0.6922
test_f1 	 0.6209


## Next steps

Now your work begins. You need to improve over my state of the art baseline. Here are some suggestions:

### Features

* pitch and intensity range, min, max and other dynamic measures
* duration of different kinds of segments (e.g., vowels vs. consonants, voiced vs. unvoiced stops)
* formants (F1, F2, F3) of specific vowels rather than vowels in general
* presence or absence of specific phonemes
* MFCCs, log-mel filterbank features (could be easier to extract with other libraries)
* voice quality features like HNR, jitter, and shimmer (remember to extract these from vowels only!)
* features that can be extracted with [OpenSMILE](https://www.audeering.com/opensmile/)

### Classifiers

* Take a deep dive into sklearn's various classifiers to find other classifiers.
* Explore the various parameteriziations for these classifiers beyond the defaults.
* Consider deep learning classification with libraries like Keras, PyTorch, etc.

---