# Vectorization of IPA Symbols Applied to Flapping in English
###### Ariel Haberman

Corpus used was the Carnegie Mellon University Dictionary

Credit/Thanks are given to Dr Danis and Zach Glabman

Please excuse any typos/weird formatting because I wrote this all in markdown

### Goals
* See how well a classification model functions when given different amounts and types of data in the vectorization process
* Personal Goal: actually learn how to vectorize textual data instead of relying on prebuilt methods for NLP/computation linguistics word

### Steps
* Create a working corpus of words by pulling data from CMU dictionary
* Shorten each word to four letters for easier vectorization down the line
* Vectorize the data
* Fit a model with the data
* Examine and theorize about why some types of vectorization work better than others
* Try to figure out what could have been done better

### Introduction
The basic idea for this project comes from having read *Rumelhart & McClelland 1985 - On Learning the Past Tenses of English Verbs* and not liking some of the decisions they made but being incredibly interested in the process. Here, I have decided to focus on my grievances with their method of vectorization, and to explore other methods. To simplify the project I have changed which process the vectorization is tested on. I wanted to focus on the smallest level of vectorization and of language which made looking at flapping in English a more apt test case. This change is a purely phonological one, not a morphological or morphophonological one. By looking only at a phonological change, there are methods of vectorization that can be tried that otherwise would have been useless. It is also important to note, that since I worked on the phonological level stress had to be taken into account, something which I believe was ignored in the original paper.

I had three main greivances with the Rumelhart and McClelland paper that I wanted to address here
1. Instead of using the regular system of features from phonology they came up with their own system called "Wickelfeatures'.
2. They were unable to translate their vectorized data, and therefore the output of the model back into English.
3. They called their features "Wickeleatures"

All three of these grievances were addressed in this project. The first by using Bruce Hayes' phonological features. The second by creating at least two methods of vectorization that can be translated back into their original IPA. And the third, by actually naming my variable somewhat decent variable names for the first time in my life.

### Background and Set Up
* A flap will be represented by a 'dx' both in my notes and when the words are still in ARPABET form. This is modeled after *Gildea & Jurafsky 1996 - Learning bias and phonological-rule induction*

* In English flapping occurs after a stressed vowel an optional 'r' and before an unstressed vowel, or t->dx \ ˈV r* __ V. This means to create our dataset, the input data must include the unflapped form of words that contain the afformentioned environment and all the words that have a 't' but lack the same environment. The output data has the 't' changed into a flap in all the places that it should, and the same unchanged 't's for words that lack that environment.

* All the words used here were pulled from the CMU dictionary. The CMU Dictionary distinguishes between unstressed, primary stressed and secondary stressed vowels. I made the decision to treat all stressed vowels as stressed without distinguishing between their levels of stress. They are all given the symbol for primary stress. This was done to simplify the vecotrization process later on. Without this change not all IPA characters would have a unique set of features. This would lead to needing a more complex vectorization system or another feature to be added. However, our environment is a boolean one that checks if the vowel is stressed or unstressed and doesn't care about the level of stress. Since the environment treats the two characters (primary and sceondary stressed vowels) the same I decided that the simplest solution to my problem would be to treat them like the same character (like the invironment does).

* When shortening the words to four letters, the shortened word will always have the 't' or 'dx' in the third postion of the word. For words that contain the environment an unstressed vowel will always come last and the stressed vowel will come first or second depending on the presence of an 'r' before the 't'. By keeping the 't' in the third position, the environemnt for flapping is always found even in the shorter version of the word. This also made checking the accuracy of the vectorization process simpler as I always knew which line to check for the change done by flapping.

* The features used in this project are an augmented version of Bruce Hayes' features. These were taken from the features spreadsheet found on the Ling 313 page (I am unsure if Dr Danis put together the excel page or if its from elsewhere). The spreadsheet was augmented by adding dipthongs following the guidelines set by Gildea & Jurafsky (1996). Dipthongs features match those of their first vowel and have an added feature of either w or y offglide. Hayes failed to account for a r-colored vowel (or I just couldn't find it) and this was added using the features found in the Pheatures app English inventory. The rhotasized vowel added was 'ɚ' as Wikipedia page on the ARPABET translates 'ER' as 'ɝ' but the pheatures spreadsheet only accounted for 'ɚ'. All vowels (including dipthongs) were then copied and a stressed version of the vowel was added to the dataframe. The stressed version contains a stress marker adjoining the IPA symbol and is +stress in the features instead of minus.

* The features were brought into this project from the spreadsheet to a dataframe that will be refrenced below. Some of the augmentations were made to the features spreadsheet using excel, others were made using pandas. I will be uploading my edited version of the features spreadsheet with my submission, if you have any problems loading it please let me know.

* Sonority as referenced in this project is also based on Bruce Hayes' features. The level of sonority of a sound is between 1 and 5 and categorized by the features of the sound. The features used to decide sonarity are +-syllabic, +-consonantal, +-approximant, and +-sonorant.

* I added/created a version of sonority that differentiates between stressed and unstressed vowels. The version of sonority that includes stress breaks up the category of syllabic into stressed syllabic and unstressed syllabic. My understanding of sonority made me place stressed vowels as more sonorant than unstressed vowels. This is necessary because our environment differentiates between stressed and unstressed vowels and the model should be able to see that.

### Types of Vectorization Examined
>I vectorized the same set of data in six different ways, and for each used the same model and same training and testing data. The goal was to keep everything the same except for the method of vectorization. 

1. One Hot IPA Encoding - in this method of vectorization each sound or IPA symbol is represented by a vector the length of the number of IPA symbols. The vector contains a singular "1" a the index that the symbol appears in the features dataframe. This type of vectorization gives the model no background knowledge. The model does not know what a vowel is and how it might be different from a consonant. This is the most basic type of vectorization done in this project, where basic means 'lacking any sense of a universal grammer or inherent biases'.

2. IPA Feature Vectorization - in this method, each symbol is represented (vectorized) as a list of their phonological features. All sounds are either + or - a particular feature which here is changed '1' or '0'. The vector is the length of the amount of features and is guaranteed to be unique for each sound. The model here is offered much more data about the sound itself and how it might be connected to another sound. Given to the model are the features used to seperate sounds into natural classes and create rules.

3. One Hot Sonority Encoding - in this method of vectorization each sound is represented by a vector of length 5. Like with the one hot IPA encoding the vector contains a single 1 and the rest are zeros. The position of the 1 on the vector is determined by the sonority of the sound, where the first spot on the vector is the most sonorant and the last spot the least. In this version of vectorization the model is given some background knowledge.

4. Sonority Feature Vectorization - in this method of vectorization each sound is represented by a shortened list of phonological features, only the ones used to determine the sonority level of a sound. Here, we are giving the model less data than we would be by giving it all the features, but also less noise to deal with as many features have been deemed unimportant and therefore cut.

>It's important to realize that neither sonority vectorization model could truly learn the pattern. Although they scored better, sonority doesn't differentiate between stressed and unstressed vowels. This led me to create a version of sonority that differentiates between stressed and unstressed vowels.

5. Stressed One Hot Sonority Encoding - same as above but with the changed made to the definition of sonority as mentioned in the background section.

6. Stressed Sonority Feature Vectorization - same as above but with the changed made to the definition of sonority as mentioned in the background section. This does give the model more information than the previous version.

**Comments should be read, they explain code and offer other insight. There are occasional larger blocks of texts that also offer explanations**

# Creating a Dataset

In [59]:
#Importing CMU dict, splitting things up and subbing the ts for dxs
import re
import nltk
import urllib.request
cmu = nltk.corpus.cmudict.entries()

#flapping happens when t comes between a stressed vowel followed by r* and an unstressed vowel
joinedWords = []
originalWords = []
flappedWords = []
otherWords = []

#join the segments to make strings
for word, pron in cmu:
    joinedWords.append(" ".join(pron))

#make two word lists, one with the unflapped words and words that lack the envorinment
#the other is flapped words and the words that lack the environment
#lookbehind and look ahead can onyl take searches with a set number of characters so the version
#of flapping with and without the r must be seperated into two seperate searches
for w in joinedWords:
    if(re.search(r"(1|2) (R )*T ..0", w)):
        originalWords.append(w)
        if(re.search(r"(1|2) T ..0", w)):
            new = re.sub(r"(?<=((1|2) ))T(?!..0)",'dx', w)
            flappedWords.append(new)
        elif(re.search(r"(1|2) R* T ..0", w)):
            newR = re.sub(r"(?<=((1|2) R ))T(?!..0)",'dx', w)
            flappedWords.append(newR)
    elif(re.search(r" T ", w)):
        otherWords.append(w)

In [60]:
#split the strings back into lists and add it to a list of lists
#figure out where in the word the t occurs and keep track of how many spaces come before it
# I feel like there should be a more clever way of doing this but I dont know what it was
original = []
ogSpaces = []
for s in originalWords:
    original.append(s.split())
    if(re.search(r"(1|2) T ..0", s)):
        t = re.search(r"(?<=((1|2) ))T(?!..0)", s)
    else:
        t = re.search(r"(?<=((1|2) R ))T(?!..0)", s)
    span = list(t.span())
    ind = span[1]
    ogSpaces.append(s.count(" ", 0, ind))
    spaces = s.count(" ", 0, ind)

#list of the two segements before t and one after
ogFour = []
i = 0
for lst in original:
    count = ogSpaces[i]
    four = lst[count-2:count+2]
    ogFour.append(four)
    i+=1

#same as above just for dx
flapped = []
flapSpaces = []
for s in flappedWords:
    flapped.append(s.split())
    if(re.search(r"(1|2) dx ..0", s)):
        t = re.search(r"(?<=((1|2) ))dx(?!..0)", s)
    else:
        t = re.search(r"(?<=((1|2) R ))dx(?!..0)", s)
    span = list(t.span())
    ind = span[1]
    flapSpaces.append(s.count(" ", 0, ind))
    spaces = s.count(" ", 0, ind)

#same as above just for dx
flapFour = []
j = 0
for lst in flapped:
    count = flapSpaces[j]
    four = lst[count-2:count+2]
    flapFour.append(four)
    j+=1
    
#same as above just for words without the environment (other words)
other = []
otherSpaces = []
for s in otherWords:
    other.append(s.split())
    if(re.search(r"(?<=(.... ))T(...)", s)):
        t = re.search(r"(?<=( ))T( )", s)
    span = list(t.span())
    ind = span[1]
    otherSpaces.append(s.count(" ", 0, ind))
    spaces = s.count(" ", 0, ind)

#same as above just for other words
otherFour = []
j = 0
for lst in other:
    count = otherSpaces[j]
    four = lst[count-3:count+1]
    otherFour.append(four)
    j+=1


In [61]:
#Modified from Dr Danis's code
def getIPA(entry):
    arpa_dict = {'AY' : 'aɪ',
'D' : 'd',
'IY' : 'i',
'V' : 'v',
'AE' : 'æ',
'JH' : 'd͡ʒ',
'UH' : 'ʊ',
'T' : 't',
'Y' : 'j',
'AH' : 'ʌ',
'G' : 'ɡ',
'Z' : 'z',
'P' : 'p',
'TH' : 'θ',
'M' : 'm',
'R' : 'ɹ',
'K' : 'k',
'EH' : 'ɛ',
'EY' : 'eɪ',
'NG' : 'ŋ',
'ZH' : 'ʒ',
'HH' : 'h',
'SH' : 'ʃ',
'OY' : 'ɔɪ',
'S' : 's',
'AO' : 'ɔ',
'F' : 'f',
'W' : 'w',
'IH' : 'ɪ',
'DH' : 'ð',
'L' : 'l',
'N' : 'n',
'CH' : 't͡ʃ',
'AA' : 'ɑ',
'B' : 'b',
'OW' : 'oʊ',
'UW' : 'u',
'AW' : 'aʊ',
'ER' : 'ɚ',
'dx' : 'ɾ',
'dxH' : 'θ',
            ' '  :  ' '}
    ipaList = []
    for let in entry:
        ipa = ''
        if re.search(r"0$",let):
            ipa = ipa + arpa_dict[re.sub(r"\d","",let)]

        elif re.search(r"(1$|2$)",let):
            ipa = ipa + 'ˈ' + arpa_dict[re.sub(r"\d","",let)]

        else:
            ipa = ipa + arpa_dict[re.sub(r"\d","",let)]

        ipaList.append((ipa))
    return ipaList

In [62]:
#would be useful for going back and checking individual words against what the model outputted 
#decided against doing so mostly do to the confusion of making a test word in the first place
#also making a test word would rely on all of my other work working and who knows if it
#actually does
ogIPA = []
for word in original:
    ogIPA.append(getIPA(word))

flapIPA = []
for word in flapped:
    flapIPA.append(getIPA(word))

#converting the lists of four letters (words) into IPA symbols
ogFourIPA = []
for word in ogFour:
    ogFourIPA.append(getIPA(word))

flapFourIPA = []
for word in flapFour:
    flapFourIPA.append(getIPA(word))  
    
otherIPA = []
for word in other:
    otherIPA.append(getIPA(word))

otherFourIPA = []
for word in otherFour:
    otherFourIPA.append(getIPA(word))  
    
ogFourIPA.extend(otherFourIPA)
flapFourIPA.extend(otherFourIPA)

# Hayes Phonological Features

In [63]:
import pandas as pd
import numpy as np

#import Hayes features
df = pd.read_csv('hayes-features.csv', header=[0])

#add in dipthong columns
df = df.rename(columns={"Unnamed: 0": "symbols"})
df['y offglide'] = 0
df['w offglide'] = 0

#change all values from +,-,0 to 0 for - and 1 for +
df = df.where(df != '+', 1)
df = df.where(df != '-', 0)

#add dipthong glide things
ai = df.index[df['symbols'] == 'aɪ'].tolist()
df.at[ai, 'y offglide'] = 1
ei = df.index[df['symbols'] == 'eɪ'].tolist()
df.at[ei, 'y offglide'] = 1
ui = df.index[df['symbols'] == 'ɔɪ'].tolist()
df.at[ui, 'y offglide'] = 1
au = df.index[df['symbols'] == 'aʊ'].tolist()
df.at[au, 'w offglide'] = 1
ou = df.index[df['symbols'] == 'oʊ'].tolist()
df.at[ou, 'w offglide'] = 1

#add stressed vowels
vowels = df.where(df['syllabic']>0)
vowels = vowels.dropna()
vowels['stress'] = 1

vowelsOne = vowels.copy()
vowelsOne['symbols'] = 'ˈ' + vowelsOne['symbols']

df = df.append(vowelsOne, ignore_index=True)

  res_values = method(rvalues)


In [64]:
#edited to add changes suggested by Dr Danis
#make output binary classification instead of a vector

#its a df of inputs and outputs where the words are tuples
#tuples b/c they're immutable and want to delete deuplicates later
flap_df = pd.DataFrame({'input': [tuple(x) for x in ogFourIPA], 'output': [tuple(x) for x in flapFourIPA]})

#add a column to track changes
#this should ideally be done at time of creating the output
#but it's easier to do it this way for now
flap_df['flapped'] = flap_df['input'] != flap_df['output']

In [65]:
#remove all unique values from input column to deal with words that become the same when reduced
#to four segements
#dont want same data point in train and test sets
unique_df = flap_df.loc[~flap_df.duplicated(subset=['input'])]
unique_df

Unnamed: 0,input,output,flapped
0,"(b, ˈeɪ, t, ɪ)","(b, ˈeɪ, ɾ, ɪ)",True
2,"(i, ˈeɪ, t, ʌ)","(i, ˈeɪ, ɾ, ʌ)",True
3,"(i, ˈeɪ, t, ɪ)","(i, ˈeɪ, ɾ, ɪ)",True
5,"(k, ˈeɪ, t, ʌ)","(k, ˈeɪ, ɾ, ʌ)",True
6,"(k, ˈeɪ, t, ɪ)","(k, ˈeɪ, ɾ, ɪ)",True
...,...,...,...
31661,"(z, ˈɪ, ɡ, ɚ)","(z, ˈɪ, ɡ, ɚ)",False
31664,"(z, ˈaɪ, ʌ, n)","(z, ˈaɪ, ʌ, n)",False
31665,"(l, ˈɑ, t, ˈʌ)","(l, ˈɑ, t, ˈʌ)",False
31669,"(z, oʊ, ˈɑ, l)","(z, oʊ, ˈɑ, l)",False


In [66]:
#y data / model output
y = unique_df['flapped'].astype(int).to_numpy()
print(y.shape)
y

(5867,)


array([1, 1, 1, ..., 0, 0, 0])

In [67]:
#CELLS MUST BE RUN IN ORDER
#this cell overwrites previous data
#every for-loop that does actual vectorization needs to iterate through the column of unique_df
ogFourIPA = [list(word) for word in unique_df['input'].tolist()]

# One Hot IPA Encoding

In [68]:
#This is the vectorization for the one hot encoding vectorization
ogFourVec = []
for word in ogFourIPA:
    ogFourVec.append(word.copy())
    
print(ogFourIPA[:5])
#https://stackoverflow.com/questions/21800169/python-pandas-get-index-of-rows-which-column-matches-certain-value
#basically takes the column of symbols and makes it a 1 at that symbol and a 0 elsewhere
i=0
for word in ogFourVec:
    j=0
    fourVec = np.zeros((4, len(df)))
    for letter in word: 
        ind = df.index[df['symbols'] == letter].tolist()
        fourVec[j][ind] = 1
        j+=1
    ogFourVec[i] = fourVec
    i+=1
    
ogFourVec[3]

[['b', 'ˈeɪ', 't', 'ɪ'], ['i', 'ˈeɪ', 't', 'ʌ'], ['i', 'ˈeɪ', 't', 'ɪ'], ['k', 'ˈeɪ', 't', 'ʌ'], ['k', 'ˈeɪ', 't', 'ɪ']]


array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0.,

In [69]:
#this was very much done with Dr Danis's help
#Turning the 4 letter vectors into a single array per word
i = 0
for word in ogFourVec:
    ogFourVec[i] = word.ravel(order='C')
    i+=1
print(ogFourVec[3])

#turning the word arrays into a giant array of all the words
ogFourArray = np.zeros((len(ogFourIPA), 4*len(df)))
for index, word in enumerate(ogFourVec):
    ogFourArray[index,:] = word
    
print(ogFourArray.shape)

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.

In [71]:
#split the data set
from sklearn.model_selection import train_test_split as tts

#split the IPA data and not just vectorized for easier checking of correct vectorization
#random state = 10 for all split so will always have the same words in the same groups/orders
trainInIPA, testInIPA, trainOutIPA, testOutIPA = tts(ogFourIPA,y, test_size=.2, random_state=10)
print(trainInIPA[:5])
print(trainOutIPA[:5])

trainIn, testIn, trainOut, testOut = tts(ogFourArray,y, test_size=.2, random_state=10)
print(trainIn[:5])
print(trainOut[:5])

[['b', 'ˈɪ', 't', 's'], ['ɑ', 'ɹ', 'k', 'ɪ'], ['p', 'ˈʌ', 't', 's'], ['s', 'ˈɑ', 't', 'h'], ['h', 'ɪ', 't', 'ʌ']]
[0 0 0 0 0]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[0 0 0 0 0]


In [74]:
from sklearn.neighbors import KNeighborsClassifier

# warning this will take a bit of time
# this model actually takes the longest
# the best fit is when k=14 so you can just run that if you prefer (commented out below)
oldScore = 0
ind=0
for i in range(1,15):
    new_model = KNeighborsClassifier(n_neighbors=i)
    new_model.fit(trainIn,trainOut)
    score = new_model.score(testIn, testOut)
    print('score: ' + str(score))
    print('n: ' + str(i))
    if(score>oldScore):
        oldScore=score
        ind = i
        
oneHotIPAWinner = oldScore
print('winner: ' + str(oldScore))
print('winner index: ' + str(ind))

# new_model = KNeighborsClassifier(n_neighbors=14)
# new_model.fit(trainIn,trainOut)
# score = new_model.score(testIn, testOut)
# oneHotIPAWinner = score
# print(score)

score: 0.8867120954003407
n: 1
score: 0.8986371379897785
n: 2
score: 0.919931856899489
n: 3
score: 0.9258943781942078
n: 4
score: 0.9361158432708688
n: 5
score: 0.9437819420783645
n: 6
score: 0.9437819420783645
n: 7
score: 0.9471890971039182
n: 8
score: 0.9565587734241908
n: 9
score: 0.9531516183986372
n: 10
score: 0.9625212947189097
n: 11
score: 0.9599659284497445
n: 12
score: 0.9633730834752982
n: 13
score: 0.9642248722316865
n: 14
winner: 0.9642248722316865
winner index: 14


# Feature Vector Encoding

In [75]:
print(ogFourIPA[:5])
ogFourFeatVec = []
for word in ogFourIPA:
    ogFourFeatVec.append(word.copy())

#this basically takes a row from the dataframe instead of taking a column
i=0
for word in ogFourFeatVec:
    j=0
    fourVec = np.zeros((4, df.shape[1]-1))
    for letter in word:
        ind = df.index[df['symbols'] == letter].tolist()
        fourVec[j] = df.iloc[ind, 1:].values[0]
        j+=1
    ogFourFeatVec[i] = fourVec
    i+=1
    
ogFourFeatVec[1]

[['b', 'ˈeɪ', 't', 'ɪ'], ['i', 'ˈeɪ', 't', 'ʌ'], ['i', 'ˈeɪ', 't', 'ɪ'], ['k', 'ˈeɪ', 't', 'ʌ'], ['k', 'ˈeɪ', 't', 'ɪ']]


array([[1., 0., 0., 0., 1., 1., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 1., 1., 0., 1., 0., 1., 0., 0.],
       [1., 1., 0., 0., 1., 1., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 1., 1., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 1., 1., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0.]])

In [76]:
#same as above (also you helped me with this stuff)
#Turning the 4 letter vectors into a single array per word
i = 0
for word in ogFourFeatVec:
    ogFourFeatVec[i] = word.ravel(order='C')
    i+=1
print(ogFourFeatVec[1])

#turning the word arrays into a giant array of all the words
ogFourFeatArray = np.zeros((len(ogFourIPA), 4*(df.shape[1]-1)))
for index, word in enumerate(ogFourFeatVec):
    ogFourFeatArray[index,:] = word


[1. 0. 0. 0. 1. 1. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1.
 0. 1. 0. 1. 0. 0. 1. 1. 0. 0. 1. 1. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 1. 0. 0. 1. 0. 1. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 1.
 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0.]


In [77]:
flapFourFeatVec = []
for word in flapFourIPA:
    flapFourFeatVec.append(word.copy())

i=0
for word in flapFourFeatVec:
    j=0
    fourVec = np.zeros((4, df.shape[1]-1))
    for letter in word:
        ind = df.index[df['symbols'] == letter].tolist()
        fourVec[j] = df.iloc[ind, 1:].values[0]
        j+=1
    flapFourFeatVec[i] = fourVec
    i+=1
    
flapFourFeatVec[1]

array([[0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [1., 1., 0., 0., 1., 1., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 1., 1., 0.],
       [0., 0., 0., 1., 1., 1., 0., 1., 1., 0., 0., 1., 0., 0., 0., 0.,
        0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 1., 1., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 1., 1., 0., 1., 0., 0., 0., 0.]])

In [78]:
#Turning the 4 letter vectors into a single array per word
i = 0
for word in flapFourFeatVec:
    flapFourFeatVec[i] = word.ravel(order='C')
    i+=1
print(flapFourFeatVec[3])

#turning the word arrays into a giant array of all the words
flapFourFeatArray = np.zeros((len(flapFourIPA), 4*(df.shape[1]-1)))
for index, word in enumerate(flapFourFeatVec):
    flapFourFeatArray[index,:] = word
    
flapFourFeatArrayFlat = flapFourFeatArray.flatten()

[1. 0. 0. 0. 1. 1. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1.
 0. 1. 0. 1. 0. 0. 1. 1. 0. 0. 1. 1. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 1. 0. 0. 1. 0. 1. 1. 0. 0. 0. 0. 1. 1. 1. 0. 1. 1. 0. 0. 1.
 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 1.
 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 1. 0. 0. 0. 0.]


In [80]:
trainInFeat, testInFeat, trainOutFeat, testOutFeat = tts(ogFourFeatArray,y, test_size=.2, random_state=10)

oldScore = 0
ind=0
for i in range(1,15):
    new_model = KNeighborsClassifier(n_neighbors=i)
    new_model.fit(trainInFeat,trainOutFeat)
    score = new_model.score(testInFeat, testOutFeat)
    print('n: ' + str(i))
    print('score: ' + str(score))
    if(score>oldScore):
        oldScore=score
        ind = i

featuresIPAWinner = oldScore
print('winner index: ' + str(ind))
print('winner: ' + str(oldScore))

#Just runs the best k
# new_model = KNeighborsClassifier(n_neighbors=8)
# new_model.fit(trainInFeat,trainOutFeat)
# score = new_model.score(testInFeat, testOutFeat)
# featuresIPAWinner = score
# print(score)

n: 1
score: 0.9156729131175468
n: 2
score: 0.9190800681431005
n: 3
score: 0.9395229982964225
n: 4
score: 0.944633730834753
n: 5
score: 0.9403747870528109
n: 6
score: 0.948892674616695
n: 7
score: 0.9429301533219762
n: 8
score: 0.9531516183986372
n: 9
score: 0.9497444633730835
n: 10
score: 0.9531516183986372
n: 11
score: 0.9471890971039182
n: 12
score: 0.9531516183986372
n: 13
score: 0.9471890971039182
n: 14
score: 0.9522998296422487
winner index: 8
winner: 0.9531516183986372


# Sonarity Data

In [81]:
#Its all just hard coded
v = 0
g = 1
l = 2
n = 3
o = 4

#I thought about pulling this data from the features dataframe each time and realized that this 
# was much much easier
vow = np.array([1, 0, 1, 1])
gld = np.array([0, 0, 1, 1])
liq = np.array([0, 1, 1, 1])
nas = np.array([0, 1, 0, 1])
obs = np.array([0, 1, 0, 0])

# One Hot Sonarity Encoding
## This was somehow the most succesful model 
##### Which is only a tiny bit suspicious

In [82]:
ogFourSonVec = []
for word in ogFourIPA:
    ogFourSonVec.append(word.copy())

#it works like the IPA one hot encoding but it checks the features of a given symbol to figure out
#what the sonarity score is
i=0
for word in ogFourSonVec:
    j=0
    fourVec = np.zeros((4, 5))
    for letter in word:
        ind = df.index[df['symbols'] == letter].tolist()
        #vowels
        if(df.loc[ind[0]].at['syllabic'] > 0):
            fourVec[j][v] = 1
        #glides
        elif(df.loc[ind[0]].at['consonantal'] == 0):
            fourVec[j][g] = 1
        #obstruents
        elif (df.loc[ind[0]].at['sonorant'] == 0):
            fourVec[j][o] = 1
        #nasals
        elif(df.loc[ind[0]].at['approximant'] == 0):
            fourVec[j][n] = 1
        #liquids
        else:
            fourVec[j][l] = 1
        j+=1
    ogFourSonVec[i] = fourVec
    i+=1
    
print(ogFourIPA[10])  
print(ogFourSonVec[10])

['l', 'ˈu', 't', 'ɪ']
[[0. 0. 1. 0. 0.]
 [1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1.]
 [1. 0. 0. 0. 0.]]


In [83]:
#Turning the 4 letter vectors into a single array per word
i = 0
for word in ogFourSonVec:
    ogFourSonVec[i] = word.ravel(order='C')
    i+=1
print(ogFourSonVec[10])

#turning the word arrays into a giant array of all the words
ogFourSonArray = np.zeros((len(ogFourIPA), 4*5))
for index, word in enumerate(ogFourSonVec):
    ogFourSonArray[index,:] = word

[0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0.]


In [84]:
trainInSon, testInSon, trainOutSon, testOutSon = tts(ogFourSonArray,y, test_size=.2, random_state=10)

#this is where things get really sketchy
#how are there perfect scores all around? 
oldScore = 0
ind=0
for i in range(1,15):
    new_model = KNeighborsClassifier(n_neighbors=i)
    new_model.fit(trainInSon,trainOutSon)
    score = new_model.score(testInSon, testOutSon)
    print('n: ' + str(i))
    print('score: ' + str(score))
    if(score>oldScore):
        oldScore=score
        ind = i

oneHotSonWinner = oldScore
print('winner index: ' + str(ind))
print('winner: ' + str(oldScore))

n: 1
score: 0.8219761499148212
n: 2
score: 0.82793867120954
n: 3
score: 0.8160136286201022
n: 4
score: 0.82793867120954
n: 5
score: 0.823679727427598
n: 6
score: 0.82793867120954
n: 7
score: 0.82793867120954
n: 8
score: 0.82793867120954
n: 9
score: 0.823679727427598
n: 10
score: 0.82793867120954
n: 11
score: 0.82793867120954
n: 12
score: 0.82793867120954
n: 13
score: 0.82793867120954
n: 14
score: 0.82793867120954
winner index: 2
winner: 0.82793867120954


# Sonarity Feature Encoding

In [85]:
ogFourSonFeatVec = []
for word in ogFourIPA:
    ogFourSonFeatVec.append(word.copy())

#instead of taking slices of the row from the main dataframe like above, we use the hardcoded 
#arrays for the vectorization
i=0
for word in ogFourSonFeatVec:
    j=0
    fourVec = np.zeros((4, 4))
    for letter in word:
        ind = df.index[df['symbols'] == letter].tolist()
        #vowels
        if(df.loc[ind[0]].at['syllabic'] == 1):
            fourVec[j] = vow
        #glides
        elif(df.loc[ind[0]].at['consonantal'] == 0):
            fourVec[j] = gld
        #obstruents
        elif (df.loc[ind[0]].at['sonorant'] == 0):
            fourVec[j] = obs
        #nasals
        elif(df.loc[ind[0]].at['approximant'] == 0):
            fourVec[j] = nas
        #liquids
        else:
            fourVec[j] = liq
        j+=1
    ogFourSonFeatVec[i] = fourVec
    i+=1
    
print(ogFourIPA[10])  
print(ogFourSonFeatVec[10])

['l', 'ˈu', 't', 'ɪ']
[[0. 1. 1. 1.]
 [1. 0. 1. 1.]
 [0. 1. 0. 0.]
 [1. 0. 1. 1.]]


In [86]:
#Turning the 4 letter vectors into a single array per word
i = 0
for word in ogFourSonFeatVec:
    ogFourSonFeatVec[i] = word.ravel(order='C')
    i+=1

#turning the word arrays into a giant array of all the words
ogFourSonFeatArray = np.zeros((len(ogFourIPA), 4*4))
for index, word in enumerate(ogFourSonFeatVec):
    ogFourSonFeatArray[index,:] = word

In [87]:
trainInSonFeat, testInSonFeat, trainOutSonFeat, testOutSonFeat = tts(ogFourSonFeatArray,y, test_size=.2, random_state=10)

oldScore = 0
ind=0
for i in range(1,15):
    new_model = KNeighborsClassifier(n_neighbors=i)
    new_model.fit(trainInSonFeat,trainOutSonFeat)
    score = new_model.score(testInSonFeat, testOutSonFeat)
    print('n: ' + str(i))
    print('score: ' + str(score))
    if(score>oldScore):
        oldScore=score
        ind = i

featSonWinner = oldScore
print('winner index: ' + str(ind))
print('winner: ' + str(oldScore))

n: 1
score: 0.8160136286201022
n: 2
score: 0.82793867120954
n: 3
score: 0.7921635434412265
n: 4
score: 0.8015332197614992
n: 5
score: 0.8015332197614992
n: 6
score: 0.82793867120954
n: 7
score: 0.8015332197614992
n: 8
score: 0.8015332197614992
n: 9
score: 0.7947189097103918
n: 10
score: 0.8015332197614992
n: 11
score: 0.7947189097103918
n: 12
score: 0.8015332197614992
n: 13
score: 0.8015332197614992
n: 14
score: 0.8015332197614992
winner index: 2
winner: 0.82793867120954


# Sonarity with Stress

In [88]:
#Its all just hard coded
#syllabic is being replaced by two categories - syllabic stressed and syllabic unstressed
vs = 0 #stressed
vu = 1 #unstressed
g = 2
l = 3
n = 4
o = 5

# here the columns are syllabic, stress, consonantal, approximant, sonorant
vows = np.array([1, 1, 0, 1, 1])
vowu = np.array([1, 0, 0, 1, 1])
gld = np.array([0, 0, 0, 1, 1])
liq = np.array([0, 0, 1, 1, 1])
nas = np.array([0, 0, 1, 0, 1])
obs = np.array([0, 0, 1, 0, 0])

## One Hot Sonority Encoding with Stress
#### Which still produces a suspicious model

In [89]:
ogFourSonStressVec = []
for word in ogFourIPA:
    ogFourSonStressVec.append(word.copy())

#it works like the IPA one hot encoding but it checks the features of a given symbol to figure out
#what the sonarity score is
i=0
for word in ogFourSonStressVec:
    j=0
    fourVec = np.zeros((4, 6))
    for letter in word:
        ind = df.index[df['symbols'] == letter].tolist()
        #vowels
        if(df.loc[ind[0]].at['syllabic'] > 0):
            if(df.loc[ind[0]].at['stress'] > 0):
                fourVec[j][vs] = 1
            else:
                fourVec[j][vu] = 1
        #glides
        elif(df.loc[ind[0]].at['consonantal'] == 0):
            fourVec[j][g] = 1
        #obstruents
        elif (df.loc[ind[0]].at['sonorant'] == 0):
            fourVec[j][o] = 1
        #nasals
        elif(df.loc[ind[0]].at['approximant'] == 0):
            fourVec[j][n] = 1
        #liquids
        else:
            fourVec[j][l] = 1
        j+=1
    ogFourSonStressVec[i] = fourVec
    i+=1
    
print(ogFourIPA[10])  
print(ogFourSonStressVec[10])

#Turning the 4 letter vectors into a single array per word
i = 0
for word in ogFourSonStressVec:
    ogFourSonStressVec[i] = word.ravel(order='C')
    i+=1
print(ogFourSonStressVec[10])

#turning the word arrays into a giant array of all the words
ogFourSonStressArray = np.zeros((len(ogFourIPA), 4*6))
for index, word in enumerate(ogFourSonStressVec):
    ogFourSonStressArray[index,:] = word

['l', 'ˈu', 't', 'ɪ']
[[0. 0. 0. 1. 0. 0.]
 [1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1.]
 [0. 1. 0. 0. 0. 0.]]
[0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0.]


In [90]:
trainInSonStress, testInSonStress, trainOutSonStress, testOutSonStress= tts(ogFourSonStressArray,y, test_size=.2, random_state=10)

oldScore = 0
ind=0
for i in range(1,15):
    new_model = KNeighborsClassifier(n_neighbors=i)
    new_model.fit(trainInSonStress,trainOutSonStress)
    score = new_model.score(testInSonStress, testOutSonStress)
    print('n: ' + str(i))
    print('score: ' + str(score))
    if(score>oldScore):
        oldScore=score
        ind = i

oneHotSonStressWinner = oldScore
print('winner index: ' + str(ind))
print('winner: ' + str(oldScore))

n: 1
score: 0.9480408858603067
n: 2
score: 0.9540034071550255
n: 3
score: 0.9778534923339012
n: 4
score: 0.9540034071550255
n: 5
score: 0.9514480408858603
n: 6
score: 0.9514480408858603
n: 7
score: 0.975298126064736
n: 8
score: 0.975298126064736
n: 9
score: 0.975298126064736
n: 10
score: 0.975298126064736
n: 11
score: 0.975298126064736
n: 12
score: 0.975298126064736
n: 13
score: 0.975298126064736
n: 14
score: 0.975298126064736
winner index: 3
winner: 0.9778534923339012


## Stressed Sonority Feature Vectorization

In [91]:
ogFourSonFeatStressVec = []
for word in ogFourIPA:
    ogFourSonFeatStressVec.append(word.copy())

#instead of taking slices of the row from the main dataframe like above, we use the hardcoded 
#arrays for the vectorization
i=0
for word in ogFourSonFeatStressVec:
    j=0
    fourVec = np.zeros((4, 5))
    for letter in word:
        ind = df.index[df['symbols'] == letter].tolist()
        #vowels
        if(df.loc[ind[0]].at['syllabic'] == 1):
            if(df.loc[ind[0]].at['stress'] > 0):
                fourVec[j] = vows
            else:
                fourVec[j] = vowu
        #glides
        elif(df.loc[ind[0]].at['consonantal'] == 0):
            fourVec[j] = gld
        #obstruents
        elif (df.loc[ind[0]].at['sonorant'] == 0):
            fourVec[j] = obs
        #nasals
        elif(df.loc[ind[0]].at['approximant'] == 0):
            fourVec[j] = nas
        #liquids
        else:
            fourVec[j] = liq
        j+=1
    ogFourSonFeatStressVec[i] = fourVec
    i+=1
    
print(ogFourIPA[10])  
print(ogFourSonFeatStressVec[10])

#Turning the 4 letter vectors into a single array per word
i = 0
for word in ogFourSonFeatStressVec:
    ogFourSonFeatStressVec[i] = word.ravel(order='C')
    i+=1

#turning the word arrays into a giant array of all the words
ogFourSonFeatStressArray = np.zeros((len(ogFourIPA), 4*5))
for index, word in enumerate(ogFourSonFeatStressVec):
    ogFourSonFeatStressArray[index,:] = word

['l', 'ˈu', 't', 'ɪ']
[[0. 0. 1. 1. 1.]
 [1. 1. 0. 1. 1.]
 [0. 0. 1. 0. 0.]
 [1. 0. 0. 1. 1.]]


In [92]:
trainInSonFeatStress, testInSonFeatStress, trainOutSonFeatStress, testOutSonFeatStress = tts(ogFourSonFeatStressArray,y, test_size=.2, random_state=10)

oldScore = 0
ind=0
for i in range(1,15):
    new_model = KNeighborsClassifier(n_neighbors=i)
    new_model.fit(trainInSonFeatStress,trainOutSonFeatStress)
    score = new_model.score(testInSonFeatStress, testOutSonFeatStress)
    print('n: ' + str(i))
    print('score: ' + str(score))
    if(score>oldScore):
        oldScore=score
        ind = i

featSonStressWinner = oldScore
print('winner index: ' + str(ind))
print('winner: ' + str(oldScore))

n: 1
score: 0.9727427597955707
n: 2
score: 0.944633730834753
n: 3
score: 0.9787052810902896
n: 4
score: 0.9684838160136287
n: 5
score: 0.9787052810902896
n: 6
score: 0.9787052810902896
n: 7
score: 0.9761499148211243
n: 8
score: 0.9761499148211243
n: 9
score: 0.9761499148211243
n: 10
score: 0.9761499148211243
n: 11
score: 0.9761499148211243
n: 12
score: 0.9761499148211243
n: 13
score: 0.9744463373083475
n: 14
score: 0.9744463373083475
winner index: 3
winner: 0.9787052810902896


# Final Results

In [93]:
print("One Hot IPA Encoding Vectorization: " + str(round(oneHotIPAWinner*100, 2)) + '%')
print("Hayes Phonetics Features Vectorization: {:0.2f}%.\n".format(featuresIPAWinner*100))
print("One Hot Sonarity Encoding Vectorization: " + str(round(oneHotSonWinner*100, 2)) + '%')
print("Sonarity Phonetics Features Vectorization: {:0.2f}%\n".format(featSonWinner*100))
print("Stressed One Hot Sonarity Encoding Vectorization: " + str(round(oneHotSonStressWinner*100, 2)) + '%')
print("Stressed Sonarity Phonetics Features Vectorization: {:0.2f}%\n".format(featSonStressWinner*100))

One Hot IPA Encoding Vectorization: 96.42%
Hayes Phonetics Features Vectorization: 95.32%.

One Hot Sonarity Encoding Vectorization: 82.79%
Sonarity Phonetics Features Vectorization: 82.79%

Stressed One Hot Sonarity Encoding Vectorization: 97.79%
Stressed Sonarity Phonetics Features Vectorization: 97.87%



>Coming into this project I thought for sure the phonological feature vectorization of the model would perform better than the one hot IPA encoding. As much as this was a very simple test case I am very suprised at how well the one hot IPA encoding version did. The accuracy was pretty much the same for both models that just used phonological features. However the accuracy of the model given a k that is not the top k for the feature model was typically higher than the accuracy given from the one hot encoding model with the same k.

>As seen by the model scores, sonority was the better way to vecorize the text for this given task. However, I'm still really suspicious of the perfectly performing sonority one hot encodings. As much as I looked over my code I keep feeling like I must have done somthing wrong to get those perfect scores. I did find it interesting that the sonority features vectorization was more accurate when stress was added. 

>The downside to vectorizing the data using sonority scores is you run into a similar problem to what Rumelhart and McClellan had. You can undo the flattening of the various arrays the just use IPA symbols and features. You can then work backwards using the index of the one in the vector or just checking features to figure out what your original inputs were. That can't be done using the more accurate sonority models. The higher accuracy models are actually less useful.

>This leads to an interesting discussion of which method was actually better. The method that produced a more accurate model or the method that has usable outputs?

# Mistakes and Where to go From Here
>There's a problem with how the training/test split was originally done. By only preserving 5 segments of the word, there has been large amounts of neutralization, in that certain words with the same flapping environment now have identical data points. So when the training/test split happens, you essentially end up with the same data point in the test data as in the train data, making the accuracy artifically high. This might be fixed now but its hard to know.

>I tried using some other models, namely SGDClassifier and Logistic Regression, however both involved my data having a very different shape. When I vectorized my data I used arrays and didnt label anything with gave me issues with both other types of models. I think most of these models are used in cases where y outputs a single data point, or something else that isnt a whole vector the size of the input data.

>So although I wasnt that happy with KNN for either of the first two methods of vectorization I still stuck with it. I also wanted to use the same model across all the methods of vectorization because this was really about comparing the vectors not the models.

>The takeaway from all this being that while I am very happy with the ideas behind the vectorization of the data I am missing a fundamental part which was properly preparing it for a model.

>If I had used a different model I would have struggled with the process of reshaping my arrays and unvectorizing my data to see what the model outputed. This can only be done for the IPA/phonological methods of vectorization and not the sonority ones, but is still a theoretical step that can be done and I am not doing. This might actually require a prediction model though and not a classification model like the one I used.