<h1><center>Project: Predict Stress in English Words</center></h1>


In this project, you need to build a classifier to predict the stresses for a list of English words.

<h1><center>1. Represent Pronunciations of English Words</center></h1>


### 1.1 Phonemes

We use the following symbols to represent phonemes of English words.

* Vowel phonemes: AA, AE, AH, AO, AW, AY, EH, ER, EY, IH, IY, OW, OY, UH, UW
* Consonant phonemes: P, B, CH, D, DH, F, G, HH, JH, K, L, M, N, NG, R, S, SH, T, TH, V, W, Y, Z, ZH

| Vowel |  IPA  || consonant | IPA | consonant | IPA |
|:-----:|:-----:||:---------:|:---:|:---------:|:---:|
|   AA  |   ɑ   ||     P     |  p  |     S     |  s  |
|   AE  |   æ   ||     B     |  b  |     SH    |  ʃ  |
|   AH  | ə / ʌ ||     CH    |  tʃ |     T     |  t  |
|   AO  |   ɔ   ||     D     |  d  |     TH    |  θ  |
|   AW  |   aʊ  ||     DH    |  ð  |     V     |  v  |
|   AY  |   aɪ  ||     F     |  f  |     W     |  w  |
|   EH  |   ɛ   ||     G     |  g  |     Y     |  j  |
|   ER  |   ɜːr ||     HH    |  h  |     Z     |  z  |
|   EY  |   eɪ  ||     JH    |  dʒ |     ZH    |  ʒ  |
|   IH  |   ɪ   ||     K     |  k  |           |     |
|   IY  |   i   ||     L     |  l  |           |     |
|   OW  |   oʊ  ||     M     |  m  |           |     |
|   OY  |   ɔɪ  ||     N     |  n  |           |     |
|   UH  |   ʊ   ||     NG    |  ŋ  |           |     |
|   UW  |   u   ||     R     |  r  |           | $ $ |

Note: in this project, we follow the pronunciation of **American English**.


### 1.2 Stress

We use 0/1/2 followed by a *vowel* phoneme to indicate the stress:
* 0: No stress
* 1: Primary stress
* 2: Secondary stress

### 1.3 Word Stress Rules

We make the following assumptions in this project (they are true in most cases)

* A word **only have one** pronunciation (we do not consider words with multiple pronunciations)
* A word **must have one and only one** primary stress
* Only vowels are stressed

In addition, we **only consider words with less than 5 vowels** (i.e., words with 5 or more vowels have been removed from training and testing datasets).

### 1.4 Example
We take the word **university** (pronounced as [ˌjunəˈvɜrsəti]) as an example, its pronunciation is formed like:
**<center>Y UW2 N AH0 V ER1 S AH0 T IY0</center>**

<h1><center>2. File Format</center></h1>

### 2.1 Training Data
The training data contains 50,000 words. Each word (uppercase) follows by its pronunciation, formed like

```
Word:Pronunciation
```

For example, the line corresponding to word *university* should be:

```
UNIVERSITY:Y UW2 N AH0 V ER1 S AH0 T IY0
```

### 2.2 Testing Data

The testing data contains several lines, where each line corresponds to a word. 

The only difference compares to the training data is that, in the testing data, we have removed all the stress symbols (i.e., 0/1/2). And you need to predict them.

For example , the line corresponding to word *university* in the testing data should be:

```
UNIVERSITY:Y UW N AH V ER S AH T IY
```

### 2.3 Helper Functions

In order to make your life easier and avoid unnecessary bugs, we will offer a helper function to read training/testing data from files, and convert them into a list of strings. And the list will be passed as an argument to the training and testing functions. Please refer to 4.1.2 for execution example.

<h1><center>3. Your Task</center></h1>

### 3.1 Output the position of the primary stress

For each word in the testing data, you need to output the position of the primary stress. Since only vowels are stressed, we only count vowels. For example, you should output **3** for the word **university**, as the primary stress of the word university is on the 3rd vowel (i.e., **ER**).

Assume the testing data contains 5 words, whose primary stresses are on the 1st, 2nd, 1st, 3rd and 2nd vowel. Then your `test()` function should return a list of 5 integer numbers: `[1, 2, 1, 3, 2]`. In order to do that, you need to train a classifier using `train()` function.

### 3.2 train()

In order to successfully predict the stress, you need to train a classifier. You are required to implement a function named train(). Its two arguments are the training data (stored as a list of strings) and the output file path.

You need to dump the classifier and relevant data/tools (if there is any) into one single file. Hint: a easy (but ont the only) way of doing this is to use `pickle`.

In [1]:
#def train(data, classifier_file):
#    pass;

### 3.3 test()

You also need to implement a function named test(), which takes the test data as input and returns a list of integers which indicate the positions of the primary stress.

In [2]:
#def test(data, classifier_file):
#    pass;

### 3.4 Restrictions

* The **total** running time of training and testing **should not exceed 10 minutes** in the submission system. The system will force stop your program if it took more than 10 minutes, and you will receive 0 point for the programming part.

* You are encouraged to use **any** classifiers from sklearn, but you **can not use** any other machine learning package. 

### 3.5 Report

You need to submit a report (in PDF format) which answers at least the following two questions:
* What features do you use in your classifier? Why are they important and what information do you expect them to capture?
* How do you experiment and improve your classifier?

<h1><center>4. Evaluation</center></h1>

### 4.1 Execution of the submitted code

Your submission will be tested automatically. In order to avoid unneccessary exceptions/errors, please make sure
* you have followed the instructions strictly
* you have tested your code before submission

#### 4.1.1 Execution environment

We have pre-installed the following modules, you can only use these modules and the built-in modules/functions.
* python: 3.5.2 (it should be okay if you use python 3.6.x on your side)
* pandas: 0.19.2
* numpy: 1.12.1
* scikit-learn: 0.18.1

#### 4.1.2 Execution example

**Note**: we will only test `train()` and `test()`; **none** of the other functions in your `submission.py` will be called by us. So make sure you called them, if any, within your `train()` or `test()` function properly.

You can imagine that our test code looks similar to the following. 

In [46]:
import helper
import submission

training_data = helper.read_data('./asset/training_data.txt')
classifier_path = './asset/classifier.dat'
submission.train(training_data, classifier_path)

MemoryError: 

In [17]:
test_data = helper.read_data('./asset/tiny_test.txt')
prediction = submission.test(test_data, classifier_path)
print(prediction)

NameError: name 'submission' is not defined

COED:K OW1 EH2 D


### 4.2 Evaluation

**Your code will contribute 70% to your final mark of the project, and your report will contribute the rest 20%. The rest 10% is from the top-performance bonus.**

After the execution, the output result of your submitted code will be compared to the ground truth. And the **micro averaged $F_1$ score** will be used to determine the score for the programming part.


In [3]:
from sklearn.metrics import f1_score
ground_truth = [1, 1, 2, 1]
print(f1_score(ground_truth, prediction, average='micro'))

0.75


### 4.3 Bonus

**The 10 best performed classifier will be rewarded by at most 10 bouns points.** More specifically, the 1st place will get 10 points, the 2nd place will get 9 points, and so on and so forth.

<h1><center>5. Submission</center></h1>

Similar to the labs, **you need to submit both `submission.py` and report to the online submission system.** You cannot submit other files.

The online system will try to execute your submitted code using data *sampled* from the test dataset that will be used for the final evaluation.

For example, *assume* the test dataset $\mathcal{D}$ for the final evaluation has 1000 words. Then each time when you submit your code, the system will randomly sample $n$ words from $\mathcal{D}$, and use these $n$ words to test your code. 

You will be able to see the precision of your code on the $n$ words. If your code cannot be correctly executed, then you will receive an error message.

### 5.1 Submission restrictions

**Due to obvious reasons, we have the following restrictions. Please strictly follow them.**

* Each student has only **10** chances to submit their *code* (no matter it can be correct executed or not). But you can submit report as many times as you want.
* The `submission.py` file should **not exceed 30KB**.
* The `report.pdf` file should **not exceed 10MB**.

### 5.2 Late penalty

* -30% for each day after the due date. 

**NOTE**

* We will take the time of your last submission (no matter its the code or report submission) as your submission time.
* We will only store your last submitted code and last submitted report; There is no way we can use your earlier versions and we do not accept any file from you after the deadline. 

In [22]:
# 读取训练数据
file_dir = 'C:/Users/Administrator/Dropbox/jupyter/spec/asset/training_data.txt'
def read_data(file_path):
	with open(file_path) as f:
		lines = f.read().splitlines()
	return lines

data = read_data(file_dir)
#print(data)

In [23]:
# 元音字典
Vowels = ['AA', 'AE', 'AH', 'AO', 'AW', 'AY', 'EH', 'ER', 
			'EY', 'IH', 'IY', 'OW', 'OY', 'UH', 'UW']
# 辅音字典
Consonants = ['P', 'B', 'CH', 'D', 'DH', 'F', 'G', 'HH', 'JH', 'K', 'L', 'M', 'N',
				 'NG', 'R', 'S', 'SH', 'T', 'TH', 'V', 'W', 'Y', 'Z', 'ZH']

Syllables = Vowels + Consonants

In [24]:
# 提取某个数据的单词部分
#word = data[0][0:data[0].find(':')]
#print(word)

In [25]:
# 提取数据的音节部分
syllables = [x[x.find(':')+1:].split() for x in data]
print(syllables[0:10])

[['K', 'OW1', 'EH2', 'D'], ['P', 'ER1', 'V', 'Y', 'UW2'], ['HH', 'EH1', 'HH', 'IH0', 'R'], ['M', 'AH1', 'S', 'AH0', 'L', 'IH0', 'NG'], ['N', 'AA0', 'N', 'P', 'OY1', 'Z', 'AH0', 'N', 'AH0', 'S'], ['L', 'AA0', 'V', 'EH1', 'K', 'IY0', 'AH0'], ['B', 'AH1', 'K', 'AH0', 'L', 'D'], ['IY1', 'T', 'AH0', 'N'], ['S', 'AY1', 'M', 'EH2', 'D'], ['Z', 'AA0', 'M', 'B', 'UW1']]


In [26]:
print('what')

what


In [27]:
# 给辅音标上位置信息（0~6）
positions = []
for word in syllables:
    vowel_count = 0
    pos = []
    for syl in word:
        if syl in Consonants:
            pos.append(vowel_count)
        else:
            vowel_count += 1
    positions.append(pos)
print(positions[0:10])

[[0, 2], [0, 1, 1], [0, 1, 2], [0, 1, 2, 3], [0, 1, 1, 2, 3, 4], [0, 1, 2], [0, 1, 2, 2], [1, 2], [0, 1, 2], [0, 1, 1]]


In [28]:
# 提取数据的元音部分
vowels = [[v for v in l if v not in Consonants] for l in syllables]
print(vowels[0:5])

[['OW1', 'EH2'], ['ER1', 'UW2'], ['EH1', 'IH0'], ['AH1', 'AH0', 'IH0'], ['AA0', 'OY1', 'AH0', 'AH0']]


In [29]:
# 找数据的重音位置（0~4代表位置1~5）
def find_stress(vowels):
    stress = []
    for l in vowels:
        for v in l: 
            if '1' in v:
                stress.append(l.index(v))
    return stress
stress = find_stress(vowels)
# 计算训练数据中有多少个重音在第一个元音上
#stress.count(0)
# 计算训练数据中有多少个重音在第二个元音上
#stress.count(1)
#print(stress)
# 如果以单词为单位,stress就是y

In [30]:
# 找数据的次重音位置，（-1代表无次重音，0~4代表位置1~5）
def find_secondary_stress(vowels):
    for v in vowels:
        if '2' in v:
            return vowels.index(v)
    return -1
secondary_stress = find_secondary_stress(vowels)
#print(secondary_stress)

In [31]:
# 去除元音中的重音标记（数字）
def strip_stress_mark(vowels):
    raw_vowels = []
    for l in vowels:
        raw_l = []
        for v in l:
            raw_l.append(v.replace('0', '').replace('1', '').replace('2', ''))
        raw_vowels.append(raw_l)
    return raw_vowels
raw_vowels = strip_stress_mark(vowels)
#print(raw_vowels[0:5])

In [32]:
# 把元音转化为元音编号（0代表空，1~15）
vowels_in_code = [[Vowels.index(x) for x in l] for l in raw_vowels]
def resize(l, n):
    while len(l) < n:
        l.append(-1)
    while len(l) > n:
        l.pop()
    return l
vowels_in_code = [resize(l, 4) for l in vowels_in_code]
print(vowels_in_code[0:5])

[[11, 6, -1, -1], [7, 14, -1, -1], [6, 9, -1, -1], [2, 2, 9, -1], [0, 12, 2, 2]]


In [33]:
# 提取数据的辅音部分
consonants = [[v for v in l if v in Consonants] for l in syllables]
print(consonants[0:10])

# 把辅音转化为辅音编号
consonants_in_code = [[Consonants.index(x) for x in l] for l in consonants]
print(consonants_in_code[0:10])

consonants_in_code = [resize(l, 10) for l in consonants_in_code]
print(consonants_in_code[0])

[['K', 'D'], ['P', 'V', 'Y'], ['HH', 'HH', 'R'], ['M', 'S', 'L', 'NG'], ['N', 'N', 'P', 'Z', 'N', 'S'], ['L', 'V', 'K'], ['B', 'K', 'L', 'D'], ['T', 'N'], ['S', 'M', 'D'], ['Z', 'M', 'B']]
[[9, 3], [0, 19, 21], [7, 7, 14], [11, 15, 10, 13], [12, 12, 0, 22, 12, 15], [10, 19, 9], [1, 9, 10, 3], [17, 12], [15, 11, 3], [22, 11, 1]]
[9, 3, -1, -1, -1, -1, -1, -1, -1, -1]


In [42]:
raw_syllables = strip_stress_mark(syllables)
print(raw_syllables[0:4])
raw_syllables_in_code = [[Syllables.index(x) for x in l] for l in raw_syllables ]
raw_syllables_in_code = [resize(l, 14) for l in raw_syllables_in_code]
print(raw_syllables_in_code[0:4])
X = raw_syllables_in_code
X = [l + [len(vowel)] for l, vowel in zip(X, vowels)]
print(X[0:4])

[['K', 'OW', 'EH', 'D'], ['P', 'ER', 'V', 'Y', 'UW'], ['HH', 'EH', 'HH', 'IH', 'R'], ['M', 'AH', 'S', 'AH', 'L', 'IH', 'NG']]
[[24, 11, 6, 18, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1], [15, 7, 34, 36, 14, -1, -1, -1, -1, -1, -1, -1, -1, -1], [22, 6, 22, 9, 29, -1, -1, -1, -1, -1, -1, -1, -1, -1], [26, 2, 30, 2, 25, 9, 28, -1, -1, -1, -1, -1, -1, -1]]
[[24, 11, 6, 18, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 2], [15, 7, 34, 36, 14, -1, -1, -1, -1, -1, -1, -1, -1, -1, 2], [22, 6, 22, 9, 29, -1, -1, -1, -1, -1, -1, -1, -1, -1, 2], [26, 2, 30, 2, 25, 9, 28, -1, -1, -1, -1, -1, -1, -1, 3]]


In [13]:
X = []
for i in range(len(vowels_in_code)):
    x = [-1] * 14
    x[0:10] = consonants_in_code[i]
    x[10:14] = vowels_in_code[i]
    X.append(x)
print(X[0:5])

[[9, 3, -1, -1, -1, -1, -1, -1, -1, -1, 11, 6, -1, -1], [0, 19, 21, -1, -1, -1, -1, -1, -1, -1, 7, 14, -1, -1], [7, 7, 14, -1, -1, -1, -1, -1, -1, -1, 6, 9, -1, -1], [11, 15, 10, 13, -1, -1, -1, -1, -1, -1, 2, 2, 9, -1], [12, 12, 0, 22, 12, 15, -1, -1, -1, -1, 0, 12, 2, 2]]


In [43]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

clfs = [
    DecisionTreeClassifier(),
    KNeighborsClassifier(8),
    #SVC(gamma=2, C=1),
    AdaBoostClassifier(),
    RandomForestClassifier(n_estimators=3000, criterion='entropy', max_features='log2', max_depth=130)
]

names = [
    "Decision Tree",
    "Nearest Neighbors", 
    #"RBF SVM", 
    "AdaBoost",
    "RF"
]

# X = raw_syllables_in_code
y = stress

# preprocess dataset, split into training and test part
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

for name, clf in zip(names, clfs):
    clf.fit(X_train, y_train)
    score = clf.score(X_test, y_test)
    print(name + ': ' + '%.4f' % score)

Decision Tree: 0.8520
Nearest Neighbors: 0.7951
AdaBoost: 0.7196
RF: 0.9089


In [80]:
def test(data, classifier):# do not change the heading of the function
    # 提取数据的音节部分
    syllables = [x[x.find(':')+1:].split() for x in data]
	
    # 提取数据的元音部分
    vowels = [[v for v in l if v not in Consonants] for l in syllables]

    # 把元音转化为元音编号（0代表空，1~15）
    vowels_in_code = [[Vowels.index(x)+1 for x in l] for l in vowels]
    vowels_in_code = [resize(l, 5) for l in vowels_in_code]
    
    X = vowels_in_code
    return X
data1 = ['COED:K OW EH D',
'PURVIEW:P ER V Y UW',
'HEHIR:HH EH HH IH R',
'MUSCLING:M AH S AH L IH NG',
'NONPOISONOUS:N AA N P OY Z AH N AH S',
'LAVECCHIA:L AA V EH K IY AH']
xx = test(data1, clfs[1])
print(xx)
print(clfs[4].predict(xx).tolist())

[[12, 7, 0, 0, 0], [8, 15, 0, 0, 0], [7, 10, 0, 0, 0], [3, 3, 10, 0, 0], [1, 13, 3, 3, 0], [1, 7, 11, 3, 0]]
[0, 0, 0, 0, 2, 2]


#### 

In [None]:
# def my_metric(x, y):
    if x == y:
        return 0 
    else:
        return 1
    
knn_clfs = [
    KNeighborsClassifier(8, metric='hamming'),
    KNeighborsClassifier(8, metric='euclidean'),
    KNeighborsClassifier(8)
]

knn_names = [
    "8 Nearest Neighbors", 
    "8 Nearest Neighbors", 
    "8 Nearest Neighbors"
]

for name, clf in zip(knn_names, knn_clfs):
    clf.fit(X_train, y_train)
    score = clf.score(X_test, y_test)
    print(name + ': ' + '%.4f' % score)

In [16]:
tree_clfs = [
    DecisionTreeClassifier(max_depth=12),
    DecisionTreeClassifier(),
    DecisionTreeClassifier(min_samples_split=20),
    DecisionTreeClassifier(min_samples_leaf=100),
    DecisionTreeClassifier(min_samples_leaf=200),
]

tree_names = [
    "12 Max_depth_tree",
    "no Max_depth_tree",
    "split",
    "leaf100",
    "leaf200",
    "leaf300",
]
for name, clf in zip(tree_names, tree_clfs):
    clf.fit(X_train, y_train)
    score = clf.score(X_test, y_test)
    print(name + ': ' + '%.4f' % score)



12 Max_depth_tree: 0.8734
no Max_depth_tree: 0.8784
split: 0.8778
leaf100: 0.8578
leaf200: 0.8355


In [None]:
# 交叉验证
from sklearn.model_selection import cross_val_score
clf = tree_clfs[3]
scores = cross_val_score(clf, X, y, cv=5)
print(scores)

In [36]:
from sklearn.neighbors import DistanceMetric
dist = DistanceMetric.get_metric('hamming')
A = [[3, 4, 7, 8, 9], [3, 4, 4, 4, 4]]
dist.pairwise(A)

array([[ 0. ,  0.6],
       [ 0.6,  0. ]])

In [60]:
clf = tree_clfs[5]
import pydotplus
from IPython.display import Image
from sklearn import tree
feature_names = ['position 1', 'position 2', 'position 3', 'position 4', 'position 5']
target_names = ['position 1', 'position 2', 'position 3', 'position 4', 'position 5']
dot_data = tree.export_graphviz(clf, out_file=None, 
                         feature_names=feature_names,  
                         class_names=target_names,  
                         filled=True, rounded=True,  
                         special_characters=True)  
graph = pydotplus.graph_from_dot_data(dot_data)
graph.write_pdf("tree(min_leaf_300).pdf") 
#Image(graph.create_png()) 

True

In [76]:
import pickle
output = open('clf.dat', 'wb')
pickle.dump(clf, output)

In [79]:
gg = pickle.load(open('clf.dat', 'rb'))

In [80]:
scores = cross_val_score(gg, X, y, cv=5)

In [82]:
y_predict = gg.predict(X_test)
print(y_predict)

[0 0 0 ..., 1 2 0]


In [85]:
yy = y_predict.tolist()
print(yy)

[0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 1, 0, 1, 0, 0, 2, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 2, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 2, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 2, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 1, 0, 1, 2, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 