# Machine Learning Engineer Nanodegree
## Capstone Proposal
Daniel Lameyer  
April 27, 2019

## Proposal

### Domain Background

Deep Learning technology has exponentially expanded the possibilities of machine learning capabilities in recent years. One of the most popular applications of which is in natural language processing and language translation due to the the technology's ability to process the more complicated structure that is human language. Thanks to advancements in this field, in 2016, Google announced that it's popular Google Translate services will transition to artificial neural network based algorithms as the foundation of its translation software. Among the various deep neural network architectures, the Recurrent Nueral Network is structure is commonly used for NLP tasks due to the temporal & sequential structure of languages. [Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation](https://arxiv.org/pdf/1609.08144.pdf)

In this project, I propose to create a machine learning mdoel that can translate a Japanese sentences to English. Translating between Japanese and English is notoriously difficult due to the vast linguistic differences in the structure, grammar, and vocabulary of the languages. Languages with similar roots or linguistic history are often easier to translate to each other both by humans and machines. However, dissimilar languages often have an added challenge. I am bilingual in these two languages, and I hope to enter a career in NLP engineer working with Japanese and Englsh. I hope to use this project as a foundation in developing my multilingual NLP skills.

[//]: # (
In this section, provide brief details on the background information of the domain from which the project is proposed. Historical information relevant to the project should be included. It should be clear how or why a problem in the domain can or should be solved. Related academic research should be appropriately cited in this section, including why that research is relevant. Additionally, a discussion of your personal motivation for investigating a particular problem in the domain is encouraged but not required.E
)

### Problem Statement

The objective of this project is to build a model, learned using data samples of Japanese sentences with their English translations, that will automatically return the translated English sentence when a new Japanese sentences has been passed to it. The model will take in a Japanese sentence (with spaces), tokenize it, and return an English sentence that reflects or maintains the general meaning of the orginal sentence.



[//]: # (
In this section, clearly describe the problem that is to be solved. The problem described should be well defined and should have at least one relevant potential solution. Additionally, describe the problem thoroughly such that it is clear that the problem is quantifiable the problem can be expressed in mathematical or logical terms , measurable the problem can be measured by some metric and clearly observed, and replicable the problem can be reproduced and occurs more than once.
)

### Datasets and Inputs
The dataset for this project will utilize the corpus provided by [Yusuke Oda](https://github.com/odashi/small_parallel_enja) which is a sample of the [Tanaka Corpus](http://www.edrdg.org/wiki/index.php/Tanaka_Corpus), filtered to sentences with 4~16 words and pre-tokenized. The corpus contains a training dataset of 50,000 Japanese/English sentence pairs, each English line has been human translated from the original source. This corpus is perfect for the proposed project, for one the human translated sentence provide the most natural labels the model can compare for training and testing purposes, and the dataset already contains pre-tokenized data which is very helpful for the Japanese corpus since the language does not contain space delimitters in sentences between words much like the English language does. 

[//]: # (

In this section, the dataset and/or input being considered for the project should be thoroughly described, such as how they relate to the problem and why they should be used. Information such as how the dataset or input is  obtained, and the characteristics of the dataset or input, should be included with relevant references and citations as necessary It should be clear how the dataset or input will be used in the project and whether their use is appropriate given the context of the problem.)

### Solution Statement

At it's core, this model will be a supervised machine learning algorithm. Therefore, the output of the model can be compared against a label value which the model should have produced. However, since a language translation is not a simple binary  or multi-label classification but a collection of words that produce meaning, using standard metrics such as accuracy, precision or recall may not suffice. Instead in order to provide a metric for the performance of the translation, I propose utilizing the Levenshtein distance to calculate a [Word Error Rate](https://en.wikipedia.org/wiki/Word_error_rate) to compare how close the output matches the label sentence. The levenshtein distance between two strings is measured as the number of character changes needed to convert to the other string. The validation dataset from corpus provided by [Yusuke Oda](https://github.com/odashi/small_parallel_enja) will be used to measure the accuracy of each translated output to the validated translation.

[//]: # (

In this section, clearly describe a solution to the problem. The solution should be applicable to the project domain and appropriate for the dataset or input given. Additionally, describe the solution thoroughly such that it is clear that the solution is quantifiable the solution can be expressed in mathematical or logical terms , measurable the solution can be measured by some metric and clearly observed, and replicable the solution can be reproduced and occurs more than once.


)

### Benchmark Model

Much like [Google’s Neural Machine Translation System](https://arxiv.org/pdf/1609.08144.pdf) the basis of this model will be a deep neural network with an RNN architecture. Recurrent Neural Networks are a very common archtecture for NLP machine learning tasks due to the sequential nature of human languages. The RNN may contain addtional hidden layers and bidirectiobnal architectures due to the complexity of the linguistic difference between Japanese and English. For a bench mark RNN model, the author will aim for a validation score of 97.5% achieved by Thomas Tracey in his Englsih to French [translation model](https://towardsdatascience.com/language-translation-with-rnns-d84d43b40571).


[//]: # (

In this section, provide the details for a benchmark model or result that relates to the domain, problem statement, and intended solution. Ideally, the benchmark model or result contextualizes existing methods or known information in the domain and problem given, which could then be objectively compared to the solution. Describe how the benchmark model or result is measurable can be measured by some metric and clearly observed with thorough detail.
)

### Evaluation Metrics

The basis of the evaluation will be to compare the English sentence the model produces, against the appropriate label sentence which the model was intended to produce. Since translation can be a very difficult concept to score with multiple methods to evaluate, there are numerous options to be selected here. The proposed evaluation metric is to utilize the [Word Error Rate](https://en.wikipedia.org/wiki/Word_error_rate) to evaluate a score between the the model output and target sentence.

The Word Error rate calculation is as follows:
WER = (S+D+I) / N  = (S+D+I) / (S+D+C)

Where S is the number of substitutions, D is the number of deletions, I is the number of insertions, C is the number of correct words, and N is the number of words in the reference (N=S+D+C).  This methodology provides a simple yet effective method to determine how closely a sentece resembles another. 

It is important to take note that one shortfall of this method is that the semantics of the original sentence is not taken into considerations. Although word selection may closely resemble the target sentence, words out of order or small differences in diction can vastly alter the intended meaning of the original sentence. 

[//]: # (

In this section, propose at least one evaluation metric that can be used to quantify the performance of both the benchmark model and the solution model. The evaluation metric you propose should be appropriate given the context of the data, the problem statement, and the intended solution. Describe how the evaluation metric are derived and provide an example of their mathematical representations. Complex evaluation metrics should be clearly defined and quantifiable can be expressed in mathematical or logical terms.
)


In [None]:
import os
def load_data(path):
    """
    Load dataset
    """
    input_file = os.path.join(path)
    with open(input_file, "r", encoding="utf-8") as f:
        data = f.read()
    return data.split('\n')

en_sent_token_train = load_data('data/odashi/train.en')

# Load Japanese data
#ja_sent_raw = load_data('data/raw/ja')
ja_sent_token_train = load_data('data/odashi/train.ja')

In [6]:

print('First 10 Japanese - English Sentence Pairs')
print("----------------------")
for sample_i in range(10):  
    print('Line {}:'.format(sample_i + 1))
    print('JP: {}'.format(ja_sent_token_train[sample_i]))
    print('EN: {}'.format(en_sent_token_train[sample_i]))
    print("----------------------")

First 10 Japanese - English Sentence Pairs
----------------------
Line 1:
JP: 誰 が 一番 に 着 く か 私 に は 分か り ま せ ん 。
EN: i can 't tell who will arrive first .
----------------------
Line 2:
JP: 多く の 動物 が 人間 に よ っ て 滅ぼ さ れ た 。
EN: many animals have been destroyed by men .
----------------------
Line 3:
JP: 私 は テニス 部員 で す 。
EN: i 'm in the tennis club .
----------------------
Line 4:
JP: エミ は 幸せ そう に 見え ま す 。
EN: emi looks happy .
----------------------
Line 5:
JP: この 事実 を 心 に 留め て お い て 下さ い 。
EN: please bear this fact in mind .
----------------------
Line 6:
JP: 彼女 は 私 たち の 世話 を し て くれ る 。
EN: she takes care of my children .
----------------------
Line 7:
JP: 私 達 は 国際 人 に な り た い と 思 い ま す 。
EN: we want to be international .
----------------------
Line 8:
JP: 約束 を 破 る べ き で は あ り ま せ ん 。
EN: you ought not to break your promise .
----------------------
Line 9:
JP: 道路 を 横切 る とき は 車 に 注意 し なさ い 。
EN: when you cross the street , watch out for cars .
----------------------
Line 10:
JP: 私 に は 生き 


### Project Design


[//]: # (
This project will require GPU computation in order for the deep neural net to train on the data. Methods of aquiring a runtime environment to perform these steps may vary from user to user. The author will utilize an AWS EC2 p2.xlarge instance to run the code for the project. Instructions on how to set up such an environment can be found here https://medium.com/@alexjsanchez/python-3-notebooks-on-aws-ec2-in-15-mostly-easy-steps-2ec5e662c6c6
)

At a high level, the project will proceed with the following steps to accomplish creating the machine translation model.

1. Importing and Loading training data sets
2. Build the Preprocessing Pipeline for the Japanese and Englsih Sentences
3. Architect the Deep Learning Model & Train on the Data
4. Create the Evaluation pipelines
5. Evaluate on a Validation Dataset

First, the training data for the model will be imported from [Yusuke Oda's corpus](https://github.com/odashi/small_parallel_enja). The document of 50,000 Japanese sentences and their English translations will each be imported into a list variable. Then there will be some preliminary exploration of the data to view sample sentences and vocabulariy to determine whether the dataset contains adequate translations and samples. Note, some readers of this project may not be able to judge whether the translations are adequate or not if they do not understand Japanese. Here is a sample of the translated sentences provided in the corpus:

Line 1:
JP: 誰 が 一番 に 着 く か 私 に は 分か り ま せ ん 。
EN: i can 't tell who will arrive first .

Line 2:
JP: 多く の 動物 が 人間 に よ っ て 滅ぼ さ れ た 。
EN: many animals have been destroyed by men .

Line 3:
JP: 私 は テニス 部員 で す 。
EN: i 'm in the tennis club .


After the data has been imported, the corpus sentences will need to be preprocessed in preparation for the deep learning algorithm. Preprocessing steps include tokenizing the dataset to breakdown each sentece to its tokens(words) and padding the data which will fill in space for shorter senteces with empty values so that the input data can be passed to the model expecting a certain size input. These preprocessing steps will prepare the Japanese and English datasets to be passed to the neural network.

The deep learning model will at its core be a recurrent neural network model. In order to accomplish this, the Keras package will be utilize to quick develop a neural network archtiecture with the appropriate parameters and layers. Once an architecture is made, the model will begin to train off the datsaset, learning the English translations of the Japanese input dataset. As it trains, the progresss of the model can be observerd as the model will print out validation loss and accuracy scores as it trains. Several attempts will most likely be required during this step to fine tune the model to produce better performance from the training step.

As outlined earlier, the [Word Error Rate](https://en.wikipedia.org/wiki/Word_error_rate) method will be used to evaluate the output of the model. In order to do that, a pipeline that will take the English sentence produced by the model will need to be compared against the target translation. In this step a series of functions will be created to easily score the output sentences with the label value using the WER formula.

Once the evaluation piepline has been created, the validation dataset will be used to pass on to the model to evaluate it's performance. This validation dataset is provided by [Yusuke Oda's corpus](https://github.com/odashi/small_parallel_enja) and does not overlap with any senteces in the training dataset.  The Japanese sentences will be passed to the model, yielding an English sentence. Then the English sentence will be passed through the evaluation pipeline to compare against the label sentence.  The pipeline will produce a score as to how close the output sentence resembles the target translation. This process will be repeated across the remaining validation dataset, and an average score will be calculated to determine the overall performance of the model.


[//]: # (

In this final section, summarize a theoretical workflow for approaching a solution given the problem. Provide thorough discussion for what strategies you may consider employing, what analysis of the data might be required before being used, or which algorithms will be considered for your implementation. The workflow and discussion that you provide should align with the qualities of the previous sections. Additionally, you are encouraged to include small visualizations, pseudocode, or diagrams to aid in describing the project design, but it is not required. The discussion should clearly outline your intended workflow of the capstone project.
)

-----------
[//]: # (


**Before submitting your proposal, ask yourself. . .**

- Does the proposal you have written follow a well-organized structure similar to that of the project template?
- Is each section particularly **Solution Statement** and **Project Design** written in a clear, concise and specific fashion? Are there any ambiguous terms or phrases that need clarification?
- Would the intended audience of your project be able to understand your proposal?
- Have you properly proofread your proposal to assure there are minimal grammatical and spelling mistakes?
- Are all the resources used for this project correctly cited and referenced?

)


### References

1)  [Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation](https://arxiv.org/pdf/1609.08144.pdf)

2) [Word Error Rate](https://en.wikipedia.org/wiki/Word_error_rate) 

3) [Language Translation with RNNs: Build a recurrent neural network that translates English to French](https://towardsdatascience.com/language-translation-with-rnns-d84d43b40571)