# Notes about Machine Translation using RNN

## Defining the Problem

- input = other language (french, japanese, and/or arabic character or word, with audio aka sequence

- output = english character or word, with audio aka sequence

- method = NN such as RNN

- set up for RNN = Many to Many; input is a sequence and output is a sequence

-  metrics = ?

- Pipeline 
    1. Preprocessing
        - load data
            - What format/file type is the dataset?
            - What packages are best for loading this type of data?
        - examine <= to determine complexity
            - view the first couple of lines of each file
            - How many words total?
            - How many unique words?
            - ten most commonly used words
        - clean/tokenization
            - _should I use NLTK or tokenize manually?_
            - remove unwanted characters such as punctuation, non-alphabets, etc
            - be careful with apostrophes?
            - convert text to UTF-8 encoding?
            - keep stop words for translation
            - be aware of any attached words that need to be broken down 
            - lemmatizing/stemming <= ? might not need to do
                - reduce derivates of word to root terms
            - split sentence into words
            - decapitalize words
        - padding 
            - ensures each sequence is the same length by adding to any sequence shorter than the max length or longest sequence
        
       - ****consider reducing the dataset for simplicity and faster runtime***
        
    2. Modeling
    
        - BUILD
        
            - **inputs**
                - integer encoded characters or words
                
            - **embedding layers**
                - converts the word to a vector
                    - size of vector depends on complexity of the vocab
                - each word is projected into n-dimensional space
                    - words with similar meanings occupy similar regions of space
                    - the vectors between words represent relationships
                    
                - may need to use a pre-trained embeddings package or keras for embedding
                - hidden layers
                
            - **recurrent layers (encoder)**
                - provides context from word vectors in previous time step to current word vector
                - context variable also called state
                - for reach time step after first word there are two inputs
                    - hidden state and NEXT word vector from sequence
                   
                - Bidirectional layer
                    - provides future and historical context
                        - two RNN layers:
                            - one gets input sequence as is
                            - the other gets a reversed copy
                            
            - **dense layers (decoder)**
                - fully connected layers used to decode encoded input into correct sequence
                - for each time step after first word there are two inputs
                    - hidden state and PREV word vector from sequence
                    
            - **hidden layer with Gated Recurrent Unit**   
                - Selects which information is relevant and which should be discarded
                    - update gate 
                        - helps the model determine how much info from previous time step to remember
                    - reset gate
                        - decides how much information to forget
            - **outputs**
                - a sequence of integers that need to be mapped to other language dataset vocabulary
        
            - **Model Params**
                - input_shape
                - length of output sequence
                - num of unique english words
                - num of unique arabic words
        - train
        
        - test model
        
    3. Prediction
        - generate translations. compare output to actual translations
    4. iteration
        - iterate on the model; refine architecture
        
- Framework/Technologies
    - Tensor Flow
    - Keras? LSTM?
    - Co-lab


## References

- [Language Translations with RNN](https://towardsdatascience.com/language-translation-with-rnns-d84d43b40571)

- [Effectively Preprocessing Text Data Part 1: Text Cleaning](https://towardsdatascience.com/effectively-pre-processing-the-text-data-part-1-text-cleaning-9ecae119cb3e)

- [Arabic-English Cross-Lingual Word Embedding Model](https://www.aclweb.org/anthology/W19-4605.pdf)

- [ManyThings.org](http://www.manythings.org/anki/)

- [Input Output](https://docs.python.org/3/tutorial/inputoutput.html)

- [Using Regex for Text Manipulation](https://stackabuse.com/using-regex-for-text-manipulation-in-python/)

- [Regex One](https://regexone.com/)

- [Removing characters before, after, and in the middle of strings](https://medium.com/@Alexander_H/removing-characters-before-after-and-in-the-middle-of-strings-fb4930cce76a)

- [Best way to strip punctuation from a string](https://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string)