Skip to content

The goal of this project is to learn about seq-2seq models using attention,experiment with different attention mechanisms and model atchitectures.

Notifications You must be signed in to change notification settings

evilc3/All-Lang-Translator

Repository files navigation

All-Lang-Translator

The goal of this project is to learn about seq-2seq models using attention,experiment with different attention mechanisms and model atchitectures.

Finally Create an all language translator. The Idea is to provided users with Gui based web application where users can just download data eg.spanish to engish translation click on the train button and there you have a nmt model. If a model is already trained users can inference it.

About the problem:

alt text

NMT or neural machine translation is a problem where we try to translate a date from one langauge to another. For this we make use of the seq-2-seq model architecture Here we have an encoder which takes the input language and outputs an encoded vector for each input word. This encoded output is then fed into the decoder along with the decoders previous hidden state and the previous word.

Note:

  1. that the encoder takes in a sentence but the decoder doesn't output an sentence but the decoder output one word at a time until it reaches the tag.

  2. The input to the encoder is the padded input langauge sentences. The decoder takes 3 inputs the previous predicted word vector , the prevoius hidden state
    and the encoders output.

  3. The input and output language can be interchanged. If you have a english - 2 - spanish dataset , we can make to spanish - 2 - english no need to find new
    data.

Outputs

alt text

More about the dataset:

The data set comes from http://www.manythings.org/anki/ site it containes abt. 100 of foriegn language - english translation datasets. The datasets are tab seperated. In this repo. I have used 2 datasets

  1. spanish to english

  2. hindi to english

    for hindi to english dataset I have not applied preprocessing to the hindi text as steps used for english dont work on hindi.

Note:

`` The constants.py file has two boolean parameters PREPROCESS_INPUT and PREPROCESS_TARGET if set to true preprocessing is applied if set to false preprocessing is not applied.

One could choose to apply preprocessing to one language keeping the other unchanged as the case with hindi-english translation just set 
PREPROCESS_INPUT = False (input is hindi) and PREPROCESS_TARGET = True. For spanish to english other PREPROCESS_INPUT and PREPROCESS_TARGET to True

``

Preprocessing:

For preprocessing part I have followed from google's seq-2-seq attention mechanism.

Problem: They have remove all punctuations except ".", "?", "!", "," the problem is with words like he's which now becomes [he, s] when split.

A quick fix would be to change the regex from r"[^a-zA-Z?.!,¿]+" to r"[^a-zA-Z?.!,\'¿]+", by doing this he's , it's will be considered as a single word.

Another way is doing contraction correction where will relace it's to it is or he's to he is

Training on other datasets.

Configurations:

It's quite simple to train on a different dataset just follow the following steps.

  1. move the dataset (.txt) file to the data folder

  2. All model parameters are stored in the constants.py file.

  3. Remember to change the PATH parameter. It should point to your dataset file name eg. change from spa.txt (which spanish2english) to hin.txt if you are doing hindi-english translations.

  4. SENT_LIMIT param limits the number of datasamples used. On my local machine With no gpu I used 300 samples.

alt text

Training the models

There are 2 options provided you can use the main.py file by running python python main.py alt text

Or

You can use the streamlit web app streamlit run app.py alt text

Experiments

I like performing experiments when I am learning new things few of the experiments I have performed are

  1. Experiments on different Attention Mechanisms
  2. Experiment is changing the Units (output dim of encoder)
  3. Experiment is changing the Embedding Dimension
  4. Changing the number of Layers
  5. Changing the type of layers ef [gru ,lstm]

In this experiments while changing one parameter eg. number of layers all other parameter are kept constant All experiments were performed in google colab.

1. Experiments on different Attention Mechanisms trained on only 30000 samples

Attention Type Bleu - 1 Score Loss SCE Training Time
Simple Attention 64% 0.009 270s
Attention With Context 66% 0.006 300s
Additive Attention 70% 0.003 300s

observation : Additive attention is better than other attention types.

2. Experiment on different Units

alt text

3. Experiment on different Embedding Dimensions

alt text

4. Experiment on different Embedding Dimensions

alt text

Things to do:

  1. Need to implement a smarter way of preprocessing the data.

  2. Create a flask app (streamlit has limited functionality)

  3. Implement Transformer Architecture from scratch

About

The goal of this project is to learn about seq-2seq models using attention,experiment with different attention mechanisms and model atchitectures.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published