All-Lang-Translator

The goal of this project is to learn about seq-2seq models using attention,experiment with different attention mechanisms and model atchitectures.

Finally Create an all language translator. The Idea is to provided users with Gui based web application where users can just download data eg.spanish to engish translation click on the train button and there you have a nmt model. If a model is already trained users can inference it.

About the problem:

NMT or neural machine translation is a problem where we try to translate a date from one langauge to another. For this we make use of the seq-2-seq model architecture Here we have an encoder which takes the input language and outputs an encoded vector for each input word. This encoded output is then fed into the decoder along with the decoders previous hidden state and the previous word.

Note:

that the encoder takes in a sentence but the decoder doesn't output an sentence but the decoder output one word at a time until it reaches the tag.
The input to the encoder is the padded input langauge sentences. The decoder takes 3 inputs the previous predicted word vector , the prevoius hidden state
and the encoders output.
The input and output language can be interchanged. If you have a english - 2 - spanish dataset , we can make to spanish - 2 - english no need to find new
data.

Outputs

More about the dataset:

The data set comes from http://www.manythings.org/anki/ site it containes abt. 100 of foriegn language - english translation datasets. The datasets are tab seperated. In this repo. I have used 2 datasets

spanish to english
hindi to english

for hindi to english dataset I have not applied preprocessing to the hindi text as steps used for english dont work on hindi.

Note:

`` The constants.py file has two boolean parameters PREPROCESS_INPUT and PREPROCESS_TARGET if set to true preprocessing is applied if set to false preprocessing is not applied.

One could choose to apply preprocessing to one language keeping the other unchanged as the case with hindi-english translation just set 
PREPROCESS_INPUT = False (input is hindi) and PREPROCESS_TARGET = True. For spanish to english other PREPROCESS_INPUT and PREPROCESS_TARGET to True

``

Preprocessing:

For preprocessing part I have followed from google's seq-2-seq attention mechanism.

Problem: They have remove all punctuations except ".", "?", "!", "," the problem is with words like he's which now becomes [he, s] when split.

A quick fix would be to change the regex from r"[^a-zA-Z?.!,¿]+" to r"[^a-zA-Z?.!,\'¿]+", by doing this he's , it's will be considered as a single word.

Another way is doing contraction correction where will relace it's to it is or he's to he is

Training on other datasets.

Configurations:

It's quite simple to train on a different dataset just follow the following steps.

move the dataset (.txt) file to the data folder
All model parameters are stored in the constants.py file.
Remember to change the PATH parameter. It should point to your dataset file name eg. change from spa.txt (which spanish2english) to hin.txt if you are doing hindi-english translations.
SENT_LIMIT param limits the number of datasamples used. On my local machine With no gpu I used 300 samples.

Training the models

There are 2 options provided you can use the main.py file by running python python main.py

Or

You can use the streamlit web app streamlit run app.py

Experiments

I like performing experiments when I am learning new things few of the experiments I have performed are

Experiments on different Attention Mechanisms
Experiment is changing the Units (output dim of encoder)
Experiment is changing the Embedding Dimension
Changing the number of Layers
Changing the type of layers ef [gru ,lstm]

In this experiments while changing one parameter eg. number of layers all other parameter are kept constant All experiments were performed in google colab.

1. Experiments on different Attention Mechanisms trained on only 30000 samples

Attention Type	Bleu - 1 Score	Loss SCE	Training Time
Simple Attention	64%	0.009	270s
Attention With Context	66%	0.006	300s
Additive Attention	70%	0.003	300s

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
__pycache__		__pycache__
data		data
images		images
models		models
notebooks		notebooks
README.md		README.md
all_attentions.py		all_attentions.py
attention.py		attention.py
attentionMain.py		attentionMain.py
constants.py		constants.py
decoder.py		decoder.py
encoder.py		encoder.py
eval.py		eval.py
input_tokenizer		input_tokenizer
main.py		main.py
max_value_file.npy		max_value_file.npy
model.py		model.py
output_tokenizer		output_tokenizer
preprocessing.py		preprocessing.py
redme_structure.txt		redme_structure.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

All-Lang-Translator

The goal of this project is to learn about seq-2seq models using attention,experiment with different attention mechanisms and model atchitectures.

Finally Create an all language translator. The Idea is to provided users with Gui based web application where users can just download data eg.spanish to engish translation click on the train button and there you have a nmt model. If a model is already trained users can inference it.

About the problem:

Note:

Outputs

More about the dataset:

Note:

Preprocessing:

Training on other datasets.

Configurations:

Training the models

Or

Experiments

1. Experiments on different Attention Mechanisms trained on only 30000 samples

2. Experiment on different Units

3. Experiment on different Embedding Dimensions

4. Experiment on different Embedding Dimensions

Things to do:

About

Releases

Packages

Languages

evilc3/All-Lang-Translator

Folders and files

Latest commit

History

Repository files navigation

All-Lang-Translator

The goal of this project is to learn about seq-2seq models using attention,experiment with different attention mechanisms and model atchitectures.

Finally Create an all language translator. The Idea is to provided users with Gui based web application where users can just download data eg.spanish to engish translation click on the train button and there you have a nmt model. If a model is already trained users can inference it.

About the problem:

Note:

Outputs

More about the dataset:

Note:

Preprocessing:

Training on other datasets.

Configurations:

Training the models

Or

Experiments

1. Experiments on different Attention Mechanisms trained on only 30000 samples

2. Experiment on different Units

3. Experiment on different Embedding Dimensions

4. Experiment on different Embedding Dimensions

Things to do:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages