
---

## **Capstone Project: English-to-Hindi Language Translator using Seq2Seq LSTM**

As part of my AIML certification program, I completed a hands-on capstone project where I designed and developed an **English-to-Hindi language translator** using a **Sequence-to-Sequence (Seq2Seq) model with LSTM** architecture. The goal of the project was to apply NLP techniques to build a working machine translation pipeline and deploy it through a user-friendly web interface.

---

###  **Tech Stack & Tools Used**

* **Programming & ML Libraries:** Python, TensorFlow, Keras, NumPy
* **UI Framework:** Streamlit
* **Dataset:** Tatoeba English-Hindi sentence pairs (sourced from [manythings.org](http://www.manythings.org/anki/))
* **Others:** Zip handling, Tokenizers, Pickle for saving models and preprocessing tools

---

### 💡 **Problem Statement**

Translate English sentences into grammatically correct and contextually appropriate Hindi sentences using neural networks, simulating how Google Translate works at a basic level.

---

###  **Project Workflow & Implementation**

#### 1. **Data Collection and Preprocessing**

* Downloaded a parallel corpus of English-Hindi sentence pairs.
* Cleaned and parsed the data, creating two datasets: English (source) and Hindi (target).
* Added special tokens (`<start>`, `<end>`) to target sequences to signal the start and end of decoding.
* Tokenized both source and target languages using Keras Tokenizers and padded them to the maximum sentence length.

#### 2. **Model Architecture**

* **Encoder:**

  * Used an embedding layer followed by an LSTM layer that outputs internal states (hidden and cell).
  * These states serve as the initial context for the decoder.

* **Decoder:**

  * Takes the previous word (starting with `<start>`) and generates the next word at each time step.
  * Also uses its own embedding and LSTM layers.
  * The output is passed through a Dense layer with a softmax activation to predict the most likely next word.
  * Trained using **categorical crossentropy loss** and **one-hot encoded labels**.

#### 3. **Training & Optimization**

* Trained on padded sequences using `model.fit()` with a batch size of 64 for 20 epochs.
* Used 20% of the data for validation.
* Achieved reasonable performance for short and medium-length sentences.

#### 4. **Inference Model Setup**

* Split the trained model into separate **encoder** and **decoder** models for prediction.
* Implemented a **greedy decoding** function to iteratively predict Hindi translations one word at a time until the `<end>` token or max length is reached.

#### 5. **Deployment with Streamlit**

* Built a clean and interactive web interface using Streamlit.
* The UI allows users to input English sentences and receive live Hindi translations from the trained model.
* Behind the scenes, the input sentence is tokenized, encoded, passed through the model, and decoded back to Hindi text.

---

###  **Outcomes**

* Developed a fully functional translation model capable of generating basic Hindi sentences from English input.
* Successfully demonstrated the encoder-decoder architecture and how LSTM networks handle sequence-to-sequence tasks.
* Learned how to bridge the gap between ML models and end-user interaction via web apps.

---

###  **Key Takeaways**

* Gained practical exposure to core NLP concepts: text vectorization, sequence padding, embeddings, RNNs, and decoding strategies.
* Understood the importance of preparing separate training and inference models for real-time predictions.
* Learned to handle model saving/loading and tokenizers for production use.
* Developed UI/UX thinking while integrating ML models into an interactive front-end using Streamlit.

---

###  **How to Present This in an Interview**

> "In my final AIML capstone, I built an English-to-Hindi translator using a Sequence-to-Sequence LSTM model. I started by preparing the bilingual dataset, tokenizing and padding the sequences. I trained an encoder-decoder architecture where the encoder creates context vectors, and the decoder uses them to generate the translated sentence word by word. I then set up separate encoder and decoder inference models for prediction and deployed the system using a Streamlit web app. This helped me understand the end-to-end ML pipeline—from data prep and model building to real-time inference and deployment."

---

