Effecient Urdu Caption Generation using Attention Mechanism

Abstract

Recent advancements in deep learning has created a lot of opportunities to solve those real world problems which remained unsolved for more than a decade. Automatic caption generation is a major research field, and the research community has done a lot of work on this problem in most common languages like English. Urdu is the national language of Pakistan and also much spoken and understood in the sub-continent region of Pakistan-India, and yet no work has been done for Urdu language caption generation. Our research aims to fill this gap by developing an attention based deep learning model using techniques of sequence modelling specialized for Urdu language. We have prepared a dataset in Urdu language by translating a subset of ”Flickr8k” dataset containing 700 ’man’ images. We evaluate our proposed technique on this dataset and show that it is able to achieve a BLEU score of 0.83 on Urdu language. We improve on the previously proposed techniques by using better CNN architectures and optimization techniques. Furthermore, we also tried adding a grammar loss to the model in order to make the predictions grammatically correct.

Dataset

As our main task is caption generation on images in Urdu language. There was no publicaly available dataset for this task. Then we decided to translate a popular image captioning dataset called flickr8k dataset from scratch as available translators were not suffeciently accurate especially on idioms and context understanding. Flickr8k dataset has 8,000 images, and for each image there are 5 captions in english. We selected about 700 images with 3500 captions with similar context to translate into Urdu. The selected captions are related to a "Man" who is doing different activities such as Water-boarding, Snow-boarding and biking.

Original Flickr8k Dataset: Download Here
Translated Captions in Urdu: Download Here

Model

Most of our caption generation model is inspired by "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention"[1] Model consist of Encoder and a Decoder architecture. Encoder part uses a CNN (We did experiments of ResNet-101-V2, InceptionV3 and Xception) to extract features from images while Decoder uses attention mechanism and GRU. We used Bahdanau Attention in our model. The main purpose of the attention mechanism is to focus on relevant part of image. GRU is just like LSTM with less parameters and gates that makes it faster and less computationally expansive. GRU is used to generate next word of caption in current time step. The generated word is based on features of image, previous hidden state of GRU, current input to decoder and context vector generated by attention mechanism. In the end model check if the grammar of generated caption is good or bad.

Example of working of Attention Mechanism

Model Diagram

Generated Captions

Results

We used BLEU score as evaluation matrix to evaluate performance of model.

Contribution

Members of our group are My self Hafiz Muhammad Abdullah Zia, Inam Ilahi, Armughan Ahmed, Rauf Tabassum, and Ahtazaz Ahsan

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Codes		Codes
images		images
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Codes

Codes

images

images

README.md

README.md

Repository files navigation

Effecient Urdu Caption Generation using Attention Mechanism

Abstract

Dataset

Model

Example of working of Attention Mechanism

Model Diagram

Generated Captions

Results

Contribution

About

Releases

Packages

Languages

abdullahzia510/Effecient-Urdu-Caption-Generation-using-Attention-Mechanism

Folders and files

Latest commit

History

Repository files navigation

Effecient Urdu Caption Generation using Attention Mechanism

Abstract

Dataset

Model

Example of working of Attention Mechanism

Model Diagram

Generated Captions

Results

Contribution

About

Topics

Resources

Stars

Watchers

Forks

Languages