# Image Captioning

<h3>Table of Contents:</h3></br>
    &emsp;* Introduction</br>
    &emsp;* Dataset</br>
    &emsp;* Model Architecture</br>
    &emsp;* Training</br>
    &emsp;* Evaluation</br>
    &emsp;* Encountered Challenges</br>
    &emsp;* How To Run the Code</br>
    &emsp;* Future Improvements</br>
    &emsp;* References</br>


===============================================================================================================

<h2>Introduction</h2>
<h4>How does the model work?</h4>
&emsp;This image captioning model follows an encoder-decoder architecture, where a CNN (Convolutional Neural Network) extracts visual</br>
&emsp;features from an image, and an RNN(Recurrent Neural Network) generates a caption based on those features.

1. <h5>Feature Extraction (Encoder - CNN):</h5>
&emsp;A ResNet model processes the image and extracts visual features.</br>
&emsp;The output is a feature vector representing the image content.
</br>
2. <h5>Caption Generation (Decoder - RNN):</h5>
&emsp;The extracted image features are passed to an RNN (LSTM) to generate a sequence of words.
&emsp;The decoder predicts the next word in the caption based on the previous words and the image features.
&emsp;The process continues until an end token is generated or a maximum length is reached.
</br>
3. <h5>Training Process:</h5>
&emsp;The model is trained using Cross-Entropy Loss, comparing the predicted words with actual captions.
&emsp;The Adam optimizer updates the weights to minimize the loss.
&emsp;The model learns to associate visual features with meaningful captions through multiple training iterations.

<h4>Key Features of the Approach</h4>
&emsp;* CNN-RNN Pipeline: </br>&emsp;&emsp;Uses a ResNet encoder for feature extraction and an LSTM/GRU decoder for caption generation.</br></br>
&emsp;* Cross-Entropy Loss: </br>&emsp;&emsp;The loss function ensures that the predicted captions closely match ground truth captions.</br></br>
&emsp;* Training with Teacher Forcing: </br>&emsp;&emsp;During training, the model learns from actual captions instead of its own predictions to improve learning speed.</br></br>
&emsp;* GPU Acceleration: </br>&emsp;&emsp;The model runs efficiently on GPUs from Kaggle for faster training and inference.</br></br>
&emsp;* Checkpoint Saving: </br>&emsp;&emsp;The model periodically saves checkpoints, allowing training to resume if interrupted.</br></br>
&emsp;* Data Augmentation (Optional): </br>&emsp;&emsp;Techniques like image flipping or color adjustments can be used to improve generalization.</br></br>

===============================================================================================================

<h2>Dataset</h2>

<h4>Used dataset for image captioning - Flickr8k</h4>
&emsp;* Link to Download dataset : <a href="https://www.kaggle.com/datasets/e1cd22253a9b23b073794872bf565648ddbe4f17e7fa9e74766ad3707141adeb">To Dataset</a>
<h4>How the Flickr8k is Structured...</h4>
&emsp;* The dataset consists of images paired with corresponding text captions that describe the image content.</br>&emsp;* Each image has 5 captions to provide diverse descriptions.

1. Images:
</br>Stored in a directory, typically in JPG format.
</br>Each image is associated with 5 textual descriptions.</br></br>
2. Captions:
</br>Stored in a separate annotation file (TXT).
</br>Each entry contains the image filename and a corresponding caption.
</br></br>Example from Captions file:
```python
image,caption
1000268201_693b08cb0e.jpg,A child in a pink dress is climbing up a set of stairs in an entry way .
1000268201_693b08cb0e.jpg,A girl going into a wooden building .
1000268201_693b08cb0e.jpg,A little girl climbing into a wooden playhouse .
1000268201_693b08cb0e.jpg,A little girl climbing the stairs to her playhouse .
1000268201_693b08cb0e.jpg,A little girl in a pink dress going into a wooden cabin .
```

<h4><strong>Preprocessing Steps</strong></h4>
&emsp;*Before training, the dataset undergoes several preprocessing steps to ensure consistency and improve model performance.
</br>
</br>1. Image Preprocessing:
</br>* Resizing – Images are resized to a fixed dimension ( 300 x 300 for ResNet ).
</br>* Normalization – Pixel values are scaled to [0,1] to match the CNN’s input format.
</br>
</br>2. Text Preprocessing:
</br>* Token Conversion – The caption tokens are converted into a PyTorch tensor.
</br>* Vocabulary Building – A word-to-index mapping is created for encoding text.
</br>* Padding & Truncation – Captions are padded to a fixed length for batch processing.
</br>* Start & End Tokens – Special tokens (#START, #END) are added to mark caption boundaries..
'''

===============================================================================================================

<h2>Model Architecture</h2>

*Encoder: CNN (ResNet) extracts image features.
</br>*Decoder: RNN (LSTM) generates captions from extracted features.
</br>*Loss Function: Cross-Entropy Loss.
</br>*Optimizer: Adam.

<h4>Why I chose ResNet over simple CNN?</h4>
&emsp;1. It solves the Vanishing Gradient Problem, skipping conections preventing gradients from vanishing.</br>
&emsp;2. Enables Training of Very Deep Networks

<h4>Why I chose LSTM over simple RNN?</h4>
&emsp;1. Also solves Vanishing Gradient Problem.</br>
&emsp;2. Handles long sequences more efficiently.</br>
&emsp;3. Reduces Exploding Gradients.

<h4>Why I chose Cross-Entropy as Loss Function?</h4>
&emsp;1. It's designed for classification problems. In our case multi-class classification.</br>
&emsp;2. Works well with SoftMax.

===============================================================================================================

<h2>Training</h2>

<h4>Model Training Loop</h4>
* Epoch-based training: Looping over multiple epochs to refine the model.
<h4>Forward Pass:</h4>
* Image is passed through ResNet to extract features.
</br>* Captions are passed through LSTM to generate the next word in the sequence.
<h4>Loss Calculation:</h4>
* Cross Entropy Loss is used to measure how well predicted captions match the ground truth.
<h4>Backpropagation:</h4>
* Using Adam Optimizer to adjust weights for both the encoder and decoder.
<h4>Checkpointing:</h4>
* Model weights are saved every few iterations to prevent losing progress.
<h4>Monitoring & Debugging</h4>
* Loss Values: Tracking loss over epochs to ensure convergence.
<h4>Prediction Checks:</h4>
* Generating captions for images every 100 iterations to evaluate performance.
<h4>Potential Issues Identified:</h4>
* Model predicting the same captions repeatedly (indicating lack of diversity).
</br>* Need for better sampling techniques or temperature scaling in the softmax function.

===============================================================================================================

<h2>Evaluation</h2>
We evaluate the quality of generated image captions using the BLEU (Bilingual Evaluation Understudy) score.
</br>BLEU measures how closely the generated caption matches human-written captions by comparing n-grams (sequences of words).
</br>In our evaluation, we calculate BLEU scores at different levels to assess the accuracy of generated captions.

<h4>Evaluation Process</h4>
1. Load the trained model and generate captions for a sample of test images.
</br>2. Compare the generated captions with ground truth captions using the BLEU score.
</br>3. Aggregate BLEU scores across multiple samples to measure overall performance.

<h4>Observations and Challenges</h4>
1. BLEU does not account for synonyms or sentence structure, so a meaningful but differently worded caption might receive a low score.
</br>2. Shorter captions may score lower, even if they are correct.

===============================================================================================================

<h2>Encountered Challenges</h2>
1. Kaggle Notebook interruption due to inactivity during model training.
</br>Solution: Simulate user appearance. Run code in PyCharm that will type something into cell, in background.
</br></br>2. Browser Shutting Down while Training, because of Lack of Memory in Browser.
</br>Solution: I had to cut half of the dataset, so that training model won't take a lot of time.
</br></br>3. Choosing right amount of layers in ResNet, to reduce time.
</br>Solution: Tried ResNet50, but GPU couldn't handle it, so I decreased it to ResNet34.
</br></br>4. Choosing Batch Size.
</br>Solution: Took the maximum the GPU was capable of, which was 21.

===============================================================================================================

<h2>How  to Run the Code</h2>
You can run the code row by row.

===============================================================================================================

<h2>Future Improvements</h2>
1. Using Transformer-Based models instead of LSTMs;</br>
2. There are more advanced CNN models like EfficientNet, ConvNeXt, or Vision Transformer (ViT), which can do feature extraction better.</br>
3. Using Cross-Entropy for loss function, instead of losses that correlate better with human evaluations.

===============================================================================================================

<h2>References</h2>


1. https://arxiv.org/pdf/1502.03044v2
2. https://www.digitalocean.com/community/tutorials/writing-resnet-from-scratch-in-pytorch
3. https://medium.com/@wangdk93/lstm-from-scratch-c8b4baf06a8b
4. https://youtu.be/y2BaTt1fxJU?si=10vHpteH-vsX5bFf