### Exploration of subtasks / methodology:

 - ### Caption sentiment class extraction
 - ### Caption sentiment classifier
 - ### Image feature extraction Pipeline: 
     - VGG-16
     - Inception
     - AlexNet
         - According to [google's dataset paper](https://www.aclweb.org/anthology/P18-1238.pdf) Inception_Resnet_v2 is best for feature extraction
 - ### Image object detection/ data
 - ### Object + Sentiment sentence generation
 - ### End to End model from tutorial
 - ### Caption evaluation pipeline:
     - BLEU score
     - Perplexity?
 

## Initial loading and displaying

In [None]:
import pandas as pd
import os
import matplotlib.pyplot as plt

In [None]:
sample_data_folder = r'data/initial_sample/'
cleaned_meta_data_file = os.path.join(sample_data_folder,'cleaned_meta_data.csv')
images_folder = os.path.join(sample_data_folder,'images')

In [None]:
image_captions = pd.read_csv(cleaned_meta_data_file, index_col='index')
image_captions.rename(columns = {'0':'caption', '1':'link', '2':'objects', '3': 'mid', '4': 'object_confidence'}, inplace=True)
image_captions.head()


In [None]:
len(image_captions)

In [None]:
#Add notebook relative file_path to image 
image_captions['image_path'] = image_captions['pos'].apply(lambda p: os.path.join(images_folder, str(p) + ".png"))
image_captions


In [None]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from PIL import Image
#load a few sample images with captions
def display_samples(meta_df, num_samples=5, seed = 0):
    #sample num_sample rows from the dataframe
    samples = meta_df.sample(n=num_samples,random_state = 0)
    for idx,row in samples.iterrows():
        file_name = row['image_path']
        caption = row['caption']
        #get objects and confidence scores
        objects = row['objects'].split(',')
        confidences = row['object_confidence'].split(',')
        
        obj_conf = [str((obj,conf[0:4])) for obj,conf in zip(objects,confidences)]
        num_obj = len(obj_conf)
        obj_str = "\n".join(obj_conf)
        image = Image.open(file_name)
        fig = plt.figure(figsize=(10,(0.3*num_obj)))
        ax = fig.add_subplot(121)
        
                         
        plt.xticks([])
        plt.yticks([])
        ax.imshow(image) 
        ax.set_title(caption)
        ax = fig.add_subplot(122)
        ax.text(0.1, 0.5, obj_str, horizontalalignment='left',verticalalignment='center')
        plt.xticks([])
        plt.yticks([])
        plt.show()
        
        







In [None]:
display_samples(image_captions)

## Profiling
- Size distribution of images
- Aspect Ration distribution of images: w/h
- Distribution of number of objects per image
- Distribution of number of objects at various confidence thresholds
- Distribution of caption size for each image

### Size distribution of images
This takes a while to run

In [None]:
image_captions['size'] = image_captions['image_path'].apply(lambda p: Image.open(p).size)
image_captions

In [None]:
#number of different sizes:
image_captions['size'].describe()

In [None]:
#profiling widths and heights
image_captions['height'] = image_captions['size'].apply(lambda x: x[1])
image_captions['width'] = image_captions['size'].apply(lambda x: x[0])

In [None]:
image_captions['height'].describe()


In [None]:
image_captions['width'].describe()

In [None]:
image_captions[['width','height']].plot.hist(bins=100, alpha=0.5)

In [None]:
#looks like most images have a height and width under 1000 

In [None]:
quant = 0.99
print(image_captions['height'].quantile(quant))
print(image_captions['width'].quantile(quant))

### Aspect ratio distribution of images

In [None]:
image_captions['aspect_ratio'] = image_captions.apply(lambda x: x.width/x.height , axis=1)

In [None]:
image_captions['aspect_ratio'].describe()

In [None]:
quant = 0.99
print(image_captions['aspect_ratio'].quantile(quant))

In [None]:
image_captions[['aspect_ratio']].plot.hist(bins=20, alpha=0.5)

### Number of objects per image

In [None]:
image_captions['num_obj'] = image_captions['objects'].apply(lambda o: 0 if pd.isna(o) else len(str(o).split(',')))

In [None]:
image_captions['num_obj'].describe()


In [None]:
image_captions['num_obj'].value_counts()


All but 4 images have atleast 1 object in them and a maximum of 15 objects in them and around 50% of the images have 15 objects detected

In [None]:
image_captions['num_obj'].plot.hist(bins=15, alpha=0.5)

### Distribution of num_objects at various confidence thresholds


In [None]:
def get_objects_with_conf_above(meta_df, threshold):
    """
    Returns a Series with a list of tuples (object,float) of object and confidence with confidence greater that threshold.
    """

    #object_conf tuple for objects with confidence above threshold
    obj_conf = image_captions.apply(lambda x: [] if (pd.isna(x.objects) or pd.isna(x.object_confidence))
                                   else [(obj,float(conf)) for (obj,conf) in zip(x.objects.split(','), x.object_confidence.split(',')) 
                                        if float(conf) >=threshold], axis=1)
    return obj_conf
    


In [None]:
#get number of objects at various thresholds and plot them
thresholds = [90,80,75,50,1]
for t in thresholds:
    num_obj_conf = get_objects_with_conf_above(image_captions,(t*0.01)).apply(lambda l: len(l))
    num_obj_conf.plot.hist(alpha=0.5,title=f"Distribution of no. of objects with confidence above {t}%").set_xlabel(f"No. of objects in image with confidence higher than {t}%")
    plt.show()
    print(f"Number of images with atleast 1 caption with confidence higher than {t}%:",len(num_obj_conf[num_obj_conf > 0]))   


## Divide captions into sentiment classes. 

 - nltk's pretrained SentimentIntensityAnalyser: sentiment and polarity
 - huggingface pretrained classifier

### NLTK's pretrained classifer and polarity scores

In [None]:
import nltk


In [None]:
from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()

In [None]:
#example on random captions

random_sample = image_captions.sample(n=50, random_state=0)
random_sample

In [None]:
random_sample['nltk_sent_int'] = random_sample['caption'].apply(lambda x: sia.polarity_scores(x))
#classify using polarity scores: class with max polarity
random_sample['nltk_sent'] = random_sample['nltk_sent_int'].apply(lambda x: max(['neu', 'pos','neg'], key=x.get))
random_sample['nltk_sent'] 

In [None]:
#now do this pfor everything and plot histogram of classes
image_captions['nltk_sent_polarity'] = image_captions['caption'].apply(lambda x: sia.polarity_scores(x))

In [None]:
image_captions['nltk_sent'] = image_captions['nltk_sent_polarity'].apply(lambda x: max(['neu', 'pos','neg'], key=x.get))

In [None]:
image_captions['nltk_sent'].describe()

In [None]:
image_captions['nltk_sent'].value_counts()

In [None]:
image_captions['nltk_sent'].value_counts().plot(kind='bar')

As we can see (and as expected), the overwhelming majority of captions are neutral

We can try limiting to the sentiment with max polarity among just positive and negative

In [None]:
image_captions['nltk_sent_pos_neg'] = image_captions['nltk_sent_polarity'].apply(lambda x: max(['pos','neg'], key=x.get))

In [None]:
image_captions['nltk_sent_pos_neg'].value_counts()

In [None]:
image_captions['nltk_sent_pos_neg'].value_counts().plot(kind='bar')

Still, classes are very unbalanced. 

## Huggingface's pretrained classifier

In [None]:

from transformers import pipeline


In [None]:
hf_sent = pipeline('sentiment-analysis')


In [None]:
#example on random captions

random_sample = image_captions.sample(n=50, random_state=0)
random_sample

In [None]:
random_sample['hf_sent'] = random_sample['caption'].apply(lambda x: hf_sent(x))
random_sample['hf_sent']
#looks a bit more divided!

In [None]:
image_captions['hf_sent'] = image_captions['caption'].apply(lambda x: hf_sent(x))

In [None]:
#TODO: any captions with more than one class?
image_captions['num_sent'] = image_captions['hf_sent'].apply(lambda x: len(x))


In [None]:
image_captions['hf_sent_class'] = image_captions['hf_sent'].apply(lambda x: x[0]['label'])
image_captions['hf_sent_conf'] = image_captions['hf_sent'].apply(lambda x: x[0]['score'])

In [None]:
image_captions['hf_sent_class'].describe()

In [None]:
image_captions['hf_sent_class'].value_counts()

In [None]:
image_captions['hf_sent_class'].value_counts().plot(kind = 'bar')

In [None]:
image_captions['hf_sent_conf'].describe() 

Looks a bit more balanced using huggingface!

### Huggingface caption confidence thresholds

In [None]:
image_captions['hf_sent_conf'].plot.hist()

In [None]:
q = np.linspace(.1, 1, 9, 0)
image_captions['hf_sent_conf'].quantile(q)

As we can see, only 10% of the data has confidence under 80% - We are dealing with high confidence labels here!

In [None]:
image_captions['hf_sent_class'].value_counts().plot(kind = 'bar')

In [None]:
image_captions['hf_sent_conf'].describe() 

Looks a bit more balanced using huggingface!

### Huggingface caption confidence thresholds

In [None]:
image_captions['hf_sent_conf'].plot.hist()

In [None]:
q = np.linspace(.1, 1, 9, 0)
image_captions['hf_sent_conf'].quantile(q)

As we can see, only 10% of the data has confidence under 80% - We are dealing with high confidence labels here!