# Feature extraction

## Image vectorisation
Images are vectorised using the penultimate layer of Keras Xception model <cite data-cite="chollet2017">(Chollet, 2013)</cite> pre-trained on imagenet <cite data-cite="deng2009">(Deng et al., 2009)</cite>.


In [2]:
from src.preprocessing.preprocess_dataset import extract_all_images, extract_all_sentences
from src.utils.files import save_as_pickle, load_pickle_file
import pandas as pd

In [7]:
def remove_unknown_images(df, embed):
    print("{} embeddings".format(len(embed)))
    df = pd.read_csv(df, usecols=["image_name"])
    img_names = list(embed.keys())
    to_remove = list(set(img_names)-set(df["image_name"]))
    removed = [embed.pop(key, None) for key in to_remove]
    if len(to_remove):
        print("The following embeddings were removed {}".format(to_remove))
        print("{} embeddings remaining".format(len(embed)))
    else:
        print("Nothing to remove")
    return embed

In [5]:
img_embeddings_train = extract_all_images("xception", "data/train_images/")
img_embeddings_train = remove_unknown_images("data/train_cleaned_final.csv", img_embeddings_train)
save_as_pickle(img_embeddings_train, "data/features/xception.pkl.train")

  "Palette images with Transparency expressed in bytes should be "


7000
The following embeddings were removed ['chandler_Friday-Mood-AF.-meme-Friends-ChandlerBing.jpg']
6999


In [3]:
img_embeddings_train = load_pickle_file("data/features/xception.pkl.train")

In [6]:
img_embeddings_dev = extract_all_images("xception", "data/dev_images/")
img_embeddings_dev = remove_unknown_images("data/dev_cleaned_final.csv", img_embeddings_dev)
save_as_pickle(img_embeddings_dev, "data/features/xception.pkl.dev")

1000
Nothing to remove
1000


In [4]:
img_embeddings_dev = load_pickle_file("data/features/xception.pkl.dev")

In [8]:
img_embeddings_test = extract_all_images("xception", "data/test_images/", "data/features/xception.pkl.test")
img_embeddings_test = remove_unknown_images("data/test_cleaned_final.csv", img_embeddings_test)
save_as_pickle(img_embeddings_test, "data/features/xception.pkl.test")

2000 embeddings
The following embeddings were removed ['bethe_1231_012216115256.jpg', 'misog_99all-white-people-are-racist-all-men-are-misogynistic-and-all-cisgender-people-ar.jpg', 'obama_50_92427733_9096be2d-c723-4452-98a2-e965a5c0949d.jpg', 'decaprio_leonardo-dicaprio-at-age-19-and-age-39-25222232.png', 'sexist_so3hqchgzxylwrq4cgne.jpg', 'racis_110ynv296w1521.jpg', 'tech_no-technology-what.jpg', 'x_men_avengers-x-men-memes.jpg', 'bean_rj5dlqakfbh21.jpg', 'gene_163remembering-gene-gilda-today-would-have-been-the-great-24175552.png', 'harvey_this-is-why-I-drink.jpg', 'obama_267obama-islam-memes.jpg', 'avengers_sub-buzz-24672-1525675842-1.png', 'tech_Gerd-Leonhard-Bodies-are-no-longer-central-to-our-identity.jpg', 'hillary_573c620b8bc46.jpeg', 'feminist_51117048_622764668173027_784856690065111654_n.jpg', 'best_2018_sub-buzz-6084-1545177168-3.png', 'trump_screen-shot-2015-07-14-at-9-42-49-am.jpg', 'pepe_girls-he-has-to-be-6ft-tall-stable-job-drive-37896153.png', 'cat_U_372eXRC3as.png', 

In [5]:
img_embeddings_test = load_pickle_file("data/features/xception.pkl.test")

## Sentences vectorisation
Text of memes are vectorised using pretrained Universal sentence encoding <cite data-cite="cer2018">(Cer et al., 2018)</cite>. The dataset of training is not specified nor open sourced by the authors. 

In [10]:
sent_embeddings_train = extract_all_sentences("data/train_cleaned_final.csv", "data/features/use.pkl.train")
print(len(sent_embeddings_train))

/home/lb732/Projects/memotion_analysis/data/models/use
6999


In [11]:
sent_embeddings_dev = extract_all_sentences("data/dev_cleaned_final.csv", "data/features/use.pkl.dev")
print(len(sent_embeddings_dev))

/home/lb732/Projects/memotion_analysis/data/models/use
1000


In [6]:
sent_embeddings_test = extract_all_sentences("data/test_cleaned_final.csv", "data/features/use.pkl.test")
print(len(sent_embeddings_test))

/home/lb732/Projects/memotion_analysis/data/models/use
1878
