# Wikipedia - Image/Caption Matching
## Retrieve captions based on images


<br>

### Description

A picture is worth a thousand words, yet sometimes a few will do. We all rely on online images for knowledge sharing, learning, and understanding. Even the largest websites are missing visual content and metadata to pair with their images. Captions and “alt text” increase accessibility and enable better search. The majority of images on Wikipedia articles, for example, don't have any written context connected to the image. Open models could help anyone improve accessibility and learning for all.

Current solutions rely on simple methods based on translations or page interlinks, which have limited coverage. Even the most advanced computer vision image captioning isn't suitable for images with complex semantics.


### Data

The objective of this competition is to predict the target caption_title_and_reference_description given information about an images. The targets for this competition are in multiple languages.

I'll focus on English language only. 

#### Files
- `train-{0000x}-of-00005.tsv` - the training data (tab delimited)
- `test.tsv` - the test data; the objective is to predict the target **caption_title_and_reference_description** for each row id
- `sample_submission.csv` - a sample submission file in the correct format; note that multiple predictions (up to 5) are allowed for each id in the test data.
- image_data_test/
    - image_pixels/`test_image_pixels_part-{0000x}.csv.gz`
        - image_url: url to the original image file, e.g. https://upload.wikimedia.org/wikipedia/commons/e/ec/Hovden.jpg
        - b64_bytes: base64 encoded bytes of the image file at a 300px resolution
        - metadata_url: url to the commons page of the image, e.g. https://commons.wikimedia.org/wiki/File:Hovden.jpg
    - resnet_embeddings/`test_resnet_embeddings_part-{0000x}.csv.gz`
        - image_url: url to the original image file, e.g. https://upload.wikimedia.org/wikipedia/commons/e/ec/Hovden.jpg
        - embedding: a comma separated list of 2048 float values
- image_data_train - Due to the size of the training image data (~275 Gb), it is hosted separately and can be found here. Note that not all of the training observations have corresponding image data.

<code> kaggle competitions download -c wikipedia-image-caption </code>

### Submission

Submissions will be evaluated using NDCG@5 ([Normalized Discounted Cumulative Gain](https://en.wikipedia.org/wiki/Discounted_cumulative_gain)).

The submission should be a list of id,caption_title_and_reference_description pairs ranked from top to bottom according to their relevance (i.e., the top id is the most relevant caption_title_and_reference_description), with up to 5 predictions per id. Each line should be a single id,caption_title_and_reference_description pair.


### Models that can be used: 
[Kaggle notebook with examples of these models](https://www.kaggle.com/shivamb/cnn-architectures-vgg-resnet-inception-tl)

**VGG16**
- VGG16 was publised in 2014 and is one of the simplest (among the other cnn architectures used in Imagenet competition). It's Key Characteristics are:
    - This network contains total 16 layers in which weights and bias parameters are learnt.
    - A total of 13 convolutional layers are stacked one after the other and 3 dense layers for classification.
    - The number of filters in the convolution layers follow an increasing pattern (similar to decoder architecture of autoencoder).
    - The informative features are obtained by max pooling layers applied at different steps in the architecture.
    - The dense layers comprises of 4096, 4096, and 1000 nodes each.
    - The cons of this architecture are that it is slow to train and produces the model with very large size.

**ResNet 50**
   - Convolutional neural network that is 50 layers deep. You can load a pretrained version of the network trained on more than a million images from the ImageNet database [1]. The pretrained network can classify images into 1000 object categories, such as keyboard, mouse, pencil, and many animals. As a result, the network has learned rich feature representations for a wide range of images. The network has an image input size of 224-by-224.


### Prizes

The top three winning teams will receive Wikipedia-branded merchandise

In [26]:
import os
import requests

# General packages
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import plotly.graph_objs as go
import plotly.express as px
import PIL.Image
import base64 
import io

from IPython.display import Image, display
import warnings
warnings.filterwarnings("ignore")

In [5]:
os.listdir('Data')

['.DS_Store',
 'test_image_pixels_part-00000.csv',
 'test.tsv',
 'test_caption_list.csv',
 'zips',
 'test_resnet_embeddings_part-00000.csv',
 'train-00000-of-00005.tsv',
 'sample_submission.csv']

### Submission Data

Let's start with sample submission. We have an id column and caption_title_and_reference_description column, and for each id we predict 5 captions. We need to select them from a predefined set of captions we'll see next.

In [11]:
sub = pd.read_csv('Data/sample_submission.csv')

In [21]:
sub.shape


(461830, 2)

In [23]:
sub.head(10)


Unnamed: 0,id,caption_title_and_reference_description
0,0,diam in ve ligula a sapien mattis a a
1,0,nisi ut eu
2,0,urna ad a duis a ligula dis ve
3,0,eros eu a vel congue ac et justo sapien
4,0,diam mi a mi nec a leo sed ornare magna mus se...
5,1,eros ad a magnis elit ad a in a
6,1,amet ut a a
7,1,pede ad
8,1,nisi eu ut est sapien felis id felis
9,1,nisl ac a eu a primis integer nih a


### Test Data

There are two "test files" 
- test.tsv - the test data; the objective is to predict the target caption_title_and_reference_description for each row id
- test_captions_list.csv - a list of captions to retrieve for test predictions

In [77]:
test      = pd.read_csv('Data/test.tsv', sep='\t')
test_cap  = pd.read_csv('Data/test_caption_list.csv')

In [79]:
print('test_file', test.shape)
print('test_cap', test_cap.shape)

test_file (92366, 2)
test_cap (92366, 1)


## Train data

In [44]:
test0 = pd.read_csv('Data/test_image_pixels_part-00000.csv', sep='\t', names=['image_url', 'b64_bytes', 'metadata_url'])

In [45]:
test0.shape

(8952, 3)

In [46]:
test0.head(2)

Unnamed: 0,image_url,b64_bytes,metadata_url
0,http://upload.wikimedia.org/wikipedia/commons/...,/9j/4AAQSkZJRgABAQEASABIAAD//gBbRmlsZSBzb3VyY2...,http://commons.wikimedia.org/wiki/File:Dodge_C...
1,https://upload.wikimedia.org/wikipedia/commons...,/9j/4QB+RXhpZgAATU0AKgAAAAgABgEOAAIAAAAKAAAAVg...,http://commons.wikimedia.org/wiki/File:Granada...


In [43]:
for i in range(0,5):
    image_64_decode = base64.b64decode(test0['b64_bytes'].loc[i])
    img = PIL.Image.open(io.BytesIO(image_64_decode))
    img.show() 

In [48]:
import datatable as dt
train0 = dt.fread('Data/train-00000-of-00005.tsv')
train0.head()

Unnamed: 0_level_0,language,page_url,image_url,page_title,section_title,hierarchical_section_title,caption_reference_description,caption_attribution_description,caption_alt_text_description,mime_type,…,attribution_passes_lang_id,page_changed_recently,context_page_description,context_section_description,caption_title_and_reference_description
Unnamed: 0_level_1,▪▪▪▪,▪▪▪▪,▪▪▪▪,▪▪▪▪,▪▪▪▪,▪▪▪▪,▪▪▪▪,▪▪▪▪,▪▪▪▪,▪▪▪▪,Unnamed: 11_level_1,▪,▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪
0,fr,https://fr.wikipedia.org/wiki/Pariser_Kanonen,https://upload.wikimedia.org/wikipedia/commons/8/8…,Pariser Kanonen,Bilan,Pariser Kanonen / Bilan,,Français&#160;: Plaque apposée au n° 81 du bouleva…,,image/jpeg,…,1,1,Les Pariser Kanonen sont sept pièces d’artillerie …,Le ou les canons restants sont démontés devant la …,Pariser Kanonen [SEP]
1,tg,https://tg.wikipedia.org/wiki/Republic_P-43_Lancer,http://upload.wikimedia.org/wikipedia/commons/c/c4…,Republic P-43 Lancer,,Republic P-43 Lancer,,"A Republic P-43 Lancer in flight over Esler Field,…",,image/jpeg,…,0,0,Republic P-43 Lancer — як ҳавогарди сохтаи Republi…,Republic P-43 Lancer (англ. Republic P-43 Lancer) …,Republic P-43 Lancer [SEP]
2,en,"https://en.wikipedia.org/wiki/Deer_Park,_Wisconsin",https://upload.wikimedia.org/wikipedia/commons/2/2…,"Deer Park, Wisconsin",,"Deer Park, Wisconsin",Downtown Deer Park,"English: Downtown Deer Park, Wisconsin on WIS46.",Downtown Deer Park,image/jpeg,…,1,1,"Deer Park is a village in St. Croix County, Wiscon…","Deer Park is a village in St. Croix County, Wiscon…","Deer Park, Wisconsin [SEP] Downtown Deer Park"
3,pt,https://pt.wikipedia.org/wiki/Vaux-l%C3%A8s-Mouzon,https://upload.wikimedia.org/wikipedia/commons/6/6…,Vaux-lès-Mouzon,,Vaux-lès-Mouzon,,Français&#160;: Vaux-lès-Mouzon&#160;: le village,,image/jpeg,…,0,0,Vaux-lès-Mouzon é uma comuna francesa na região ad…,Vaux-lès-Mouzon é uma comuna francesa na região ad…,Vaux-lès-Mouzon [SEP]
4,ne,https://ne.wikipedia.org/wiki/%E0%A4%95%E0%A5%8D%E…,https://upload.wikimedia.org/wikipedia/commons/b/b…,क्रिसमस टापु,,क्रिसमस टापु,,This is a locator map for Christmas Island I made.,,image/png,…,0,0,क्रिसमस टापु हिन्द महासागरमा अवस्थित अष्ट्रेलियाको…,क्रिसमस टापु हिन्द महासागरमा अवस्थित अष्ट्रेलियाको…,क्रिसमस टापु [SEP]
5,en,https://en.wikipedia.org/wiki/J%C3%BCrgen_Ovens,https://upload.wikimedia.org/wikipedia/commons/0/0…,Jürgen Ovens,Works,Jürgen Ovens / Works,"Jürgen Ovens's Justitia, 1663-1665, Museumsberg Fl…",,,image/jpeg,…,0,0,"Jürgen Ovens, also known as Georg, or Jurriaen Ove…",Ovens was appreciated for his portraits and painte…,"Jürgen Ovens [SEP] Jürgen Ovens's Justitia, 1663-1…"
6,hu,https://hu.wikipedia.org/wiki/Jablonn%C3%A1,https://upload.wikimedia.org/wikipedia/commons/3/3…,Jablonná,,Jablonná,,Čeština: Kříž v ohrádce u komunikace z Kamýku na P…,,image/jpeg,…,0,0,"Jablonná település Csehországban, a Příbrami járás…","Jablonná település Csehországban, a Příbrami járás…",Jablonná [SEP]
7,cs,https://cs.wikipedia.org/wiki/V%C3%A1clav_Vladivoj…,http://upload.wikimedia.org/wikipedia/commons/7/73…,Václav Vladivoj Tomek,"Rodina, studium a profesní dráha","Václav Vladivoj Tomek / Biografie / Rodina, studiu…",Václav Vladivoj Tomek,Čeština: Václav Vladivoj Tomek English: Václav Vla…,,image/jpeg,…,1,0,"Václav Vladivoj rytíř Tomek, starším způsobem zápi…",Narodil se v rodině obuvnického mistra. Na obecní …,Václav Vladivoj Tomek [SEP] Václav Vladivoj Tomek
8,ml,https://ml.wikipedia.org/wiki/%E0%B4%AE%E0%B4%BE%E…,https://upload.wikimedia.org/wikipedia/commons/a/a…,മാർപ്പാപ്പാമാരുടെ പട്ടിക,മൂന്നാം നൂറ്റാണ്ട്,മാർപ്പാപ്പാമാരുടെ പട്ടിക / കാലക്രമമനുസരിച്ചുള്ള പട…,,English: Illustration of Pope Anterus,,image/jpeg,…,0,1,റോമൻ കത്തോലിക്കാ സഭയുടെ പരമാചാര്യനുംപരമ മേലദ്ധ്യക്…,,മാർപ്പാപ്പാമാരുടെ പട്ടിക [SEP]
9,ko,https://ko.wikipedia.org/wiki/%EC%B9%B4%EC%9D%B8%E…,http://upload.wikimedia.org/wikipedia/commons/6/60…,카인주,,카인주,,English: Kayin state in Myanmar,,image/png,…,0,0,카인주 또는 카렌주는 미얀마의 주이다. 주도는 파안이다.,"카인주(버마어: ကရင်ပြည်နယ်, Kayin State) 또는 카렌주(Karen St…",카인주 [SEP]


In [55]:
train0_lan = train0['language'].to_pandas()
train0_lan.value_counts(normalize=True)


language
en          0.146194
de          0.090268
fr          0.068888
es          0.046811
ru          0.041328
              ...   
ia          0.000440
ckb         0.000410
si          0.000397
xmf         0.000397
cv          0.000388
Length: 108, dtype: float64

In [57]:
0.14 * 7000000 * 5

4900000.000000001

### "Dirty" prediction: 
Using only the URL to create a caption, since the URL already includes information about the image. 

In [80]:
from urllib.parse import unquote

t = test.image_url.loc[2]

def convert(t):
    t = t.rsplit('/',1)[1]
    t = unquote(t)
    t = t.replace('_', ' ')
    t = t + ' [SEP]'
    return(t)

In [81]:
for i in range(5):
    print(f'target: {train0[i,-1]}')
    print(f'prediction: {convert(train0[i,1])}')
    print()

target: Pariser Kanonen [SEP]
prediction: Pariser Kanonen [SEP]

target: Republic P-43 Lancer [SEP]
prediction: Republic P-43 Lancer [SEP]

target: Deer Park, Wisconsin [SEP] Downtown Deer Park
prediction: Deer Park, Wisconsin [SEP]

target: Vaux-lès-Mouzon [SEP]
prediction: Vaux-lès-Mouzon [SEP]

target: क्रिसमस टापु [SEP]
prediction: क्रिसमस टापु [SEP]



In [82]:
test['prediction'] = test['image_url'].apply(convert)


In [83]:
test.head()


Unnamed: 0,id,image_url,prediction
0,0,https://upload.wikimedia.org/wikipedia/commons...,Scots Gaelic speakers in the 2011 census.png [...
1,1,https://upload.wikimedia.org/wikipedia/commons...,Thermopylae ancient coastline large.jpg [SEP]
2,2,https://upload.wikimedia.org/wikipedia/commons...,Map of New York highlighting Sullivan County.s...
3,3,https://upload.wikimedia.org/wikipedia/commons...,Kykkos Watermill.jpg [SEP]
4,4,https://upload.wikimedia.org/wikipedia/commons...,Kansai International Airport03s3s4410.jpg [SEP]


In [64]:
CAPTIONS = captions.caption_title_and_reference_description.values.tolist()
len(CAPTIONS)

92366

In [65]:
from rapidfuzz import process, fuzz


In [84]:
%%time

for i in range(5):
    s = test.prediction.loc[i]
    print(f'image_url: {s}')
    res = process.extract(s, CAPTIONS, scorer=fuzz.ratio, processor=None, limit=5)
    print(f'closest captions:')
    for c in res:
        print(c[0])
    print('*'*60)
    print()  

image_url: Scots Gaelic speakers in the 2011 census.png [SEP]
closest captions:
São Vicente do Penso [SEP] Escudo
Jocs Panamericans de 2011 [SEP] Guadalajara.
Bois de pins et chênes de Madrean [SEP] carte
Standard time in the United States [SEP] 1913
Hebrides [SEP] Geographic distribution of Gaelic speakers in Scotland (2011)
************************************************************

image_url: Thermopylae ancient coastline large.jpg [SEP]
closest captions:
Oberlin College [SEP] Logo
Departamento de Flores [SEP] Escudo
Sekolah Menengah Kebangsaan Selandar [SEP] CNY 4
Academia de Ciencias de Cuba [SEP] Sede.
Geothermal areas in New Zealand [SEP] Geyser Flat
************************************************************

image_url: Map of New York highlighting Sullivan County.svg [SEP]
closest captions:
Breaking Clean [SEP] Map of Montana highlighting Phillips County
New York City Subway [SEP] Ein Token
Deogyang-gu [SEP] Map of Gyeonggi highlighting Deogyang-gu.
Rome, New York [SEP] Loc

In [68]:
def find_closest_match(s):
    res = process.extract(s, CAPTIONS, scorer=fuzz.ratio, processor=None, limit=5)
    res = [x[0] for x in res]
    return res

In [69]:
from tqdm.auto import tqdm
tqdm.pandas()

In [85]:
test['caption_title_and_reference_description'] = test['prediction'].progress_apply(find_closest_match)

  0%|          | 0/92366 [00:00<?, ?it/s]

In [86]:
test

Unnamed: 0,id,image_url,prediction,caption_title_and_reference_description
0,0,https://upload.wikimedia.org/wikipedia/commons...,Scots Gaelic speakers in the 2011 census.png [...,"[São Vicente do Penso [SEP] Escudo, Jocs Panam..."
1,1,https://upload.wikimedia.org/wikipedia/commons...,Thermopylae ancient coastline large.jpg [SEP],"[Oberlin College [SEP] Logo, Departamento de F..."
2,2,https://upload.wikimedia.org/wikipedia/commons...,Map of New York highlighting Sullivan County.s...,[Breaking Clean [SEP] Map of Montana highlight...
3,3,https://upload.wikimedia.org/wikipedia/commons...,Kykkos Watermill.jpg [SEP],"[Kykkos watermill [SEP] Kykkos Watermill, Kays..."
4,4,https://upload.wikimedia.org/wikipedia/commons...,Kansai International Airport03s3s4410.jpg [SEP],[Hong Kong International Airport [SEP] Skypier...
...,...,...,...,...
92361,92361,https://upload.wikimedia.org/wikipedia/commons...,Alsterquelle 001.jpg [SEP],"[Elsterheide [SEP] Nardt, Elsterheide vald [SE..."
92362,92362,https://upload.wikimedia.org/wikipedia/commons...,Stavoren 1649 Blaeu.jpg [SEP],"[Stavoren [SEP] Mapo el 1649, Nora Stanton Bar..."
92363,92363,https://upload.wikimedia.org/wikipedia/commons...,Karlovy Vary smírčí kříž Na Milíři 6.jpg [SEP],"[Karolina Lanckorońska [SEP] 1945, Klostret Ga..."
92364,92364,https://upload.wikimedia.org/wikipedia/commons...,Winchester Repeating Arms Company advertisemen...,[Winchester Repeating Arms Company [SEP] Magaz...


In [73]:
#test['id'] = test0.index
#test0['id'] 

0          0
1          1
2          2
3          3
4          4
        ... 
8947    8947
8948    8948
8949    8949
8950    8950
8951    8951
Name: id, Length: 8952, dtype: int64

In [87]:
sub = test[['id', 'caption_title_and_reference_description']]
sub = sub.explode('caption_title_and_reference_description')
sub.head()


Unnamed: 0,id,caption_title_and_reference_description
0,0,São Vicente do Penso [SEP] Escudo
0,0,Jocs Panamericans de 2011 [SEP] Guadalajara.
0,0,Bois de pins et chênes de Madrean [SEP] carte
0,0,Standard time in the United States [SEP] 1913
0,0,Hebrides [SEP] Geographic distribution of Gael...


In [88]:
sub.tail()


Unnamed: 0,id,caption_title_and_reference_description
92365,92365,Provincia de San José [SEP] Escudo
92365,92365,Marília [SEP] Marília (década de 1960).
92365,92365,Marília [SEP] Marília (década de 1960).
92365,92365,Angliai csata [SEP] Bf 109
92365,92365,Eleccións xerais de España de 1931 [SEP] Votac...


In [76]:
sub.shape

(44760, 2)

In [89]:
sub.to_csv('submission.csv', index=False)


We've now been able to score above 0.0000 on the leaderboard. To be honest, I'm not sure if using page_url is expected by the host so I asked that question on the forum. For sure, the more challenging and interesting aspect is matching the captions directly with images, and we'll try to tackle that next :)