## Tutorial 29: Wikipedia image data

This tutorial introduces the `wikiimage.py` module, which we can use to grab
and process image data from Wikipedia pages. Start by reading in the module,
as well as numpy and pylab (for plotting the images).

In [None]:
%pylab inline

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches

import wiki
import wikiimage
import wikitext

In [None]:
plt.rcParams["figure.figsize"] = (12, 16)

### Reading image data from Wikipedia

The `image_data_frame` takes a list of Wikipedia pages and returns a data frame object
showing all of the images from the page. You can also supply the minimum and maximum 
allowed sizes of images. By default the function will download a local version of any
images you do not yet have locally.

In [None]:
df = wikiimage.image_data_frame(['Paris', 'London'], min_size=300)
df

Note that the returned results include the page name, the path of the image, as well as
a column called "max_size". The latter column gives the size of the largest dimension of
the image (either the height or width).

### Displaying the images in Python

The `load_image` function takes the name of an image and returns a `PIL` object,
a special image type that can be plotted in Python.

In [None]:
img = wikiimage.load_image(df.img.values[4])
type(img)

In [None]:
plt.imshow(img)

Here is some Python code that prints all of the image in the data frame. Note
that you may need to modify the line `plt.subplot(4, 3, ind + 1)` if you change
the data. The 4 gives the number of columns in the plot and the 3 gives the number
of rows. If you have more than 12 images, only the first 12 will be shown. You can
also adjust the `plt.rcParams["figure.figsize"] = (12, 16)` above to change the overall
size of the print out (I find that I need to adjust this depending on my screen and
the images in question).

In [None]:
for ind, idx in enumerate(range(df.shape[0])):
    try:
        plt.subplots_adjust(left=0, right=1, bottom=0, top=1)
        plt.subplot(4, 3, ind + 1)

        img = wikiimage.load_image(df.iloc[idx]['img'])
        plt.imshow(img)
        plt.axis("off")
        
    except:
        pass

### Image embedding

Last time we saw how the VGG19 model takes a 224-by-224 dimensional image and
returns a list of 1000 probabilities giving predictions of what objects are
located in the image. Here's the model once again:

In [None]:
from keras.applications.vgg19 import VGG19
vgg19_full = VGG19(weights='imagenet')
vgg19_full.summary()

The VGG19 model as described here is really only useful if we care about the 1000
categories described in the ILSVRC competition. Why would this be important enough
to include in the **keras** module? In and of itself, it really is not. The reason
the model is so important is due to something called *transfer learning*.

It turns out that if we apply only a subset of the layers, say all but the final layer
of the model, the neural network serves as form of dimensionality reduction. Look at
the model above; if we look at the output of the layer `fc2` this serves to project a
`224 * 224 * 3`, or `150,528` dimensional object, into `4096` dimensional space. To
produce such an embedding, I'll use keras to strip off the second to last layer:

In [None]:
from keras.models import Model

vgg_fc2 = Model(inputs=vgg19_full.input, outputs=vgg19_full.get_layer('fc2').output)
vgg_fc2.summary()

And we can apply the model just as we did before, but the output now contains 4096
dimensions. These dimensions, just like with PCA and t-SNE, do not have an explict
meaning. The relationships between images in the embedding space, however, describe
semantic relationships, which we will be able to explore shortly.

In [None]:
from keras.preprocessing import image
from keras.applications.vgg19 import preprocess_input

img = wikiimage.load_image(df.img.values[1], target_size=(224, 224))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)

y = vgg_fc2.predict(x)
y.shape

### Embeddings in wikiimage

The wikiimage module contains the function `vgg19_embed` that performs
embedding into the `fc2` layer. Conveniently, the embedding are cached
so that you only need to construct them once (it can take a while to
create the embeddings).

In [None]:
df_fc2 = wikiimage.vgg19_embed(df.img.values)
df_fc2.shape

The output is a numpy array with one row for each image and 4096 columns. Again, we will see how
to use these in just a moment.

### Bulk download

As with the Wikipedia pages at the start of the semester, I do not want you to all have
to wait a long time to download the images for today's class. Conveniently, we should be
able to use the same bulk download function if we are clever about calling the "language"
of the images "img" and the "language" of the embeddings "embed". Grab both of these here:

In [None]:
wiki.bulk_download('impressionists-text', lang='en')

In [None]:
wiki.bulk_download('impressionists-image', lang='img')

In [None]:
wiki.bulk_download('impressionists-embed', lang='embed')

### Exploring impressionists

For today's tutorial, let's create a dataset of all the pages linked to from the 
impressionists and extract from these all of the images. Note: you should have almost
all of these from the bulk download above. If it starts downloading a lot of stuff,
something is wrong!

In [None]:
page_links = wikitext.get_internal_links("Impressionism")['ilinks'] + ["Impressionism"]
df = wikiimage.image_data_frame(page_links, download=True, min_size=224, max_size=750)
df

Next, let's grab the VGG19 embeddings for these images. This may take a minute or two,
there is a lot to load, but should finish quickly as almost all of the embeddings should
already have been downloaded.

In [None]:
wikiart_fc2 = wikiimage.vgg19_embed(df.img.values)
wikiart_fc2.shape

Now, finally, let's see why these embeddings are so useful. Let's start with the image 700:

In [None]:
start_img = 700
img = wikiimage.load_image(df.iloc[start_img]['img'])
plt.imshow(img)

We can compute the distance in the 4096-dimensional embedding space of this image to all
of the other images in our corpus.

In [None]:
dists = np.sum(np.abs(wikiart_fc2 - wikiart_fc2[start_img, :]), 1)
dists.shape

Then, we'll sort these distances and get the indicies of all of the 24 closest images in this
space (of course, the closest image will be image 700 itself).

In [None]:
idx = np.argsort(dists.flatten())[:24]
idx

Finally, let's see all of the images in order from closest to farthest:

In [None]:
plt.figure(figsize=(14, 36))
for ind, i in enumerate(idx):
    try:
        plt.subplots_adjust(left=0, right=1, bottom=0, top=1)
        plt.subplot(8, 3, ind + 1)

        img = wikiimage.load_image(df.iloc[i]['img'])
        plt.imshow(img)
        plt.axis("off")
        
    except:
        pass

Fairly accurate, when you consider all of the image types in the corpus, no?

### Testing the embeddings

The code below picks a randoming starting point and displays the closest 24 images in the
`fc2` space. Run it multiple times, and record particularly interesting numbers. Where does
it work well and where does it run into problems? Tell me about at least one number that
worked better than you expected and one issue that it had trouble dealing with:

**Answer**:

In [None]:
start_img = np.random.randint(0, df.shape[0])

print("Grabbed image number {0:d}.".format(start_img))
print(df.iloc[start_img])

dists = np.sum(np.abs(wikiart_fc2 - wikiart_fc2[start_img, :]), 1)
idx = np.argsort(dists.flatten())[:24]
plt.figure(figsize=(14, 36))

for ind, i in enumerate(idx):
    try:
        plt.subplots_adjust(left=0, right=1, bottom=0, top=1)
        plt.subplot(8, 3, ind + 1)

        img = wikiimage.load_image(df.iloc[i]['img'])
        plt.imshow(img)
        plt.axis("off")
        
    except:
        pass