## Simpsons characters classification 

In this notebook we try to classify images of different simpsons characters. The characters are 'abraham_grampa_simpson', 'apu_nahasapeemapetilon', 'bart_simpson', 'charles_montgomery_burns', 'chief_wiggum', 'homer_simpson', 'krusty_the_clown', 'lisa_simpson', 'marge_simpson', 'milhouse_van_houten', 'moe_szyslak', 'ned_flanders', 'principal_skinner' and 'sideshow_bob'.

This dataset was preprocessed in an other notebook, it is splitted into a train val and testset and resized into 80x80 pixels and all characters have more than 500 images in total. The whole dataset with the original size can be found here

https://www.kaggle.com/alexattia/the-simpsons-characters-dataset

### Imports

In [None]:
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
from tqdm.notebook import tqdm
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

In [None]:
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Dense, Dropout, Activation, Flatten
from tensorflow.keras.layers import Conv2D, MaxPooling2D,GlobalAveragePooling2D,GlobalMaxPooling2D, BatchNormalization
from tensorflow.keras.utils import to_categorical


####  Setup



In [None]:
import os,sys

if "google.colab" in sys.modules:
    %pip install wget
    
import wget,zipfile

if "labsetup_run" not in locals() or labsetup_run:

    print("running setup ...")

    # download data.zip from shared google drive
    if not(os.path.isfile("data.zip")): 
        filename=wget.download("https://drive.google.com/uc?export=download&confirm=yes&id=1dkSV2oL8Ua1SDmzVvtGkyQ0LGQ6VpUIy")
    # unpack it
    if not(os.path.isdir("./data")):
        zf = zipfile.ZipFile(os.path.join(".","data.zip"), "r")
        zf.extractall()
                          
    # allow "hot-reloading" of modules
    %load_ext autoreload
    %autoreload 2
    # needed for inline plots in some contexts
    %matplotlib inline

    print("done.")
    labsetup_run = False  # change to True re-run setup
else:
    print("setup already run.")

#### Open Data

In [None]:
path="./data/simpson_data"
print(os.path.join(os.getcwd(),path))

In [None]:
Data = pd.read_csv(os.path.join(path,"Data.csv"))

X_train = np.load(os.path.join(path,"X_train.npy"))
Y_train = np.load(os.path.join(path,"Y_train.npy"))

X_val = np.load(os.path.join(path,"X_val.npy"))
Y_val = np.load(os.path.join(path,"Y_val.npy"))

X_test = np.load(os.path.join(path,"X_test.npy"))
Y_test = np.load(os.path.join(path,"Y_test.npy"))

labels = Data["label"].unique()

print(X_train.shape)
print(X_val.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_val.shape)
print(Y_test.shape)

### View Data

Let's use the trainset to plot a random image of each character. You can see that the characters are easy recognizable. And all images are the same size.

In [None]:
plt.figure(figsize=(15,15))
for i in range(0,len(np.unique(np.argmax(Y_train,axis=1)))):
    rmd=np.random.choice(np.where(np.argmax(Y_train,axis=1)==i)[0],1)
    plt.subplot(4,4,i+1)
    img=X_train[rmd]
    plt.imshow(img[0,:,:,:])
    plt.title(labels[i])

In this cell we plot the label distribution of all sets. You clearly see that the label distribution in all sets is very similar. The biggest class in the trainigset is obviously homer and the smallest class is apu.

In [None]:
plt.figure(figsize=(14,4))
plt.subplot(1,3,1)
plt.bar(np.unique(np.argmax(Y_train,axis=1),return_counts=True)[0],np.unique(np.argmax(Y_train,axis=1),return_counts=True)[1]
       ,tick_label=labels )
plt.xticks(rotation=90)
plt.title("train distribution")
plt.subplot(1,3,2)
plt.bar(np.unique(np.argmax(Y_val,axis=1),return_counts=True)[0],np.unique(np.argmax(Y_val,axis=1),return_counts=True)[1]
       ,tick_label=labels )
plt.xticks(rotation=90)
plt.title("val distribution")
plt.subplot(1,3,3)
plt.bar(np.unique(np.argmax(Y_test,axis=1),return_counts=True)[0],np.unique(np.argmax(Y_test,axis=1),return_counts=True)[1]
       ,tick_label=labels )
plt.xticks(rotation=90)
plt.title("test distribution")
plt.show()

### Random guessing

Let's build our first "classifier", what would be the accuracy if we would just random guess one of the labels for all testimages. Note that here, every character has the same chance to be predicted.


In [None]:
random_pred=np.zeros((len(X_test)),dtype="int64")
for i in range (0,len(X_test)):
    random_pred[i]=np.random.choice(np.arange(0,max(np.argmax(Y_test,axis=1)+1)),1)
acc=np.average(random_pred==np.argmax(Y_test,axis=1))
res1 = pd.DataFrame(
          {'Acc' : acc}, index=['random_guessing']
)
res1

### Weighted random guessing

Now let's build an other classifier, instead of just random guessing we now want to use the class distribution of the trainset, this means the chances that we predict homer are higher than that we predict apu and so on. Note that we assume that the testset is the same as the training here which is not always the case.


In [None]:
class_probs=np.unique(np.argmax(Y_train,axis=1),return_counts=True)[1]/len(Y_train)
weighted_random_pred=np.zeros((len(X_test)),dtype="int64")
for i in range (0,len(X_test)):
    weighted_random_pred[i]=np.random.choice(np.arange(0,max(np.argmax(Y_test,axis=1)+1)),1,p=class_probs)
acc= np.average(weighted_random_pred==np.argmax(Y_test,axis=1))
res2 = pd.DataFrame(
          {'Acc' : acc}, index=['weighted_random_guessing']
)
pd.concat([res1,res2])


### All max class

The next "classifier", is just predicting every image to the biggest class, in our case this is homer. What is the accuracy if we just predict "homer" for all test images.

In [None]:
idx=np.where(np.unique(np.argmax(Y_train,axis=1),return_counts=True)[1]==max(np.unique(np.argmax(Y_train,axis=1),return_counts=True)[1]))
max_class=np.unique(np.argmax(Y_train,axis=1),return_counts=True)[0][idx]
#print(max_class)
acc=np.average(max_class==np.argmax(Y_test,axis=1))
res3 = pd.DataFrame(
          {'Acc' : acc}, index=['all_max_class']
)
pd.concat([res1,res2,res3])


### RF with HOG features

Let's use the fist real classifier. In the next cells we extract the histograms of oriented gradients of every 20x20 pxiel patch (the parameter orientations is the number of histograms you want to extract from each patch, and the pixel_per_cell parameter defines how big a patch is). Then we use a random forest model and train it on the hog featues of the the training data and use the trained model to predict the class of the images based on the hog features of the testdata. Finally we calculate the accuracy on the testset.

In [None]:
from skimage.feature import hog
from skimage import data, exposure

fd, hog_image = hog(X_train[0], orientations=5, pixels_per_cell=(20, 20),
                    cells_per_block=(1, 1), visualize=True, multichannel=True)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 4), sharex=True, sharey=True)

ax1.imshow(X_train[0], cmap=plt.cm.gray)
ax1.set_title('Input image')

# Rescale histogram for better display
hog_image_rescaled = exposure.rescale_intensity(hog_image, in_range=(0, 10))

ax2.imshow(hog_image_rescaled, cmap=plt.cm.gray)
ax2.set_title('Histogram of Oriented Gradients')
plt.show()

In [None]:
hog_features_train=np.zeros((len(X_train),fd.shape[0]))
for i in tqdm(range(0,len(X_train))):
  fd, hog_image = hog(X_train[i], orientations=5, pixels_per_cell=(20, 20),
                    cells_per_block=(1, 1), visualize=True,  multichannel=True)
  hog_features_train[i]=fd
  
hog_features_test=np.zeros((len(X_test),fd.shape[0]))
for i in tqdm(range(0,len(X_test))):
  fd, hog_image = hog(X_test[i], orientations=5, pixels_per_cell=(20, 20),
                    cells_per_block=(1, 1), visualize=True,  multichannel=True)
  hog_features_test[i]=fd

In [None]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=100,random_state=22)
clf.fit(hog_features_train, np.argmax(Y_train,axis=1))  

In [None]:
pred=clf.predict(hog_features_test)
acc=np.average(pred==np.argmax(Y_test,axis=1))
res4 = pd.DataFrame(
          {'Acc' : acc}, index=['RF_with_hog']
)
pd.concat([res1,res2,res3,res4])

### RF with colorhist featues

In the next cells we extract the colorhistogram of each colorchannel. We choose a binsize of 12 and just extract 12 numbers for each channel, with 3 channels, we have 36 features per image. We again use a random forest model and train it on the colorhistogram-features of the the training data and use the trained model to predict the class of the images based on the colorhistogram-features of the testdata. Finally we calculate the accuracy on the testset.

In [None]:
bin_size=12

col_hist_train=np.zeros((len(X_train),3*bin_size))
for i in tqdm(range(0,len(X_train))):
    col_hist_1=np.histogram(X_train[i,:,:,0],range=[0,255],bins=bin_size)[0]
    col_hist_2=np.histogram(X_train[i,:,:,1],range=[0,255],bins=bin_size)[0]
    col_hist_3=np.histogram(X_train[i,:,:,2],range=[0,255],bins=bin_size)[0]
    col_hist_train[i]=np.concatenate([col_hist_1,col_hist_2,col_hist_3])

col_hist_test=np.zeros((len(X_test),3*bin_size))
for i in tqdm(range(0,len(X_test))):
    col_hist_1=np.histogram(X_test[i,:,:,0],range=[0,255],bins=bin_size)[0]
    col_hist_2=np.histogram(X_test[i,:,:,1],range=[0,255],bins=bin_size)[0]
    col_hist_3=np.histogram(X_test[i,:,:,2],range=[0,255],bins=bin_size)[0]
    col_hist_test[i]=np.concatenate([col_hist_1,col_hist_2,col_hist_3])


In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 4), sharex=False, sharey=False)

ax1.imshow(X_train[0], cmap=plt.cm.gray)
ax1.set_title('Input image')

ind = np.arange(0, bin_size)
width = 0.25
red = ax2.bar(ind, col_hist_train[0][0:12], width, color='red')
green = ax2.bar(ind+width, col_hist_train[0][12:24], width, color='green')
blue = ax2.bar(ind+2*width, col_hist_train[0][24:36], width, color='blue')

# ax2.bar(np.arange(0, bin_size*3), col_hist_train[0], width=0.8)
ax2.set_title('Colorhistogram')
plt.show()

In [None]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=100,random_state=22)
clf.fit(col_hist_train, np.argmax(Y_train,axis=1))  

In [None]:
pred=clf.predict(col_hist_test)
acc=np.average(pred==np.argmax(Y_test,axis=1))
res5 = pd.DataFrame(
          {'Acc' : acc}, index=['RF_with_colhist']
)
pd.concat([res1,res2,res3,res4,res5])

### Fully connected neural network

In the next cells we want to use a fully connected neural network. For this we first normalize the pixelvalues to be in the range from -1 to 1. Then we need to flatten the imput, note that we ignore the 2d structure of the image here. We use a neural network with two hidden layers with the nodesizes of 800 and 200 and use the relu non-linearity activation function. Finally we predict the probabilities for the 14 character with the softmax activation. As loss function we use the categorical crossentropy. We use a batchsize of 64 and fit for 5 epochs. We use the trainset to learn the weights and validate our performance on the validationset. For an estimation of the performace on new unseen data we predict the testset and check the performance.

In [None]:
X_train = np.array(X_train,dtype="float32")
X_train = ((X_train/255)-0.5)*2

X_val = np.array(X_val,dtype="float32")
X_val = ((X_val/255)-0.5)*2

X_test = np.array(X_test,dtype="float32")
X_test = ((X_test/255)-0.5)*2

In [None]:
model  =  Sequential()

model.add(Flatten(input_shape=(80,80,3)))
model.add(Dense(800))
model.add(Activation('relu'))
model.add(Dense(200))
model.add(Activation('relu'))
model.add(Dense(14))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

model.summary()

In [None]:
model.fit(X_train, Y_train, 
                  batch_size=64, 
                  epochs=5,
                  verbose=2,
                  shuffle=True,
                  validation_data=(X_val, Y_val))

In [None]:
Y_prob = model.predict(X_test) 
Y_pred = Y_prob.argmax(axis=-1)

acc = np.average(Y_pred == np.argmax(Y_test,axis=1))
res6 = pd.DataFrame({'Acc' : acc}, index=['FCN_on_pixel'])
pd.concat([res1,res2,res3,res4,res5,res6])

### VGG feature extraction and RF 

Now we will use a neural network (VGG16) that was trainend on Imagenet and we will only use the convolutional part to extract featues for our simpson images. With the features we will train a random forest classifier. Note that this network is trained to classify animals, vehicles and plants and was trained on a very large dataset. Let's see if we can extract useful features to decide which simpson character is on the image.

In [None]:
base_model = tf.keras.applications.vgg16.VGG16(weights='imagenet', include_top=False,input_shape=(80,80,3))
base_model.summary()

In [None]:
#X_train_vgg_features=base_model.predict(X_train)
#X_val_vgg_features=base_model.predict(X_val)
#X_test_vgg_features=base_model.predict(X_test)

#X_train_vgg_features=X_train_vgg_features.reshape((len(X_train),512*2*2))
#X_val_vgg_features=X_val_vgg_features.reshape((len(X_val),512*2*2))
#X_test_vgg_features=X_test_vgg_features.reshape((len(X_test),512*2*2))

# takes a lot of time, therefore we load the pre-computed results from disk
path="./data/simpson_data"
X_train_vgg_features=np.load(os.path.join(path,"X_train_vgg_features.npy"))
X_val_vgg_features=np.load(os.path.join(path,"X_val_vgg_features.npy"))
X_test_vgg_features=np.load(os.path.join(path,"X_test_vgg_features.npy"))

print(X_train_vgg_features.shape)
print(X_val_vgg_features.shape)
print(X_test_vgg_features.shape)

In [None]:
clf = RandomForestClassifier(n_estimators=100, random_state=22)
clf.fit(X_train_vgg_features, np.argmax(Y_train,axis=1))  

In [None]:
pred=clf.predict(X_test_vgg_features)
acc=np.average(pred==np.argmax(Y_test,axis=1))
res7 = pd.DataFrame(
          {'Acc' : acc}, index=['RF_on_vgg_features']
)
pd.concat([res1,res2,res3,res4,res5,res6,res7])

### Tranfer learning after VGG extraction (only if you use colab)

In this section we will again use a neural network (VGG16) that was trainend on Imagenet and this time we will add two fully connected layer on top of the features extraction part. We will freeze the weights of the convolutional part and only train the fully connected part that we added. We will predict the probabilities for the 14 character with the softmax activation. As loss function we use the categorical crossentropy. We use a batchsize of 64 and fit for 5 epochs. We use the trainset to learn the weights and validate our performance on the validationset. For an estimation of the performace on new unseen data we predict the testset and check the performance.

Note that the training of the network may take a lot of time if you run it on your local machine.

In [None]:
base_model = tf.keras.applications.vgg16.VGG16(weights='imagenet', include_top=False,input_shape=(80,80,3))

In [None]:
x = base_model.output
x = Flatten()(x)
x = Dense(400, activation='relu')(x)
x = Dense(200, activation='relu')(x)

predictions = Dense(max(Data["Klasse"])+1, activation='softmax')(x)

model = Model(inputs=base_model.input, outputs=predictions)

In [None]:
model.summary()

In [None]:
for layer in base_model.layers:
    layer.trainable = False

In [None]:
model.compile(optimizer='adam', loss='categorical_crossentropy',metrics=['accuracy'])

In [None]:
for i, layer in enumerate(model.layers):
   print(i, layer.name,layer.trainable)

In [None]:
model.summary()

In [None]:
model.fit(X_train, Y_train, 
                  batch_size=64, 
                  epochs=5,
                  verbose=2,
                  shuffle=True,
                  validation_data=(X_val, Y_val))

In [None]:
acc = np.average(np.argmax(model.predict(X_test),axis=1) == np.argmax(Y_test,axis=1))
res8 = pd.DataFrame(
          {'Acc' : acc}, index=['transfer_learning_on_vgg_features']
)
pd.concat([res1,res2,res3,res4,res5,res6,res7,res8])

### Now it's your turn



*   Take the best model and check the individual class performace for each class.
*   Look at some wrong predictions.
*   Try to improve the performace on the testset with a different model.  
*   *Hints:  You may want to use a deeper neural network, or combine the features for the random forest. Maybe data augmentation could improve the performace or a CNN from scratch may work well.*




In [None]:
### acc per class
pred = np.argmax(model.predict(X_test), axis=1)
for i in range(0,len(labels)):
  print(labels[i],np.average(pred[np.where(np.argmax(Y_test,axis=1)==i)]==i))

In [None]:
### misclassified examples
path="./data/simpson_data"

X_test_unnorm = np.load(os.path.join(path,"X_test.npy"))
idx = np.where(pred != np.argmax(Y_test,axis=1))[0]
rmd = np.random.choice(idx,1)
print("predicted:", labels[pred[rmd]])
print("true:", labels[np.argmax(Y_test,axis=1)[rmd]])
plt.imshow(np.squeeze(X_test_unnorm[rmd]))
plt.show()