# Goal

We are going to mutli classify the first 150 Pokemon (the entire first generation) via using image classification.

## The Data

We have a directory containing multiple images for each individual pokemon.


## Approach

In order to solve this problem we are going to construct a Convolution Neural Network.

## Tools

We will be using Keras for constructing our CNN.

## Dataset Structuring

The first thing we will need to do is decide how we are going to structure our data. I have decided to start with a train test apporach for training the model, thus we will need to split the data up into test, train

First, let's take a look at our data and see how many records of each pokemon we have. This will help us determine what percentage of data is going to be used for testing/training.

In addition to this, we will also be able to see how many images we should use for each different Pokemon that we are classifying.

In [None]:
from os import listdir
from os.path import join

PATH_TO_PROJECT = '/content/drive/MyDrive/Colab Notebooks/Generation One Pokemon Classification/'
PATH_TO_POKEMON_IMG_DIR = '/content/drive/MyDrive/Colab Notebooks/Generation One Pokemon Classification/pokemon.zip (Unzipped Files)/dataset/'


def get_dict_of_image_count_for_each_pokemon(path_to_dir_containing_pokemon_img_dirs):
    dict_of_pokemon_image_counts = {}
    inner_pokemon_dirs = listdir(path_to_dir_containing_pokemon_img_dirs)
    
    for pokemon_dir_name in inner_pokemon_dirs:
        # It's nice to keep records in alphabetical order.
        path_to_pokemon_images = join(path_to_dir_containing_pokemon_img_dirs, pokemon_dir_name)
        if '.' not in pokemon_dir_name:
            dict_of_pokemon_image_counts[pokemon_dir_name] = len(listdir(path_to_pokemon_images))
    
    return dict_of_pokemon_image_counts

In [None]:

pokemon_count_dict = get_dict_of_image_count_for_each_pokemon(PATH_TO_POKEMON_IMG_DIR)

for pokemon in pokemon_count_dict:
    print(f"{pokemon}: {pokemon_count_dict[pokemon]}")

Alakazam: 49
Abra: 42
Arbok: 63
Aerodactyl: 97
Arcanine: 61
Articuno: 56
Beedrill: 53
Bellsprout: 55
Blastoise: 62
Bulbasaur: 289
Butterfree: 66
Caterpie: 50
Chansey: 58
Charizard: 52
Charmander: 296
Charmeleon: 65
Clefable: 49
Clefairy: 60
Cloyster: 60
Cubone: 59
Dewgong: 67
Diglett: 51
Ditto: 49
Dodrio: 65
Doduo: 48
Dragonair: 65
Dragonite: 62
Dratini: 109
Drowzee: 60
Dugtrio: 64
Eevee: 41
Ekans: 52
Electrode: 67
Electabuzz: 54
Exeggcute: 57
Exeggutor: 70
Farfetchd: 64
Fearow: 124
Flareon: 59
Gastly: 50
Gengar: 60
Geodude: 56
Golbat: 67
Gloom: 58
Goldeen: 58
Golduck: 61
Golem: 64
Graveler: 58
Grimer: 64
Growlithe: 69
Gyarados: 68
Haunter: 63
Hitmonchan: 61
Hitmonlee: 65
Horsea: 63
Hypno: 63
Ivysaur: 53
Jigglypuff: 65
Jolteon: 64
Jynx: 59
Kabuto: 56
Kabutops: 66
Kadabra: 61
Kakuna: 67
Kangaskhan: 63
Kingler: 69
Koffing: 65
Krabby: 64
Lapras: 71
Lickitung: 67
Machamp: 72
Machoke: 51
Machop: 53
Magikarp: 59
Magmar: 59
Magnemite: 60
Magneton: 59
Mankey: 72
Marowak: 70
Meowth: 70
Metapod:

## Constructing a Fair Dataset
Looking at the number of Pokemon images associated with each distinct Pokemon, we notice that they are all not the same. Some Pokemon have more images in the dataset than others.

In order to keepa certain level of fairness/distribution for the training of our future CNN, we are going to take a look at the pokemon with the least amount of images, `N`. We will then take `N` number of images per each Pokemon. These images will then be added to one large dataset that we will later split into a training dataset and a testing dataset.

In [None]:
import statistics

# Let's find out which Pokemon has least number of images and how few images that is.
pokemon_with_the_least_imgs = min(pokemon_count_dict.keys(), key=lambda k: pokemon_count_dict[k])
least_number_of_imgs = pokemon_count_dict[pokemon_with_the_least_imgs]

pokemon_with_the_most_imgs = max(pokemon_count_dict.keys(), key=lambda k: pokemon_count_dict[k])
most_number_of_imgs = pokemon_count_dict[pokemon_with_the_most_imgs]

mean_number_of_imgs = sum([pokemon_count_dict[pokemon] for pokemon in pokemon_count_dict]) / len(pokemon_count_dict)

median_number_of_imgs = statistics.median([pokemon_count_dict[pokemon] for pokemon in pokemon_count_dict])

In [None]:
print(f'Pokemon with the least number of images: {pokemon_with_the_least_imgs}')
print(f'Least number of images: {least_number_of_imgs}\n')

print(f'Pokemon with the most number of images: {pokemon_with_the_most_imgs}')
print(f'Most number of images: {most_number_of_imgs}\n')

print(f'Range number of images: {most_number_of_imgs - least_number_of_imgs}\n')

print(f'Mean average of images : {mean_number_of_imgs}\n')

print(f'Median average of images: {median_number_of_imgs}')

Pokemon with the least number of images: Eevee
Least number of images: 41

Pokemon with the most number of images: Mewtwo
Most number of images: 307

Range number of images: 266

Mean average of images : 71.76510067114094

Median average of images: 63


Now that we know that the least amount of images for a pokemon out of all the data, let's construct a data set with 41 **random** images of each pokemon.

Note we take **random** images from each Pokemon to avoid fitting the data in a specific way. In general, randomizing data is always the best option, as it helps to avoid fitting the model on a speicfic sort of pattern. Any time we can randomize inputs, we should do it.

*NOTE*: For the sake of clarity when constructing this model, we will use a specific random seed. This will avoid us receiving completely different model accuracies on every run.

In [None]:
import numpy as np
import random as python_random
import tensorflow as tf

# This is commonly used as a random seed for things. Ultimately it does not really matter what we choose however.
# We simply use 42 for consistentcy.
random_seed = 42

# Okay, now we will not have in consistant data splitate moving forward,
# but the images we split will still be randomly picked.
# Set random seeds.
np.random.seed(random_seed) 
python_random.seed(random_seed)
tf.random.set_seed(random_seed)

In [None]:
# Now we will construct the overall Pokemon dataset that we will be using.
# We will be storing this dataset in the location "<path to project location>/Sample-Dataset"
path_to_dataset = join(PATH_TO_PROJECT, 'Sample-Dataset')
print("Path to dataset:", path_to_dataset)

Path to dataset: /content/drive/MyDrive/Colab Notebooks/Generation One Pokemon Classification/Sample-Dataset


In [None]:
from os import mkdir
from shutil import rmtree
from os.path import exists

# Don't wanna re-create the dataset, it takes a long time...
if not exists(path_to_dataset):
    mkdir(path_to_dataset)

Alright, finally, let's populate our new dir with images that we will feed to our model.

In [None]:
from shutil import copyfile

def construct_dataset_of_random_images(path_to_dataset_dir, number_of_random_images, random_seed=None):
    all_random_images_to_add_to_dataset = []
    image_src_path_mapped_to_destination_path = {}

    for pokemon_dir_name in listdir(PATH_TO_POKEMON_IMG_DIR):
        path_to_individual_pokemon_img_dir = join(PATH_TO_POKEMON_IMG_DIR, pokemon_dir_name)
        pokemon_images = listdir(path_to_individual_pokemon_img_dir)

        # Need to make sure we get the absolute path of each image, as we will need it later when
        # copying images to another dir.
        for i in range(len(pokemon_images)):
            # Need to make sure we have the full abs path to each Pokemon image.
            pokemon_images[i] = join(path_to_individual_pokemon_img_dir, pokemon_images[i])
        
        # Ensure that we mix up the images in a random order.
        np.random.shuffle(pokemon_images)

        # Add them to the dictinoary so we can map there src path with there destination path
        # as well as tag each image with the type of pokemon they are and index they are.
        # This will make splitting our training and test data a lot easier and fair.
        for i in range(number_of_random_images):

            # Need to actually create the folder that all of this Pokemon's images will live in
            # but must make sure we do not create the folder on every iteration.
            destination_folder_path = join(path_to_dataset_dir, pokemon_dir_name)
            if not exists(destination_folder_path):
                mkdir(destination_folder_path)

            # Populate the dict with the current path of the pokemon -> location that it will be moved to.
            # Note: In the destination we need to remember to add the image .ext or things will get really bad.
            img_ext = pokemon_images[i].split('.')[-1]
            image_src_path_mapped_to_destination_path[pokemon_images[i]] = join(destination_folder_path,
                                                                                f"{pokemon_dir_name}_{i}.{img_ext}")

    # Copy images to the location of our dataset folder.
    count_of_images = 0
    for img_src_path in image_src_path_mapped_to_destination_path:
        img_destination_path = image_src_path_mapped_to_destination_path[img_src_path]
        
        # This is so we can keep track of the state we are at ourselves.
        count_of_images += 1
        print(f"Copying image number {count_of_images} out of {number_of_random_images * 150} total")
        print(f"Image Destination is name: {img_destination_path}\n")

        if not exists(img_destination_path):
            copyfile(img_src_path, img_destination_path)

In [None]:
# NOTE: This might take a while if it is the first time you are running this line.

# We will only construct the dataset if it has not yet been constructed. If there is an image in the folder, then we know it was already made.
if not listdir(path_to_dataset):
    construct_dataset_of_random_images(path_to_dataset_dir=path_to_dataset, number_of_random_images=least_number_of_imgs)

In [None]:
# Let's just confirm we have an even amount of images for each Pokemon.
path_to_sample_dataset = '/content/drive/MyDrive/Colab Notebooks/Generation One Pokemon Classification/Sample-Dataset'
pokemon_count_dict = get_dict_of_image_count_for_each_pokemon(path_to_sample_dataset)
for pokemon in pokemon_count_dict:
    print(f"{pokemon}: {pokemon_count_dict[pokemon]}")

Alakazam: 41
Abra: 41
Arbok: 41
Aerodactyl: 41
Arcanine: 41
Articuno: 41
Beedrill: 41
Bellsprout: 41
Blastoise: 41
Bulbasaur: 41
Butterfree: 41
Caterpie: 41
Chansey: 41
Charizard: 41
Charmander: 41
Charmeleon: 41
Clefable: 41
Clefairy: 41
Cloyster: 41
Cubone: 41
Dewgong: 41
Diglett: 41
Ditto: 41
Dodrio: 41
Doduo: 41
Dragonair: 41
Dragonite: 41
Dratini: 41
Drowzee: 41
Dugtrio: 41
Eevee: 41
Ekans: 41
Electrode: 41
Electabuzz: 41
Exeggcute: 41
Exeggutor: 41
Farfetchd: 41
Fearow: 41
Flareon: 41
Gastly: 41
Gengar: 41
Geodude: 41
Golbat: 41
Gloom: 41
Goldeen: 41
Golduck: 41
Golem: 41
Graveler: 41
Grimer: 41
Growlithe: 41
Gyarados: 41
Haunter: 41
Hitmonchan: 41
Hitmonlee: 41
Horsea: 41
Hypno: 41
Ivysaur: 41
Jigglypuff: 41
Jolteon: 41
Jynx: 41
Kabuto: 41
Kabutops: 41
Kadabra: 41
Kakuna: 41
Kangaskhan: 41
Kingler: 41
Koffing: 41
Krabby: 41
Lapras: 41
Lickitung: 41
Machamp: 41
Machoke: 41
Machop: 41
Magikarp: 41
Magmar: 41
Magnemite: 41
Magneton: 41
Mankey: 41
Marowak: 41
Meowth: 41
Metapod: 41


## Split Dataset Into Test/Train

Alright, now that we have our correct dataset, containing the first 150 pokemon and K images of all of them, it's time to split our data set up.

For this I think that a Training set of 80% and a Testing set of 20% is a good split.


### Fair Split

We want to make sure to do a fair random split of the data.

As in we want to split 20 percent of each different class up into our dataset. This will ensure that we have a good amount of records for every different Pokemon in our dataset.

Since we have tagged image with the name of the Pokemon followed by the number image it is of that Pokemon, we can much easier ensure that this data is split fairly.

In [None]:
from keras_preprocessing.image import ImageDataGenerator

In [None]:
# Let's set a batch size and epochs right now.
# We will use these for all of the models that we create.
batch_size = 32
epochs = 150

In [None]:
train_dir = '/content/drive/MyDrive/Colab Notebooks/Generation One Pokemon Classification/train'
test_dir = '/content/drive/MyDrive/Colab Notebooks/Generation One Pokemon Classification/test'

def populate_test_and_train_dirs(path_to_dataset, train_dir, test_dir, test_size=0.2):
    for pokemon_dir in listdir(path_to_dataset):
        full_path_to_pokemon_dir = join(path_to_dataset, pokemon_dir)
        
        number_of_imgs_for_each_pokemon = len(listdir(full_path_to_pokemon_dir))
        print(number_of_imgs_for_each_pokemon)

        dir_for_training_pokemon_imgs = join(train_dir, pokemon_dir)
        dir_for_testing_pokemon_imgs = join(test_dir, pokemon_dir)

        if not exists(dir_for_testing_pokemon_imgs):
            mkdir(dir_for_testing_pokemon_imgs)

        if not exists(dir_for_training_pokemon_imgs):
            mkdir(dir_for_training_pokemon_imgs)

        # Keep count of how many images we need for testing and for training.
        # This will vary based on what our test_size is.
        number_of_imgs_used_for_testing = round(number_of_imgs_for_each_pokemon * test_size)
        number_of_imgs_used_for_training = number_of_imgs_for_each_pokemon - number_of_imgs_used_for_testing

        # Since the images are already shuffeled in random order, we can just add all the needed training imgs
        # and then add all the needed testing images.
        for pokemon_img in listdir(full_path_to_pokemon_dir):
            print(pokemon_img)
            path_to_pokemon_img = join(full_path_to_pokemon_dir, pokemon_img)

            if len(listdir(join(test_dir, pokemon_dir))) != number_of_imgs_used_for_testing:
                destination_path_for_img = join(test_dir, pokemon_dir, pokemon_img)
            else:
                destination_path_for_img = join(train_dir, pokemon_dir, pokemon_img)

            copyfile(path_to_pokemon_img, destination_path_for_img)


            print(f"Src: {path_to_pokemon_img}")
            print(f"Destination: {destination_path_for_img}\n") 

In [None]:

# Create folders to store training and testing data. Now let's randomly split them...
if not exists(train_dir):
    mkdir(train_dir)

if not exists(test_dir):
    mkdir(test_dir)
    populate_test_and_train_dirs(path_to_dataset=path_to_sample_dataset, train_dir=train_dir, test_dir=test_dir)

In [None]:
# Okay, now we can load our split up data into directly into keras.
datagen = ImageDataGenerator(rescale=1./255, shear_range=0.2, zoom_range=0.2, horizontal_flip=True,
                             validation_split=0.2)

train_generator = datagen.flow_from_directory(train_dir, batch_size=batch_size, subset='training',
                                              seed=random_seed)

test_generator = datagen.flow_from_directory(test_dir, batch_size=batch_size, subset='validation',
                                             seed=random_seed)

Found 4001 images belonging to 149 classes.
Found 149 images belonging to 149 classes.


## Build Model

Now that we have our dataset constructed and our data is loaded into Keras, it is time to construct or convolutional Neural Network.

In [None]:
from keras.layers import Dense, Activation, Flatten, Dropout, BatchNormalization
from keras.layers import Conv2D, MaxPooling2D
from keras import regularizers, optimizers, Sequential

import tensorflow as tf

import pandas as pd
import numpy as np

In [None]:
def build_model(number_of_hidden_layers=3,
                hidden_activation_function='relu',
                output_activation_function='softmax',
                dropout=None,
                input_shape=(244, 244, 3),
                pool_size=(2, 2)
                ):
    model = Sequential()

    # Let's do the input and first hidden layer seperately.
    model.add(Conv2D(32, 3, 3, input_shape=input_shape))
    model.add(Activation(hidden_activation_function))
    model.add(MaxPooling2D(pool_size=pool_size))

    # Add all other hidden layers.
    for i in range(number_of_hidden_layers - 1):
        model.add(Conv2D(32, 3, 3, input_shape=input_shape))
        model.add(Activation(hidden_activation_function))
        model.add(MaxPooling2D(pool_size=pool_size))

    # Final output layer...
    model.add(Flatten())
    model.add(Dense(64))
    model.add(Activation(hidden_activation_function))

    # Apply drouput last layer.
    if dropout:
        model.add(Dropout(dropout))

    # 150 for the 150 classes of pokemon that we have.
    model.add(Dense(150))
    model.add(Activation(output_activation_function))

    return model

We will assign a dropout of 0.5 for this model, as our first test.

This model will only have 3 hidden layers.

We will start simple, diagnose our problems, then produce better more effective models.

For an optimizer, we will be using adam. It is alway a good one to start with.

In [None]:
model_1 = build_model(dropout=0.5)

For our loss we will use Sparse Categorical Crossentropy.

Our minimization/optimizaiton function will be adam, as it performs well in almost all conditions. This also optimizes each tuning paramater as the model's acceleration changes, allowing for faster convergence and less epochs.

In [None]:
model_1.compile(loss='categorical_crossentropy',
                optimizer='adam',
                metrics=['accuracy'])

NameError: ignored

Let's take a quick look at this model's specs.

In [None]:
model_1.summary()

Model: "sequential_21"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_47 (Conv2D)           (None, 81, 81, 32)        896       
_________________________________________________________________
activation_67 (Activation)   (None, 81, 81, 32)        0         
_________________________________________________________________
max_pooling2d_44 (MaxPooling (None, 40, 40, 32)        0         
_________________________________________________________________
conv2d_48 (Conv2D)           (None, 13, 13, 32)        9248      
_________________________________________________________________
activation_68 (Activation)   (None, 13, 13, 32)        0         
_________________________________________________________________
max_pooling2d_45 (MaxPooling (None, 6, 6, 32)          0         
_________________________________________________________________
conv2d_49 (Conv2D)           (None, 2, 2, 32)        

Now, for fitting the model, we will start by running a large number of epochs and work our way down later, if we can receive a better loss value at a lower epoch.

We will use a small null batchsize for this larger epoch.

In [None]:
model_1.fit(train_generator, epochs=150, batch_size=15)

Epoch 1/150


InvalidArgumentError: ignored