## 1. How to load our images
### 1.1 MNIST-like load_data function
### 1.2 ImageDataGenerator
### 1.3 Dataframes

## 2. MNIST
## 3. Data augmentation
## 4. Pre-trained neural networks
## 5. RGB or grayscale
## 6. Conclusion

# Classifying album covers to genres

In the project we try to classify album covers to genres. The dataset has 5022 pictures, sepreated in five categories. The categories are Rock, Pop, Electronic, Jazz and HipHop. One of the challenges of classifying album covers is that there is no systematic which can be applied on covers. For example when we try to classify a picture as dog or cat, we humans can clearly se the difference. And the machine learning algorithm can try to detected shapes that are common for cats and dogs. But when humans look at an album cover, provided that they don't have any prior knowledge, it's classify an album cover. That first thought about the dataset leads to an early thesis that we probalby need an pre-trained neural network for solving this task appropriate.
Our approach is to start with traditional machine learning methods, like using an convolutional neural network inspired by the architecture used for classifying handwritten didgets from MNIST. 
After that we have an baseline for comparing with other methods and possible improvements. Before we look at the different approaches, we will look at how we load the images.

## some comfort functions

The cells below will import all needed libraries and also test if the systme is set up correctly. We don't need to duplicate the imports. We just import once before we run other cells with code.

In [1]:
#imports 
import sys
import os
import random
import matplotlib.pyplot as plt
# keras
from keras.preprocessing.image import load_img, img_to_array, array_to_img
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers import Conv2D
from keras.layers import MaxPooling2D
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Dropout
from keras.applications import VGG16

#tensorflow
import tensorflow.keras
import tensorflow as tf
# sci kit learm
import sklearn as sk
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer

import numpy as np
import pandas as pd
from PIL import ImageChops

In [2]:
print(f"Tensor Flow Version: {tf.__version__}")
print(f"Keras Version: {tensorflow.keras.__version__}")
print()
print(f"Python {sys.version}")
print(f"Pandas {pd.__version__}")
print(f"Scikit-Learn {sk.__version__}")
print("GPU is", "available" if tf.test.is_gpu_available() else "NOT AVAILABLE")

Tensor Flow Version: 2.2.0
Keras Version: 2.3.0-tf

Python 3.8.5 (default, Sep  5 2020, 10:50:12) 
[GCC 10.2.0]
Pandas 1.0.5
Scikit-Learn 0.23.1
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
GPU is available


# 1. How to load images

The dataset consists out of 5022 images seperated in the five categories Rock, Pop, Electronic, Jazz and HipHop.
For each category exists one folder with the associated images. We use three different approaches to load the data.
Our first approach is inspired by the MNIST dataset and we wrote our own `load_data` function. The second approach is the use of an `ImageDataGenerator` that needs a specific directory structure. And the last one is to use dataframes, which are csv files.

## 1.1 MNIST-like load_data function

In the MNIST example the data is loades via `load_data()` function, which returns two tuple with training and test data. This function is imitated by our `load_data()` function.
We define a function that goes through the specified path, converts the images to arrays and then appends them to the array $X$. Also it appends the class of the index to the array $y$. Then we use the Scikit-learn function train_test_split to split $X$ and $y$ into $X_{train}$, $X_{test}$, $y_{train}$, $y_{test}$ and return these variables.



In [9]:


# define some useful constants
DATA_PATH = './data/covers_original/'
CATEGORIES = ['electronic', 'hiphop', 'jazz', 'pop', 'rock']
IMG_SIZE = 150
CATEGORIES_SIZE = 5
TRAIN_PERC = 0.8
TEST_PERC = 0.2

# define a function that creates the dataset
def load_data():
    X = []
    y = []
    # training data
    for category in CATEGORIES:
        path = os.path.join(DATA_PATH, category) # '../data/covers_original/<category>'
        cn = CATEGORIES.index(category) # get index of class name (e.g. 'electronic' => 0, 'rock' => 4)
        
        for img in os.listdir(path):
            try:
                img = load_img(os.path.join(path, img), target_size=(IMG_SIZE, IMG_SIZE))
                img_as_array = img_to_array(img)
                X.append(img_as_array)
                y.append(cn)
            except Exception as e:
                print(e)
    X = np.array(X)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_PERC)
    return (X_train, y_train), (X_test, y_test)

(X_train, y_train), (X_test, y_test) = load_data()

# test tuple
#print(X_train[0][0][0], "    ", y_train[0])
# reshape and normalize the data
X_train = X_train.reshape((4000, IMG_SIZE, IMG_SIZE, 3))
X_train = X_train.astype('float32') / 255
X_test = X_test.reshape((1000, IMG_SIZE, IMG_SIZE, 3))
X_test = X_test.astype('float32') / 255

# one-hot encode the class labels (0-9)
y_train = np_utils.to_categorical(y_train, CATEGORIES_SIZE)
y_test = np_utils.to_categorical(y_test, CATEGORIES_SIZE)


[1. 1. 1.]      4


## 1.2 ImageDataGenerator

ImageDataGenerator gives the opportunity to manipulate the images and enlarge our dataset. This plays an important role when the database is to small. The ImageDataGenerator can transform and rotate the images. By applying data augmentation we hope to reduce or avoid overfitting because on small databases models tend to overfitt. That means the model descibes the
training and test data well but performs bad on new data. We will see later if this approache is as productive as expected.

The ImageDataGenerator needs a specific directory structure. At the root directory it needs three folder called train, test and validate. In every folder there exists an folder for each category. Then the images are split up in these folder. 80% of the data is going to train. 10% is going to test and another 10% to validate.
To use the ImageDataGenerator you just user the path to the specific directory.

* root
    * test
        * rock
        * pop
        * jazz
        * electronic
        * hiphoop
    * train
        * rock 
        * pop ...
    * validate
        * ...
        
        
For example the training data needs the path `./data/covers/training`. Then the generator will load and manipulate the data before it is processed by the model.

``` python
training_datagen = ImageDataGenerator(
    rotation_range=40,
    width_shift_range=0.2,
    height_shift_range=0.2,
    rescale=1./255,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True,
    fill_mode='nearest')
validation_datagen = ImageDataGenerator(rescale=1./255)
test_datagen = ImageDataGenerator(rescale=1./255)


train_gen = training_datagen.flow_from_directory(
    './data/covers/training',
    target_size=(IMG_SIZE, IMG_SIZE),
    class_mode='categorical',
    batch_size=BATCH_SIZE,
)


```

In [None]:
DATA_DEST = './data/covers'
VALIDATION_PERC = 0.1
TEST_PERC = 0.1

categories = os.listdir(DATA_PATH)

# unfortunately this is too slow, removing the directory 
# using finder or explorer is way faster
# remove directory if it already exists (for convenience)
# if os.path.exists(DATA_DEST):
#     shutil.rmtree(DATA_DEST)
        
for category in categories:
    # set up some paths
    oldCategoryPath = os.path.join(DATA_PATH, category)
    newCategoryTrainingPath = os.path.join(os.path.join(DATA_DEST, 'training'), category)
    newCategoryValidationPath = os.path.join(os.path.join(DATA_DEST, 'validation'), category)
    newCategoryTestPath = os.path.join(os.path.join(DATA_DEST, 'test'), category)

    # get all files of each category
    files = os.listdir(os.path.join(DATA_PATH, category))
    
    # make a directory for each category
    os.makedirs(newCategoryTrainingPath)
    os.makedirs(newCategoryValidationPath)
    os.makedirs(newCategoryTestPath)
    
    # for each category, see how far we have to run for training and validation images.
    training_end = int((1-(VALIDATION_PERC + TEST_PERC)) * len(files))
    validation_end = training_end + int(VALIDATION_PERC * len(files))
    test_end = validation_end + int(TEST_PERC * len(files))
    
    # separate training and validation files
    training_files = files[:training_end]
    validation_files = files[training_end:validation_end]
    test_files = files[validation_end:test_end]
    print('Training files:', len(training_files))
    print('Validation files:', len(validation_files))
    print('Test files:', len(test_files))
    
    # copy training files to training path
    for idx, file in enumerate(training_files):
        oldFilePath = os.path.join(oldCategoryPath, file)
        if file != '.DS_Store' and os.path.isfile(oldFilePath):
#             newFilename = category + '_image_' + str(idx) + '.jpeg'
            newFilePath = os.path.join(newCategoryTrainingPath, file)
            shutil.copy(oldFilePath, newFilePath)

    # copy validation files to validation path
    for idx, file in enumerate(validation_files):
        oldFilePath = os.path.join(oldCategoryPath, file)
        if file != '.DS_Store' and os.path.isfile(oldFilePath):
#             newFilename = category + '_image_' + str(idx) + '.jpeg'
            newFilePath = os.path.join(newCategoryValidationPath, file)
            shutil.copy(oldFilePath, newFilePath)
    
    # copy validation files to test path
    for idx, file in enumerate(test_files):
        oldFilePath = os.path.join(oldCategoryPath, file)
        if file != '.DS_Store' and os.path.isfile(oldFilePath):
#             newFilename = category + '_image_' + str(idx) + '.jpeg'
            newFilePath = os.path.join(newCategoryTestPath, file)
            shutil.copy(oldFilePath, newFilePath)