# Project F
* Gerardo Zavala A.
* Dileep Badveli
*  Alex McKinnon

---------------------------
* id: gzavala@ncsu.edu
* id: dbadvel@ncsu.edu
* id: ajmckin2@ncsu.edu



# Movie Genre Classification 

## Motivation:

Classifying a movie based only on the poster image is a difficult task to do even for humans, as sometimes poster images can be misleading of what the movie will be about. Humans can predict the movie genre based on actors and prequels if they exist, but without any prior knowledge of the movie a person would need to depend on the trailer or the movie poster. 


The poster of a movie is the first interaction a person sees about a movie. The genre can be determined by the colors used, expressions on the faces, objects in the poster, etc.; these types of characteristics have been shown to affect emotions in people, and we will take advantage of that to train the network. 
Movies can fall into more than one type of classification, so one problem to solve is building a multiclassification neural network that can categorize multiple movie genres based on the poster image.


# 0. Libraries: 

In [0]:
from google.colab import drive
drive.mount('/content/gdrive',force_remount=True)

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


In [0]:
# Update keras/tensorflow for use in Google colab (needs to be done every new session)
!pip install --upgrade keras
!pip install --upgrade tensorflow
# Remember to press 'Restart Runtime' at end

Collecting keras
[?25l  Downloading https://files.pythonhosted.org/packages/ad/fd/6bfe87920d7f4fd475acd28500a42482b6b84479832bdc0fe9e589a60ceb/Keras-2.3.1-py2.py3-none-any.whl (377kB)
[K     |▉                               | 10kB 24.4MB/s eta 0:00:01[K     |█▊                              | 20kB 4.2MB/s eta 0:00:01[K     |██▋                             | 30kB 6.1MB/s eta 0:00:01[K     |███▌                            | 40kB 7.7MB/s eta 0:00:01[K     |████▍                           | 51kB 4.8MB/s eta 0:00:01[K     |█████▏                          | 61kB 5.6MB/s eta 0:00:01[K     |██████                          | 71kB 6.4MB/s eta 0:00:01[K     |███████                         | 81kB 7.2MB/s eta 0:00:01[K     |███████▉                        | 92kB 7.9MB/s eta 0:00:01[K     |████████▊                       | 102kB 6.4MB/s eta 0:00:01[K     |█████████▌                      | 112kB 6.4MB/s eta 0:00:01[K     |██████████▍                     | 122kB 6.4MB/s eta 0:

In [0]:
# Libraries for tensorflow
from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow as tf
import sys
print(sys.version)
print(tf.__version__)

3.6.9 (default, Nov  7 2019, 10:44:02) 
[GCC 8.3.0]
2.0.0


In [0]:
# Common Libraries
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import os

# Sklearn Libraries for cross validation:
from sklearn.model_selection import KFold
from keras.utils import np_utils, plot_model
from keras.utils.vis_utils import plot_model
import pathlib
from IPython.display import Image, display
import glob
import scipy.misc
from tqdm import tqdm
import requests  
import re
from bs4 import BeautifulSoup  
from urllib.request import urlretrieve
import ast 
from sklearn.preprocessing import OneHotEncoder

-----------------------------------------------------------------------------
# 1. Loading the Data

In [0]:
movie_csv='gdrive/My Drive/ECE 542 F3/data/MovieGenre.csv'
movies_dir='gdrive/My Drive/ECE 542 F3/data/posters/'

#### 1.1 Load the training files:
* This file contains the id (image name) and the label assigned to it. 

In [0]:
movie_data = pd.read_csv(movie_csv, encoding="ISO-8859-1")

In [0]:
movie_data.columns

Index(['imdbId', 'Imdb Link', 'Title', 'IMDB Score', 'Genre', 'Poster'], dtype='object')

In [0]:
def process_movie_data(movie_info):
    print("Initial lenght of file: ", len(movie_info))
    movie_info.drop_duplicates(keep='first', inplace=True)
    print("lenght after droping duplicates: ", len(movie_info))
    movie_info['Genre'].replace('', np.nan, inplace=True)
    movie_info.dropna(inplace=True)
    print("lenght after removing missing Genre: ", len(movie_info))
    return movie_info

In [0]:
movie_data_process = process_movie_data(movie_data)

Initial lenght of file:  40108
lenght after droping duplicates:  39515
lenght after removing missing Genre:  38654


# Generate filenames to look for available posters:

## Sample from whole data set to create a smaller to reduce computational time:

In [0]:
movie_data_sample = movie_data_process.sample(frac=.20)
print(len(movie_data_sample))
movie_data_remaining = movie_data_process[~movie_data_process['imdbId'].isin(movie_data_sample['imdbId'])]
print(len(movie_data_remaining))

7731
30923


In [0]:
genre_df = movie_data_sample['Genre'].value_counts().reset_index()
genre_df.head(15)

Unnamed: 0,index,Genre
0,Drama,781
1,Comedy,499
2,Documentary,323
3,Comedy|Drama,312
4,Drama|Romance,275
5,Comedy|Drama|Romance,233
6,Comedy|Romance,194
7,Horror,145
8,Crime|Drama,107
9,Drama|Thriller,104


In [0]:
genre_df.sort_values('Genre', ascending=False,inplace=True)
genre_top_15 = genre_df['index'][0:20].tolist()

In [0]:
genre_top_15

['Drama',
 'Comedy',
 'Documentary',
 'Comedy|Drama',
 'Drama|Romance',
 'Comedy|Drama|Romance',
 'Comedy|Romance',
 'Horror',
 'Crime|Drama',
 'Drama|Thriller',
 'Action|Crime|Drama',
 'Horror|Thriller',
 'Crime|Drama|Thriller',
 'Thriller',
 'Crime|Drama|Mystery',
 'Horror|Mystery|Thriller',
 'Western',
 'Action|Crime|Thriller',
 'Drama|War',
 'Biography|Drama']

# Stay with top 35 Genre counts:

In [0]:
movie_data_sample_2 =  movie_data_sample.loc[movie_data_sample['Genre'].isin(genre_top_15)]
print(len(movie_data_sample_2))

3670


In [0]:
movie_data_sample_2['Genre'].value_counts().reset_index()

Unnamed: 0,index,Genre
0,Drama,781
1,Comedy,499
2,Documentary,323
3,Comedy|Drama,312
4,Drama|Romance,275
5,Comedy|Drama|Romance,233
6,Comedy|Romance,194
7,Horror,145
8,Crime|Drama,107
9,Drama|Thriller,104


## Split Movies with more than one Genre into one: 

In [0]:
label_dict = {"word2id": {}, "id2word": []}
idx = 0
genre_per_movie = movie_data_sample_2["Genre"].apply(lambda x: str(x).split("|")[:-1])
for l in [g for d in genre_per_movie for g in d]:
    if l in label_dict["id2word"]:
        pass
    else:
        label_dict["id2word"].append(l)
        label_dict["word2id"][l] = idx
        idx += 1
n_classes = len(label_dict["id2word"])

In [0]:
def genre_count(df, label_dict):
    max_genre = 0
    for label in label_dict["id2word"]:
        occurrences = len((df[df['Genre'].str.contains(label)]))
        print(label, occurrences)
        if occurrences > max_genre:
            max_genre = occurrences
    return max_genre

In [0]:
genres_split = genre_count(movie_data_sample_2, label_dict)

Comedy 1238
Crime 413
Drama 2172
Horror 298
Mystery 125
Action 161
Biography 55


### Since there is imbalance on the data set, we will take from original set the ones with less counts per Genre:

In [0]:
# IMBALANCE: OVERSAMPLING SOLUTION
movie_data_copy = movie_data_remaining[~movie_data_remaining["Genre"].str.contains("Comedy|Drama")].copy()
    
for label in label_dict["id2word"]:
    if label not in ["Drama", "Comedy"]:
        len_genre = len(movie_data_sample_2[movie_data_sample_2['Genre'].str.contains(label)])
        df_genre = movie_data_copy[movie_data_copy['Genre'].str.contains(label)]
        #df_genre['genres'] = [label+"|" for i in range (0, len(df_genre))]    
        if (genres_split - len_genre) > 0:
            if len_genre > 3000:
                param = 0
            elif len_genre > 2000:
                param = 0.3
            elif len_genre > 1000:
                param = 0.5
            else:
                param = 0.9
            df_class_over = df_genre.sample(int((genres_split-len_genre)*param)+1, replace=True)
            movie_data_sample_2 = pd.concat([movie_data_sample_2, df_class_over], axis=0)

print('Random over-sampling:')
print(genre_count(movie_data_sample_2, label_dict))

Random over-sampling:
Comedy 1238
Crime 2607
Drama 2172
Horror 2627
Mystery 2059
Action 1802
Biography 1969
2627


In [0]:
print("Data Availability: ", len(movie_data_sample_2))

Data Availability:  10459


In [0]:
def get_final_movie_data(movie_data):
    # new data frame with split value columns 
    movie_data = movie_data.copy()
    split_genres = movie_data["Genre"].str.split("|", n = 2, expand = True) 
    
    # making separate first name column from new data frame 
    movie_data["Genre_1"]= split_genres[0] 
    
    # making separate last name column from new data frame 
    movie_data["Genre_2"]= split_genres[1] 
    
    movie_data_genre_1 = movie_data[["imdbId","Genre_1"]]
    
    movie_data_genre_1.drop_duplicates(inplace=True, keep='first')
    movie_data_genre_1.rename(columns={'Genre_1':'Genre'}, inplace=True)
    
    movie_data_genre_2 = movie_data[["imdbId","Genre_2"]]
    movie_data_genre_2.drop_duplicates(inplace=True, keep='first')
    movie_data_genre_2.rename(columns={'Genre_2':'Genre'}, inplace=True)
    
    
    final_movie_data = pd.concat([movie_data_genre_2, movie_data_genre_1])
    
    print(len(final_movie_data))
    final_movie_data = final_movie_data.loc[~final_movie_data['Genre'].isin([None])]
    print(len(final_movie_data))
    
    return final_movie_data

In [0]:
final_movie_data = get_final_movie_data(movie_data_sample_2)

13278
11017


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(**kwargs)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [0]:
final_movie_data["Genre"].value_counts().reset_index()

Unnamed: 0,index,Genre
0,Drama,2070
1,Horror,1490
2,Comedy,1238
3,Crime,1116
4,Action,1012
5,Documentary,780
6,Thriller,692
7,Mystery,620
8,Romance,513
9,Biography,472


In [0]:
final_movie_data_Genre_List = final_movie_data["Genre"].value_counts()[0:6].reset_index()
final_movie_data_Genre_List

Unnamed: 0,index,Genre
0,Drama,2070
1,Horror,1490
2,Comedy,1238
3,Crime,1116
4,Action,1012
5,Documentary,780


In [0]:
final_movie_data.head()

Unnamed: 0,imdbId,Genre
10177,45607,Romance
20390,1130087,Romance
9222,172543,Romance
33543,4338434,Drama
19079,882969,Romance


In [0]:
final_movie_data = final_movie_data.loc[final_movie_data['Genre'].isin(final_movie_data_Genre_List['index'])]

In [0]:
len(final_movie_data)

7706

In [0]:
final_movie_data

Unnamed: 0,imdbId,Genre
33543,4338434,Drama
2236,118636,Drama
6082,100442,Drama
1094,116581,Drama
2607,130827,Drama
...,...,...
5387,303353,Documentary
38721,3593124,Documentary
39760,3520318,Documentary
28503,3181314,Documentary


## 1.2  Extract posters into desired path (Only run once): 

#### 1.2 Load the image filepaths and thir corresponding labels into lists:
##### We need to do some preprocessing for movies without a defined genre, droping this movies of the set. 

In [0]:
final_movie_data['imdbId'] = final_movie_data['imdbId'].astype(str)
final_movie_data['imdbId_jpg'] = final_movie_data['imdbId']+'.jpg'

# 2.0 Process the labels:

### Check that all files are images in the posters directory: 

### 2.1 Labels pre-proccesing:

#### # Match final ids with labels:

### Check that all files are images in the posters directory: 

In [0]:
import imghdr
import os
def get_images_only(movie_filepaths):
    print(len(movie_filepaths))
    for image in movie_filepaths:
        if not (imghdr.what(image) == "jpeg") | (imghdr.what(image) == "png") :
            movie_filepaths.remove(image)
    print(len(movie_filepaths))
    
    return movie_filepaths

In [0]:
import glob
def get_available_posters(movies_dir, movie_sample):
  image_dir = glob.glob(movies_dir + "*.jpg")
  df_available_posters = pd.DataFrame(image_dir, columns=['posters'])
  print(df_available_posters.head())
  split_posters = df_available_posters["posters"].str.split("/", n = 6, expand = True) 
    
  # making separate last name column from new data frame 
  df_available_posters["split_1"]= split_posters[5] 
    
  final_split = df_available_posters["split_1"].str.split(".", n = 2, expand = True) 
  df_available_posters["available_id"]= final_split[0]     
  print(df_available_posters.head())
    
  movie_sample = movie_sample.copy()
    
  print(len(movie_sample))
  movie_sample = movie_sample.loc[movie_sample['imdbId'].isin(df_available_posters["available_id"])]
  print(len(movie_sample))
    
  return movie_sample

In [0]:
final_movie_dataset = get_available_posters(movies_dir, final_movie_data)

                                             posters
0  gdrive/My Drive/ECE 542 F3/data/posters/248617...
1  gdrive/My Drive/ECE 542 F3/data/posters/354364...
2  gdrive/My Drive/ECE 542 F3/data/posters/75984.jpg
3  gdrive/My Drive/ECE 542 F3/data/posters/901206...
4  gdrive/My Drive/ECE 542 F3/data/posters/142233...
                                             posters     split_1 available_id
0  gdrive/My Drive/ECE 542 F3/data/posters/248617...  248617.jpg       248617
1  gdrive/My Drive/ECE 542 F3/data/posters/354364...  354364.jpg       354364
2  gdrive/My Drive/ECE 542 F3/data/posters/75984.jpg   75984.jpg        75984
3  gdrive/My Drive/ECE 542 F3/data/posters/901206...  901206.jpg       901206
4  gdrive/My Drive/ECE 542 F3/data/posters/142233...  142233.jpg       142233
7706
7664


In [0]:
final_filenames = [movies_dir + fname for fname in final_movie_dataset['imdbId_jpg'].tolist()]

In [0]:
final_filenames[0:5]

['gdrive/My Drive/ECE 542 F3/data/posters/4338434.jpg',
 'gdrive/My Drive/ECE 542 F3/data/posters/118636.jpg',
 'gdrive/My Drive/ECE 542 F3/data/posters/100442.jpg',
 'gdrive/My Drive/ECE 542 F3/data/posters/116581.jpg',
 'gdrive/My Drive/ECE 542 F3/data/posters/130827.jpg']

In [0]:
len(final_filenames)

7664

In [0]:
#final_filenames = get_images_only(filenames)

In [0]:
#final_filenames = get_images_only(final_filenames)

In [0]:
#from sklearn.preprocessing import LabelBinarizer
#encoder = LabelBinarizer()
#labels_cat_1hot = encoder.fit_transform(labels)

In [0]:
final_labels = final_movie_dataset['Genre'].tolist()

### 2.2 Hot Encoding of Labels: 

In [0]:
from sklearn.preprocessing import LabelEncoder
le_genre = LabelEncoder()
genre_encoded = le_genre.fit_transform(final_labels)

In [0]:
set(genre_encoded)

{0, 1, 2, 3, 4, 5}

In [0]:
print("Shape of labels: ", genre_encoded.shape)

Shape of labels:  (7664,)


# 3. Split the data into training, validation and testing: 
* Since amount of data is moderate, we will use training 70%  and cross validation to select best hyperparameters 
* Finally evaluate the performance in the testing set 30% of total data

In [0]:
def stratified_split(X, y, test_size=0.2, validate_size=0.2):

    X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=test_size)

    # need to do calculate new split size. 
    # let's assume we had 100 samples and we don't do this
    # then the split will be 20 + (20% of 80) + (80% of 80). 
    # But we want 20 + 20 + 60
    new_validate_size = validate_size / (1 - test_size)
    
    X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, stratify=y_train, test_size=new_validate_size)
    
    y_train_values = [[x, y_train.tolist().count(x)] for x in set(y_train.tolist())]
    y_val_values = [[x, y_val.tolist().count(x)] for x in set(y_val.tolist())]
    
    print("Distribution of classes on training set: ", y_train_values)
    print("\nDistribution of classes on testing set: ", y_val_values)

    return X_train, X_test, X_val, y_train.tolist(), y_test.tolist(), y_val.tolist()

In [0]:
train_X, test_X, val_X, train_Y, test_Y, val_Y = stratified_split(final_filenames, genre_encoded)

Distribution of classes on training set:  [[0, 604], [1, 736], [2, 667], [3, 467], [4, 1233], [5, 891]]

Distribution of classes on testing set:  [[0, 201], [1, 245], [2, 223], [3, 156], [4, 411], [5, 297]]


--------------------------------------------------------------------
# 3.1 Exploring the Data:


In [0]:
print("total images in training set: ", len(train_X))

total images in training set:  4598


In [0]:
print("total images in validation set: ", len(val_X))

total images in validation set:  1533


In [0]:
print("total images in testing data: ", len(test_Y))

total images in testing data:  1533


## 3.1.1 Exploration of labels: 

In [0]:
unique_labels = set(final_labels)

In [0]:
print(unique_labels)

{'Crime', 'Action', 'Comedy', 'Drama', 'Horror', 'Documentary'}


# 4. Create tf.data.Dataset Objects:

* Creates a `Dataset` whose elements are slices of the given tensors.

* Creates a constant tensor.


In [0]:
train_data = tf.data.Dataset.from_tensor_slices((tf.constant(train_X), tf.constant(train_Y)))
val_data = tf.data.Dataset.from_tensor_slices((tf.constant(val_X), tf.constant(val_Y)))
test_data = tf.data.Dataset.from_tensor_slices((tf.constant(test_X), tf.constant(test_Y)))

## 4.1. Convert file path into image with Tensorflow function:

In [0]:
def load_preprocess(filepath, label):
    """
    1.- Read Image from file path.
    
    2.- Decode a JPEG-encoded image to a uint8 tensor, 
    The attr `channels` indicates the desired number of color channels for the decoded image.
    
    3.- Convert `image` to `dtype`, scaling its values if needed.
    Images that are represented using floating point values are expected to have
    values in the range [0,1)
    
    4.- Resize `images` to `size` using the specified `method`.
    Resized images will be distorted if their original aspect ratio is not the same as `size`.
    """
    # 1
    image = tf.io.read_file(filepath)
    # 2
    image = tf.image.decode_jpeg(image, channels = 3)
    # 3
    image_normalized = (tf.cast(image, tf.float32))
    #image = tf.image.convert_image_dtype(image_normalized, tf.float32)
    # 4
    image = tf.image.resize(image_normalized, (IMG_Width, IMG_Height))
    return image, label

## This transformation applies `map_func` to each element of this dataset, and returns a new dataset containing the transformed elements, in the same order as they appeared in the input.

## # Set `num_parallel_calls` so multiple images are loaded/processed in parallel.


In [0]:
BATCH_SIZE = 32
IMG_Width = 182
IMG_Height = 268 

In [0]:
train_data

<TensorSliceDataset shapes: ((), ()), types: (tf.string, tf.int32)>

In [0]:
training_set = (train_data.map(load_preprocess).shuffle(buffer_size=10000).batch(BATCH_SIZE))

In [0]:
val_set = (val_data.map(load_preprocess).shuffle(buffer_size=10000).batch(BATCH_SIZE))

In [0]:
test_set = (test_data.map(load_preprocess).shuffle(buffer_size=10000).batch(BATCH_SIZE))

In [0]:
#for image, label in training_set.take(1):
#    print("Image Shape: ", image.numpy().shape)
#    print("Label: ", label.numpy())

---------------------------------------------------------------------------------------
# 5.0 Model Definition 
* Build the tf.keras.Sequential model by stacking layers:

In [0]:
Image_shape = (IMG_Width, IMG_Height, 3)

In [0]:
print(Image_shape)

(182, 268, 3)


* Layer 1: Flatten: transforms the format of the images from 28,28 into one dimensional array of 784 pixels. 

* Layer 2: Convolutional 2D layer, with filter size of (3,3) and 32 number of filters.

* Layer 3: Max Pooling: after each convolution layer to reduce the spatial size (computational Complexity) it also helps with the overfitting problem.

* Layer 4: Dense layer to interpret the features, in this case with 80 nodes.

* Layer 5: DropOut: for overfitting. 

* Layer 5: Final Layer that consists of 10 nodes which are the total possibilities of predictions of the 10 class labels, each of the nodes contains a score.  

## 5.1 Train Model & Model Validation (Hyperparameter tunning)
* Train model and run k-fold cross validation to select best hyperparameters (batch size, learning rate).

In [0]:
def model_train_validation(model, train_data, train_label, n_folds=2, batch_size=32):
    # initialize list to store scores accuracy to plot:
    hist = []
    # start k-fold cross validation from training data, with k=1. (data will be splitted into half 50% - 50%)
    kfold_cross = KFold(n_folds, shuffle=True, random_state=2019)
    #save weights between validation splits
    model.save_weights('model.h5')
    # Perform the splits for training and validation:
    for train_index, val_index in kfold_cross.split(train_data):
        train_X, train_Y, val_X, val_Y = train_data[train_index], train_label[train_index], train_data[val_index], train_label[val_index]
        
        # train_test_split(filenames, labels, train_size=0.7, random_state=42)
        # Make sure we have the same weights for each validation. 
        model.load_weights('model.h5')
        # Fit the model for each of the sets:
        history = model.fit(train_X, train_Y, epochs = 10, batch_size= batch_size,
                            validation_data = (val_X, val_Y), verbose=0)
        # Store history with accuracy and loss during training:
        hist.append(history)
    return hist

In [0]:
def base_model(DropoutRate):
    model = tf.keras.models.Sequential([
        tf.keras.layers.Conv2D(32, (3,3), input_shape=Image_shape ,activation='relu'), 
        tf.keras.layers.MaxPooling2D((2,2)),
        tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
        tf.keras.layers.MaxPooling2D((2,2)),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(100, activation = 'relu'),
        tf.keras.layers.Dropout(DropoutRate),
        tf.keras.layers.Dense(5, activation='softmax')
    ])
    return model

def createUniqueModel(ConvL1Width, ConvL2Width, DropoutRate, Image_shape, DenseWidth = 50):
    # Model Definition:
    model = tf.keras.models.Sequential([
        tf.keras.layers.Conv2D(ConvL1Width, (3,3), input_shape=Image_shape ,activation='relu'), 
        tf.keras.layers.MaxPooling2D((2,2)),
        tf.keras.layers.Conv2D(ConvL2Width, (3,3), activation='relu'),
        tf.keras.layers.MaxPooling2D((2,2)),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(DenseWidth, activation = 'relu'),
        tf.keras.layers.Dropout(DropoutRate),
        tf.keras.layers.Dense(5, activation='softmax')
    ])
    return model
#Model with batch norm
def createUniqueModelBN(ConvL1Width, ConvL2Width, Image_shape, DenseWidth = 50):
    # Model Definition:
    model = tf.keras.models.Sequential([
        tf.keras.layers.Conv2D(ConvL1Width, (3,3), input_shape=Image_shape ,activation='relu'), 
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.MaxPooling2D((2,2)),
        tf.keras.layers.Conv2D(ConvL2Width, (3,3), activation='relu'),
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.MaxPooling2D((2,2)),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(DenseWidth, activation = 'relu'),
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.Dense(5, activation='softmax')
    ])
    return model      

In [0]:
#uniform log learning rate
def random_lmbda(): 
    lmbda = [0.0001, 0.001, 0.01]
    lmbda = lmbda[np.random.randint(0,3)]
    return lmbda

#use integer than divide by 100 to create probability
def random_dropout(dmin, dmax): 
    dropout = np.random.randint(dmin,dmax)
    dropout = float(dropout)*.01
    return dropout

#batch size
def random_batch_size():
    batch_sizes = [16, 32, 64] #smaller batch sizes since small data set
    batch_size = batch_sizes[np.random.randint(0,3)]
    return batch_size

#ConvL1Width or L2 Width 
def random_width(wmin, wmax):
    width = np.random.randint(wmin,wmax)
    return width

def random_search(training_data, Image_shape):
    train_set = (training_data.map(load_preprocess).shuffle(buffer_size=10000))
    #train_augmented = (train_data_augment.map(load_augment).shuffle(buffer_size=10000))
    #train_combined = train_set.concatenate(train_augmented)
    
    #seperate sets
    train_x = []
    train_y = []
    
    for image, label in train_set:
        train_x.append(image.numpy())
        train_y.append(label.numpy())
    
    train_x = np.asarray(train_x)
    train_y = np.asarray(train_y)
    
    #verify proper inputs for cross validation
    print(train_x.shape,train_y.shape)
    
    for i in range(0, 3):
        lmbda = random_lmbda()
        dropout = random_dropout(20,50)
        optimizer = tf.optimizers.Adam(lmbda)
        width1 = random_width(4,16)
        width2 = random_width(16,32)
        dwidth = random_width(80, 120)
        #model =  base_model(dropout)
        model =  createUniqueModel(width1, width2, dropout, Image_shape, dwidth)
        #model = createUniqueModelBN(width1, width2, Image_shape, dwidth)
        model.compile(optimizer = optimizer, 
                      loss = 'sparse_categorical_crossentropy', 
                      metrics = ['sparse_categorical_accuracy'] )
        hist = model_train_validation(model, train_x, train_y, 2, 32)
        #print(lmbda,dropout)
        print(lmbda, dropout, width1, width2, dwidth)
        #print(lmbda,dropout,width1,width2,dwidth)
        print(hist[0].history['val_sparse_categorical_accuracy'])
        print(hist[1].history['val_sparse_categorical_accuracy'])
    return model

## 5.2 Run Random Search: 

In [0]:
base_model = random_search(train_data, Image_shape = Image_shape)

(4598, 182, 268, 3) (4598,)


InvalidArgumentError: ignored

In [0]:
acc_loss_plots_1_it(base_model.history)

## 5.3 Define Final Model: 

In [0]:
num_train = len(train_filenames)
num_epoch = 15
steps_per_epoch = round(num_train)//BATCH_SIZE
print(BATCH_SIZE)
val_steps = 12

In [0]:
# Model Definition:
model = tf.keras.models.Sequential([
    tf.keras.layers.Conv2D(14, (3,3), input_shape=Image_shape ,activation='relu'), 
    #tf.keras.layers.BatchNormalization(),
    tf.keras.layers.MaxPooling2D((2,2)),
    tf.keras.layers.Conv2D(19, (3,3), activation='relu'),
    #tf.keras.layers.BatchNormalization(),
    tf.keras.layers.MaxPooling2D((2,2)),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(128, activation = 'relu'),
    #tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dropout(0.32),
    tf.keras.layers.Dense(10, activation='softmax')
])
# Compile the model:
model.compile(optimizer=tf.keras.optimizers.Adam(lr= 0.004584439568618798), 
             loss='sparse_categorical_crossentropy',
             metrics=['sparse_categorical_accuracy'])
model.summary()

In [0]:
history = model.fit(training_set.repeat(), epochs = num_epoch, steps_per_epoch = steps_per_epoch, 
                   validation_data=val_set.repeat(), validation_steps=val_steps)

### 4.2 Plot Accuracy and validation traning loss

### 4.3 Save Model
* Save model parameters to disk. Skip if loading a previously trained model.

In [0]:
#model.save_weights('Model_Leaderbord_1_CNN_lr_0.00458_BS_20.h5')

-------------------------------------------------------------------------------------

# Transfer Learning from MobileNetV2:

In [0]:
# Pre-trained model with MobileNetV2
base_model = tf.keras.applications.MobileNetV2(
    input_shape=Image_shape,
    include_top=False,
    weights='imagenet'
)
# Freeze the pre-trained model weights
base_model.trainable = False

# Trainable classification head
maxpool_layer = tf.keras.layers.GlobalMaxPooling2D()
prediction_layer = tf.keras.layers.Dense(5, activation='softmax')

# Layer classification head with feature detector
model_MOBILE_NET = tf.keras.Sequential([
    base_model,
    maxpool_layer,
    prediction_layer
])

# Compile the model
model_MOBILE_NET.compile(optimizer=tf.keras.optimizers.Adam(lr=0.0001), 
              loss='sparse_categorical_crossentropy',
              metrics=['sparse_categorical_accuracy']
)
print("--------------------MobileNetV2---------------------------------------------")
model_MOBILE_NET.summary()

### 4.3 Save Model 
* Save model parameters to disk. Skip if loading a previously trained model.

In [0]:
#model.save_weights('Model_Leaderbord_1_CNN_lr_0.00458_BS_20.h5')

### 4.2 Plot accuracy and model loss during training for train and validation set:

In [0]:
def acc_loss_plots_many_iterations(hist):
    for i in range(len(hist.history)):
        plt.subplot(211)
        plt.plot(hist[i].history['categorical_accuracy'], color='green', label='Training')
        plt.plot(hist[i].history['val_categorical_accuracy'], color='blue', label='Validation')
        plt.title('model accuracy')
        plt.ylabel('accuracy')
        plt.xlabel('epoch')
        plt.legend(loc='upper right')
        
        plt.subplot(212)
        plt.plot(hist[i].history['loss'], color='green', label='Training')
        plt.plot(hist[i].history['val_loss'], color='blue', label='Validation')
        plt.title('model loss')
        plt.ylabel('loss')
        plt.xlabel('epoch')
        plt.legend(loc='upper right')
    
    plt.tight_layout()
    #plt.savefig("Model_3_CNN_lr_0.0005_BS_30.png", dpi=1200)
    plt.show()

In [0]:
def acc_loss_plots_1_it(hist):
    plt.figure(figsize=(8, 8))
    plt.subplot(211)
    plt.plot(hist.history['sparse_categorical_accuracy'], color='green', label='Training')
    plt.plot(hist.history['val_sparse_categorical_accuracy'], color='blue', label='Validation')
    plt.title('model accuracy')
    plt.ylabel('accuracy')
    plt.xlabel('epoch')
    plt.legend(loc='upper right')
    
    plt.subplot(212)
    plt.plot(hist.history['loss'], color='green', label='Training')
    plt.plot(hist.history['val_loss'], color='blue', label='Validation')
    plt.title('model loss')
    plt.ylabel('loss')
    plt.xlabel('epoch')
    plt.legend(loc='upper right')
    plt.tight_layout()
    
    plt.tight_layout()
    #plt.savefig("Model_3_CNN_lr_0.0005_BS_30.png", dpi=1200)
    plt.show()

In [0]:
model.load_weights('Model_Leaderbord_1_CNN_lr_0.00458_BS_20.h5')

In [0]:
#model.load_weights('Model_Leaderbord_1_CNN_lr_0.00458_BS_32.h5')

In [0]:
val_set = (val_data.map(load_preprocess).shuffle(buffer_size=10000))

#seperate sets
val_x = []
val_y = []

for image, label in val_set:
    val_x.append(image.numpy())
    val_y.append(label.numpy())
    
val_x = np.asarray(val_x)
val_y = np.asarray(val_y)
    
#verify proper inputs for cross validation
print(val_x.shape,val_y.shape)

In [0]:
# evaluate loaded model on test data
model.compile(loss = 'sparse_categorical_crossentropy', optimizer = tf.optimizers.Adam(lr = 0.004584439568618798), 
                     metrics = ['sparse_categorical_accuracy'])
score = model.evaluate(val_x, val_y, verbose=0)
print("%s: %.2f%%" % (model.metrics_names[1], score[1]*100))

### 4.4 Display CNN Architecture

In [0]:
#tf.keras.utils.plot_model(model, to_file='Model_3_CNN_lr_0.0005_BS_30_relu_architecture.png')

## 5.0 Model Testing (Assesment) 
* Evaluate model on testing set: 

In [0]:
#Setup testing set
import os
test_files = os.listdir('data/Project_C2_Testing/')
test_files = ['data/Project_C2_Testing/' + f  for f in test_files]
test_files_len = len(test_files)
#just use empty labels for now since only conducting predictions
test_labels = [0]*test_files_len
#validate proper files
print(test_files[0:5])
print(len(test_labels))
print(test_files_len)
test_data = tf.data.Dataset.from_tensor_slices((tf.constant(test_files), tf.constant(test_labels)))
test_set = (test_data.map(load_preprocess).batch(BATCH_SIZE))
prediction = model.predict(test_set)
print(prediction.shape)

f = open('predictions.csv', 'w')
f2 = open('predictions_one_hot.csv', 'w')

for img in prediction: 
    highVal = 0
    label = 0 
    index = 0
    for predict in img:
        if(predict > highVal):
            highVal = predict
            label = index
        index = index + 1
        
    #print integer predictions
    f.write(str(label) + '\n')
    
    #print one hot encoding predictions
    for i in range(0,5):
        if(i == 4):
            if(i == label):
                f2.write('1\n')
            else:
                f2.write('0\n')
        else:
            if(i == label):
                f2.write('1,')
            else:
                f2.write('0,')
        
f.close()
f2.close()

In [0]:
_, acc = model.evaluate(test_set)
print("Model accuracy on testing set: {percent:.3%}".format(percent=acc))

In [0]:
predictions = model.predict(test_set)

In [0]:
predictions_reshape = np.array(predictions).reshape(180, 1, 5)