# Notebook

Build and demonstrate a data science product. Reference other scripts as needed, but be sure to include those in the same repo. 

Demonstrate your technical prowess as well as visualization and narrative storytelling. It should include all stages of your process in an easy-to-read format.

- Wrangle your data. Get it into the notebook in the best form possible for your analysis and model building.
- Explore your data. Make visualizations and conduct statistical analyses to explain what’s happening with your data, why it’s interesting, and what features you intend to take advantage of for your modeling.
- Build a modeling pipeline – build model in a coherent pipeline of linked stages that is efficient and easy to implement.
- Evaluate your models. You should have built multiple models, which you should thoroughly evaluate and compare via a robust analysis of residuals and failures
- Present and thoroughly explain your product. Describe your model in detail: why you chose it, why it works, what problem it solves, how it will run in a production like environment. What would you need to do to maintain it going forward?

## Proposed Data Science Product

- __Problem:__
    
    Grocery stores must manually tape a barcode to their produce (which may or may not stay on), or have their cashiers memorize particular items' codes, in order for produce to be rung up properly. For self-checkout customers, they have to go through the hassle of looking the items' codes up in the store inventory, and there is nothing saying that they will choose correctly.

- __Solution & Value:__
    
    Grocery stores could use image detection technology to identify items being scanned. This can save cashiers time, prevent customers from selecting an incorrect code, and streamline the self-checkout process – all of which would save the grocery store money.
    
- __Data Source:__

    There is a Kaggle dataset of fruit images for classification that I will use as a starting point. I can supplement this with more fruit images, as well as other produce images (vegetables, herbs, etc.), from scraping Google Image search results.

- __Technique:__

    I will use deep-learning techniques (i.e. neural networks via Keras / TensorFlow) to categorize images as various produce items.

- __Production Environment Deployment:__

    The model I create would live on a server, and be fed a photo taken at the register via web protocol.

## TBD: Web-Scraping for Vegetables

In [1]:
# from google_images_download import google_images_download
# import os

# veggie_keywords = [line.rstrip('\n') for line in open('data/vegetables.txt')]

# os.system('rm -rf data/vegetables/')

# def populate_veggie_imgs(output_dir, prefix, suffix):
#     response = google_images_download.googleimagesdownload()
#     for veggie in veggie_keywords:
#         arguments = {"keywords": veggie, "prefix_keywords": prefix, "suffix_keywords": suffix, "output_directory": output_dir, "image_directory": veggie, "color_type": "full-color", "exact_size": "100,100", "limit": 1, "print_urls":True}
#         paths = response.download(arguments)

# populate_veggie_imgs('data/vegetables/train_single', '', 'vegetable')
# populate_veggie_imgs('data/vegetables/test_single', 'vegetable', '')

## Data Importing

In [2]:
import os
import numpy as np
import skimage
from skimage import io, transform
from tqdm import tqdm

import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPooling2D
from keras.layers import LSTM, Input
from keras.models import Model
from keras.optimizers import Adam, Adadelta

img_size = 100
train_dir = './data/fruits/test_single/'
test_dir =  './data/fruits/train_single/'

def get_data(folder_path):
    imgs = []
    labels = []
    for idx, folder_name in enumerate(os.listdir(folder_path)[:11]):
        if not folder_name.startswith('.'):
            for file_name in tqdm(os.listdir(folder_path + folder_name)):
                if not file_name.startswith('.'):
                    img_file = io.imread(folder_path + folder_name + '/' + file_name)
                    if img_file is not None:
                        img_file = transform.resize(img_file, (img_size, img_size))
                        imgs.append(np.asarray(img_file))
                        labels.append(idx)
    imgs = np.asarray(imgs)
    labels = np.asarray(labels)
    return imgs, labels

X_train, y_train = get_data(train_dir)
X_test, y_test = get_data(test_dir)

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
  warn("The default mode, 'constant', will be changed to 'reflect' in "
100%|██████████| 166/166 [00:00<00:00, 330.13it/s]
100%|██████████| 166/166 [00:00<00:00, 350.57it/s]
100%|██████████| 246/246 [00:00<00:00, 353.88it/s]
100%|██████████| 164/164 [00:00<00:00, 354.42it/s]
100%|██████████| 164/164 [00:00<00:00, 347.77it/s]
100%|██████████| 165/165 [00:00<00:00, 360.13it/s]
100%|██████████| 143/143 [00:00<00:00, 357.87it/s]
100%|██████████| 164/164 [00:00<00:00, 353.55it/s]
100%|██████████| 166/166 [00:00<00:00, 359.55it/s]
100%|██████████| 166/166 [00:00<00:00, 350.60it/s]
100%|██████████| 166/166 [00:00<00:00, 354.38it/s]
100%|██████████| 490/490 [00:01<00:00, 355.94it/s]
100%|██████████| 490/490 [00:01<00:00, 351.10it/s]
100%|██████████| 738/738 [00:02<00:00, 349.70it/s]
100%|██████████| 492/492 [00:01<00:00, 349.60it/s]
100%|██████████| 492/492 [00:01<00:00, 348.70it/s]
100%|██████████| 492/

## Data Wrangling & Cleaning

In [3]:
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)
print('y_train:', y_train)
print('y_test:', y_test)

num_categories = len(np.unique(y_train))

new_y_train = keras.utils.to_categorical(y_train, num_categories)
new_y_test = keras.utils.to_categorical(y_test, num_categories)

X_train shape: (1875, 100, 100, 3)
X_test shape: (5583, 100, 100, 3)
y_train: [ 0  0  0 ... 10 10 10]
y_test: [ 0  0  0 ... 10 10 10]


## Exploratory Data Analysis

## Model Selection

### Multi-Layer Perceptron (MLP)

In [4]:
mlp_model = Sequential()

mlp_model.add(Dense(100, activation='relu', input_shape=(X_train.shape[1] * X_train.shape[2] * X_train.shape[3],)))
mlp_model.add(Dropout(0.1))
mlp_model.add(Dense(100, activation='relu'))
mlp_model.add(Dropout(0.1))
mlp_model.add(Dense(num_categories, activation='softmax'))

mlp_model.summary()
mlp_model.compile(loss='categorical_crossentropy', optimizer=Adam(), metrics=['accuracy'])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 100)               3000100   
_________________________________________________________________
dropout_1 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 100)               10100     
_________________________________________________________________
dropout_2 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 11)                1111      
Total params: 3,011,311
Trainable params: 3,011,311
Non-trainable params: 0
_________________________________________________________________


### Convolutional Neural Network

In [7]:
convolutional_model = Sequential()

convolutional_model.add(Conv2D(32, kernel_size=(3,3), activation='relu', input_shape=(X_train.shape[1], X_train.shape[2], X_train.shape[3],)))
convolutional_model.add(Conv2D(64, (X_train.shape[3], X_train.shape[3]), activation='relu'))
convolutional_model.add(MaxPooling2D(pool_size=(2, 2)))
convolutional_model.add(Dropout(0.25))
convolutional_model.add(Flatten()) # SWITCH THIS AND ABOVE IN ANOTHER ITERATION?
convolutional_model.add(Dense(128, activation='relu'))
convolutional_model.add(Dropout(0.5))
convolutional_model.add(Dense(num_categories, activation='softmax'))

convolutional_model.summary()
convolutional_model.compile(loss='categorical_crossentropy', optimizer=Adam(), metrics=['accuracy'])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_5 (Conv2D)            (None, 98, 98, 32)        896       
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 96, 96, 64)        18496     
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 48, 48, 64)        0         
_________________________________________________________________
dropout_4 (Dropout)          (None, 48, 48, 64)        0         
_________________________________________________________________
flatten_2 (Flatten)          (None, 147456)            0         
_________________________________________________________________
dense_3 (Dense)              (None, 128)               18874496  
_________________________________________________________________
dropout_5 (Dropout)          (None, 128)               0         
__________

## Model Evaluation

Why you chose a model, why it works, what problem it solves, how it will run in production-like environment

### Multi-Layer Perceptron (MLP)

In [4]:
def evaluate_model(model, new_X_train, new_X_test, batch_size, epochs):
    history = model.fit(new_X_train, new_y_train, batch_size=batch_size, epochs=epochs, verbose=1, validation_data=(new_X_test, new_y_test))
    score = model.evaluate(new_X_test, new_y_test, verbose=0)
    print('***Loss***', score[0])
    print('***Accuracy***', score[1])

In [7]:
new_X_train = X_train.reshape(X_train.shape[0], X_train.shape[1] * X_train.shape[2] * X_train.shape[3]).astype('float32')
new_X_test = X_test.reshape(X_test.shape[0], X_test.shape[1] * X_test.shape[2] * X_test.shape[3]).astype('float32')
evaluate_model(mlp_model, new_X_train, new_X_test, 150, 10)

Train on 1875 samples, validate on 5583 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
***Loss*** 0.7863650332369643
***Accuracy*** 0.8092423428691418


### Convolutional Neural Network

In [8]:
new_X_train = X_train.reshape(X_train.shape[0], X_train.shape[1], X_train.shape[2], X_train.shape[3]).astype('float32')
new_X_test = X_test.reshape(X_test.shape[0], X_test.shape[1], X_test.shape[2], X_test.shape[3]).astype('float32')
evaluate_model(convolutional_model, new_X_train, new_X_test, 150, 10)

Train on 1875 samples, validate on 5583 samples
Epoch 1/10
Epoch 2/10

KeyboardInterrupt: 

## Optimizing Convolutional Network

### Adding Convolutional Layers

In [5]:
convolutional_2 = Sequential()

convolutional_2.add(Conv2D(32, kernel_size=(3,3), activation='relu', input_shape=(X_train.shape[1], X_train.shape[2], X_train.shape[3],)))
convolutional_2.add(Conv2D(64, (X_train.shape[3], X_train.shape[3]), activation='relu'))
convolutional_2.add(MaxPooling2D(pool_size=(2, 2))) # 1/4 of data (3,3 would result in 1/9)
convolutional_2.add(Dropout(0.25))

# added Convolutional Layers
convolutional_2.add(Conv2D(128, kernel_size=(3,3), activation='relu'))
convolutional_2.add(Conv2D(256, (X_train.shape[3], X_train.shape[3]), activation='relu'))
convolutional_2.add(MaxPooling2D(pool_size=(2, 2)))
convolutional_2.add(Dropout(0.25))

convolutional_2.add(Flatten())
convolutional_2.add(Dense(128, activation='relu'))
convolutional_2.add(Dropout(0.5))
convolutional_2.add(Dense(num_categories, activation='softmax'))

convolutional_2.summary()
convolutional_2.compile(loss='categorical_crossentropy', optimizer=Adam(), metrics=['accuracy'])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 98, 98, 32)        896       
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 96, 96, 64)        18496     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 48, 48, 64)        0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 48, 48, 64)        0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 46, 46, 128)       73856     
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 44, 44, 256)       295168    
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 22, 22, 256)       0         
__________

In [6]:
new_X_train = X_train.reshape(X_train.shape[0], X_train.shape[1], X_train.shape[2], X_train.shape[3]).astype('float32')
new_X_test = X_test.reshape(X_test.shape[0], X_test.shape[1], X_test.shape[2], X_test.shape[3]).astype('float32')
evaluate_model(convolutional_2, new_X_train, new_X_test, 150, 10)

Train on 1875 samples, validate on 5583 samples
Epoch 1/10
 150/1875 [=>............................] - ETA: 2:02 - loss: 2.4002 - acc: 0.1133

KeyboardInterrupt: 

In [11]:
convolutional_3 = Sequential()

# kernel, padding, stride
convolutional_3.add(Conv2D(32, kernel_size=(3,3), activation='relu', input_shape=(X_train.shape[1], X_train.shape[2], X_train.shape[3],)))
# convolutional_2.add(MaxPooling2D(pool_size=(2, 2)))
convolutional_3.add(Conv2D(64, (X_train.shape[3], X_train.shape[3]), activation='relu'))
convolutional_3.add(MaxPooling2D(pool_size=(2, 2)))
convolutional_3.add(Dropout(0.25))

# TODO: Run with fewer images to test (compare with and without the below)
# If there is an improvement, then add back all the other images

# TODO: Use Kaggle kernels to run more quickly

# convolutional_2.add(Conv2D(32, kernel_size=(3,3), activation='relu'))
# convolutional_2.add(Conv2D(64, (X_train.shape[3], X_train.shape[3]), activation='relu'))
# convolutional_2.add(MaxPooling2D(pool_size=(2, 2)))
# convolutional_2.add(Dropout(0.25))

convolutional_3.add(Flatten())
convolutional_3.add(Dense(128, activation='relu'))
convolutional_3.add(Dropout(0.5))
convolutional_3.add(Dense(num_categories, activation='softmax'))

convolutional_3.summary()
convolutional_3.compile(loss='categorical_crossentropy', optimizer=Adadelta(), metrics=['accuracy'])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_7 (Conv2D)            (None, 98, 98, 32)        896       
_________________________________________________________________
conv2d_8 (Conv2D)            (None, 96, 96, 64)        18496     
_________________________________________________________________
max_pooling2d_4 (MaxPooling2 (None, 48, 48, 64)        0         
_________________________________________________________________
dropout_8 (Dropout)          (None, 48, 48, 64)        0         
_________________________________________________________________
flatten_3 (Flatten)          (None, 147456)            0         
_________________________________________________________________
dense_8 (Dense)              (None, 128)               18874496  
_________________________________________________________________
dropout_9 (Dropout)          (None, 128)               0         
__________