# Notebook

Build and demonstrate a data science product. Reference other scripts as needed, but be sure to include those in the same repo. 

Demonstrate your technical prowess as well as visualization and narrative storytelling. It should include all stages of your process in an easy-to-read format.

- Wrangle your data. Get it into the notebook in the best form possible for your analysis and model building.
- Explore your data. Make visualizations and conduct statistical analyses to explain what’s happening with your data, why it’s interesting, and what features you intend to take advantage of for your modeling.
- Build a modeling pipeline – build model in a coherent pipeline of linked stages that is efficient and easy to implement.
- Evaluate your models. You should have built multiple models, which you should thoroughly evaluate and compare via a robust analysis of residuals and failures
- Present and thoroughly explain your product. Describe your model in detail: why you chose it, why it works, what problem it solves, how it will run in a production like environment. What would you need to do to maintain it going forward?

## Proposed Data Science Product

- __Problem:__
    
    Grocery stores must manually tape a barcode to their produce (which may or may not stay on), or have their cashiers memorize particular items' codes, in order for produce to be rung up properly. For self-checkout customers, they have to go through the hassle of looking the items' codes up in the store inventory, and there is nothing saying that they will choose correctly.

- __Solution & Value:__
    
    Grocery stores could use image detection technology to identify items being scanned. This can save cashiers time, prevent customers from selecting an incorrect code, and streamline the self-checkout process – all of which would save the grocery store money.
    
- __Data Source:__

    There is a Kaggle dataset of fruit images for classification that I will use as a starting point. I can supplement this with more fruit images, as well as other produce images (vegetables, herbs, etc.), from scraping Google Image search results.

- __Technique:__

    I will use deep-learning techniques (i.e. neural networks via Keras / TensorFlow) to categorize images as various produce items.

- __Production Environment Deployment:__

    The model I create would live on a server, and be fed a photo taken at the register via web protocol.

## Data Importing

In [1]:
import os
import numpy as np
import skimage
from skimage import io, transform
from tqdm import tqdm
# from sklearn.model_selection import train_test_split

import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPooling2D
from keras.layers import LSTM, Input
from keras.models import Model
from keras.optimizers import Adam

img_size = 100
train_dir = './data/fruits/test_single/'
test_dir =  './data/fruits/train_single/'

def get_data(folder_path):
    imgs = []
    labels = []
    for idx, folder_name in enumerate(os.listdir(folder_path)):
        if not folder_name.startswith('.'):
            for file_name in tqdm(os.listdir(folder_path + folder_name)):
                if not file_name.startswith('.'):
                    img_file = io.imread(folder_path + folder_name + '/' + file_name)
                    if img_file is not None:
                        img_file = transform.resize(img_file, (img_size, img_size))
                        imgs.append(np.asarray(img_file))
                        labels.append(idx)
    imgs = np.asarray(imgs)
    labels = np.asarray(labels)
    return imgs, labels

X_train, y_train = get_data(train_dir)
X_test, y_test = get_data(test_dir)

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
  warn("The default mode, 'constant', will be changed to 'reflect' in "
100%|██████████| 166/166 [00:00<00:00, 319.08it/s]
100%|██████████| 166/166 [00:00<00:00, 361.37it/s]
100%|██████████| 246/246 [00:00<00:00, 349.37it/s]
100%|██████████| 164/164 [00:00<00:00, 363.76it/s]
100%|██████████| 164/164 [00:00<00:00, 362.29it/s]
100%|██████████| 165/165 [00:00<00:00, 360.20it/s]
100%|██████████| 143/143 [00:00<00:00, 352.56it/s]
100%|██████████| 164/164 [00:00<00:00, 349.75it/s]
100%|██████████| 166/166 [00:00<00:00, 346.83it/s]
100%|██████████| 166/166 [00:00<00:00, 334.91it/s]
100%|██████████| 166/166 [00:00<00:00, 310.50it/s]
100%|██████████| 166/166 [00:00<00:00, 290.32it/s]
100%|██████████| 156/156 [00:00<00:00, 348.04it/s]
100%|██████████| 164/164 [00:00<00:00, 316.23it/s]
100%|██████████| 164/164 [00:00<00:00, 360.68it/s]
100%|██████████| 166/166 [00:00<00:00, 359.93it/s]
100%|██████████| 165/

## Data Wrangling & Cleaning

In [2]:
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)
print('y_train:', y_train)
print('y_test:', y_test)

num_categories = len(np.unique(y_train))

new_y_train = keras.utils.to_categorical(y_train, num_categories)
new_y_test = keras.utils.to_categorical(y_test, num_categories)

X_train shape: (12709, 100, 100, 3)
X_test shape: (37836, 100, 100, 3)
y_train: [ 0  0  0 ... 74 74 74]
y_test: [ 0  0  0 ... 74 74 74]


## Exploratory Data Analysis

## Model Selection

### Multi-Layer Perceptron (MLP)

In [4]:
mlp_model = Sequential()

mlp_model.add(Dense(100, activation='relu', input_shape=(X_train.shape[1] * X_train.shape[2] * X_train.shape[3],)))
mlp_model.add(Dropout(0.1))
mlp_model.add(Dense(100, activation='relu'))
mlp_model.add(Dropout(0.1))
mlp_model.add(Dense(num_categories, activation='softmax'))

mlp_model.summary()
mlp_model.compile(loss='categorical_crossentropy', optimizer=Adam(), metrics=['accuracy'])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 100)               3000100   
_________________________________________________________________
dropout_1 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 100)               10100     
_________________________________________________________________
dropout_2 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 75)                7575      
Total params: 3,017,775
Trainable params: 3,017,775
Non-trainable params: 0
_________________________________________________________________


### Convolutional Neural Network

In [5]:
convolutional_model = Sequential()

convolutional_model.add(Conv2D(32, kernel_size=(X_train.shape[3], X_train.shape[3]), activation='relu', input_shape=(X_train.shape[1], X_train.shape[2], X_train.shape[3],)))
convolutional_model.add(Conv2D(64, (X_train.shape[3], X_train.shape[3]), activation='relu'))
convolutional_model.add(MaxPooling2D(pool_size=(2, 2)))
convolutional_model.add(Dropout(0.25))
convolutional_model.add(Flatten())
convolutional_model.add(Dense(128, activation='relu'))
convolutional_model.add(Dropout(0.5))
convolutional_model.add(Dense(num_categories, activation='softmax'))

convolutional_model.summary()
convolutional_model.compile(loss='categorical_crossentropy', optimizer=keras.optimizers.Adadelta(), metrics=['accuracy'])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 98, 98, 32)        896       
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 96, 96, 64)        18496     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 48, 48, 64)        0         
_________________________________________________________________
dropout_3 (Dropout)          (None, 48, 48, 64)        0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 147456)            0         
_________________________________________________________________
dense_4 (Dense)              (None, 128)               18874496  
_________________________________________________________________
dropout_4 (Dropout)          (None, 128)               0         
__________

## Model Evaluation

Why you chose a model, why it works, what problem it solves, how it will run in production-like environment

### Multi-Layer Perceptron (MLP)

In [6]:
def evaluate_model(model, new_X_train, new_X_test, batch_size, epochs):
    history = model.fit(new_X_train, new_y_train, batch_size=batch_size, epochs=epochs, verbose=1, validation_data=(new_X_test, new_y_test))
    score = model.evaluate(new_X_test, new_y_test, verbose=0)
    print('***Loss***', score[0])
    print('***Accuracy***', score[1])

In [7]:
new_X_train = X_train.reshape(X_train.shape[0], X_train.shape[1] * X_train.shape[2] * X_train.shape[3]).astype('float32')
new_X_test = X_test.reshape(X_test.shape[0], X_test.shape[1] * X_test.shape[2] * X_test.shape[3]).astype('float32')
evaluate_model(mlp_model, new_X_train, new_X_test, 150, 10)

Train on 12709 samples, validate on 37836 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
***Loss*** 3.641265244347334
***Accuracy*** 0.05399619410085633


### Convolutional Neural Network

In [8]:
new_X_train = X_train.reshape(X_train.shape[0], X_train.shape[1], X_train.shape[2], X_train.shape[3]).astype('float32')
new_X_test = X_test.reshape(X_test.shape[0], X_test.shape[1], X_test.shape[2], X_test.shape[3]).astype('float32')
evaluate_model(convolutional_model, new_X_train, new_X_test, 150, 10)

Train on 12709 samples, validate on 37836 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
***Loss*** 1.4427357740878168
***Accuracy*** 0.7760862670480443
