## <div style="text-align: center">A Comprehensive Deep Learning Workflow with Python </div>

<div style="text-align: center">This <b>tutorial</b> demonstrates the basic workflow of using <b>TensorFlow</b> for <b>Deep Learning</b>. After loading the so-called <b>MNIST</b> data-set with images of hand-written digits, we define and optimize a simple mathematical model in TensorFlow. The results are then plotted and discussed.

You should be familiar with basic [linear algebra](https://www.kaggle.com/mjbahmani/linear-algebra-in-60-minutes), [Python](https://www.kaggle.com/mjbahmani/10-steps-to-become-a-data-scientist) and the Jupyter Notebook editor. It also helps if you have a basic understanding of [Machine Learning](http://https://www.kaggle.com/mjbahmani/a-comprehensive-ml-workflow-with-python) and classification.</div>
<div style="text-align:center">last update: <b>10/09/2018</b></div>



>###### you may  be interested have a look at it: [**A Comprehensive ML Workflow for House Prices**](https://www.kaggle.com/mjbahmani/a-comprehensive-ml-workflow-for-house-prices)


---------------------------------------------------------------------
Fork and run my kernels on **GiHub**  and follow me:
> ###### [ GitHub](https://github.com/mjbahmani)
-------------------------------------------------------------------------------------------------------------
 **I hope you find this kernel helpful and some upvotes would be very much appreciated**
 
 -----------

## Notebook  Content
*   1-  [Introduction](#1)
*   2- [Machine learning workflow](#2)
*   3- [Problem Definition](#3)
*       3-1 [Problem feature](#4)
*       3-2 [Aim](#5)
*       3-3 [Variables](#6)
*   4-[ Inputs & Outputs](#7)
*   4-1 [Inputs ](#8)
*   4-2 [Outputs](#9)
*   5- [Installation](#10)
*       5-1 [ jupyter notebook](#11)
*       5-2[ kaggle kernel](#12)
*       5-3 [Colab notebook](#13)
*       5-4 [install python & packages](#14)
*       5-5 [Loading Packages](#15)
*   6- [Exploratory data analysis](#16)
*       6-1 [Data Collection](#17)
*       6-2 [Visualization](#18)
*       6-3 [Data Preprocessing](#30)
*       6-4 [Data Cleaning](#31)
*   7- [Model Deployment](#32)
*   8- [Conclusion](#53)
*  9- [References](#54)

 <a id="1"></a> <br>
## 1- Introduction
This is a **comprehensive DP techniques with python** data set, that I have spent for more than two months to complete it.

it is clear that everyone in this community is familiar with **MNIST dataset** but if you need to review your information about the dataset please visit this [link](https://en.wikipedia.org/wiki/MNIST_database).

I have tried to help  Kaggle users  how to face deep learning problems. and I think it is a great opportunity for who want to learn deep learning workflow with python completely.

I am open to getting your feedback for improving this **kernel**


<a id="2"></a> <br>
## 2- Deep Learning Workflow
If you have already read some [Deep Learning books](https://towardsdatascience.com/list-of-free-must-read-machine-learning-books-89576749d2ff). You have noticed that there are different ways to stream data into deep learning.

most of these books share the following steps:
*   Define Problem
*   Specify Inputs & Outputs
*   Exploratory data analysis
*   Data Collection
*   Data Preprocessing
*   Data Cleaning
*   Visualization
*   Model Design, Training, and Offline Evaluation
*   Model Deployment, Online Evaluation, and Monitoring
*   Model Maintenance, Diagnosis, and Retraining

**You can see my workflow in the below image** :
 <img src="http://s9.picofile.com/file/8338227634/workflow.png" />



<a id="3"></a> <br>
## 3- Problem Definition
I think one of the important things when you start a new machine learning project is Defining your problem.

Problem Definition has four steps that have illustrated in the picture below:
<img src="http://s8.picofile.com/file/8338227734/ProblemDefination.png">
<a id="4"></a> <br>
### 3-1 Problem Feature
we will use the classic MNIST  data set. This dataset contains information about  handwritten digits that is commonly used for training various image processing systems.
he MNIST database contains 60,000 training images and 10,000 testing images.

Half of the training set and half of the test set were taken from NIST's training dataset, while the other half of the training set and the other half of the test set were taken from NIST's testing dataset. There have been a number of scientific papers on attempts to achieve the lowest error rate
<a id="5"></a> <br>
### 3-2 Aim
 your goal is to correctly identify digits from a dataset of tens of thousands of handwritten images
<a id="6"></a> <br>
### 3-3 Variables
Each **pixel** column in the training set has a name like pixelx, where x is an integer between 0 and 783, inclusive. To locate this pixel on the image, suppose that we have decomposed x as x = i * 28 + j, where i and j are integers between 0 and 27, inclusive. Then pixelx is located on row i and column j of a 28 x 28 matrix, (indexing by zero).


<a id="7"></a> <br>
## 4- Inputs & Outputs
<a id="8"></a> <br>
### 4-1 Inputs
The data files train.csv and test.csv contain gray-scale images of **hand-drawn digits**, from zero through nine.

Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between **0 and 255**, inclusive.

The training data set, (train.csv), has 785 columns. The first column, called "label", is the digit that was drawn by the user. The rest of the columns contain the pixel-values of the associated image.

<img src="https://upload.wikimedia.org/wikipedia/commons/2/27/MnistExamples.png"></img>
<a id="9"></a> <br>
### 4-2 Outputs
your Output is to correctly identify digits from a dataset of tens of thousands of handwritten images.

<a id="10"></a> <br>
## 5-Installation
#### Windows:
* Anaconda (from https://www.continuum.io) is a free Python distribution for SciPy stack. It is also available for Linux and Mac.
* Canopy (https://www.enthought.com/products/canopy/) is available as free as well as commercial distribution with full SciPy stack for Windows, Linux and Mac.
* Python (x,y) is a free Python distribution with SciPy stack and Spyder IDE for Windows OS. (Downloadable from http://python-xy.github.io/)
#### Linux
Package managers of respective Linux distributions are used to install one or more packages in SciPy stack.

For Ubuntu Users:
sudo apt-get install python-numpy python-scipy python-matplotlibipythonipythonnotebook
python-pandas python-sympy python-nose

<a id="11"></a> <br>
## 5-1 Jupyter notebook
I strongly recommend installing **Python** and **Jupyter** using the **[Anaconda Distribution](https://www.anaconda.com/download/)**, which includes Python, the Jupyter Notebook, and other commonly used packages for scientific computing and data science.

First, download Anaconda. We recommend downloading Anaconda’s latest Python 3 version.

Second, install the version of Anaconda which you downloaded, following the instructions on the download page.

Congratulations, you have installed Jupyter Notebook! To run the notebook, run the following command at the Terminal (Mac/Linux) or Command Prompt (Windows):

> jupyter notebook
> 

<a id="12"></a> <br>
## 5-2 Kaggle Kernel
Kaggle kernel is an environment just like you use jupyter notebook, it's an **extension** of the where in you are able to carry out all the functions of jupyter notebooks plus it has some added tools like forking et al.

<a id="13"></a> <br>
## 5-3 Colab notebook
**Colaboratory** is a research tool for machine learning education and research. It’s a Jupyter notebook environment that requires no setup to use.
### 5-3-1 What browsers are supported?
Colaboratory works with most major browsers, and is most thoroughly tested with desktop versions of Chrome and Firefox.
### 5-3-2 Is it free to use?
Yes. Colaboratory is a research project that is free to use.
### 5-3-3 What is the difference between Jupyter and Colaboratory?
Jupyter is the open source project on which Colaboratory is based. Colaboratory allows you to use and share Jupyter notebooks with others without having to download, install, or run anything on your own computer other than a browser.

<a id="15"></a> <br>
## 5-5 Loading Packages
In this kernel we are using the following packages:

 <img src="http://s8.picofile.com/file/8338227868/packages.png">
 Now we import all of them 

In [None]:
# packages to load 
# Check the versions of libraries
# Python version
import warnings
warnings.filterwarnings('ignore')
import sys
print('Python: {}'.format(sys.version))
# scipy
import scipy
print('scipy: {}'.format(scipy.__version__))
import numpy
# matplotlib
import matplotlib
print('matplotlib: {}'.format(matplotlib.__version__))
# numpy
import numpy as np # linear algebra
print('numpy: {}'.format(np.__version__))
# pandas
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
print('pandas: {}'.format(pd.__version__))
import seaborn as sns
print('seaborn: {}'.format(sns.__version__))
sns.set(color_codes=True)
import matplotlib.pyplot as plt
print('matplotlib: {}'.format(matplotlib.__version__))
%matplotlib inline
# scikit-learn
import sklearn
print('sklearn: {}'.format(sklearn.__version__))
import tensorflow as tf
print('tensorflow: {}'.format(tf.__version__))
np.random.seed(2)
from keras.utils.np_utils import to_categorical # convert to one-hot-encoding
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPool2D
from keras.optimizers import RMSprop
from keras.preprocessing.image import ImageDataGenerator
from keras.callbacks import ReduceLROnPlateau
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import itertools
import os
%matplotlib inline
sns.set(style='white', context='notebook', palette='deep')



<a id="16"></a> <br>
## 6- Exploratory Data Analysis(EDA)
 In this section, you'll learn how to use graphical and numerical techniques to begin uncovering the structure of your data. 
 
* Which variables suggest interesting relationships?
* Which observations are unusual?

By the end of the section, you'll be able to answer these questions and more, while generating graphics that are both insightful and beautiful.  then We will review analytical and statistical operations:

*   5-1 Data Collection
*   5-2 Visualization
*   5-3 Data Preprocessing
*   5-4 Data Cleaning
<img src="http://s9.picofile.com/file/8338476134/EDA.png">

<a id="17"></a> <br>
## 6-1 Data Collection
**Data collection** is the process of gathering and measuring data, information or any variables of interest in a standardized and established manner that enables the collector to answer or test hypothesis and evaluate outcomes of the particular collection.[techopedia]



In [None]:
# import Dataset to play with it
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')

**<< Note 1 >>**

* Each row is an observation (also known as : sample, example, instance, record)
* Each column is a feature (also known as: Predictor, attribute, Independent Variable, input, regressor, Covariate)

After loading the data via **pandas**, we should checkout what the content is, description and via the following:

In [None]:
type(train)
type(test)

## 6-1-1 Statistical Summary
1- Dimensions of the dataset.

2- Peek at the data itself.

3- Statistical summary of all attributes.

4- Breakdown of the data by the class variable.[7]

Don’t worry, each look at the data is **one command**. These are useful commands that you can use again and again on future projects.

In [None]:
# shape
print(train.shape)

In [None]:
# shape
print(test.shape)

In [None]:
#columns*rows
train.size

In [None]:
#columns*rows
test.size


We can get a quick idea of how many instances (rows) and how many attributes (columns) the data contains with the shape property.

You should see 42000 instances and 785 attributes for train.csv

for getting some information about the dataset you can use **info()** command

In [None]:
print(train.info())

to check the first 5 rows of the data set, we can use head(5).

In [None]:
train.head(5) 

to check out last 5 row of the data set, we use tail() function

In [None]:
train.tail() 

to pop up 5 random rows from the data set, we can use **sample(5)**  function

In [None]:
train.sample(5) 

to give a statistical summary about the dataset, we can use **describe()

In [None]:
train.describe() 

##  Data preparation

In [None]:
Y_train = train["label"]

# Drop 'label' column
X_train = train.drop(labels = ["label"],axis = 1) 

# free some space
del train 

g = sns.countplot(Y_train)

Y_train.value_counts()

<a id="30"></a> <br>
## 6-3 Data Preprocessing
**Data preprocessing** refers to the transformations applied to our data before feeding it to the algorithm.
 
Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis.
there are plenty of steps for data preprocessing and we just listed some of them :
* removing Target column (id)
* Sampling (without replacement)
* Making part of iris unbalanced and balancing (with undersampling and SMOTE)
* Introducing missing values and treating them (replacing by average values)
* Noise filtering
* Data discretization
* Normalization and standardization
* PCA analysis
* Feature selection (filter, embedded, wrapper)

# New Chapter Coming Soon

you can follow and fork my work  in **GitHub**:
> ###### [ GitHub](https://github.com/mjbahmani)


--------------------------------------

 **I hope you find this kernel helpful and some upvotes would be very much appreciated**
 

<a id="54"></a> <br>

-----------

# 9- References
* [3] [https://skymind.ai/wiki/machine-learning-workflow](https://skymind.ai/wiki/machine-learning-workflow)
* [4] [keras](https://www.kaggle.com/yassineghouzam/introduction-to-cnn-keras-0-997-top-6)
* [5] [Problem-define](https://machinelearningmastery.com/machine-learning-in-python-step-by-step/)
* [6] [Sklearn](http://scikit-learn.org/)
* [7] [machine-learning-in-python-step-by-step](https://machinelearningmastery.com/machine-learning-in-python-step-by-step/)
* [8] [Data Cleaning](http://wp.sigmod.org/?p=2288)
* [9] [Kaggle kernel that I use it](https://www.kaggle.com/yassineghouzam/introduction-to-cnn-keras-0-997-top-6)



-------------
