# Machine Learning Engineer Nanodegree
## Capstone Project - Street View House Numbers
Thomas Wieczorek
January 21th, 2017

## I. Definition
_(approx. 1-2 pages)_

### <font color='blue'>  Project Overview
<font color='blue'> 
In this section, look to provide a high-level overview of the project in layman’s terms. Questions to ask yourself when writing this section:
<ol>
<li>_Has an overview of the project been provided, such as the problem domain, project origin, and related datasets or input data?_
<li>_Has enough background information been given so that an uninformed reader would understand the problem domain and following problem statement?_
</ol>
</font> 

This capstone-project aims to solve the problem to correctly classify housenumber, extracted from Street View images. A rule-based approach ("Look for a certain color or pattern...) is not very promising, because the house numbers look differently: They come in all colors, different fonts, sometimes they are even one underneath the other.  

![image alt >](res/Street_View.png)
<center>_1. Example of Google Street View in Berlin [Source](https://maps.google.com/help/maps/streetview/index.html?hl=de)_</center>
![image alt >](res/Street_View_close.png)
<center>_2. Close up of House Number [Source](https://maps.google.com/help/maps/streetview/index.html?hl=de)_</center>

The dataset was collected by Stanford University and is described as follows: _SVHN is a real-world image dataset for developing machine learning and object recognition algorithms with minimal requirement on data preprocessing and formatting. It can be seen as similar in flavor to MNIST (e.g., the images are of small cropped digits), but incorporates an order of magnitude more labeled data (over 600,000 digit images) and comes from a significantly harder, unsolved, real world problem (recognizing digits and numbers in natural scene images). SVHN is obtained from house numbers in Google Street View images._ (from http://ufldl.stanford.edu/housenumbers/)


To solve this Computer Vision problem, the domain of Machine Learning is advisable. Especially Deep Learning has been very successful for solving Computer Vision problems like these in the latest years.



### <font color='blue'> Problem Statement
<font color='blue'> In this section, you will want to clearly define the problem that you are trying to solve, including the strategy (outline of tasks) you will use to achieve the desired solution. You should also thoroughly discuss what the intended solution will be for this problem. Questions to ask yourself when writing this section:
<ol>
<li> _Is the problem statement clearly defined? Will the reader understand what you are expecting to solve?_
<li> _Have you thoroughly discussed how you will attempt to solve the problem?_
<li> _Is an anticipated solution clearly defined? Will the reader understand what results you are looking for?_
</ol>
</font> 

The problem statement is to identify house numbers out of images. The images from the SVHN have been resized to a fixed resolution of 32-by-32 pixels. The file type is .mat. In addition to the images, we have the correct labels of every image.

However, if the house number contains more then one digit (for example 123), the images are cropped. It is possible, that several digits are visible in the 32x32 image, the wanted digit is always centered. So the classifier only has to classify one digit.

These images  illustrate several examples:

![image alt >](res/Classification_Examples.png)
<center>_3. Example of correct predictions._</center>

![image alt >](res/Example_Cropping.png)
<center>_4. Example of Cropping a 4-digit house number._</center>

### <font color='blue'> Metrics
<font color='blue'> In this section, you will need to clearly define the metrics or calculations you will use to measure performance of a model or result in your project. These calculations and metrics should be justified based on the characteristics of the problem and problem domain. Questions to ask yourself when writing this section:
<ol>
<li> _Are the metrics you’ve chosen to measure the performance of your models clearly discussed and defined?_
<li> _Have you provided reasonable justification for the metrics chosen based on the problem and solution?_
</ol>
</font> 
<br><br>

The metrics used for this project were **Accuracy** and the **Confusion Matrix**. The Formula for the **Accuracy** is:

\begin{align}
{Accuracy} & = \frac{tp+tn}{tp+tn+fp+fn} \\\\
{tp} & : {True Positive} \\
{tn} & : {True Negative} \\
{fp} & : {False Positive} \\
{fn} & : {False Negative} \\ 
\end{align}
<br><br>
For my research, I plotted the Accuracy of the **Training Data**, as well as of the **Validation Data** to prevent [Overfitting](https://www.wikiwand.com/en/Overfitting). Overfitting describes the effect, if the system learns too much and "memorize" the training data. The result ist, that the system does not generalize very well and is not good with different images. 

![image alt >](res/Accuracy.png)
<center>_4. Accuracy of Train-Data and Validation-Data._</center>
<br>
Overfitting can be detected by comparing the accuracy of the training and validation-set. If the validation line would be declining while the training line still increasing, it is a very good sign to detect Overfitting. The best model is usually, where the validation accuracy peaks. See this figure:

![image alt >](res/Overfitting.png)
<center>_5. Overfitting explanation [Source](https://medium.com/autonomous-agents/part-2-error-analysis-the-wild-west-algorithms-to-improve-neuralnetwork-accuracy-6121569e66a5)_</center>
<br><br>

For classification tasks like this, the accuracy or error-rate(which is _1-accuracy_) are very common metrics, see this [overview of the best classification techniques for SVHN-Dataset](http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html#5356484e).

## II. Analysis
_(approx. 2-4 pages)_



### <font color='blue'> Data Exploration
<font color='blue'> In this section, you will be expected to analyze the data you are using for the problem. This data can either be in the form of a dataset (or datasets), input data (or input files), or even an environment. The type of data should be thoroughly described and, if possible, have basic statistics and information presented (such as discussion of input features or defining characteristics about the input or environment). Any abnormalities or interesting qualities about the data that may need to be addressed have been identified (such as features that need to be transformed or the possibility of outliers). Questions to ask yourself when writing this section:
<ol>
<li>_If a dataset is present for this problem, have you thoroughly discussed certain features about the dataset? Has a data sample been provided to the reader?_
<li> _If a dataset is present for this problem, are statistics about the dataset calculated and reported? Have any relevant results from this calculation been discussed?_
<li> _If a dataset is **not** present for this problem, has discussion been made about the input space or input data for your problem?_
<li> _Are there any abnormalities or characteristics about the input space or dataset that need to be addressed? (categorical variables, missing values, outliers, etc.)_
</ol>
</font> 

In the following images, 8 random examples of every house number are presented: 
![image alt >](res/Examples_Housenumbers/0.png)
![image alt >](res/Examples_Housenumbers/1.png)
![image alt >](res/Examples_Housenumbers/2.png)
![image alt >](res/Examples_Housenumbers/3.png)
![image alt >](res/Examples_Housenumbers/4.png)
![image alt >](res/Examples_Housenumbers/5.png)
![image alt >](res/Examples_Housenumbers/6.png)
![image alt >](res/Examples_Housenumbers/7.png)
![image alt >](res/Examples_Housenumbers/8.png)
![image alt >](res/Examples_Housenumbers/9.png)
<center>_Examples of the house numbers_</center>

The images of the data-sets have a shape of: (``<width>, <height>, <channels>, <batchsize>``), for example (32, 32, 3, 73257).

For easier usage in Tensorflow, I changed the shape to (``<batchsize>, <width>, <height>, <channels>``), for example (73257, 32, 32, 3). In further documentation we will abbreviate the images with **X**

The labels of the data-sets have one-dimensional shape, for example [1,5,7,8,3,6,7...], where every number represents the correct house number of the index. 


<center>

| **Dataset** | **Size (``<batchsize>, <width>, <height>, <channel>``)**  |
|------|------|
|  X_train   | <center>(73257, 32, 32, 3) </center>| 
|  y_train   | <center>(73257,) </center>| 
|  X_test   | <center>(26032, 32, 32, 3) </center>| 
|  y_test  | <center>(26032,) </center>| 

</center>
Number of training examples =	 73257 <br>
Number of testing examples =	 26032 <br>
Image data shape =		         (32, 32, 3) <br>
Number of classes =		         10 

There were no abnormalities or characteristics found, however some images are very blurry, and it is even hard for humans to identify them.
![image alt >](res/Blurry.png)
<center>_Example of Blurry Image_</center>


### <font color='blue'>Exploratory Visualization
<font color='blue'>In this section, you will need to provide some form of visualization that summarizes or extracts a relevant characteristic or feature about the data. The visualization should adequately support the data being used. Discuss why this visualization was chosen and how it is relevant. Questions to ask yourself when writing this section:
<ol>
<li> _Have you visualized a relevant characteristic or feature about the dataset or input data?_
<li> _Is the visualization thoroughly analyzed and discussed?_
<li> _If a plot is provided, are the axes, title, and datum clearly defined?_
</ol>
</font>


There are 10 classes with the digits from 0-9. Originally, the zeros were labeled as '10'. To remove any irritations, I changed it to '0'.
(https://www.wikiwand.com/en/Benford's_law). 


![image alt >](res/Labels_Histogram.png)
<center>_Histogram of the labels_</center>


The classes are not distributed equally: The number one is the most common and the quantity is decreasing, the higher the number. The number 0 (originally labelled as 10) is the least common number.

Interestingly, the distribution seems to follow [Benford's Law](https://en.wikipedia.org/wiki/Benford's_law), _"also called the **first-digit law**, is an observation about the frequency distribution of leading digits in many real-life sets of numerical data. The law states that in many naturally occurring collections of numbers, the leading significant digit is likely to be small."_

![image alt >](res/Benfords_Law.png)
<center>_Benford's Law for comparision [Source](https://en.wikipedia.org/wiki/Benford's_law)_</center>

### <font color='blue'>Algorithms and Techniques
<font color='blue'>In this section, you will need to discuss the algorithms and techniques you intend to use for solving the problem. You should justify the use of each one based on the characteristics of the problem and the problem domain. Questions to ask yourself when writing this section:
<ol>
<li> _Are the algorithms you will use, including any default variables/parameters in the project clearly defined?_
<li> _Are the techniques to be used thoroughly discussed and justified?_
<li> _Is it made clear how the input data or datasets will be handled by the algorithms and techniques chosen?_
</ol>
</font>

For solving the classification problem I will use **Deep Learning techniques**, to be more precisely: **Convolutional Neural Networks (CNN)**. CNN's are very common for image recognition, which is the reason why they were used in this project. CNN's are based on different layers of different types. In our architecture, three layer types are used: 
- Convolutions
- Subsampling
- Fully connected Layer

Example of Convolutions:

![image alt >](res/Conv.jpg)
<center>_Example of Convolutions [Source](https://developer.apple.com/library/content/documentation/Performance/Conceptual/vImage/ConvolutionOperations/ConvolutionOperations.html)_</center>

Example of Pooling (it is Max pooling):

![image alt >](res/pool.png)
<center>_Example of Convolutions [Source](https://developer.apple.com/library/content/documentation/Performance/Conceptual/vImage/ConvolutionOperations/ConvolutionOperations.html)_</center>

Finally, the high-level reasoning in the neural network is done via fully connected layers. 

<br><br>

As architecture I will use [LeNET](http://vision.stanford.edu/cs598_spring07/papers/Lecun98.pdf), which achieves good results for computer vision classification problems.

![image alt >](res/lenet.png)
<center>_LeNet Architecture [Source](http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf)_</center>


As Optimizer, I will use the [AdamOptimizer](https://www.tensorflow.org/api_docs/python/train/optimizers#AdamOptimizer). The benefit of AdamOptimizer is, that it controls the learning rate itself, which improves in speed and finding an optimum [See Paper](https://arxiv.org/pdf/1412.6980v8.pdf).

To reduce Overfitting, I will use [Dropout](https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf). The idea behind Dropout is interesting: If you deactivate some neurons during the learning process, other neurons have to compensate for the missing neurons. This results in a system, which is very stable.

**Default Parameter:**
- Normalizing = Off
- AdamOptimizer with Learning Rate = 0.001
- EPOCHS = 10
- BATCH_SIZE = 50
- Dropout Rate = 0.5

###  <font color='blue'>Benchmark
 <font color='blue'>In this section, you will need to provide a clearly defined benchmark result or threshold for comparing across performances obtained by your solution. The reasoning behind the benchmark (in the case where it is not an established result) should be discussed. Questions to ask yourself when writing this section:
<ol>
<li> _Has some result or value been provided that acts as a benchmark for measuring performance?_
<li> _Is it clear how this result or value was obtained (whether by data or by hypothesis)?_
</ol>
</font>

As described, I will use **Accuracy** of the test set as my metric. For my benchmark I am using the [original paper from Stanford/Google by Andrew Ng et al](http://ufldl.stanford.edu/housenumbers/nips2011_housenumbers.pdf)

![image alt >](res/Benchmark.png)
<center>Benchmark from [original paper from Stanford/Google by Andrew Ng et al](http://ufldl.stanford.edu/housenumbers/nips2011_housenumbers.pdf)</center>

## III. Methodology
_(approx. 3-5 pages)_

###  <font color='blue'>Data Preprocessing
<font color='blue'>In this section, all of your preprocessing steps will need to be clearly documented, if any were necessary. From the previous section, any of the abnormalities or characteristics that you identified about the dataset will be addressed and corrected here. Questions to ask yourself when writing this section:
<ol>
<li> _If the algorithms chosen require preprocessing steps like feature selection or feature transformations, have they been properly documented?_
<li> _Based on the **Data Exploration** section, if there were abnormalities or characteristics that needed to be addressed, have they been properly corrected?_
<li> _If no preprocessing is needed, has it been made clear why?_
</ol>
</font>

For preprocessing, I used [Histogram equalization](http://docs.opencv.org/3.1.0/d5/daf/tutorial_py_histogram_equalization.html). 

_Consider an image whose pixel values are confined to some specific range of values only. For eg, brighter image will have all pixels confined to high values. But a good image will have pixels from all regions of the image. So you need to stretch this histogram to either ends (as given in below image, from wikipedia) and that is what Histogram Equalization does (in simple words). This normally improves the contrast of the image._ (from http://docs.opencv.org/3.1.0/d5/daf/tutorial_py_histogram_equalization.html)

![image alt >](res/Histogram_Equalization.png)
<center>_Example of Blurry Image [Source](http://docs.opencv.org/3.1.0/d5/daf/tutorial_py_histogram_equalization.html)_</center>


In addition to that, I used [One Hot Encoding](https://www.wikiwand.com/en/One-hot) for the labels. For every possible house number, there will be one additional column with a boolean value.
<ol>
<li>The number 0 is represented as: [1,0,0,0,0,0,0,0,0,0,0]
<li>The number 3 is represented as: [0,0,0,1,0,0,0,0,0,0,0]
<li>The number 9 is represented as: [0,0,0,0,0,0,0,0,0,0,1]
</ol>

The following picture explains One-hot encoding with a different setting:

![image alt >](res/OneHot.jpg)
<center>_Example of Blurry Image [Source](http://pt.slideshare.net/Hadoop_Summit/machine-learning-on-hadoop-data-lakes)_</center>

As another way to prevent Overfitting, I splitted the training set into 2 seperate sets: Training and Validation set.

| **Dataset** | **Size (``<batchsize>, <width>, <height>, <channel>``)**  |
|------|------|
|  Train size:    | <center>(49082, 32, 32, 3) </center>| 
|  Validiation size:   | <center>(24175, 32, 32, 3) </center>| 
|  Test size: 	   | <center>(26032, 32, 32, 3) </center>| 


</center>

### <font color='blue'>Implementation
<font color='blue'>In this section, the process for which metrics, algorithms, and techniques that you implemented for the given data will need to be clearly documented. It should be abundantly clear how the implementation was carried out, and discussion should be made regarding any complications that occurred during this process. Questions to ask yourself when writing this section:
<ol>
<li> _Is it made clear how the algorithms and techniques were implemented with the given datasets or input data?_
<li> _Were there any complications with the original metrics or techniques that required changing prior to acquiring a solution?_
<li> _Was there any part of the coding process (e.g., writing complicated functions) that should be documented?_
</ol>
</font>

The whole implementation can be found [here](https://github.com/VisionTom/Nanodegree_MachineLearning/blob/master/06_Capstone/capstone.ipynb).

The implementation was done in Python with jupyter notebook. The most important frameworks were
- cv2
- scipy
- sklearn
- tensorflow
- and several more

The most important parts of the implementations are the following:

### Normalizing, using Histogram equalization

In [None]:
def normalize_YUV(img):
    yuv = cv2.cvtColor(img, cv2.COLOR_RGB2YUV)
    yuv[:,:,0] = cv2.equalizeHist(yuv[:,:,0])
    return cv2.cvtColor(yuv, cv2.COLOR_YUV2RGB) 

### LeNet architecture, including dropout

In [None]:
def LeNet(x, keep_prob):

    # 28x28x6
    conv1_W = tf.Variable(tf.truncated_normal([5, 5, 3, 6], stddev = 0.01))
    conv1_b = tf.Variable(tf.zeros(6))
    conv1 = tf.nn.conv2d(x, conv1_W, strides=[1, 1, 1, 1], padding='VALID') + conv1_b

    conv1 = tf.nn.relu(conv1)

    # 14x14x6
    conv1 = tf.nn.max_pool(conv1, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='VALID')

    # 10x10x16
    conv2_W = tf.Variable(tf.truncated_normal(shape=(5, 5, 6, 16),stddev = 0.01))
    conv2_b = tf.Variable(tf.zeros(16))
    conv2 = tf.nn.conv2d(conv1, conv2_W, strides=[1, 1, 1, 1], padding='VALID') + conv2_b

    conv2 = tf.nn.relu(conv2)

    # 5x5x16
    conv2 = tf.nn.max_pool(conv2, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='VALID')   

    # Flatten
    fc1 = flatten(conv2)
    # (5 * 5 * 16, 120)

    fc1_W = tf.Variable(tf.truncated_normal([400,120],stddev = 0.01))
    fc1_b = tf.Variable(tf.zeros(120))
    fc1 = tf.matmul(fc1, fc1_W) + fc1_b
    fc1 = tf.nn.relu(fc1)
    
    #Dropout
    fc1 = tf.nn.dropout(fc1, keep_prob)
    
    fc2_W = tf.Variable(tf.truncated_normal(shape=(120, n_classes), stddev = 0.01))
    fc2_b = tf.Variable(tf.zeros(n_classes))
    return tf.matmul(fc1, fc2_W) + fc2_b

### Using Tensorflow for Training:

In [None]:
if __name__ == '__main__':
    if LEARN_MODUS:
        with tf.Session() as sess:
            sess.run(tf.global_variables_initializer())
            steps_per_epoch = len(X_train) // BATCH_SIZE
            num_examples = steps_per_epoch * BATCH_SIZE

            # Train model
            for i in range(EPOCHS):
                for step in range(steps_per_epoch):
                    #Calculate next Batch
                    batch_start = step * BATCH_SIZE
                    batch_end = (step + 1) * BATCH_SIZE
                    batch_x = X_train[batch_start:batch_end] 
                    batch_y = y_train[batch_start:batch_end]
                    
                    #Run Training
                    loss = sess.run(train_op, feed_dict={x: batch_x, y: batch_y, keep_prob: 0.5})

                #Calculate Training Loss and Accuracy
                train_loss, train_acc = eval_data(X_train, y_train)
                print("EPOCH {} ...".format(i+1))
                print("Training loss = {:.3f}".format(train_loss))
                print("Training accuracy = {:.3f}".format(train_acc))
                train_losses.append(train_loss)
                train_accuracies.append(train_acc)
                
                #Calculate Validation Loss and Accuracy
                val_loss, val_acc = eval_data(X_val, y_val)
                print("EPOCH {} ...".format(i+1))
                print("Validation loss = {:.3f}".format(val_loss))
                print("Validation accuracy = {:.3f}".format(val_acc))
                val_losses.append(val_loss)
                val_accuracies.append(val_acc)
                
                #Calculate Test Loss and Accuracy (Should only be done once at the end, because of survivor bias)
                test_loss, test_acc = eval_data(X_test, y_test)
                print("EPOCH {} ...".format(i+1))
                print("Test loss = {:.3f}".format(test_loss))
                print("Test accuracy = {:.3f}".format(test_acc))
                test_losses.append(test_loss)
                test_accuracies.append(test_acc)
            try:
                saver
            except NameError:
                saver = tf.train.Saver()
            saver.save(sess, 'foo')
            print("Model saved")
               

One difficulty that occured during coding were several Out of Memory Errors, which are very cryptic in TensorFlow and hard to debug. I implemented batch learning and batch validation, which improved the stability of the system.


### <font color='blue'>Refinement
<font color='blue'>In this section, you will need to discuss the process of improvement you made upon the algorithms and techniques you used in your implementation. For example, adjusting parameters for certain models to acquire improved solutions would fall under the refinement category. Your initial and final solutions should be reported, as well as any significant intermediate results as necessary. Questions to ask yourself when writing this section:
<ol>
<li> _Has an initial solution been found and clearly reported?_
<li> _Is the process of improvement clearly documented, such as what techniques were used?_
<li> _Are intermediate and final solutions clearly reported as the process is improved?_
</ol>
</font>


As already described, the initial solution had the parameter:

- Normalizing = Off
- AdamOptimizer with Learning Rate = 0.001
- EPOCHS = 10
- BATCH_SIZE = 50
- Dropout Rate = 0.5

The result on Training accuracy is: 0.890
The result on Validation accuracy is: 0.862

The process to further refine the system and the parameter is **trial and error**. I changed the parameters and looked if the accuracy of the validation set improved.

**1) EPOCHS**

| **EPOCHS** | Learning_Rate  | BATCH_SIZE | Dropout Rate | Normalisation | Training Acc | Validation Acc | Best score
|------|------|------|------|------|------|
|   **10**  | 0.001 | 50 | 0.5 | ON | 0.857 | 0.838 |
|   **25**  | 0.001 | 50 | 0.5 | ON | 0.884 | 0.856 |
|   **50**  | 0.001 | 50 | 0.5 | ON | 0.890 | 0.862 |
|   **75**  | 0.001 | 50 | 0.5 | ON | 0.908 | 0.870 | **Best** |
|   **100**  | 0.001 | 50 | 0.5 | ON | 0.904 | 0.867 |

**Result: Overfitting is visible above 75 epochs.**

**2,3) Learning Rate**

| EPOCHS | **Learning_Rate**  | BATCH_SIZE | Dropout Rate | Normalisation | Training Acc | Validation Acc | Best score
|------|------|------|------|------|------|
|   10  | **0.0005** | 50 | 0.5 | ON | 0.865 | 0.861 |
|   25  | **0.0005** | 50 | 0.5 | ON | 0.857 | 0.838 |
|   10  | **0.0002** | 50 | 0.5 | ON | 0.887 | 0.866 |
|   25  | **0.0002** | 50 | 0.5 | ON | 0.919 | 0.883 | **Best** |

**Result: Learning Rate: The lower, the better.**


**4,5) Batch_Size**

| EPOCHS | Learning_Rate  | **BATCH_SIZE** | Dropout Rate | Normalisation | Training Acc | Validation Acc | Best score
|------|------|------|------|------|------|
|   10  | 0.0002 | **25** | 0.5 | ON | 0.913 | 0.890 |
|   25  | 0.0002 | **25** | 0.5 | ON | 0.951 | 0.912 | **Best all time**
|   10  | 0.0002 | **75** | 0.5 | ON | 0.904 | 0.883 |
|   25  | 0.0002 | **75** | 0.5 | ON | 0.934 | 0.898 |

**Result: Batch Size: The lower, the better.**

**6,7) Dropout Rate**

| EPOCHS | Learning_Rate  | BATCH_SIZE | **Dropout Rate** | Normalisation | Training Acc | Validation Acc | Best score
|------|------|------|------|------|------|
|   10  | 0.0002 | 25 | **0.25** | ON | 0.850 | 0.840 |
|   25  | 0.0002 | 25 | **0.25** | ON | 0.867 | 0.853 |
|   10  | 0.0002 | 25 | **0.5** | ON | 0.913 | 0.890 |
|   25  | 0.0002 | 25 | **0.5** | ON | 0.951 | 0.912 | **Best all time**
|   10  | 0.0002 | 25 | **0.75** | ON | 0.922 | 0.888 |
|   25  | 0.0002 | 25 | **0.75** | ON | 0.960 | 0.906 |

**Result: Droprate 0.5 was best.**

**8) Normalisation**

| EPOCHS | Learning_Rate  | BATCH_SIZE | Dropout Rate | **Normalisation** | Training Acc | Validation Acc | Best score
|------|------|------|------|------|------|
|   10  | 0.0002 | 25 | 0.5 | **OFF** | 0.886 | 0.867 |
|   25  | 0.0002 | 25 | 0.5 | **OFF** | 0.924 | 0.890 |
|   10  | 0.0002 | 25 | 0.5 | **ON** | 0.913 | 0.890 |
|   25  | 0.0002 | 25 | 0.5 | **ON** | 0.951 | 0.912 | **Best all time**

**Result: Without Normalisation is worse.**

The **final parameters** are:

| EPOCHS | Learning_Rate  | BATCH_SIZE | Dropout Rate | **Normalisation** | Training Acc | Validation Acc | Best score
|------|------|------|------|------|------|
|   25  | 0.0002 | 25 | 0.5 | **ON** | 0.951 | 0.912 | **Best all time**

## IV. Results
_(approx. 2-3 pages)_

### <font color='blue'>Model Evaluation and Validation
<font color='blue'>In this section, the final model and any supporting qualities should be evaluated in detail. It should be clear how the final model was derived and why this model was chosen. In addition, some type of analysis should be used to validate the robustness of this model and its solution, such as manipulating the input data or environment to see how the model’s solution is affected (this is called sensitivity analysis). Questions to ask yourself when writing this section:
<ol>
<li> _Is the final model reasonable and aligning with solution expectations? Are the final parameters of the model appropriate?_
<li> _Has the final model been tested with various inputs to evaluate whether the model generalizes well to unseen data?_
<li> _Is the model robust enough for the problem? Do small perturbations (changes) in training data or the input space greatly affect the results?_
<li> _Can results found from the model be trusted?_
<ol>
</font>

** Final Parameter**

| EPOCHS | Learning_Rate  | BATCH_SIZE | Dropout Rate | **Normalisation** | Training Acc | Validation Acc 
|------|------|------|------|------|------|
|   25  | 0.0002 | 25 | 0.5 | **ON** | 0.951 | 0.912 

The final parameter were found by trial and error. Another (more structured way) to find the best parameter is [grid search](http://scikit-learn.org/stable/modules/grid_search.html). However, grid search takes a lot of time and computer power, that is why it was not used for this. For the same reason, I only used EPOCHS=25. In addition to faster computation, I reduce overfitting by limiting EPOCHS to 25.

**Test-Set**

To evaluate the model, I used the test-set for the first time. The precicion for the test set is: **0.901**

**Further Model Evaluation and Validation**

To get a better estimation for the test accuracy, the evaluation was run **five** times with the same parameters. 


| testrun | EPOCHS | Learning_Rate  | BATCH_SIZE | Dropout Rate | **Normalisation** | Training Acc | Validation Acc | Test Acc
|------|------|------|------|------|------|------|
|#1|   25  | 0.0002 | 25 | 0.5 | ON | 0.951 | 0.912 | **0.891**
|#2|   25  | 0.0002 | 25 | 0.5 | ON | 0.926 | 0.890 | **0.835**
|#3|   25  | 0.0002 | 25 | 0.5 | ON | 0.944 | 0.901 | **0.854**
|#4|   25  | 0.0002 | 25 | 0.5 | ON | 0.944 | 0.905 | **0.848**
|#5|   25  | 0.0002 | 25 | 0.5 | ON | 0.944 | 0.905 | **0.849**
|**Average**|   25  | 0.0002 | 25 | 0.5 | ON | 0.942 | 0.903 | **0.855 **


The notebooks with the results were saved in the capstone.ipynb to capstone5.ipynb. The first testrun was better than all the others. One reason for this could be simply due to luck, while testing a lot of different parameters. This bias is called [Survivorship Bias](https://www.wikiwand.com/en/Survivorship_bias). A way to improve the validation and the statistical significance is to use [cross-validation](http://scikit-learn.org/stable/modules/cross_validation.html). However, this takes a lot of time and computer power, which is why it is not used here.

**Second run:**
EPOCH 25 ...
Training loss = 0.247
Training accuracy = 0.926
EPOCH 25 ...
Validation loss = 0.379
Validation accuracy = 0.890
EPOCH 25 ...
Test loss = 0.573
Test accuracy = 0.835

**Third run:**
EPOCH 25 ...
Training loss = 0.186
Training accuracy = 0.944
EPOCH 25 ...
Validation loss = 0.342
Validation accuracy = 0.901
EPOCH 25 ...
Test loss = 0.545
Test accuracy = 0.854

**Forth run:**
EPOCH 25 ...
Training loss = 0.195
Training accuracy = 0.944
EPOCH 25 ...
Validation loss = 0.347
Validation accuracy = 0.904
EPOCH 25 ...
Test loss = 0.557
Test accuracy = 0.848

**Fifth run:**
EPOCH 25 ...
Training loss = 0.191
Training accuracy = 0.944
EPOCH 25 ...
Validation loss = 0.343
Validation accuracy = 0.905
EPOCH 25 ...
Test loss = 0.546
Test accuracy = 0.849

**Sixth run:**
EPOCH 25 ...
Training loss = 0.235
Training accuracy = 0.932
EPOCH 25 ...
Validation loss = 0.371
Validation accuracy = 0.894
EPOCH 25 ...
Test loss = 0.585
Test accuracy = 0.834



### <font color='blue'>Justification
<font color='blue'>In this section, your model’s final solution and its results should be compared to the benchmark you established earlier in the project using some type of statistical analysis. You should also justify whether these results and the solution are significant enough to have solved the problem posed in the project. Questions to ask yourself when writing this section:
<ol>
<li> _Are the final results found stronger than the benchmark result reported earlier?_
<li> _Have you thoroughly analyzed and discussed the final solution?_
<li> _Is the final solution significant enough to have solved the problem?_
</ol>
</font>

![image alt >](res/Benchmark.png)
<center>Benchmark from [original paper from Stanford/Google by Andrew Ng et al](http://ufldl.stanford.edu/housenumbers/nips2011_housenumbers.pdf)</center>

With an accuracy of 85.5%, the results are stronger than the algorithms "Binary Features (WDCH) and "HOG", however it is worse than K-Means and Stacked Sparse Auto-Encoders. It is well below Human Performance.

## V. Conclusion
_(approx. 1-2 pages)_

### <font color='blue'>Free-Form Visualization
<font color='blue'>In this section, you will need to provide some form of visualization that emphasizes an important quality about the project. It is much more free-form, but should reasonably support a significant result or characteristic about the problem that you want to discuss. Questions to ask yourself when writing this section:
<ol>
<li> _Have you visualized a relevant or important quality about the problem, dataset, input data, or results?_
<li> _Is the visualization thoroughly analyzed and discussed?_
<li> _If a plot is provided, are the axes, title, and datum clearly defined?_
</font>

To get a better understanding of the model, it is interesting how the model would perform on singular classes. Are there any numbers, which are mixed up regularly? 

One way to get a better understanding of the performance is a **Confusion Matrix**. This helps to identify problems with certain classes (for example the number 3 always get mixed up with 8, because of the similarity). 

![image alt >](res/ConfusionMatrix.png)
<center>_6. Confusion Matrix_</center>
<br><br>

The Confusion Matrix gives a detailed overview about misclassed house numbers. House numbers with the true label '8' are most often confused with the numbers 6 and 3. The reason is, because these numbers look similar, only a few strokes different.

Another fact that can be derived from this chart, is that most of the time, the results are correct. This can be seen in the diagonal line, where "True label" is the same as "Predicted label". 

### <font color='blue'>Reflection
<font color='blue'>In this section, you will summarize the entire end-to-end problem solution and discuss one or two particular aspects of the project you found interesting or difficult. You are expected to reflect on the project as a whole to show that you have a firm understanding of the entire process employed in your work. Questions to ask yourself when writing this section:
<ol>
<li> _Have you thoroughly summarized the entire process you used for this project?_
<li> _Were there any interesting aspects of the project?_
<li> _Were there any difficult aspects of the project?_
<li> _Does the final model and solution fit your expectations for the problem, and should it be used in a general setting to solve these types of problems?_
</ol>
</font>

First, to get a basic understanding of the problem, the data was thoroughly **analyzed**: samples of the house-number pictures, the distribution, the labels. After this similar tasks and papers were examined, to get a benchmark and additional knowledge. 

After this, the data was **preprocessed**: the label 10 was changed to 0, because pictures of this labels displayed a 0. In addition to this, the labels were one-hot encoded. Then the pictures were optimized for machine learning, using histogram equalization. Finally, the data was splitted into Train-, Test- and Validationset

As a classifying algorithm, **Convolutional Neural Network** (Deep Learning) based on Tensorflow was used. As architecture **LeNET** was implemented, which achieves good results for computer vision classification problems and does not take too much time to learn.

To optimize the parameters, I used trial and error on the validation set and resulted in a system with an accuracy of **85.5 on the testset**.

One interesting aspect was the analysis of mistakes: Similar to humans, the model sometimes confused "similar numbers", like 3, 8 or 6. A difficult aspect (and very time consuming) is fine tuning the parameters. 

In summary, the model fit my expectation for the problem and this approach is recommended for these types of problems.


### <font color='blue'>Improvement
<font color='blue'>In this section, you will need to provide discussion as to how one aspect of the implementation you designed could be improved. As an example, consider ways your implementation can be made more general, and what would need to be modified. You do not need to make this improvement, but the potential solutions resulting from these changes are considered and compared/contrasted to your current solution. Questions to ask yourself when writing this section:
<ol>
<li> _Are there further improvements that could be made on the algorithms or techniques you used in this project?_
<li> _Were there algorithms or techniques you researched that you did not know how to implement, but would consider using if you knew how?_
<li> _If you used your final solution as the new benchmark, do you think an even better solution exists?_
</ol>
</font>

There are a lot of ways, to improve the described system:
- Generating more data could be a good way to improve the accuracy. One way is to increase the dataset by rotating, stretching or distorting the images. By this way, the imbalance of the occurence of the house numbers could be fixed.
- Using grid search and cross-validation for improving the parameters
- A more complex CNN-architecture probably would improve the accuracy. Examples would be [AlexNet](https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf) or [GoogleNet](https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf).

One technique that would be very interesting is the detection of the position of house numbers in high-resolution images. I am currently enrolled in the (Self Driving Car-Nanodegree)[https://de.udacity.com/course/self-driving-car-engineer-nanodegree--nd013/], where these techniques will be taught. By that, one could detect and classify house numbers in random Google Streetview images.

The problem is often used for Machine Learning papers and there are a lot of systems, which outperform my system. Using the above mentioned AlexNet and GoogleNet, the highest result in the SHVN-dataset was achieved by (Chen-Yu Lee, Patrick W. Gallagher and Zhuowen Tu)[https://arxiv.org/pdf/1509.08985v2.pdf] with an accuracy of **98.31**. 

-----------

**Before submitting, ask yourself. . .**

- Does the project report you’ve written follow a well-organized structure similar to that of the project template?
- Is each section (particularly **Analysis** and **Methodology**) written in a clear, concise and specific fashion? Are there any ambiguous terms or phrases that need clarification?
- Would the intended audience of your project be able to understand your analysis, methods, and results?
- Have you properly proof-read your project report to assure there are minimal grammatical and spelling mistakes?
- Are all the resources used for this project correctly cited and referenced?
- Is the code that implements your solution easily readable and properly commented?
- Does the code execute without error and produce results similar to those reported?