# **Writeup | Self-Driving Car project - Deep Learning**
**x** min read

**Abstract — This notebook is the writeup of the Traffic Sign Recognition project.** We apply Deep Learning and Convolutional Networks (ConvNets) to the task of traffic sign classification as part of the SELF-DRIVING CAR nanodegree program. The project is broken down into three steps, which are:   
>- Step 1: Data Set Summary and Exploration
- Step 2: Design and Test a Model Architecture
- Step 3: Test a Model on New Images

The model yielded the accuracy of **9x.xx%** with a loss of **xx.xx**, above the human performance of 98.81%, using 32x32 pre-proceeded input images.

Beyond the initial requirements, I also implemented the Tensorboard features: embedding visualizer, summary{images, loss, accuracy, weights, biais}, a comparaison of different architectures, and compared the model result with Google Images search.

Here is the link to the [PROJECT SPECIFICATION](https://review.udacity.com/#!/rubrics/481/view) and here to my [PROJECT CODE](https://github.com/chatmoon/Traffic-Sign-Classifier-Project/blob/master/_2_WIP/_JNBK_/_TSC-step2.1_170309-1557_WIP.ipynb).

---
### Step 1: Data Set Summary & Exploration

The goals here are the following:
* Load the [data set](http://benchmark.ini.rub.de/?section=gtsrb&subsection=dataset) and Summarize
* Explore and visualize the data set

#### 1.1. Load and summarize the data set  

*The code for the loading of the data set is contained in the code cell [2] of the IPython notebook.   
And the one for the summary is contained in the code cell [4].*   

I used the numpy library to calculate summary statistics of the traffic signs data set:   
* The size of training set is 34799
* The size of validing set is 4410
* The size of test set is 12630
* The shape of a traffic sign image is (32, 32, 3)
* The number of unique classes/labels in the data set is 43

#### 1.2. An exploratory visualization of the dataset
##### 1.2.1. Explore the data set

*The code for this step is contained in the code cells from [6] to [8] of the IPython notebook.*

In this part, we have three representations of the data set:   
- fig.1: a list showing the number of occurence per traffic sign name
- fig.2: a bar chart showing the number of occurence per class id
- fig.3: a bar chart showing the distribution of these traffic sign images into the data set

>![fig.1](https://raw.githubusercontent.com/chatmoon/Traffic_Sign_Classifier/master/images/_dataExplo_1_showList.PNG)
>![fig.2](https://raw.githubusercontent.com/chatmoon/Traffic_Sign_Classifier/master/images/_dataExplo_2_showChart.png)
>![fig.3](https://raw.githubusercontent.com/chatmoon/Traffic_Sign_Classifier/master/images/_dataExplo_3_showDist.png)

**Observations:**    
- The fig.1 and 2 show there is a large disparity between traffic sign occurences.   
Additional data should be created in order to rebalance the under represented classes.
- The fig.3 shows that each class has been piled on top of the other.   
The data set should be shuffled before training the model.

##### 1.2.2. Visualize the data set
*The code for this step is contained in the code cell [9] of the IPython notebook.*

In this part, we have also three types of visualization of the data set:   
- fig.4: 43 random images, one per class and with their black-boxes
- fig.5: 5 x 10, 5 random traffic signs, 10 images each 
- fig.6: a sprite image showing all the traffic sign in a single image

> *fig.4: 43 random images, one per class and with their black-box*
![fig.4](https://raw.githubusercontent.com/chatmoon/Traffic_Sign_Classifier/master/images/_dataVisu_4_show43TS.png)
> *fig.5: 5 x 10, 5 random traffic signs, 10 images each*
![fig.5](https://raw.githubusercontent.com/chatmoon/Traffic_Sign_Classifier/master/images/_dataVisu_5_show5TS.png)
> *fig.6: a sprite image showing all the traffic sign in a single image*
![fig.6](https://raw.githubusercontent.com/chatmoon/Traffic_Sign_Classifier/master/images/_sp_xx_5984x5984.png)   
> *fig.7: a sampling of challenging images to classify*
![fig.7](https://raw.githubusercontent.com/chatmoon/Traffic_Sign_Classifier/master/images/_dataExplo_4_challenges.PNG)

**Observations:**   
The fig.4 to 7 show how the images are of different qualities. Many images are either shaky, too dark or not having the same scale. There are variabilities such as viewpoint variations, lighting conditions (saturations, low-contrast), motion-blur, occlusions, sun glare, physical damage, colors fading. The classification of the traffic sign can be challenging and complex in this context. We should pre-proceed the images to decrease the impact of the mixed qualities of images.

**Note:** the sprite image will be useful for the embedding visualization in TensorBoard.

---
### Step 2: Design and Test a Model Architecture

#### 2.1. Preprocess the data set and generate additional data

##### 2.1.1. Preprocess the data set
*The code for this step is contained in the code cell **[XX]** of the IPython notebook.* 

> **Definition:** Pre-processing refers to techniques such as converting to grayscale, normalization, etc.

In this part, I describe how I preprocessed the image data, what techniques were chosen and why I chose these techniques. You will also find below a overview of the preprocessing workflow in figure *fig.8* with images showing the output of each preprocessing technique.

In summary, the preprocessing workflow generate five different types of image: 0RGB, 1GRAY, 2SHP, 3HST, 4CLAHE.

> *fig.8: the preprocessing workflow*
![fig.8](https://raw.githubusercontent.com/chatmoon/Traffic_Sign_Classifier/master/images/_dataPPro%20flow_170610-1221.png)


First of all, **all preprocessed images are centered and normalized** because during the training of the network we multiply weights to the initial input and add biases to cause activations and then backpropagate with the gradients to train (update) the model. In this process, we do not want the gradients go out of control. Then all preprocessed images are centered around zero by subtracting the mean, and normalized by dividing by the standard deviation. This technique doesn't change the content of the image. It avoids the values of weights and biases to get too big or to small. It tackles the numerical stability issue that occurs when several small values are added to big values (introducing a lot of errors) during the optimization of the Loss function.

The two starting points of this approch are:
- the following comments: "the ConvNet was trained with full supervision on the colorimages of the GTSRB dataset and reached 98.97% accuracyon the phase 1 test set. After the end of phase 1, additional experiments with grayscale images established a new record accuracy of 99.17%", Traffic Sign Recognition with Multi-Scale Convolutional Networks, Pierre Sermanet and Yann LeCun
- the observation of the training set samples shows there are variabilities such as colors fading, lighting conditions (saturations, low-contrast), sun glare, motion-blur. To overcome a part of these variabilities and make the classification easier, I converted RGB images to grascale, then grayscale images have been sharpened, the histograms of sharpened images have been equilized, and the histograms of equilized images have been equilized adaptively with limited contrast

The following functions from OpenCV have been used:
- 1GRAY: [cvtColor](http://docs.opencv.org/2.4/modules/imgproc/doc/miscellaneous_transformations.html#cvtcolor)   
- 2SHP: [filter2D](http://docs.opencv.org/2.4/modules/imgproc/doc/filtering.html#filter2d)   
- 3HST: [equalizeHist](http://docs.opencv.org/2.4/modules/imgproc/doc/histograms.html#equalizehist) to improve the contrast   
- 4CLAHE: [createCLAHE](http://docs.opencv.org/3.1.0/d5/daf/tutorial_py_histogram_equalization.html)   

**Note:**
- In addition, I also tried to blur the images but I got a better accuracy removing this last technic   
- other realistic perturbations would probably also increase robustness of the model such as other affine transformations, brightness or adding some artificial occlusions. They would be implemented in the future sprint

#### 2.1.2. Generate additional data
*The code for this step is contained in the code cell **[XX]** of the IPython notebook.* 

In this part, I describe how and why I generated additional data, what techniques were chosen and why I chose these techniques. You will also find below a workflow of the generation of additional data in figure fig.10 with a visualization of the jittered image in fig.9.

I decided to generate additional data for two reasons:
- first, the fig.2 shows the data set is imbalanced, i.e. the classes are not represented equally. In this case, the accuracy measures might be excellent accuracy on paper but it is only reflecting the underlying class distribution. For example, if the accuracy is 90% of the instances in Class-3 (Speed limit 50km/h) is because the models look at the data and cleverly decide that the best thing to do is to always predict “Class-3” and achieve high accuracy
- secondly, the amount of data might be not sufficient for the model to generalise well in production with new data   


To add more data to the the data set, I combined the two following techniques:
- images are randomly picked and perturbed in position ([-2,2] pixels)
- then they are pertubed in rotation ([-15,+15] degrees)   


**Notes:**
- in addition, I also tried to use the bounding box to crop the images and then perturbing them in scale ([.9,1.1] ratio). I got an accuracy around 93%, far below the human performance of 98.81%. I removed this last part and I get a better result at the end      
- the observation of the training set in fig.6 shows the data set is a stack of several series of 30 similar images **with usually increasing scale**. It is for that reason I did not implement the simple perturbation in scale
- there are 21 traffic signs that have a horizontal or a vertical axis of symmetry. Consequently, they are invariant to horizontal or vertical flipping. This technic would be implemented to add more data to the data set in the future sprint

Here is an example of an original image and an augmented image:
> *fig.9: augmented data example*
![fig.9](https://raw.githubusercontent.com/chatmoon/Traffic_Sign_Classifier/master/images/_dataJit_test4.png)

I did not know how much the data would have to be raised to improve the accuracy or to overcome the overfitting problem. Then I created several set of augmented data such as each class has a least the following quantity (qty) of occurence: 500, 1000, 1500, 2000, 2500 and 3000 for each preprocessed type of image (see fig.10).
> *fig.10: workflow of the generation of additional data*
![fig.10](https://raw.githubusercontent.com/chatmoon/Traffic_Sign_Classifier/master/images/_dataJIT_flow_170617-2336.png)   
> *tab.11: the size of augmented training set*

| Minimum quantity per class | 0 | 500 | 1000 | 1500 | 2000 | 2500 | 3000 |
|:---------------------------------------------------------------:| 
| Quantity in addion | 0 | 4440 | 12451 | 15690 | 18630 | 21490 | 21500 |
| Total size | 34799 | 39239 | 51690 | 67380 | 86010 | 107500 | 129000 |

#### 2.2. Model Architecture   
*The code for this step is contained in the code cell **[XX]** of the IPython notebook.*   

In this part, I describe what the final model architecture used looks like, how the model has been trained and the approach taken for finding a solution.

##### 2.2.1. The final model architecture

*The code for this step is contained in the code cell **[XX]** of the IPython notebook.*   

Even though I have built and experimented [several models](https://raw.githubusercontent.com/chatmoon/Traffic_Sign_Classifier/master/images/_architecture_170711-1428.png), I get the best test and validation accuracies with the Lenet5 architecture.

> *fig.12: the original LeNet5 architecture*
![fig.12](https://raw.githubusercontent.com/chatmoon/Traffic_Sign_Classifier/master/images/lenet5.png)

Here is a diagram describing the final model:
> *fig.13: the final model architecture used*
![fig.13](https://raw.githubusercontent.com/chatmoon/Traffic_Sign_Classifier/master/images/_architecture_170711-1428_mod1.png)

The model consisted of the following layers:   

| Layer         		|     Description	        					 | 
|:---------------------:|:----------------------------------------------:| 
| Input         		| 32x32xch, ch=3 for RGB image or 1 for grayscale| 
| Convolution        	| 1x1 stride, valid padding, outputs 28x28x6   	 |
| RELU					|												 |
| Max pooling	      	| 2x2 stride,  outputs 14x14x6  				 |
| Convolution   	    | 1x1 stride, valid padding, outputs 10x10x16	 |
| RELU					|												 |
| Max pooling	      	| 2x2 stride,  outputs 5x5x16  				     |
| Flatten	      		| outputs 400  				    			     |
| Fully connected		| outputs 200  				    			     |
| RELU					|												 |
| Dropout				| keep_prob = 0.67								 |
| Fully connected		| outputs 84  				    			     |
| RELU					|												 |
| Dropout				| keep_prob = 0.67								 |
| Fully connected		| outputs 43  				    			     |


**Note:** The graph of the model can be found using Tensorboard: [graph](https://raw.githubusercontent.com/chatmoon/Traffic_Sign_Classifier/master/images/_architecture_tensorboard.png) 


##### 2.2.2. How the model has been trained

In this part, the discussion includes the type of optimizer, the batch size, number of epochs and any hyperparameters such as learning rate.

To train the model, I used the Adam optimyzer. Adam stands for Adaptive moment estimation. It usually outperforms the other types of optimizer like the gradient descent, the stochastic gradient descent, the mini-batch gradient descent, the momentum or the Nesterov accelerated gradient. For more details, see this great short [video](https://www.youtube.com/watch?v=nhqo0u1a6fw).   

I used ReLU as activation function for the hidden layers. I implemented and tested leaky ReLU but the ReLU function gave me the best validation and test accuracies results. It outperforms the other activation functions like Sigmoid and hyperbolic tangent function (tanh). I did not try Maxout function. For more details, see this short [video](https://www.youtube.com/watch?v=-7scQpJT7uo).

I used the dropout technique as regularization method with a rate of 0.67.

Regarding the other hyperparameters:   

| Hyperparameter    	| Value  |   
|:---------------------:|:------:|   
| LEARNING RATE       	| 8.5e-4 |   
| EPOCHS   	            | 100	 |   
| BATCH SIZE            | 100	 |

##### 2.2.3. The approach taken for finding a solution

###### 2.2.3.0. Description of the general approach

It was an iterative approach.   

As a prelude, I used the Lenet5 architecture with the initial RGB images as input data and arbitrary hyperparameter values.   
I chose this starting point for two reasons:   
- Firstly, the initial measure gave me a benchmark value of the validation accuracy that would serve as an element of comparaison when optimizing the architecture performance. At this stage, I got a validation accuracy of 91.7%, a test accuracy of 90.3% and a loss of 0.456 with a learning rate of 1E-3.  
- Secondly, the Lenet5 architecture is a building block of the Multi-Scale Convolutional Network. ![ MultiScaleConvNet](https://ai2-s2-public.s3.amazonaws.com/figures/2016-11-08/9ab0de951cc9cdf16887b1f841f8da6affc9c0de/1-Figure2-1.png)   
The latter promises a validation accuracy of 99.17%, above the human performance of 98.81%. In this way, I was able to get a sense of the impact of each parameter on a simpler system before playing with more complex architectures.   

Next, I played with the following parameters to tune the Lenet5 model:   
- learning rate: [1E-3, 2E-3, 9E-4, 1E-4, 1E-5, 8E-4, 9.5E-4, 8.5E-4]
- dropout rate: [0.5, 0.75, 0.25, 0.85, 0.6, 0.8, 0.67]
- preprocessed data types: (centered & normalized) AND [RGB, grayscale, sharpen, histogram, CLAHE]
- the amount of jittered data: [500, 1000, 1500, 2000, 2500, 3000]   
- activation function: [Relu, leakyRelu]

As a result, I got a better performance: a validation accuracy of 97.7%, a test accuracy of 95.8% with a loss of 0.141.

Finally, I played with various changes in the initial architecture and tried other ones.   

###### 2.2.3.1. Initial benchmark values with the Lenet5 architecture

At this stage, I got a validation accuracy of 91.7%, a test accuracy of 90.3% and a loss of 0.456 with a learning rate of 1E-3.   

The fig.14.a shows the model does not overfit or underfit.     

> *fig.14.a: model1 - cost and accuracies measures before, in between and tuning*
![fig.14.a](https://raw.githubusercontent.com/chatmoon/Traffic_Sign_Classifier/4521055bf8aaa1f74f97d2eb83f0853ac51567f3/images/_TensorBoard_Mod1_variousLearningRate.png)


###### 2.2.3.2. Tuning of the Lenet5 architecture

By varying the learning rate, I was able to get a validation accuracy of 94.4% with a learning rate of 8.5E-4 *(see fig. 15.a)*.   

A learning rate of 8.5E-4 would be kept for the next measures.

> *fig.15.a: model1's cost and accuracies measures with various learning rates*
![fig.15.a]( https://raw.githubusercontent.com/chatmoon/Traffic_Sign_Classifier/master/images/_scatterMod1_variousLearningRate.png)
*Find [here](https://github.com/chatmoon/Traffic_Sign_Classifier/blob/master/jpbk/_SimulationResults_170725-1237.ipynb) the details of the model1's cost and accuracies measure.*   

By varying the dropout rate, I got better performance: a validation accuracy of 94.7% and a test accuracy of 93.5% with a dropout rate of 0.67 *(see fig. 15.b)*.   

> *fig.15.b: model1's cost and accuracies measures with various dropout rates*
![fig.15.b]( https://raw.githubusercontent.com/chatmoon/Traffic_Sign_Classifier/master/images/_scatterMod1_variousDropoutRates.png)
*Find [here](https://github.com/chatmoon/Traffic_Sign_Classifier/blob/master/jpbk/_SimulationResults_170725-1237.ipynb) the details of the model1's cost and accuracies measure.*   

By varying the preprocessed data types in input, I got an improvment using centered and normalized data, and with grayscale and RGB images *(see fig. 16)*:
- grayscale & centered & normalized: validation accuracy of 96.3%, a test accuracy of 94.5%, a cost of 0.279   
- RGB & centered & normalized: validation accuracy of 94.9%, a test accuracy of 94.7%, a cost of 0.492   

> *fig.16: model1's cost and accuracies measures with various with various preprocessed data types*
![fig.16](https://raw.githubusercontent.com/chatmoon/Traffic_Sign_Classifier/master/images/_scatterMod1_variousPproDataType.png)
*Find [here](https://github.com/chatmoon/Traffic_Sign_Classifier/blob/master/jpbk/_SimulationResults_170725-1237.ipynb) the details of the model1's cost and accuracies measure.*   

By varying the amount of jittered data in input, I got an improvment using at least 3000 centered and normalized RGB images per class: a validation accuracy of 97.7%, a test accuracy of 95.8%, a cost of 0.141 *(see fig. 17)*.   

> *fig.17: model1's cost and accuracies measures with various with various amount of jittered grayscale & RGB data*
![fig.17](https://raw.githubusercontent.com/chatmoon/Traffic_Sign_Classifier/master/images/_scatterMod1_variousJitData.png)
*Find [here](https://github.com/chatmoon/Traffic_Sign_Classifier/blob/master/jpbk/_SimulationResults_170725-1237.ipynb) the details of the model1's cost and accuracies measure.*   

###### 2.2.3.3. Other architectures

Finally, by varying architectures, the Lenet5 architecture has still gotten the best performance results *(see fig.18, .19)*.   

> *fig.18: cost and accuracies measures with various architectures*
![fig.18](https://raw.githubusercontent.com/chatmoon/Traffic_Sign_Classifier/master/images/_scatterModXNew_result.png)
*Find [here](https://raw.githubusercontent.com/chatmoon/Traffic_Sign_Classifier/master/images/_architecture_170711-1428.png) the details of each model.*   

> *fig.19: different architectures descriptiion that have been tested*
![fig.19](https://raw.githubusercontent.com/chatmoon/Traffic_Sign_Classifier/master/images/_architecture_170711-1429.PNG)
*Find [here](https://raw.githubusercontent.com/chatmoon/Traffic_Sign_Classifier/master/images/_architecture_170711-1428.png) the details of each model.*  

###### 2.2.3.4. In conclusion

I must say that I am a bit desappointed. Even though I implemented the Multi-Scale Convolutional Network and used various pre-processing data techniques, the performance was not as good as the record of 99.17% promised by the paper. I expected to improve the performance by playing with various architectures but I was not successful despite my several attempts and due to some hardware limitions. Only the standardization, normalization and jitteration techniques have a significant impact on the validation and test accuracies. 

In conclusion, after building and experimenting seven models, the best validation and test accuracies have been achieved with the Lenet5 architecture, using 32x32x3, jittered, centered and normalized RGB images. **The model yielded a validation accuracy of 97.7%, a test accuracy of 95.8% with a loss of 0.141.**   

   <-- VOUS ETES ICI

Sue tips from https://machinelearningmastery.com/improve-deep-learning-performance/

ResourceExaustedError: **[Notes]"images are only of size 32x32x3 (32 wide, 32 high, 3 color channels), so a single fully-connected neuron in a first hidden layer of a regular Neural Network would have 32323 = 3072 weights. This amount still seems manageable, but clearly this fully-connected structure does not scale to larger images. For example, an image of more respectable size, e.g. 200x200x3, would lead to neurons that have 2002003 = 120,000 weights. Moreover, we would almost certainly want to have several such neurons, so the parameters would add up quickly! Clearly, this full connectivity is wasteful and the huge number of parameters would quickly lead to overfitting." + RAM memory consumption calcul
Notes: Why don't we train on images larger then 32x32?
Memory usage. Since the discriminator has fully-connected layers after the convolutions, the output of the last convolution must be flattened to connect to the first fully-connected layer. The size of this output is dependent on the input image size, and blows up really quickly (e.g. For an input size of 64x64, going from 128 feature maps to a fully connected layer with 512 nodes, you need a connection with 64x64x128x512 = 268,435,456 weights). Because of this, training on paimages larger than 32x32 causes an out-of-memory error (at least on my machine).
Source: github.com/dyelax/Adversarial_Video_Generation**


*Describe the approach taken for finding a solution and getting the validation set accuracy to be at least 0.93. Include in the discussion the results on the training, validation and test sets and where in the code these were calculated. Your approach may have been an iterative process, in which case, outline the steps you took to get to the final solution and **why you chose those steps**. Perhaps your solution involved an already well known implementation or architecture. In this case, discuss why you think the architecture is suitable for the current problem.*

My final model results were:
* training set accuracy of ?
* validation set accuracy of ? 
* test set accuracy of ?

If an iterative approach was chosen:
* What was the first architecture that was tried and why was it chosen?
* What were some problems with the initial architecture?
* How was the architecture adjusted and why was it adjusted? Typical adjustments could include choosing a different model architecture, adding or taking away layers (pooling, dropout, convolution, etc), using an activation function or changing the activation function. One common justification for adjusting an architecture would be due to overfitting or underfitting. A high accuracy on the training set but low accuracy on the validation set indicates over fitting; a low accuracy on both sets indicates under fitting.
* Which parameters were tuned? How were they adjusted and why?
* What are some of the important design choices and why were they chosen? For example, why might a convolution layer work well with this problem? How might a dropout layer help with creating a successful model?

If a well known architecture was chosen:
* What architecture was chosen?
* Why did you believe it would be relevant to the traffic sign application?
* How does the final model's accuracy on the training, validation and test set provide evidence that the model is working well?

##### 2.2.x. xxxx

---

> *fig.1x: model1's cost and accuracies measures with various hyperparameters and activations*
![fig.1x]( 
https://raw.githubusercontent.com/chatmoon/Traffic_Sign_Classifier/master/images/_scatterMod1_result.png)
Find [here](https://github.com/chatmoon/Traffic_Sign_Classifier/blob/master/jpbk/_SimulationResults_170725-1237.ipynb) the details of the model1's cost and accuracies measure.   


> *fig.1y: cost and accuracies measures comparison of defferent models*
![fig.1y]( https://raw.githubusercontent.com/chatmoon/Traffic_Sign_Classifier/0d70f968fea8dc52d5d942e805d9560ebb89f488/images/_scatterModX_result.png)
Find [here](https://github.com/chatmoon/Traffic_Sign_Classifier/blob/master/jpbk/_SimulationResults_170725-1237.ipynb) the details of the model1's cost and accuracies measure.   

---

![LeNet5](http://eblearn.sourceforge.net/lib/exe/lenet5.png)

![LeNet](http://eblearn.sourceforge.net/lib/exe/lenet5.png)

![ MultiScaleConvNet](https://ai2-s2-public.s3.amazonaws.com/figures/2016-11-08/9ab0de951cc9cdf16887b1f841f8da6affc9c0de/1-Figure2-1.png)

---
**Notes:**
- Neural network architecture (is the network over or underfitting?)
- Play around preprocessing techniques (normalization, rgb to grayscale, etc)
- Number of examples per label (some have more than others).
- Generate fake data.


Why would we need the zero-padding thing?


**Notes:**
"images are only of size 32x32x3 (32 wide, 32 high, 3 color channels), so a single fully-connected neuron in a first hidden layer of a regular Neural Network would have 32*32*3 = 3072 weights. This amount still seems manageable, but clearly this fully-connected structure does not scale to larger images. For example, an image of more respectable size, e.g. 200x200x3, would lead to neurons that have 200*200*3 = 120,000 weights. Moreover, we would almost certainly want to have several such neurons, so the parameters would add up quickly! Clearly, this full connectivity is wasteful and the huge number of parameters would quickly lead to overfitting." + RAM memory consumption calcul

**Notes:**
Why don't we train on images larger then 32x32?  
> Memory usage. Since the discriminator has fully-connected layers after the convolutions, the output of the last convolution must be flattened to connect to the first fully-connected layer. The size of this output is dependent on the input image size, and blows up really quickly (e.g. For an input size of 64x64, going from 128 feature maps to a fully connected layer with 512 nodes, you need a connection with 64x64x128x512 = 268,435,456 weights). Because of this, training on paimages larger than 32x32 causes an out-of-memory error (at least on my machine).  
  
> Source: github.com/dyelax/Adversarial_Video_Generation

# ANNEX

[writeup_template.md](https://github.com/udacity/CarND-Traffic-Sign-Classifier-Project/blob/master/writeup_template.md)

#DELETE

| Layer         		|     Description	        					 | Comment |
|:---------------------:|:----------------------------------------------:|:----------------------------------------------:|  
| Input         		| 32x32xch (ch=3 for RGB image, 1 for grayscale) | [32x32x3] will hold the raw pixel values of the image, in this case an image of width 32, height 32, and with three color channels R,G,B |
| Convolution        	| 1x1 stride, valid padding, outputs 28x28x6   	 | convolutional layer will compute the output of neurons that are connected to local regions in the input, each computing a dot product between their weights and a small region they are connected to in the input volume. This results in volume such as [28x28x6] |
| RELU					| 												 | relu layer will apply an elementwise activation function. This leaves the size of the volume unchanged ([28x28x6]) |
| Max pooling	      	| 2x2 stride,  outputs 14x14x6  				 | max pooling layer will perform a downsampling operation along the spatial dimensions (width, height), resulting in volume such as [14x14x6] |
| Convolution   	    | 1x1 stride, valid padding, outputs 10x10x16	 | |
| RELU					|												 | |
| Max pooling	      	| 2x2 stride,  outputs 5x5x16  				     | |
| Flatten	      		| outputs 400  				    			     | |
| Fully connected		| outputs 200  				    			     | |
| RELU					|												 | |
| Dropout				| keep_prob = 0.67								 | |
| Fully connected		| outputs 84  				    			     | |
| RELU					|												 | |
| Dropout				| keep_prob = 0.67								 | |
| Fully connected		| outputs 43  				    			     | fully-connected layer will compute the class scores, resulting in volume of size [1x1x43], where each of the 43 numbers correspond to a class score, such as among the 43 categories of the Traffic Sign dataset. Each neuron in this layer will be connected to all the numbers in the previous volume |