![open-cv inbuilt dataset](img/06_digits-1024x512.jpeg)

We are going to use the above image as our dataset that comes with OpenCV samples. 

It contains 5000 images in all — 500 images of each digit. 

Each image is 20×20 grayscale with a black background. 

4500 of these digits will be used for training and the remaining 500 will be used for testing the performance of the algorithm.

### Step 1: Deskewing (Pre-Processing)

People often think of a learning algorithm as a block box.

In reality, you can assist the algorithm a bit and notice huge gains in performance. For example, if you are building a face recognition system, aligning the images to a reference face often leads to a quite substantial improvement in performance.

Aligning digits before building a classifier similarly produces superior results. In the case of faces, aligment is rather obvious — you can apply a similarity transformation to an image of a face to align the two corners of the eyes to the two corners of a reference face.

In the case of handwritten digits, an obvious variation in writing among people is the slant of their writing. Some writers have a right or forward slant where the digits are slanted forward, some have a backward or left slant, and some have no slant at all. We can help the algorithm quite a bit by fixing this vertical slant so it does not have to learn this variation of the digits. The image on the left shows the original digit in the first column and it’s deskewed (fixed) version.

This deskewing of simple grayscale images can be achieved using image moments (an image moment is a certain particular weighted average (moment) of the image pixels' intensities, or a function of such moments, usually chosen to have some attractive property or interpretation.) - [Raw Moments, Central Moments, Moment Invariants]. 
OpenCV has an implementation of moments and it comes in handy while calculating useful information like centroid, area, skewness of simple images with black backgrounds.


It turns out that a measure of the skewness is the given by the ratio of the two central moments ( mu11 / mu02 ). The skewness thus calculated can be used in calculating an affine transform that deskews the image.

@Affine Tranformation:-
    1. Origin doesn't necessary map to origin
    2. Lines map to lines.
    3. Parallel lines remain parallel
    4. Ratio are preserved

        def deskew(img):
            m = cv2.moments(img)
            if abs(m['mu02']) < 1e-2:
                # no deskewing needed. 
                return img.copy()
            #// Calculate skew based on central momemts. 
            skew = m['mu11']/m['mu02']
            #// Calculate affine transform to correct skewness. 
            M = np.float32([[1, skew, -0.5*SZ*skew], [0, 1, 0]])
            #// Apply affine transform
            img = cv2.warpAffine(img, M, (SZ, SZ), flags=cv2.WARP_INVERSE_MAP | cv2.INTER_LINEAR)
            return img

### Step 2 : Calculate the Histogram of Oriented Gradients (HOG) descriptor

Convert the grayscale image to a feature vector using the HOG feature descriptor.

Gathering information is easy but the difficult part is putting that knowledge into Practise. 

Part of the reason was that a lot of these algorithms worked after tedious handtuning and it was not obvious how to set the right parameters. For example, in Harris corner detector, why is the free parameter k set to 0.04 ? Why not 1 or 2 or 0.34212 instead? Why is 42 the answer to life, universe, and everything?

As I got more real world experience, I realized that in some cases you can make an educated guess but in other cases, nobody knows why. People often do a parameter sweep — they change different parameters in a principled way to see what produces the best result. Sometimes, the best parameters have an intuitive explanation and sometimes they don’t.


    winSize = (20,20) ##set to 20×20 (size of the digit images) & want to calculate 1 descriptor for the entire image.
        
    blockSize = (10,10)
    blockStride = (5,5)
    
    ## The cellSize is chosen based on the scale of the features important to do the classification. 
    ## A very small value would blow up the size of feature vector & very large one may not capture relevant information.
    ## 8 value could have been used.
    cellSize = (10,10) 
    
    nbins = 9
    derivAperture = 1
    winSigma = -1.
    histogramNormType = 0
    L2HysThreshold = 0.2
    gammaCorrection = 1
    nlevels = 64
    signedGradients = True
    
    hog = cv2.HOGDescriptor(winSize,blockSize,blockStride,cellSize,nbins,derivAperture,winSigma,histogramNormType, L2HysThreshold,gammaCorrection,nlevels, useSignedGradients)
        