# **Singular Value Decomposition**

Singlular Value Decomposition (SVD) is one of the concepts od linear algebra and machine learning. SVD allows us to extract and untangle information. SVD gives you the whole nine-yard of diagonalizing a matrix into special matrices that are easy to manipulate and to analyze. 

<br>

## **Singular vectors and Singular values**

The matrix AAᵀ and AᵀA are very special in linear algebra. Consider any m × n matrix A, we can multiply it with Aᵀ to form AAᵀ and AᵀA separately. These matrices are:  

* symmetrical,

* square,

* at least positive semidefinite (eigenvalues are zero or positive),

* both matrices have the same positive eigenvalues,

* both have the same rank r as A.

In addition, the covariance matrices that we often use in ML are in this form. Since they are symmetric, we can choose its eigenvectors to be orthonormal (perpendicular to each other with unit length) — this is a fundamental property for symmetric matrices.

![vector-imgae](https://miro.medium.com/max/1050/1*fDMLG40hhRi4gkQBiPPk5w.jpeg)

Let’s introduce some terms that frequently used in SVD. We name the eigenvectors for AAᵀ as uᵢ and AᵀA as vᵢ here and call these sets of eigenvectors u and v the singular vectors of A. Both matrices have the same positive eigenvalues. The square roots of these eigenvalues are called singular values.  
Not too many explanations so far but let’s put everything together first and the explanations will come next. We concatenate vectors uᵢ into U and vᵢ into V to form orthogonal matrices.  
![g](https://miro.medium.com/max/1050/1*WNk8KMCbWeEg8rvNBM2gpg.gif)

Since these vectors are orthonormal, it is easy to prove that U and V obey
![r](https://miro.medium.com/max/1050/1*OoMBe1LoSziciLoWpzGAdQ.jpeg)

## **SVD**

Let’s start with the hard part first. SVD states that any matrix A can be factorized as:
![e](https://miro.medium.com/max/1050/1*WIAHFWAg03FFvNJdun9W0w.jpeg)

where U and V are orthogonal matrices with orthonormal eigenvectors chosen from AAᵀ and AᵀA respectively. S is a diagonal matrix with r elements equal to the root of the positive eigenvalues of AAᵀ or Aᵀ A (both matrics have the same positive eigenvalues anyway). The diagonal elements are composed of singular values.

![a](https://miro.medium.com/max/1050/1*MJNokn4E9dBLrncETGIhHg.jpeg)

i.e. an m× n matrix can be factorized as:
![f](https://miro.medium.com/max/1050/1*tmVzY_1k9_JpxyKDkXwAQA.jpeg)

![z](https://miro.medium.com/max/1050/1*n5pHiLBbkM2kza1OFUsLNg.jpeg)

We can arrange eigenvectors in different orders to produce U and V. To standardize the solution, we order the eigenvectors such that vectors with higher eigenvalues come before those with smaller values.


## **Example**

Before moving forward, lets demonstrate what we have have learned so far with a simple example.

![a](https://miro.medium.com/max/1050/1*4DncmDEnF9SIYTTrDR0Adw.png)

These matrices are at least positive semidefinite (all eigenvalues are positive or zero). As shown, they share the same positive eigenvalues (25 and 9). The figure below also shows their corresponding eigenvectors.

![b](https://miro.medium.com/max/1050/1*RG2fiVNyPDr77eS4qsAjLg.jpeg)

The singular values are the square root of positive eigenvalues, i.e. 5 and 3. Therefore, the SVD composition is

![c](https://miro.medium.com/max/1050/1*xnlAa8E-c63HnMcTRcb5HA.jpeg)


## **Proof of SVD**

To proof SVD, we want to solve U, S, and V with:

![d](https://miro.medium.com/max/1050/1*ZrydTbVPycTfNGULVTPSeg.jpeg)

We have 3 unknowns. Hopefully, we can solve them with the 3 equations above. The transpose of A is

![e](https://miro.medium.com/max/1050/1*CB2zCTb7mZwSiy7Q49mcUA.jpeg)

Knowing

![f](https://miro.medium.com/max/1050/1*otGyHsvtFLtjW42iKKaTkQ.jpeg)

We compute AᵀA,

![g](https://miro.medium.com/max/1050/1*qCSYDYEB1CEVSXmZfQ4BkA.jpeg)

The last equation is equilvant to the eigenvector definition for the matrix (AᵀA). We just put all eigenvectors in a matrix.

![h](https://miro.medium.com/max/1050/1*tTHpTWQtjb6uxqremadCZQ.jpeg)

with VS² equals

![i](https://miro.medium.com/max/1050/1*Yq7bXDClX1Qwe42M-t4djw.jpeg)

V hold all the eigenvectors vᵢ of AᵀA and S hold the square roots of all eigenvalues of AᵀA. We can repeat the same process for AAᵀ and come back with a similar equation.

![j](https://miro.medium.com/max/1050/1*vpWeR_dAqd4gnI9IxnJMOQ.gif)

Hence, Proved.

## **Reformulate SVD**

Since matrix V is orthogonal, VᵀV equals I. We can rewrite the SVD equation as:

![k](https://miro.medium.com/max/1050/1*0LEG-KOZYYYsaXnQxMCkXA.gif)

This equation establishes an important relationship between uᵢ and vᵢ.  
<br>

Recall

![l](https://miro.medium.com/max/1050/1*3xIyvsDwrkGa_YL87eBtBg.gif)

Apply AV = US,

![m](https://miro.medium.com/max/1050/1*KGcqnL20ihPN4RDyLhQzXA.jpeg)

This can be generalized as,

![n](https://miro.medium.com/max/1050/1*77EJGbPLWXtUtYx6-a4Gdg.gif)

Recall,

![o](https://miro.medium.com/max/1050/1*fWBhde6AExVN42MgkJIG9A.jpeg)

and

![p](https://miro.medium.com/max/1050/1*CTcIbBs6SVx_Zv_RhnL0xQ.gif)

The SVD decomposition can be recognized as a series of outer products of uᵢ and vᵢ.

![q](https://miro.medium.com/max/1050/1*0n2-o06c_j42d0MJo7igYQ.gif)

This formularization of SVD is the key to understand the components of A. It provides an important way to break down an m × n array of entangled data into r components. Since uᵢ and vᵢ are unit vectors, we can even ignore terms (σᵢuᵢvᵢᵀ) with very small singular value σᵢ.  
<br>

Let’s first reuse the example before and show how it works.

![r](https://miro.medium.com/max/1050/1*4DncmDEnF9SIYTTrDR0Adw.png)

The matrix A above can be decomposed as

![s](https://miro.medium.com/max/1050/1*5Kzrmdw7c7X7hFuvRXTMOg.gif)

## **Moore-Penrose Pseudoinverse**

For a linear equation system, we can compute the inverse of a square matrix A to solve x.

![t](https://miro.medium.com/max/1050/1*a1inq-_XL9WHTCsxzHpamQ.jpeg)

But not all matrices are invertible. Also, in ML, it will be unlikely to find an exact solution with the presence of noise in data. Our objective is to find the model that best fit the data. To find the best-fit solution, we compute a pseudoinverse

![u](https://miro.medium.com/max/1050/1*PbhfkyF7h7_e6bbWp6uPIw.jpeg)

which minimizes the least square error below.

![v](https://miro.medium.com/max/1050/1*aQ2MqySUZnIbCIrxfH1G8Q.png)

And the solution for x can be estimated as,

![w](https://miro.medium.com/max/1050/1*lF1z-LodZHA3834kseswYw.jpeg)

In a linear regression problem, x is our linear model, A contains the training data and b contains the corresponding labels. We can solve x by

![x](https://miro.medium.com/max/1050/1*ClzObIIjZyQDb9svX4FjjQ.jpeg)

![y](https://miro.medium.com/max/1050/1*mu9Z_NVYd_3CXwhbERVB8Q.jpeg)

Here is an example,
![z](https://miro.medium.com/max/1050/1*xxatolWVNPjMCUEEWLfvyg.jpeg)

## **Variance & covariance**

In ML, we identify patterns and relationship. How do we identify the correlation of properties in data? Let’s start the discussion with an example. We sample the height and weight of 12 people and compute their means. We zero-center the original values by subtracting them with its mean. For example, Matrix A below holds the adjusted zero-centered height and weight.

![aa](https://miro.medium.com/max/1050/1*QPWbq3WpMHSHluFLK2yXyg.gif)

As we plot the data points, we can recognize height and weight are positively related. But how can we quantify such a relationship?

![ab](https://miro.medium.com/max/1050/1*dQnsbaMC-WtRc6LOvIToZg.gif)

First, how does a property vary? We probably learn the variance from high school. Let’s introduce its cousin. Sample variance is defined as :

![ac](https://miro.medium.com/max/1050/1*2QGDPgiLjHTtERCIe2l1IQ.gif)

Note, it is divided by n-1 instead of n in the variance. With a limited size of the samples, the sample mean is biased and correlated with the samples. The average square distance from this mean will be smaller than that from the general population. The sample covariance S², divided by n-1, compensates for the smaller value and can be proven to be an unbiased estimate for variance σ². 

## **Covariance matrices**

Variance measures how a variable varies between itself while covariance is between two variables (a and b).

![ad](https://miro.medium.com/max/1050/1*ekRCblqz5Q0ItHb6bsMe8g.gif)

We can hold all these possible combinations of covariance in a matrix called the covariance matrix Σ.

![ae](https://miro.medium.com/max/1050/1*6121tc3mQ3d-uDy9DQEBXA.jpeg)

We can rewrite this in a simple matrix form.

![af](https://miro.medium.com/max/1050/1*9lVXIfqEHQ_C-U4ZEzexrw.jpeg)

The diagonal elements hold the variances of individual variables (like height) and the non-diagonal elements hold the covariance between two variables. Let’s compute the sample covariance now.

![ag](https://miro.medium.com/max/1050/1*i6JeWI_q60qQBt2S4zssBA.gif)

![ah](https://miro.medium.com/max/1050/1*ivMbXad2byovZTyg7HgtKQ.gif)

The positive sample covariance indicates weight and height are positively correlated. It will be negative if they are negatively correlated and zero if they are independent.

![ai](https://miro.medium.com/max/1050/1*BTh8aHR7z7F_QLv44wDmEQ.gif)

### **Covariance matrix & SVD**

We can use SVD to decompose the sample covariance matrix. Since σ₂ is relatively small compared with σ₁, we can even ignore the σ₂ term. When we train an ML model, we can perform a linear regression on the weight and height to form a new property rather than treating them as two separated and correlated properties (where entangled data usually make model training harder).

![aj](https://miro.medium.com/max/1050/1*SCI5suKmi_Zc9ue639UwYw.gif)

u₁ has one significant importance. It is the principal component of S.

![ak](https://miro.medium.com/max/1050/1*rHqx1h2H6qbNULrMKAMe5Q.gif)

There are a few properties about a sample covariance matrix under the context of SVD:

* The total variance of the data equals the trace of the sample covariance matrix S which equals the sum of squares of S’s singular values. Equipped with this, we can calculate the ratio of variance lost if we drop smaller σᵢ terms. This reflects the amount of information lost if we eliminate them

![al](https://miro.medium.com/max/1050/1*IwKErbFqWgpjsjqT7NSLMg.gif)

* The first eigenvector u₁ of S points to the most important direction of the data. In our example, it quantifies the typical ratio between weight and height.

![am](https://miro.medium.com/max/659/1*J8jP6pTszPxGCPpE7Fbs1A.jpeg)

* The error, calculated as the sum of the perpendicular squared distance from the sample points to u₁, is the minimum when SVD is used.

### **Property**

Covariance matrices are not only symmetric but they are also positive semidefinite. Because variance is positive or zero, uᵀVu below is always greater or equal zero. By the energy test, V is positive semidefinite.

![an](https://miro.medium.com/max/1050/1*ngrrIUIre0An1l6LXZZ1wQ.gif)

Therefore,

![ao](https://miro.medium.com/max/1050/1*huxsM0xk-4IjATzVH-AcdA.jpeg)

Often, after some linear transformation A, we want to know the covariance of the transformed data. This can be calculated with the transformation matrix A and the covariance of the original data.

![ap](https://miro.medium.com/max/1050/1*zfcHEnrUVkKqSg2y2MZpAA.gif)

### **Correlation matrix**

A correlation matrix is a scaled version of the covariance matrix. A correlation matrix standardizes (scale) the variables to have a standard deviation of 1.

![aq](https://miro.medium.com/max/1050/1*20um0wwFPjycjwnZr51Fjg.gif)

Correlation matrix will be used if variables are in scales of very different magnitudes. Bad scaling may hurt ML algorithms like gradient descent.

## **Visualization**

So far, we have a lot of equations. Let’s visualize what SVD does and develop the insight gradually. SVD factorizes a matrix A into USVᵀ. Applying A to a vector x (Ax) can be visualized as performing a rotation (Vᵀ), a scaling (S) and another rotation (U) on x.

![ar](https://miro.medium.com/max/1050/1*LwmAwpNTGQ_a7n--n3LQpA.jpeg)

As shown above, the eigenvector vᵢ of V is transformed into:

![as](https://miro.medium.com/max/1050/1*d8c3E0OVl4zKbChXo2GSMA.gif)

Or in the full matrix form

![at](https://miro.medium.com/max/1050/1*q6aiiqcIx9ES2xaxcnxlyw.gif)

## **Insight of SVD**

As described before, the SVD can be formulated as

![au](https://miro.medium.com/max/1050/1*0sHvvODvpT9536FwBZfO2A.gif)

Since uᵢ and vᵢ have unit length, the most dominant factor in determining the significance of each term is the singular value σᵢ. We purposely sort σᵢ in the descending order. If the eigenvalues become too small, we can ignore the remaining terms (+ σᵢuᵢvᵢᵀ + …).

![av](https://miro.medium.com/max/1050/1*7Q9pXwIeseO0xLwp7CyESA.gif)

This formularization has some interesting implications. For example, we have a matrix contains the return of stock yields traded by different investors.

![aw](https://miro.medium.com/max/1050/1*FJXUrl22HERjCUe2dR42mA.gif)

As a fund manager, what information can we get out of it? Finding patterns and structures will be the first step. Maybe, we can identify the combination of stocks and investors that have the largest yields. SVD decompose an n × n matrix into r components with the singular value σᵢ demonstrating its significant. Consider this as a way to extract entangled and related properties into fewer principal directions with no correlations.

![ax](https://miro.medium.com/max/1050/1*8CRT2ATi3eivfWLTM9Nfbg.jpeg)

If data is highly correlated, we should expect many σᵢ values to be small and can be ignored.

![ay](https://miro.medium.com/max/1050/1*tPQPCx5mcp4CA6aHZWx9Ig.jpeg)

In our previous example, weight and height are highly related. If we have a matrix containing the weight and height of 1000 people, the first component in the SVD decomposition will dominate. The u₁ vector indeed demonstrates the ratio between weight and height among these 1000 people as we discussed before.

![az](https://miro.medium.com/max/1050/1*jQNJUMyjard3toQo3sSGrg.jpeg)


## **Applications of SVD**

### **SVD for Image Compression**

How many times have we faced this issue? We love clicking images with our smartphone cameras and saving random photos off the web. And then one day – no space! Image compression helps deal with that headache.

It minimizes the size of an image in bytes to an acceptable level of quality. This means that you are able to store more images in the same disk space as compared to before.

Image compression takes advantage of the fact that only a few of the singular values obtained after SVD are large. You can trim the three matrices based on the first few singular values and obtain a compressed approximation of the original image. Some of the compressed images are nearly indistinguishable from the original by the human eye.

##### **CODE**





In [1]:
from google.colab import files
import cv2
uploaded = files.upload()

Saving 1_9wAFFiwG9kV3RWnXIo-Rng.jpeg to 1_9wAFFiwG9kV3RWnXIo-Rng (1).jpeg


In [2]:
import numpy
import matplotlib.pyplot as plt
from PIL import Image


# FUNCTION DEFINTIONS:

# open the image and return 3 matrices, each corresponding to one channel (R, G and B channels)

imOrig = cv2.imread("1_9wAFFiwG9kV3RWnXIo-Rng.jpeg")
im = numpy.array(imOrig)

aRed = im[:, :, 0]
aGreen = im[:, :, 1]
aBlue = im[:, :, 2]

originalImage = imOrig


# compress the matrix of a single channel
def compressSingleChannel(channelDataMatrix, singularValuesLimit):
    uChannel, sChannel, vhChannel = numpy.linalg.svd(channelDataMatrix)
    aChannelCompressed = numpy.zeros((channelDataMatrix.shape[0], channelDataMatrix.shape[1]))
    k = singularValuesLimit

    leftSide = numpy.matmul(uChannel[:, 0:k], numpy.diag(sChannel)[0:k, 0:k])
    aChannelCompressedInner = numpy.matmul(leftSide, vhChannel[0:k, :])
    aChannelCompressed = aChannelCompressedInner.astype('uint8')
    return aChannelCompressed


# MAIN PROGRAM:
print('*** Image Compression using SVD - a demo')


# image width and height:
imageWidth = 512
imageHeight = 512

# number of singular values to use for reconstructing the compressed image
singularValuesLimit = 160

aRedCompressed = compressSingleChannel(aRed, singularValuesLimit)
aGreenCompressed = compressSingleChannel(aGreen, singularValuesLimit)
aBlueCompressed = compressSingleChannel(aBlue, singularValuesLimit)

imr = Image.fromarray(aRedCompressed, mode=None)
img = Image.fromarray(aGreenCompressed, mode=None)
imb = Image.fromarray(aBlueCompressed, mode=None)

newImage = Image.merge("RGB", (imr, img, imb))

img = cv2.cvtColor(imOrig,cv2.COLOR_BGR2RGB)
newImage.show()

# CALCULATE AND DISPLAY THE COMPRESSION RATIO
mr = imageHeight
mc = imageWidth

originalSize = mr * mc * 3
compressedSize = singularValuesLimit * (1 + mr + mc) * 3

print('original size:')
print(originalSize)

print('compressed size:')
print(compressedSize)

print('Ratio compressed size / original size:')
ratio = compressedSize * 1.0 / originalSize
print(ratio)

print('Compressed image size is ' + str(round(ratio * 100, 2)) + '% of the original image ')
print('DONE - Compressed the image! Over and out!')

*** Image Compression using SVD - a demo
original size:
786432
compressed size:
492000
Ratio compressed size / original size:
0.6256103515625
Compressed image size is 62.56% of the original image 
DONE - Compressed the image! Over and out!


> Now as you can see we have finally compressed our image using SVD. So there are many more applications such as these where SVD is used.