# Saving Memory


Running operations can cause new memory to be allocated to host results. 

1. First, we do not want to run around allocating memory unnecessarily all the time. In machine learning, we might have hundreds of megabytes of parameters and update all of them multiple times per second. Typically, we will want to perform these updates **in place**.
2. we might point at the same parameters from multiple variables. If we do not update in place, other references will still point to the old memory location, making it possible for parts of our code to inadvertently reference stale parameters.

In [22]:
import torch
y = torch.tensor([3.5])
x = 1.2
print(id(y))
y = 1.2 + y
print(id(y))

2232877748616
2232877751896


In [23]:
# Fortunately, performing in-place operations is easy. We can assign the result of an operation 
# to a previously allocated array with slice notation
Z = torch.zeros_like(y)
print('id(Z):', id(Z))
Z[:] = x + y
print('id(Z):', id(Z))

id(Z): 2232877750936
id(Z): 2232877750936


In [24]:
print(id(y))
y[:] = 1.2 + y
print(id(y))

2232877751896
2232877751896


In [25]:
print(id(y))
y += x
print(id(y))

2232877751896
2232877751896


# Linear Algebra

## Scalar, Vector, Matrix and Tensor
1. Formally, we call values consisting of just one numerical quantity **scalars** --> $x \in \mathbb{R}$
\begin{align}
x = 2.5
\end{align}
2. You can think of a **vector** as simply a list of scalar values. We call these values the elements (entries or components) of the vector. We work with vectors via one-dimensional tensors. In general tensors can have arbitrary lengths, subject to the memory limits of your machine. Extensive literature considers column vectors to be the default orientation of vectors, so does this book.  --> $X \in \mathbb{R}^n$ represents a vector x consists of n real-valued scalars.
\begin{equation}
  X = \begin{bmatrix}
x_1 \\
x_2 \\
\vdots \\
x_n
\end{bmatrix}
\end{equation}
3. Just as vectors generalize scalars from order zero to order one, **matrices** generalize vectors from order one to order two. $A \in \mathbb{R}^{m \times n}$ to express that the matrix A consists ofmrows and n columns of real-valued scalars
\begin{equation}
A = \begin{bmatrix}
a_{11} & a_{12} & \ldots & a_{1n} \\
a_{21} & a_{22} & \ldots & a_{2n} \\
\vdots & \vdots & \ddots &\vdots \\
a_{m1} & a_{m2} & \ldots & a_{mn}
\end{bmatrix}
\end{equation}
- when a matrix has the same number of rows and columns, its shape becomes a square; thus, it is called a **square matrix**.
- Sometimes, we want to flip the axes. When we exchange a matrixʼs rows and columns, the result is called the **transpose** of the matrix, which is denoted by $A^T$
- a **symmetric matrix** A is equal to its transpose: $A = A^T$

Matrices are useful data structures: they allow us to organize data that have **different modalities of variation**. For example, rows in our matrix might correspond to different houses (data examples), while columns might correspond to different attributes. 

Thus, although the default orientation of a single vector is a column vector, in a matrix that represents a tabular dataset, **it is more conventional to treat each data example as a row vector in the matrix.** And, as we will see in later chapters, this convention will enable common deep learning practices

4. Just as vectors generalize scalars, and matrices generalize vectors, we can build data structures with even more axes. **Tensors** (“tensors” in this subsection refer to algebraic objects) give us a generic way of describing **n-dimensional arrays** with an arbitrary number of axes.
 - Vectors, for example, are **first-order** tensors, and matrices are **second-order** tensors.

## Basic Properties of Tensor Arithmetic

### Hadamard product $A*B$

<!-- $\odot$ \quad $\oplus$ \quad $\otimes$ \quad $\ominus$ \quad $\oslash$ -->
Specifically, elementwise multiplication of two matrices is called their Hadamard product (math notation $\odot$). For example, $A, B \in \mathbb{R}^{m \times n}$:
\begin{equation}
B = \begin{bmatrix}
b_{11} & b_{12} & \ldots & b_{1n} \\
b_{21} & b_{22} & \ldots & b_{2n} \\
\vdots & \vdots & \ddots &\vdots \\
b_{m1} & b_{m2} & \ldots & b_{mn}
\end{bmatrix}
\end{equation}


\begin{equation}
A \odot B = A * B= \begin{bmatrix}
a_{11}b_{11} &  a_{12}b_{12} & \ldots & a_{1n}b_{1n} \\
a_{21}b_{21} & a_{22}b_{22} & \ldots & a_{2n}b_{2n} \\
\vdots & \vdots & \ddots &\vdots \\
a_{m1}b_{m1} & a_{m2}b_{m2} & \ldots & a_{mn}b_{mn}
\end{bmatrix}
\end{equation}

### Dot Products $torch.dot(x, y)$

Given two vectors $x, y \in \mathbb{R}^d$, their **dot product** $x^Ty$ or ($<x, y>$) is a sum over the products of the elements at the same position:
\begin{equation}
x^Ty = \sum_{i = 1}^d x_iy_i
\end{equation}

In [5]:
import torch
y = torch.ones(4, dtype=torch.float32)
y

tensor([1., 1., 1., 1.])

In [7]:
x = torch.tensor([0., 1., 2., 3.])
x

tensor([0., 1., 2., 3.])

In [8]:
torch.dot(x,y)

tensor(6.)

In [9]:
x*y

tensor([0., 1., 2., 3.])

In [10]:
torch.sum(x * y)

tensor(6.)

**Usage: weighted average** Dot products are useful in a wide range of contexts. For example, given some set of values, denoted by a vector $x \in \mathbb{R}^d$ and a set of weights denoted by $w \in \mathbb{R}^d$, the weighted sum of the values in $x$ according to the weights $w$ could be expressed as the dot product $x^Tw$. When the weights are non-negative and sum to one (i.e., $\sum_{i=1}^{d}w_i = 1$), the dot product expresses a **weighted average**. After normalizing two vectors to have the unit length, the dot products express the cosine of the angle between them.

### Matrix-Vector Products $torch.mv(A, x)$

Recall the matrix $A \in \mathbb{R}^{m \times n}$ and the vector $x \in \mathbb{R}^n$. The matrix-vector product $Ax$ is simply a column vector of length $m$, whose $i^{th}$ element is the dot product $a_i^Tx$

### Matrix-Matrix Multiplication $torch.mm(A, B)$

Say that we have two matrices $A \in \mathbb{R}^{m x k}$ and $B \in \mathbb{R}^{k x n}$, $C = AB \in mathbb{R}^{m x n}$

### Norms

Informally, the norm of a vector tells us how big a vector is. The notion of **size** under consideration here concerns not dimensionality but rather the magnitude of the components. 


In linear algebra, a vector norm is a function $f$ that maps a vector ($x, Y\in \mathbb{R}^n$) to a scalar $f: \mathbb{R}^n \rightarrow \mathbb{R}$, satisfying a handful of properties:
1. its norm also scales by the absolute value $\alpha$ of the same constant factor: $f(\alpha x) = |\alpha| f(x)$. 
2. Triangle inequality: $f(x + y) \leq f(x) + f(y)$
3. the norm must be non-negative: $f(x) \geq 0$


You might notice that norms sound a lot like measures of distance:

1. $L_2$ norm (Euclidean distance) of $x$ is the square root of the sum of the squares of the vector elements: 
\begin{equation}
L_2(x) = \parallel x \parallel_2 = \sqrt{\sum_{i=1}^{n}x_i^2}
\end{equation} 
In deep learning, we work more often with the squared L2 norm.

In [18]:
# In code, we can calculate the L2 norm of a vector as follows.
u = torch.tensor([3.0, -4.0])
torch.norm(u)

tensor(5.)

2. $L_1$ norm is expressed as the sum of the absolute values of the vector elements: 
\begin{equation}
L_1(x) = \parallel x \parallel_1 = \sum_{i=1}^{n}|x_i|
\end{equation}

In [19]:
# To calculate the L1 norm, we compose the absolute value function with a sum over the elements.
torch.abs(u).sum()

tensor(7.)

As compared with the L2 norm, it is less influenced by outliers.  Both the $L_2$ norm and the $L_1$ norm are special cases of the more general $L_p$ norm: 
\begin{equation}L_p(x) = \parallel x \parallel_p = (\sum_{i=1}^{n}|x_i|^p)^{1/p}\end{equation}

Analogous to $L_2$ norms of vectors, the Frobenius norm of a matrix $X \in \mathbb{R}^{m x n}$ is the square root of the sum of the squares of the matrix elements:
\begin{equation}
\parallel X \parallel_F = \sqrt{\sum_{i=1}^m\sum_{j=1}^nx_{ij}^2}
\end{equation}

In [13]:
# Invoking the following function will calculate the Frobenius norm of a matrix.
torch.norm(torch.ones((4, 9)))

tensor(6.)