# Fundamentals of Machine Learning Theory - Explained

Note: I tend to highlight in **bold** letters the mathematical or statistical terms that usually have a precise definition. Definitions are everything in mathematics. If you read a term **xyz** you need to quickly access its definition in your brain so that you can follow the content. I will try to be as clear as I can with transforming definitions in a easy-to-read-and-interpret content.

-> _Work in Progress_

## Statistical Learning Theory

## 1. How the Data is Structured?
Let's start with the basics of the basics.

Mathematical concepts are absolutely necessary to understand what Machine Learning really is.<br>
Data is interpreted as a **Collection of Vectors**. Thus, it is a must to introduce the concept of **Vectors**.<br>
However, to introduce **Vectors**, we need to introduce the world in which they live and the properties that allow us to do operations with them (such as addition and multiplication by a number). This is why the concepts of **Field** and **Vector Spaces** are important. 

### 1.1. Field
A **Field** is a fundamental algebraic structure which is widely used in many areas in mathematics.
* A **Field** $\textbf{F}$ is a set containing at least two distinct elements called 0 and 1, along with operations of addition and multiplication (and others well-known properties such as commutativity, associativity, etc. 
    * The reason why 0 and 1 are present is because two properties allow for specific operations:
        * Additive and multiplicative identity, which means $a + 0 = a$ and $a â‹… 1 = a$.
    * It is represented by a bold letter **$\textbf{F}$**.
    * Example: $F = \mathbb{R}$ is the field of real numbers and these "numbers" receive a fancy name: **scalars**.
* A **higher-dimensional Field** is represented by $\textbf{F}^{n}$, which means that this Field has **n** dimensions. An element of this field is represented by a **list of lenght n** which is an ordered collection of n elements denoted by $(x_1, ..., x_n)$, where each element lies on a its respective dimension.
    *  Set Definition for $\textbf{F}^n$:<br>
    $\textbf{F}^{n} = \{(x_1, ..., x_n) : x_j \in \textbf{F}$ for $j = 1, ..., n\}$, i.e., the set of n-dimensional lists.

### 1.2. Vector Space, Dimension, Span, Linear Independence, Vectors

A **Vector Space** is a set of objects called **vectors**, which may be added together and multiplied ("scaled") by numbers called scalars. <br>
Vector spaces are the subject of Linear Algebra and are well-characterized by their **dimensions**, which, roughly speaking, specifiy the number of **independent directions** in the space.

Quickly digression on _**Dimension**_:
* A (real) line has 1 dimension because all points along the line can be reached by **spanning** the only vector needed to form its **basis**.
* A plane has 2 dimensions because all points on the plane can be reached by **spanning** the only 2 vectors needed to form its **basis**.

By **Span**, we mean: 
* On 1-Dimension: take a vector $v$ and multiply it by a scalar (real number) $a_1$. Do this many times with different scalars and you will be able to reach every point in the real line and _span the real line_
* On 2-Dimensions: take 2 vectors $v_1$ and $v_2$ (that form a **basis** of $\textbf{V}$), multiply them by two scalars $a_1$ and $a_2$ and add/subtract them. Do this many times with different scalars and you will be able to reach every point in the space and _span the space_.

By **Basis**, we mean: 

A list of vectors in $\textbf{V}$ that are **Linearly Independent**.

By **Linearly Independent**, we mean:
* Take a list of vectors $x_1, x_2, ..., x_p$.
* If neither vector in this list can be generated by a **linear combination** (adding and/or multiplying vectors by scalars is the process of generating new vectors in the space) of others vectors in this list, then we say that these vectors are **linearly independent**.
* An example of a **Basis** is the one that forms the basis of the 3-D space:<br> <img src="/img/3d.png" width="120">

Thus, the concept of **Dimension** is connected to the concept of a **Basis** of a **Vector Space**.

* As we mentioned, a **Vector Space** is a set $\textbf{V}$ where you can do two important things among its elements: addition and scalar multiplication.

Because of such properties, we say:
* $\textbf{V}$ is a Vector Space over a Field $\textbf{F}$. 
* Elements are now called **vectors** or **points**.

Although in mathematics a vector is defined as an _Ordered List of <ins>Objects</ins>_, in Machine Learning Theory, a vector is an _Ordered List of <ins>Numbers</ins>_.<br>

Thus, it is assumed (or taken for granted if you prefer) that the Vector Space $\textbf{V}$ is a **Real Vector Space**, i.e., $\textbf{V} = \mathbb{R}$.<br>

A good reason for this is that computers do operations with numbers and Machine Learning is done by computers.

The easiest way to define a vector is to follow what is done in **Statistics** (and also in **Econometrics**):
* A **vector** is defined as a **column-vector**. It is written as a bold character
$\textbf{x} = \begin{bmatrix}
     x_{1} \\
     x_{2} \\
     \vdots \\
     x_{n}
    \end{bmatrix}$

In addition to this definition, we could help the writing of vectors by writing them horizontally as $\textbf{x} = (x_1, ..., x_n)$, i.e., an _Ordered List of Objects_ (as mentioned above)<br>

Vectors can also be interepreted as:
* an array of numbers (a computer science view) , hence the horizontal writting format,
* a vector as an arrow with a direction and magnitude (a physics view),
* and as an object that obeys addition and scaling/multiplication (a mathematical view) in the form of a column-vector.

### 1.3. Attributes or Entries, Features or Inputs, Feature Vector 

Now, let's see the most common notation and terminology used in the Statistical Learning / Machine Learning field:

An **element** or **entry** of a vector is called an **example** or an **attribute**.

It is now defined as $x_{ij}$, where:
* **i** denotes a specifc dimension of the vector (i.e., a numerical value of a particular feature),
* and **j** denotes a particular Feature.<br> 

Vectors are now called **Features**.<br>
* If $\textbf{x} \in \mathbb{R}^1$, then $\textbf{x}$ is a **Feature**. It can, for example, represent heights of a person.<br>
Example: n people were surveyed, hence we have n heights.
$\textbf{x} = \begin{bmatrix}
   x_{11} \\
   x_{21} \\
   \vdots \\
   x_{n1}
  \end{bmatrix}_{n\times1}$ 
where the second index could have been omited in this case denotes a particular feature (in this case, the height), which would turn out to be:
$\textbf{x} = \begin{bmatrix}
   x_{1} \\
   x_{2} \\
   \vdots \\
   x_{n}
  \end{bmatrix}_{n\times1}$

* If $\textbf{x} \in \mathbb{R}^2$, then $\textbf{x}$ is a **Feature Vector** representing two possible Features (e.g.: one representing the height of a person and the other for the weight of that person).<br>
  
Example: n people were surveyed, hence we have n heights and n weights.
$\textbf{X} = \begin{bmatrix}
     x_{11}   & x_{12} \\
     x_{21}   & x_{22} \\
     \vdots & \vdots    \\
     x_{n1}   & x_{n2}
    \end{bmatrix}_{n\times2}$    
where the superscripts denote the features (in this case, the height and the weight).

By bringing the math concepts altogather, we can say that:

* $x^{ij}$ represents the distance of a point (in the Real Vector Space) from the origin along one of the directions that form the **basis** of that vector space.

* Example: 
Suppose two people were interviewed and asked about their heights in _cm_:<br>
$\textbf{x} = \begin{bmatrix}
     170 \\
     164
\end{bmatrix}$ implies that $x_{11} = 170 cm$ and $x_{12} = 164 cm$.

In the most general form, $\textbf{x}$ is a D-dimensional vector, i.e., $\textbf{x} \in \mathbb{R}^d$ where $d \in \{1, ..., D\}$. It means that $\textbf{x}$ is a **Feature Vector** with D **Features**.
$\textbf{X} = \begin{bmatrix}
   x_{11} & \cdots & x_{1D}  \\
   \vdots & \cdots & \vdots  \\
   x_{n1} & \cdots & x_{nD}
\end{bmatrix}_{n\times D}$     

Sometimes we may be interested in a row of a Feature Vector. <br>
We say:

A row of $\textbf{x} \in \mathbb{R}^d$ is denoted by $\textbf{x}_i$ it denotes D variable measurements for the _i_ th **observation** of a **Feature Vector**:
$\textbf{X} = \begin{bmatrix}
\vdots & \vdots & \vdots \\
x_{i1} & \cdots & x_{iD} \\
\vdots & \vdots & \vdots
\end{bmatrix}_{n\times D}$ 
where $\textbf{x}_i$ is written as a Column-Vector (because vectors by default are represented this way, as mentioned above).

Thus:
$\textbf{x}_i = \begin{bmatrix}
   x_{i1} \\
   x_{i2} \\
   \vdots \\
   x_{iD}
  \end{bmatrix}_{D\times 1}$

  Similarly,
A columns of $\textbf{x} \in \mathbb{R}^d$ is denoted by $\textbf{x}_j$ it denotes n **realizations** (in the statistical sense of realizations of a random variable - we will see this later) for the _j_ th **Feature** of a **Feature Vector**:
$\textbf{X} = \begin{bmatrix}
\cdots & x_{1j} & \cdots \\
\cdots & \vdots & \cdots \\
\cdots & x_{nj} & \cdots
\end{bmatrix}_{n\times D}$ 
where $\textbf{x}_j$ is written as a Column-Vector (because vectors by default are represented this way, as mentioned above).

Thus:
$\textbf{x}_j = \begin{bmatrix}
   x_{1j} \\
   x_{2j} \\
   \vdots \\
   x_{nj}
  \end{bmatrix}_{n\times 1}$

We can rewrite $\textbf{X}$ into two ways:

* Using the _j_ th **Feature** representation $\textbf{x}_j$:
$\textbf{X} = \begin{bmatrix} 
   \textbf{x}_{1} & \textbf{x}_{2} & \cdots & \textbf{x}_{D} \end{bmatrix}_{n\times D}$

* Using the _i_ th **Observation** representation $\textbf{x}_i$:
$\textbf{X} = \begin{bmatrix}
   x^T_{1} \\
   x^T_{2} \\
   \vdots \\
   x^T_{n}
  \end{bmatrix}_{n\times D}$


### 2. What is Statistical Learning and What is the critical Difference to Statistical Inference?