-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
6 changed files
with
164 additions
and
39 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,97 @@ | ||
$\def \P#1{{ \mathbb{P} \left(#1\right) }}$ | ||
$\def \E#1{{ \mathbb{E} \left(#1\right) }}$ | ||
|
||
# Impurity | ||
|
||
Impurity, in a Machine Learning context, is a numerical value that represent how accurate a prediction-model is. It usually refers to models that use decision-making algorithms such as [decision trees](AI%20and%20ML/Unit%202/Supervised%20Learning/Decision%20Trees.md), and it is analogous to a loss or cost function. | ||
|
||
> [!tip] Classification | ||
> | ||
> A good classification model should should aim to lower the impurity value. | ||
|
||
An impurity function is denoted by $H(S)$, where $S$ is a dataset. Let $S_k \subseteq S$ be subsets divided by label, | ||
|
||
$$\large | ||
S_k = \set{(x,y) \in S \ | \ y = k}, \quad | ||
\bigcup_k S_k = S | ||
$$ | ||
|
||
then, define $p_k$ to be the probability of picking a point of label $k$ from $S$. | ||
|
||
$$\large | ||
p_k = \frac{|S_k|}{|S|} | ||
$$ | ||
|
||
## Misclassification Impurity | ||
|
||
The misclassification impurity measures the probability of mislabelling the label with the highest cardinality. | ||
|
||
$$\large | ||
H(S) = 1 - \max_k(p_k) | ||
$$ | ||
|
||
> [!example] Example | ||
> | ||
> Given a set $S$ with 4 points, of which 3 have label $y_1$ and 1 has label $y_2$, then the misclassification impurity of $S$ is | ||
> $$\large | ||
> H(S) = 1 - \max{p_k} | ||
> = 1 - \max\set{ \frac{3}{4}, \frac{1}{4} } | ||
> = \frac{1}{4} = 25\% | ||
> $$ | ||
## Gini Impurity | ||
|
||
The Gini impurity is an advanced form of misclassification impurity, but it involves all classes instead of considering just the one with the highest cardinality. | ||
|
||
$$\large | ||
G(S) = \sum_{k=1}^K p_k (1 - p_k) | ||
$$ | ||
|
||
The Gini impurity represents the expectation of randomly classifying a point with the wrong label. | ||
|
||
$$\large | ||
\begin{aligned} | ||
\E{K \ne k} | ||
&= \E{1 - \P{K = k}} \\ | ||
&= \E{1 - p_k} \\ | ||
&= \sum_{k=1}^K (1 - p_k) p_k \\ | ||
&= G(S) | ||
\end{aligned} | ||
$$ | ||
|
||
## Entropy | ||
|
||
Entropy is defined as the amount of *randomness*, or *chaos*, that is present in a set. The formula that defines entropy is the following. | ||
|
||
$$\large | ||
H(X) \doteq - \sum_{x \in X} p_x \log_2 p_x | ||
$$ | ||
|
||
> [!abstract] Problem generalization | ||
> | ||
> The entropy function is denoted as $H(X)$, and not $H(S)$, because we generalize the problem and suppose that $X$ is a random variable instead of a set of points and labels, but the only thing that changes is the notation. | ||
Like Gini impurity, entropy computes some kind of expectation, but using a function $h(x)$ that measures the *impact* of a single class. For convenience, $h(x)$ must abide to some properties. | ||
|
||
The entropy of an event should be inversely proportional to the likelihood of it happening: | ||
- if an event is almost certain to happen, then the entropy should be very small, $\P X \rightarrow 1 \Longrightarrow h(x) \rightarrow 0$; | ||
- if an event is almost impossible to happen, then the entropy should be very big, $\P X \rightarrow 0 \Longrightarrow h(x) \rightarrow \infty$. | ||
|
||
If two events are independent, then we should be able to sum the entropy of both events to get the entropy of the joint events: $h(x,y) = h(x) + h(y)$. | ||
|
||
A function that satisfies all the previous properties is the $\log$ function. | ||
|
||
$$\large | ||
h(x) = \log\frac{1}{p_x} = - \log p_x | ||
$$ | ||
|
||
Hence, the entropy function computes the expectation of the TK of each event. | ||
|
||
$$\large | ||
\begin{aligned} | ||
\E{h(x)} &= \sum_{x \in X} p_x \, h(x) \\ | ||
&= - \sum_{x \in X} p_x \log_2 p_x \\ | ||
&= H(X) | ||
\end{aligned} | ||
$$ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
# Normalization | ||
|
||
Normalizing the data is an important step before applying any learning algorithm. If the data isn't normalized, it could be positioned in the higher-dimension space in a way such that the Euclidian distance (or any other [distance metric](/AI%20and%20ML/Unit%202/Distance%20Metrics.md)) might not be a good metric. | ||
|
||
Moreover, algorithms such as [k-NN](/AI%20and%20ML/Unit%202/Supervised%20Learning/Nearest%20Neighbour.md) have irregular and non-linear decision boundaries, which is a sign of overfitting. Normalization is applied to ensure smooth decision boundaries and reducing the risk of overfitting the data. | ||
|
||
## Min-Max Normalization | ||
|
||
*Min-Max* normalization aims to scale all the axes to fit the data in a range $[0,1]^{|\mathcal D|}$ , so that all features have equal importance. | ||
|
||
The formula to scale a single axis to the range $[0,1]$ is the following. | ||
|
||
$$\large | ||
x' = \frac{x - x_\min}{x_\max - x_\min} | ||
$$ | ||
|
||
## Standard Normalization | ||
|
||
A.k.a. *normal normalization*, the standard normalization supposes that the data is generated by a single gaussian and is reshaped in a way to have zero mean and unit variance on all axis. | ||
|
||
The formula to standardize a single axis is the following. | ||
|
||
$$\large | ||
x' = \frac{x - \mu}{\sigma} | ||
$$ | ||
|
||
This type of normalization is equivalent to centring and decorrelating the features with [PCA](/AI%20and%20ML/Unit%202/Preprocessing/Principal%20Component%20Analysis.md). | ||
|
||
## Feature Normalization | ||
|
||
In reality, features carry different weights, meaning that some features are more important than other features. Some features could be even categorised as *irrelevant features*. | ||
|
||
The are two options to normalize the features. | ||
|
||
1. **Classify good and irrelevant features** | ||
|
||
Assume that the distribution is composed of good features (useful to classification) and irrelevant features (useless to classification). | ||
|
||
Let $\mathcal S_{gt}, \mathcal S_{ir}$ be two sets containing the indices of good features (ground truth) and irrelevant features. Then, the Euclidian metric can be defined in the following way. | ||
|
||
$$\large | ||
d(x,v) = \sqrt{ | ||
\sum_{i \in \mathcal S_{gt} } (x_i - v_i)^2 + | ||
\sum_{j \in \mathcal S_{ir} } (x_j - v_j)^2 | ||
} | ||
$$ | ||
|
||
Learning to recognize which are the irrelevant features and removing them could help increase the accuracy of the algorithm. | ||
|
||
2. **Feature weighting** | ||
|
||
The Euclidian distance treats all features equally, with the same importance, but each axis could carry a different meaning with a different importance (wrt to the others). | ||
|
||
The Euclidian distance can be redefined as a weighted sum of the magnitude difference on all axis: let $w = [\seq w D]$ be a vector containing the weights for each axis, then the weighted Euclidian distance $d(\cdot, \cdot)$ is defined as follows. | ||
|
||
$$\large | ||
d(x, v) = \sqrt{ \sum_{i=1}^D w_i (v_i - x_i)^2 } | ||
$$ | ||
|
||
## Manifold | ||
|
||
*TK* |
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters