[AI ML U2] Impurity

devExcale · May 6, 2023 · ef6386c · ef6386c
1 parent 647480e
commit ef6386c
Show file tree

Hide file tree

Showing 6 changed files with 164 additions and 39 deletions.
diff --git a/AI and ML/README.md b/AI and ML/README.md
@@ -16,7 +16,7 @@
 
 [**Machine Learning**](/AI%20and%20ML/Unit%202/Machine%20Learning.md)
 
-- [Principal Component Analysis](/AI%20and%20ML/Unit%202/Principal%20Component%20Analysis.md)
+- [Principal Component Analysis](AI%20and%20ML/Unit%202/Preprocessing/Principal%20Component%20Analysis.md)
 
 [**Unsupervised Learning**](/AI%20and%20ML/Unit%202/Unsupervised%20Learning/Unsupervised%20Learning.md)
 

diff --git a/AI and ML/Unit 2/Impurity.md b/AI and ML/Unit 2/Impurity.md
@@ -0,0 +1,97 @@
+$\def \P#1{{ \mathbb{P} \left(#1\right) }}$
+$\def \E#1{{ \mathbb{E} \left(#1\right) }}$
+
+# Impurity
+
+Impurity, in a Machine Learning context, is a numerical value that represent how accurate a prediction-model is. It usually refers to models that use decision-making algorithms such as [decision trees](AI%20and%20ML/Unit%202/Supervised%20Learning/Decision%20Trees.md), and it is analogous to a loss or cost function.
+
+> [!tip] Classification
+> 
+> A good classification model should should aim to lower the impurity value.
+
+
+An impurity function is denoted by $H(S)$, where $S$ is a dataset. Let $S_k \subseteq S$ be subsets divided by label,
+
+$$\large
+	S_k = \set{(x,y) \in S \ | \ y = k}, \quad
+	\bigcup_k S_k = S
+$$
+
+then, define $p_k$ to be the probability of picking a point of label $k$ from $S$.
+
+$$\large
+	p_k = \frac{|S_k|}{|S|}
+$$
+
+## Misclassification Impurity
+
+The misclassification impurity measures the probability of mislabelling the label with the highest cardinality.
+
+$$\large
+	H(S) = 1 - \max_k(p_k)
+$$
+
+> [!example] Example
+> 
+> Given a set $S$ with 4 points, of which 3 have label $y_1$ and 1 has label $y_2$, then the misclassification impurity of $S$ is
+> $$\large
+> 	H(S) = 1 - \max{p_k}
+> 	= 1 - \max\set{ \frac{3}{4}, \frac{1}{4} }
+> 	= \frac{1}{4} = 25\%
+> $$
+
+## Gini Impurity
+
+The Gini impurity is an advanced form of misclassification impurity, but it involves all classes instead of considering just the one with the highest cardinality.
+
+$$\large
+	G(S) = \sum_{k=1}^K p_k (1 - p_k)
+$$
+
+The Gini impurity represents the expectation of randomly classifying a point with the wrong label.
+
+$$\large
+\begin{aligned}
+	\E{K \ne k}
+	&= \E{1 - \P{K = k}} \\
+	&= \E{1 - p_k} \\
+	&= \sum_{k=1}^K (1 - p_k) p_k \\
+	&= G(S)
+\end{aligned}
+$$
+
+## Entropy
+
+Entropy is defined as the amount of *randomness*, or *chaos*, that is present in a set. The formula that defines entropy is the following.
+
+$$\large
+	H(X) \doteq - \sum_{x \in X} p_x \log_2 p_x
+$$
+
+> [!abstract] Problem generalization
+> 
+> The entropy function is denoted as $H(X)$, and not $H(S)$, because we generalize the problem and suppose that $X$ is a random variable instead of a set of points and labels, but the only thing that changes is the notation.
+
+Like Gini impurity, entropy computes some kind of expectation, but using a function $h(x)$ that measures the *impact* of a single class. For convenience, $h(x)$ must abide to some properties.
+
+The entropy of an event should be inversely proportional to the likelihood of it happening:
+- if an event is almost certain to happen, then the entropy should be very small, $\P X \rightarrow 1 \Longrightarrow h(x) \rightarrow 0$;
+- if an event is almost impossible to happen, then the entropy should be very big, $\P X \rightarrow 0 \Longrightarrow h(x) \rightarrow \infty$.
+
+If two events are independent, then we should be able to sum the entropy of both events to get the entropy of the joint events: $h(x,y) = h(x) + h(y)$.
+
+A function that satisfies all the previous properties is the $\log$ function. 
+
+$$\large
+	h(x) = \log\frac{1}{p_x} = - \log p_x
+$$
+
+Hence, the entropy function computes the expectation of the TK of each event.
+
+$$\large
+\begin{aligned}
+	\E{h(x)} &= \sum_{x \in X} p_x \, h(x) \\
+	&= - \sum_{x \in X} p_x \log_2 p_x \\
+	&= H(X)
+\end{aligned}
+$$
diff --git a/AI and ML/Unit 2/Preprocessing/Data Normalization.md b/AI and ML/Unit 2/Preprocessing/Data Normalization.md
@@ -0,0 +1,62 @@
+# Normalization
+
+Normalizing the data is an important step before applying any learning algorithm. If the data isn't normalized, it could be positioned in the higher-dimension space in a way such that the Euclidian distance (or any other [distance metric](/AI%20and%20ML/Unit%202/Distance%20Metrics.md)) might not be a good metric.
+
+Moreover, algorithms such as [k-NN](/AI%20and%20ML/Unit%202/Supervised%20Learning/Nearest%20Neighbour.md) have irregular and non-linear decision boundaries, which is a sign of overfitting. Normalization is applied to ensure smooth decision boundaries and reducing the risk of overfitting the data.
+
+## Min-Max Normalization
+
+*Min-Max* normalization aims to scale all the axes to fit the data in a range $[0,1]^{|\mathcal D|}$ , so that all features have equal importance.
+
+The formula to scale a single axis to the range $[0,1]$ is the following.
+
+$$\large
+	x' = \frac{x - x_\min}{x_\max - x_\min}
+$$
+
+## Standard Normalization
+
+A.k.a. *normal normalization*, the standard normalization supposes that the data is generated by a single gaussian and is reshaped in a way to have zero mean and unit variance on all axis.
+
+The formula to standardize a single axis is the following.
+
+$$\large
+	x' = \frac{x - \mu}{\sigma}
+$$
+
+This type of normalization is equivalent to centring and decorrelating the features with [PCA](/AI%20and%20ML/Unit%202/Preprocessing/Principal%20Component%20Analysis.md).
+
+## Feature Normalization
+
+In reality, features carry different weights, meaning that some features are more important than other features. Some features could be even categorised as *irrelevant features*.
+
+The are two options to normalize the features.
+
+1. **Classify good and irrelevant features**
+
+Assume that the distribution is composed of good features (useful to classification) and irrelevant features (useless to classification).
+
+Let $\mathcal S_{gt}, \mathcal S_{ir}$ be two sets containing the indices of good features (ground truth) and irrelevant features. Then, the Euclidian metric can be defined in the following way.
+
+$$\large
+	d(x,v) = \sqrt{
+		\sum_{i \in \mathcal S_{gt} } (x_i - v_i)^2 +
+		\sum_{j \in \mathcal S_{ir} } (x_j - v_j)^2
+	}
+$$
+
+Learning to recognize which are the irrelevant features and removing them could help increase the accuracy of the algorithm.
+
+2. **Feature weighting**
+
+The Euclidian distance treats all features equally, with the same importance, but each axis could carry a different meaning with a different importance (wrt to the others).
+
+The Euclidian distance can be redefined as a weighted sum of the magnitude difference on all axis: let $w = [\seq w D]$ be a vector containing the weights for each axis, then the weighted Euclidian distance $d(\cdot, \cdot)$ is defined as follows.
+
+$$\large
+	d(x, v) = \sqrt{ \sum_{i=1}^D w_i (v_i - x_i)^2 }
+$$
+
+## Manifold
+
+*TK*
diff --git a/...ML/Unit 2/Principal Component Analysis.md → ...rocessing/Principal Component Analysis.md b/...ML/Unit 2/Principal Component Analysis.md → ...rocessing/Principal Component Analysis.md
diff --git a/AI and ML/Unit 2/Supervised Learning/Nearest Neighbour.md b/AI and ML/Unit 2/Supervised Learning/Nearest Neighbour.md
@@ -1,3 +1,5 @@
+$\def \seq#1#2{{ {#1}_1, {#1}_2, \ldots, {#1}_{#2} }}$
+
 # Nearest Neighbour
 
 Nearest Neighbour, a.k.a. *k-NN*, is a [supervised](/AI%20and%20ML/Unit%202/Supervised%20Learning/Supervised%20Learning.md) machine learning algorithm used to predict classifications on data.
@@ -57,40 +59,4 @@ $k$ is chosen in an empirical way:
 > 
 > ![KNN - Best Constant Predictor](/assets/knn_best_constant.png)
 
-## Normalization
-
-Normalizing the data is an important step before applying k-NN. If the data isn't normalized, it could be positioned in the higher-dimension space in a way such that the Euclidian distance (or any other distance) might not be a good metric.
-
-Moreover, usually k-NN has irregular and non-linear decision boundaries, which is a sign of overfitting. Normalization is applied to ensure smooth decision boundaries and reducing the risk of overfitting the data.
-
-### Min-Max Normalization
-
-*Min-Max* normalization aims to scale all the axes to fit the data in a range $[0,1]^{|\mathcal D|}$ , so that all features have equal importance.
-
-The formula to scale a single axis to the range $[0,1]$ is the following.
-
-$$\large
-	x' = \frac{x - x_\min}{x_\max - x_\min}
-$$
-
-### Standard Normalization
-
-A.k.a. *normal normalization*, the standard normalization supposes that the data is generated by a single gaussian and is reshaped in a way to have zero mean and unit variance on all axis.
-
-The formula to standardize a single axis is the following.
-
-$$\large
-	x' = \frac{x - \mu}{\sigma}
-$$
-
-This type of normalization is equivalent to centring and decorrelating the features with [PCA](/AI%20and%20ML/Unit%202/Principal%20Component%20Analysis.md).
-
-### Feature Normalization
-
-*TK*
-
-the distance is important: euclidian distance treats each features as equally important, while some axis could be more important than others (e.g. points on two parallel lines)
-
-In high-dimensional spaces, there could be more irrelevant features 
-
-Solving the problem is learning which are the good features and removing the bad features
+[Data Normalization](AI%20and%20ML/Unit%202/Preprocessing/Data%20Normalization.md)
diff --git a/AI and ML/Unit 2/Unsupervised Learning/Clustering.md b/AI and ML/Unit 2/Unsupervised Learning/Clustering.md
@@ -6,7 +6,7 @@ Clustering algorithms attempt to find patterns and similarities in the data base
 
 > [!tip] Principal Component Analysis
 > 
-> [PCA](/AI%20and%20ML/Unit%202/Principal%20Component%20Analysis.md) is a data analysis algorithm that may be run before applying any clustering algorithm: it helps by reducing the dimensionality of the input space while retaining most of the variance in the data (i.e. compressing the data) and by trying to reduce the noise in the input data.
+> [PCA](AI%20and%20ML/Unit%202/Preprocessing/Principal%20Component%20Analysis.md) is a data analysis algorithm that may be run before applying any clustering algorithm: it helps by reducing the dimensionality of the input space while retaining most of the variance in the data (i.e. compressing the data) and by trying to reduce the noise in the input data.
 
 ## Clustering Algorithms