# **Pre-processing and dissimilarities**
In real world data are dirty, noisy, huge and in part useless.

## Pre-processing

### Aggregation
Combining two or more attributes into a single one for:
* Data reduction;
* Change of scale (i.e. cities aggregated into regions, states, countries, etc...);
* Stable data (reduced variability).


### Sampling
Processing or obtaining the entire dataset could be too expensive or time consuming. Using a sample will work almost as well as using the entire dataset if the sample is **representative** (has approximatively the same properties).
It comes in different flavours:
* **Simple random**: single random choice given a probability distribution;
* **With replacement**: repeated indipendent simple random extraction. Easier to implement and to be interpreted;
* **Without replacement**: repeated indipendent simple random extraction, but the extracted element is removed from the pupulation. Nearly equivalent with the one before if sample size is a small fraction of the dataset size.
* **Stratified**: split data according some criteria, then draw random samples from each partitions. It's used to maintain proportion between the classes.

The choice of the sample size it's a tradeoff between data reduction and precision. We need to assess optimal sample size and sample significativity.

The probability of sampling at least one element for each class is independent from the size of the dataset (depends on the sample size).

### Dimensionality and dimensionality reduction
**Curse of dimensionality**: when dimensionality is very high the occupation of the space becomes very sparse (20 or more).
Discrimination on the basis of the distance becomes uneffective (concept of neighbourhood).

There are some techinques for dimensionality reduction in order to: avoid the curse of dimensionality, reduce noise, reduce complexity and help visualization.

**Principal component analysis (PCA)**: find projections that capture most of the data variation.
Find the eigenvectors of the covariance matrix, that define a new space.
The new dataset will have only the attributes which capture most of the data variation.

**Singual values decompositions (SVD)**

**Feature subset selection**: it's a local way to reduce dimensionality removing redundant or irrilevant attributes.
Different kinds:
* **Brute force**: try all possible subsets as input to data mining algorithm and measure effectiveness of the algorithm with the reduced dataset;
* **Embedded approach**: features selection occurs naturally as part of the mining algorithm (i.e. decision trees);
* **Filter approach**: feature are selected before the mining;
* **Wrapper approach**: the mining algorithm can choose the best set of attributes (like brute force but following some heuristic, not exhaustive search).

**Feature creation**: new features can capture more effectively data characteristics (i.e. extraction, mapping).

### Discretization
Some algorithms work better with categorical data and a smaller number of distinct values can let patterns emerge more clearly (there is less noise and randomness too).
* **Continuous to discrete**: thresholding (if just one it's binarization);
* **Discrete with many values to discrete with less values**: guided by domain knowledge.

Example (unsupervised):
![](https://i.ibb.co/nMZMtQX/IAP-l.jpg)

### Attribute transformation
Map the entire dataset of values to a new set according to a function. In general they change the distribution of values.
Some types:
* **Standardization**: translation with shrinking or stretching, don't change the distribution $x \to \frac{x-\mu}{\sigma}$;
* **Normalization (min/max)**: domains are mapped into standard ranges.
> * $x \to \frac{x-x_\min}{x_\max - x_\min}$ maps to $[0,1]$;
  * $x \to \frac{x-{\frac{x_\max + x_\min}{2}}}{\frac{x_\max + x_\min}{2}}$ maps to $[-1,1]$.

## Similarity and dissimilarity
**Similarity**: measure of how alike two data objecs are. 
Higher when objects are more alike.
Usually in range $[0,1]$.

**Dissimilarity**: measure of differe two data objects are.
Lower when objects are more alike.
None standard range.

Proximity refers to a similarity or dissimilarity.

![](https://i.ibb.co/yVxBWZr/photo-2020-12-31-10-19-45.jpg)

### Distances
**Euclidean distance $L_2$**: $D$ is the number of attributes, $p_d$ and $q_d$ are the $d$-th attributes of the data objects $p$ and $q$. Normalization is necessary if scales differs a lot.
$$\text{dist}=\sqrt{\sum_{d=1}^{D}(p_d - q_d)^2}$$

**Minkowski distance $L_r$**: $r$ is a parameter thar depends on the dataset or application.
$$\text{dist}=\Bigl({\sum_{d=1}^{D}|p_d - q_d|^r}\Bigr)^{\frac{1}{r}}$$
It's the most general:
* If $r=1$ it's the $L_1$ norm, the so called *Manhattan distance*. Works better than euclidean in very high dimensional spaces;
* If $r=2$ it's the $L_2$ norm;
* If $r=\infty$ it's the $L_\infty$ norm, the so called *Chebyshev distance* or *supremum distance*. Considers only the dimensions where the difference is maximum. Provides a simplified evaluation disregarding the dimensions with lower differences.
$$\text{dist}_\infty=\max_{d}|p_d-q_d|$$

**Mahalanobis distance**: more sophisticated, considers the data distribution.
Decreases if, keeping the same euclidean distance, the segment connecting two points is stretched along a direction of greater veriation of data.
Described by [covariance matrix](https://en.wikipedia.org/wiki/Covariance_matrix) of the data set:
$$\sum_{ij}=\frac{1}{N-1}(e_{ki}-\overline{e_i})(e_{kj}-\overline{e_j})$$
$$\text{dist}_m = \sqrt{(p-q){\sum}^{-1}(p-q)^T}$$

The Mahalnobis distance between two point is higher if the data are less distribuited in that direction.

**Properties of distance (metric)**
* **Positive definiteness** $\text{dist}(p,q) \ge 0$ $\forall p,q$ and $\text{dist}(p,q) = 0$ iff $p=q$;
* **Symmetry**: $\text{dist}(p,q) = \text{dist}(q,p)$;
* **Triangle inequality** $\text{dist}(p,q) \le \text{dist}(p,r) + \text{dist}(r,q)$ $\forall p,q,r$


### Similarities
**Similarity in binary spaces**: consider:
* $M_{00}$ the number of attribues where $p=0$ and $p=0$;
* $M_{01}$ the number of attribues where $p=0$ and $p=1$;
* $M_{10}$ the number of attribues where $p=1$ and $p=0$;
* $M_{11}$ the number of attribues where $p=1$ and $p=1$;

We can define:
* **Simple matching coefficient**: $\text{SMC} = \frac{M_{00}+M_{11}}{M_{00}+M_{01}+M_{10}+M_{11}}$ (number of matches over number of attributes);
* **Jaccard coefficient**: $\text{JC} = \frac{M_{11}}{M_{01}+M_{10}+M_{11}}$ (disregards negative matches).

**Cosine similarity**: $\cos(p,q) = \frac{p \cdot q}{||p|| \cdot ||q||}$ (useful for positive values).

**Extended Jaccard coefficient (Tanimoto)**: $T(p,q) = \frac{pq}{{||p||}^2 + {||q||}^2 - pq}$

**Properties of similarity**
* $\text{sim}(p,q)=1$ if $p=q$;
* $\text{sim}(p,q)=\text{sim}(q,p)$.

### How to choose the right proximity measure?
* If data are dense and continuous use a **metric measure**;
* If data are sparse and asymmetric use **similarity measure**.

## Correlation
Measure the linear relationship between a pair of attributes.
Start from $p=[p_1,\dots,p_n]$ and $q=[q_1,\dots,q_n]$, standardize them dividing by the $n$-th element and obtain $p'$ and $q'$.
Compute the dot product: $$\text{corr}(p,q) = p' \cdot q' $$

Indipendent variables has zero correlation, but the inverse is not in general valid: zero correlation means the absence of linear relationship between the variables.
Positive values imply the positive linear relationship.
![](https://upload.wikimedia.org/wikipedia/commons/0/02/Correlation_examples.png)

For nominal attribute I can't compute the dot product so I use the **symmetric uncertainty**, exploiting the entropy: $$U(p,q) = 2 \frac{H(p) + H(q) - H(p,q)}{H(p) + H(q)}$$

From a complete independence $U(p,q)=0$ to a complete biunivocal correspence $U(p,q)=1$.
When there is independence the joint entropy is the sum of the individual entropies.
When there is complete correspondence the individual entropies and the joint one are equal.