# Novelty and Outlier Detection

Many different applications require being able to decide whether a new observation belongs to the same distribution as existing observations (inliner), or should we consider it an outlier. This is often used to clean real data sets. There are two important distinctions here.

| Detection Type | Definition |
|:---:|:---|
| Outlier | The training data contains outliers which are defined as observations that are fat from the others. Outlier detection estimators thus try to find the regions where the training data is the most concentrated, ignoring the deviant observations. |
| Novelty | The training data is not polluted by outliers and we are interested in detecting whether a new observation is an outlier or not. In this context an outlier is also called a novelty. |

Outlier detection and novelty detection are both used for anomaly detection, where one is interested in detecting abnormal or unusual observations. Outlier detection is then also known as unsupervised anomaly detection and novelty detection as semi-supervised anomaly detection. In the context of outlier detection, the outliers/anomalies cannot form a dense cluster as available estimators assume that the outliers/anomalies are located in low density regions. On the contrary, in the context of novelty detection, novelties/anomalies can form a dense cluster as long as they are in a low density region of the training data, considered as normal in this context.

You can refer to [scikit-learn](https://scikit-learn.org/stable/index.html) for documentations and examples of this application. [Novelty and Outlier Detection](https://scikit-learn.org/stable/modules/outlier_detection.html) is a good place to start.

The goal of this project is to detect outliers in a one dimensional residual vector. This will show us possible anomalies in the residual vector. The residual vector is extracted at each iteration of the CFD solver and outlier detection is applied to the vector. In the following graph the absolute value of the residual in each cell is plotted versus solution iteration in a 2D cell-centered Burgers problem with 502 cells. As seen, after iteration 11 some cells behave differently than the others and their residual value grows substantially. In fact, these cells contribute to the instability of the problem and divergence. 

<img src="./images/burgers-residual.jpg" alt="The absolute value of the residual in each cell is plotted versus solution iteration in a 2D cell-centered Burgers problem with 502 cells"
	title="The absolute value of the residual in each cell is plotted versus solution iteration in a 2D cell-centered Burgers problem with 502 cells" height="700" />

Here, we would like to find the solution iterations in which there are cells with outlier residual values. At this point, we will apply an optimization application to the solution and the numerical grid. The goal is to eliminate the unstable solution modes to get a stable result. 


In [4]:
import pandas as pd

N=20

for i in range(N):
    df[i] = pd.read_csv('./data/sol'+str(i)+'.csv', skiprows=2, header=None)

df.info()

NameError: name 'df' is not defined

In [None]:
df[19].describe()