# Anomaly Detection in a CSTR Stream

# Introduction
The goal of this project is to build a simple machine learning pipeline that detects anomalies in a theoretical Continuous Stirred Tank Reactor (CSTR) and automatically logs the reactor data.

The pipeline follows three main steps:

It **generates synthetic sensor data** from the reactor using theoretical chemical formulas.

It **detects anomalies** using both an `Isolation Forest` model and a rule-based decision threshold.

It **stores summary statistics** in a PostgreSQL database for later inspection and reporting.

## Dataset Generation
Synthetic data is generated using theoretical equations of a Continuous Stirred Tank Reactor (CSTR). To make the sensory data more realistic, Gaussian noise is added to both the input and output variables.

$$
\begin{align*}
T_i &\sim \mathcal{N}(350,\,3^2) + \mathcal{N}(0,\,(\text{input\_noise\_scale} \times 350)^2) \\
\\
C_{A0,i} &\sim \mathcal{N}(1,\,0.02^2) + \mathcal{N}(0,\,(\text{input\_noise\_scale})^2) \\
\\
Q_i &\sim \mathcal{U}(1,\,1.5) + \mathcal{N}(0,\,(\text{input\_noise\_scale})^2) \\
\\
k_i &= A \cdot \exp\left(-\frac{E_a}{R\,T_i}\right) \\
\\
\tau_i &= \frac{V}{Q_i} \\
\\
X_{A,i} &= \frac{k_i \tau_i}{1 + k_i \tau_i} \\
\\
C_{A,i} &= C_{A0,i} \cdot (1 - X_{A,i}) \\
\\
r_{A,i} &= -k_i \cdot C_{A,i} \\
\\
X_{A,\text{measured},i} &= X_{A,i} + \mathcal{N}(0,\,(\text{output\_noise\_scale})^2) \\
\\
C_{A,\text{measured},i} &= C_{A,i} + \mathcal{N}(0,\,(\text{output\_noise\_scale})^2) \\
\\
r_{A,\text{measured},i} &= r_{A,i} + \mathcal{N}(0,\,(0.2 \times \text{output\_noise\_scale})^2)
\end{align*}
$$

On the other hand, anomalies in the sensory data are generated by selectively modifying one of the input variables — either increasing or decreasing its value significantly — while keeping the other variables sampled from their normal distributions:

where $\text{input\_noise\_scale} = 0.03$ and  $\text{output\_noise\_scale} = 0.07$.

# Anomaly Detection

<img src="images/parameter_distribution.png" alt="Parameter Distributions" width="600">

To detect anomalies Isolation Forest (IF) is used. Based on the experiments, while IF is found to be good at separating anomalies in the parameters having Gaussian distribution, the performance is not as good at separating uniform distribution. Uniform distribution clusters are hard to separate from each other therefore a decision rule is used. When Flow rate (Q) exceeds the upper and lower bound plus a margin, the sensory data classified as anomaly and the opposite otherwise. 




