<a href="https://colab.research.google.com/github/YenLinWu/Trend_Detection/blob/main/Reading_Note.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Acronym


* OOT : out-of-trend 
* OOS : out-of-specification 
* SL : Shewhart limits 
* UPL : upper prediction limit 
* LPL : lower prediction limit 
* SI : Shewhart interval  
* CI : confidence interval (信賴區間)  
  * The confidence interval is the range in which a parameter (e.g., the expected value) occurs with a certain probability.  
* PI : prediction interval (預測區間)  
  * The prediction interval is the range in which a future observation is found with some specified probability.  
* TI : tolerance interval (容忍區間)
  * The tolerance range contains a specified part of future observations (content) at a certain probability (confidence).



# <font color="#00dd00">**Control Chart**</font>

* Control charts can be separated into two phases:  
  **Phase I** : parameters of the distribution (typically mean and standard deviation) of the reference data are estimated.  
  **Phase II** : certain statistics (e.g., sample mean, range, etc.) of the observed sample are plotted and examined.  
  * The distribution parameters obtained in **Phase I** are used to construct lower and upper control limits (i.e., lower and upper quantiles) for the statistic (i.e., the individual value) and added to the chart.

* The interval between the control limits of the individuals control chart contains the allowable value of the new data of the observed process with a certain probability (typically, this probability is 0.9973), when the process is in control. This approach will be referred to as the <font color="#00dd00">**Shewhart method**</font> in this paper.     
    * For the authors’ study, the Shewhart method is valid only if the sample size in the reference set of **Phase I** is large enough, then the 0.9973 probability for the $\pm 3\sigma$ range is reached as a limit.    

* The typical application requires <font color="#dddd00">at least 25 samples</font> in the reference set.    
  * In the current situation, the Shewhart method is not considered reliable <font color="#dd0000">because the reference data set is **small**.</font>

* The observed sample contains only a single datapoint, thus the statistic to be plotted is the individual value (individuals control chart).  
  * <font color="#dddd00">In the case where the sample size is small, one should consider the uncertainty of expected value and variance.</font>

* The width of the valid control limits in the authors’ typical case (**small reference set**) are $$\pm ts\sqrt{1+\frac{1}{n}} \text{ ,}$$ which is the prediction interval.  

* If a point is found to be OOT, one may choose from two routes, depending on whether data analysis is **in real time or not**: 
  * in real time:   
    If the non-OOT nature is justified (by remeasurement), the stability study is continued, neglecting the point. If the OOT nature is justified, a corrective action follows the detection, and thus, the original process is ended.
  * not in real time:  
    The analysis of data is performed not in real time but upon collecting more data. The original process is continued, and the point that is later found to be **OOT cannot be re-measured and should be removed from the data set**. If a point is found to be non-OOT, the point is added to the historical data set, new control limits are calculated, and a new control chart is made for the next observed data.

# Assumptions in this paper

* The 95% confidence level.  
* The reference data are non-OOT. 
* For the authors’ data, the hypothesis of homogeneity of variances (變異數同質性) is accepted.


# <font color="#00dd00">**Regression Control Chart**</font>

* In the regression control chart method, data are compared within a batch.  

* Three kinds of statistical intervals are used: <font color="#00dd00">confidence</font>, <font color="#00dd00">prediction</font>, and <font color="#00dd00">tolerance</font>.    

* <font color="#dddd00">The “parameters” of the chart are a **regression line** (i.e., an expected value changing with time) constructed by the previous data from **Phase I** and the **variance of residuals**, and both are considered as known in the Shewhart method.</font>

* Suggest fitting a regression line to historical data of the observed batch and extrapolating the line to the time point of the new data.  
  **Control limits** are calculated for the expected value by $\text{expected value} ± ks$ (called the <font color="#00dd00">Shewhart limits (SL)</font>) at the new time point.  
  * The $k$ is taken from a table of normal distribution at a desired significance level. 
  * The residual standard error ($s$) is calculated either by only the regression line of the observed batch, or by regression lines of historical batches. In the latter case, a common slope is assumed, and different intercepts are allowed for the historical batches.

* The prediction interval with $t$-distribution is to be used if one asks about a single new observation.  

* The method requires points in the observed batch as reference data (assume that these reference data are non-OOT) to construct the first regression line.  
<font color="#dd0000">Question :</font> How many points from the beginning of the observed batch should be considered as the reference data set ?

* The prediction interval </br>  
  $$\hat{Y}-t_{\alpha/2}s_{y^*-\hat{Y}}<y^*<\hat{Y}+t_{\alpha/2}s_{y^*-\hat{Y}}  \quad\quad \text{(Eq. 1)}$$ </br>
  where $y^*$ is the new measured value at the new time point $x^*$,    
  &emsp;&emsp;&ensp; $\hat{Y}$ is the predicted value of the measured variable $y$ at the new time point $x^*$,   
  &emsp;&emsp;&ensp; $t_{\alpha/2}$ is the onesided upper critical $t$ value at $α/2$ one-sided level (with $n-2$ degrees of freedom),     
  &emsp;&emsp;&ensp; $n$ is the number of points used to construct the regression line, and    
  &emsp;&emsp;&ensp; $s_{y^*-\hat{Y}}$ is the sample standard deviation calculated by </br>  
  $$s_{y^*-\hat{Y}}=s_r\sqrt{1+\frac{1}{n}+\frac{(x^*-\bar{x})^2}{\sum_{j=1}^{k}(x_j-\bar{x})^2}} \quad\quad \text{(Eq. 3)}$$    
  &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&ensp; where $s_r$ is the [residual standard deviation](https://www.investopedia.com/terms/r/residual-standard-deviation.asp) and   
  &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&ensp; $\bar{x}$ is the mean of the $x_j$ values (time points) of data
(without $x^*$) used to estimate the regression line. </br>        
The right side of (Eq. 1) is the <font color="#00dd00">upper prediction limit (UPL)</font>, while the left side is the <font color="#00dd00">lower prediction limit (LPL)</font>. <font color="#dddd00">**If the inequality (Eq. 1) is satisfied, that is the $y^*$ is within the limits, the data is accepted, otherwise it is OOT.**</font>

  * The residual standard error may be calculated using earlier batches as well, not only from the recent one in study. The use of earlier batches may follow several different routes. Some authors used parallel lines (i.e., the common slope for all historical batches). **The authors of this article are not convinced about the usefulness of this assumption, however.**  </br>
<font color="#dddd00">**If common slope is forced for historical batches, the fit is deteriorated(惡化的). This situation would lead to greater residual error than would be achieved using the best linear fits (letting different slopes).**</font>   
On the other hand, <font color="#dddd00">**if separate lines are fitted to historical batches, the residual standard errors obtained may be pooled.**</font> The <font color="#00dd00">pooled error</font> has higher degrees of freedom than that obtained from a single batch, thus, it is a better estimate to the variance, leading to better estimated control limits.  

  * <font color="#dddd00">If every group has the **same number of data**, the pooled sample variance $s_{p(res)}^{2}$ is the arithmetic average of the individual sample variances.</font> Hence, regression lines were fit to each historical batch and residual standard errors ($s_i$) for the $i$-th historical batch were calculated. The pooling is justified only if the variation of data of different batches has the same variance. For the authors’ data,
the hypothesis of homogeneity of variances (變異數同質性) is accepted.      

  * For calculation of the prediction interval with pooled residual
standard error, in Equation (Eq. 3), is substituted for $s_r$. The degrees
of freedom of the pooled sample variance $s_{p(res)}^{2}$ is $p(n-2)$, thus a different $t_{α/2}$ should be taken from $t$-table when calculating the control limits.  

* Shewhart limits: 
  $$\text{SL} = \hat{y^*} \pm z_{\alpha/2}\sigma_{r} \quad\quad \text{(Eq. 5)}$$ </br>
  where $z_{\alpha/2}$ is the critical value of standard normal distribution at two-sided $\alpha/2$ level, and     
  &emsp;&emsp;&ensp; $\sigma_{r}$ is the square root of variance of residuals, equal to $s_{r}$.   
  Note : One may use the square root of the pooled standard deviation as a substitute of the $\sigma_{r}$.   

* Confidence limits:   
  $$\text{CL} = \hat{y^*} \pm t_{\alpha/2} s_{r}/\sqrt{n} \quad\quad \text{(Eq. 6)}$$ </br>    
  where $t_{\alpha/2}$ is the onesided upper critical $t$ value at $α/2$ one-sided level (with $n-2$ degrees of freedom).    
    * For the calculations with pooled standard deviation, $s_{r}$ is substituted with $s_{p(res)}$ in (Eq. 6) and degrees of freedom of $s_{p(res)}^{2}$ is used to obtain the $t$-score.

* Tolerance limits:   
  $$\text{TL} = \hat{y^*} \pm k_1 s_{r} \quad\quad \text{(Eq. 7)}$$ </br> 
  where 
  $$k_1 =  \sqrt{\frac{\nu \chi_{P,1}^{2}(\frac{1}{n^{'}})}{\chi_{1-\gamma, \nu}^{2}}} \quad\quad \text{(Eq. 8)}$$ </br> 
  where $\nu$ is the degrees of freedom of $s_{r}^{2}$ ($n-2$ actually),    
  &emsp;&emsp;&ensp; $\chi_{1-\gamma, \nu}^{2}$ is the critical value of the chi-square distribution at a one-sided $1-\gamma$ level with $\nu$ degrees of freedom,    
  &emsp;&emsp;&ensp; $\chi_{P,1}^{2}(\delta)$ is the critical value of non-central-chi-square distribution at one-sided $P$ level with $\nu$ degrees of freedom, and      
  &emsp;&emsp;&ensp; $\delta$ is the argument in the noncentral-chi-square function, actually, $\delta = 1/n^{'}$ with       

  \begin{equation}
  \begin{split}
    n^{'} &= \frac{n\sum(x_i-\bar{x})^2}{\sum(x_i-\bar{x})^2+n(x^{*}-\bar{x})^2} \quad\quad \text{(Eq. 9)} \\ \\ & = \frac{s_{r}^2}{s_{\hat{Y}}^{2}} \quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad \text{(Eq. 10)}
     \\ \\ & = \frac{s_{r}^2}{s_{r}\sqrt{\frac{1}{n}+\frac{(x^*-\bar{x})^2}{\sum_{j=1}^{k} (x_j-\bar{x})^2}}} \quad\quad\quad\quad \text{(Eq. 11)}
  \end{split}
  \end{equation}
    </br>       

    * For the calculations using pooled standard deviation, $s_{r}$ is substituted with $s_{p(res)}$ in (Eq. 10) and (Eq. 11) whenever a new effective number ($n^{'}$) of observations is to be obtained. The proper substitution is performed for degrees of freedom as well in (Eq. 8).








# References  

  [1] [Pooled Variance](https://www.statisticshowto.com/pooled-variance/), Stephanie, February 25, 2020.     
  [2] [10.5: Standard Error and Pooled Variance](https://stats.libretexts.org/Bookshelves/Applied_Statistics/Book%3A_An_Introduction_to_Psychological_Statistics_(Foster_et_al.)/10%3A__Independent_Samples/10.05%3A_Standard_Error_and_Pooled_Variance), Garett C. Foster, May 2, 2021.    
  [3] [What is a pooled variance?](https://blogs.sas.com/content/iml/2020/06/29/pooled-variance.html), Rick Wicklin, June 29, 2020.