Marco Riani1, Anthony C. Atkinson2, Luca Greco3 , and Aldo Corbellini1
1 Department of Economics and Management and Interdepartmental Research Centre for Robust Statistics
2 London School of Economics
3 Università Telematica Giustino Fortunato
We develop a general framework for multivariate analysis with missing observations, with particular emphasis on the computation and use of Mahalanobis distances. When some entries are missing, the usual Mahalanobis distance can only be computed on the observed coordinates, yielding partial distances that are not directly comparable across units with different missingness patterns. To overcome this difficulty, we study a class of adjustments that rescale partial Mahalanobis distances to a common reference scale.
The proposed methodology is based on the EM algorithm for estimating multivariate normal location and scatter in the presence of missing values. We show that this framework allows the computation of adjusted distances without explicit imputation. Seven adjustment methods are considered, including moment-based, determinant-based, and distributional transformations, as well as a model-based correction derived from the conditional expectation of the complete-data Mahalanobis distance. This principled adjustment is shown to be optimal under a mean squared error criterion.
We further extend the methodology to a robust context through a trimmed EM algorithm, thereby combining missing-data estimation with outlier detection. A simulation study compares the proposed adjustments in terms of their ability to reconstruct the complete-data Mahalanobis distances. Across a wide range of settings, the principled EM correction consistently provides the best performance, while chi-square, Beta, and standardization mappings provide useful alternatives.
Finally, we introduce a new graphical diagnostic for assessing whether data are Missing Completely at Random (MCAR), based on the comparison of Mahalanobis distances computed from complete rows with those obtained when all rows are used. This graphical procedure is formalized through a Monte-Carlo test. The methods are illustrated on a dataset of cows with missing measurements, where the analysis reveals both multivariate outliers and clear evidence against MCAR.
The proposed framework provides a flexible and robust approach to multivariate analysis with missing data, combining statistical interpretability, computational efficiency, and practical diagnostic tools.
In the table below you can find the original source (MATLAB live script): .mlx file and the corresponding .ipynb file.
MATLAB live script files
The .mlx file contain both the code and the output that the code produces.
👀 To view the .mlx files click on the "File Exchange button"
The Jupiter notebook version of the files is also given in the last column of the table below. Similarly to the .mlx files the Jupiter notebook files also contain both the code and the output produced by the code.
Jupiter notebook files
To view the .ipynb files click on the corresponding link.
To run the .ipynb files inside the agnostic environment jupiter notebook follow the instructions in the file ipynbRunInstructions.md.
Note: in order to run the files below you need to have FSDA toolbox installed.
| Description | Routine name (link to HTML doc file) |
|---|---|
| EM algorithm for data with missing values (no trimming). | mdEM |
| EM algorithm with trimming (TEM) for data with missing values. | mdTEM |
| Compute squared Mahalanobis distances using only observed entries. | mdPartialMD |
| Rescale partial squared Mahalanobis distances to the full-dimensional scale. | mdPartialMD2full |
| Bootstrap test for change in Mahalanobis distances under MCAR. | mdMCARtest |
| Replace NaNs with conditional mean or random draw from conditional distribution. | mdImputeCondMean |
The following section contains a table with the source code that enables the reproduction of the Figures of the paper and the simulation study.
| FileName | View 👀 | Run |
Jupiter notebook | m format |
|---|---|---|---|---|
RobMultMissingFigures.mlx: This code generates Figures from 4 to 9 and Table 1 of the paper. |
RobMultMissingFigures.ipynb | RobMultMissingFigures.m | ||
RobMultMissingSimStudies.mlx: This code generates Figures from 1 to 3 and perform the simulation study described in section 3 of the paper. |
RobMultMissingSimStudies.ipynb | RobMultMissingSimStudies.m |