Connection from the identified outliers to the row or unique identifier #15

benkertp · 2017-09-07T11:28:09Z

Thanks a lot for this very useful package!

It would be very helpful to have the possibility to specify a "key column" in indentifyOutliers(). When the outliers are reported in the pdf/html-file, the key is then reported together with the value of the outlier.

Background: I work with patient data and in each table we typically have a patient-ID column or another identifier. If I could see the patient ID together with each outlier, I could just go to the CRF and correct the entry for the given patient. To go further one could also generate a column with links to the location where the data can be changed.

Best wishes,
Pascal

annennenne · 2017-09-08T08:42:49Z

Thanks for the input! I think we already have tools for doing something very similar to what you suggest, but let me know if it doesn't cover everything you had in mind. However, there is one thing that we generally try to avoid: Printing id numbers in the report, as this can be problematic in terms of data security.

Anyway, here is how I would suggest doing what you propose:

So first we load some data and make a report:

library(dataMaid)
data(toyData)

clean(toyData)

Here, we might notice that var4 has some outliers and decide to replace them by NA. This can then be done in the following way:

toyData[toyData$var4 %in% identifyOutliers(toyData$var4)$problemValues, "var4"] <- NA

and now they're gone:

> identifyOutliers(toyData$var4)
No problems found.

The key here is that identifyOutliers(), and all other check functions, are all objects with list structures and they have three entries:

> str(identifyOutliers(toyData$var4))
List of 3
 $ problem      : logi TRUE
 $ message      : chr "Note that the following possible outlier values were detected: \\\"1.12\\\", \\\"1.51\\\", \\\"1.6\\\"."
 $ problemValues: num [1:3] 1.6 1.51 1.12
 - attr(*, "class")= chr "checkResult"

$problem is a logical, indicating whether a problem was found
$message is the printed message you see in e.g. the clean()-report
$problemValues are the values in the variable that were identified as problematic

benkertp · 2017-09-12T07:42:42Z

Thanks for your answer.
That's not really what I had in mind. I don't want to clean the data in R directly but rather in the external data base. Therefore I would like to display additional information to each individual outlier in the generated report.
I guess one way would be to modify the "message" variable in the list such that it does not only show the outlier value but additional meta information (e.g. the "patient id") allowing me to change the value in the original database for that patient.
For that I probably need to write my own identifyOutliers() function, right?

annennenne · 2017-09-12T08:29:04Z

I see. However, this is not possible with the current structure of clean(). The function only looks at a single variable at a time, so it cannot combine information from an ID-varibale with another one. If you are willing to do it in the console, however, you will get the IDs from typing e.g.

library(dataMaid)
data(testData)

#testData contains a variable, numOurlierVar, which has an outlier:
>identifyOutliers(testData$numOutlierVar)
Note that the following possible outlier values were detected: 100.

#testData also contains an ID-variable, cprKeyVar, and we can get the IDs of 
#the outliers like this:
> testData$cprKeyVar[testData$numOutlierVar %in% identifyOutliers(testData$numOutlierVar)$problemValues]
[1] "010274-2648"

annennenne self-assigned this Sep 8, 2017

annennenne added enhancement and removed enhancement labels Sep 8, 2017

annennenne removed their assignment Sep 5, 2018

annennenne mentioned this issue Mar 5, 2019

Add "compare" step #43

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connection from the identified outliers to the row or unique identifier #15

Connection from the identified outliers to the row or unique identifier #15

benkertp commented Sep 7, 2017

annennenne commented Sep 8, 2017

benkertp commented Sep 12, 2017

annennenne commented Sep 12, 2017

Connection from the identified outliers to the row or unique identifier #15

Connection from the identified outliers to the row or unique identifier #15

Comments

benkertp commented Sep 7, 2017

annennenne commented Sep 8, 2017

benkertp commented Sep 12, 2017

annennenne commented Sep 12, 2017