Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection from the identified outliers to the row or unique identifier #15

Open
benkertp opened this issue Sep 7, 2017 · 3 comments
Open

Comments

@benkertp
Copy link

benkertp commented Sep 7, 2017

Thanks a lot for this very useful package!

It would be very helpful to have the possibility to specify a "key column" in indentifyOutliers(). When the outliers are reported in the pdf/html-file, the key is then reported together with the value of the outlier.

Background: I work with patient data and in each table we typically have a patient-ID column or another identifier. If I could see the patient ID together with each outlier, I could just go to the CRF and correct the entry for the given patient. To go further one could also generate a column with links to the location where the data can be changed.

Best wishes,
Pascal

@annennenne annennenne self-assigned this Sep 8, 2017
@annennenne
Copy link
Collaborator

Thanks for the input! I think we already have tools for doing something very similar to what you suggest, but let me know if it doesn't cover everything you had in mind. However, there is one thing that we generally try to avoid: Printing id numbers in the report, as this can be problematic in terms of data security.

Anyway, here is how I would suggest doing what you propose:

So first we load some data and make a report:

library(dataMaid)
data(toyData)

clean(toyData) 

Here, we might notice that var4 has some outliers and decide to replace them by NA. This can then be done in the following way:

toyData[toyData$var4 %in% identifyOutliers(toyData$var4)$problemValues, "var4"] <- NA

and now they're gone:

> identifyOutliers(toyData$var4)
No problems found.

The key here is that identifyOutliers(), and all other check functions, are all objects with list structures and they have three entries:

> str(identifyOutliers(toyData$var4))
List of 3
 $ problem      : logi TRUE
 $ message      : chr "Note that the following possible outlier values were detected: \\\"1.12\\\", \\\"1.51\\\", \\\"1.6\\\"."
 $ problemValues: num [1:3] 1.6 1.51 1.12
 - attr(*, "class")= chr "checkResult"

$problem is a logical, indicating whether a problem was found
$message is the printed message you see in e.g. the clean()-report
$problemValues are the values in the variable that were identified as problematic

@benkertp
Copy link
Author

Thanks for your answer.
That's not really what I had in mind. I don't want to clean the data in R directly but rather in the external data base. Therefore I would like to display additional information to each individual outlier in the generated report.
I guess one way would be to modify the "message" variable in the list such that it does not only show the outlier value but additional meta information (e.g. the "patient id") allowing me to change the value in the original database for that patient.
For that I probably need to write my own identifyOutliers() function, right?

@annennenne
Copy link
Collaborator

I see. However, this is not possible with the current structure of clean(). The function only looks at a single variable at a time, so it cannot combine information from an ID-varibale with another one. If you are willing to do it in the console, however, you will get the IDs from typing e.g.

library(dataMaid)
data(testData)

#testData contains a variable, numOurlierVar, which has an outlier:
>identifyOutliers(testData$numOutlierVar)
Note that the following possible outlier values were detected: 100.

#testData also contains an ID-variable, cprKeyVar, and we can get the IDs of 
#the outliers like this:
> testData$cprKeyVar[testData$numOutlierVar %in% identifyOutliers(testData$numOutlierVar)$problemValues]
[1] "010274-2648"

@annennenne annennenne removed their assignment Sep 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants