New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Establish script to remove deleted and inactive accounts. #15210
Conversation
b3ffd60
to
83c3223
Compare
Looking for a high-level review, confirming that it is sensical to perform this process semi-manually rather than create triggers and dependent destroys within our models. Assuming so, please let me know whether you would prefer a bunch of incremental PRs or a long PR (expecting some back and forth). |
bin/cron/delete_accounts
Outdated
|
||
user.reload.update!( | ||
name: '', | ||
username: '', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For better or for worse, there are unique indexes on username
, email
, etc. How do we want to manipulate these fields while maintaining the DB constraints? Do we want to relax the DB constraints?
720eb9b
to
7b510ae
Compare
bin/cron/delete_accounts
Outdated
def delete_open_ended_answers(user_id) | ||
UserLevels.where(user_id: user_id).find_each do |user_level| | ||
if OPEN_ENDED_LEVEL_TYPES.include? user_level.level.type | ||
user_level.level_source.destroy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we want do delete the level_source? Does it make sense to wipe its data (also md5?), keeping the row otherwise intact? Do we want to purge the whole row?
Note that, in both cases, there are implications for other users that happen to have the same level_source_id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we destroying the level source at all and not just the user level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we are fine with maintaining a record that the student did work, we want to eliminate the work that was done. This is motivated by student privacy law, I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we then update the user_level to no longer point to a level_source and delete the level_source only if there are no longer any user_levels pointing to it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't at all clear to me what makes the most sense, from a product or from a engineering perpsective (thus my starting this thread). If we could delete the level source only if there were no user_levels pointing to it, that'd be awesome. Unfortunately, we are lacking the index on user_levels.level_source_id
to allow us to carry this out efficiently (practically).
Note that we do have an index on activities.level_source_id
. That said, there would be some complexity in trying to make use of it, as we'd have to consult both activities
and activities_old
(the backup table from before our overrunning the ID counter).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I'm misunderstanding something, deleting the level source indiscriminately seems like it isn't even an option, given how significant the impact would be on other users.
How bad would it be to simply remove the pointer from deleted user to level source and potentially have orphaned level sources floating around?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My belief is that this is not considered acceptable by product. Otherwise, I agree, this would be the easy answer.
For now, I'm going to suggest we not focus too much on this, as product still needs to finalize their requirements.
What's the thinking behind doing this manually as opposed to relying on dependent destroys? |
My concern with dependent destroys is that this pipeline only happens on "hard"-delete, not "soft"-delete. So we'd need to invoke this pipeline in a manner distinct from Also, it seems less error prone to keep data deletion to a script. Though I could be very mistaken on that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I like the overall idea of doing this manually and carefully, but it does seem pretty error-prone; I'm not sure whether I'm more worried about things that should be deleted getting skipped and being orphaned or about the act of deleting something causing other things that we don't want to delete to break.
I'm in favor of this all being in a single large PR, in the hopes that that will make it easier to notice something that got missed.
bin/cron/delete_accounts
Outdated
def delete_open_ended_answers(user_id) | ||
UserLevels.where(user_id: user_id).find_each do |user_level| | ||
if OPEN_ENDED_LEVEL_TYPES.include? user_level.level.type | ||
user_level.level_source.destroy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we destroying the level source at all and not just the user level?
bin/cron/delete_accounts
Outdated
FreeResponse | ||
Gamelab | ||
Weblab | ||
).freeze |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should Artist and Playlab be included in this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most definitely. Amongst many other things...the exact list is still getting sorted out.
bin/cron/delete_accounts
Outdated
next unless Follower.where(student_user: student_user).count == 1 | ||
|
||
# TODO(asher): Determine whether this should be delete_user_and_dependencies(student_user). | ||
student_user.destroy! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be a soft delete?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per the TODO, I need to get a definite answer. Note that, after thirty days, the two are equivalent (as a soft-deleted user then has delete_user_and_dependencies
run on it).
b7a5670
to
a56d0bc
Compare
c57cfce
to
4f3951a
Compare
4f3951a
to
25dd94c
Compare
25dd94c
to
5ff14cb
Compare
5ff14cb
to
54e483c
Compare
54e483c
to
323e9f8
Compare
Ready for review... PTAL. |
Actually, just kidding. There are a few more changes that I want to refactor from the script into models. |
ef5af3f
to
7ff8f19
Compare
Those refactor changes have been made, PTAL. |
lib/cdo/delete_accounts_helper.rb
Outdated
# Cleans all LevelSource-backed progress associated with the user. | ||
# WARNING: Since the relationship between user_levels and activities and level_sources is | ||
# many-to-one, we cannot be certain that this is not messing with the integrity of the data for | ||
# other users. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still quite concerned about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted. As am I. But I'm pretty sure @jeremydstone is not... (at least, I'm doing what I'm told).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per offline discussion with @jeremydstone, this has been changed to severe the link between the user's level-backed progress and the user (see commit d39f371) rather than remove the actual student work (image). PTAL.
lib/cdo/delete_accounts_helper.rb
Outdated
where(dashboard_user_id: user_id). | ||
map {|contact_rollup| contact_rollup[:pardot_id]} | ||
failed_ids = Pardot.delete_prospects(pardot_ids) | ||
raise "Pardot.delete_prospects failed for Pardot IDs #{failed_ids.join(', ')}." unless failed_ids |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be unless failed_ids.empty?
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Done (using if failed_ids.any?
).
DeleteAccountsHelper.remove_from_solr(user.id) | ||
|
||
user.purged_at = Time.zone.now | ||
user.save(validate: false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this imply that purged users are expected to fail validation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, purged users are expected to pass validation, I've gone to some trouble to make sure of that. That said, I want the purged_at
update to get persisted, whether or not the user is valid.
# NOTE: This method assumes the sections and followers associated with user_id have already been | ||
# destroyed. | ||
# @param [Integer] user_id for which to delete orphaned students. | ||
def self.purge_orphaned_students(user_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is maybe the scariest part to me. An area where we could be doing the right thing under the letter of our TOS/Privacy policy and yet have kids come in the in morning to discover all their stuff nuked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel welcome to take the issue up in the spec or with @poorvasingal directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Teachers will have a super clear warning that their student accounts (that don't have a personal login) will get deleted if they delete their own account. I also don't think it's a common scenario for teachers to delete their accounts so I am not super worried.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SELECT user_type, COUNT(0)
FROM users
WHERE deleted_at IS NOT NULL
GROUP BY 1;
+-----------+----------+
| user_type | COUNT(0) |
+-----------+----------+
| student | 71937 |
| teacher | 6710 |
+-----------+----------+
The code looks good. Would like to have an interactive conversation with @ashercodeorg around testing, risks & mitigation prior to merge. |
f9988f7
to
d39f371
Compare
Per offline conversation, merging so that this can be tested in isolation on |
The aim of the script is to remove all PII, PPII, and information owned by thirty day deleted users and seven year inactive users. Spec (probably somewhat out-of-date already).
VOCABULARY: Throughout, we use the term "purge" to refer to this removal of PII, PPII, and information owned. Throughout, we use the term "clean" to refer to the removal of the information while leaving the skeleton holding that information intact.
Given this involves data retention, adding explicit lists to this PR of the data we explicitly want to destroy and the data we explicitly want to keep.
DATA TO DESTROY
DATA TO KEEP