Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Establish script to remove deleted and inactive accounts. #15210

Merged
merged 4 commits into from Aug 2, 2017

Conversation

ashercodeorg
Copy link
Contributor

@ashercodeorg ashercodeorg commented May 19, 2017

The aim of the script is to remove all PII, PPII, and information owned by thirty day deleted users and seven year inactive users. Spec (probably somewhat out-of-date already).

VOCABULARY: Throughout, we use the term "purge" to refer to this removal of PII, PPII, and information owned. Throughout, we use the term "clean" to refer to the removal of the information while leaving the skeleton holding that information intact.

Given this involves data retention, adding explicit lists to this PR of the data we explicitly want to destroy and the data we explicitly want to keep.

DATA TO DESTROY

  • USER
    • display name
    • username
    • fine geolocation data
      • address, city
      • latitude, longitude
    • IP addresses
    • associated email addresses (including hashed emails)
    • passwords (even if encrypted)
    • oauth provider UID
    • all SOLR data
    • all Pardot data
      • Pardot data
      • contact_rollups
      • contact_rollups_daily
  • TEACHING
    • section names
    • all Pardot data
  • PROFESSIONAL DEVELOPMENT
    • all channel-backed projects (standalone or in lessons)
    • all data in data tables (even if used in lessons)
    • all uploaded assets (even if used in lessons)
    • free response survey question responses
    • open ended text responses
    • all Pardot data
    • peer review content
  • CONTENT
    • all channel-backed projects (standalone or in lessons)
    • all data in data tables (even if used in lessons)
    • all uploaded assets (even if used in lessons)
    • free response survey question responses
    • open ended text responses
    • all Pardot data

DATA TO KEEP

  • USER
    • gender
    • locale
    • coarse geolocation data
      • state
      • country
      • postal code
    • birthday
    • user_type
    • oauth provider
    • urm, races
  • TEACHING
    • section information other than section name
    • follower information
  • PROFESSIONAL DEVELOPMENT
  • CONTENT
    • school_info_id (TODO(asher): REALLY?)
    • paired programming information
    • puzzle ratings
    • non-free response survey question responses
    • professional development attendance information
    • peer reviews other than the content

@ashercodeorg
Copy link
Contributor Author

Looking for a high-level review, confirming that it is sensical to perform this process semi-manually rather than create triggers and dependent destroys within our models. Assuming so, please let me know whether you would prefer a bunch of incremental PRs or a long PR (expecting some back and forth).


user.reload.update!(
name: '',
username: '',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For better or for worse, there are unique indexes on username, email, etc. How do we want to manipulate these fields while maintaining the DB constraints? Do we want to relax the DB constraints?

@ashercodeorg ashercodeorg force-pushed the deleteScript branch 2 times, most recently from 720eb9b to 7b510ae Compare May 19, 2017 16:37
def delete_open_ended_answers(user_id)
UserLevels.where(user_id: user_id).find_each do |user_level|
if OPEN_ENDED_LEVEL_TYPES.include? user_level.level.type
user_level.level_source.destroy
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we want do delete the level_source? Does it make sense to wipe its data (also md5?), keeping the row otherwise intact? Do we want to purge the whole row?

Note that, in both cases, there are implications for other users that happen to have the same level_source_id.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we destroying the level source at all and not just the user level?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we are fine with maintaining a record that the student did work, we want to eliminate the work that was done. This is motivated by student privacy law, I believe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we then update the user_level to no longer point to a level_source and delete the level_source only if there are no longer any user_levels pointing to it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't at all clear to me what makes the most sense, from a product or from a engineering perpsective (thus my starting this thread). If we could delete the level source only if there were no user_levels pointing to it, that'd be awesome. Unfortunately, we are lacking the index on user_levels.level_source_id to allow us to carry this out efficiently (practically).

Note that we do have an index on activities.level_source_id. That said, there would be some complexity in trying to make use of it, as we'd have to consult both activities and activities_old (the backup table from before our overrunning the ID counter).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm misunderstanding something, deleting the level source indiscriminately seems like it isn't even an option, given how significant the impact would be on other users.

How bad would it be to simply remove the pointer from deleted user to level source and potentially have orphaned level sources floating around?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My belief is that this is not considered acceptable by product. Otherwise, I agree, this would be the easy answer.

For now, I'm going to suggest we not focus too much on this, as product still needs to finalize their requirements.

@Hamms
Copy link
Contributor

Hamms commented May 19, 2017

What's the thinking behind doing this manually as opposed to relying on dependent destroys?

@ashercodeorg
Copy link
Contributor Author

My concern with dependent destroys is that this pipeline only happens on "hard"-delete, not "soft"-delete. So we'd need to invoke this pipeline in a manner distinct from user.destroy, where we want the soft-delete dependent destroys but not the hard-delete dependent destroys.

Also, it seems less error prone to keep data deletion to a script. Though I could be very mistaken on that.

Copy link
Contributor

@Hamms Hamms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I like the overall idea of doing this manually and carefully, but it does seem pretty error-prone; I'm not sure whether I'm more worried about things that should be deleted getting skipped and being orphaned or about the act of deleting something causing other things that we don't want to delete to break.

I'm in favor of this all being in a single large PR, in the hopes that that will make it easier to notice something that got missed.

def delete_open_ended_answers(user_id)
UserLevels.where(user_id: user_id).find_each do |user_level|
if OPEN_ENDED_LEVEL_TYPES.include? user_level.level.type
user_level.level_source.destroy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we destroying the level source at all and not just the user level?

FreeResponse
Gamelab
Weblab
).freeze
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should Artist and Playlab be included in this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most definitely. Amongst many other things...the exact list is still getting sorted out.

next unless Follower.where(student_user: student_user).count == 1

# TODO(asher): Determine whether this should be delete_user_and_dependencies(student_user).
student_user.destroy!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be a soft delete?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the TODO, I need to get a definite answer. Note that, after thirty days, the two are equivalent (as a soft-deleted user then has delete_user_and_dependencies run on it).

@ashercodeorg
Copy link
Contributor Author

Ready for review... PTAL.

@ashercodeorg
Copy link
Contributor Author

Actually, just kidding. There are a few more changes that I want to refactor from the script into models.

@ashercodeorg
Copy link
Contributor Author

Those refactor changes have been made, PTAL.

# Cleans all LevelSource-backed progress associated with the user.
# WARNING: Since the relationship between user_levels and activities and level_sources is
# many-to-one, we cannot be certain that this is not messing with the integrity of the data for
# other users.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still quite concerned about this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted. As am I. But I'm pretty sure @jeremydstone is not... (at least, I'm doing what I'm told).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per offline discussion with @jeremydstone, this has been changed to severe the link between the user's level-backed progress and the user (see commit d39f371) rather than remove the actual student work (image). PTAL.

FYI @poorvasingal @Hamms @jeremydstone.

where(dashboard_user_id: user_id).
map {|contact_rollup| contact_rollup[:pardot_id]}
failed_ids = Pardot.delete_prospects(pardot_ids)
raise "Pardot.delete_prospects failed for Pardot IDs #{failed_ids.join(', ')}." unless failed_ids
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be unless failed_ids.empty??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Done (using if failed_ids.any?).

DeleteAccountsHelper.remove_from_solr(user.id)

user.purged_at = Time.zone.now
user.save(validate: false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this imply that purged users are expected to fail validation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, purged users are expected to pass validation, I've gone to some trouble to make sure of that. That said, I want the purged_at update to get persisted, whether or not the user is valid.

# NOTE: This method assumes the sections and followers associated with user_id have already been
# destroyed.
# @param [Integer] user_id for which to delete orphaned students.
def self.purge_orphaned_students(user_id)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is maybe the scariest part to me. An area where we could be doing the right thing under the letter of our TOS/Privacy policy and yet have kids come in the in morning to discover all their stuff nuked.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel welcome to take the issue up in the spec or with @poorvasingal directly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Teachers will have a super clear warning that their student accounts (that don't have a personal login) will get deleted if they delete their own account. I also don't think it's a common scenario for teachers to delete their accounts so I am not super worried.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SELECT user_type, COUNT(0)
FROM users
WHERE deleted_at IS NOT NULL
GROUP BY 1;
+-----------+----------+
| user_type | COUNT(0) |
+-----------+----------+
| student   |    71937 |
| teacher   |     6710 |
+-----------+----------+

@jeremydstone
Copy link

The code looks good. Would like to have an interactive conversation with @ashercodeorg around testing, risks & mitigation prior to merge.

@ashercodeorg
Copy link
Contributor Author

Per offline conversation, merging so that this can be tested in isolation on adhoc and test environments.

@ashercodeorg ashercodeorg merged commit 7a50109 into staging Aug 2, 2017
@ashercodeorg ashercodeorg deleted the deleteScript branch August 2, 2017 19:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants