Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create backups flowchart #23

Closed
ACharbonneau opened this issue Jul 6, 2015 · 11 comments
Closed

Create backups flowchart #23

ACharbonneau opened this issue Jul 6, 2015 · 11 comments
Labels
help wanted Looking for Contributors type:enhancement Propose enhancement to the lesson

Comments

@ACharbonneau
Copy link
Contributor

as in http://extremepresentation.typepad.com/files/choosing-a-good-chart-09.pdf

@mkuzak
Copy link
Member

mkuzak commented Aug 30, 2017

I like this idea a lot. I'm afraid there are too many solutions and approaches, maybe we can come up with something simple. I'm curious if any one has seen something like that.

@Tantoluwa
Copy link

Please simplify the chart, it is cumbersome

@fmichonneau fmichonneau added type:enhancement Propose enhancement to the lesson and removed enhancement labels Apr 17, 2018
@hoytpr
Copy link
Contributor

hoytpr commented Jun 19, 2018

@mkuzak @ACharbonneau @Tantoluwa Can you clarify this? Are you asking for a flow-chart describing a sequencing data backup system? The chart referenced is a chart describing how best to plot your data. Is this request that we develop a: data --> backup data --> archives chart?

We use a fairly simple but effective backup process for archiving data. The data are available for a few weeks-to-months on an internal cloud server while the data-owner is allowed to make their own backups locally. After time, the data are all compressed and moved to (duplicate) tape archives that are site-separated. Is this the kind of chart requested?

@ACharbonneau
Copy link
Contributor Author

@hoytpr I was originally thinking of making something like a decision-tree/flow chart type of thing. I wanted a way to give people an idea of what type of backup system fit their needs, which usually just involves me asking them a bunch of questions that help me narrow down what might work.

So, at the top maybe your first decision might be "How much data do you have to back up"?

Other questions I had in my head were:

  • Is your data access restricted? (i.e. medical data has different security requirements than my plant stuff)
  • How long do you need to maintain the backup? (difference between tape/harddrive/etc)
  • How often do you need to access the data? (can't keep it on tape if you need it everyday)
  • How many people need access to the data? (stuff on a university HPC might not be accessible by collaborators at other universities)
  • How much TIME do you have to spend on this? (are you actually going to check your cold storage tapes every year)
  • How much MONEY do you have to spend on this? (can you just pay Amazon or similar?)

There were probably others, but my 2 year old recollection of a brainstorming meeting is failing me :)

@hoytpr
Copy link
Contributor

hoytpr commented Jun 19, 2018

Thanks @ACharbonneau, your 2-year-old memory is better than mine, and your topics are good! This would be part of the original data planning. When submitting samples, the size of the resulting data files can be estimated (length of reads, number of reads, etc.). There are already some "Guidelines for storing data" and the specifics would probably be different at different institutions. For my NSF-funded instrument, it's pretty simple: save everything, forever. It would also depend on whether you were using a core facility, external service, or your lab had their own sequencer. The common result of all those options is that you need your OWN PERSONAL copy of the data. Then you can work on it with whatever computer services are available, and archive it wherever you have those options. Accessibility restrictions will also vary between projects and places.

My opinion is that a more detailed workshop lesson here about planning would compliment the genomics wrangling, cloud genomics, and genomics workshops. Your storage topic would be good to include, and then reference more storage details in the Cloud Genomics, Data Wrangling, and HPC lessons. There is so much that goes into planning that many take for granted.

But if we go into great detail about storage options, it would be unbalanced relative to other planning details like "What's the project?" "Will you make your own libraries?" "How much coverage do you you need?" "What are the desired questions you want resolved?" "Will you have a lot of samples, or a few?" "How many milligrams of each sample can be produced?" "Are these metagenomic samples?" ... These are all parts of planning and organizing sequencing projects. So maybe that level of detail will have to wait until this lesson is expanded. @mkuzak @Roselynlemusinmegen @analeighgui @raynamharris

@ACharbonneau
Copy link
Contributor Author

ACharbonneau commented Jun 19, 2018

Right. I'm honestly not sure where "organization" falls in the lessons right now, but originally this was going to be one of the first things we talked about in genomics, as part of a big "actually plan your bioinformatics like a wet lab experiment" talk/soapbox. I never intended it to be a thing we spent a lot of time on in class, but rather a thing you could reference and let people go back to. We have a similar sort of thing in the cloud lesson: https://datacarpentry.org/cloud-genomics/04-which-cloud/index.html

I still think it would be a nice thing to have this 'decision tree/things to think about/whatever it is' as a link in one of the planning lessons. Even if it didn't get covered explicitly in most workshops, it would be useful for people coming back and reviewing. But obviously it's not a super high priority :)

@JasonJWilliamsNY
Copy link
Contributor

Arizona BugBBQ: This information is useful but should not take up major real estate in the lesson. Probably a link or two in the lesson would be enough

@Tantoluwa
Copy link

That would be great

@hoytpr
Copy link
Contributor

hoytpr commented Jun 12, 2023

@JCSzamosi and @ACharbonneau et al. This is a stale issue, but it still has important points. I've recovered data for people that was 3-5 years old. But, the most important part of the issue is to make multiple backups of your raw data. The learners don't need to know how to operate an archival system. This is emphasized sufficiently in the genomics lessons. Please reopen if you disagree.

Peter

@hoytpr hoytpr closed this as completed Jun 12, 2023
@JCSzamosi
Copy link
Contributor

Is this something that should be referred to the curriculum committee?

@hoytpr
Copy link
Contributor

hoytpr commented Jun 13, 2023

Speaking for myself, I don't think so. There is enough emphasis on data protection (keep raw data raw, protecting data through permissions, making backups) it should be clear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Looking for Contributors type:enhancement Propose enhancement to the lesson
Projects
None yet
Development

No branches or pull requests

8 participants