Create backups flowchart #23

ACharbonneau · 2015-07-06T21:42:10Z

as in http://extremepresentation.typepad.com/files/choosing-a-good-chart-09.pdf

mkuzak · 2017-08-30T18:22:42Z

I like this idea a lot. I'm afraid there are too many solutions and approaches, maybe we can come up with something simple. I'm curious if any one has seen something like that.

Tantoluwa · 2018-02-23T09:20:19Z

Please simplify the chart, it is cumbersome

hoytpr · 2018-06-19T13:10:53Z

@mkuzak @ACharbonneau @Tantoluwa Can you clarify this? Are you asking for a flow-chart describing a sequencing data backup system? The chart referenced is a chart describing how best to plot your data. Is this request that we develop a: data --> backup data --> archives chart?

We use a fairly simple but effective backup process for archiving data. The data are available for a few weeks-to-months on an internal cloud server while the data-owner is allowed to make their own backups locally. After time, the data are all compressed and moved to (duplicate) tape archives that are site-separated. Is this the kind of chart requested?

ACharbonneau · 2018-06-19T13:25:59Z

@hoytpr I was originally thinking of making something like a decision-tree/flow chart type of thing. I wanted a way to give people an idea of what type of backup system fit their needs, which usually just involves me asking them a bunch of questions that help me narrow down what might work.

So, at the top maybe your first decision might be "How much data do you have to back up"?

Other questions I had in my head were:

Is your data access restricted? (i.e. medical data has different security requirements than my plant stuff)
How long do you need to maintain the backup? (difference between tape/harddrive/etc)
How often do you need to access the data? (can't keep it on tape if you need it everyday)
How many people need access to the data? (stuff on a university HPC might not be accessible by collaborators at other universities)
How much TIME do you have to spend on this? (are you actually going to check your cold storage tapes every year)
How much MONEY do you have to spend on this? (can you just pay Amazon or similar?)

There were probably others, but my 2 year old recollection of a brainstorming meeting is failing me :)

hoytpr · 2018-06-19T20:03:12Z

Thanks @ACharbonneau, your 2-year-old memory is better than mine, and your topics are good! This would be part of the original data planning. When submitting samples, the size of the resulting data files can be estimated (length of reads, number of reads, etc.). There are already some "Guidelines for storing data" and the specifics would probably be different at different institutions. For my NSF-funded instrument, it's pretty simple: save everything, forever. It would also depend on whether you were using a core facility, external service, or your lab had their own sequencer. The common result of all those options is that you need your OWN PERSONAL copy of the data. Then you can work on it with whatever computer services are available, and archive it wherever you have those options. Accessibility restrictions will also vary between projects and places.

My opinion is that a more detailed workshop lesson here about planning would compliment the genomics wrangling, cloud genomics, and genomics workshops. Your storage topic would be good to include, and then reference more storage details in the Cloud Genomics, Data Wrangling, and HPC lessons. There is so much that goes into planning that many take for granted.

But if we go into great detail about storage options, it would be unbalanced relative to other planning details like "What's the project?" "Will you make your own libraries?" "How much coverage do you you need?" "What are the desired questions you want resolved?" "Will you have a lot of samples, or a few?" "How many milligrams of each sample can be produced?" "Are these metagenomic samples?" ... These are all parts of planning and organizing sequencing projects. So maybe that level of detail will have to wait until this lesson is expanded. @mkuzak @Roselynlemusinmegen @analeighgui @raynamharris

ACharbonneau · 2018-06-19T20:12:26Z

Right. I'm honestly not sure where "organization" falls in the lessons right now, but originally this was going to be one of the first things we talked about in genomics, as part of a big "actually plan your bioinformatics like a wet lab experiment" talk/soapbox. I never intended it to be a thing we spent a lot of time on in class, but rather a thing you could reference and let people go back to. We have a similar sort of thing in the cloud lesson: https://datacarpentry.org/cloud-genomics/04-which-cloud/index.html

I still think it would be a nice thing to have this 'decision tree/things to think about/whatever it is' as a link in one of the planning lessons. Even if it didn't get covered explicitly in most workshops, it would be useful for people coming back and reviewing. But obviously it's not a super high priority :)

JasonJWilliamsNY · 2019-06-01T17:06:48Z

Arizona BugBBQ: This information is useful but should not take up major real estate in the lesson. Probably a link or two in the lesson would be enough

Tantoluwa · 2019-06-05T00:50:47Z

That would be great

hoytpr · 2023-06-12T21:19:13Z

@JCSzamosi and @ACharbonneau et al. This is a stale issue, but it still has important points. I've recovered data for people that was 3-5 years old. But, the most important part of the issue is to make multiple backups of your raw data. The learners don't need to know how to operate an archival system. This is emphasized sufficiently in the genomics lessons. Please reopen if you disagree.

Peter

JCSzamosi · 2023-06-12T22:16:54Z

Is this something that should be referred to the curriculum committee?

hoytpr · 2023-06-13T16:24:30Z

Speaking for myself, I don't think so. There is enough emphasis on data protection (keep raw data raw, protecting data through permissions, making backups) it should be clear.

mkuzak added enhancement help wanted Looking for Contributors labels Aug 30, 2017

ErinBecker added the after-lesson-release label Nov 21, 2017

fmichonneau added type:enhancement Propose enhancement to the lesson and removed enhancement labels Apr 17, 2018

hoytpr closed this as completed Jun 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create backups flowchart #23

Create backups flowchart #23

ACharbonneau commented Jul 6, 2015

mkuzak commented Aug 30, 2017

Tantoluwa commented Feb 23, 2018

hoytpr commented Jun 19, 2018

ACharbonneau commented Jun 19, 2018

hoytpr commented Jun 19, 2018

ACharbonneau commented Jun 19, 2018 •

edited by hoytpr

JasonJWilliamsNY commented Jun 1, 2019

Tantoluwa commented Jun 5, 2019

hoytpr commented Jun 12, 2023

JCSzamosi commented Jun 12, 2023

hoytpr commented Jun 13, 2023

Create backups flowchart #23

Create backups flowchart #23

Comments

ACharbonneau commented Jul 6, 2015

mkuzak commented Aug 30, 2017

Tantoluwa commented Feb 23, 2018

hoytpr commented Jun 19, 2018

ACharbonneau commented Jun 19, 2018

hoytpr commented Jun 19, 2018

ACharbonneau commented Jun 19, 2018 • edited by hoytpr

JasonJWilliamsNY commented Jun 1, 2019

Tantoluwa commented Jun 5, 2019

hoytpr commented Jun 12, 2023

JCSzamosi commented Jun 12, 2023

hoytpr commented Jun 13, 2023

ACharbonneau commented Jun 19, 2018 •

edited by hoytpr