Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workshop overhaul #53

Closed
taylorreiter opened this issue Jul 1, 2018 · 12 comments

Comments

Projects
None yet
5 participants
@taylorreiter
Copy link
Contributor

commented Jul 1, 2018

We propose a change to the current lesson. The changes were born out of a recent DC Genomics workshop at UC Davis, and conversations and brainstorming sessions that occurred at CarpentryConnect West. These changes reflect conversations with @fpsom @crazyhottommy @ryanpeek @shannonekj @raynamharris @AstrobioMike @abostroem @perisateesh @tomsing1 @jthmiller @reedacartwright @tracykteal @Joiry and @adamjorr. It also reflects some changes suggested by @bluegenes (#41) and @standage (#42).

We welcome more community input as we move forward! We have forked DC Genomics repos to github.com/data-lessons, and will be developing there.

https://github.com/data-lessons/shell-genomics
https://github.com/data-lessons/cloud-genomics
https://github.com/data-lessons/organization-genomics
https://github.com/data-lessons/wrangling-genomics

We propose:

Day 1: Introduction to command line for bioinformatics

  • Why shell? (use tools, automate)

  • Why of cloud computing? (more space. also note you need shell to cloud compute)

    • remove "choosing a cloud section"
    • NB this section will be quite short.
  • Cloud Genomics, Episode 2: Logging onto cloud

    • Talk about command structure when sshing
  • Shell Genomics - on cloud, written around a text file. This could be the metadata file, that we reveal later. It could include all 2,443 Lenski samples. Meta-data here. Include fasta file as well.

    • Episode 3 needs a rewrite. We think we need to cover cd,
      rm, head, tail, cat, print, mv, cp, grep, wc, less, man, scp (teach with cp), curl
      • Show grep by grepping for our 6 samples.
    • We think this could be named "Exploring the Shell"
    • @tomsing1 pirate treasure hunt to demonstrate folder structure in a rewarding way
    • Add optional episode that includes cut, paste, sort, uniq, awk
  • Shell Genomics, Episode 4: Pipes & Redirection

    • right now includes >, |, sort, wc, and uniq, cut, paste
    • We think it should only include > and |
  • Shell Genomics, Episode 05: Writing Scripts.

    • Change name to "Writing For loops & scripts"
    • Don't write a script using history.
    • Write the script in nano
    • Modify to include for loop, addressing variables ($) and arguments ($1)
    • Also use print in the for loop, like @ctb's Beginner Unix lesson.
    • For non-novice learners optional: Introduce tmux/screen, perhaps with for loops.
    • Consider making two episodes

Day 02: Genomics Workflow

Additional suggestions

  • use GitBash instead of PuTTy. Include pasting instructions in GitBash, and note that open and man don't work in GitBash. Relates to #41
  • Change the dataset to longer reads (~150bp) from Lenski lab as suggested in#42. (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4988878/)
  • We would like to add a project narrative that includes details of the Lenski experiment.
@JasonJWilliamsNY

This comment has been minimized.

Copy link
Contributor

commented Jul 3, 2018

How will these changes reflect the forthcoming R lessons. The R maintainers have been condensing everything into a one-day workshop that is paired with a one day Unix workshop? I have taught 10+ Genomics workshops and always in the One-day R/One-day Unix format?

@taylorreiter

This comment has been minimized.

Copy link
Contributor Author

commented Jul 3, 2018

@JasonJWilliamsNY I don't think there will be any direct conflicts with the forthcoming R lessons. One potential conflict is that we are proposing to move the discussion of data tidiness to the beginning of the second day of this workshop instead of at the beginning of the first day.

The re-write of the unix lesson will use a metadata file derived from this supplement to demonstrate some of the commands, as well as the REL606 fasta file. Although the metadata file I am proposing we use is different than the one that is used in the R lessons, they are both relevant to REL606 and the E. coli story. Do you foresee the use of this other metadata file creating issues or confusion?

Instead of telling the cit+/- phenotype story, I am planning to interweave the story of hypermutablity during the variant calling workflow. This is an especially rewarding biologial story for variant calling, and the hypermutable strains accumulate mutations much more quickly than the other strains, and this can be observed by the number of variants called in the vcf file using the commands learned during the shell lesson.

@naupaka

This comment has been minimized.

Copy link
Member

commented Jul 4, 2018

@taylorreiter @JasonJWilliamsNY One of the starting points for our lessons is the set of VCF files from the Lenski data. Is the pipeline script that produces those files going to change substantially, or just the interpretation of the calls?

@taylorreiter

This comment has been minimized.

Copy link
Contributor Author

commented Jul 4, 2018

@JasonJWilliamsNY @naupaka
The pipeline is being updated to more recent versions of the tools, and the input files are changing to longer reads (~150bp).

So far we have selected these SRR files. Number of propagation are in parenthesis:
ARA+3 (non mutator)
SRR2588658 (500)
SRR2584668 (500)
SRR2584669 (1000)
SRR2591034 (1000)

ARA-3 (mutator)
SRR2584683 (20000)
SRR2584684 (20000)
SRR2584685 (30000)
SRR2588848 (30000)

I see that your lessons rely on designated Ara-3, so hopefully even though the calls from the vcf file will likely be different, it will not impact the narrative and code for your lessons. I have not produced the new vcf file yet, but it is on my list of things to do in the next day or two. I can attach it here if that will be helpful!

We had talked about subsampling our trimmed reads to one gene instead of to 3x coverage as an alternate way of having the pipeline run faster that would render better in samtools tview, would this impact your lesson? We have not implemented this yet, and so it would be easy to not.

@naupaka

This comment has been minimized.

Copy link
Member

commented Jul 4, 2018

It does sound like some of these changes will alter the lessons we are developing in terms of specifics, even if not in overall structure and flow. When will those VCF decisions get finalized? We should be able to work with whatever you all decide on, but we can't move forward until then on the parts of our lessons that are based on analyzing and visualizing those VCF results.

@JasonJWilliamsNY

This comment has been minimized.

Copy link
Contributor

commented Jul 8, 2018

@naupaka @taylorreiter my overall concern is that if we are working towards a two-day genomics workshop with one full-day of R, are you working the Unix lessons into a one day format? This is actually a big decision and maybe we need to check with curriculum committee.

@ErinBecker

This comment has been minimized.

Copy link
Contributor

commented Sep 12, 2018

@taylorreiter @naupaka @JasonJWilliamsNY - I'm working on organizing the agenda for the CAC meetings on the 24th and 25th and wanted to try to get some clarification on this proposed reorganization of the workshop.

The currently published Genomics workshop includes project organization and management, intro to the command line, data wrangling and processing, and intro to cloud computing. It is two days long and includes NO R.

From my understanding, the curriculum that @taylorreiter proposed above rearranges and makes significant changes to the existing workshop materials, but does NOT add any R content. It would stay a two-day long workshop.

The Genomics R Maintainers (including @JasonJWilliamsNY and @naupaka) have been working on putting together a curriculum for Genomics work with R. I was under the impression that this was meant to be a two day workshop, which included no (or very little) Unix, and was completely independent of the existing curriculum (in the sense of being able to be offered as a stand-alone workshop).

If I'm misunderstanding anything, please let me know. I'd like to make sure the agenda I put in front of the CAC is accurate.

@naupaka

This comment has been minimized.

Copy link
Member

commented Sep 12, 2018

My understanding was that our plan was to have the R materials be the second day of a two day workshop - the idea was that we would start with the VCF file that was produced at the end of the first day. That way the learners attending the workshop get to go through the whole process from raw data to report, instead of stopping part way through the process.

@taylorreiter

This comment has been minimized.

Copy link
Contributor Author

commented Sep 13, 2018

@ErinBecker your understanding of the proposed changes is correct. We do not propose to add any R material, and propose to make significant changes to existing material. As @naupaka points out, the conflict arises in that we have proposed to change the dataset we are using to one with longer, paired-end reads, which would change the VCF file that is output of the genomics lesson and acts as input to any subsequent R lessons. We have a beta version of the lesson that would create this new vcf file here: https://github.com/data-lessons/wrangling-genomics

We plan to update the shell/cloud/organization lessons soon.

@mdehollander

This comment has been minimized.

Copy link

commented Nov 28, 2018

I noticed the Curriculum Advisory Committee discussed this topic (https://github.com/datacarpentry/curriculum-advisors/blob/master/genomics/september-2018-genomics-minutes.md). Since we are planning to organise a Genomics Carpentry event in the Netherlands, I am interested to know the current status. Especially about this part in the minutes:

Consensus to move forward with new proposed dataset and tools, provided support from community members who proposed and/or Maintainers for Shell/Wrangling lessons.

Is there any current activity? Where can I follow the progress? Here? If I want to contribute, what is the best way?

@naupaka

This comment has been minimized.

Copy link
Member

commented Nov 28, 2018

Current R lessons are in progress at https://github.com/carpentrieslab/genomics-r-intro, but are not yet even ready for an alpha release. I believe our target is to have the parts at least drafted by ~January or so. There is the current release version here, but that does not include R at the moment.

@ErinBecker

This comment has been minimized.

Copy link
Contributor

commented Mar 18, 2019

These changes have been implemented and are now live!

@ErinBecker ErinBecker closed this Mar 18, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.