-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Extract chromosome M data
Background
See DCL tutorial mitolinplan-v0 for an outline of the mitolin project plan. The plan outlines steps 1-4, which are referred to below.
Since the mitolin project focuses on mitochondrial DNA there are a number of places in our pipeline that we can get rid of excess data. (See slides for the difference between mitochondrial & nuclear DNA.)
The first, but more complicated place to remove excess data, is from the .vcf file that is generated after step 2. (See the google drive link at the bottom of mitolinplan-v0 to download an example .vcf file).
The second, and more straight forward place, is from the .fasta files that are generated after step 3. I suggest you first tackle this issue by starting with .fasta files.
Aim
Create a script that makes two new eg.fasta files from the two fasta example files called "chrMchr1-1.fasta.eg" & "chrMchr1-2.fasta.eg". These can be found in mitolin/data/gen/nguyen_nc_2018/20190613-fastas. Please put the new files in the same directory as the old files.
The new files should only have chrM and not chr1.
Document your work
Please fork & clone this repo. Check out a branch for your work, then push and make a PR for us to merge your note and files.
Add a note (can be .md or .ipynb) with your solution to mitolin/nb.
Your note should be named as follows:
- DATE-issue#-shortdescription.ext
e.g.:
- 20190701-i02-extract-chrM-fa.md
Method
This data extraction from a text file can probably be done using the bash command sed or Python. Either is fine. If you want to try both or find another way that's great too. Please add notes to your documentation that give us some insight into your thinking.
See these Biostars discussion chains:
a. "Question: how to convert a long fasta-file into many separate single fasta sequences" link
b. "Question: Splitting A Fasta File" link
c. "Question: How To Split A Multiple Fasta" link
d. "Question: How To Split One Big Sequence File Into Multiple Files With Less Than 1000 Sequences In A Single File" link
e. "Question: Split Large Fasta Into Mulitple Files, Can'T Name Them With Gi Number" link
Questions?
Please put questions related to this issue in this issue thread. If you want a quick response, post a link to your comment in this thread to Slack #deepcelllineage or DM @deena. To join Slack enter your email address here. For questions NOT specifically related to this issue, get in touch through any of the communication methods listed in DCL's overview README.