Extract chromosome M data

# Extract chromosome M data

## Background

See DCL tutorial [mitolinplan-v0](https://github.com/deepcelllineage/overview/wiki/mitolinplan-v0) for an outline of the mitolin project plan. The plan outlines steps 1-4, which are referred to below. 

Since the mitolin project focuses on mitochondrial DNA there are a number of places in our pipeline that we can get rid of excess data. (See [slides](https://github.com/deepcelllineage/overview/wiki/slides) for the difference between mitochondrial & nuclear DNA.)

The first, but more complicated place to remove excess data, is from the .vcf file that is generated after step 2. (See the google drive link at the bottom of [mitolinplan-v0](https://github.com/deepcelllineage/overview/wiki/mitolinplan-v0) to download an example .vcf file). 

The second, and more straight forward place, is from the .fasta files that are generated after step 3. I suggest you first tackle this issue by starting with .fasta files. 

## Aim

Create a script that makes two new `eg.fasta` files from the two fasta example files called "chrMchr1-1.fasta.eg" & "chrMchr1-2.fasta.eg". These can be found in [mitolin/data/gen/nguyen_nc_2018/20190613-fastas](https://github.com/deepcelllineage/mitolin/tree/master/data/gen/nguyen_nc_2018/20190613-fastas). Please put the new files in the same directory as the old files.

The new files should only have chrM and not chr1.

## Document your work

Please fork & clone this repo. Check out a branch for your work, then push and make a PR for us to merge your note and files. 

Add a note (can be .md or .ipynb) with your solution to [mitolin/nb](https://github.com/deepcelllineage/mitolin/tree/master/nb). 

Your note should be named as follows: 

- DATE-issue#-shortdescription.ext

e.g.:

- 20190701-i02-extract-chrM-fa.md

## Method

This data extraction from a text file can probably be done using the bash command `sed` or Python. Either is fine. If you want to try both or find another way that's great too. Please add notes to your documentation that give us some insight into your thinking. 

See these Biostars discussion chains:
a. "Question: how to convert a long fasta-file into many separate single fasta sequences" [link](https://www.biostars.org/p/105388/)
b. "Question: Splitting A Fasta File" [link](https://www.biostars.org/p/76716/#76719) 
c. "Question: How To Split A Multiple Fasta" [link](https://www.biostars.org/p/2226/)
d. "Question: How To Split One Big Sequence File Into Multiple Files With Less Than 1000 Sequences In A Single File" [link](https://www.biostars.org/p/13270/) 
e. "Question: Split Large Fasta Into Mulitple Files, Can'T Name Them With Gi Number" [link](https://www.biostars.org/p/45973/)

## Questions? 

Please put questions related to this issue in this issue thread. If you want a quick response, post a link to your comment in this thread to Slack #deepcelllineage or DM @Deena. To join Slack enter your email address [here](http://bit.ly/JoinSlackFastaiSFbay). For questions NOT specifically related to this issue, get in touch through any of the communication methods listed in [DCL's overview README](https://github.com/deepcelllineage/overview/blob/master/README.md).




 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract chromosome M data #5

Extract chromosome M data

Background

Aim

Document your work

Method

Questions?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Extract chromosome M data #5

Description

Extract chromosome M data

Background

Aim

Document your work

Method

Questions?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions