# task_One
## Understand the several files in Genome Assembly folder
_Date: 03/22/2017_

### *_protein.faa.gz file

> - FASTA format of the accessioned protein products annotated on the genome  assembly
> - The FASTA title is formatted as sequence accession.version plus 
       description.
       
- We can find Oskar !!

### *_genomic.gbff.gz file

> - GenBank flat file format of the genomic sequence(s) in the assembly 
> - This file includes both the genomic sequence and the CONTIG description

- Can be directly used by __Augustus__

### *_genomic.gff.gz file

> - Annotation of the genomic sequence(s) in Generic Feature Format Version 3
> - (GFF3). Sequence identifiers are provided as accession.version.

- Can be used by __Augustus__ after converting this file in genbank file

### *_genomic.fna.gz file

> - FASTA format of the genomic sequence(s) in the assembly. Repetitive 
sequences in eukaryotes are masked to lower-case (see below).
> - The FASTA title is formatted as sequence accession.version plus description. 

- Used for __gene prediction (GeneMark)__ and __protein alignments (Exonerate)__ in case of we don't have gene annotations (genomic.gff) 
- Used to convert annotations.gff into annotations.gb

### *_rna.fna.gz file

> - FASTA format of accessioned RNA products annotated on the genome assembly; Provided for RefSeq assemblies as relevant (Note, RNA and mRNA products are not instantiated as a separate accessioned record in GenBank but are provided for some RefSeq genomes, most notably the eukaryotes.)
> - The FASTA title is provided as sequence accession.version plus description.

- _Not used there but can be referenced in our csv file_

In [3]:
import pandas as pd
import re

In [4]:
inputpath = "/home/savandara/Documents/Internship/2017_master_internship/genomeDownloader/curated_insectGenomesInfos.csv"
df = pd.read_csv(inputpath)
dfi = df.sort_values(["Order id", "Family id"], ascending=[0,1])[['Order name', 'Family name', 'Species name', 'Accession number']]
dfi

Unnamed: 0,Order name,Family name,Species name,Accession number
229,Diplura,Japygidae,Catajapyx aquilonaris,GCA_000934665.1
160,Collembola,Entomobryidae,Orchesella cincta,GCA_001718145.1
42,Blattodea,Termopsidae,Zootermopsis nevadensis,GCA_000696155.1
62,Blattodea,Termitidae,Nasutitermes exitiosus,GCA_001404035.1
215,Blattodea,Ectobiidae,Blattella germanica,GCA_000762945.1
204,Phthiraptera,Pediculidae,Pediculus humanus corporis,GCA_000006295.1
180,Plecoptera,Perlodidae,Isoperla grammatica,GCA_001676475.1
48,Plecoptera,Nemouridae,Amphinemura sulcicollis,GCA_001676325.1
77,Trichoptera,Limnephilidae,Limnephilus lunatus,GCA_000648945.1
165,Thysanoptera,Thripidae,Frankliniella occidentalis,GCA_000697945.1


In [5]:
nameFolder_list = []

for genome in dfi['Accession number'] :
    regExpr = re.findall(r'^([A-Z]{3})_([0-9]{9})', genome)[0]
    nameFolder = regExpr[0]+regExpr[1]
    nameFolder_list.append(nameFolder)
    
# nameFolder_list

In [6]:
'''
Use of $ in RegExpr
'''
ex = 'blablabla*_genomic.gbff.gz'

rep = re.findall(r'(_genomic.gbff.gz)$', ex)
rep

['_genomic.gbff.gz']

In [7]:
infos = [{'GCA_000472065.2_Anop_sine_SINENSIS_V1': {'Protein annotations': 'No', 'TO DO': 'Use Augustus', 'Accession number': 'GCA_000472065.2', 'RNA fasta file': 'No', 'GeneBank annotations': 'Yes', 'Genomic fasta file': 'No', 'GFF3 annotations': 'Yes'}}, {'GCA_001014415.1_ASM101441v1': {'Protein annotations': 'No', 'TO DO': 'Use Augustus', 'Accession number': 'GCA_001014415.1', 'RNA fasta file': 'No', 'GeneBank annotations': 'Yes', 'Genomic fasta file': 'No', 'GFF3 annotations': 'Yes'}}, {'GCA_001676475.1_iso_kmer61.contig': {'Protein annotations': 'No', 'TO DO': 'Use Augustus', 'Accession number': 'GCA_001676475.1', 'RNA fasta file': 'No', 'GeneBank annotations': 'Yes', 'Genomic fasta file': 'No', 'GFF3 annotations': 'Yes'}}]


In [8]:
infos_df = []
for Dict in infos :
    for key in Dict.keys():
        infos_df.append(Dict[key])
infos_df

[{'Accession number': 'GCA_000472065.2',
  'GFF3 annotations': 'Yes',
  'GeneBank annotations': 'Yes',
  'Genomic fasta file': 'No',
  'Protein annotations': 'No',
  'RNA fasta file': 'No',
  'TO DO': 'Use Augustus'},
 {'Accession number': 'GCA_001014415.1',
  'GFF3 annotations': 'Yes',
  'GeneBank annotations': 'Yes',
  'Genomic fasta file': 'No',
  'Protein annotations': 'No',
  'RNA fasta file': 'No',
  'TO DO': 'Use Augustus'},
 {'Accession number': 'GCA_001676475.1',
  'GFF3 annotations': 'Yes',
  'GeneBank annotations': 'Yes',
  'Genomic fasta file': 'No',
  'Protein annotations': 'No',
  'RNA fasta file': 'No',
  'TO DO': 'Use Augustus'}]

In [9]:
dfm = pd.DataFrame(infos_df)
dfm = dfm[['Accession number','Protein annotations','GeneBank annotations','GFF3 annotations','Genomic fasta file','RNA fasta file','TO DO']]
dfm

Unnamed: 0,Accession number,Protein annotations,GeneBank annotations,GFF3 annotations,Genomic fasta file,RNA fasta file,TO DO
0,GCA_000472065.2,No,Yes,Yes,No,No,Use Augustus
1,GCA_001014415.1,No,Yes,Yes,No,No,Use Augustus
2,GCA_001676475.1,No,Yes,Yes,No,No,Use Augustus


In [10]:
dff = pd.merge(dfi, dfm, how='inner', on=['Accession number'], sort=False, suffixes=('_x', '_y'), copy=True, indicator=False)

In [11]:
dff = dff.set_index('Species name')

In [12]:
dff

Unnamed: 0_level_0,Order name,Family name,Accession number,Protein annotations,GeneBank annotations,GFF3 annotations,Genomic fasta file,RNA fasta file,TO DO
Species name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Isoperla grammatica,Plecoptera,Perlodidae,GCA_001676475.1,No,Yes,Yes,No,No,Use Augustus
Anopheles sinensis,Diptera,Culicidae,GCA_000472065.2,No,Yes,Yes,No,No,Use Augustus
Phortica variegata,Diptera,Drosophilidae,GCA_001014415.1,No,Yes,Yes,No,No,Use Augustus


In [25]:
print("Species name,")
for column in dff.columns:
    if column == "TO DO":
        print(column,'\n')
    else :
        print(column,',')

Species name,
Order name ,
Family name ,
Accession number ,
Protein annotations ,
GeneBank annotations ,
GFF3 annotations ,
Genomic fasta file ,
RNA fasta file ,
TO DO 



In [26]:
for species in dff.index:
    print(species, ',')
    for column in dff.columns:
        if column == 'TO DO':
            print(dff[column][species],'\n')
        else:
            print(dff[column][species],',')

Isoperla grammatica ,
Plecoptera ,
Perlodidae ,
GCA_001676475.1 ,
No ,
Yes ,
Yes ,
No ,
No ,
Use Augustus 

Anopheles sinensis ,
Diptera ,
Culicidae ,
GCA_000472065.2 ,
No ,
Yes ,
Yes ,
No ,
No ,
Use Augustus 

Phortica variegata ,
Diptera ,
Drosophilidae ,
GCA_001014415.1 ,
No ,
Yes ,
Yes ,
No ,
No ,
Use Augustus 

