How to download genomes using the accession number

Asad Prodhan PhD

https://asadprodhan.github.io/

Step 1: Collect the accession numbers of your interest

collect the assembly summary report for your organism of interest from the NCBI RefSeq Index. For example, the assembly summary report for Bacteria can be obtained as follows:
```
wget ftp://ftp.ncbi.nih.gov/genomes/refseq/bacteria/assembly_summary.txt
```
For other organisms, navigate to the assembly summary report starting from the ‘Index of /genomes/refseq’ as shown in Fig. 1:

Figure 1. Index of the genomes in the RefSeq

Filter out your targeted genomes from the assembly report. For example, all species of Pseudomonas can be extracted from the bacterial assembly report as follows:
```
#!/bin/bash
awk -F '\t' '{if($8 ~ /Pseudomonas/) print $1","$2","$3","$5","$8","$11","$12","$14","$15","$16","$20}' assembly_summary.txt > assembly_summary_complete_genomes_Pseudomonas.txt
```
Column 8 ($8) in the assembly report contains the name of the species. ‘~ /Pseudomonas/’ will extract only the Pseudomonas species Here, we are extracting Pseudomonas species along with other metadata in different columns of the assembly report.

Column 1 ($1): # assembly_accession

Column 2 ($2): bioproject ID

Column 3 ($3): biosample ID

Column 5 ($5): refseq_category, is it a representative genome? representative genome are quality-checked by RefSeq team

Column 8 ($8): organism_name

Column 11 ($11): version_status, is it latest?

Column 12 ($12): assembly_level, complete genome, scaffold or contig

Column 14 ($14): genome_rep, full? or partial?

Column 15 ($15): seq_rel_date, release date

Column 16 ($16): asm_name, assembly name

Column 20 ($20): ftp_path, the download link (however, the links, as they appear here, do not download the files, the links need to be amended in the following step to get them download-ready)
Make a csv file with the accession numbers only and name it as ‘accession_list.csv’ (Fig. 2)

Figure 2. Accession list

Step 2: Install ‘NCBI Datasets’ tool

Create a conda environment
```
conda create -n ncbi_datasets
```
Activate ncbi_datasets
```
conda activate ncbi_datasets
```

Install ncbi-datasets package

conda install -c conda-forge ncbi-datasets-cli

Ref: https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install/

Step 3: Download the genomes

Make a directory and name it “Downloads”
cd to ‘Downloads’ directory
Keep the following script in the ‘Downloads’ directory

The script

    ```
    #!/bin/bash

    #metadata
    metadata=./*.csv
    #
    Red="$(tput setaf 1)"
    Green="$(tput setaf 2)"
    Bold=$(tput bold)
    reset=`tput sgr0` # turns off all atribute
    while IFS=, read -r field1  

    do 
        echo ""
        echo "${Red}${Bold}Downloading ${reset}: "${field1}"" 
        datasets download genome accession "${field1}" --filename "${field1}".zip
        echo "${Bold}Extracting "${field1}.zip" ${reset}"
        unzip "${field1}.zip" 
        cd "ncbi_dataset/data/${field1}" 
        echo "${Bold}Moving "${field1}" fasta file into home directory${reset}"
        mv *.fna ../../../
        cd "../../../"
        rm -r "${field1}".zip ncbi_dataset *.md  
        echo "${Green}${Bold}Download_completed ${reset}: ${field1}" 
        echo ""
    done < ${metadata}

    ```

Keep the ‘accession_list.csv’ file (Fig. 2) in the ‘Downloads’ directory
Check the file type of ‘accession_list.csv’
```
file accession_list.csv
```
- If it is ‘ASCII text, with CRLF line terminators’ i.e., Windows file type; then convert it to ‘Unix’ format as follows:
```
dos2unix accession_list.csv
```
Run the following script as follows:
```
./name_of_the_script.sh
```

The script will be automatically downloading the genomic sequences (Fig. 3).

Figure 3. The script is automatically downloading the genomic sequences

Output files

Figure 4. An image showing the output files with their file names and headers.

Note: The script can download the entire BioProject by replacing the accession number by the BioProject number. Downloading using the BioProject number will automatically download all the associated data and metadata.

However, if the file names and headers (Fig. 4) are too big to deal with, they can be shortented (Fig. 5) by replacing the above script by the following script:

The script that will retrieve the genomic data and shorten names and headers

#!/bin/bash

#metadata
metadata=./*.csv
#
Red="$(tput setaf 1)"
Green="$(tput setaf 2)"
Bold=$(tput bold)
reset=`tput sgr0` # turns off all atribute
while IFS=, read -r field1  

do 
    echo ""
    echo "${Red}${Bold}Downloading ${reset}: "${field1}"" 
    datasets download genome accession "${field1}" --filename "${field1}".zip
    echo "${Bold}Extracting "${field1}.zip" ${reset}"
    unzip "${field1}.zip" 
    cd "ncbi_dataset/data/${field1}"
    echo "${Bold}Renaming "${field1}" fasta file${reset}"
    mv *.fna "${field1}.fasta"
    echo "${Bold}Shortening the "${field1}" fasta file header${reset}"
    for fasta in *.fasta; 
    do
        cut -f 1 -d " " $fasta > ${fasta%.*}.temp;
        mv ${fasta%.*}.temp $fasta
    done
    echo "${Bold}Moving "${field1}.fasta" into home directory${reset}"
    mv "${field1}.fasta" ../../../
    cd "../../../"
    rm -r "${field1}".zip ncbi_dataset *.md  
    echo "${Green}${Bold}Download_completed ${reset}: ${field1}" 
    echo ""
done < ${metadata}

Output files with SHORTENED names and headers

Figure 5. An image showing the output files with their **SHORTENED** file names and headers

Name	Name	Last commit message	Last commit date
Latest commit asadprodhan Add files via upload Apr 20, 2023 1932bdd · Apr 20, 2023 History 22 Commits
FileName_and_Header_After.PNG	FileName_and_Header_After.PNG	Add files via upload	Apr 20, 2023
FileName_and_Header_Before.PNG	FileName_and_Header_Before.PNG	Add files via upload	Apr 20, 2023
Index_NCBI.PNG	Index_NCBI.PNG	Add files via upload	Apr 19, 2023
Output.PNG	Output.PNG	Add files via upload	Apr 19, 2023
README.md	README.md	Update README.md	Apr 20, 2023
accession_list.PNG	accession_list.PNG	Add files via upload	Apr 20, 2023
accession_list.csv	accession_list.csv	Add files via upload	Apr 19, 2023
ncbi_download.sh	ncbi_download.sh	Add files via upload	Apr 19, 2023
ncbi_download_fileName_header_shortened.sh	ncbi_download_fileName_header_shortened.sh	Add files via upload	Apr 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

How to download genomes using the accession number

Asad Prodhan PhD

Step 1: Collect the accession numbers of your interest

Step 2: Install ‘NCBI Datasets’ tool

Step 3: Download the genomes

The script

Output files

The script that will retrieve the genomic data and shorten names and headers

Output files with SHORTENED names and headers

Now, you have downloaded the genomic sequences of all the accessions in the list.

About

Releases

Packages

Languages

asadprodhan/How-to-download-genomes-using-the-accession-number

Folders and files

Latest commit

History

Repository files navigation

How to download genomes using the accession number

Asad Prodhan PhD

Step 1: Collect the accession numbers of your interest

Step 2: Install ‘NCBI Datasets’ tool

Step 3: Download the genomes

The script

Output files

The script that will retrieve the genomic data and shorten names and headers

Output files with SHORTENED names and headers

Now, you have downloaded the genomic sequences of all the accessions in the list.

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages