<h1 style="font-family: 'Times New Roman'; text-align: center; font-weight: bold;">
    Data Preprocessing
</h1></p>

<div style="font-family: 'Times New Roman'; font-size: 15px; text-align: justify; width: 100%;">
  <div>
    <span style="display: inline-block; width: 100px;"><b>Date</b></span>: 4<sup>th</sup> October 2024
  </div>
  <div>
    <span style="display: inline-block; width: 100px;"><b>Author</b></span>: Deepan Kanagarajan Babu
  </div>
  <div>
    <span style="display: inline-block; width: 100px;"><b>Description</b></span>: In this document, the Metagenomic Assembled Genomes (MAGs) from the rice leaf phyllosphere-associated data are preprocessed to run BLAST, RAST, and CoverM analysis. The provided data are <a href="https://static-content.springer.com/esm/art%3A10.1038%2Fs41467-023-44335-3/MediaObjects/41467_2023_44335_MOESM4_ESM.xlsx">Supplementary Data 1</a>, <a href="https://static-content.springer.com/esm/art%3A10.1038%2Fs41597-022-01320-7/MediaObjects/41597_2022_1320_MOESM2_ESM.xlsx">Supplementary Data 2</a> containing MAGs details, MAGs, and CoxL protein sequences. Supplementary Data 1 contains details about the rice varieties used for metagenome sequencing, and Supplementary Data 2 contains details about the MAGs and their assembly, including taxonomic classification, BioSample ID, SRA accession, Sample ID, and scientific name details. Supplimentary Data 1 is from the article <a href="https://doi.org/10.1038/s41467-023-44335-3">"Microbiome homeostasis on rice leaves is regulated by a precursor molecule of lignin biosynthesis"</a> and Supplimentary Data 2 is from the article <a href="https://doi.org/10.1038/s41597-022-01320-7">"Recovery of metagenome-assembled genomes from the phyllosphere of 110 rice genotypes"</a>
  </div>
</div>


<h2 style="font-family: 'Times New Roman'; font-weight: bold;">
    Required Libraries
</h2>

In [3]:
# importing necessary libraries
import os
import shutil
import pandas as pd
import numpy as np
from Bio import SeqIO
import subprocess

<h2 style="font-family: 'Times New Roman'; font-weight: bold;">
    Required Functions
</h2>

In [6]:
# Defining single function to open multiple file formats (.csv, .txt, .fasta or .fastq (to display only 1st 100 lines), and .xlsx)
def open_file(file_path):
    if file_path.endswith('.csv'):
        # Read CSV file
        data = pd.read_csv(file_path)
        return data
    elif file_path.endswith('.txt'):
        # Read TXT file
        try:
            with open(file_path, 'r') as file:
                data = file.read()
            return data
        except FileNotFoundError:
            raise FileNotFoundError(f"The file {file_path} does not exist.")
    elif file_path.endswith(('.fasta', '.fastq')):
        # Read FASTA or FASTQ file
        try:
            with open(file_path, 'r') as file:
                data = []
                for _ in range(100):
                    line = file.readline()
                    if not line:
                        break
                    data.append(line.strip())
            return data
        except FileNotFoundError:
            raise FileNotFoundError(f"The file {file_path} does not exist.")
    elif file_path.endswith('.xlsx'):
        # Read XLSX file
        try:
            data = pd.read_excel(file_path)
            return data
        except FileNotFoundError:
            raise FileNotFoundError(f"The file {file_path} does not exist.")
    else:
        raise ValueError("Unsupported file format")

In [8]:
# Defining function to list files from the folder
def list_file(folder_path):
    # List all files in the folder
    files = os.listdir(folder_path)
    
    # Check if there are any files in the folder
    if not files:
        return None
    
    # Return the first file in the folder
    return files

# Defining function to cound number of files in a folder by extension
def count_files(folder_path, extensions):
    # List all files in the given folder
    files = os.listdir(folder_path)
    # Initialize a dictionary to hold counts for each extension
    counts = {ext: 0 for ext in extensions}
    # Count files for each extension
    for file in files:
        for ext in extensions:
            if file.endswith(ext):
                counts[ext] += 1
    return counts

# List of possible extensions to count
extensions = ['.docx', '.csv', '.fasta', '.txt', '.fastq']

In [10]:
# Defining function to create database
def create_db(input_file, output_folder, db_name):
    # Create the new folder if it doesn't exist
    os.makedirs(output_folder, exist_ok=True)
    
    # Define the output database path
    output_db = os.path.join(output_folder, db_name)
    
    # Run the makeblastdb command
    subprocess.run([
        "makeblastdb",
        "-in", input_file,
        "-dbtype", "prot",
        "-out", output_db
    ])
    
    print(f"Database created at {output_db}")

In [12]:
def remove_gaps_from_fasta(input_file, output_file):
    with open(input_file, 'r') as infile, open(output_file, 'w') as outfile:
        for line in infile:
            if line.startswith('>'):
                outfile.write(line)
            else:
                outfile.write(line.replace('-', ''))

<h2 style="font-family: 'Times New Roman'; font-weight: bold;">
    Required Directories Path
</h2>

In [15]:
# Specify the folder path containing unprocessed MAG files
## Raw_Data folder where all the files are located
raw_data = "C:/Users/dkb1g24/Documents/mres/Raw_Data"

## Path to the unprocessed MAGs files
unprocessed_mag = "C:/Users/dkb1g24/Documents/mres/Raw_Data/MAG_unprocessed"

## Path to Rice cultivars details excel file
rice_path = "C:/Users/dkb1g24/Documents/mres/Raw_Data/rice_cultivar_data.xlsx"

## Path to Metagenomic Assembled Genomes (MAGs) details CSV file
mag_file_path = "C:/Users/dkb1g24/Documents/mres/Raw_Data/mag_data.csv"

## Path to CoxL protein sequences FASTA file
coxl_fasta_path = "C:/Users/dkb1g24/Documents/mres/Raw_Data/CoxL_sequences.fa"

## Path to New SRA accession file path
sra_path = "C:/Users/dkb1g24/Documents/mres/Raw_Data/SRA.xlsx"

# Create output directories
## Directory to save the process MAG files
mags = "C:/Users/dkb1g24/Documents/mres/BLAST/MAG"
## Ensure the output directory exists
os.makedirs(mags, exist_ok=True)

## Define coxl db output folders
coxl1_db_folder = "C:/Users/dkb1g24/Documents/mres/BLAST/CoxL1" # for Form 1 proteins
coxl2_db_folder = "C:/Users/dkb1g24/Documents/mres/BLAST/CoxL2" # for Form 2 proteins

## Ensure the output directories exist
os.makedirs(coxl1_db_folder, exist_ok=True)
os.makedirs(coxl2_db_folder, exist_ok=True)

<h2 style="font-family: 'Times New Roman'; font-weight: bold;">
    Analysis and Preprocess
</h2>

In [18]:
# List the files in the raw unprocessed MAG files
mag_files_list = list_file(unprocessed_mag)
print(mag_files_list)

['maxbin_SP106_1.016_cleaned.fasta', 'maxbin_SP106_1.023_cleaned.fasta', 'maxbin_SP106_2.001.fasta', 'maxbin_SP106_3.024.fasta', 'maxbin_SP108_3.007_cleaned.fasta', 'maxbin_SP112_1.001_cleaned.fasta', 'maxbin_SP118_2.002.fasta', 'maxbin_SP118_2.009.fasta', 'maxbin_SP118_3.007_cleaned.fasta', 'maxbin_SP119_2.004.fasta', 'maxbin_SP122_1.006_cleaned.fasta', 'maxbin_SP122_2.006_cleaned.fasta', 'maxbin_SP123_3.025_sub_cleaned.fasta', 'maxbin_SP128_2.014.fasta', 'maxbin_SP128_2.029_sub_cleaned.fasta', 'maxbin_SP128_3.008.fasta', 'maxbin_SP133_1.009.fasta', 'maxbin_SP133_2.005.fasta', 'maxbin_SP133_3.007_cleaned.fasta', 'maxbin_SP135_3.009_sub_cleaned.fasta', 'maxbin_SP138_3.003.fasta', 'maxbin_SP139_2.003_cleaned.fasta', 'maxbin_SP139_3.013.fasta', 'maxbin_SP143_1.006.fasta', 'maxbin_SP149_1.012_cleaned.fasta', 'maxbin_SP149_2.004_cleaned.fasta', 'maxbin_SP149_3.011_sub_cleaned.fasta', 'maxbin_SP151_2.002_cleaned.fasta', 'maxbin_SP154_3.006_sub_cleaned.fasta', 'maxbin_SP166_2.026_cleaned.fas

In [20]:
# Check opening one of the fasta files
## Specify the fasta file path
fasta_files_path = os.path.join(unprocessed_mag, "maxbin_SP8_2.014.fasta")

## Open first 100 line of fasta file
open_file(fasta_files_path)

['>Genome437_k127_93520 flag=1 multi=4.0000 len=1848',
 'TTTATGGATCAGGGGCGTATAGTTGAGCAGGGGCTGGCAAAAGCGTTATTCGCCGACCCGCAACAGCCGC',
 'GTACCCGACAGTTCCTGGAAAAATTCCTTCTTCAGTAGAATCTTTGTTCTTTTTTGCTCCAGAAACATTA',
 'TCTGGAGCAAATAATTCCCCCTGAAAGTCTTTACATATATTGTTTTATTCTGTCTGTATTGCTGTAACCA',
 'CATAAATTCTGTACAAATTCCCCCTTGTGGTCGATATTTTATACATATAAATAAGATTTATGTGTTAGTT',
 'TAATCAATCCATAATAATGATTATCTTTTACAGGGGCTTCATTCCGTAAGTATGAAGGATTCAGACTTTT',
 'TCACATGGCGACGGGATTGCTATCTTCGGTTTCAGGAAATGACTTCGGCCGATGAGGTATATTCCGAACT',
 'TCTGCGACAAACACAGGCACTAGAGTTCGATTATTTTTCGCTTTGTGTCCGTCATCCCGTGCCGTTTACC',
 'CGGCCTAAACTCTCATTGCAGAGTAATTATCCGGCACAATGGCTATCGCATTATCAGGCGGAGAATTATT',
 'TCGCCATCGATCCGGTACTCAAACCAGAGAATTTTGTGCACGGCCACCTACCCTGGAATGACGCGTTGTT',
 'TGCCGATGCGCAACCCTTATGGGATGGCGCGCGGGAGCATGGGCTGCGCAAGGGAATTACCCAGTGCCTG',
 'ATGTTACCGAATCATGCGCTCGGTTTTCTGTCCGTCTCCAGCACGGTGCAAAGCATCAATAAGTTCAGCA',
 'AAGAAGAGCTGGAGTTGCGTCTGCAAATGCTGGTGCAGATGGCCCTGACCACGCTGTTGCGTCTGGAACA',
 'TGAAATGGTGATGCCGCCTGAGATGAAATTCAGCAAGCGCGAG

In [22]:
# Load MAGs CSV file
mag_df = open_file(mag_file_path)
mag_df.head()

Unnamed: 0,Sample_ID,SRA_accession,BioSample,Tax_ID,Scientific_name,Sequencing_method,Assembly_Software,Completeness_score,Completeness_software,Contamination_score,Binning_software,Investigation_type,Binning_parameters
0,S1C35295_SP282_cleaned,ERS6626560,SAMEA8943724,114707,uncultured Pseudomonas sp.,Illumina NovaSeq 6000,Megahit,99,CheckM,0,vamb,metagenome-assembled genome,default
1,S1C41259_SP48_cleaned,ERS6626561,SAMEA8943725,158751,uncultured Acidovorax sp.,Illumina NovaSeq 6000,Megahit,99,CheckM,0,vamb,metagenome-assembled genome,default
2,S1C20416_SP29_cleaned,ERS6626562,SAMEA8943726,331964,uncultured Curtobacterium sp.,Illumina NovaSeq 6000,Megahit,98,CheckM,0,vamb,metagenome-assembled genome,default
3,S1C62909_SP162_cleaned,ERS6626563,SAMEA8943727,114707,uncultured Pseudomonas sp.,Illumina NovaSeq 6000,Megahit,91,CheckM,1,vamb,metagenome-assembled genome,default
4,S2C18454_SP60_cleaned,ERS6626564,SAMEA8943728,165433,uncultured Acinetobacter sp.,Illumina NovaSeq 6000,Megahit,91,CheckM,1,vamb,metagenome-assembled genome,default


In [24]:
# Create a .csv file with the list of MAG fasta files name
## List to store file names
fasta_names = []

## Append file names
for file_name in os.listdir(unprocessed_mag):
    if file_name.endswith('.fasta'):
        # Remove extension
        fasta_names.append(file_name)

## Create csv file for file names
fasta_list = pd.DataFrame({'file_name': fasta_names})

## Save the DataFrame to a csv file
fasta_out_list = 'fasta_list.csv'
fasta_list.to_csv(fasta_out_list, index=False)

print("All done")

All done


In [26]:
# Open the fasta list .csv file
fasta_files = open_file('fasta_list.csv')
fasta_files.head()

Unnamed: 0,file_name
0,maxbin_SP106_1.016_cleaned.fasta
1,maxbin_SP106_1.023_cleaned.fasta
2,maxbin_SP106_2.001.fasta
3,maxbin_SP106_3.024.fasta
4,maxbin_SP108_3.007_cleaned.fasta


In [28]:
# Remove the extension and save the ids to ID column
fasta_files['ID'] = fasta_files.iloc[:,0].str[:-6] # last 6 character of the file name is file extensions
fasta_files.head()

Unnamed: 0,file_name,ID
0,maxbin_SP106_1.016_cleaned.fasta,maxbin_SP106_1.016_cleaned
1,maxbin_SP106_1.023_cleaned.fasta,maxbin_SP106_1.023_cleaned
2,maxbin_SP106_2.001.fasta,maxbin_SP106_2.001
3,maxbin_SP106_3.024.fasta,maxbin_SP106_3.024
4,maxbin_SP108_3.007_cleaned.fasta,maxbin_SP108_3.007_cleaned


In [30]:
# Saving the modification to fasta_list.csv
fasta_files.to_csv('fasta_list.csv', index=False)

In [32]:
# Opening fasta file name list
fasta_files = open_file('fasta_list.csv')
fasta_files.head()

Unnamed: 0,file_name,ID
0,maxbin_SP106_1.016_cleaned.fasta,maxbin_SP106_1.016_cleaned
1,maxbin_SP106_1.023_cleaned.fasta,maxbin_SP106_1.023_cleaned
2,maxbin_SP106_2.001.fasta,maxbin_SP106_2.001
3,maxbin_SP106_3.024.fasta,maxbin_SP106_3.024
4,maxbin_SP108_3.007_cleaned.fasta,maxbin_SP108_3.007_cleaned


In [34]:
# Checking the MAGs CSV file
mag_df.head()

Unnamed: 0,Sample_ID,SRA_accession,BioSample,Tax_ID,Scientific_name,Sequencing_method,Assembly_Software,Completeness_score,Completeness_software,Contamination_score,Binning_software,Investigation_type,Binning_parameters
0,S1C35295_SP282_cleaned,ERS6626560,SAMEA8943724,114707,uncultured Pseudomonas sp.,Illumina NovaSeq 6000,Megahit,99,CheckM,0,vamb,metagenome-assembled genome,default
1,S1C41259_SP48_cleaned,ERS6626561,SAMEA8943725,158751,uncultured Acidovorax sp.,Illumina NovaSeq 6000,Megahit,99,CheckM,0,vamb,metagenome-assembled genome,default
2,S1C20416_SP29_cleaned,ERS6626562,SAMEA8943726,331964,uncultured Curtobacterium sp.,Illumina NovaSeq 6000,Megahit,98,CheckM,0,vamb,metagenome-assembled genome,default
3,S1C62909_SP162_cleaned,ERS6626563,SAMEA8943727,114707,uncultured Pseudomonas sp.,Illumina NovaSeq 6000,Megahit,91,CheckM,1,vamb,metagenome-assembled genome,default
4,S2C18454_SP60_cleaned,ERS6626564,SAMEA8943728,165433,uncultured Acinetobacter sp.,Illumina NovaSeq 6000,Megahit,91,CheckM,1,vamb,metagenome-assembled genome,default


In [36]:
# Merging the data frames with matching values of ID from fasta file name and Sample_ID from MAGs CSV file
merged_df = pd.merge(fasta_files, mag_df, left_on='ID', right_on='Sample_ID')

# Saving the merged file as 'mag_data_v2.csv' (version 2) file
output_file = os.path.join(raw_data, 'mag_data_v2.csv')
merged_df.to_csv(output_file, index=False)

# Deleting the unwanted variable fasta_files and .csv file fasta_list.csv
del fasta_files
os.remove('fasta_list.csv')

# To verigy the process are done
print("Merged and Removed")

Merged and Removed


In [38]:
# Opening the MAGs CSV file version 2
mag_df2 = open_file(output_file)
print(mag_df2.shape) # Checking the data frame shape to verify the number of MAGs
mag_df2.head()

(569, 15)


Unnamed: 0,file_name,ID,Sample_ID,SRA_accession,BioSample,Tax_ID,Scientific_name,Sequencing_method,Assembly_Software,Completeness_score,Completeness_software,Contamination_score,Binning_software,Investigation_type,Binning_parameters
0,maxbin_SP106_1.016_cleaned.fasta,maxbin_SP106_1.016_cleaned,maxbin_SP106_1.016_cleaned,ERS6626984,SAMEA8944148,157277,uncultured Agrobacterium sp.,Illumina NovaSeq 6000,Megahit,72,CheckM,6,maxbin,metagenome-assembled genome,default
1,maxbin_SP106_1.023_cleaned.fasta,maxbin_SP106_1.023_cleaned,maxbin_SP106_1.023_cleaned,ERS6626983,SAMEA8944147,278208,uncultured Spirosoma sp.,Illumina NovaSeq 6000,Megahit,72,CheckM,1,maxbin,metagenome-assembled genome,default
2,maxbin_SP106_2.001.fasta,maxbin_SP106_2.001,maxbin_SP106_2.001,ERS6626844,SAMEA8944008,227322,uncultured Paenibacillus sp.,Illumina NovaSeq 6000,Megahit,100,CheckM,1,maxbin,metagenome-assembled genome,default
3,maxbin_SP106_3.024.fasta,maxbin_SP106_3.024,maxbin_SP106_3.024,ERS6626982,SAMEA8944146,278208,uncultured Spirosoma sp.,Illumina NovaSeq 6000,Megahit,71,CheckM,2,maxbin,metagenome-assembled genome,default
4,maxbin_SP108_3.007_cleaned.fasta,maxbin_SP108_3.007_cleaned,maxbin_SP108_3.007_cleaned,ERS6626962,SAMEA8944126,797541,uncultured Mucilaginibacter sp.,Illumina NovaSeq 6000,Megahit,63,CheckM,7,maxbin,metagenome-assembled genome,default


In [40]:
# Comparing the ID column with Sample_ID to verify the file names are properly assigned to the right organsim
compare = mag_df2['ID'] == mag_df2['Sample_ID']

# Verifying all the matches are true and no mismatch is found
compare.unique()

array([ True])

In [42]:
# Cross checking the length of comparison is as same as number of MAG files
len(compare)

569

In [44]:
# Droping the extra sample ID
mag_df2 = mag_df2.drop(columns=["ID"])

#Saving the updated data frame to MAGs CSV file version 2
mag_df2.to_csv('mag_data_v2.csv', index = False)
print("Updated") # For confirmation

Updated


In [46]:
# Checking the updated .csv file
mag_df2 = open_file(output_file)
mag_df2.head()

Unnamed: 0,file_name,ID,Sample_ID,SRA_accession,BioSample,Tax_ID,Scientific_name,Sequencing_method,Assembly_Software,Completeness_score,Completeness_software,Contamination_score,Binning_software,Investigation_type,Binning_parameters
0,maxbin_SP106_1.016_cleaned.fasta,maxbin_SP106_1.016_cleaned,maxbin_SP106_1.016_cleaned,ERS6626984,SAMEA8944148,157277,uncultured Agrobacterium sp.,Illumina NovaSeq 6000,Megahit,72,CheckM,6,maxbin,metagenome-assembled genome,default
1,maxbin_SP106_1.023_cleaned.fasta,maxbin_SP106_1.023_cleaned,maxbin_SP106_1.023_cleaned,ERS6626983,SAMEA8944147,278208,uncultured Spirosoma sp.,Illumina NovaSeq 6000,Megahit,72,CheckM,1,maxbin,metagenome-assembled genome,default
2,maxbin_SP106_2.001.fasta,maxbin_SP106_2.001,maxbin_SP106_2.001,ERS6626844,SAMEA8944008,227322,uncultured Paenibacillus sp.,Illumina NovaSeq 6000,Megahit,100,CheckM,1,maxbin,metagenome-assembled genome,default
3,maxbin_SP106_3.024.fasta,maxbin_SP106_3.024,maxbin_SP106_3.024,ERS6626982,SAMEA8944146,278208,uncultured Spirosoma sp.,Illumina NovaSeq 6000,Megahit,71,CheckM,2,maxbin,metagenome-assembled genome,default
4,maxbin_SP108_3.007_cleaned.fasta,maxbin_SP108_3.007_cleaned,maxbin_SP108_3.007_cleaned,ERS6626962,SAMEA8944126,797541,uncultured Mucilaginibacter sp.,Illumina NovaSeq 6000,Megahit,63,CheckM,7,maxbin,metagenome-assembled genome,default


In [48]:
# Renaming and copying the renamed MAG files mags folder
for index, row in mag_df2.iterrows():
    new_name = f"{row['Scientific_name']}_{row['SRA_accession']}.fasta".replace(' ', '_') # renaming with organism name & accession number and replacing spaces in the file name with '_'
    old_name = row['file_name']
    old_path = os.path.join(unprocessed_mag, old_name)
    new_path = os.path.join(mags, new_name)
    
    if os.path.exists(old_path):
        shutil.copy2(old_path, new_path) # Copying to mags folder
    else:
        print(f"File {old_path} does not exist") # Checking for non available file

print("All matching files have been renamed") # Confirmation for all the matched files renaming

All matching files have been renamed


In [49]:
# List the mags folder files to cross check
list_file(mags)

['Bdellovibrionaceae_bacterium_ERS6626579.fasta',
 'Bdellovibrionaceae_bacterium_ERS6626826.fasta',
 'Burkholderiaceae_bacterium_ERS6626828.fasta',
 'Burkholderiaceae_bacterium_ERS6626861.fasta',
 'Candidatus_Sericytochromatia_bacterium_ERS6626968.fasta',
 'Chitinophagaceae_bacterium_ERS6626975.fasta',
 'Mycoplasmatales_bacterium_ERS6626571.fasta',
 'Patescibacteria_group_bacterium_ERS6626570.fasta',
 'Patescibacteria_group_bacterium_ERS6626572.fasta',
 'Patescibacteria_group_bacterium_ERS6626573.fasta',
 'Patescibacteria_group_bacterium_ERS6626790.fasta',
 'Rhizobiaceae_bacterium_ERS6626796.fasta',
 'uncultured_Achromobacter_sp._ERS6626913.fasta',
 'uncultured_Acidovorax_sp._ERS6626561.fasta',
 'uncultured_Acidovorax_sp._ERS6626565.fasta',
 'uncultured_Acidovorax_sp._ERS6626576.fasta',
 'uncultured_Acidovorax_sp._ERS6626591.fasta',
 'uncultured_Acidovorax_sp._ERS6626592.fasta',
 'uncultured_Acidovorax_sp._ERS6626596.fasta',
 'uncultured_Acidovorax_sp._ERS6626599.fasta',
 'uncultured_A

In [50]:
# Cross checking the number of files to verify the 569 MAGs renamed and copied to mags folder
len(list_file(mags))

569

<p style="font-family: 'Times New Roman'; font-size: 15px; text-align: justify; width: 100%;">
    The MAGs name are renamed and saved to new folder. The MAG are named after the organism and the SRA accession number for better understanding in the further analysis.
</p>

In [54]:
mag_df2[mag_df2['Contamination_score']>5]

Unnamed: 0,file_name,ID,Sample_ID,SRA_accession,BioSample,Tax_ID,Scientific_name,Sequencing_method,Assembly_Software,Completeness_score,Completeness_software,Contamination_score,Binning_software,Investigation_type,Binning_parameters
0,maxbin_SP106_1.016_cleaned.fasta,maxbin_SP106_1.016_cleaned,maxbin_SP106_1.016_cleaned,ERS6626984,SAMEA8944148,157277,uncultured Agrobacterium sp.,Illumina NovaSeq 6000,Megahit,72,CheckM,6,maxbin,metagenome-assembled genome,default
4,maxbin_SP108_3.007_cleaned.fasta,maxbin_SP108_3.007_cleaned,maxbin_SP108_3.007_cleaned,ERS6626962,SAMEA8944126,797541,uncultured Mucilaginibacter sp.,Illumina NovaSeq 6000,Megahit,63,CheckM,7,maxbin,metagenome-assembled genome,default
19,maxbin_SP135_3.009_sub_cleaned.fasta,maxbin_SP135_3.009_sub_cleaned,maxbin_SP135_3.009_sub_cleaned,ERS6626874,SAMEA8944038,259322,uncultured Epilithonimonas sp.,Illumina NovaSeq 6000,Megahit,65,CheckM,7,maxbin,metagenome-assembled genome,default
49,maxbin_SP264_3.001.fasta,maxbin_SP264_3.001,maxbin_SP264_3.001,ERS6627327,SAMEA8944550,202669,uncultured Exiguobacterium sp.,Illumina NovaSeq 6000,Megahit,89,CheckM,6,maxbin,metagenome-assembled genome,default
51,maxbin_SP266_1.010.fasta,maxbin_SP266_1.010,maxbin_SP266_1.010,ERS6626972,SAMEA8944136,797541,uncultured Mucilaginibacter sp.,Illumina NovaSeq 6000,Megahit,66,CheckM,7,maxbin,metagenome-assembled genome,default
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
546,SP77_3_metabat2_genome_mining.18_sub.fasta,SP77_3_metabat2_genome_mining.18_sub,SP77_3_metabat2_genome_mining.18_sub,ERS6626870,SAMEA8944034,158754,uncultured Sphingomonas sp.,Illumina NovaSeq 6000,Megahit,51,CheckM,10,metabat2,metagenome-assembled genome,default
547,SP77_3_metabat2_genome_mining.28.fasta,SP77_3_metabat2_genome_mining.28,SP77_3_metabat2_genome_mining.28,ERS6626938,SAMEA8944102,158754,uncultured Sphingomonas sp.,Illumina NovaSeq 6000,Megahit,58,CheckM,9,metabat2,metagenome-assembled genome,default
553,SP80_2_metabat2_genome_mining.9.fasta,SP80_2_metabat2_genome_mining.9,SP80_2_metabat2_genome_mining.9,ERS6626981,SAMEA8944145,797541,uncultured Mucilaginibacter sp.,Illumina NovaSeq 6000,Megahit,71,CheckM,8,metabat2,metagenome-assembled genome,default
556,SP86_3_metabat2_genome_mining.5.fasta,SP86_3_metabat2_genome_mining.5,SP86_3_metabat2_genome_mining.5,ERS6627282,SAMEA8944505,158789,uncultured Deinococcus sp.,Illumina NovaSeq 6000,Megahit,80,CheckM,9,metabat2,metagenome-assembled genome,default


In [78]:
filtered_mag = mag_df2[mag_df2['Contamination_score']<=5]
filtered_mag

Unnamed: 0,file_name,ID,Sample_ID,SRA_accession,BioSample,Tax_ID,Scientific_name,Sequencing_method,Assembly_Software,Completeness_score,Completeness_software,Contamination_score,Binning_software,Investigation_type,Binning_parameters
1,maxbin_SP106_1.023_cleaned.fasta,maxbin_SP106_1.023_cleaned,maxbin_SP106_1.023_cleaned,ERS6626983,SAMEA8944147,278208,uncultured Spirosoma sp.,Illumina NovaSeq 6000,Megahit,72,CheckM,1,maxbin,metagenome-assembled genome,default
2,maxbin_SP106_2.001.fasta,maxbin_SP106_2.001,maxbin_SP106_2.001,ERS6626844,SAMEA8944008,227322,uncultured Paenibacillus sp.,Illumina NovaSeq 6000,Megahit,100,CheckM,1,maxbin,metagenome-assembled genome,default
3,maxbin_SP106_3.024.fasta,maxbin_SP106_3.024,maxbin_SP106_3.024,ERS6626982,SAMEA8944146,278208,uncultured Spirosoma sp.,Illumina NovaSeq 6000,Megahit,71,CheckM,2,maxbin,metagenome-assembled genome,default
5,maxbin_SP112_1.001_cleaned.fasta,maxbin_SP112_1.001_cleaned,maxbin_SP112_1.001_cleaned,ERS6626773,SAMEA8943937,202669,uncultured Exiguobacterium sp.,Illumina NovaSeq 6000,Megahit,94,CheckM,1,maxbin,metagenome-assembled genome,default
6,maxbin_SP118_2.002.fasta,maxbin_SP118_2.002,maxbin_SP118_2.002,ERS6626928,SAMEA8944092,259322,uncultured Chryseobacterium sp.,Illumina NovaSeq 6000,Megahit,54,CheckM,2,maxbin,metagenome-assembled genome,default
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,SP92_3_metabat2_genome_mining.5_sub.fasta,SP92_3_metabat2_genome_mining.5_sub,SP92_3_metabat2_genome_mining.5_sub,ERS6626796,SAMEA8943960,1913961,Rhizobiaceae bacterium,Illumina NovaSeq 6000,Megahit,96,CheckM,0,metabat2,metagenome-assembled genome,default
565,SP92_3_metabat2_genome_mining.9.fasta,SP92_3_metabat2_genome_mining.9,SP92_3_metabat2_genome_mining.9,ERS6626829,SAMEA8943993,292277,uncultured Novosphingobium sp.,Illumina NovaSeq 6000,Megahit,99,CheckM,0,metabat2,metagenome-assembled genome,default
566,SP93_2_metabat2_genome_mining.11.fasta,SP93_2_metabat2_genome_mining.11,SP93_2_metabat2_genome_mining.11,ERS6627002,SAMEA8944166,1805779,uncultured Naasia sp.,Illumina NovaSeq 6000,Megahit,78,CheckM,0,metabat2,metagenome-assembled genome,default
567,SP93_2_metabat2_genome_mining.8.fasta,SP93_2_metabat2_genome_mining.8,SP93_2_metabat2_genome_mining.8,ERS6626884,SAMEA8944048,331964,uncultured Curtobacterium sp.,Illumina NovaSeq 6000,Megahit,58,CheckM,1,metabat2,metagenome-assembled genome,default


In [80]:
filtered_mag['Genome'] = filtered_mag.apply(lambda row: f"{row['Scientific_name']}_{row['SRA_accession']}".replace(' ', '_'), axis=1)
filtered_mag

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_mag['Genome'] = filtered_mag.apply(lambda row: f"{row['Scientific_name']}_{row['SRA_accession']}".replace(' ', '_'), axis=1)


Unnamed: 0,file_name,ID,Sample_ID,SRA_accession,BioSample,Tax_ID,Scientific_name,Sequencing_method,Assembly_Software,Completeness_score,Completeness_software,Contamination_score,Binning_software,Investigation_type,Binning_parameters,Genome
1,maxbin_SP106_1.023_cleaned.fasta,maxbin_SP106_1.023_cleaned,maxbin_SP106_1.023_cleaned,ERS6626983,SAMEA8944147,278208,uncultured Spirosoma sp.,Illumina NovaSeq 6000,Megahit,72,CheckM,1,maxbin,metagenome-assembled genome,default,uncultured_Spirosoma_sp._ERS6626983
2,maxbin_SP106_2.001.fasta,maxbin_SP106_2.001,maxbin_SP106_2.001,ERS6626844,SAMEA8944008,227322,uncultured Paenibacillus sp.,Illumina NovaSeq 6000,Megahit,100,CheckM,1,maxbin,metagenome-assembled genome,default,uncultured_Paenibacillus_sp._ERS6626844
3,maxbin_SP106_3.024.fasta,maxbin_SP106_3.024,maxbin_SP106_3.024,ERS6626982,SAMEA8944146,278208,uncultured Spirosoma sp.,Illumina NovaSeq 6000,Megahit,71,CheckM,2,maxbin,metagenome-assembled genome,default,uncultured_Spirosoma_sp._ERS6626982
5,maxbin_SP112_1.001_cleaned.fasta,maxbin_SP112_1.001_cleaned,maxbin_SP112_1.001_cleaned,ERS6626773,SAMEA8943937,202669,uncultured Exiguobacterium sp.,Illumina NovaSeq 6000,Megahit,94,CheckM,1,maxbin,metagenome-assembled genome,default,uncultured_Exiguobacterium_sp._ERS6626773
6,maxbin_SP118_2.002.fasta,maxbin_SP118_2.002,maxbin_SP118_2.002,ERS6626928,SAMEA8944092,259322,uncultured Chryseobacterium sp.,Illumina NovaSeq 6000,Megahit,54,CheckM,2,maxbin,metagenome-assembled genome,default,uncultured_Chryseobacterium_sp._ERS6626928
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,SP92_3_metabat2_genome_mining.5_sub.fasta,SP92_3_metabat2_genome_mining.5_sub,SP92_3_metabat2_genome_mining.5_sub,ERS6626796,SAMEA8943960,1913961,Rhizobiaceae bacterium,Illumina NovaSeq 6000,Megahit,96,CheckM,0,metabat2,metagenome-assembled genome,default,Rhizobiaceae_bacterium_ERS6626796
565,SP92_3_metabat2_genome_mining.9.fasta,SP92_3_metabat2_genome_mining.9,SP92_3_metabat2_genome_mining.9,ERS6626829,SAMEA8943993,292277,uncultured Novosphingobium sp.,Illumina NovaSeq 6000,Megahit,99,CheckM,0,metabat2,metagenome-assembled genome,default,uncultured_Novosphingobium_sp._ERS6626829
566,SP93_2_metabat2_genome_mining.11.fasta,SP93_2_metabat2_genome_mining.11,SP93_2_metabat2_genome_mining.11,ERS6627002,SAMEA8944166,1805779,uncultured Naasia sp.,Illumina NovaSeq 6000,Megahit,78,CheckM,0,metabat2,metagenome-assembled genome,default,uncultured_Naasia_sp._ERS6627002
567,SP93_2_metabat2_genome_mining.8.fasta,SP93_2_metabat2_genome_mining.8,SP93_2_metabat2_genome_mining.8,ERS6626884,SAMEA8944048,331964,uncultured Curtobacterium sp.,Illumina NovaSeq 6000,Megahit,58,CheckM,1,metabat2,metagenome-assembled genome,default,uncultured_Curtobacterium_sp._ERS6626884


In [82]:
filtered_mag['Genome'] = filtered_mag['Genome'].str.replace('^uncultured_', '', regex=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_mag['Genome'] = filtered_mag['Genome'].str.replace('^uncultured_', '', regex=True)


In [84]:
fil_mag_file = os.path.join(raw_data, 'filtered_mag.csv')
filtered_mag.to_csv(fil_mag_file)

<h2 style="font-family: 'Times New Roman'; font-weight: bold;">
   Create CoxL Protein Data Bases
</h2>

In [51]:
# Separating form 2 proteins and form 1 proteins
## Output file names
raw_coxl1_file = os.path.join(raw_data, "raw_coxl1.fasta") # form 1 proteins
raw_coxl2_file = os.path.join(raw_data, "raw_coxl2.fasta") # form 2 proteins

## Read all seauences from the input file
sequences = list(SeqIO.parse(coxl_fasta_path, "fasta"))

## Split the sequences
coxl1 = sequences[:-2]
coxl2 = sequences[-2:] ## we know last two sequences are form 2  CoxL protein

## Write last form 2 and form 1 CoxL proteins to separate files
SeqIO.write(coxl1, raw_coxl1_file, "fasta") # Form 1 CoxL proteins
SeqIO.write(coxl2, raw_coxl2_file, "fasta") # Form 2 CoxL proteins

## For cross checking
print(f" Form 1 proteins saved to {raw_coxl1_file}")
print(f" Form 2 proteins saved to {raw_coxl2_file}")

 Form 1 proteins saved to C:/Users/dkb1g24/Documents/mres/Raw_Data\raw_coxl1.fasta
 Form 2 proteins saved to C:/Users/dkb1g24/Documents/mres/Raw_Data\raw_coxl2.fasta


In [52]:
coxl1_file = os.path.join(raw_data, "coxl1.fasta") # form 1 proteins
coxl2_file = os.path.join(raw_data, "coxl2.fasta")
remove_gaps_from_fasta(raw_coxl1_file, coxl1_file)
remove_gaps_from_fasta(raw_coxl2_file, coxl2_file)

In [53]:
# Create databases
create_db(coxl1_file, coxl1_db_folder, "coxl1_db")
create_db(coxl2_file, coxl2_db_folder, "coxl2_db")

Database created at C:/Users/dkb1g24/Documents/mres/BLAST/CoxL1\coxl1_db
Database created at C:/Users/dkb1g24/Documents/mres/BLAST/CoxL2\coxl2_db


<h2 style="font-family: 'Times New Roman'; font-weight: bold;">
   Prepare List to Upload MAGs to RAST Servers
</h2>

In [55]:
# Creating a column with mag file names
mag_df2['mag_file'] = mag_df2['Scientific_name'].str.replace(' ', '_') + '_' + mag_df2['SRA_accession'].str.replace(' ', '_') + '.fasta'
mag_df2.head()

Unnamed: 0,file_name,ID,Sample_ID,SRA_accession,BioSample,Tax_ID,Scientific_name,Sequencing_method,Assembly_Software,Completeness_score,Completeness_software,Contamination_score,Binning_software,Investigation_type,Binning_parameters,mag_file
0,maxbin_SP106_1.016_cleaned.fasta,maxbin_SP106_1.016_cleaned,maxbin_SP106_1.016_cleaned,ERS6626984,SAMEA8944148,157277,uncultured Agrobacterium sp.,Illumina NovaSeq 6000,Megahit,72,CheckM,6,maxbin,metagenome-assembled genome,default,uncultured_Agrobacterium_sp._ERS6626984.fasta
1,maxbin_SP106_1.023_cleaned.fasta,maxbin_SP106_1.023_cleaned,maxbin_SP106_1.023_cleaned,ERS6626983,SAMEA8944147,278208,uncultured Spirosoma sp.,Illumina NovaSeq 6000,Megahit,72,CheckM,1,maxbin,metagenome-assembled genome,default,uncultured_Spirosoma_sp._ERS6626983.fasta
2,maxbin_SP106_2.001.fasta,maxbin_SP106_2.001,maxbin_SP106_2.001,ERS6626844,SAMEA8944008,227322,uncultured Paenibacillus sp.,Illumina NovaSeq 6000,Megahit,100,CheckM,1,maxbin,metagenome-assembled genome,default,uncultured_Paenibacillus_sp._ERS6626844.fasta
3,maxbin_SP106_3.024.fasta,maxbin_SP106_3.024,maxbin_SP106_3.024,ERS6626982,SAMEA8944146,278208,uncultured Spirosoma sp.,Illumina NovaSeq 6000,Megahit,71,CheckM,2,maxbin,metagenome-assembled genome,default,uncultured_Spirosoma_sp._ERS6626982.fasta
4,maxbin_SP108_3.007_cleaned.fasta,maxbin_SP108_3.007_cleaned,maxbin_SP108_3.007_cleaned,ERS6626962,SAMEA8944126,797541,uncultured Mucilaginibacter sp.,Illumina NovaSeq 6000,Megahit,63,CheckM,7,maxbin,metagenome-assembled genome,default,uncultured_Mucilaginibacter_sp._ERS6626962.fasta


In [56]:
mag_df2[mag_df2['SRA_accession'] == 'ERS6626837']

Unnamed: 0,file_name,ID,Sample_ID,SRA_accession,BioSample,Tax_ID,Scientific_name,Sequencing_method,Assembly_Software,Completeness_score,Completeness_software,Contamination_score,Binning_software,Investigation_type,Binning_parameters,mag_file
537,SP67_2_metabat2_genome_mining.16.fasta,SP67_2_metabat2_genome_mining.16,SP67_2_metabat2_genome_mining.16,ERS6626837,SAMEA8944001,158789,uncultured Deinococcus sp.,Illumina NovaSeq 6000,Megahit,99,CheckM,1,metabat2,metagenome-assembled genome,default,uncultured_Deinococcus_sp._ERS6626837.fasta


In [57]:
mag_df[mag_df['SRA_accession'] == 'ERS6626861']

Unnamed: 0,Sample_ID,SRA_accession,BioSample,Tax_ID,Scientific_name,Sequencing_method,Assembly_Software,Completeness_score,Completeness_software,Contamination_score,Binning_software,Investigation_type,Binning_parameters
301,SP213_2_metabat2_genome_mining.4,ERS6626861,SAMEA8944025,2030806,Burkholderiaceae bacterium,Illumina NovaSeq 6000,Megahit,72,CheckM,2,metabat2,metagenome-assembled genome,default


In [58]:
# Reordering the columns
mag_df2 = mag_df2[['mag_file', 'Scientific_name', 'SRA_accession', 'BioSample', 'Tax_ID', 'Sample_ID', 'Completeness_score', 'Contamination_score', 'Sequencing_method', 'Assembly_Software', 'Completeness_software', 'Binning_software', 'Investigation_type', 'Binning_parameters', 'file_name']]
mag_df2.head()

Unnamed: 0,mag_file,Scientific_name,SRA_accession,BioSample,Tax_ID,Sample_ID,Completeness_score,Contamination_score,Sequencing_method,Assembly_Software,Completeness_software,Binning_software,Investigation_type,Binning_parameters,file_name
0,uncultured_Agrobacterium_sp._ERS6626984.fasta,uncultured Agrobacterium sp.,ERS6626984,SAMEA8944148,157277,maxbin_SP106_1.016_cleaned,72,6,Illumina NovaSeq 6000,Megahit,CheckM,maxbin,metagenome-assembled genome,default,maxbin_SP106_1.016_cleaned.fasta
1,uncultured_Spirosoma_sp._ERS6626983.fasta,uncultured Spirosoma sp.,ERS6626983,SAMEA8944147,278208,maxbin_SP106_1.023_cleaned,72,1,Illumina NovaSeq 6000,Megahit,CheckM,maxbin,metagenome-assembled genome,default,maxbin_SP106_1.023_cleaned.fasta
2,uncultured_Paenibacillus_sp._ERS6626844.fasta,uncultured Paenibacillus sp.,ERS6626844,SAMEA8944008,227322,maxbin_SP106_2.001,100,1,Illumina NovaSeq 6000,Megahit,CheckM,maxbin,metagenome-assembled genome,default,maxbin_SP106_2.001.fasta
3,uncultured_Spirosoma_sp._ERS6626982.fasta,uncultured Spirosoma sp.,ERS6626982,SAMEA8944146,278208,maxbin_SP106_3.024,71,2,Illumina NovaSeq 6000,Megahit,CheckM,maxbin,metagenome-assembled genome,default,maxbin_SP106_3.024.fasta
4,uncultured_Mucilaginibacter_sp._ERS6626962.fasta,uncultured Mucilaginibacter sp.,ERS6626962,SAMEA8944126,797541,maxbin_SP108_3.007_cleaned,63,7,Illumina NovaSeq 6000,Megahit,CheckM,maxbin,metagenome-assembled genome,default,maxbin_SP108_3.007_cleaned.fasta


In [59]:
# Convert all column headings to lower case
mag_df2.columns = [col.lower() for col in mag_df2.columns]
mag_df2.head()

Unnamed: 0,mag_file,scientific_name,sra_accession,biosample,tax_id,sample_id,completeness_score,contamination_score,sequencing_method,assembly_software,completeness_software,binning_software,investigation_type,binning_parameters,file_name
0,uncultured_Agrobacterium_sp._ERS6626984.fasta,uncultured Agrobacterium sp.,ERS6626984,SAMEA8944148,157277,maxbin_SP106_1.016_cleaned,72,6,Illumina NovaSeq 6000,Megahit,CheckM,maxbin,metagenome-assembled genome,default,maxbin_SP106_1.016_cleaned.fasta
1,uncultured_Spirosoma_sp._ERS6626983.fasta,uncultured Spirosoma sp.,ERS6626983,SAMEA8944147,278208,maxbin_SP106_1.023_cleaned,72,1,Illumina NovaSeq 6000,Megahit,CheckM,maxbin,metagenome-assembled genome,default,maxbin_SP106_1.023_cleaned.fasta
2,uncultured_Paenibacillus_sp._ERS6626844.fasta,uncultured Paenibacillus sp.,ERS6626844,SAMEA8944008,227322,maxbin_SP106_2.001,100,1,Illumina NovaSeq 6000,Megahit,CheckM,maxbin,metagenome-assembled genome,default,maxbin_SP106_2.001.fasta
3,uncultured_Spirosoma_sp._ERS6626982.fasta,uncultured Spirosoma sp.,ERS6626982,SAMEA8944146,278208,maxbin_SP106_3.024,71,2,Illumina NovaSeq 6000,Megahit,CheckM,maxbin,metagenome-assembled genome,default,maxbin_SP106_3.024.fasta
4,uncultured_Mucilaginibacter_sp._ERS6626962.fasta,uncultured Mucilaginibacter sp.,ERS6626962,SAMEA8944126,797541,maxbin_SP108_3.007_cleaned,63,7,Illumina NovaSeq 6000,Megahit,CheckM,maxbin,metagenome-assembled genome,default,maxbin_SP108_3.007_cleaned.fasta


In [60]:
# Prepating for rast details file
rast_list = mag_df2[['scientific_name', 'sra_accession','mag_file', 'tax_id']]
rast_list.head()

Unnamed: 0,scientific_name,sra_accession,mag_file,tax_id
0,uncultured Agrobacterium sp.,ERS6626984,uncultured_Agrobacterium_sp._ERS6626984.fasta,157277
1,uncultured Spirosoma sp.,ERS6626983,uncultured_Spirosoma_sp._ERS6626983.fasta,278208
2,uncultured Paenibacillus sp.,ERS6626844,uncultured_Paenibacillus_sp._ERS6626844.fasta,227322
3,uncultured Spirosoma sp.,ERS6626982,uncultured_Spirosoma_sp._ERS6626982.fasta,278208
4,uncultured Mucilaginibacter sp.,ERS6626962,uncultured_Mucilaginibacter_sp._ERS6626962.fasta,797541


In [62]:
# Copying dataframe to avoid SettingWithCopyWarning
rast_list = rast_list.copy()
# Create a new column 'mag_name' by removing the last 6 characters from 'mag_file'
rast_list['mag_name'] = rast_list['mag_file'].str[:-6]
rast_list.head()

Unnamed: 0,scientific_name,sra_accession,mag_file,tax_id,mag_name
0,uncultured Agrobacterium sp.,ERS6626984,uncultured_Agrobacterium_sp._ERS6626984.fasta,157277,uncultured_Agrobacterium_sp._ERS6626984
1,uncultured Spirosoma sp.,ERS6626983,uncultured_Spirosoma_sp._ERS6626983.fasta,278208,uncultured_Spirosoma_sp._ERS6626983
2,uncultured Paenibacillus sp.,ERS6626844,uncultured_Paenibacillus_sp._ERS6626844.fasta,227322,uncultured_Paenibacillus_sp._ERS6626844
3,uncultured Spirosoma sp.,ERS6626982,uncultured_Spirosoma_sp._ERS6626982.fasta,278208,uncultured_Spirosoma_sp._ERS6626982
4,uncultured Mucilaginibacter sp.,ERS6626962,uncultured_Mucilaginibacter_sp._ERS6626962.fasta,797541,uncultured_Mucilaginibacter_sp._ERS6626962


In [67]:
# Save to csv files
output = os.path.join(raw_data, 'rast_list.csv')
rast_list.to_csv(output, index=False)

<p style="font-family: 'Times New Roman'; font-size: 15px; text-align: justify; width: 100%;">
   The MAG files are uploaded to RAST served and the results are downloaded in fasta amino acids and excel files named after organism and SRA accession..
</p>

<h2 style="font-family: 'Times New Roman'; font-weight: bold;">
   Prepare List for CoverM Analysis
</h2>

In [70]:
rice_cultivars = open_file(rice_path)
rice_cultivars

Unnamed: 0,Supplementary Data 1. Information on rice varieties used for metagenome sequencing.,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12
0,Bioproject,SRA accession,BioSample,Sample ID,Host species,Sub-population,Group,Country of origin,IRRI seed bank ID,GSOR ID,Sample origin,Sample coordinates,Sampling date
1,PRJEB45634,ERS6595519,SAMEA8912636,SP71_1,Oryza sativa,Admixed-Indica,indica,Madagascar,IRGC 121291,GSOR301997,"China, Hunan Province, Hunan Identificaiton Ce...","N28.8, E112.2",September 5. 2020
2,PRJEB45634,ERS6595579,SAMEA8912697,SP285_1,Oryza sativa,Admixed-Indica,indica,India,IRGC 121277,GSOR301983,"China, Hunan Province, Hunan Identificaiton Ce...","N28.8, E112.2",September 5. 2020
3,PRJEB45634,ERS6595630,SAMEA8912748,SP71_2,Oryza sativa,Admixed-Indica,indica,Madagascar,IRGC 121291,GSOR301997,"China, Hunan Province, Hunan Identificaiton Ce...","N28.8, E112.2",September 5. 2020
4,PRJEB45634,ERS6595690,SAMEA8912808,SP285_2,Oryza sativa,Admixed-Indica,indica,India,IRGC 121277,GSOR301983,"China, Hunan Province, Hunan Identificaiton Ce...","N28.8, E112.2",September 5. 2020
...,...,...,...,...,...,...,...,...,...,...,...,...,...
326,PRJEB45634,ERS6595820,SAMEA8912938,SP391_3,Oryza sativa,Tropical japonica,japonica,Cote d'Ivoire,IRGC 121843,GSOR302541,"China, Hunan Province, Hunan Identificaiton Ce...","N28.8, E112.2",September 5. 2020
327,PRJEB45634,ERS6595822,SAMEA8912940,SP397_3,Oryza sativa,Tropical japonica,japonica,Spain,IRGC 121697,GSOR302400,"China, Hunan Province, Hunan Identificaiton Ce...","N28.8, E112.2",September 5. 2020
328,PRJEB45634,ERS6595823,SAMEA8912941,SP398_3,Oryza sativa,Tropical japonica,japonica,Brazil,IRGC 121809,GSOR302512,"China, Hunan Province, Hunan Identificaiton Ce...","N28.8, E112.2",September 5. 2020
329,PRJEB45634,ERS6595824,SAMEA8912942,SP400_3,Oryza sativa,Tropical japonica,japonica,Madagascar,IRGC 121293,GSOR301999,"China, Hunan Province, Hunan Identificaiton Ce...","N28.8, E112.2",September 5. 2020


In [71]:
rice_cultivars.columns = rice_cultivars.iloc[0]

In [72]:
rice_cultivars

Unnamed: 0,Bioproject,SRA accession,BioSample,Sample ID,Host species,Sub-population,Group,Country of origin,IRRI seed bank ID,GSOR ID,Sample origin,Sample coordinates,Sampling date
0,Bioproject,SRA accession,BioSample,Sample ID,Host species,Sub-population,Group,Country of origin,IRRI seed bank ID,GSOR ID,Sample origin,Sample coordinates,Sampling date
1,PRJEB45634,ERS6595519,SAMEA8912636,SP71_1,Oryza sativa,Admixed-Indica,indica,Madagascar,IRGC 121291,GSOR301997,"China, Hunan Province, Hunan Identificaiton Ce...","N28.8, E112.2",September 5. 2020
2,PRJEB45634,ERS6595579,SAMEA8912697,SP285_1,Oryza sativa,Admixed-Indica,indica,India,IRGC 121277,GSOR301983,"China, Hunan Province, Hunan Identificaiton Ce...","N28.8, E112.2",September 5. 2020
3,PRJEB45634,ERS6595630,SAMEA8912748,SP71_2,Oryza sativa,Admixed-Indica,indica,Madagascar,IRGC 121291,GSOR301997,"China, Hunan Province, Hunan Identificaiton Ce...","N28.8, E112.2",September 5. 2020
4,PRJEB45634,ERS6595690,SAMEA8912808,SP285_2,Oryza sativa,Admixed-Indica,indica,India,IRGC 121277,GSOR301983,"China, Hunan Province, Hunan Identificaiton Ce...","N28.8, E112.2",September 5. 2020
...,...,...,...,...,...,...,...,...,...,...,...,...,...
326,PRJEB45634,ERS6595820,SAMEA8912938,SP391_3,Oryza sativa,Tropical japonica,japonica,Cote d'Ivoire,IRGC 121843,GSOR302541,"China, Hunan Province, Hunan Identificaiton Ce...","N28.8, E112.2",September 5. 2020
327,PRJEB45634,ERS6595822,SAMEA8912940,SP397_3,Oryza sativa,Tropical japonica,japonica,Spain,IRGC 121697,GSOR302400,"China, Hunan Province, Hunan Identificaiton Ce...","N28.8, E112.2",September 5. 2020
328,PRJEB45634,ERS6595823,SAMEA8912941,SP398_3,Oryza sativa,Tropical japonica,japonica,Brazil,IRGC 121809,GSOR302512,"China, Hunan Province, Hunan Identificaiton Ce...","N28.8, E112.2",September 5. 2020
329,PRJEB45634,ERS6595824,SAMEA8912942,SP400_3,Oryza sativa,Tropical japonica,japonica,Madagascar,IRGC 121293,GSOR301999,"China, Hunan Province, Hunan Identificaiton Ce...","N28.8, E112.2",September 5. 2020


In [81]:
rice_cultivars = rice_cultivars[1:]

rice_cultivars.head()

Unnamed: 0,Bioproject,SRA accession,BioSample,Sample ID,Host species,Sub-population,Group,Country of origin,IRRI seed bank ID,GSOR ID,Sample origin,Sample coordinates,Sampling date
1,PRJEB45634,ERS6595519,SAMEA8912636,SP71_1,Oryza sativa,Admixed-Indica,indica,Madagascar,IRGC 121291,GSOR301997,"China, Hunan Province, Hunan Identificaiton Ce...","N28.8, E112.2",September 5. 2020
2,PRJEB45634,ERS6595579,SAMEA8912697,SP285_1,Oryza sativa,Admixed-Indica,indica,India,IRGC 121277,GSOR301983,"China, Hunan Province, Hunan Identificaiton Ce...","N28.8, E112.2",September 5. 2020
3,PRJEB45634,ERS6595630,SAMEA8912748,SP71_2,Oryza sativa,Admixed-Indica,indica,Madagascar,IRGC 121291,GSOR301997,"China, Hunan Province, Hunan Identificaiton Ce...","N28.8, E112.2",September 5. 2020
4,PRJEB45634,ERS6595690,SAMEA8912808,SP285_2,Oryza sativa,Admixed-Indica,indica,India,IRGC 121277,GSOR301983,"China, Hunan Province, Hunan Identificaiton Ce...","N28.8, E112.2",September 5. 2020
5,PRJEB45634,ERS6595740,SAMEA8912858,SP71_3,Oryza sativa,Admixed-Indica,indica,Madagascar,IRGC 121291,GSOR301997,"China, Hunan Province, Hunan Identificaiton Ce...","N28.8, E112.2",September 5. 2020


In [83]:
rice_cultivars.reset_index(drop=True, inplace=True)
rice_cultivars.head()

Unnamed: 0,Bioproject,SRA accession,BioSample,Sample ID,Host species,Sub-population,Group,Country of origin,IRRI seed bank ID,GSOR ID,Sample origin,Sample coordinates,Sampling date
0,PRJEB45634,ERS6595519,SAMEA8912636,SP71_1,Oryza sativa,Admixed-Indica,indica,Madagascar,IRGC 121291,GSOR301997,"China, Hunan Province, Hunan Identificaiton Ce...","N28.8, E112.2",September 5. 2020
1,PRJEB45634,ERS6595579,SAMEA8912697,SP285_1,Oryza sativa,Admixed-Indica,indica,India,IRGC 121277,GSOR301983,"China, Hunan Province, Hunan Identificaiton Ce...","N28.8, E112.2",September 5. 2020
2,PRJEB45634,ERS6595630,SAMEA8912748,SP71_2,Oryza sativa,Admixed-Indica,indica,Madagascar,IRGC 121291,GSOR301997,"China, Hunan Province, Hunan Identificaiton Ce...","N28.8, E112.2",September 5. 2020
3,PRJEB45634,ERS6595690,SAMEA8912808,SP285_2,Oryza sativa,Admixed-Indica,indica,India,IRGC 121277,GSOR301983,"China, Hunan Province, Hunan Identificaiton Ce...","N28.8, E112.2",September 5. 2020
4,PRJEB45634,ERS6595740,SAMEA8912858,SP71_3,Oryza sativa,Admixed-Indica,indica,Madagascar,IRGC 121291,GSOR301997,"China, Hunan Province, Hunan Identificaiton Ce...","N28.8, E112.2",September 5. 2020


In [84]:
refining = rice_cultivars[(rice_cultivars['Group'] == 'indica') | (rice_cultivars['Group'] == 'japonica')]
refining

Unnamed: 0,Bioproject,SRA accession,BioSample,Sample ID,Host species,Sub-population,Group,Country of origin,IRRI seed bank ID,GSOR ID,Sample origin,Sample coordinates,Sampling date
0,PRJEB45634,ERS6595519,SAMEA8912636,SP71_1,Oryza sativa,Admixed-Indica,indica,Madagascar,IRGC 121291,GSOR301997,"China, Hunan Province, Hunan Identificaiton Ce...","N28.8, E112.2",September 5. 2020
1,PRJEB45634,ERS6595579,SAMEA8912697,SP285_1,Oryza sativa,Admixed-Indica,indica,India,IRGC 121277,GSOR301983,"China, Hunan Province, Hunan Identificaiton Ce...","N28.8, E112.2",September 5. 2020
2,PRJEB45634,ERS6595630,SAMEA8912748,SP71_2,Oryza sativa,Admixed-Indica,indica,Madagascar,IRGC 121291,GSOR301997,"China, Hunan Province, Hunan Identificaiton Ce...","N28.8, E112.2",September 5. 2020
3,PRJEB45634,ERS6595690,SAMEA8912808,SP285_2,Oryza sativa,Admixed-Indica,indica,India,IRGC 121277,GSOR301983,"China, Hunan Province, Hunan Identificaiton Ce...","N28.8, E112.2",September 5. 2020
4,PRJEB45634,ERS6595740,SAMEA8912858,SP71_3,Oryza sativa,Admixed-Indica,indica,Madagascar,IRGC 121291,GSOR301997,"China, Hunan Province, Hunan Identificaiton Ce...","N28.8, E112.2",September 5. 2020
...,...,...,...,...,...,...,...,...,...,...,...,...,...
325,PRJEB45634,ERS6595820,SAMEA8912938,SP391_3,Oryza sativa,Tropical japonica,japonica,Cote d'Ivoire,IRGC 121843,GSOR302541,"China, Hunan Province, Hunan Identificaiton Ce...","N28.8, E112.2",September 5. 2020
326,PRJEB45634,ERS6595822,SAMEA8912940,SP397_3,Oryza sativa,Tropical japonica,japonica,Spain,IRGC 121697,GSOR302400,"China, Hunan Province, Hunan Identificaiton Ce...","N28.8, E112.2",September 5. 2020
327,PRJEB45634,ERS6595823,SAMEA8912941,SP398_3,Oryza sativa,Tropical japonica,japonica,Brazil,IRGC 121809,GSOR302512,"China, Hunan Province, Hunan Identificaiton Ce...","N28.8, E112.2",September 5. 2020
328,PRJEB45634,ERS6595824,SAMEA8912942,SP400_3,Oryza sativa,Tropical japonica,japonica,Madagascar,IRGC 121293,GSOR301999,"China, Hunan Province, Hunan Identificaiton Ce...","N28.8, E112.2",September 5. 2020


In [88]:
selection = refining[['IRRI seed bank ID', 'SRA accession', 'Group']]
selection

Unnamed: 0,IRRI seed bank ID,SRA accession,Group
0,IRGC 121291,ERS6595519,indica
1,IRGC 121277,ERS6595579,indica
2,IRGC 121291,ERS6595630,indica
3,IRGC 121277,ERS6595690,indica
4,IRGC 121291,ERS6595740,indica
...,...,...,...
325,IRGC 121843,ERS6595820,japonica
326,IRGC 121697,ERS6595822,japonica
327,IRGC 121809,ERS6595823,japonica
328,IRGC 121293,ERS6595824,japonica


In [89]:
sra = open_file(sra_path)
sra

Unnamed: 0,SRA accession,New SRA
0,ERS6595506,ERR6050324
1,ERS6595507,ERR6050325
2,ERS6595508,ERR6050326
3,ERS6595509,ERR6050327
4,ERS6595513,ERR6050331
...,...,...
325,ERS6595785,ERR6050602
326,ERS6595797,ERR6050614
327,ERS6595808,ERR6050625
328,ERS6595813,ERR6050630


In [94]:
group = selection.merge(sra, on = 'SRA accession').reset_index(drop=True)
group

Unnamed: 0,IRRI seed bank ID,SRA accession,Group,New SRA
0,IRGC 121291,ERS6595519,indica,ERR6050337
1,IRGC 121277,ERS6595579,indica,ERR6050397
2,IRGC 121291,ERS6595630,indica,ERR6050447
3,IRGC 121277,ERS6595690,indica,ERR6050507
4,IRGC 121291,ERS6595740,indica,ERR6050557
...,...,...,...,...
271,IRGC 121843,ERS6595820,japonica,ERR6050637
272,IRGC 121697,ERS6595822,japonica,ERR6050639
273,IRGC 121809,ERS6595823,japonica,ERR6050640
274,IRGC 121293,ERS6595824,japonica,ERR6050641


In [97]:
group['IRRI seed bank ID'].nunique()

92

In [99]:
group.sort_values(by = 'IRRI seed bank ID', inplace = True)
group.reset_index(drop=True)

Unnamed: 0,IRRI seed bank ID,SRA accession,Group,New SRA
0,IRGC 117429,ERS6595826,indica,ERR6050643
1,IRGC 117429,ERS6595606,indica,ERR6050423
2,IRGC 117429,ERS6595716,indica,ERR6050533
3,IRGC 117459,ERS6595767,indica,ERR6050584
4,IRGC 117459,ERS6595657,indica,ERR6050474
...,...,...,...,...
271,IRGC 121872,ERS6595648,indica,ERR6050465
272,IRGC 121872,ERS6595758,indica,ERR6050575
273,IRGC 121884,ERS6595701,indica,ERR6050518
274,IRGC 121884,ERS6595591,indica,ERR6050408


In [101]:
out_put = os.path.join(raw_data, 'group.csv')
group.to_csv(out_put, index=False)