# Microbiome Relative Abundance Data Cleanup
## Anna Lapteva
### April 10, 2024

All the data used to generate the final .CSV document is sourced from [GMRepo](https://gmrepo.humangut.info/home) database. Specifically, all run IDs in the database linked to each project for each phenotype (Autism Spectrum Disorder (ASD), Alzheimer's Disease (AD), Parkinson's Disease (PD), Multiple Sclerosis (MS), and epilepsy) were saved as a spreadsheet, individually opened, and downloaded. Each downloaded .txt file is then cleaned and appended to a DataFrame in this code. The following projects, which perform 16S sequencing of fecal samples using Illumina tools, were used for data collection:
- ASD: PRJNA578223
- AD: PRJNA633959
- PD: PRJEB17784
- MS: PRJNA450340

Let's start by importing relevant packages.

In [1]:
# Relevant string operations
import string

# Data handling
import glob
import os
import pandas as pd

We will use the `glob` module, which allows us to get all our files of interest with the same extension for convenience.

In [2]:
# Specify directories holding data
data_dir = ['../Data/LargeData/ASD', '../Data/LargeData/AD', '../Data/LargeData/PD', '../Data/LargeData/MS']
            
# Initialize DataFrame for concatenation steps
df = pd.DataFrame()

# Iterate through each of the directories
for direc in data_dir:
    
    # Glob string for all .txt files, which have relative abundance data
    file_glob = os.path.join(direc, '*.txt')
    
    # Get list of files in direc
    file_list = glob.glob(file_glob)

    # Iterate through all files in list
    for file in file_list:

        # Initialize skip_row_id to be sufficiently large before the loop starts
        skip_row_id = len(file)

        # List to store modified lines
        modified_lines = []
        
        # Iterate through all lines in file
        with open(file, 'r') as f:
            # Find where relative abundance data starts
            for row_number, line in enumerate(f, start=1):
                if "ncbi_taxon_id" in line:
                    skip_row_id = row_number - 1
        
                # Ensure column entries are one "word"
                if row_number > skip_row_id:  # Ensure lines before are not edited
                    modified_line = ''
                    prev_char = None
                    for char in line:
                        if char == ' ' and prev_char in string.ascii_lowercase + '.':
                            modified_line += '_'
                        else:
                            modified_line += char
                        prev_char = char
                    modified_lines.append(modified_line)
        
        # Write modified lines back to file
        with open(file, 'w') as f:
            f.writelines(modified_lines)

        # Create DataFrame from file, removing filler info
        data = pd.read_csv(file, sep='\s+')

        # Add individual index for reshaping
        data["individual"] = 0

        # Reshape data
        data = data[["relative_abundance", "scientific_name", "individual"]].pivot_table(columns="scientific_name", values="relative_abundance", index="individual", aggfunc="sum").reset_index()
        data = data.drop("individual", axis=1)

        # Add label
        data["Diagnosis"] = [direc[direc.rfind('/') + 1:]] * len(data)

        # Update DataFrame
        df = pd.concat([df, data], ignore_index=True).fillna(0)

# Check out df
df.head()

scientific_name,Akkermansia,Bacteroides,Bifidobacterium,Butyricicoccus,Clostridium,Collinsella,Coprococcus,Enterococcus,Faecalibacterium,Gemmiger,...,Endozoicomonas,Hydrocarboniphaga,Brucella,Jannaschia,Labrys,Parvibaculum,Rudaea,Sphingopyxis,Serratia,Actinobaculum
0,2.84373,18.3876,4.52788,0.193263,4.94202,2.09829,26.339,0.193263,10.2154,4.22419,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,25.0332,2.25198,0.0,3.22114,0.0,1.5603,0.0,1.11795,3.31363,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.096056,8.59078,0.28505,0.0,4.49125,2.17374,1.91114,0.003119,4.92038,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.00071,5.34432,0.460558,0.0,0.048256,0.042579,0.001419,0.0,0.386755,0.009225,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.00189,6.94195,0.059074,0.0,0.034736,0.000236,0.107043,0.008979,0.252839,0.014414,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Finally, let's export the data as a CSV!

In [3]:
# Export to CSV
df.to_csv("../Data/20440_cleaned_data_large.csv", index=False)