# Generation of a centralized count data

One of the initial challenges researchers face when analyzing RNA-seq data is **the consolidation of count data from multiple samples into a single comprehensive dataset**. This centralized dataset forms the foundation for downstream analyses, including the application of dimensionality reduction techniques, differential expression analysis, and pathway enrichment, which are critical for uncovering biological insights.

However, the process of aggregating count data is fraught with potential pitfalls, such as ensuring the consistency of gene identifiers (Geneid) across all samples. Misalignment in gene order or identifiers can lead to erroneous results, making the initial data preparation phase both crucial and challenging.

This notebook presents an approach to centralizing RNA-seq count data, developed to address these challenges efficiently. A pivotal step in our method is the meticulous verification of the **Geneid feature's consistency across all samples**. This crucial verification step allows us to employ a direct appending strategy on disk for integrating count data from multiple files. Unlike traditional data merging techniques, which often necessitate loading substantial datasets into memory—thereby significantly increasing memory usage—our approach ensures data is concatenated without extensive memory demands. By automating this process, we not only minimize manual intervention and the likelihood of errors but also establish a solid foundation for subsequent analyses with considerably lower memory consumption.

# 1 - Import libraries

In [2]:
import pandas as pd
import re
import os
from tqdm import tqdm
from functools import reduce

# 2 - Setting Directories and Listing Files

This section sets up the working environment by defining the directory containing the **RNA-seq count data files (working_dir)** and dynamically obtaining the current working directory (home_dir). It then lists all .txt files within the working_dir, ensuring that only relevant files are considered for processing.

In [24]:
# Set the working directory and home directory
working_dir = 'PD_RNAseq_CountData'
home_dir = os.getcwd()

# List all .txt files in the specified directory
filenames_lst = [file for file in os.listdir(os.path.join(home_dir, working_dir)) if file.endswith('.txt')]

# 3 - Determining whether the order of Geneid is consistent across all the files

This markdown cell introduces the first major analytical step: verifying the consistency of gene identifiers (Geneid) across all files. This is a crucial prerequisite for accurate data consolidation, as inconsistencies in gene order could lead to misaligned data and incorrect analyses.

In [17]:
# Function to read the 'Geneid' column from a file
def read_geneid_column(file_path):
    return pd.read_csv(file_path, delimiter='\t', comment='#', usecols=['Geneid'])

# Read the 'Geneid' column from the first file to establish a reference

# Initialize variable to store the 'Geneid' order from the first file
reference_geneid = None
# Flag to indicate if all .txt files have the same 'Geneid' order
same_order = True

# Iterate over the filenames with a progress bar

for file in tqdm(filenames_lst, desc='Checking Geneid order'):
    if file.endswith('.txt'):  # Process only .txt files
        file_path = os.path.join(home_dir, working_dir, file)
        current_geneid = read_geneid_column(file_path)
        
        if reference_geneid is None:
            reference_geneid = current_geneid  # Set the reference 'Geneid' from the first .txt file
        else:
            if not current_geneid.equals(reference_geneid):
                same_order = False
                break  # Exit the loop early if the order doesn't match

# Print the result

if same_order:
    print("All .txt files have the same 'Geneid' order.")
else:
    print("Not all .txt files have the same 'Geneid' order.")

Checking Geneid order: 100%|██████████| 1049/1049 [21:37<00:00,  1.24s/it]

All .txt files have the same 'Geneid' order.





# 4 - Defining functions 

To ensure the efficient handling and centralization of RNA-seq count data, this notebook employs two key functions: **read_and_process_file** and **concatenate_columns**. These functions are meticulously designed to address specific challenges in RNA-seq data analysis, emphasizing memory efficiency and data integrity.

## read_and_process_file
### Purpose and Design:
This function is pivotal for preprocessing individual RNA-seq count files. Its primary role is to:

Read each file: Leveraging pandas to handle tab-delimited count data, ensuring compatibility with common RNA-seq output formats.
Pattern Matching: Utilize regular expressions to identify and rename the last column based on a specified pattern (e.g., matching specific sample identifiers). This is crucial for maintaining consistency and clarity in the dataset, especially when dealing with multiple samples or experimental conditions.
Selective Column Inclusion: Conditionally include the Geneid column based on the function call. This flexibility is vital for the initial file processing (to retain Geneid) and subsequent files (to exclude Geneid and prevent duplication).
Rationale:
This function encapsulates the preprocessing logic, making the script adaptable to varying file formats and naming conventions. By abstracting this logic, we ensure that each file is processed consistently, laying a solid foundation for accurate data consolidation.

## concatenate_columns

### Purpose and Design:
The concatenate_columns function orchestrates the centralization process by:

Initiating with a Base File: It starts by processing the first file completely, including the Geneid column, to establish the baseline dataset.
Progressive Concatenation: For each additional file, it processes and appends the data column-wise to the existing dataset. This step is crucial for building up the centralized table without resorting to memory-intensive merge operations.
Efficient Disk Operations: By writing the processed data back to disk after each addition, the function avoids the significant memory overhead typically associated with loading and manipulating large datasets in memory.
Rationale:
The choice to append data directly on disk, guided by the initial verification of Geneid consistency, circumvents the need for memory-intensive data merging. This approach significantly reduces memory consumption, making the process more scalable and efficient.


In [36]:
# Assuming home_dir, working_dir, filenames_lst are defined
def read_and_process_file(file_path, pattern, include_geneid=False):
    # Your existing function for preprocessing
    df = pd.read_csv(file_path, delimiter='\t', comment='#')
    last_col_name = df.columns[-1]
    match = re.search(pattern, last_col_name)
    new_col_name = match.group(0) if match else "Counts"
    if include_geneid:
        return df.loc[:, ['Geneid', last_col_name]].rename(columns={last_col_name: new_col_name})
    else:
        return df.loc[:, [last_col_name]].rename(columns={last_col_name: new_col_name})

def concatenate_columns(base_file_path, additional_files_paths, output_file_path, pattern):
    # Process the first file and write its content (including Geneid)
    base_df = read_and_process_file(base_file_path, pattern, include_geneid=True)
    base_df.to_csv(output_file_path, sep='\t', index=False, mode='w')
    
    # Process each of the additional files with a progress bar
    for file_path in tqdm(additional_files_paths, desc='Processing files'):
        processed_df = read_and_process_file(file_path, pattern, include_geneid=False)
        # Since we're only adding data columns, extract the column as a series to avoid alignment issues
        data_series = processed_df.iloc[:, 0]
        
        # Read the current output and combine column-wise
        current_df = pd.read_csv(output_file_path, sep='\t')
        current_df[data_series.name] = data_series.values
        
        # Write back to the disk
        current_df.to_csv(output_file_path, sep='\t', index=False, mode='w')

# 5 - Writing data on disk

In [38]:
# Correct definitions based on your setup
pattern = r'5104-SL-\d{4}'
home_dir = os.getcwd()  # This should correctly point to /home/jovyan/work
processed_dir = 'Processed data'  # Directory name for processed data
working_dir = 'PD_RNAseq_CountData'

In [40]:
# Ensure the output directory exists
os.makedirs(os.path.join(home_dir, processed_dir), exist_ok=True)

In [41]:
# Define your filenames list and paths
filenames_lst = sorted([file for file in os.listdir(os.path.join(home_dir, working_dir)) if file.endswith('.txt')])

base_file_path = os.path.join(home_dir, working_dir, filenames_lst[0])
additional_files_paths = [os.path.join(home_dir, working_dir, fname) for fname in filenames_lst[1:]]
output_file_path = os.path.join(home_dir, processed_dir, 'final_output.tsv')

# Concatenate columns with preprocessing
concatenate_columns(base_file_path, additional_files_paths, output_file_path, pattern)

Processing files: 100%|██████████| 1046/1046 [6:15:20<00:00, 21.53s/it] 


# 6 - Minimal exploratory data analysis

In [56]:
# Get the file size in bytes
file_size_bytes = os.path.getsize(output_file_path)

def convert_size(size_bytes):
    for unit in ['bytes', 'KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB']:
        if size_bytes < 1024:
            return f"{size_bytes:.2f} {unit}"
        size_bytes /= 1024
    return f"{size_bytes:.2f} YB"

# Use the convert_size function to print the file size in a more readable format
print(f"The size of the file is {convert_size(file_size_bytes)}.")

The size of the file is 157.66 MB.


In [43]:
centralized_df = pd.read_csv(output_file_path, sep = '\t')

In [54]:
centralized_df.head()

Unnamed: 0,Geneid,5104-SL-3174,5104-SL-3180,5104-SL-3171,5104-SL-3175,5104-SL-4006,5104-SL-3990,5104-SL-3902,5104-SL-3885,5104-SL-3854,...,5104-SL-4679,5104-SL-3953,5104-SL-4761,5104-SL-4749,5104-SL-4580,5104-SL-4592,5104-SL-4611,5104-SL-2827,5104-SL-2875,5104-SL-2868
0,ENSG00000223972.5,24,15,15,20,18,11,9,6,7,...,6,8,4,7,1,3,20,6,17,4
1,ENSG00000227232.5,55,50,80,66,198,154,84,48,39,...,63,63,48,85,147,176,163,130,75,101
2,ENSG00000278267.1,10,3,13,16,11,4,13,3,1,...,7,16,8,11,16,54,68,17,29,20
3,ENSG00000243485.5,7,5,8,4,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,ENSG00000284332.1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [53]:
rows, columns = centralized_df.shape
print(f"The DataFrame has {rows} genes and {columns} samples.")

The DataFrame has 58780 genes and 1048 samples.


In [48]:
# Assess missing values in dataframe
# Check for missing values in the entire DataFrame
missing_values_per_column = centralized_df.isnull().sum()

# Print the number of missing values for each column
print(missing_values_per_column)

Geneid          0
5104-SL-3174    0
5104-SL-3180    0
5104-SL-3171    0
5104-SL-3175    0
               ..
5104-SL-4592    0
5104-SL-4611    0
5104-SL-2827    0
5104-SL-2875    0
5104-SL-2868    0
Length: 1048, dtype: int64


In [50]:
total_missing_values = missing_values_per_column.sum()
print(f"Total number of missing values in the centralized dataframe: {total_missing_values}")

Total number of missing values in the centralized dataframe: 0
