## Data preprocessing

This file contains the code for preprocess the raw data, downloaded from the GDC portal, into one large data frame that then gets saved into a large, about 188mb '.csv' file. Follow the instructions in the './Dateset/Settings/README.md' file to dowload the dataset. The dataset is also availible in te repository it self. Due to a bug in the GDC commandline tool, downloading became a long and tedious process, and so we decided to upload the dataset into the github repository.

The dataset contains two types of data firstly the clinical data that takes the form of '.xml' files, these files contain important information like: isease_code, histological_type, vital_status and more. The second file that has a '.tvc' extention contains the RNA sequencing data of the case the file uses GENCODE v36 and has 60665 total genes.

The added 'Dataset/Settings/gdc_manifest.2025-05-22.214548.txt' file is used to merge all processed files from each client into their own folder. Some clients have multiple RNA files, this is something to look out for. (Like for case '0a45f302-5748-48f3-9dc9-66c01843a68e')

#### 1. Defining global information

Information like paths, gene-model type and minimal clinical columns are stored in variables for use throughout the code.
Type anotations are used fore better clearity

In [None]:
import pandas as pd

# NOTE: Make sure you run the project from the DatesetParser folder

# Path to the dataset folder
basePath: str = './Dataset/' 
# Path to settings folder
settingsPath: str = basePath + 'Settings/'
# Path to GDC data files
inputPath: str = basePath + 'OriginalFiles/'
# Path to the parsed files
outputPath: str = basePath + 'ProcessedFiles/'
# Path to the metadata file
metadataPath: str = settingsPath + 'metadata.cart.2025-05-22.json'

# Gene-model:
geneModel: str = "# gene-model: GENCODE v36\n"

# The clinical must include the relevant columns
columnsToKeep: set [str] = set(['histological_type', 'icd_o_3_histology'])

#### 2. Helper functions

Simple generelized functions that the main code uses, like reading a json file into a dictionary or changing the file extention of a given path. 

In [None]:
# Swaps the file extension of a given path with a new extension
# Example: updateFileExtension('file.txt', 'csv') -> 'file.csv'
def updateFileExtension(path: str, newExtension: str) -> str:
    # Splitting the path into segments
    segments: list[str] = str.split(path, '.')
    # Removing the last segment (the current extension)
    segments: list[str] = segments[0:-1]
    # Adding the new extension
    segments.append(newExtension)
    # Joining the remaining segments back together
    return '.'.join(segments)

# Reading the a given json file and parsing it into a dictionary
def readJsonFile(path: str) -> dict:
    import json
    with open(path) as f:
        dictionary: dict = json.load(f)
        return dictionary

#### 3. Processing gene data

The gene data is processed in multiple steps. 
First the header of the file is verified, to make sure it uses te right gene vesion. Then the data is loaded into a dataframe and filterd on 'lncRNA' 'gene_type'. After this the data is flattend into a column for each 'tpm_unstranded' and 'unstranded' row.

In [None]:
def processGeneData(originalPath: str) -> tuple[pd.DataFrame | None, str]:
    with open(originalPath) as file:
        header: str = file.readline()

        # Verifying if the first line has the correct gene model
        if header == geneModel:
            # Updating the header variable
            header = file.readline()

            # Reading the file into a pandas dataframe
            dataframe: pd.DataFrame = pd.read_csv(originalPath, delimiter="\t", skiprows=1)

            # Filtering the dataframe to only include lncRNA
            dataframe = dataframe[dataframe['gene_type'] == 'lncRNA']

            # Creating the columns for unstranded and tpm values
            dataframe['unstranded_col'] = dataframe['gene_name'] + '_unstranded'
            dataframe['tpm_col'] = dataframe['gene_name'] + '_tpm_unstranded'

            # Build a dictionary with new column names and their values
            unstranded_vals: dict = dict(zip(dataframe['unstranded_col'], dataframe['unstranded']))
            tpm_vals: dict = dict(zip(dataframe['tpm_col'], dataframe['tpm_unstranded']))

            # Merge the two into one row using a dictionary
            combined: dict = {**unstranded_vals, **tpm_vals}

            # Create a new single-row DataFrame from the combined dict
            result_df: pd.DataFrame = pd.DataFrame([combined])

        return result_df, header    

#### 4. Processing clinical data

The clinical data is processed into multiple steps. First the xml file is loaded into a dataframe. Then the data is flattend, due to the nested nature of the xml data. Lastly the data is checked for required headers.

In [None]:
def processClinicalData(originalPath: str) -> pd.DataFrame | None:
    # Loading the clinical data from the original file
    dataFrame: pd.DataFrame = pd.read_xml(originalPath)

    # Flattening the XML data
    firstRow = dataFrame.loc[0]
    for item in dataFrame.columns:
        # Checking if the column is empty and removing it
        if firstRow[item] is None or firstRow[item] == '':
            dataFrame.loc[0, item] = dataFrame.loc[1, item]  # Copying the value from the second row

    # Checking if the columns exist in the dataframe
    if not columnsToKeep.issubset(dataFrame.columns):
        print("Some columns are missing in the clinical data. Please check the file.")
        return None
    
    # The xml file is not parsed flat but in two rows, so we need to flatten it
    dataFrame.drop(index=1, inplace=True)
  
    return dataFrame

#### 5. Mering data (Using Metadata)

The function starts with defining two dataframes for the two separate dataset. Then the metadata is loaded in. Every element in the metadata file is evaluated and processed. First the 'case_id' is retrieved, then extra information to form the input and ouput urls are loaded. A simple log is pinted to the screen to indicate and visualise progress. Next the file wil be checked for existence. After the file extension is checked to split the parsing of each type in their respectable functions. The result is verfied to be non NONE and non empty. This result is then concatinated to its respectable dataframe an optionaly saved to its case output folder. Lasty the function merges all the data into one large dataframe.

In [None]:
def mergeCaseData(metadataPath: str, inputPath: str, outputPath: str, storeSubfiles: bool = True) -> pd.DataFrame:
    import os

    # Initializing the two main data frames with the case_id column
    dataFrameColumns: list[str] = ['case_id']
    clinicalDataFrame: pd.DataFrame = pd.DataFrame(columns=dataFrameColumns)
    geneDataFrame: pd.DataFrame = pd.DataFrame(columns=dataFrameColumns)

    # Initializing the file count
    fileCount: int = 0

    # Reading the metadata file, with contains information about merging case files
    metaData: dict = readJsonFile(metadataPath)
    for file in metaData:
        # Creating the output folder for the case
        caseId = file['associated_entities'][0]['case_id']

        # Retrieving the file name and folder name
        fileName: str = file['file_name']
        folderName: str = file['file_id']
        fileFormat: str = file['data_format']

        # Increasing the file count and printing an update to indicate progress
        fileCount += 1
        print(f"Processing file {fileCount}/{len(metaData)}: {fileName} for case {caseId}")

        # Creating the storage and output paths
        dataFile = inputPath + folderName + '/' + fileName
        outputFile = updateFileExtension(outputPath + caseId + '/' + fileName, "csv")
  
        # Checking if the file exists (Data set may be corrupted or incomplete)
        if not os.path.isfile(dataFile):
            print("File not found: " + dataFile)
            continue

         # Handling the different file types
        dataFrame = None
        if fileFormat == 'TSV':
            (dataFrame, _) = processGeneData(dataFile) 
        else:  
            dataFrame = processClinicalData(dataFile)

        # Adding the data to the main dataframe and storing the file
        if dataFrame is not None and not dataFrame.empty:
            dataFrame['case_id'] = caseId
         
            # Adding the loaded data frame to the main data frame
            if fileFormat == 'TSV':
                geneDataFrame = pd.concat([geneDataFrame, dataFrame])
            else:  
                clinicalDataFrame = pd.concat([clinicalDataFrame, dataFrame])
             
            # Only storing the subfiles if the flag is set
            if storeSubfiles:
                dataFrame.to_csv(outputFile, index=False, header=True)

    # Merging the clinical and gene data frames on the case_id column
    return pd.merge(clinicalDataFrame, geneDataFrame, on="case_id", how="inner")      

#### 6. Run processing

The main processing function is called where after the result is printed and stored in the 'merged_data.csv' output file.

In [None]:
# Running the main processing function
mainDataFrame: pd.DataFrame = mergeCaseData(metadataPath, inputPath, outputPath, True)

# Displaying the main data frame for debugging purposes
print(mainDataFrame)

# Storing the main data frame to a file
mainDataFrame.to_csv(outputPath + 'merged_data.csv', index=False, header=True)