# DATA NORMALIZATION

## Theory

Various factors affect transcript quantification in RNA-seq data, such as sequencing depth, transcript length, and sample-to-sample and batch-to-batch variability. Normalization methods exist to minimize these variables and ensure reliable transcriptomic data. Sequencing technologies introduce technical variability. Therefore, raw transcriptomic data must be adjusted to account for these technical factors. 

It is essential to choose the correct RNA-seq normalization method for the dataset and there are three main RNA-seq normalization stages to consider:
 1. **Within sample**: Within sample normalization is required to compare the expression of genes within an individual sample. It can adjust data for two primary technical variables: transcript length and sequencing depth. Longer genes often have more mapped reads than shorter genes at the same expression level. Therefore, their expression level can only be accurately compared within a sample after normalization. Furthermore, the number of sequencing reads per sample may vary. This can also be corrected by within sample normalization. Within sample normalization is not sufficient to compare gene expression between samples. For this, between sample RNA-seq normalization methods are required. Within sample normalization most common techniques are:
    - **CPM**: Counts per million (CPM) mapped reads are the number of raw reads mapped to a transcript, scaled by the number of sequencing reads in your sample, multiplied by a million. It normalizes RNA-seq data for sequencing depth but not gene length. Therefore, although it is a within sample normalization approach, CPM normalization is unsuitable for within sample comparisons of gene expression. Between sample comparisons can be made when CPM is used alongside ‘within a dataset’ normalization methods.
    - **FPKM/RPKM**: FPKM (fragments per kilobase of transcript per million fragments mapped) for paired-end data and RPKM (reads per kilobase of transcript per million reads mapped) for single-end data correct for variations in library size and gene length. One issue with FPKM/RPKM units is that the expression of a gene in one sample will appear different from its expression in another sample, even when its true expression level is the same. This is because it depends on the relative abundance of a transcript among a population of sequenced transcripts. FPKM/RPKM units best compare gene expression within a single sample
    - **TPM**: Transcripts per million (TPM) represents the relative number of transcripts you would detect for a gene if you had sequenced one million full-length transcripts. It is calculated by dividing the number of reads mapped to a transcript by the transcript length. This value is then divided by the sum of mapped reads to all transcripts after normalization for transcript length. It is then multiplied by one million to allow easier further analyses. It normalizes RNA-seq data for sequencing depth and transcript length. TPM and FPKM/RPKM are closely related, however, in contrast to FPKM/RPKM, there is limited variation in values between samples as the sum of all TPMs in each sample is the same. TPM can be used for within sample comparisons but requires ‘within a dataset’ normalization for between sample comparisons.

 2. **Within a dataset (between samples)**: Samples within a dataset can be simultaneously normalized as a complete set to adjust for different technical variations such as sequencing depth. RNA-seq is a relative, not an absolute, measure of transcript abundance. This means that the transcript population as a whole affects relative levels of transcripts. This creates biases for gene expression analyses, and these are minimized by between sample RNA-seq normalization methods. Within a dataset normalization mosto common techniques are:
    - **Quantile**: The quantile method aims to make the distribution of gene expression levels the same for each sample in a dataset. It assumes that the global differences in distributions between samples are all due to technical variation. Any remaining differences are likely actual biological effects. For each sample, genes are ranked based on their expression level. An average value is calculated across all samples for genes of the same rank. This average value then replaces the original value of all genes in that rank. These genes are then placed in their original order.
    - **TMM**: TMM (trimmed mean of M-values) also assumes that most genes are not differentially expressed between samples. If many genes are uniquely or highly expressed in one experimental condition, it will affect the accurate quantification of the remaining genes. To adjust for this possibility, TMM calculates scaling factors to adjust library sizes for the normalization of samples within a dataset. To do this, one sample is chosen as a reference sample. The fold changes and absolute expression levels of other samples within the dataset are then calculated relative to the reference sample. Next, the genes in the data set are ‘trimmed’ to remove differentially expressed genes using these two values. The trimmed mean of the fold changes is then found for each sample. Finally, read counts are scaled by this trimmed mean and the total count of their sample.

 3. **Across datasets**: Researchers often integrate RNA-seq data from multiple independent studies. These datasets are usually sequenced at different times, with varying methods across multiple facilities, and contain other experimental variables. This results in a batch effect. The batch effect is often responsible for the greatest source of differential expression when data is combined. It can mask any true biological differences and lead to incorrect conclusions. RNA-seq normalization across datasets can correct for known variables across batches, such as the sequencing center and date of sequencing, as well as unknown variables.

source : https://bigomics.ch/blog/why-how-normalize-rna-seq-data/


**PERSONAL CONSIDERATIONS**

Since we want to have a model capable of inferring the survivability of a patient given a feature vector, we thought that the model would need, to behave at best, data normalized across the dataset allowing intra-dataset comparisons in order to learn when a feature vector corresponds to higher or lower survivability. Given this preposition we first normalize miRNA read values with the TMM method to allow genetic intra-dataset comparisons, and then normalize all values of the feature vectors.

## Init

In [1]:
import pandas as pd
import os
import numpy as np

In [2]:
base = os.path.basename(os.getcwd())
list = os.getcwd().split(os.sep) 
list.pop(list.index(base))
ROOT = '\\'.join(list)
print(ROOT)
DATA_PATH = os.path.join(ROOT, 'datasets\\preprocessed')

d:\Universita\2 anno magistrale\Progetto BioInf\miRNA_to_age


In [3]:
raw_data = pd.read_csv(os.path.join(DATA_PATH, 'clinical_miRNA(RC_RPM).csv'))
print(raw_data.columns)

Index(['days_to_death', 'pathologic_stage',
       'age_at_initial_pathologic_diagnosis', 'days_to_last_followup', 'Death',
       'case_id', 'miRNA_ID', 'read_count', 'reads_per_million_miRNA_mapped'],
      dtype='object')


In [None]:
# from sklearn.preprocessing import OneHotEncoder

# print(data.shape)

# tmp = data.drop(columns=['radiation_therapy', 'case_id'])
# print(tmp.shape)
# tmp = tmp.dropna(subset=['pathologic_stage'])
# print(tmp.shape)

# tmp.to_csv(os.path.join(DATA_PATH, 'clinical_miRNA.csv'), index=False)

## Data preparation

Removed useless columns

In [24]:
print(raw_data.shape)
raw_data.drop(columns=['case_id'], inplace=True)
print(raw_data.shape)

(767, 9)
(767, 8)


In [25]:
def parse_array(x):
    if isinstance(x, str):
        x = x.strip("[]")
        return np.array(eval(x, {'np':np}))
    
    return np.array(x)

One-hot encoding of pathological stage and deletion of columns with less than 20 samples

In [26]:
from sklearn.preprocessing import OneHotEncoder

print(raw_data.shape)

pathologic_stage = raw_data[['pathologic_stage']]

encoder = OneHotEncoder()
encoded_stages = encoder.fit_transform(pathologic_stage).toarray()
encoded_columns = encoder.get_feature_names_out(['pathologic_stage'])
print(f"Passing from one column to {len(encoded_columns)}")
encoded_df = pd.DataFrame(encoded_stages, columns=encoded_columns, index=pathologic_stage.index)

print(encoded_df.shape)
index = encoded_df.sum().index
encoded_df.drop(columns=[i for i in index if encoded_df.sum()[i]<20], inplace=True)
print(encoded_df.shape)

raw_data = pd.concat([raw_data, encoded_df], axis=1)
raw_data = raw_data.drop(columns=['pathologic_stage'])
print(raw_data.shape)

(767, 8)
Passing from one column to 12
(767, 12)
(767, 6)
(767, 13)


Deleting age outliers

In [27]:
print(raw_data.shape)

ages_distrib = raw_data['age_at_initial_pathologic_diagnosis'].value_counts()
ages_to_del = ages_distrib[ages_distrib < 5].index.tolist()
mask = raw_data['age_at_initial_pathologic_diagnosis'].isin(ages_to_del)
indexes = raw_data[mask].index.to_list()
raw_data.drop(indexes, inplace=True)
print(raw_data.shape)

(767, 13)
(746, 13)


deleting rows that are not dead people or patients with high last_days_to_followup (25-th percentile of dead patients days_to_death)

In [30]:
print(raw_data.shape)

dead = raw_data[raw_data['Death'] == 1]
print(f"\nNumber of dead patients: {dead.shape[0]}")
describe = dead.describe()['days_to_death']
print(describe)
alive = raw_data[(raw_data['days_to_last_followup']>describe['25%']) & (raw_data['Death']==0)]
print(f"\nNumber of alive patients with days_to_last_followup > {describe['25%']}: {alive.shape[0]}")

raw_data = pd.concat([dead, alive], axis=0)
print(raw_data.shape)

(746, 13)

Number of dead patients: 68
count      68.000000
mean     1521.117647
std      1083.264132
min         1.000000
25%       602.250000
50%      1269.500000
75%      2384.000000
max      4456.000000
Name: days_to_death, dtype: float64

Number of alive patients with days_to_last_followup > 602.25: 259
(327, 13)


extraction of miRNA_reads and adjustment into dataframe with each column being a gene

In [46]:
if os.path.exists(os.path.join(DATA_PATH, 'genes_reads(RPM).csv')):
    print('Used existing file')
    genes_reads = pd.read_csv(os.path.join(DATA_PATH, 'genes_reads(RPM).csv'), index_col=0)
    print(genes_reads.head())

else:
    print('Creating new file')
    n_cols = len(raw_data["reads_per_million_miRNA_mapped"][0].strip("[]").split(","))
    reads = raw_data["reads_per_million_miRNA_mapped"].apply(parse_array)

    genes_reads = pd.DataFrame(
        np.stack(reads.values),  # Converte la Series di vettori in un array 2D
        index=[f'Sample_{i}' for i in raw_data.index],
        columns = np.array(eval(raw_data['miRNA_ID'][0].strip('[]'), {'np':str}))
    )

    genes_reads.to_csv(os.path.join(DATA_PATH, 'genes_reads(RPM).csv'))
    print(genes_reads.head())

print(genes_reads.shape)

Used existing file
           hsa-let-7a-1  hsa-let-7a-2  hsa-let-7a-3    hsa-let-7b  \
Sample_4    4818.534597   4785.049036   4915.228857   9827.447776   
Sample_25   4282.757713   4320.487207   4417.571638  12548.737764   
Sample_37  14109.535620  14138.877137  14181.022226  29818.450693   
Sample_74   7677.699119   7596.189759   7757.599742  21011.718847   
Sample_76   7889.027302   7703.268940   7950.132027   6782.624423   

            hsa-let-7c  hsa-let-7d   hsa-let-7e  hsa-let-7f-1  hsa-let-7f-2  \
Sample_4   1689.327738  667.830008  4157.853192   1991.450271   2011.014868   
Sample_25  1737.857314  884.802652  1044.462829   1526.664170   1525.743938   
Sample_37  2817.319178  378.238837  1458.540169   7162.531202   7317.774505   
Sample_74  2165.092383  841.905893  1783.553502   3637.355204   3632.797114   
Sample_76  1008.635320  215.088630   615.935623    808.211823    791.102500   

           hsa-let-7g  ...  hsa-mir-941-5  hsa-mir-942  hsa-mir-943  \
Sample_4   780.70268

### Quantile and Log2 normalization

In [50]:
import qnorm

print(genes_reads.iloc[:5, :5])

# Log2 transform necessary for 0 values before quantile normalization
reads_logged = genes_reads.applymap(lambda x: np.log2(float(x) + 1e-3))
print(reads_logged.iloc[:5, :5])

# Quantile normalization
quant_norm = qnorm.quantile_normalize(reads_logged)
print(quant_norm.iloc[:5, :5])

           hsa-let-7a-1  hsa-let-7a-2  hsa-let-7a-3    hsa-let-7b   hsa-let-7c
Sample_4    4818.534597   4785.049036   4915.228857   9827.447776  1689.327738
Sample_25   4282.757713   4320.487207   4417.571638  12548.737764  1737.857314
Sample_37  14109.535620  14138.877137  14181.022226  29818.450693  2817.319178
Sample_74   7677.699119   7596.189759   7757.599742  21011.718847  2165.092383
Sample_76   7889.027302   7703.268940   7950.132027   6782.624423  1008.635320


  reads_logged = genes_reads.applymap(lambda x: np.log2(float(x) + 1e-3))


           hsa-let-7a-1  hsa-let-7a-2  hsa-let-7a-3  hsa-let-7b  hsa-let-7c
Sample_4      12.234379     12.224318     12.263043   13.262601   10.722234
Sample_25     12.064325     12.076979     12.109038   13.615255   10.763095
Sample_37     13.784383     13.787380     13.791674   14.863918   11.460108
Sample_74     12.906458     12.891060     12.921395   14.358907   11.080214
Sample_76     12.945632     12.911255     12.956763   12.727628    9.978190
           hsa-let-7a-1  hsa-let-7a-2  hsa-let-7a-3  hsa-let-7b  hsa-let-7c
Sample_4      -7.651968     -7.651968     -7.651968   -7.687666   -6.768681
Sample_25     -7.887958     -7.887958     -7.857447   -7.371142   -6.740548
Sample_37     -4.561636     -4.561636     -4.583317   -5.892996   -5.720627
Sample_74     -6.864840     -6.875526     -6.850876   -6.753876   -6.340340
Sample_76     -6.768681     -6.850876     -6.768681   -7.967716   -7.325029


Delete reads columns with variance under the 50th percentile of the value

In [51]:
print(reads_logged.shape)
zero_var_cols = reads_logged.var()[reads_logged.var()<reads_logged.var().describe().loc['50%']].index
print(f"Columns to drop: {len(zero_var_cols)}")
reads_logged.drop(columns=[i for i in reads_logged.columns if i in zero_var_cols], inplace=True)
print(reads_logged.shape)

(327, 1881)
Columns to drop: 940
(327, 941)


In [52]:
print(quant_norm.shape)
quant_norm_cols = quant_norm.var()[quant_norm.var()<quant_norm.var().describe().loc['50%']].index
print(f"Columns to drop:{len(quant_norm_cols)}")
quant_norm.drop(columns=[i for i in quant_norm.columns if i in quant_norm_cols], inplace=True)
print(quant_norm.shape)

(327, 1881)
Columns to drop:939
(327, 942)


### Creation of normalized file with just log2

In [53]:
if os.path.exists(os.path.join(DATA_PATH, 'clinical_miRNA_normalized_log.csv')):
    print("Loading existing dataset")
    dataset = pd.read_csv(os.path.join(DATA_PATH, 'clinical_miRNA_normalized_log.csv'))    

else:
    print("Creating new dataset")
    dataset = raw_data.copy()
    dataset.drop(columns=['read_count', 'reads_per_million_miRNA_mapped', 'miRNA_ID'], inplace=True)
    
    reads_logged.index = dataset.index
    dataset = pd.concat([dataset, reads_logged], axis=1)
    dataset.to_csv(os.path.join(DATA_PATH, 'clinical_miRNA_normalized_log.csv'), index=False)

print(dataset.head())
print(dataset.shape)
    

Creating new dataset
    days_to_death  age_at_initial_pathologic_diagnosis  days_to_last_followup  \
4          2763.0                                   46                 2763.0   
25         4456.0                                   50                 4456.0   
37         2520.0                                   55                 2520.0   
74          538.0                                   79                  538.0   
76         2551.0                                   47                 2551.0   

    Death  pathologic_stage_Stage I  pathologic_stage_Stage IA  \
4       1                       1.0                        0.0   
25      1                       0.0                        0.0   
37      1                       1.0                        0.0   
74      1                       0.0                        0.0   
76      1                       0.0                        0.0   

    pathologic_stage_Stage IIA  pathologic_stage_Stage IIB  \
4                          0.0   

### Creation of normalized file with also quantile norm

In [54]:
if os.path.exists(os.path.join(DATA_PATH, 'clinical_miRNA_normalized_quant.csv')):
    print("Loading existing dataset")
    dataset = pd.read_csv(os.path.join(DATA_PATH, 'clinical_miRNA_normalized_quant.csv'))

else:
    print("Creating new dataset")
    dataset = raw_data.copy()
    dataset.drop(columns=['read_count', 'reads_per_million_miRNA_mapped', 'miRNA_ID'], inplace=True)
    
    quant_norm.index = dataset.index
    dataset = pd.concat([dataset, quant_norm], axis=1)
    dataset.to_csv(os.path.join(DATA_PATH, 'clinical_miRNA_normalized_quant.csv'), index=False)

print(dataset.head())
print(dataset.shape)

Creating new dataset
    days_to_death  age_at_initial_pathologic_diagnosis  days_to_last_followup  \
4          2763.0                                   46                 2763.0   
25         4456.0                                   50                 4456.0   
37         2520.0                                   55                 2520.0   
74          538.0                                   79                  538.0   
76         2551.0                                   47                 2551.0   

    Death  pathologic_stage_Stage I  pathologic_stage_Stage IA  \
4       1                       1.0                        0.0   
25      1                       0.0                        0.0   
37      1                       1.0                        0.0   
74      1                       0.0                        0.0   
76      1                       0.0                        0.0   

    pathologic_stage_Stage IIA  pathologic_stage_Stage IIB  \
4                          0.0   