# DATA NORMALIZATION

Various factors affect transcript quantification in RNA-seq data, such as sequencing depth, transcript length, and sample-to-sample and batch-to-batch variability. Normalization methods exist to minimize these variables and ensure reliable transcriptomic data. Sequencing technologies introduce technical variability. Therefore, raw transcriptomic data must be adjusted to account for these technical factors. 

It is essential to choose the correct RNA-seq normalization method for the dataset and there are three main RNA-seq normalization stages to consider:
 1. **Within sample**: Within sample normalization is required to compare the expression of genes within an individual sample. It can adjust data for two primary technical variables: transcript length and sequencing depth. Longer genes often have more mapped reads than shorter genes at the same expression level. Therefore, their expression level can only be accurately compared within a sample after normalization. Furthermore, the number of sequencing reads per sample may vary. This can also be corrected by within sample normalization. Within sample normalization is not sufficient to compare gene expression between samples. For this, between sample RNA-seq normalization methods are required. Within sample normalization most common techniques are:
    - **CPM**: Counts per million (CPM) mapped reads are the number of raw reads mapped to a transcript, scaled by the number of sequencing reads in your sample, multiplied by a million. It normalizes RNA-seq data for sequencing depth but not gene length. Therefore, although it is a within sample normalization approach, CPM normalization is unsuitable for within sample comparisons of gene expression. Between sample comparisons can be made when CPM is used alongside ‘within a dataset’ normalization methods.
    - **FPKM/RPKM**: FPKM (fragments per kilobase of transcript per million fragments mapped) for paired-end data and RPKM (reads per kilobase of transcript per million reads mapped) for single-end data correct for variations in library size and gene length. One issue with FPKM/RPKM units is that the expression of a gene in one sample will appear different from its expression in another sample, even when its true expression level is the same. This is because it depends on the relative abundance of a transcript among a population of sequenced transcripts. FPKM/RPKM units best compare gene expression within a single sample
    - **TPM**: Transcripts per million (TPM) represents the relative number of transcripts you would detect for a gene if you had sequenced one million full-length transcripts. It is calculated by dividing the number of reads mapped to a transcript by the transcript length. This value is then divided by the sum of mapped reads to all transcripts after normalization for transcript length. It is then multiplied by one million to allow easier further analyses. It normalizes RNA-seq data for sequencing depth and transcript length. TPM and FPKM/RPKM are closely related, however, in contrast to FPKM/RPKM, there is limited variation in values between samples as the sum of all TPMs in each sample is the same. TPM can be used for within sample comparisons but requires ‘within a dataset’ normalization for between sample comparisons.

 2. **Within a dataset (between samples)**: Samples within a dataset can be simultaneously normalized as a complete set to adjust for different technical variations such as sequencing depth. RNA-seq is a relative, not an absolute, measure of transcript abundance. This means that the transcript population as a whole affects relative levels of transcripts. This creates biases for gene expression analyses, and these are minimized by between sample RNA-seq normalization methods. Within a dataset normalization mosto common techniques are:
    - **Quantile**: The quantile method aims to make the distribution of gene expression levels the same for each sample in a dataset. It assumes that the global differences in distributions between samples are all due to technical variation. Any remaining differences are likely actual biological effects. For each sample, genes are ranked based on their expression level. An average value is calculated across all samples for genes of the same rank. This average value then replaces the original value of all genes in that rank. These genes are then placed in their original order.
    - **TMM**: TMM (trimmed mean of M-values) also assumes that most genes are not differentially expressed between samples. If many genes are uniquely or highly expressed in one experimental condition, it will affect the accurate quantification of the remaining genes. To adjust for this possibility, TMM calculates scaling factors to adjust library sizes for the normalization of samples within a dataset. To do this, one sample is chosen as a reference sample. The fold changes and absolute expression levels of other samples within the dataset are then calculated relative to the reference sample. Next, the genes in the data set are ‘trimmed’ to remove differentially expressed genes using these two values. The trimmed mean of the fold changes is then found for each sample. Finally, read counts are scaled by this trimmed mean and the total count of their sample.

 3. **Across datasets**: Researchers often integrate RNA-seq data from multiple independent studies. These datasets are usually sequenced at different times, with varying methods across multiple facilities, and contain other experimental variables. This results in a batch effect. The batch effect is often responsible for the greatest source of differential expression when data is combined. It can mask any true biological differences and lead to incorrect conclusions. RNA-seq normalization across datasets can correct for known variables across batches, such as the sequencing center and date of sequencing, as well as unknown variables.

source : https://bigomics.ch/blog/why-how-normalize-rna-seq-data/


**PERSONAL CONSIDERATIONS**

Since we want to have a model capable of inferring the survivability of a patient given a feature vector, we thought that the model would need, to behave at best, data normalized across the dataset allowing intra-dataset comparisons in order to learn when a feature vector corresponds to higher or lower survivability. Given this preposition we first normalize miRNA read values with the TMM method to allow genetic intra-dataset comparisons, and then normalize all values of the feature vectors.

In [None]:
%pip install rnanorm

In [1]:
import pandas as pd
import os
import numpy as np

In [6]:
base = os.path.basename(os.getcwd())
list = os.getcwd().split(os.sep) 
list.pop(list.index(base))
ROOT = '\\'.join(list)
print(ROOT)
DATA_PATH = os.path.join(ROOT, 'datasets\\preprocessed')
miRNA_file = 'clinical_miRNA.csv'

d:\Universita\2 anno magistrale\Progetto BioInf\miRNA_to_age


In [7]:
miRNA_dataset = os.path.join(DATA_PATH, miRNA_file)
miRNA_dataset

'd:\\Universita\\2 anno magistrale\\Progetto BioInf\\miRNA_to_age\\datasets\\preprocessed\\clinical_miRNA.csv'

In [8]:
data = pd.read_csv(miRNA_dataset)
data.head()

Unnamed: 0,age_at_initial_pathologic_diagnosis,days_to_death,days_to_last_followup,case_id,read_count
0,71.0,-1.0,1918.0,378778d2-b331-4867-a93b-c64028c8b4c7,"[13756, 13807, 13949, 55698, 5797, 518, 3747, ..."
1,53.0,-1.0,1309.0,b343bfe0-7c23-4c6a-8c84-9ee39db2ecda,"[37711, 37303, 37662, 44231, 14405, 1889, 3169..."
2,55.0,-1.0,0.0,abdc76db-f85e-4337-a57e-6d098789da03,"[10731, 10926, 10792, 14125, 2622, 462, 698, 3..."
3,64.0,-1.0,212.0,fbee40f1-d6d8-4156-8d42-36e09bb9f095,"[70280, 70637, 70972, 67833, 5167, 1629, 18954..."
4,46.0,2763.0,2763.0,50619f8c-10aa-464a-a227-90a7aa6ffd43,"[12807, 12718, 13064, 26120, 4490, 1775, 11051..."


## Data preparation

creation of file with all miRNA base names

In [9]:
if os.path.exists(os.path.join(DATA_PATH, 'genes_names.txt')):
    with open(os.path.join(DATA_PATH, 'genes_names.txt'), 'r') as f: 
        lines = f.readlines()
        genes_names = [line.strip() for line in lines]
else:
    genes_names = []
    #path = 'D:\\Universita\\2 anno magistrale\\Progetto BioInf\\miRNA_to_age\\datasets\\miRNA_seq\\0b64742b-b99d-4fc1-bfbf-bb7074f21a67\\9293a93f-b3d1-41ab-8c18-3deed4cc776d.mirbase21.mirnas.quantification.txt'
    path = '../datasets/miRNA_seq/0b64742b-b99d-4fc1-bfbf-bb7074f21a67/9293a93f-b3d1-41ab-8c18-3deed4cc776d.mirbase21.mirnas.quantification.txt'
    with open(path, 'r') as f: 
        lines = f.readlines()[1:]
        for l in lines:
            genes_names.append(l.split('\t')[0])

    with open(os.path.join(DATA_PATH, 'genes_names.txt'), 'w') as f:
        for name in genes_names:
            f.write(name + '\n') 

extraction of miRNA_reads and adjustment into dataframe with each column being a gene

In [10]:
def parse_array(x):
    if isinstance(x, str):
        x = x.strip("[]")
        return np.array([float(i) for i in x.split(",")])
    
    return np.array(x)

if os.path.exists(os.path.join(DATA_PATH, 'genes_reads.csv')):
    with open(os.path.join(DATA_PATH, 'genes_reads.csv'), 'r') as f:
        genes_reads = pd.read_csv(os.path.join(DATA_PATH, 'genes_reads.csv'), index_col=0)

else:
    n_cols = len(data["read_count"][0].strip("[]").split(","))
    reads = data["read_count"].apply(parse_array)

    genes_reads = pd.DataFrame(
        np.stack(reads.values),  # Converte la Series di vettori in un array 2D
        index=[f'Sample_{i}' for i in data.index]
    )

    genes_reads.columns = genes_names

    genes_reads.to_csv(os.path.join(DATA_PATH, 'genes_reads.csv'))
    print(genes_reads.head())

          hsa-let-7a-1  hsa-let-7a-2  hsa-let-7a-3  hsa-let-7b  hsa-let-7c  \
Sample_0       13756.0       13807.0       13949.0     55698.0      5797.0   
Sample_1       37711.0       37303.0       37662.0     44231.0     14405.0   
Sample_2       10731.0       10926.0       10792.0     14125.0      2622.0   
Sample_3       70280.0       70637.0       70972.0     67833.0      5167.0   
Sample_4       12807.0       12718.0       13064.0     26120.0      4490.0   

          hsa-let-7d  hsa-let-7e  hsa-let-7f-1  hsa-let-7f-2  hsa-let-7g  ...  \
Sample_0       518.0      3747.0        4192.0        4212.0       318.0  ...   
Sample_1      1889.0      3169.0       27465.0       27650.0      1935.0  ...   
Sample_2       462.0       698.0        3780.0        3783.0       438.0  ...   
Sample_3      1629.0     18954.0       31984.0       32414.0      2036.0  ...   
Sample_4      1775.0     11051.0        5293.0        5345.0      2075.0  ...   

          hsa-mir-941-5  hsa-mir-942  hsa-mi

### Extraction from web and manual dictionary of the corresponendce between miRNA base names and gene_id and gene_name

In [11]:
if os.path.exists(os.path.join(DATA_PATH, 'gene_lengths.csv')):
    gene_lengths_df = pd.read_csv(os.path.join(DATA_PATH, 'gene_lengths.csv'), sep="\t", index_col=0)
    print(gene_lengths_df.head())

              gene_length
gene_id                  
hsa-let-7a-1           80
hsa-let-7a-2           72
hsa-let-7a-3           74
hsa-let-7b             83
hsa-let-7c             84


In [12]:
from pybiomart import Server

server = Server(host='http://www.ensembl.org')
mart = server['ENSEMBL_MART_ENSEMBL']
dataset = mart['hsapiens_gene_ensembl']

# Query con filtro sui geni miRNA
miRNA_mapping = dataset.query(
    attributes=['ensembl_gene_id', 'external_gene_name', 'mirbase_id'],
    filters={'biotype': 'miRNA'}  # filtra solo i miRNA
)

print(miRNA_mapping.head())
miRNA_mapping.shape

    Gene stable ID  Gene name      miRBase ID
0  ENSG00000283344  MIR1244-4  hsa-mir-1244-1
1  ENSG00000283344  MIR1244-4  hsa-mir-1244-2
2  ENSG00000283344  MIR1244-4  hsa-mir-1244-3
3  ENSG00000283344  MIR1244-4  hsa-mir-1244-4
4  ENSG00000292346    MIR6089  hsa-mir-6089-1


(2170, 3)

In [13]:
server = Server(host='http://grch37.ensembl.org')
mart = server['ENSEMBL_MART_ENSEMBL']
dataset = mart['hsapiens_gene_ensembl']

mapping_grch37 = dataset.query(
    attributes=['ensembl_gene_id', 'external_gene_name', 'mirbase_id'],
    filters={'biotype': 'miRNA'}
)
miRNA_mapping_grch37 = mapping_grch37[mapping_grch37['miRBase ID'].notnull()].drop_duplicates()

print(miRNA_mapping_grch37.head())
miRNA_mapping_grch37.shape

     Gene stable ID Gene name    miRBase ID
2   ENSG00000252695   MIR2276  hsa-mir-2276
5   ENSG00000263399   MIR3170  hsa-mir-3170
7   ENSG00000207719    MIR623   hsa-mir-623
9   ENSG00000263615   MIR4306  hsa-mir-4306
12  ENSG00000265164   MIR2681  hsa-mir-2681


(1601, 3)

In [14]:
gencode_miRNA = pd.read_csv(os.path.join(DATA_PATH, 'miRNA_gtf.csv'))
gencode_miRNA.head()

Unnamed: 0,seqname,source,feature,start,end,score,strand,frame,gene_id,gene_type,...,exon_number,exon_id,transcript_support_level,havana_transcript,hgnc_id,havana_gene,ont,protein_id,ccdsid,artif_dupl
0,chr1,ENSEMBL,gene,17369,17436,.,-,.,ENSG00000278267.1,miRNA,...,,,,,HGNC:50039,,,,,
1,chr1,ENSEMBL,transcript,17369,17436,.,-,.,ENSG00000278267.1,miRNA,...,,,,,HGNC:50039,,,,,
2,chr1,ENSEMBL,exon,17369,17436,.,-,.,ENSG00000278267.1,miRNA,...,1.0,ENSE00003746039.1,,,HGNC:50039,,,,,
3,chr1,ENSEMBL,gene,30366,30503,.,+,.,ENSG00000284332.1,miRNA,...,,,,,HGNC:35294,,,,,
4,chr1,ENSEMBL,transcript,30366,30503,.,+,.,ENSG00000284332.1,miRNA,...,,,,,HGNC:35294,,,,,


In [None]:
manual_mapping = {
    "hsa-mir-3607": "MIR3607",
    "hsa-mir-3653": "MIR3653",
    "hsa-mir-3687-1": "MIR3687-1",
    "hsa-mir-3687-2": "MIR3687-2",
    "hsa-mir-6087": "MIR6087",
    "hsa-mir-6723": "MIR6723",
    "hsa-mir-6827": "MIR6827",
    "hsa-mir-7641-1": "MIR7641-1",
    "hsa-mir-7641-2": "MIR7641-2",
    "hsa-mir-3656": "MIR3656",
    "hsa-mir-4788": "MIR4788"
}

gene_lengths = {}
for name in genes_names:
    gene_id_value = None
    if name in manual_mapping:
        gene_id_value = manual_mapping[name]
    else:
        map = miRNA_mapping[miRNA_mapping.astype(str).apply(lambda x: x.str.contains(name, case=False, na=False)).any(axis=1)]
        if map.empty:
            map = miRNA_mapping_grch37[miRNA_mapping_grch37.astype(str).apply(lambda x: x.str.contains(name, case=False, na=False)).any(axis=1)]
        if map.empty:
            print(f"Gene not found for miRBase ID: {name}")
            continue            
        gene_id_value = map['Gene name'].iloc[0] if not map['Gene name'].empty else None
    
    result = pd.DataFrame()
    if gene_id_value:
        result = gencode_miRNA[gencode_miRNA.astype(str).apply(lambda x: x.str.contains(gene_id_value, case=False, na=False)).any(axis=1)]
    
    if result.empty:
        # Result is empty, trying with Gene stable ID
        gene_stable_id = map['Gene stable ID'].iloc[0]
        result = gencode_miRNA[gencode_miRNA.astype(str).apply(lambda x: x.str.contains(gene_stable_id, case=False, na=False)).any(axis=1)]
        
        if result.empty:
            print(f"Gene not found for miRBase ID: {name} using both Gene name and Gene stable ID")
            continue
      
    gene_lengths[name] = int(result['end'].values[0]) - int(result['start'].values[0]) + 1

gene_lengths_df = pd.DataFrame([
    {'gene_id': name, 'gene_length': length}
    for name, length in gene_lengths.items()
])

print(gene_lengths_df.head())
print(gene_lengths_df.shape)
print(gene_lengths_df.isna().sum())
gene_lengths_df.to_csv("gene_lengths.csv", sep="\t", index=False)

KeyboardInterrupt: 

The manual mapping was made consulting the miRBase database at https://www.mirbase.org/ .

### Quantile normalization

In [None]:
import qnorm

reads = genes_reads
print(reads.iloc[:5, :5])

# Log2 transform necessary for 0 values before quantile normalization
reads_logged = reads.applymap(lambda x: np.log2(x + 1))
print(reads_logged.iloc[:5, :5])

# Quantile normalization
quant_norm = qnorm.quantile_normalize(reads_logged)
print(quant_norm.iloc[:5, :5])

          hsa-let-7a-1  hsa-let-7a-2  hsa-let-7a-3  hsa-let-7b  hsa-let-7c
Sample_0       13756.0       13807.0       13949.0     55698.0      5797.0
Sample_1       37711.0       37303.0       37662.0     44231.0     14405.0
Sample_2       10731.0       10926.0       10792.0     14125.0      2622.0
Sample_3       70280.0       70637.0       70972.0     67833.0      5167.0
Sample_4       12807.0       12718.0       13064.0     26120.0      4490.0


  reads_logged = reads.applymap(lambda x: np.log2(x + 1))


          hsa-let-7a-1  hsa-let-7a-2  hsa-let-7a-3  hsa-let-7b  hsa-let-7c
Sample_0     13.747878     13.753217     13.767978   15.765364   12.501340
Sample_1     15.202736     15.187043     15.200860   15.432803   13.814382
Sample_2     13.389631     13.415610     13.397808   13.786065   11.357002
Sample_3     16.100847     16.108157     16.114983   16.049721   12.335390
Sample_4     13.644758     13.634698     13.673420   14.672923   12.132821
          hsa-let-7a-1  hsa-let-7a-2  hsa-let-7a-3  hsa-let-7b  hsa-let-7c
Sample_0      1.152513      1.153766      1.150609    1.331246    1.488567
Sample_1      1.660901      1.659254      1.656288    1.242474    2.112232
Sample_2      1.056369      1.064106      1.055602    0.837121    1.153766
Sample_3      2.123158      2.117345      2.123158    1.451740    1.433827
Sample_4      1.134205      1.134205      1.134205    1.030723    1.381485


### TMM normalization (NOT USED because not reccomended for miRNA-seq)

In [48]:
import conorm

reads = genes_reads.T

conorm_reads = conorm.tmm(reads)
# print(conorm_reads.T.head())

print(genes_reads.iloc[:1, :5])
print(conorm_reads.T.iloc[:1, :5])
# print(conorm.tmm_norm_factors(conorm_reads))

          hsa-let-7a-1  hsa-let-7a-2  hsa-let-7a-3  hsa-let-7b  hsa-let-7c
Sample_0       13756.0       13807.0       13949.0     55698.0      5797.0
          hsa-let-7a-1  hsa-let-7a-2  hsa-let-7a-3    hsa-let-7b   hsa-let-7c
Sample_0  14437.364984  14490.891127    14639.9247  58456.844643  6084.138181


### Creation of normalized file

In [57]:
if os.path.exists(os.path.join(DATA_PATH, 'clinical_miRNA_normalized.csv')):
    normalized = pd.read_csv(os.path.join(DATA_PATH, 'clinical_miRNA_normalized.csv'))
    print(normalized.head())

else:
    final_merged_miRNA = pd.DataFrame(
        columns=['read_count'],
        data=quant_norm.apply(lambda row: row.values, axis=1)
    )
    # print(final_merged_miRNA.head())
    # print(final_merged_miRNA.shape)

    normalized = pd.read_csv(miRNA_dataset)
    print(normalized.head())
    normalized['read_count'] = final_merged_miRNA['read_count'].values
    print(normalized.head())

    normalized.to_csv(os.path.join(DATA_PATH, 'clinical_miRNA_normalized.csv'), index=False)

   days_to_death pathologic_stage  age_at_initial_pathologic_diagnosis  \
0           -1.0        Stage IIA                                   71   
1           -1.0          Stage I                                   53   
2           -1.0        Stage IIB                                   55   
3           -1.0        Stage IIA                                   64   
4         2763.0          Stage I                                   46   

   days_to_last_followup  Death                               case_id  \
0                 1918.0      0  378778d2-b331-4867-a93b-c64028c8b4c7   
1                 1309.0      0  b343bfe0-7c23-4c6a-8c84-9ee39db2ecda   
2                    0.0      0  abdc76db-f85e-4337-a57e-6d098789da03   
3                  212.0      0  fbee40f1-d6d8-4156-8d42-36e09bb9f095   
4                 2763.0      1  50619f8c-10aa-464a-a227-90a7aa6ffd43   

                                          read_count  
0  [13756, 13807, 13949, 55698, 5797, 518, 3747, ...  
1  [37