### DATA NORMALIZATION

Various factors affect transcript quantification in RNA-seq data, such as sequencing depth, transcript length, and sample-to-sample and batch-to-batch variability. Normalization methods exist to minimize these variables and ensure reliable transcriptomic data. Sequencing technologies introduce technical variability. Therefore, raw transcriptomic data must be adjusted to account for these technical factors. 

It is essential to choose the correct RNA-seq normalization method for the dataset and there are three main RNA-seq normalization stages to consider:
 1. **Within sample**: Within sample normalization is required to compare the expression of genes within an individual sample. It can adjust data for two primary technical variables: transcript length and sequencing depth. Longer genes often have more mapped reads than shorter genes at the same expression level. Therefore, their expression level can only be accurately compared within a sample after normalization. Furthermore, the number of sequencing reads per sample may vary. This can also be corrected by within sample normalization. Within sample normalization is not sufficient to compare gene expression between samples. For this, between sample RNA-seq normalization methods are required. Within sample normalization most common techniques are:
    - **CPM**: Counts per million (CPM) mapped reads are the number of raw reads mapped to a transcript, scaled by the number of sequencing reads in your sample, multiplied by a million. It normalizes RNA-seq data for sequencing depth but not gene length. Therefore, although it is a within sample normalization approach, CPM normalization is unsuitable for within sample comparisons of gene expression. Between sample comparisons can be made when CPM is used alongside ‘within a dataset’ normalization methods.
    - **FPKM/RPKM**: FPKM (fragments per kilobase of transcript per million fragments mapped) for paired-end data and RPKM (reads per kilobase of transcript per million reads mapped) for single-end data correct for variations in library size and gene length. One issue with FPKM/RPKM units is that the expression of a gene in one sample will appear different from its expression in another sample, even when its true expression level is the same. This is because it depends on the relative abundance of a transcript among a population of sequenced transcripts. FPKM/RPKM units best compare gene expression within a single sample
    - **TPM**: Transcripts per million (TPM) represents the relative number of transcripts you would detect for a gene if you had sequenced one million full-length transcripts. It is calculated by dividing the number of reads mapped to a transcript by the transcript length. This value is then divided by the sum of mapped reads to all transcripts after normalization for transcript length. It is then multiplied by one million to allow easier further analyses. It normalizes RNA-seq data for sequencing depth and transcript length. TPM and FPKM/RPKM are closely related, however, in contrast to FPKM/RPKM, there is limited variation in values between samples as the sum of all TPMs in each sample is the same. TPM can be used for within sample comparisons but requires ‘within a dataset’ normalization for between sample comparisons.

 2. **Within a dataset (between samples)**: Samples within a dataset can be simultaneously normalized as a complete set to adjust for different technical variations such as sequencing depth. RNA-seq is a relative, not an absolute, measure of transcript abundance. This means that the transcript population as a whole affects relative levels of transcripts. This creates biases for gene expression analyses, and these are minimized by between sample RNA-seq normalization methods. Within a dataset normalization mosto common techniques are:
    - **Quantile**: The quantile method aims to make the distribution of gene expression levels the same for each sample in a dataset. It assumes that the global differences in distributions between samples are all due to technical variation. Any remaining differences are likely actual biological effects. For each sample, genes are ranked based on their expression level. An average value is calculated across all samples for genes of the same rank. This average value then replaces the original value of all genes in that rank. These genes are then placed in their original order.
    - **TMM**: TMM (trimmed mean of M-values) also assumes that most genes are not differentially expressed between samples. If many genes are uniquely or highly expressed in one experimental condition, it will affect the accurate quantification of the remaining genes. To adjust for this possibility, TMM calculates scaling factors to adjust library sizes for the normalization of samples within a dataset. To do this, one sample is chosen as a reference sample. The fold changes and absolute expression levels of other samples within the dataset are then calculated relative to the reference sample. Next, the genes in the data set are ‘trimmed’ to remove differentially expressed genes using these two values. The trimmed mean of the fold changes is then found for each sample. Finally, read counts are scaled by this trimmed mean and the total count of their sample.

 3. **Across datasets**: Researchers often integrate RNA-seq data from multiple independent studies. These datasets are usually sequenced at different times, with varying methods across multiple facilities, and contain other experimental variables. This results in a batch effect. The batch effect is often responsible for the greatest source of differential expression when data is combined. It can mask any true biological differences and lead to incorrect conclusions. RNA-seq normalization across datasets can correct for known variables across batches, such as the sequencing center and date of sequencing, as well as unknown variables.



**PERSONAL CONSIDERATIONS**

We didn't think the Across Dataset normalization was needed in our case since all samples came from same laboratory and all samples were equal in terms of desease and in general terms. WE thought pf applying only the first 2 types of normalization.

In [None]:
%pip install rnanorm

In [1]:
import pandas as pd
import os
import numpy as np

In [2]:
base = os.path.basename(os.getcwd())
list = os.getcwd().split(os.sep) 
list.pop(list.index(base))
ROOT = '\\'.join(list)
print(ROOT)
DATA_PATH = os.path.join(ROOT, 'datasets\\preprocessed')
miRNA_file = 'final_merged.csv'

d:\Universita\2 anno magistrale\Progetto BioInf\miRNA_to_age


In [5]:
miRNA_dataset = os.path.join(DATA_PATH, miRNA_file)
miRNA_dataset

'd:\\Universita\\2 anno magistrale\\Progetto BioInf\\miRNA_to_age\\datasets\\preprocessed\\final_merged.csv'

In [6]:
data = pd.read_csv(miRNA_dataset)
data.head()

Unnamed: 0,case_id,age_at_initial_pathologic_diagnosis,reads_per_million_miRNA_mapped
0,378778d2-b331-4867-a93b-c64028c8b4c7,71.0,"[12120.990742, 12041.557881, 12141.51573, 1768..."
1,b343bfe0-7c23-4c6a-8c84-9ee39db2ecda,53.0,"[20904.21904, 21106.26856, 20907.030164, 45794..."
2,b343bfe0-7c23-4c6a-8c84-9ee39db2ecda,53.0,"[4742.042712, 4748.951756, 4919.239371, 4809.9..."
3,3e775c99-ceda-4246-8d6f-0f58ca5097c8,59.0,"[8992.901265, 8985.700823, 9044.654437, 29490...."
4,abdc76db-f85e-4337-a57e-6d098789da03,55.0,"[6540.401419, 6668.261153, 6475.719437, 19448...."


In [7]:
genes_names = []
with open('D:\\Universita\\2 anno magistrale\\Progetto BioInf\\miRNA_to_age\\datasets\\miRNA_seq\\0b64742b-b99d-4fc1-bfbf-bb7074f21a67\\9293a93f-b3d1-41ab-8c18-3deed4cc776d.mirbase21.mirnas.quantification.txt', 'r') as f: 
    lines = f.readlines()[1:]
    for l in lines:
        genes_names.append(l.split('\t')[0])

with open(os.path.join(DATA_PATH, 'genes_names.txt'), 'w') as f:
    for name in genes_names:
        f.write(name + '\n') 

In [8]:
def parse_array(x):
    if isinstance(x, str):
        x = x.strip("[]")
        return np.array([float(i) for i in x.split(",")])
    
    return np.array(x)
n_cols = len(data["reads_per_million_miRNA_mapped"][0].strip("[]").split(","))
reads = data["reads_per_million_miRNA_mapped"].apply(parse_array)
genes_reads = pd.DataFrame(
    np.stack(reads.values),  # Converte la Series di vettori in un array 2D
    index=[f'Sample_{i}' for i in data.index]
)
f = open(os.path.join(DATA_PATH, 'genes_names.txt'), 'r')
f_lines = f.readlines()
genes_reads.columns = [i.strip() for i in f_lines]
print(genes_reads.head())
print(type(genes_reads))

          hsa-let-7a-1  hsa-let-7a-2  hsa-let-7a-3    hsa-let-7b   hsa-let-7c  \
Sample_0  12120.990742  12041.557881  12141.515730  17683.610218  2483.291559   
Sample_1  20904.219040  21106.268560  20907.030164  45794.611568  6957.179909   
Sample_2   4742.042712   4748.951756   4919.239371   4809.913909   546.220895   
Sample_3   8992.901265   8985.700823   9044.654437  29490.757783  3047.586817   
Sample_4   6540.401419   6668.261153   6475.719437  19448.969676  2930.996358   

          hsa-let-7d   hsa-let-7e  hsa-let-7f-1  hsa-let-7f-2  hsa-let-7g  \
Sample_0  276.913390   860.889868   3728.010288   3782.511667  375.827595   
Sample_1  471.917401  2279.821367   7539.785307   7636.417687  759.354805   
Sample_2  298.714552   869.320309   1083.094260   1083.907089  495.419101   
Sample_3  825.575608   577.835421   2360.394692   2396.171885  974.759753   
Sample_4  310.623940   563.334943   1192.103985   1270.324057  559.574362   

          ...  hsa-mir-941-5  hsa-mir-942  hsa-mir

In [24]:
from pybiomart import Server

server = Server(host='http://www.ensembl.org')
mart = server['ENSEMBL_MART_ENSEMBL']
dataset = mart['hsapiens_gene_ensembl']

# Query con filtro sui geni miRNA
miRNA_mapping = dataset.query(
    attributes=['ensembl_gene_id', 'external_gene_name', 'mirbase_id'],
    filters={'biotype': 'miRNA'}  # filtra solo i miRNA
)

print(miRNA_mapping.head())
miRNA_mapping.shape

    Gene stable ID  Gene name      miRBase ID
0  ENSG00000283344  MIR1244-4  hsa-mir-1244-1
1  ENSG00000283344  MIR1244-4  hsa-mir-1244-2
2  ENSG00000283344  MIR1244-4  hsa-mir-1244-3
3  ENSG00000283344  MIR1244-4  hsa-mir-1244-4
4  ENSG00000292346    MIR6089  hsa-mir-6089-1


(2170, 3)

In [26]:
gencode_miRNA = pd.read_csv(os.path.join(DATA_PATH, 'miRNA_gtf.csv'))
gencode_miRNA.head()

Unnamed: 0,seqname,source,feature,start,end,score,strand,frame,gene_id,gene_type,...,exon_number,exon_id,transcript_support_level,havana_transcript,hgnc_id,havana_gene,ont,protein_id,ccdsid,artif_dupl
0,chr1,ENSEMBL,gene,17369,17436,.,-,.,ENSG00000278267.1,miRNA,...,,,,,HGNC:50039,,,,,
1,chr1,ENSEMBL,transcript,17369,17436,.,-,.,ENSG00000278267.1,miRNA,...,,,,,HGNC:50039,,,,,
2,chr1,ENSEMBL,exon,17369,17436,.,-,.,ENSG00000278267.1,miRNA,...,1.0,ENSE00003746039.1,,,HGNC:50039,,,,,
3,chr1,ENSEMBL,gene,30366,30503,.,+,.,ENSG00000284332.1,miRNA,...,,,,,HGNC:35294,,,,,
4,chr1,ENSEMBL,transcript,30366,30503,.,+,.,ENSG00000284332.1,miRNA,...,,,,,HGNC:35294,,,,,


In [49]:
# gene_id = miRNA_mapping.loc[miRNA_mapping['miRBase ID'] == 'hsa-mir-1254-1', 'Gene name']
start = gencode_miRNA.loc[gencode_miRNA['gene_name'] == 'MIR1254-1', 'start']
# print(gene_id.values[0])
print(start.values[0])

IndexError: index 0 is out of bounds for axis 0 with size 0

In [45]:
gene_lengths = {}
for name in genes_names:
    print(name)
    gene_id = miRNA_mapping.loc[miRNA_mapping['miRBase ID'] == name, 'Gene name']
    start = gencode_miRNA.loc[gencode_miRNA['gene_name'] == gene_id.values[0], 'start']
    end = gencode_miRNA.loc[gencode_miRNA['gene_name'] == gene_id.values[0], 'end']
    length = int(end.values[0]) - int(start.values[0]) + 1
    gene_lengths[name] = length

gene_lengths = pd.DataFrame.from_dict(gene_lengths, orient='index', columns=['gene_id', 'gene_length'])
gene_lengths.head()

hsa-let-7a-1
hsa-let-7a-2
hsa-let-7a-3
hsa-let-7b
hsa-let-7c
hsa-let-7d
hsa-let-7e
hsa-let-7f-1
hsa-let-7f-2
hsa-let-7g
hsa-let-7i
hsa-mir-1-1
hsa-mir-1-2
hsa-mir-100
hsa-mir-101-1
hsa-mir-101-2
hsa-mir-103a-1
hsa-mir-103a-2
hsa-mir-103b-1
hsa-mir-103b-2
hsa-mir-105-1
hsa-mir-105-2
hsa-mir-106a
hsa-mir-106b
hsa-mir-107
hsa-mir-10a
hsa-mir-10b
hsa-mir-1178
hsa-mir-1179
hsa-mir-1180
hsa-mir-1181
hsa-mir-1182
hsa-mir-1183
hsa-mir-1184-1
hsa-mir-1184-2
hsa-mir-1184-3
hsa-mir-1185-1
hsa-mir-1185-2
hsa-mir-1193
hsa-mir-1197
hsa-mir-1199
hsa-mir-1200
hsa-mir-1202
hsa-mir-1203
hsa-mir-1204
hsa-mir-1205
hsa-mir-1206
hsa-mir-1207
hsa-mir-1208
hsa-mir-122
hsa-mir-1224
hsa-mir-1225
hsa-mir-1226
hsa-mir-1227
hsa-mir-1228
hsa-mir-1229
hsa-mir-1231
hsa-mir-1233-1
hsa-mir-1233-2
hsa-mir-1234
hsa-mir-1236
hsa-mir-1237
hsa-mir-1238
hsa-mir-124-1
hsa-mir-124-2
hsa-mir-124-3
hsa-mir-1243
hsa-mir-1244-1
hsa-mir-1244-2
hsa-mir-1244-3
hsa-mir-1244-4
hsa-mir-1245a
hsa-mir-1245b
hsa-mir-1246
hsa-mir-1247
hsa-m

IndexError: index 0 is out of bounds for axis 0 with size 0

In [None]:
from rnanorm import TPM
genes_length_file = os.path.join(DATA_PATH, 'genes_lengths.txt')
tpm = TPM(genes_length_file)
normalized_reads = tpm.set_output(transform="pandas").fit_transform(genes_reads)


<class 'pandas.core.frame.DataFrame'>
(4, 5)
<class 'pandas.core.frame.DataFrame'>
(758, 1881)


In [None]:
final_merged_miRNA = pd.DataFrame({
    'vector': normalized_reads.apply(lambda row: np.array(row.values), axis=1)
})

df = pd.read_csv(miRNA_dataset)
df['reads_per_million_miRNA_mapped'] = final_merged_miRNA['vector']
open(os.path.join(DATA_PATH, 'final_merged_miRNA_normalized.csv'), 'w').write(df.to_csv(index=False))