Data Cleaning and Normalization for Gene Expression Analysis

This notebook outlines the steps for cleaning and normalizing a gene expression dataset. The key steps are:

1. Data Reading:
Reading the data as a dataframe in Pandas

2. Column Renaming:
This is a step done for better readability and easier data manipulation. I rename the columns to make them more understandable.

3. String Splitting:
Here, I split the Gene_ID into two separate identifiers: ENSG and HGNC. This is also a data manipulation step and is done to make the data easier to work with.

4. Data Normalization:
Normalization is performed to make the data comparable across different samples. I use Counts Per Million (CPM) normalization, which is commonly used in RNA-seq data analysis. This involves dividing each raw value by the sum of all values in its column and then multiplying by 1,000,000.

5. Data Filtering:
After normalization, I filter out genes that have a mean expression value across all samples that is less than 1. This is done to remove genes that are lowly expressed across all samples, as they are less likely to be of interest in the following analyses.

In [29]:
import pandas as pd

In [48]:
# Step 1: Reading the Data
# Read the raw matrix into a DataFrame
df = pd.read_csv('../data/GSE162285_gene_raw_counts_matrix.txt', delimiter='\t')
print("Initial Data:\n")
print(df.head(2))

Initial Data:

          ENSG|HGNC_symbol  \
0  ENSG00000223972|DDX11L1   
1   ENSG00000227232|WASH7P   

   20170417_MDAMB231Org_Veh1_ED3996-2_S1_R1_001.trimmed.fastq.gz.ReadsPerGene.out.tab  \
0                                                  0                                    
1                                                  1                                    

   20170417_MDAMB231Org_Veh2_ED3996-2_S2_R1_001.trimmed.fastq.gz.ReadsPerGene.out.tab  \
0                                                  0                                    
1                                                  2                                    

   20170417_MDAMB231Org_Veh3_ED3996-2_S3_R1_001.trimmed.fastq.gz.ReadsPerGene.out.tab  \
0                                                  0                                    
1                                                  0                                    

   20170417_MDAMB231Org_Doc1_ED3996-2_S10_R1_001.trimmed.fastq.gz.ReadsPerGene.out.tab  \


In [50]:
# Step 2: Renaming Columns
# Rename the columns for better readability
df.columns = ['Gene_ID'] + [f'Sample_{i}' for i in range(1, df.shape[1])]
print("\nColumns after renaming:\n")
#print(df.columns)
print(df.head())


Columns after renaming:

                     Gene_ID  Sample_1  Sample_2  Sample_3  Sample_4  \
0    ENSG00000223972|DDX11L1         0         0         0         3   
1     ENSG00000227232|WASH7P         1         2         0         2   
2  ENSG00000278267|MIR6859-1         1         0         1         0   
3           ENSG00000243485|         0         0         0         3   
4  ENSG00000274890|MIR1302-2         0         0         0         0   

   Sample_5  Sample_6  Sample_7  Sample_8  Sample_9  ...  Sample_64  \
0         0         1         0         0         0  ...          0   
1         1         0         1         0         0  ...          0   
2         0         0         0         0         0  ...          1   
3         1         0         0         0         0  ...          0   
4         0         0         0         0         0  ...          0   

   Sample_65  Sample_66  Sample_67  Sample_68  Sample_69  Sample_70  \
0          1          0          0         

In [55]:
# Step 3: Splitting Gene_ID
# Split the 'Gene_ID' into 'ENSG' and 'HGNC' based on the delimiter '|'
df[['ENSG', 'HGNC']] = df['Gene_ID'].str.split('|', expand=True)
print("Data after splitting Gene_ID:\n")
print(df.head())

Data after splitting Gene_ID:

                     Gene_ID  Sample_1  Sample_2  Sample_3  Sample_4  \
0    ENSG00000223972|DDX11L1         0         0         0         3   
1     ENSG00000227232|WASH7P         1         2         0         2   
2  ENSG00000278267|MIR6859-1         1         0         1         0   
3           ENSG00000243485|         0         0         0         3   
4  ENSG00000274890|MIR1302-2         0         0         0         0   

   Sample_5  Sample_6  Sample_7  Sample_8  Sample_9  ...  Sample_66  \
0         0         1         0         0         0  ...          0   
1         1         0         1         0         0  ...          4   
2         0         0         0         0         0  ...          1   
3         1         0         0         0         0  ...          0   
4         0         0         0         0         0  ...          0   

   Sample_67  Sample_68  Sample_69  Sample_70  Sample_71  Sample_72  \
0          0          0          0    

In [45]:
# Step 4: Normalizing the Data
# Normalize the data by dividing each value by the sum of its column and then multiplying by 1e6
df_normalized = df.iloc[:, 1:-2].apply(lambda x: (x / x.sum()) * 1e6, axis=0)
print("\nNormalized Data:\n")
print(df_normalized.head(2))


Normalized Data:

   20170417_MDAMB231Org_Veh1_ED3996-2_S1_R1_001.trimmed.fastq.gz.ReadsPerGene.out.tab  \
0                                           0.000000                                    
1                                           0.224773                                    

   20170417_MDAMB231Org_Veh2_ED3996-2_S2_R1_001.trimmed.fastq.gz.ReadsPerGene.out.tab  \
0                                           0.000000                                    
1                                           0.474113                                    

   20170417_MDAMB231Org_Veh3_ED3996-2_S3_R1_001.trimmed.fastq.gz.ReadsPerGene.out.tab  \
0                                                0.0                                    
1                                                0.0                                    

   20170417_MDAMB231Org_Doc1_ED3996-2_S10_R1_001.trimmed.fastq.gz.ReadsPerGene.out.tab  \
0                                           1.930194                                   

In [47]:
# Step 5: Filtering the Data
# Filter out rows where the mean value across all samples is less than 1
df_filtered = df_normalized[df_normalized.mean(axis=1) > 1]
print("\nFiltered Data:")
print(df_filtered.head(2))


Filtered Data:
    20170417_MDAMB231Org_Veh1_ED3996-2_S1_R1_001.trimmed.fastq.gz.ReadsPerGene.out.tab  \
19                                           4.270685                                    
36                                           3.821139                                    

    20170417_MDAMB231Org_Veh2_ED3996-2_S2_R1_001.trimmed.fastq.gz.ReadsPerGene.out.tab  \
19                                           5.689356                                    
36                                           5.215243                                    

    20170417_MDAMB231Org_Veh3_ED3996-2_S3_R1_001.trimmed.fastq.gz.ReadsPerGene.out.tab  \
19                                           4.332186                                    
36                                           4.982014                                    

    20170417_MDAMB231Org_Doc1_ED3996-2_S10_R1_001.trimmed.fastq.gz.ReadsPerGene.out.tab  \
19                                           5.790582                           