# <div style="background-color: #000000; color:white;padding: 20px; border-radius: 10px; text-align: center;">Chapeter 1. Data Loading and Preprocessing</div>


In this section, we will load the data from provided URLs and perform some basic preprocessing steps. The datasets include gene expression data, mutation data, copy number variation data, and subtype covariates. 



In this chapter, you will explore the different types of multi-omics datasets and learn the function of 🐼pandas:

- 🐼How to read the .csv data`pd.read(...)`
- 🐼How to merge multiple data frames`pd.merge(...)` `pd.contact(...)`
- 🐼How to check for missing value`[].isna()` `[].isna().sum()`



##### ⚡⚡ ⚡If you have learnt these, you can start  to practice in  [Data Loading and Preprocessing Exercise](#Data-Loading-and-Preprocessing-Exercise:)

## Table of Contents

- [1.1Data Loading](#1.1Data-Loading)
- [1.2 Data preprocessing](#1.2-Data-preprocessing)
- [🐼1.2.1 Missing Values](#🐼1.2.1-Missing-Values)
- [🐼1.2.2 Data Normalization](#🐼1.2.2-Data-Normalization)
- [🐼1.2.3 Combine the data into a dataframe](#🐼1.2.3-Combine-the-data-into-a-dataframe)

##### The first step to use pands is to📋:

In [1]:
import pandas as pd

##

## 1.1Data Loading
Loading the data is the simplest step, but there are a lot of **details** need to be taken care of.We need to read the data and then detect the problem. For example, we need to look at the data first to determine the **rows and columns** of the data frame. We need to fix the names of the rows and columns and have the **features as columns** and the **samples as rows**. Through this section, you will have a better understanding of the data structure and understand the differences between what is a **data frame** and a **matrix**, and what is a **feature** and a **variable**. 

🐼`pandas` https://pandas.pydata.org/docs/reference/arrays.htmlPandas is a powerful Python library that is used for data loading and preprocessing. It helps with reading datasets, handling missing values, and combining data. 


We will download the following datasets from [https://raw.githubusercontent.com/WanbingZeng/OMINEX/main/data/ ]
1. **Gene Expression Data (COREAD_gex.csv)**
2. **Mutation Data (COREAD_mu.csv)**
3. **Copy Number Variation Data (COREAD_cn.csv)**
4. **Subtype Covariates (COREAD_subtypes.csv)**

##
## 🐼1.1.1 Gene expression

In [2]:
import pandas as pd
# Base URL for data download
base_url = "https://raw.githubusercontent.com/WanbingZeng/OMINEX/main/data/"
# Load Gene Expression Data
x1 = pd.read_csv(base_url + "COREAD_gex.csv")
# Display the DataFrame
x1

Unnamed: 0.1,Unnamed: 0,Case1,Case2,Case3,Case4,Case5,Case6,Case7,Case8,Case9,...,Case112,Case113,Case114,Case115,Case116,Case117,Case118,Case119,Case120,Case121
0,RNF113A,21.195670,21.508658,20.080716,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,15.070030,0.000000,0.000000,11.644673,11.673601,11.717598,0.000000,13.003646,11.363485
1,S100A13,19.726005,18.657292,18.970339,11.883356,12.077533,12.991279,0.000000,0.000000,11.874508,...,14.686042,13.154416,0.000000,11.981534,11.728916,0.000000,0.000000,13.146131,0.000000,0.000000
2,AP3D1,11.530217,12.988300,10.837587,10.242480,0.000000,0.000000,0.000000,0.000000,0.000000,...,15.131805,19.188532,0.000000,0.000000,0.000000,0.000000,0.000000,14.633627,19.870922,18.007650
3,ATP6V1G1,0.000000,14.126750,15.313250,19.792995,0.000000,19.969765,16.120017,13.997117,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,20.580360,18.139138,0.000000,0.000000
4,UBQLN4,15.356366,19.622082,0.000000,0.000000,0.000000,13.176176,0.000000,0.000000,0.000000,...,0.000000,13.339314,13.727886,0.000000,0.000000,13.527430,11.986711,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,PLOD1,18.373326,17.017828,16.332485,18.558028,18.953197,17.295380,18.475193,18.882841,17.756759,...,16.147915,15.755994,14.682897,17.465695,18.363699,19.990074,17.893169,16.104134,18.700134,19.552769
496,KIFC1,16.549827,15.357553,19.223665,16.222916,16.699535,16.304812,16.282002,16.178144,16.453070,...,14.623869,17.859642,16.480703,16.549068,10.275163,18.081018,16.506832,15.757595,14.663050,18.294214
497,GNG4,18.765078,15.658792,17.712799,19.256511,19.018383,14.849142,17.136288,16.644723,17.855843,...,15.908297,17.826622,17.166550,18.716933,16.702172,16.589956,17.882752,18.764311,18.086688,15.956000
498,SMIM30,16.932423,15.977293,17.595020,13.663508,16.277123,13.771628,15.025089,11.941106,16.479619,...,13.507506,13.952473,13.375888,14.710285,17.626794,12.700718,15.825745,14.789536,18.072638,15.934925


**We can see `Unnamed: 0` in the first row, which means that simply using `pd.read_csv` does not fix the first row as the index. Therefore, we need set the first row as the index using  `index_col=0`.**

⚠️**After loading the data, it is important to check the format of the data, including the row and column names.**


**⚠️Each row should refer to a "sample", Each column should refer to a "feature", which is important for data nomalization and other processes**

![image-6.png](attachment:image-6.png)
    
#####  Now let us reload `COREAD_gex.csv` with `index_col=0` and `.T`


In [3]:
x1 = pd.read_csv(base_url + "COREAD_gex.csv",index_col=0).T
x1

Unnamed: 0,RNF113A,S100A13,AP3D1,ATP6V1G1,UBQLN4,TPPP3,TSSC4,FOS,ERBB3,CHRAC1,...,CLSTN1,TMEM98,ENOPH1,NOTCH3,HIST1H4C,PLOD1,KIFC1,GNG4,SMIM30,PLPP2
Case1,21.195670,19.726005,11.530217,0.000000,15.356366,12.767472,0.000000,16.489747,17.708855,16.056793,...,21.808384,16.919395,16.402046,16.036894,16.549966,18.373326,16.549827,18.765078,16.932423,15.615291
Case2,21.508658,18.657292,12.988300,14.126750,19.622082,0.000000,0.000000,21.274977,18.115900,12.230063,...,22.529812,19.056564,18.207457,0.000000,19.552257,17.017828,15.357553,15.658792,15.977293,15.185182
Case3,20.080716,18.970339,10.837587,15.313250,0.000000,0.000000,22.374816,0.000000,19.785306,11.779289,...,21.642433,18.559744,15.417365,11.759395,17.047553,16.332485,19.223665,17.712799,17.595020,15.639959
Case4,0.000000,11.883356,10.242480,19.792995,0.000000,0.000000,0.000000,13.031538,0.000000,0.000000,...,11.739006,18.352290,16.249263,12.971004,16.414379,18.558028,16.222916,19.256511,13.663508,16.101776
Case5,0.000000,12.077533,0.000000,0.000000,0.000000,0.000000,11.077866,0.000000,0.000000,0.000000,...,0.000000,18.930303,16.912966,12.679842,17.950233,18.953197,16.699535,19.018383,16.277123,15.351156
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Case117,11.673601,0.000000,0.000000,0.000000,13.527430,0.000000,16.734721,11.906196,11.835825,10.059176,...,11.613499,14.717214,14.195925,16.497371,17.007356,19.990074,18.081018,16.589956,12.700718,12.879539
Case118,11.717598,0.000000,0.000000,20.580360,11.986711,0.000000,14.445359,11.950195,11.879824,11.102489,...,11.657495,19.082116,16.783186,16.491340,18.708697,17.893169,16.506832,17.882752,15.825745,16.221057
Case119,0.000000,13.146131,14.633627,18.139138,0.000000,14.063835,18.105979,12.294651,0.000000,12.768502,...,13.001758,17.433537,16.981739,13.596509,17.932108,16.104134,15.757595,18.764311,14.789536,13.852952
Case120,13.003646,0.000000,19.870922,0.000000,0.000000,0.000000,15.360792,12.651393,12.581015,10.803979,...,12.358662,15.309447,17.466296,13.483838,20.450992,18.700134,14.663050,18.086688,18.072638,13.914297



##### 🩺 Data Structure

- **Rows**: Each row represents a different **patient**
   `Case1`
  `Case2`
  `Case3`


- **Columns**: Each column represents  a different **gene**
 `RNF113A`
 `S100A13`
 `AP3D1`

- **Values** The values in the matrix represent the gene expression level of each patient. 

## 🐼1.1.2 Mutaion

##### ▶️ Load **COREAD_mu.csv** as `x2` (it has the same data structure as COREAD_gex.csv)


#### Mutation Data Description

In the mutation matrix, each row represents a different gene, and each column represents a different case or sample. The values in the matrix indicate the presence (`1`) or absence (`0`) of a mutation for each gene in each case.
##### 🩺Data Structure

- **Rows**: Each row corresponds to a **gene**:
   `TTN`
   `TP53`
   `APC`


- **Columns**: Each column represents a different **patient**
  `Case1`  `Case2`  `Case3`

 


## 🐼1.1.3 Copy Number Variation
##### ▶️ Load **COREAD_cn.csv** as `x3` (it has the same data structure as COREAD_gex.csv)

In the CNV matrix, each row represents a different copy number variation (CNV) region, and each column represents a different case or sample. The values in the matrix indicate the presence (`1`) or absence (`0`) of a CNV for each region in each case.

##### 🩺Data Structure

- **Rows**: Each row corresponds to a **CNV region**:
  `8p23.2`
  `8p23.3`
  `8p23.1`
- **Columns**: Each column represents a different **patient**:
  `Case1`
  `Case2`
  `Case3`


## 🐼1.1.4 Cancer subtypes
These tumors belong to two molecular subtypes, CMS1 and CMS3, as defined by the Colorectal Cancer Subtyping Consortium.



In [4]:
covariates = pd.read_csv(base_url + "COREAD_subtypes.csv")
covariates

Unnamed: 0.1,Unnamed: 0,dataset,age,gender,stage,pt,pn,pm,tnm,grade,msi,cimp,kras_mut,braf_mut,subtypes,osMo,osStat,rfsMo,rfsStat
0,Case1,tcga,82,female,3.0,3.0,1,0.0,IIIB,,msi,CIMP.High,0.0,0.0,CMS1,16.536986,0.0,,
1,Case2,tcga,71,female,2.0,4.0,0,0.0,IIB/IIC,,msi,CIMP.High,,,CMS1,10.290411,0.0,,
2,Case3,tcga,80,female,2.0,3.0,0,0.0,IIA,,msi,CIMP.High,0.0,1.0,CMS1,10.060274,0.0,,
3,Case4,tcga,84,female,2.0,3.0,0,0.0,IIA,,msi,CIMP.High,0.0,1.0,CMS1,9.402740,0.0,,
4,Case5,tcga,82,male,1.0,2.0,0,0.0,I,,msi,CIMP.High,1.0,1.0,CMS1,2.958904,0.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
116,Case117,tcga,77,female,4.0,2.0,0,,I,,mss,,1.0,0.0,CMS3,0.460274,0.0,,
117,Case118,tcga,64,female,3.0,3.0,1,0.0,IIIB,,mss,,1.0,0.0,CMS3,0.657534,0.0,,
118,Case119,tcga,59,male,2.0,3.0,0,0.0,IIA,,msi,,1.0,1.0,CMS3,3.550685,0.0,,
119,Case120,tcga,33,male,3.0,3.0,1,0.0,IIIB,,mss,,0.0,1.0,CMS1,7.791781,0.0,,


**⚠️`Unnamed: 0`  appears in the first row, we need to use`index_col=0`**

**⚠️`0` appears in the first column, which means that the first column is not fixed as the index. We need to change it by using `header=0`**

#####  Now let us reload `COREAD_covariates.csv` with `index_col=0` ,`header=0`
##### Example Code📋 

In [5]:
covariates = pd.read_csv(base_url + "COREAD_subtypes.csv",index_col=0, header=0)
covariates

Unnamed: 0,dataset,age,gender,stage,pt,pn,pm,tnm,grade,msi,cimp,kras_mut,braf_mut,subtypes,osMo,osStat,rfsMo,rfsStat
Case1,tcga,82,female,3.0,3.0,1,0.0,IIIB,,msi,CIMP.High,0.0,0.0,CMS1,16.536986,0.0,,
Case2,tcga,71,female,2.0,4.0,0,0.0,IIB/IIC,,msi,CIMP.High,,,CMS1,10.290411,0.0,,
Case3,tcga,80,female,2.0,3.0,0,0.0,IIA,,msi,CIMP.High,0.0,1.0,CMS1,10.060274,0.0,,
Case4,tcga,84,female,2.0,3.0,0,0.0,IIA,,msi,CIMP.High,0.0,1.0,CMS1,9.402740,0.0,,
Case5,tcga,82,male,1.0,2.0,0,0.0,I,,msi,CIMP.High,1.0,1.0,CMS1,2.958904,0.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Case117,tcga,77,female,4.0,2.0,0,,I,,mss,,1.0,0.0,CMS3,0.460274,0.0,,
Case118,tcga,64,female,3.0,3.0,1,0.0,IIIB,,mss,,1.0,0.0,CMS3,0.657534,0.0,,
Case119,tcga,59,male,2.0,3.0,0,0.0,IIA,,msi,,1.0,1.0,CMS3,3.550685,0.0,,
Case120,tcga,33,male,3.0,3.0,1,0.0,IIIB,,mss,,0.0,1.0,CMS1,7.791781,0.0,,


The covariates matrix contains clinical and molecular information for each patient. Each row represents a different case, with columns providing details on demographics, clinical stage, mutation status, and molecular subtypes.

##### 🩺Data Structure

- **Rows**: Each row corresponds to a **patient**:
   `Case1`
   `Case2`
   `Case3`

- **Columns**: Each column represents a different type of information of the patients, in this project, we only use the first column and the  <mark>**subtypes**</mark> column:
  - `dataset`: Data source ( `tcga: the cancer genome atlas`)
  - `age`: Patient's age
  - `gender`: Patient's gender
  - `stage`: Clinical stage of the disease
  - `pt`: Tumor size
  - `pn`: Regional lymph node involvement
  - `pm`: Distant metastasis
  - `tnm`: TNM classification
  - `grade`: Tumor grade
  - `msi`: Microsatellite instability status
  - `cimp`: CpG island methylator phenotype
  - `kras_mut`: KRAS mutation status
  - `braf_mut`: BRAF mutation status
  - <mark>**subtypes**</mark>: Molecular subtype (e.g., `CMS1`, `CMS3`)
  - `osMo`: Overall survival months
  - `osStat`: Overall survival status
  - `rfsMo`: Recurrence-free survival months
  - `rfsStat`: Recurrence-free survival status

##### 🩺CMS1 and CMS3
CMS1 and CMS3 are two of the four consensus molecular subtypes of colorectal cancer (CRC) identified by the Consensus Molecular Subtyping 

 

##### For`covariates`, since we are only using the `subtypes` column in this project, we need to only keep the `subtypes` column using:
  ```python
data[['column']]
```



##### Example Code📋 




In [6]:
covariates_s = covariates[['subtypes']]
covariates_s

Unnamed: 0,subtypes
Case1,CMS1
Case2,CMS1
Case3,CMS1
Case4,CMS1
Case5,CMS1
...,...
Case117,CMS3
Case118,CMS3
Case119,CMS3
Case120,CMS1


## 1.2 Data preprocessing
From 1.1 Data Loading, we have read the datatset as
1. `x1`**Gene Expression Data (COREAD_gex.csv)**
2. `x2`**Mutation Data (COREAD_mu.csv)**
3. `x3`**Copy Number Variation Data (COREAD_cn.csv)**
4. `covariates`**Subtype Covariates (COREAD_subtypes.csv)**

Now we need to check if there is missing values and merge the data 

## 🐼1.2.1 Missing Values


 Here is an example to check for missing values in `x1`, which shows the number of missing values in each column. 

##### ▶️  check if `x2`, `x3` , `covariates_s` have missing vaues


In [7]:
#x1
x1.isna().sum()

RNF113A     0
S100A13     0
AP3D1       0
ATP6V1G1    0
UBQLN4      0
           ..
PLOD1       0
KIFC1       0
GNG4        0
SMIM30      0
PLPP2       0
Length: 500, dtype: int64

In [114]:
#x2


In [115]:
#x3


In [None]:
#covariates_s


If there is` missing values`:
```python
df_cleaned = df.dropna(axis=0)  # Remove row with missing values
df_cleaned = df.dropna(axis=1)  # Remove column with missing values
df_filled_constant = df.fillna(0) # Fill missing values with 0
df_filled_median = df.fillna(df.median()) # Fill missing values with the median of each column
df_filled_mode = df.fillna(df.mode().iloc[0]) # Fill missing values with the mode of each column
df_filled_mean = df.fillna(df.mean()) # Fill missing values with the mean of each column 
```



## 🐼1.2.2 Data Normalization



- [Z-score normalization](#Z-score-normalization)
- [Feature normalization with frobenius norm 🦅](#Feature-normalization-with-frobenius-norm-🦅)

## 


<div style="border-left: 5px solid #1E90FF; padding: 10px; background-color: white; font-size: 16px;">
  <strong>Note</strong>
  <p>-The <strong>z-score normalization</strong> is suitable for <strong>continuous data</strong>, while <strong>one-hot encoding</strong> is for <strong>categorical data</strong>.</p>
      <p>-Normalizing the features using the <strong>Frobenius norm</strong> is suitable for <strong>both</strong> continuous and categorical data.</p>
</div>

##  <div style="background-color: #000000; color: white;padding: 10px;">1. Z-score normalization</div>

## Basics

**Z-score normalization** is to **normalize each feature** of a dataset so that they have a mean of zero and a standard deviation of one. This ensures that each feature contributes equally for the samples. 


## Step1 **Calculate the Mean $\mu $**:
   The mean is the average of all the values of the **features**.

  
  $$ \mu = \frac{1}{N} \sum_{i=1}^{N} x_i$$

   Where $ N $ is the number of values.

## Step2 **Calculate the Standard Deviation $ \sigma $**:
   The standard deviation measures the **dispersion** of the values from the mean.

   $$\sigma = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2}$$

## Step3  Calculate $ z $ :
$$ z = \frac{x - \mu}{\sigma}$$
Where:
- $x$ is the original value.
- $ \mu $ is the mean of the feature.
- $\sigma $ is the standard deviation of the feature.
- $ z $ is the standardized value.

   Subtract the mean from each value and then divide by the standard deviation to get the standardized value.


## Example 🎬 
## Here shows an example of **Z-score normalization** using 🐼pandas

In [2]:
import pandas as pd
df1 = pd.DataFrame({
    'GeneA': [2, 4, 7],
    'GeneB': [2, 4, 8],
    'GeneC': [3, 6, 6]
}, index=['Sample1', 'Sample2', 'Sample3'])
df1

Unnamed: 0,GeneA,GeneB,GeneC
Sample1,2,2,3
Sample2,4,4,6
Sample3,7,8,6


In [50]:
mean = df1.mean() # Calculate the Mean for Each gene
std = df1.std() # Calculate the Standard Deviation for Each gene
df1_standardized = (df1-mean )/ std # Calculate Z

In [51]:
df1_standardized

Unnamed: 0,GeneA,GeneB,GeneC
Sample1,-0.927173,-0.872872,-1.154701
Sample2,-0.132453,-0.218218,0.57735
Sample3,1.059626,1.091089,0.57735


<div style="border-left: 5px solid #1E90FF; padding: 10px; background-color: white; font-size: 16px;">
  <strong>Note</strong>
  <p>The standard deviation in <code>pandas</code> is calculated using the sample standard deviation:</p>
  <p>
  $$
  \sigma_{\text{sample}} = \sqrt{\frac{1}{N-1} \sum_{i=1}^{N} (x_i - \mu)^2} 
  $$
  </p>
  <p>We can add <code>ddof=0</code> to calculate the population standard deviation using:<code>___= [ ].std(ddof=0)</code></p>
  <p>
  $$
  \sigma_{\text{population}} = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2}
  $$
  </p>


</div>

In [52]:
mean = df1.mean() # Calculate the Mean
std = df1.std(ddof=0)
df1_standardized = (df1-mean )/ std # Apply the Formula
df1_standardized

Unnamed: 0,GeneA,GeneB,GeneC
Sample1,-1.13555,-1.069045,-1.414214
Sample2,-0.162221,-0.267261,0.707107
Sample3,1.297771,1.336306,0.707107


## We can also use the `sklearn`🐍 to perform z-score normalization 

In [53]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler() # initialize StandardScaler
df1_standardized_sklearn = scaler.fit_transform(df1)

In [55]:
df1_standardized_sklearn = pd.DataFrame(df1_standardized_sklearn, index=df1.index, columns=df1.columns)
df1_standardized_sklearn

Unnamed: 0,GeneA,GeneB,GeneC
Sample1,-1.13555,-1.069045,-1.414214
Sample2,-0.162221,-0.267261,0.707107
Sample3,1.297771,1.336306,0.707107


The standard variation in `sklearn` is calculated using the population standard deviation$\sigma_{\text{population}} = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2}$


As you have already imoorted the gene expression data `x1` and handled the missing values, we will use the data as an example for normalization.

![image-3.png](attachment:image-3.png)


As you have already had the gene expression data `x1` and handled the missing values, we will use the data as an example for normalization.



##### ▶️  Now try to conduct the Z-score Normalization on `x1`using $pandas$ and $sklearn$


In [None]:
#pandas

In [None]:
#sklearn

##  <div style="background-color: #000000; color: white;padding: 10px;">2. Min-Max normalization</div>
Tutorial: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

 Min-max normalization ensures that the normalized data is non-negative, which makes it suitable for NMF analysis.


In [3]:
df1

Unnamed: 0,GeneA,GeneB,GeneC
Sample1,2,2,3
Sample2,4,4,6
Sample3,7,8,6


In [8]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df1_minmax = pd.DataFrame(scaler.fit_transform(df1), columns=df1.columns, index=df1.index)
df1_minmax

Unnamed: 0,GeneA,GeneB,GeneC
Sample1,0.0,0.0,0.0
Sample2,0.4,0.333333,1.0
Sample3,1.0,1.0,1.0


##### ▶️  Now try to conduct the Min-Max normalization on `x1`


In [None]:
#Min-max

##  <div style="background-color: #000000; color: white;padding: 10px;">3. One-Hot Encoding</div>

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

In [11]:
import pandas as pd
data= {
    'GeneA': [1, -1, 1],
    'GeneB': [0, 1, 1],
    'GeneC': [1, 0, 1],
    'GeneD': [0, -1, 0],
    'GeneE': [1, 0, 1]
}
samples = ['Case1', 'Case2', 'Case3']

df = pd.DataFrame(data, index=samples)
df.columns.name = 'CopyNumberVariation'
df

Copy Number Variation,GeneA,GeneB,GeneC,GeneD,GeneE
Case1,1,0,1,0,1
Case2,-1,1,0,-1,0
Case3,1,1,1,0,1


In [13]:
from sklearn.preprocessing import OneHotEncoder
# One-Hot Encoding
encoder = OneHotEncoder(sparse=False)
encoded_data = encoder.fit_transform(df)

# Creating a DataFrame for the one-hot encoded data
encoded_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out(df.columns), index=df.index)
encoded_df.columns.name = 'CopyNumberVariation'
encoded_df



CopyNumberVariation,GeneA_-1,GeneA_1,GeneB_0,GeneB_1,GeneC_0,GeneC_1,GeneD_-1,GeneD_0,GeneE_0,GeneE_1
Case1,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0
Case2,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0
Case3,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0



##### ▶️  Now try to conduct the One-hot encoding Normalization on `x3`

## Feature normalization with frobenius norm 🦅
## Basiscs
## 🦅Sterp 1. Feature normalization
For a matrix $A $, the mean of each column is calculated as:

$$
\mu_j = \frac{1}{m} \sum_{i=1}^m a_{ij}
$$

where $ \mu_j $ is the mean of the $ j $-th column, $ m $ is the number of rows in the matrix, and $a_{ij} $ is the element in the $ i $-th row and $ j $-th column of the matrix.



To normalize the matrix, divide each element by the mean of its respective column to obtain the normalized matrix $A_{norm} $:

$$
a_{ij}^{norm} = \frac{a_{ij}}{\mu_j}
$$

where $ a_{ij}^{norm}$ is the element in the normalized matrix, $a_{ij} $is the element in the original matrix, and $\mu_j $is the mean of the $j $-th column.
## 🦅Step2. Frobenius norm
Frobenius normalization is a process used to standardize the scale of a matrix by normalizing it using its Frobenius norm. This technique ensures that the matrix's values are scaled in a way that is independent of its original magnitude.


The Frobenius norm of a matrix $A$ is defined as:

$$ \|A\|_F = \sqrt{\sum_{i,j} |a_{ij}|^2} $$

where $ a_{ij} $represents the elements of matrix $ A $. In simpler terms, it's the square root of the sum of the squares of all elements in the matrix.


To perform Frobenius norm normalization on a matrix $ A $:

1. **Compute the Frobenius Norm**: Calculate the Frobenius norm $ \|A\|_F $ of the matrix $A $.

2. **Normalize the Matrix**: Divide each element of the matrix by the Frobenius norm. The resulting matrix will have a Frobenius norm of 1.

Mathematically, if $ A $ is the original matrix and $\|A\|_F $ is its Frobenius norm, the normalized matrix $ A_{norm} $is:

$$ A_{norm} = \frac{A}{\|A\|_F} $$



$$
A = \begin{bmatrix}
1 & 2 & 3 & 4 \\
5 & 6 & 7 & 8 \\
9 & 10 & 11 & 12 \\
13 & 14 & 15 & 16
\end{bmatrix}
$$


1. **Compute the Frobenius Norm**:

   The Frobenius norm $ \|A\|_F $is calculated as:

   $$
   \|A\|_F = \sqrt{\sum_{i,j} |a_{ij}|^2}
   $$

   For matrix \( A \):

   $$
   \|A\|_F = \sqrt{1^2 + 2^2 + 3^2 + 4^2 + 5^2 + 6^2 + 7^2 + 8^2 + 9^2 + 10^2 + 11^2 + 12^2 + 13^2 + 14^2 + 15^2 + 16^2} = 55
   $$


2. **Normalize the Matrix**:


   $$
   A_{norm} =\frac{A}{\|A\|_F}= \frac{1}{55} \begin{bmatrix}
   1 & 2 & 3 & 4 \\
   5 & 6 & 7 & 8 \\
   9 & 10 & 11 & 12 \\
   13 & 14 & 15 & 16
   \end{bmatrix}= \begin{bmatrix}
   \frac{1}{55} & \frac{2}{55} & \frac{3}{55} & \frac{4}{55} \\
   \frac{5}{55} & \frac{6}{55} & \frac{7}{55} & \frac{8}{55} \\
   \frac{9}{55} & \frac{10}{55} & \frac{11}{55} & \frac{12}{55} \\
   \frac{13}{55} & \frac{14}{55} & \frac{15}{55} & \frac{16}{55}
   \end{bmatrix}
   $$


## Example 1

Here's how you can perform Frobenius norm normalization in Python:



In [30]:
import numpy as np
df

mutation,Case1,Case2,Case3
GeneA,1,0,1
GeneB,0,1,1
GeneC,1,0,0
GeneD,0,1,0
GeneE,1,0,1


In [40]:
df_mean_normalized = df.div(df.mean(axis=0), axis=1)
df_mean_normalized

mutation,Case1,Case2,Case3
GeneA,1.666667,0.0,1.25
GeneB,0.0,2.5,1.25
GeneC,1.666667,0.0,1.25
GeneD,0.0,2.5,0.0
GeneE,1.666667,0.0,1.25


In [41]:
# Calculate the Frobenius norm of the mean-normalized DataFrame
frobenius_norm = np.linalg.norm(df_mean_normalized.values, 'fro')
# Normalize the matrix using the Frobenius norm
A_normalized = df_mean_normalized / frobenius_norm
A_normalized

mutation,Case1,Case2,Case3
GeneA,0.320256,0.0,0.240192
GeneB,0.0,0.480384,0.240192
GeneC,0.320256,0.0,0.240192
GeneD,0.0,0.480384,0.0
GeneE,0.320256,0.0,0.240192


▶️ Now try to noramlize the gene expression data `x1` using the Frobenius Norm 
```python
x1 = pd.read_csv(base_url + "COREAD_gex.csv",index_col=0).T
```


## 🐼1.2.3 Combine the datasets into a dataframe

## Example1 🎬 
Here shows two datasets with fixed row and column names.

In [5]:
data2 = {
    'GeneA': [1, 0, 1],
    'GeneB': [0, 1, 1],
    'GeneC': [1, 1, 0]
}
df2 = pd.DataFrame(data2, index=['Case1', 'Case2', 'Case3'])
df2

Unnamed: 0,GeneA,GeneB,GeneC
Case1,1,0,1
Case2,0,1,1
Case3,1,1,0


In [6]:
# Create the covariates subtypes DataFrame
data_covariates = {
    'subtype': ['Y1','Y2', 'Y1'],
}
df_c = pd.DataFrame(data_covariates, index=['Case1', 'Case2', 'Case3'])
df_c

Unnamed: 0,subtype
Case1,Y1
Case2,Y2
Case3,Y1


##### 🔗Option 1: Merge 2 dataframes

`pd.merge()`:  Combines two DataFrames based on a common key. Here, instead of using columns as keys, we use the indexes.

`left_index=True`: Indicates that the index of the left DataFrame (x1_T) should be used for merging.

`right_index=True`: Indicates that the index of the right DataFrame (c) should be used for merging.

In [7]:
merged_data = pd.merge(df2, df_c, left_index=True,right_index=True)
merged_data

Unnamed: 0,GeneA,GeneB,GeneC,subtype
Case1,1,0,1,Y1
Case2,0,1,1,Y2
Case3,1,1,0,Y1


In [10]:
group_mean = merged_data.groupby('subtype').mean()
group_mean

Unnamed: 0_level_0,GeneA,GeneB,GeneC
subtype,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Y1,1.0,0.5,0.5
Y2,0.0,1.0,1.0


##### Please use `.T` and `pd.merge` to combine the dataframe below, we will use the combined data to select the important features. we will use the merged data to select the features
1. `x1`**Gene Expression Data (COREAD_gex.csv)**
2. `x2`**Mutation Data (COREAD_mu.csv)**
3. `x3`**Copy Number Variation Data (COREAD_cn.csv)**
4. `covariates`**Subtype Covariates (COREAD_subtypes.csv)**

##### ▶️  Try to merge the dataframes: `x1` and `covariates_s`  as `m1` ,we will use the merged data to select the features

##### 🔗Option 2: Merge multiple dataframes

`pd.contact（）` is a function to combine more than two datasets

`pd.contact([df1,df2,df3],axis=0)` Merge the data by mapping  **row** names.

`pd.contact([df1,df2,df3],axis=1)`Merge the data by mapping  **column** names.
##### Example Code📋 
```python
combined_data = pd.concat([x1,x2,x3,covariates_s], axis=1)
combined_data
```
##### ▶️run the codes below and see all the datasets combination

## Exercises:
## 1. **load** the following datasets from
[https://raw.githubusercontent.com/WanbingZeng/OMINEX/main/data/ ], **normalize** the data and **merge** them in a dataframe.
1. **Gene Expression Data (BRCA_gex.csv)**
2. **Mutation Data (BRCA_mu.csv)**
3. **Copy Number Variation Data (BRCA_cn.csv)**
4. **Subtype Covariates (BRCA_subtypes.csv)**

If you have probelems:
- [1.1Data Loading](#1.1Data-Loading)
- [1.2 Data preprocessing](#1.2-Data-preprocessing)
- [🐼1.2.1 Missing Values](#🐼1.2.1-Missing-Values)
- [🐼1.2.2 Data Normalization](#🐼1.2.2-Data-Normalization)
- [🐼1.2.3 Combine the data into a dataframe](#🐼1.2.3-Combine-the-data-into-a-dataframe)


## 2. Data processing
- check missing values
- normalize the data 
- combine one molecular data with clinical data into a dataframe

## 3. Feature Selection: Conduct a method to  choose the features for gene expression?

🔑 ...recommend to look through the entire process

In [15]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
base_url = "https://raw.githubusercontent.com/WanbingZeng/OMINEX/main/data/"
#Data loading
x1 = pd.read_csv(base_url + "COREAD_gex.csv", index_col=0).T
scaler = StandardScaler()
x1= pd.DataFrame(scaler.fit_transform(x1), columns=x1.columns, index=x1.index)
x2 = pd.read_csv(base_url + "COREAD_mu.csv", index_col=0).T
x2[x2 > 0] = 1
x3 = pd.read_csv(base_url + "COREAD_cn.csv", index_col=0).T
covariates = pd.read_csv(base_url+"COREAD_subtypes.csv", index_col=0,header=0)
covariates_s=covariates[['subtypes']]
# Data processing
combined_data = pd.concat([x1,x2,x3,covariates_s], axis=1)
combined_data.T

Unnamed: 0,Case1,Case2,Case3,Case4,Case5,Case6,Case7,Case8,Case9,Case10,...,Case112,Case113,Case114,Case115,Case116,Case117,Case118,Case119,Case120,Case121
RNF113A,2.244998,2.292435,2.076011,-0.967498,-0.967498,-0.967498,-0.967498,-0.967498,-0.967498,0.865736,...,-0.967498,1.316572,-0.967498,-0.967498,0.797412,0.801797,0.808465,-0.967498,1.003383,0.754794
S100A13,2.049263,1.887056,1.934569,0.858921,0.888392,1.027079,-0.944713,-0.944713,0.857578,0.903901,...,1.284307,1.05184,-0.944713,0.873822,0.83548,-0.944713,-0.944713,1.050582,-0.944713,-0.944713
AP3D1,0.477131,0.671197,0.384944,0.305737,-1.057504,-1.057504,-1.057504,-1.057504,-1.057504,-1.057504,...,0.956491,1.496428,-1.057504,-1.057504,-1.057504,-1.057504,-1.057504,0.890185,1.587252,1.339256
ATP6V1G1,-0.963856,0.588405,0.718779,1.211018,-0.963856,1.230441,0.807427,0.574161,-0.963856,-0.963856,...,-0.963856,-0.963856,-0.963856,-0.963856,-0.963856,-0.963856,1.297534,1.02929,-0.963856,-0.963856
UBQLN4,1.476407,2.121418,-0.845599,-0.845599,-0.845599,1.146745,-0.845599,-0.845599,-0.845599,-0.845599,...,-0.845599,1.171413,1.230168,-0.845599,-0.845599,1.199857,0.966888,-0.845599,-0.845599,-0.845599
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17p11.2,0,0,0,0,0,0,0,0,1,0,...,0,0,1,1,0,1,1,0,0,1
15q11.1,0,0,0,0,0,1,0,0,0,0,...,0,0,1,1,0,0,0,0,0,0
15q11.2,0,0,0,0,0,1,0,0,0,0,...,0,0,1,1,0,0,0,0,0,0
18q12.1,0,0,0,0,1,0,0,0,1,0,...,1,0,1,1,0,1,1,0,0,0


In [16]:
import pandas as pd
base_url = "https://raw.githubusercontent.com/WanbingZeng/OMINEX/main/data/"
#Data loading
x1 = pd.read_csv(base_url + "BRCA_gex.csv", index_col=0).T
scaler = StandardScaler()
x1= pd.DataFrame(scaler.fit_transform(x1), columns=x1.columns, index=x1.index)
x2 = pd.read_csv(base_url + "BRCA_mu.csv", index_col=0).T
x3 = pd.read_csv(base_url + "BRCA_cn.csv", index_col=0).T
covariates = pd.read_csv(base_url+"BRCA_subtypes.csv", index_col=0,header=0)
covariates_s=covariates[['subtypes']]
# Data processing
combined_data = pd.concat([x1,x2,x3,covariates_s], axis=1)
combined_data.T

Unnamed: 0,Case1,Case2,Case3,Case4,Case5,Case6,Case7,Case8,Case9,Case10,...,Case248,Case249,Case250,Case251,Case252,Case253,Case254,Case255,Case256,Case257
rs_CLEC3A,-0.757209,-0.938169,-0.17848,-0.938169,-0.938169,-0.815983,0.207869,-0.457617,-0.938169,-0.938169,...,-0.938169,0.032302,-0.938169,2.085883,-0.327936,-0.938169,-0.483772,-0.277274,1.446766,-0.596165
rs_CPB1,0.082011,-0.56617,-0.412708,2.697229,-0.365932,-1.164059,-0.75672,-1.297824,-1.258985,0.442628,...,-0.46371,0.266304,-0.874091,-0.26952,0.138615,-0.271569,-0.302352,0.10906,0.099218,-1.27546
rs_SCGB2A2,0.886513,1.510301,-0.044359,1.806181,1.217581,-0.129686,-0.341652,-2.057654,-1.596314,-0.9631,...,0.34358,1.86011,-0.55455,0.641552,0.954708,0.614312,-0.207443,0.923076,0.785303,1.256187
rs_SCGB1D2,0.678601,1.812907,-0.540889,1.585699,1.186311,-0.594309,-0.140016,-1.162469,-1.533906,-1.064777,...,0.083686,1.622835,-0.790942,0.519635,0.149974,0.735778,-0.468318,0.179089,0.681362,1.004827
rs_TFF1,0.950685,0.163327,0.694753,1.161593,-0.070382,-2.175696,-0.44869,-1.995139,-1.737864,0.264389,...,1.345766,1.395001,1.663935,1.054626,0.160557,0.501283,0.654795,0.742825,1.089712,1.120641
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
cn_WNT7B,0,0,1,0,1,1,1,0,0,1,...,1,0,1,0,0,0,1,1,1,1
cn_KLHDC7B,0,0,1,0,1,1,1,0,0,1,...,1,0,1,0,0,0,1,1,1,1
cn_MAPK8IP2,0,0,1,0,1,1,1,0,0,1,...,1,0,1,0,0,0,1,1,1,1
cn_MLC1,0,0,1,0,1,1,1,0,0,1,...,1,0,1,0,0,0,1,1,1,1
