# 1. Data Loading and Preprocessing

In this section, we will load the data from provided URLs and perform some basic preprocessing steps. The datasets include gene expression data, mutation data, copy number variation data, and subtype covariates.

##  🐼1.1Data Loading

🐼`pandas` Pandas is a powerful Python library that is used for data loading and preprocessing. 

It helps with reading datasets, handling missing values, and combining data.

We will load the following datasets from [https://raw.githubusercontent.com/WanbingZeng/OMINEX/main/data/ ]
1. **Gene Expression Data (COREAD_gex.csv)**
2. **Mutation Data (COREAD_mu.csv)**
3. **Copy Number Variation Data (COREAD_cn.csv)**
4. **Subtype Covariates (COREAD_subtypes.csv)**
## 🐼1.1.1 Gene expression

##### Example Code📋 

```python
import pandas as pd
# Base URL for data download
base_url = "https://raw.githubusercontent.com/WanbingZeng/OMINEX/main/data/"
# Load Gene Expression Data
x1 = pd.read_csv(base_url + "COREAD_gex.csv")
# Display the DataFrame
x1

```
##### ▶️ Please run the code below and see how `x1` looks like

⚠️**We can see that the first column appears as `Unnamed: 0`, which indicates that simply using `pd.read_csv` does not fix the first row as the index. Therefore, we need set the first row as the index using  `index_col=0`.**

⚠️**After loading the data, it is important to check the format of the data, including the row and column names.**
#####  Now let us reload `COREAD_gex.csv` with `index_col=0`
##### Example Code📋 
```python

x1 = pd.read_csv(base_url + "COREAD_gex.csv",index_col=0)
x1
```
##### ▶️ Reload COREAD_gex.csv as `x1` 





The dataset provided contains gene expression levels across multiple cases. Each row represents a different gene, and each column corresponds to a different case or sample. The values in the matrix represent the expression level of each gene in each case. The values represent the expression levels of each gene in different cases. Higher values indicate higher gene expression.

##### Data Structure

- **Rows**: Each row represents a different **gene**
 `RNF113A`
 `S100A13`
 `AP3D1`

- **Columns**: Each column represents a different **patient**
   `Case1`
  `Case2`
  `Case3`


## 🐼1.1.2 Mutaion 
##### ▶️ Load **COREAD_mu.csv** as `x2`

#### Mutation Data Description

In the mutation matrix, each row represents a different gene, and each column represents a different case or sample. The values in the matrix indicate the presence (`1`) or absence (`0`) of a mutation for each gene in each case.
##### Data Structure

- **Rows**: Each row corresponds to a **gene**:
   `TTN`
   `TP53`
   `APC`


- **Columns**: Each column represents a different **patient**
  `Case1`  `Case2`  `Case3`

 


## 🐼1.1.3 Copy Number Variation
##### ▶️ Load **COREAD_cn.csv** as `x3`

In the CNV matrix, each row represents a different copy number variation (CNV) region, and each column represents a different case or sample. The values in the matrix indicate the presence (`1`) or absence (`0`) of a CNV for each region in each case.

##### Data Structure

- **Rows**: Each row corresponds to a **CNV region**:
  `8p23.2`
  `8p23.3`
  `8p23.1`
- **Columns**: Each column represents a different **patient**:
  `Case1`
  `Case2`
  `Case3`


## 🐼1.1.4 Cancer subtypes
These tumors belong to two molecular subtypes, CMS1 and CMS3, as defined by the Colorectal Cancer Subtyping Consortium.



##### Example Code📋 
```python
covariates = pd.read_csv(base_url + "COREAD_subtypes.csv")
covariates
```
##### ▶️ Please run the code below and see how `Subtype Covariates` looks like

⚠️**We can see that the first column appears as `Unnamed: 0` , which indicates that simply using `pd.read_csv` does not fix the first row as the index. Therefore, we need set the first row as the index using `index_col=0` **

⚠️**We can see that the first row appears as `0` , which indicates that simply using `pd.read_csv` does not fix the first column as the index. Therefore, we need set the first column as the index using `header=0` **
#####  Now let us reload `COREAD_covariates.csv` with `index_col=0` and `header=0`
##### Example Code📋 
```python
covariates = pd.read_csv(base_url + "COREAD_subtypes.csv",index_col=0, header=0)
covariates
```
##### ▶️ Reload COREAD_gex.csv as `x1` 



The covariates matrix contains clinical and molecular information for each patient. Each row represents a different case, with columns providing details on demographics, clinical stage, mutation status, and molecular subtypes.

##### 🩺Data Structure

- **Rows**: Each row corresponds to a **patient**:
   `Case1`
   `Case2`
   `Case3`

- **Columns**: Each column represents a different type of information of the patients, in this project, we only use the first column and the  <mark>**subtypes**</mark> column:
  - `dataset`: Data source ( `tcga: the cancer genome atlas`)
  - `age`: Patient's age
  - `gender`: Patient's gender
  - `stage`: Clinical stage of the disease
  - `pt`: Tumor size
  - `pn`: Regional lymph node involvement
  - `pm`: Distant metastasis
  - `tnm`: TNM classification
  - `grade`: Tumor grade
  - `msi`: Microsatellite instability status
  - `cimp`: CpG island methylator phenotype
  - `kras_mut`: KRAS mutation status
  - `braf_mut`: BRAF mutation status
  - <mark>**subtypes**</mark>: Molecular subtype (e.g., `CMS1`, `CMS3`)
  - `osMo`: Overall survival months
  - `osStat`: Overall survival status
  - `rfsMo`: Recurrence-free survival months
  - `rfsStat`: Recurrence-free survival status

##### 🩺CMS1 and CMS3
CMS1 and CMS3 are two of the four consensus molecular subtypes of colorectal cancer (CRC) identified by the Consensus Molecular Subtyping 

 

##### For 🩺`covariates`, since we are only using the `subtypes` column in this project, we need to only keep the `subtypes` column using:
  **`data[['column']]`**: Selects  the only `column`  from the `data` DataFrame.
##### Example Code📋 



```python
covariates_s = covariates[['subtypes']]
covariates_s

```



##### ▶️ remove  columns except for `subtypes` from the `covariates` DataFrame.

Congratulations! Your have loaded all your datasets!🎉

## 1.2 Data preprocessing
From 1.1 Data Loading, we have read the datatset as
1. `x1`**Gene Expression Data (COREAD_gex.csv)**
2. `x2`**Mutation Data (COREAD_mu.csv)**
3. `x3`**Copy Number Variation Data (COREAD_cn.csv)**
4. `covariates`**Subtype Covariates (COREAD_subtypes.csv)**

Now we need to check if there is missing values and merge the data for the analyzing

## 🐼1.2.1 Missing Values


We will still use the 🐼`pandas` library. Here is an example to check for missing values in `x1`, which shows the number of missing values in each column. 

##### Example Code📋 

```python
print(x1.isna().sum())  

```
##### ▶️ Check if `x1`, `x2`, `x3` , `covariates_s` have missing vaues


In [None]:
#x1


In [114]:
#x2


In [115]:
#x3


In [None]:
#covariates_s


If there is no missing values, let's go on.

## 🐼1.2.2 Merge the data

##### Example 🎬 
Here shows two datasets with fixed row and column names.

In [91]:
data1 = {
    'Case1': [1, 0, 1],
    'Case2': [0, 1, 1],
    'Case3': [1, 1, 0]
}
d1 = pd.DataFrame(data1, index=['GeneA', 'GeneB', 'GeneC'])
d1

Unnamed: 0,Case1,Case2,Case3
GeneA,1,0,1
GeneB,0,1,1
GeneC,1,1,0


In [92]:
# Create the second DataFrame (covariates)
data_covariates = {
    'Category': ['Y1','Y2', 'Y1'],
}
c = pd.DataFrame(data_covariates, index=['Case1', 'Case2', 'Case3'])
c

Unnamed: 0,Category
Case1,Y1
Case2,Y2
Case3,Y1


##### 🔗Step1 Transpose d1 : `.T`
⚠️Normally, we set features(e.g `genes`) as row names and samples(e.g `patiens`) as column names.

In [94]:
# Transpose x1 to match the column names of c
d1_T=d1.T
d1_T

Unnamed: 0,GeneA,GeneB,GeneC
Case1,1,0,1
Case2,0,1,1
Case3,1,1,0


##### 🔗Step2 Merge dataframes on the `case` column 

`pd.merge()`:  Combines two DataFrames based on a common key. Here, instead of using columns as keys, we use the indexes.

`left_index=True`: Indicates that the index of the left DataFrame (x1_T) should be used for merging.

`right_index=True`: Indicates that the index of the right DataFrame (c) should be used for merging.

In [97]:
merged_data = pd.merge(d1_T, c, left_index=True,right_index=True)
merged_data

       GeneA  GeneB  GeneC Category
Case1      1      0      1       Y1
Case2      0      1      1       Y2
Case3      1      1      0       Y1


##### Please use `.T` and `pd.merge` to combine the datasets below, we will use the combined data to select the important features.
1. `x1`**Gene Expression Data (COREAD_gex.csv)**
2. `x2`**Mutation Data (COREAD_mu.csv)**
3. `x3`**Copy Number Variation Data (COREAD_cn.csv)**
4. `covariates`**Subtype Covariates (COREAD_subtypes.csv)**

##### ▶️ Now, try to combine the dataframes: `x1` and `covariates_s`  as `m1` ,we will use the combined data to select the features

##### ▶️  combine the dataframes: `x2` and `covariates_s`  as `m2` 

##### ▶️  combine the dataframes: `x3` and `covariates_s`  as `m3` ,we will use the combined data to select the features

###### ❓ How to combine all the datasets 
  ANSWER : We can use `pd.contact` to combine `x1`, `x2`, `x3` first. Then run the same process.
##### Example Code📋 
```python
combined_data = pd.concat([x1,x2,x3], axis=0)
combined_data_T=combined_data.T
combined_all=pd.merge(combined_data_T, covariates_s,left_index=True, right_index= True)
```

🔑 ...recommended to go through the entire process

In [113]:
import pandas as pd
base_url = "https://raw.githubusercontent.com/WanbingZeng/OMINEX/main/data/"
# Data Loading
x1 = pd.read_csv(base_url + "COREAD_gex.csv",index_col=0)
x2 = pd.read_csv(base_url + "COREAD_mu.csv",index_col=0)
x3 = pd.read_csv(base_url + "COREAD_cn.csv",index_col=0)
covariates = pd.read_csv(base_url + "COREAD_subtypes.csv",index_col=0,header=0)
covariates_s = covariates[['subtypes']]
# Data processing
x1_T=x1.T
x2_T=x2.T
x3_T=x3.T
m1=pd.merge(x1_T, covariates_s,left_index=True, right_index= True)
m2=pd.merge(x2_T, covariates_s,left_index=True, right_index= True)
m3=pd.merge(x3_T, covariates_s,left_index=True, right_index= True)
combined_data = pd.concat([x1,x2,x3], axis=0)
combined_data_T=combined_data.T
combined_all=pd.merge(combined_data_T, covariates_s,left_index=True, right_index= True)

Unnamed: 0,RNF113A,S100A13,AP3D1,ATP6V1G1,UBQLN4,TPPP3,TSSC4,FOS,ERBB3,CHRAC1,...,15q13.2,7q21.13,13q31.1,4q35.2,7q32.1,17p11.2,15q11.1,15q11.2,18q12.1,subtypes
Case1,21.195670,19.726005,11.530217,0.000000,15.356366,12.767472,0.000000,16.489747,17.708855,16.056793,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,CMS1
Case2,21.508658,18.657292,12.988300,14.126750,19.622082,0.000000,0.000000,21.274977,18.115900,12.230063,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,CMS1
Case3,20.080716,18.970339,10.837587,15.313250,0.000000,0.000000,22.374816,0.000000,19.785306,11.779289,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,CMS1
Case4,0.000000,11.883356,10.242480,19.792995,0.000000,0.000000,0.000000,13.031538,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,CMS1
Case5,0.000000,12.077533,0.000000,0.000000,0.000000,0.000000,11.077866,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,CMS1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Case117,11.673601,0.000000,0.000000,0.000000,13.527430,0.000000,16.734721,11.906196,11.835825,10.059176,...,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,CMS3
Case118,11.717598,0.000000,0.000000,20.580360,11.986711,0.000000,14.445359,11.950195,11.879824,11.102489,...,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,CMS3
Case119,0.000000,13.146131,14.633627,18.139138,0.000000,14.063835,18.105979,12.294651,0.000000,12.768502,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,CMS3
Case120,13.003646,0.000000,19.870922,0.000000,0.000000,0.000000,15.360792,12.651393,12.581015,10.803979,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,CMS1
