# 1. Data Loading and Preprocessing

In this section, we will load the data from provided URLs and perform some basic preprocessing steps. The datasets include gene expression data, mutation data, copy number variation data, and subtype covariates.

##  🐼1.1Data Loading

🐼`pandas` is a powerful Python library that supports loading, cleaning, analyzing, and visualizing structured data using its useful data structures and functions.

We will load the following datasets from [https://raw.githubusercontent.com/WanbingZeng/OMINEX/main/data/ ]
1. **Gene Expression Data (COREAD_gex.csv)**
2. **Mutation Data (COREAD_mu.csv)**
3. **Copy Number Variation Data (COREAD_cn.csv)**
4. **Subtype Covariates (COREAD_subtypes.csv)**
## 🐼1.1.1 Gene expression

##### Example Code📋 

```python
import pandas as pd
# Base URL for data download
base_url = "https://raw.githubusercontent.com/WanbingZeng/OMINEX/main/data/"
# Load Gene Expression Data
x1 = pd.read_csv(base_url + "COREAD_gex.csv")
# Display the DataFrame
x1

```
##### ▶️ Please run the code below and see how `x1` looks like

⚠️**We can see that the first column appears as `Unnamed: 0`, which indicates that simply using `pd.read_csv` does not fix the first row as the index. Therefore, we need set the first row as the index using  `index_col=0`.**

⚠️**After loading the data, it is important to check the format of the data, including the row and column names.**
#####  Now let us reload `COREAD_gex.csv` with `index_col=0`
##### Example Code📋 
```python
# Load Gene Expression Data with index_col=0
x1 = pd.read_csv(base_url + "COREAD_gex.csv",index_col=0)
# Display the DataFrame
x1
```
##### ▶️ Reload COREAD_gex.csv as `x1` 





The dataset provided contains gene expression levels across multiple cases. Each row represents a different gene, and each column corresponds to a different case or sample. The values in the matrix represent the expression level of each gene in each case. The values represent the expression levels of each gene in different cases. Higher values indicate higher gene expression.

##### Data Structure

- **Rows**: Each row represents a different **gene**
 `RNF113A`
 `S100A13`
 `AP3D1`

- **Columns**: Each column represents a different **patient**
   `Case1`
  `Case2`
  `Case3`


## 🐼1.1.2 Mutaion 
##### ▶️ Load **COREAD_mu.csv** as `x2`

#### Mutation Data Description

In the mutation matrix, each row represents a different gene, and each column represents a different case or sample. The values in the matrix indicate the presence (`1`) or absence (`0`) of a mutation for each gene in each case.
##### Data Structure

- **Rows**: Each row corresponds to a **gene**:
   `TTN`
   `TP53`
   `APC`


- **Columns**: Each column represents a different **patient**
  `Case1`  `Case2`  `Case3`

 


## 🐼1.1.3 Copy Number Variation
##### ▶️ load **COREAD_cn.csv** as `x3`

In the CNV matrix, each row represents a different copy number variation (CNV) region, and each column represents a different case or sample. The values in the matrix indicate the presence (`1`) or absence (`0`) of a CNV for each region in each case.

##### Data Structure

- **Rows**: Each row corresponds to a **CNV region**:
  `8p23.2`
  `8p23.3`
  `8p23.1`
- **Columns**: Each column represents a different **patient**:
  `Case1`
  `Case2`
  `Case3`


## 🐼1.1.4 Cancer subtypes
These tumors belong to two molecular subtypes, CMS1 and CMS3, as defined by the Colorectal Cancer Subtyping Consortium.

##### ▶️ load **COREAD_subtypes.csv** as `covariates`  


The covariates matrix contains clinical and molecular information for each patient. Each row represents a different case, with columns providing details on demographics, clinical stage, mutation status, and molecular subtypes.

##### Data Structure

- **Rows**: Each row corresponds to a **patient**:
   `Case1`
   `Case2`
   `Case3`

- **Columns**: Each column represents a different type of information of the patients, in this project, we only use `subtypes`:
  - `dataset`: Data source ( `tcga: the cancer genome atlas`)
  - `age`: Patient's age
  - `gender`: Patient's gender
  - `stage`: Clinical stage of the disease
  - `pt`: Tumor size
  - `pn`: Regional lymph node involvement
  - `pm`: Distant metastasis
  - `tnm`: TNM classification
  - `grade`: Tumor grade
  - `msi`: Microsatellite instability status
  - `cimp`: CpG island methylator phenotype
  - `kras_mut`: KRAS mutation status
  - `braf_mut`: BRAF mutation status
  - <mark>**subtypes**</mark>: Molecular subtype (e.g., `CMS1`, `CMS3`)
  - `osMo`: Overall survival months
  - `osStat`: Overall survival status
  - `rfsMo`: Recurrence-free survival months
  - `rfsStat`: Recurrence-free survival status

##### 🩺CMS1 and CMS3
CMS1 and CMS3 are two of the four consensus molecular subtypes of colorectal cancer (CRC) identified by the Consensus Molecular Subtyping (CMS) project. Each is characterized by distinct molecular and clinical features.
 


 

## 1.2 Data preprocessing