# Lab 7 Extended Gene expression analysis including Co-expression Networks on Brain cancer gene expression - CuMiDa GSE50161 microarray experiment
- Name: AbdelRahman Adel AbdelFattah
- ID: 17012296

## Objective

- Understand and perform the preprocessing of microarray data.
- Conduct principal component analysis (PCA) to explore data.
- Apply regression analysis to investigate specific gene expressions.
- Perform clustering to identify patterns in gene expression data.
- Utilize classification techniques to distinguish between cancer and non-cancer samples.
- Using WGCNA to generate co-expression network
- Using Cytoscape to visualize co-expression network

## Prerequisites

- Software: R and RStudio.

In [1]:
install.packages("ggplot2")
install.packages("gridExtra")
install.packages("caret")
install.packages("e1071")
install.packages("MASS")

Installing package into '/opt/homebrew/lib/R/4.3/site-library'
(as 'lib' is unspecified)

"installation of package 'ggplot2' had non-zero exit status"
Installing package into '/opt/homebrew/lib/R/4.3/site-library'
(as 'lib' is unspecified)

"installation of package 'gridExtra' had non-zero exit status"
Installing package into '/opt/homebrew/lib/R/4.3/site-library'
(as 'lib' is unspecified)

"installation of package 'caret' had non-zero exit status"
Installing package into '/opt/homebrew/lib/R/4.3/site-library'
(as 'lib' is unspecified)

Installing package into '/opt/homebrew/lib/R/4.3/site-library'
(as 'lib' is unspecified)



In [2]:
library(ggplot2)
library(gridExtra)
library(caret)
library(e1071)
library(MASS)

Loading required package: lattice



## Part 1: Data Acquisition

Access and download the dataset from Kaggle.
https://www.kaggle.com/datasets/brunogrisci/brain-cancer-gene-expression-cumida


In [3]:
file <- "Brain_GSE50161.csv"
data <- read.csv(file)
samples <- 130

In [None]:
head(data)

## Part 2: Preprocessing and Quality Control

Perform initial data checks and normalization.

Hints: remove gene levels that are less than 50%, remove outliers, and perform normalization

In [None]:
for(i in colnames(data)){
    sum <- sum(is.na(data[,i]))
    if(sum > samples * 0.5){
        print(i)
    }
}

In [None]:
for (i in row.names(data)){
    sum <- sum(is.na(data[i,]))
    if(sum > 0){
        print(i)
    }
}

In [None]:
numerical_data <- data[, 3:ncol(data)]
t_numerical_data <- as.data.frame(t(data[, 3:ncol(data)]))

In [None]:
boxplot(numerical_data[,1:25])

In [None]:
boxplot(t_numerical_data)

In [None]:
for (x in colnames(numerical_data)) {
    value = numerical_data[,x][numerical_data[,x] %in% boxplot.stats(numerical_data[,x])$out]
    numerical_data[,x][numerical_data[,x] %in% value] = NA
}
head(as.data.frame(colSums(is.na(numerical_data))))

In [None]:
for (x in colnames(t_numerical_data)) {
    value = t_numerical_data[,x][t_numerical_data[,x] %in% boxplot.stats(t_numerical_data[,x])$out]
    t_numerical_data[,x][t_numerical_data[,x] %in% value] = NA
}
head(as.data.frame(colSums(is.na(t_numerical_data))))

In [None]:
no_row_numerical_data<-na.omit(numerical_data)
no_col_numerical_data<-numerical_data[ , colSums(is.na(numerical_data))==0]
dim(no_row_numerical_data)
dim(no_col_numerical_data)

In [None]:
no_t_row_numerical_data<-na.omit(t_numerical_data)
no_t_col_numerical_data<-t_numerical_data[ , colSums(is.na(t_numerical_data))==0]
dim(no_t_row_numerical_data)
dim(no_t_col_numerical_data)

In [None]:
numerical_data <- t(no_col_numerical_data)

In [None]:
scaled_numerical_data <- scale(numerical_data)

In [None]:
scaled_data <- cbind(data[,1:2], scaled_numerical_data)

In [None]:
head(scaled_data)

## Part 3: Principal Component Analysis

### Task 3.1: Compute PCA and View Eigenvalues and Eigenvectors

Load the PCA results into R and create a 2D plot of the first two principal components.

Hints: You can use the ggplot2 package for this task. Example: ggplot(data, aes(x=PC1, y=PC2)) + geom_point().

In [None]:
pca_result <- prcomp(scaled_numerical_data, center = TRUE, scale. = TRUE)
summary(pca_result)

In [None]:
pc <- pca_result$x[, 1:3]
pc_df <- as.data.frame(pc)
colnames(pc_df) <- c('PC1', 'PC2', 'PC3')
head(pc_df)

In [None]:
data_with_pca <- cbind(scaled_numerical_data, pc_df)
data_with_pca_with_label <- cbind(data[,1:2], data_with_pca)

In [None]:
ggplot(data_with_pca, aes(x=PC1, y=PC2)) + geom_point()

### Task 3.2: Plotting PC1 vs PC2, PC1 vs PC3, and PC2 vs PC3


Create plots comparing PC1 vs PC2, PC1 vs PC3, and PC2 vs PC3.

Hints: You can use the gridExtra package to arrange multiple plots on a single page.

In [None]:
plot_pc1_pc2 <- ggplot(pc_df, aes(x = PC1, y = PC2)) + geom_point() +labs(title = "PC1 vs PC2")
plot_pc1_pc3 <- ggplot(pc_df, aes(x = PC1, y = PC3)) + geom_point() + labs(title = "PC1 vs PC3")
plot_pc2_pc3 <- ggplot(pc_df, aes(x = PC2, y = PC3)) + geom_point() + labs(title = "PC2 vs PC3")
grid.arrange(plot_pc1_pc2, plot_pc1_pc3, plot_pc2_pc3, ncol = 2)

## Part 4: Regression Analysis

Investigate relationships between genes and cancer existence.

Hints: You can create a logistic regression model using the glm() function. Make sure to set family = binomial for logistic regression.

In [None]:
cancer_types <- scaled_data[, 'type']
cancer <- c()
for (i in seq_len(nrow(scaled_data))){
    if(scaled_data[i, "type"] == 'normal'){
        cancer <- c(cancer, 0)
    }else{
        cancer <- c(cancer, 1)
    }
}

In [None]:
data_with_cancer <- cbind(data_with_pca_with_label, cancer)
head(data_with_cancer)

In [None]:
colNames <- colnames(scaled_data)[3:ncol(scaled_data)]
total_size <- length(colNames)
step_size <- floor(total_size / 5)
start_1 <- 1
start_2 <- start_1 + step_size
start_3 <- start_2 + step_size
start_4 <- start_3 + step_size
start_5 <- start_4 + step_size
start_6 <- total_size

In [None]:
reg_formula <- as.formula(paste("cancer ~ ", paste(colNames, collapse = "+")))
reg_model <- glm(reg_formula, data = data_with_cancer, family = binomial)
summary(reg_model)

In [None]:
reg_formula <- as.formula(paste("cancer ~ ", paste(colNames[start_1:start_2], collapse = "+")))
reg_model1 <- glm(reg_formula, data = data_with_cancer, family = binomial)
summary(reg_model1)

In [None]:
reg_formula <- as.formula(paste("cancer ~ ", paste(colNames[start_2:start_3], collapse = "+")))
reg_model2 <- glm(reg_formula, data = data_with_cancer, family = binomial)
summary(reg_model2)

In [None]:
reg_formula <- as.formula(paste("cancer ~ ", paste(colNames[start_3:start_4], collapse = "+")))
reg_model3 <- glm(reg_formula, data = data_with_cancer, family = binomial)
summary(reg_model3)

In [None]:
reg_formula <- as.formula(paste("cancer ~ ", paste(colNames[start_4:start_5], collapse = "+")))
reg_model4 <- glm(reg_formula, data = data_with_cancer, family = binomial)
summary(reg_model4)

In [None]:
reg_formula <- as.formula(paste("cancer ~ ", paste(colNames[start_5:start_6], collapse = "+")))
reg_model5 <- glm(reg_formula, data = data_with_cancer, family = binomial)
summary(reg_model5)

## Part 5: Clustering Analysis

### Task 5.1: clustering using K-means.

Explore data grouping based on gene expression data using K-means

In [None]:
set.seed(5)
kmeans_result <- kmeans(scaled_numerical_data, centers = 5)
kmeans_result

### Task 5.2: Visualize clusters

Visualize cluster output in PC1 vs PC2.

In [None]:
data_with_cancer_with_kmeans <- cbind(data_with_cancer, kmeans_result$cluster)

In [None]:
ggplot(data_with_cancer_with_kmeans, aes(x=PC1, y=PC2, color = factor(kmeans_result$cluster))) + geom_point()

## Part 6: Classification Techniques

Note: if you do feature selection and classify again after fulfilling part 6 requirements you have a bonus of (5 Points)

Split your dataset into 80% training and 20% testing, make sure that the testing dataset has nearly the same distribution of classes, note although we are interested here in cancer and non-cancer, the real number of classes is 5, so to have consistent results you need to have testing split have the same distribution of classes.


In [None]:
cancer_types <- unique(cancer_types)
class_1 <- data_with_cancer_with_kmeans[data_with_cancer_with_kmeans$type == cancer_types[1],]
class_2 <- data_with_cancer_with_kmeans[data_with_cancer_with_kmeans$type == cancer_types[2],]
class_3 <- data_with_cancer_with_kmeans[data_with_cancer_with_kmeans$type == cancer_types[3],]
class_4 <- data_with_cancer_with_kmeans[data_with_cancer_with_kmeans$type == cancer_types[4],]
class_5 <- data_with_cancer_with_kmeans[data_with_cancer_with_kmeans$type == cancer_types[5],]
dim(class_1)
dim(class_2)
dim(class_3)
dim(class_4)
dim(class_5)

In [None]:
classes_tables <- list(class_1, class_2, class_3, class_4, class_5)

In [None]:
train_data <- data.frame()
test_data <- data.frame()
for (class in classes_tables){
    set.seed(130)
    sample <- sample(c(TRUE, FALSE), nrow(class), replace = TRUE, prob = c(0.8, 0.2))
    train <- class[sample, ]
    test <- class[!sample, ]
    train_data <- rbind(train_data, train)
    test_data <- rbind(test_data, test)
}
dim(train_data)
dim(test_data)

### Task 6.1: Classification using LDA and SVM


- Classify samples into cancer and non-cancer groups using LDA and SVM techniques with k-fold cross-validation following below steps
- Import the necessary libraries, including MASS for LDA, e1071 for SVM, and caret for cross-validation.
- Implement k-fold cross-validation with a specified number of folds (e.g., 5 or 10). Split your dataset into these folds.
- For each fold, do the following:
  - Use the remaining (k-1) folds as the training data.
- For LDA:
  - Fit an LDA model on this training data.
  - Predict the class labels for the fold left out (validation data) using the LDA model.
- For SVM:
  - Train an SVM model on this training data with the specified kernel (e.g., RBF).
 - Evaluate the model's performance on the held-out fold (validation data) using metrics like accuracy, precision, recall, and F-score.
- Average the performance metrics (e.g., accuracy) across all folds to obtain an overall
estimate of the models' performance for both LDA and SVM.


In [None]:
train <- train_data[, 2:(ncol(train_data)-5)]
test <- test_data[, 2:(ncol(test_data)-5)]

In [None]:
set.seed(123)
num_folds <- 10
folds <- createFolds(train$type, k = num_folds, list = TRUE)

In [None]:
lda_metrics <- matrix(0, nrow = num_folds, ncol = 4)
svm_metrics <- matrix(0, nrow = num_folds, ncol = 4)

In [None]:
for (fold in 1:num_folds) {
  train_indices <- unlist(folds[-fold])
  validation_indices <- folds[[fold]]
  cv_train_data <- train[train_indices, ]
  validation_data <- train[validation_indices, ]
  cv_train_features <- cv_train_data[, -1]
  cv_train_labels <- cv_train_data$type
  validation_features <- validation_data[, -1]
  validation_labels <- validation_data$type
  lda_model <- lda(cv_train_labels ~ ., data = cv_train_data)
  lda_predictions <- predict(lda_model, newdata = validation_features)
  lda_metrics[fold, ] <- confusionMatrix(lda_predictions$class, validation_labels)$overall
  svm_model <- svm(cv_train_labels ~ ., data = cv_train_data, kernel = "radial")
  svm_predictions <- predict(svm_model, newdata = validation_features)
  svm_metrics[fold, ] <- confusionMatrix(svm_predictions, validation_labels)$overall
}
lda_avg_metrics <- colMeans(lda_metrics)
svm_avg_metrics <- colMeans(svm_metrics)
print("LDA Average Metrics:")
print(lda_avg_metrics)
print("SVM Average Metrics:")
print(svm_avg_metrics)

### Task 6.2: metrics reporting for both methods

- Report overall accuracy, precision, recall, F score and plot confusion matrix for each method output
- Compare the classification accuracy of significant genes due to regression with the whole set of genes used for the classifier

### Task 6.3: choice of the best method based on metrics

- If you are asked to choose best classifier from your output metrics, what will you choose and provide a reason for your choice?
- In other contexts will your choice depend only on the method with the best accuracy why? And why not?

Hints: LDA implementation using the lda() function from the MASS package. SVM model creation using svm() from the e1071 package.
To calculate accuracy, you can use the caret package.

## Part 7: WGCNA Analysis and Cytoscape Visualization

### Task 7.1: WGCNA Analysis

1. Import the necessary R libraries for WGCNA analysis, including WGCNA and stats.
2. Construct the Network:
3. Use the blockwiseModules function from the WGCNA package to construct the gene
co-expression network.
4. Parameters to consider include the soft thresholding power, network type (signed or
unsigned), and module detection settings.
5. Employ functions like moduleColors to obtain module assignments for each gene.
6. Optionally, use cutreeDynamic to refine module assignments.
7. Create visualizations such as plots, dendrograms, and heatmaps to understand the
network's structure and module relationships.
8. Extract the network of the significant genes “those are the ones obtained from
regression analysis” (relaxed significance p<0.05 )

### Task 7.2: Export Network Data for Cytoscape

1. Export the gene co-expression network data from R in a format compatible with Cytoscape, such as CSV or TXT. Ensure that your exported file includes information about nodes (genes) and their connections (edges). This file should contain details about which genes are connected to each other in the network.
2. Download and install Cytoscape from the official website (https://cytoscape.org/download.html) if you haven't already done so.
3. Launch Cytoscape and use its import functionality to bring in the network data exported from R. This will create the base network visualization.
4. Simplify the visualization by focusing on the most important parts of the network:
  - Utilize layout algorithms within Cytoscape to arrange nodes (genes) in a visually
informative way. Experiment with different layouts to find the one that best
represents the network's structure.
  - Adjust node properties, such as size, color, and shape, to convey biological
significance. Highlight significant genes “those are the ones obtained from
regression analysis” by assigning distinctive colors or larger sizes to them.
  - Modify edge properties, including width and color, to represent co-expression
strength effectively. Stronger co-expression can be indicated by thicker or
differently colored edges.
5. Extract subnetworks containing significant genes “those are the ones obtained from
regression analysis”:
  - Use the significant genes as seeds or starting points.
  - Visualize these subnetworks separately or in conjunction with the base
network.
6. Install and activate the gProfiler or enricher plugin within Cytoscape, if it's not already
installed. You can install it via the Cytoscape App Manager.
Use the gProfiler plugin or enricher to perform gene enrichment analysis on Extracted subnetworks in last point. This analysis will provide insights into the biological functions, pathways, and processes associated with the selected genes.
7. Visualize the results of the gene enrichment analysis within Cytoscape, which may include enriched pathways, Gene Ontology (GO) terms, and other relevant annotations.
8. Customize the visualization of enrichment results to highlight the most significant terms or pathways associated with the genes in your network or subnetworks.
9. Export the enriched gene sets, pathways, or GO terms as part of your final network visualization or as separate reports for inclusion in publications or presentations.