[![Google Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/giordamaug/EG-identification---Data-Science-in-App-Springer/blob/main/notebook/EssentialGenes_Regression.ipynb)
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/giordamaug/EG-identification---Data-Science-in-App-Springer/main?filepath=notebook%2FEssentialGenes_Regression.ipynb)

# Loading required libraries

In [None]:
!pip install -q torch-scatter -f https://data.pyg.org/whl/torch-1.10.0+cu113.html
!pip install -q torch-sparse -f https://data.pyg.org/whl/torch-1.10.0+cu113.html
!pip install -q torch-cluster -f https://data.pyg.org/whl/torch-1.10.0+cu113.html
!pip install -q git+https://github.com/pyg-team/pytorch_geometric.git

In [None]:
import sys
IN_COLAB = 'google.colab' in sys.modules
if not IN_COLAB:
    !pip install -q pandas
    !pip install -q pandas
    !pip install -q sklearn
    !pip install -q imblearn
    !pip install -q xgboost
    !pip install -q tqdm

In [1]:
import warnings
warnings.filterwarnings('ignore')
import random
import numpy as np
import pandas as pd
def set_seed(seed=1):
    random.seed(seed)
    np.random.seed(seed)

# Download dataset from Github

In [36]:
!wget https://raw.githubusercontent.com/giordamaug/EG-identification---Data-Science-in-App-Springer/main/data/ppi.csv
!wget https://raw.githubusercontent.com/giordamaug/EG-identification---Data-Science-in-App-Springer/main/data/labels.csv
!wget https://raw.githubusercontent.com/giordamaug/EG-identification---Data-Science-in-App-Springer/main/data/bio_attributes.csv
!wget https://raw.githubusercontent.com/giordamaug/EG-identification---Data-Science-in-App-Springer/main/data/net_attributes.csv
!wget https://raw.githubusercontent.com/giordamaug/EG-identification---Data-Science-in-App-Springer/main/data/gtex_attributes.csv
!wget https://raw.githubusercontent.com/giordamaug/EG-identification---Data-Science-in-App-Springer/main/data/ppi+met.csv
!wget https://raw.githubusercontent.com/giordamaug/EG-identification---Data-Science-in-App-Springer/main/data/scores.csv

--2022-05-03 22:06:40--  https://raw.githubusercontent.com/giordamaug/EG-identification---Data-Science-in-App-Springer/main/data/ppi.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3966521 (3.8M) [text/plain]
Saving to: ‘ppi.csv.1’


2022-05-03 22:06:40 (48.6 MB/s) - ‘ppi.csv.1’ saved [3966521/3966521]

--2022-05-03 22:06:40--  https://raw.githubusercontent.com/giordamaug/EG-identification---Data-Science-in-App-Springer/main/data/labels.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 71739 (70K) [text/plain]
Saving to: ‘la

# Load the label
Only a subset of genes are selected for classification:
+ genes belonging to CS0 group, that are labeled as Essential (E);
+ genes belonging to CS6, CS7, ..., CS9 groups, that are labeled as Not-Essential (NE).

All remaining genes belong to intermediate groups (CS1-CS5) and are considered undetermined (label ND) 

In [12]:
datapath='../data'
scores = pd.read_csv(f"{datapath}/scores.csv", index_col='name')
genes = scores.index.values                                             # get genes with defined labels (E or NE)
print(f'Selected {len(genes)} genes')

Selected 11696 genes


# Load attributes to be used
We identified three sets of attributes:
1. bio attributes, related to gene information (such as, expression, etc.)
2. net attributes, derived from role of gene/node in the network (such as, degree, centrality, etc.)
3. GTEX-* attribute, additional biological information of genes 
Based on user selection, the node attributes are appended in a single matrix of attributes (`x`)

In the attribute matrix `x` there can be NaN or Infinite values. They are corrected as it follow:
+ NaN is replaced by the mean in the attribute range, 
+ Infinte value is replaced by the maximum in the range.

After Nan and Infinite values fixing, the attributes are normalized with Z-score or MinMax normalization functions.

At the end, only nodes (genes) with E or NE labels are selected for the classification

In [65]:
#@title Choose attributes { form-width: "20%" }
normalize_node = "zscore" #@param ["", "zscore", "minmax"]
bio = True #@param {type:"boolean"}
gtex = True #@param {type:"boolean"}
net = False #@param {type:"boolean"}
variable_name = "bio"
bio_df = pd.read_csv(f"{datapath}/bio_attributes.csv", index_col='name') if bio else pd.DataFrame()
gtex_df = pd.read_csv(f"{datapath}/gtex_attributes.csv", index_col='name') if gtex else pd.DataFrame()
net_df = pd.read_csv(f"{datapath}/net_attributes.csv", index_col='name') if net else pd.DataFrame()
x = pd.concat([bio_df, gtex_df, net_df], axis=1)
print(f'Found {x.isnull().sum().sum()} NaN values and {np.isinf(x).values.sum()} Infinite values')
for col in x.columns[x.isna().any()].tolist():
  mean_value=x[col].mean()          # Replace NaNs in column with the mean of values in the same column
  if mean_value is not np.nan:
    x[col].fillna(value=mean_value, inplace=True)
  else:                             # otherwise, if the mean is NaN, remove the column
    x = x.drop(col, 1)
if normalize_node == 'minmax':
  print("X attributes normalization (minmax)...")
  x = (x-x.min())/(x.max()-x.min())
elif normalize_node == 'zscore':
  print("X attributes normalization (zscore)...")
  x = (x-x.mean())/x.std()
x = x.loc[genes]
print(f'New attribute matrix x{x.shape}')

Found 15705 NaN values and 0 Infinite values
X attributes normalization (zscore)...
New attribute matrix x(11696, 105)


# Load the PPI+MET network
The PPI networks is loaded from a CSV file, where
*   `A` is the column name for edge source (gene name)
*   `B` is the column name for edge target (gene name)
*   `weight` is the column name for edge weight
Only some method use the PPI netoworks, as an example all GCN methods, and Node2Vec.

The PPI+MET network is reduced by removing genes with undetermined labels

In [88]:
ppi = pd.read_csv(f"{datapath}/ppi+met.csv")                                               # read PPI+MET network from CSV file
ppi = ppi.loc[((ppi['A'].isin(genes)) & (ppi['B'].isin(genes)))]           # reduce network only to selected nodes/genes
idxlbl = scores.reset_index(drop=True)
idxlbl['name'] = scores.index
map_gene_to_idx = { v['name']: i  for i,v in idxlbl.to_dict('Index').items() }
vfunc = np.vectorize(lambda t: map_gene_to_idx[t])
import torch
edges_index = torch.from_numpy(vfunc(ppi[['A','B']].to_numpy().T)) 

## Add node2vec embedding

In [90]:
from torch_geometric.nn.models import Node2Vec
import torch.optim as optim
import torch_cluster
PARAMS = {
    'embedding_dim': 128,
    'walk_length': 64,
    'context_size': 64,
    'walks_per_node': 64,
    'num_negative_samples': 1,
}
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
LR = 1e-2
WEIGHT_DECAY = 5e-4
EPOCHS = 50

def n2vembed(edge_index, epochs=100, log=False):

    n2v = Node2Vec(edge_index, **PARAMS).to(DEVICE)
    n2v_loader = n2v.loader(batch_size=128, shuffle=True, num_workers=0)
    n2v_optimizer = optim.Adam(n2v.parameters(), lr=LR)

    Z = None

    n2v.train()
    for i in range(epochs):
        n2v_train_loss = 0

        for (pos_rw, neg_rw) in n2v_loader:
            n2v_optimizer.zero_grad()
            loss = n2v.loss(pos_rw.to(DEVICE), neg_rw.to(DEVICE))
            loss.backward()
            n2v_optimizer.step()
            n2v_train_loss += loss.data.item()
        if log: print(f'Epoch {i}. N2V Train_Loss:', n2v_train_loss)
    n2v.eval()
    Z = n2v().detach()
    return Z
Z = n2vembed(edges_index, epochs=EPOCHS,log=True)

Epoch 0. N2V Train_Loss: 302.2271947860718
Epoch 1. N2V Train_Loss: 220.84252607822418


In [89]:
df = pd.DataFrame(Z.cpu().numpy()) #convert to a dataframe
df = df.set_index(x.index)
x = pd.concat([x, df], axis=1)

tensor([[  132,   132,   132,  ...,  1866, 11569,  8938],
        [  136,  7818,  2408,  ...,  7404, 11341,  3158]])

# Regression


In [78]:
#@title Choose classifier { run: "auto", form-width: "20%" }
method = "XGB" #@param ["SVM", "XGB", "RF", "MLP", "RUS", "DUMMY"]
from xgboost import XGBRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import RepeatedKFold, cross_val_score
from tqdm import tqdm
from sklearn.metrics import *
import sys
sys.path.append('.')
from multiscorer import MultiScorer
from numpy import absolute,average
seed=1
set_seed(seed)

scorer = MultiScorer({
    'mean_absolute_error'  : (mean_absolute_error , {}),
    'R2' : (r2_score, {}),
})


X = x.to_numpy()
y = scores.values[:,-1]
columns_names = ["mean_absolute_error"]
cv = RepeatedKFold(n_splits=10, n_repeats=5, random_state=1)
# define model
model = LinearRegression()

df_results = pd.DataFrame(columns=columns_names)
# evaluate model
cross_val_score(model, X, y, scoring=scorer, cv=cv)
results = scorer.get_results()
# force scores to be positive
for metric in results.keys():                                        # Iterate and use the results
  print("%s: %.3f" % (metric, average(results[metric])))


ValueError: Supported target types are: ('binary', 'multiclass'). Got 'continuous' instead.

# Print predictions

In [46]:
p = np.zeros(len(y))
p[mm] = predictions
labels['predictions'] = ['NE' if x>0 else 'E' for x in p]
labels

Unnamed: 0_level_0,CS0_vs_CS6-9,predictions
name,Unnamed: 1_level_1,Unnamed: 2_level_1
ENSG00000001036,NE,NE
ENSG00000001461,NE,NE
ENSG00000001561,NE,NE
ENSG00000001630,NE,NE
ENSG00000001631,NE,NE
...,...,...
ENSG00000288283,NE,NE
ENSG00000288359,NE,NE
ENSG00000288407,NE,NE
ENSG00000288478,NE,NE
