# 7.2. DATA PREPARATION FOR MMSBM (discretizations)

Models we'll create (always using a SVC)... D-> discretized / ND-> non discretized:
- a) Genes D (T=5)
- b) Genes D (T=5)
- c) Genes ND


- d) Cells D (T=6)
- e) Cells D (T=7)
- f) Cells ND


- g) Genes D (5) + Cells D (6)
- h) Genes ND + Cells ND


- i) Genes D (5) + Cells D (6) + dosage (type, time, dose)
- j) Genes ND + Cells ND + dosage (same as the original dataset, we'll use the one already created)


In [1]:
import pandas as pd

features = pd.read_csv("../lish-moa/train_features.csv", index_col=0)
targets = pd.read_csv("../lish-moa/train_targets_scored.csv", index_col=0)
#cp_type:
features.loc[features.cp_type == "trt_cp", "cp_type"] = 0
features.loc[features.cp_type == "ctl_vehicle", "cp_type"] = 1 #CONTROL = 1
#cp_dose:
features.loc[features.cp_dose == "D1", 'cp_dose'] = 0
features.loc[features.cp_dose == "D2", 'cp_dose'] = 1


# Data to be used for the model:
xtrain = pd.read_csv("xtrain.csv", index_col=0)
ytrain = pd.read_csv("ytrain.csv", index_col=0)

# Data to be used for the predictions:
xtest = pd.read_csv("xtest.csv", index_col=0)
ytest = pd.read_csv("ytest.csv", index_col=0)

In [2]:
#how many genes do we have?
g_columns = [g for g in features.columns.tolist() if g.startswith('g-')]
len(g_columns)

772

In [3]:
#how many cell lines?
c_columns = [c for c in features.columns.tolist() if c.startswith('c-')]
len(c_columns)

100

In [4]:
z_scores = pd.read_csv("Z-scores_features_over_control.csv", index_col=0)

## 7.1. GENE discretization

Although we thought there would be more differences between these distributions and the ones we had by calculating the zscores using the mean and standar deviation of all the samples (treated and control)!!!, let's now try to find a threshold (T) so that we can discretize the gene expression. 

Instead of having a number which represents this expression, we want to have 3 different "classes":
- Underexpressed (z-score < -T) &rarr; -1
- Normal (-T ≥ z-score ≤ +T)    &rarr; 0
- Overexpressed (z-score > +T)  &rarr; 1

### a) T=5

We must discretize both the training set and the test set:

In [5]:
def gene_discretization(names_cols, z_scores, threshold):
    new_df = pd.DataFrame(index = z_scores.index, columns = names_cols)
    for element in names_cols:
        new_column = []
        for i in range(z_scores.shape[0]):
            if (z_scores[element][i] < -threshold):
                new_column.append(-1)
            elif (z_scores[element][i] > threshold):
                new_column.append(1)
            else:
                new_column.append(0)
        new_df[element] = new_column        
    return(new_df)

In [None]:
genes_discretized = gene_discretization(g_columns, z_scores, 5)
genes_discretized.head(5)

### New data -> New model
Let's create a model (SVC for example) with this new data. We're just taking into consideration the genes, leaving cell viability and dosage data. 

In [8]:
def MultipleColumns_model(xtrain, ytrain, c, estimator):
    """Torna una llista amb els models creats i adaptats per a cada columna.
        Entrades: ytrain i número de columnes que agafem.
    """
    models = [] #llista dels models per a cada columna
    for i in range(c): #from 0 to number of columns
        models.append(estimator(probability = True)) #Prob = True bc we want to predict_proba
        models[i].fit(xtrain, ytrain.iloc[:,i]) #Fitting the model with all xtrain and 1 column of ytrain
        print("model", i, "done")
    return models

from sklearn.svm import SVC

In [None]:
# Data used for the model:
xtrain = genes_discretized.loc[xtrain.index.tolist(), :].copy()
#xtrain = xtrain.drop('cp_type', axis=1)
ytrain = targets.loc[xtrain.index.tolist(), :].copy()

# Data used for the predictions:
xtest = genes_discretized.loc[xtest.index.tolist(), :].copy()

In [None]:
svc_model = MultipleColumns_model(xtrain, ytrain, ytrain.shape[1], SVC) 

In [None]:
from joblib import dump, load
dump(svc_model, 'output/7a_model.joblib') 

In [None]:
#new dataframe for saving the predictions
proba_pred_SVC = pd.DataFrame(columns=ytest.columns)

name_col = ytest.columns.tolist()
for i in range(ytest.shape[1]): 
    proba_pred_SVC[name_col[i]] = svc_model[i].predict_proba(xtest)[:, 1]
    print(i, end=' ', flush=True)
    
proba_pred_SVC["sig_id"]= xtest.index.tolist()
proba_pred_SVC = proba_pred_SVC.set_index('sig_id')
proba_pred_SVC.to_csv('output/7a_probas.csv')
proba_pred_SVC.head(5)

### b) T=6

In [None]:
genes_discretized = gene_discretization(g_columns, z_scores, 6)

# Data used for the model:
xtrain = genes_discretized.loc[xtrain.index.tolist(), :].copy()
ytrain = targets.loc[xtrain.index.tolist(), :].copy()
# Data used for the predictions:
xtest = genes_discretized.loc[xtest.index.tolist(), :].copy()

svc_model = MultipleColumns_model(xtrain, ytrain, ytrain.shape[1], SVC) 
dump(svc_model, 'output/7b_model.joblib') 

#new dataframe for saving the predictions
proba_pred_SVC = pd.DataFrame(columns=ytest.columns)

name_col = ytest.columns.tolist()
for i in range(ytest.shape[1]): 
    proba_pred_SVC[name_col[i]] = svc_model[i].predict_proba(xtest)[:, 1]
    print(i, end=' ', flush=True)
    
proba_pred_SVC["sig_id"]= xtest.index.tolist()
proba_pred_SVC = proba_pred_SVC.set_index('sig_id')
proba_pred_SVC.to_csv('output/7b_probas.csv')
proba_pred_SVC.head(5)

__________
## 7.2. CELL discretization

Instead of having a number which represents this viability, we want to have 2 different "classes":
- Non-viable (z-score < -T) &rarr; 0
- Viable (z-score ≤ T)    &rarr; 1

We'll only look for those samples in which a surprising number of cells have died, because we think that most of the treatment will tend to reduce the population of some type of cells (we decide so also by looking at the plots obtained in the previous notebook).

Potential thresholds: 6 or 7. (Why? Imagine the slope of the control distributions... where does it reaches the x axis?).

In [None]:
def cell_discretization(names_cols, z_scores, threshold):
    new_df = pd.DataFrame(index = z_scores.index, columns = names_cols)
    for element in names_cols:
        new_column = []
        for i in range(z_scores.shape[0]):
            if (z_scores[element][i] < -threshold):
                new_column.append(0)
            else:
                new_column.append(1)
        new_df[element] = new_column        
    return(new_df)

### d) T = 6


In [None]:
cells_discretized = cell_discretization(c_columns, z_scores, 6)
cells_discretized_6.head(5)

In [None]:
# Data used for the model:
xtrain = cells_discretized.loc[xtrain.index.tolist(), :].copy()
ytrain = targets.loc[xtrain.index.tolist(), :].copy()
# Data used for the predictions:
xtest = cells_discretized.loc[xtest.index.tolist(), :].copy()

svc_model = MultipleColumns_model(xtrain, ytrain, ytrain.shape[1], SVC) 
dump(svc_model, 'output/7e_model.joblib') 

#new dataframe for saving the predictions
proba_pred_SVC = pd.DataFrame(columns=ytest.columns)

name_col = ytest.columns.tolist()
for i in range(ytest.shape[1]): 
    proba_pred_SVC[name_col[i]] = svc_model[i].predict_proba(xtest)[:, 1]
    print(i, end=' ', flush=True)
    
proba_pred_SVC["sig_id"]= xtest.index.tolist()
proba_pred_SVC = proba_pred_SVC.set_index('sig_id')
proba_pred_SVC.to_csv('output/7e_probas.csv')
proba_pred_SVC.head(5)

### e) T = 7

In [None]:
cells_discretized = cell_discretization(c_columns, z_scores, 7)

# Data used for the model:
xtrain = cells_discretized.loc[xtrain.index.tolist(), :].copy()
ytrain = targets.loc[xtrain.index.tolist(), :].copy()
# Data used for the predictions:
xtest = cells_discretized.loc[xtest.index.tolist(), :].copy()

svc_model = MultipleColumns_model(xtrain, ytrain, ytrain.shape[1], SVC) 
dump(svc_model, 'output/7f_model.joblib') 

#new dataframe for saving the predictions
proba_pred_SVC = pd.DataFrame(columns=ytest.columns)

name_col = ytest.columns.tolist()
for i in range(ytest.shape[1]): 
    proba_pred_SVC[name_col[i]] = svc_model[i].predict_proba(xtest)[:, 1]
    print(i, end=' ', flush=True)
    
proba_pred_SVC["sig_id"]= xtest.index.tolist()
proba_pred_SVC = proba_pred_SVC.set_index('sig_id')
proba_pred_SVC.to_csv('output/7f_probas.csv')
proba_pred_SVC.head(5)

___
In order to understand the reasons of an improvement (or a deterioration) of the log loss score, we need to create different models (taking into account only genes without discretization, only cells without discretization and so on).


### c) Genes ND

In [None]:
what = 'genes'                                                                                                            

features = pd.read_csv("../lish-moa/train_features.csv", index_col=0)
targets = pd.read_csv("../lish-moa/train_targets_scored.csv", index_col=0)
#cp_type:                                                                                                                                        
features.loc[features.cp_type == "trt_cp", "cp_type"] = 0
features.loc[features.cp_type == "ctl_vehicle", "cp_type"] = 1 #CONTROL = 1                                                                      
#cp_dose:                                                                                                                                        
features.loc[features.cp_dose == "D1", 'cp_dose'] = 0
features.loc[features.cp_dose == "D2", 'cp_dose'] = 1
# Data to be used for the model:                                                                                                                 
xtrain = pd.read_csv("xtrain.csv", index_col=0)
ytrain = pd.read_csv("ytrain.csv", index_col=0)
# Data to be used for the predictions:                                                                                                           
xtest = pd.read_csv("xtest.csv", index_col=0)
ytest = pd.read_csv("ytest.csv", index_col=0)
g_columns = [g for g in features.columns.tolist() if g.startswith('g-')]
c_columns = [c for c in features.columns.tolist() if c.startswith('c-')]

if what=='genes':
    discretized = features.loc[:,g_columns].copy()
if what=='cells':
    discretized = features.loc[:,c_columns].copy()
if what=='both': 
    discretized = features.loc[:,g_columns+c_columns].copy()
    
# Data used for the model:                                                                                                                       
xtrain = discretized.loc[xtrain.index.tolist(), :].copy()
#xtrain = xtrain.drop('cp_type', axis=1)                                                                                                         
ytrain = targets.loc[xtrain.index.tolist(), :].copy() #remains the same                                                                          
# Data used for the predictions:                                                                                                                 
xtest = discretized.loc[xtest.index.tolist(), :].copy()
svc_model = MultipleColumns_model(xtrain, ytrain, ytrain.shape[1], SVC)
from joblib import dump
dump(svc_model, 'output/7c_model.joblib')
#new dataframe for saving the predictions                                                                                                        
proba_pred_SVC = pd.DataFrame(columns=ytest.columns)
name_col = ytest.columns.tolist()
for i in range(ytest.shape[1]):
    proba_pred_SVC[name_col[i]] = svc_model[i].predict_proba(xtest)[:, 1]
    print(i, end=' ', flush=True)
proba_pred_SVC["sig_id"]= xtest.index.tolist()
proba_pred_SVC = proba_pred_SVC.set_index('sig_id')
proba_pred_SVC.to_csv('output/7c_probas.csv')                  

### f) Cells ND

In [None]:
what = 'cells'                                                                                                            

features = pd.read_csv("../lish-moa/train_features.csv", index_col=0)
targets = pd.read_csv("../lish-moa/train_targets_scored.csv", index_col=0)
#cp_type:                                                                                                                                        
features.loc[features.cp_type == "trt_cp", "cp_type"] = 0
features.loc[features.cp_type == "ctl_vehicle", "cp_type"] = 1 #CONTROL = 1                                                                      
#cp_dose:                                                                                                                                        
features.loc[features.cp_dose == "D1", 'cp_dose'] = 0
features.loc[features.cp_dose == "D2", 'cp_dose'] = 1
# Data to be used for the model:                                                                                                                 
xtrain = pd.read_csv("xtrain.csv", index_col=0)
ytrain = pd.read_csv("ytrain.csv", index_col=0)
# Data to be used for the predictions:                                                                                                           
xtest = pd.read_csv("xtest.csv", index_col=0)
ytest = pd.read_csv("ytest.csv", index_col=0)
g_columns = [g for g in features.columns.tolist() if g.startswith('g-')]
c_columns = [c for c in features.columns.tolist() if c.startswith('c-')]

if what=='genes':
    discretized = features.loc[:,g_columns].copy()
if what=='cells':
    discretized = features.loc[:,c_columns].copy()
if what=='both': 
    discretized = features.loc[:,g_columns+c_columns].copy()
    
# Data used for the model:                                                                                                                       
xtrain = discretized.loc[xtrain.index.tolist(), :].copy()
#xtrain = xtrain.drop('cp_type', axis=1)                                                                                                         
ytrain = targets.loc[xtrain.index.tolist(), :].copy() #remains the same                                                                          
# Data used for the predictions:                                                                                                                 
xtest = discretized.loc[xtest.index.tolist(), :].copy()
svc_model = MultipleColumns_model(xtrain, ytrain, ytrain.shape[1], SVC)
from joblib import dump
dump(svc_model, 'output/7f_model.joblib')
#new dataframe for saving the predictions                                                                                                        
proba_pred_SVC = pd.DataFrame(columns=ytest.columns)
name_col = ytest.columns.tolist()
for i in range(ytest.shape[1]):
    proba_pred_SVC[name_col[i]] = svc_model[i].predict_proba(xtest)[:, 1]
    print(i, end=' ', flush=True)
proba_pred_SVC["sig_id"]= xtest.index.tolist()
proba_pred_SVC = proba_pred_SVC.set_index('sig_id')
proba_pred_SVC.to_csv('output/7f_probas.csv')  

### h) Genes ND + Cells ND

In [None]:
what = 'both'                                                                                                            

features = pd.read_csv("../lish-moa/train_features.csv", index_col=0)
targets = pd.read_csv("../lish-moa/train_targets_scored.csv", index_col=0)
#cp_type:                                                                                                                                        
features.loc[features.cp_type == "trt_cp", "cp_type"] = 0
features.loc[features.cp_type == "ctl_vehicle", "cp_type"] = 1 #CONTROL = 1                                                                      
#cp_dose:                                                                                                                                        
features.loc[features.cp_dose == "D1", 'cp_dose'] = 0
features.loc[features.cp_dose == "D2", 'cp_dose'] = 1
# Data to be used for the model:                                                                                                                 
xtrain = pd.read_csv("xtrain.csv", index_col=0)
ytrain = pd.read_csv("ytrain.csv", index_col=0)
# Data to be used for the predictions:                                                                                                           
xtest = pd.read_csv("xtest.csv", index_col=0)
ytest = pd.read_csv("ytest.csv", index_col=0)
g_columns = [g for g in features.columns.tolist() if g.startswith('g-')]
c_columns = [c for c in features.columns.tolist() if c.startswith('c-')]

if what=='genes':
    discretized = features.loc[:,g_columns].copy()
if what=='cells':
    discretized = features.loc[:,c_columns].copy()
if what=='both': 
    discretized = features.loc[:,g_columns+c_columns].copy()
    
# Data used for the model:                                                                                                                       
xtrain = discretized.loc[xtrain.index.tolist(), :].copy()
#xtrain = xtrain.drop('cp_type', axis=1)                                                                                                         
ytrain = targets.loc[xtrain.index.tolist(), :].copy() #remains the same                                                                          
# Data used for the predictions:                                                                                                                 
xtest = discretized.loc[xtest.index.tolist(), :].copy()
svc_model = MultipleColumns_model(xtrain, ytrain, ytrain.shape[1], SVC)
from joblib import dump
dump(svc_model, 'output/7h_model.joblib')
#new dataframe for saving the predictions                                                                                                        
proba_pred_SVC = pd.DataFrame(columns=ytest.columns)
name_col = ytest.columns.tolist()
for i in range(ytest.shape[1]):
    proba_pred_SVC[name_col[i]] = svc_model[i].predict_proba(xtest)[:, 1]
    print(i, end=' ', flush=True)
proba_pred_SVC["sig_id"]= xtest.index.tolist()
proba_pred_SVC = proba_pred_SVC.set_index('sig_id')
proba_pred_SVC.to_csv('output/7h_probas.csv')  

### g) Genes T=5 + Cells T=6

In [5]:
import pandas as pd
what = 'both' #cells or genes or both                                                                                                             
#threshold = int(sys.argv[2])
features = pd.read_csv("../lish-moa/train_features.csv", index_col=0)
targets = pd.read_csv("../lish-moa/train_targets_scored.csv", index_col=0)
#cp_type:                                                                                                                                        
features.loc[features.cp_type == "trt_cp", "cp_type"] = 0
features.loc[features.cp_type == "ctl_vehicle", "cp_type"] = 1 #CONTROL = 1                                                                      
#cp_dose:                                                                                                                                        
features.loc[features.cp_dose == "D1", 'cp_dose'] = 0
features.loc[features.cp_dose == "D2", 'cp_dose'] = 1
# Data to be used for the model:                                                                                                                 
xtrain = pd.read_csv("xtrain.csv", index_col=0)
ytrain = pd.read_csv("ytrain.csv", index_col=0)
# Data to be used for the predictions:                                                                                                           
xtest = pd.read_csv("xtest.csv", index_col=0)
ytest = pd.read_csv("ytest.csv", index_col=0)
z_scores = pd.read_csv("Z-scores_features_over_control.csv", index_col = 0)
g_columns = [g for g in features.columns.tolist() if g.startswith('g-')]
c_columns = [c for c in features.columns.tolist() if c.startswith('c-')]
def gene_discretization(names_cols, z_scores, threshold):
    new_df = pd.DataFrame(index = z_scores.index, columns = names_cols)
    for element in names_cols:
        new_column = []
        for i in range(z_scores.shape[0]):
            if (z_scores[element][i] < -threshold):
                new_column.append(-1)
            elif (z_scores[element][i] > threshold):
                new_column.append(1)
            else:
                new_column.append(0)
        new_df[element] = new_column
    return(new_df)
def cell_discretization(names_cols, z_scores, threshold):
    new_df = pd.DataFrame(index = z_scores.index, columns = names_cols)
    for element in names_cols:
        new_column = []
        for i in range(z_scores.shape[0]):
            if (z_scores[element][i] < -threshold):
                new_column.append(0)
            else:
                new_column.append(1)
        new_df[element] = new_column
    return(new_df)
def MultipleColumns_model(xtrain, ytrain, c, estimator):
    """Torna una llista amb els models creats i adaptats per a cada columna.                                                                     
        Entrades: ytrain i número de columnes que agafem.                                                                                        
    """
    models = [] #llista dels models per a cada columna                                                                                           
    for i in range(c): #from 0 to number of columns                                                                                              
        models.append(estimator(probability = True)) #Prob = True bc we want to predict_proba                                                    
        models[i].fit(xtrain, ytrain.iloc[:,i]) #Fitting the model with all xtrain and 1 column of ytrain                                        
        print("model", i, "done")
    return models
from sklearn.svm import SVC
if what=='genes':
    discretized = gene_discretization(g_columns, z_scores, threshold)
if what=='cells':
    discretized = cell_discretization(c_columns, z_scores, threshold)
if what=='both': discretized=pd.concat([gene_discretization(g_columns,z_scores,5),cell_discretization(c_columns,z_scores,6)],axis=1)
# Data used for the model:                                                                                                                       
xtrain = discretized.loc[xtrain.index.tolist(), :].copy()
#xtrain = xtrain.drop('cp_type', axis=1)                                                                                                         
ytrain = targets.loc[xtrain.index.tolist(), :].copy() #remains the same                                                                          
# Data used for the predictions:                                                                                                                 
xtest = discretized.loc[xtest.index.tolist(), :].copy()
svc_model = MultipleColumns_model(xtrain, ytrain, ytrain.shape[1], SVC)
from joblib import dump
dump(svc_model, 'output/7g_model.joblib')
#new dataframe for saving the predictions                                                                                                        
proba_pred_SVC = pd.DataFrame(columns=ytest.columns)
name_col = ytest.columns.tolist()
for i in range(ytest.shape[1]):
    proba_pred_SVC[name_col[i]] = svc_model[i].predict_proba(xtest)[:, 1]
    print(i, end=' ', flush=True)
proba_pred_SVC["sig_id"]= xtest.index.tolist()
proba_pred_SVC = proba_pred_SVC.set_index('sig_id')
proba_pred_SVC.to_csv('output/7g_probas.csv')

model 0 done
model 1 done
model 2 done
model 3 done
model 4 done
model 5 done
model 6 done
model 7 done
model 8 done
model 9 done
model 10 done
model 11 done
model 12 done
model 13 done
model 14 done
model 15 done
model 16 done
model 17 done
model 18 done
model 19 done
model 20 done
model 21 done
model 22 done
model 23 done
model 24 done
model 25 done
model 26 done
model 27 done
model 28 done
model 29 done
model 30 done
model 31 done
model 32 done
model 33 done
model 34 done
model 35 done
model 36 done
model 37 done
model 38 done
model 39 done
model 40 done
model 41 done
model 42 done
model 43 done
model 44 done
model 45 done
model 46 done
model 47 done
model 48 done
model 49 done
model 50 done
model 51 done
model 52 done
model 53 done
model 54 done
model 55 done
model 56 done
model 57 done
model 58 done
model 59 done
model 60 done
model 61 done
model 62 done
model 63 done
model 64 done
model 65 done
model 66 done
model 67 done
model 68 done
model 69 done
model 70 done
model 71 done
mo

### i) Genes T=5 + Cells T=6 + dosage

In [6]:
import pandas as pd

features = pd.read_csv("../lish-moa/train_features.csv", index_col=0)
targets = pd.read_csv("../lish-moa/train_targets_scored.csv", index_col=0)
#cp_type:                                                                                                                                        
features.loc[features.cp_type == "trt_cp", "cp_type"] = 0
features.loc[features.cp_type == "ctl_vehicle", "cp_type"] = 1 #CONTROL = 1                                                                      
#cp_dose:                                                                                                                                        
features.loc[features.cp_dose == "D1", 'cp_dose'] = 0
features.loc[features.cp_dose == "D2", 'cp_dose'] = 1
# Data to be used for the model:                                                                                                                 
xtrain = pd.read_csv("xtrain.csv", index_col=0)
ytrain = pd.read_csv("ytrain.csv", index_col=0)
# Data to be used for the predictions:                                                                                                           
xtest = pd.read_csv("xtest.csv", index_col=0)
ytest = pd.read_csv("ytest.csv", index_col=0)
z_scores = pd.read_csv("Z-scores_features_over_control.csv", index_col = 0)
g_columns = [g for g in features.columns.tolist() if g.startswith('g-')]
c_columns = [c for c in features.columns.tolist() if c.startswith('c-')]
def gene_discretization(names_cols, z_scores, threshold):
    new_df = pd.DataFrame(index = z_scores.index, columns = names_cols)
    for element in names_cols:
        new_column = []
        for i in range(z_scores.shape[0]):
            if (z_scores[element][i] < -threshold):
                new_column.append(-1)
            elif (z_scores[element][i] > threshold):
                new_column.append(1)
            else:
                new_column.append(0)
        new_df[element] = new_column
    return(new_df)
def cell_discretization(names_cols, z_scores, threshold):
    new_df = pd.DataFrame(index = z_scores.index, columns = names_cols)
    for element in names_cols:
        new_column = []
        for i in range(z_scores.shape[0]):
            if (z_scores[element][i] < -threshold):
                new_column.append(0)
            else:
                new_column.append(1)
        new_df[element] = new_column
    return(new_df)
def MultipleColumns_model(xtrain, ytrain, c, estimator):
    """Torna una llista amb els models creats i adaptats per a cada columna.                                                                     
        Entrades: ytrain i número de columnes que agafem.                                                                                        
    """
    models = [] #llista dels models per a cada columna                                                                                           
    for i in range(c): #from 0 to number of columns                                                                                              
        models.append(estimator(probability = True)) #Prob = True bc we want to predict_proba                                                    
        models[i].fit(xtrain, ytrain.iloc[:,i]) #Fitting the model with all xtrain and 1 column of ytrain                                        
        print("model", i, "done")
    return models
from sklearn.svm import SVC

discretized=pd.concat([gene_discretization(g_columns,z_scores,5),cell_discretization(c_columns,z_scores,6)],axis=1)
discretized=pd.concat([features.iloc[:,:3], discretized],axis=1)

# Data used for the model:                                                                                                                       
xtrain = discretized.loc[xtrain.index.tolist(), :].copy()
#xtrain = xtrain.drop('cp_type', axis=1)                                                                                                         
ytrain = targets.loc[xtrain.index.tolist(), :].copy() #remains the same                                                                          
# Data used for the predictions:                                                                                                                 
xtest = discretized.loc[xtest.index.tolist(), :].copy()
svc_model = MultipleColumns_model(xtrain, ytrain, ytrain.shape[1], SVC)
from joblib import dump
dump(svc_model, 'output/7i_model.joblib')
#new dataframe for saving the predictions                                                                                                        
proba_pred_SVC = pd.DataFrame(columns=ytest.columns)
name_col = ytest.columns.tolist()
for i in range(ytest.shape[1]):
    proba_pred_SVC[name_col[i]] = svc_model[i].predict_proba(xtest)[:, 1]
    print(i, end=' ', flush=True)
proba_pred_SVC["sig_id"]= xtest.index.tolist()
proba_pred_SVC = proba_pred_SVC.set_index('sig_id')
proba_pred_SVC.to_csv('output/7i_probas.csv')

model 0 done
model 1 done
model 2 done
model 3 done
model 4 done
model 5 done
model 6 done
model 7 done
model 8 done
model 9 done
model 10 done
model 11 done
model 12 done
model 13 done
model 14 done
model 15 done
model 16 done
model 17 done
model 18 done
model 19 done
model 20 done
model 21 done
model 22 done
model 23 done
model 24 done
model 25 done
model 26 done
model 27 done
model 28 done
model 29 done
model 30 done
model 31 done
model 32 done
model 33 done
model 34 done
model 35 done
model 36 done
model 37 done
model 38 done
model 39 done
model 40 done
model 41 done
model 42 done
model 43 done
model 44 done
model 45 done
model 46 done
model 47 done
model 48 done
model 49 done
model 50 done
model 51 done
model 52 done
model 53 done
model 54 done
model 55 done
model 56 done
model 57 done
model 58 done
model 59 done
model 60 done
model 61 done
model 62 done
model 63 done
model 64 done
model 65 done
model 66 done
model 67 done
model 68 done
model 69 done
model 70 done
model 71 done
mo