<h1>Notebook para calcular el indice GMHI para tax, path y el Random Forest Tree </h1>

In [300]:
#Librerias que se utilizarán
import pandas as pd
import numpy as np

<h3>La siguiente función cálcula las proporción presencia de las especies $m$ en las muestras saludables y en las no saludables, denotadas por $p_{H,m}$ y $p_{N,m}$, respectivamente. Después, se cálculan las métricas $f_H=\frac{p_{H,m}}{p_{N,m}}$, $f_N=\frac{p_{N,m}}{p_{H,m}}$ ,$d_H=p_{H,m}-p_{N,m}$ y $d_N=p_{N,m}-p_{H,m}$ </h3>

In [301]:
def get_fH_fN_dH_dN(meta,tax):
    ######### Recibe los data frames de los metadatos y la taxonomia.
    
    #Se obtienen los id's de las muestras saludables identificadas en los metadatos y después 
    #observamos la taxonomia de las muestras saludables
    healthy_id = meta[meta['Diagnosis']=='Healthy']['SampleID']
    tax_healthy = tax[healthy_id]
    
    #Se obtienen los id's de muestras no saludables y despues se observa la taxonmia de estas muestras
    no_healthy_id = meta[meta['Diagnosis']!='Healthy']['SampleID']
    tax_no_healthy = tax[no_healthy_id]
    
    #Se obtienen todas las especies de todas las muestras
    species = tax.index
    
    #Definimos lower para establecer una cota y evitar divisiones entre 0
    lower=1e-05
    
    #Se crea un Data Frame que tendrá las metricas como columnas y a las especies como index
    metrics=pd.DataFrame(index=species,columns=['f_H','f_N','d_H','d_N'])
    
    #Este ciclo obtiene para cada especie m las prevalencias en las muestras saludables p_H y no saludables P_N
    #Posteriormente se  agregan f_H,f_N, d_H y d_N al data frame metric
    for specie in species:
        
        #Se localiza la especie en todas las muestras healthy y se obtiene su presencia absoluta
        specie_in_H=tax_healthy.loc[specie,:]
        abs_pres_H=len(specie_in_H[specie_in_H!=0])
        
        #Se localiza la especie en todas las muestras no-healthy y se obtiene su presencia absoluta
        specie_in_N=tax_no_healthy.loc[specie,:]
        abs_pres_N=len(specie_in_N[specie_in_N!=0])
        
        #Se obtiene PH y PN de la especie, tomando en cuenta que si el resultado es 0, entonces se intercambia por la cota 1e-05
        PH=np.divide(abs_pres_H,len(specie_in_H),out=np.asanyarray(lower),where=(abs_pres_H!=0))
        PN=np.divide(abs_pres_N,len(specie_in_N),out=np.asanyarray(lower),where=(abs_pres_N!=0))
        metrics.loc[specie,:]=[np.divide(PH,PN),np.divide(PN,PH),PH-PN,PN-PH]
    return metrics

######### Regresa un DataFrame en el que para cada especie se obtienen sus metricas f_H,f_N,d_H y d_N

<h3> En la siguiente función se calculan los conjuntos de especies $M_H$ de las especies con más presencia en las muestras saludables y $M_N$
como las especies en las no saludables. Esto, usando los parámetros $\theta_f$ y $\theta_N$.</h3>

In [302]:
def get_MH_MN(metrics,theta_f,theta_d):
    ######### Recibe el conjunto de metricas para cada especie y los parámetros de comparación
    
    
    #Se obtienen las especies beneficiosas que son mayores a los parametros theta_f y theta_d
    health_species_pass_theta_f=set(metrics[metrics['f_H']>=theta_f].index)
    health_species_pass_theta_d=set(metrics[metrics['d_H']>=theta_d].index)
    
    #Se obtienen las especies dañinas que son mayores a los parametros theta_f y theta_d
    no_health_species_pass_theta_f=set(metrics[metrics['f_N']>=theta_f].index)
    no_health_species_pass_theta_d=set(metrics[metrics['d_N']>=theta_d].index)
    
    #Se definen los conjuntos de las especies beneficiosas y dañinas que superan ambos parámetros
    MH=health_species_pass_theta_f & health_species_pass_theta_d
    MN=no_health_species_pass_theta_f & no_health_species_pass_theta_d
    
    print('|MH|=', len(MH) )
    print('|MN|=', len(MN))        
    return MH,MN

######### Regresa los conjuntos de especies identificadas beneficiosas MH y dañinas MN, de acuerdo a los parámetros

<h3>La función get_Psi cálcula $\Psi_{{M_H},i}=\frac{R_{{M_H},i}}{|M_H|} \sum_{j\in I_{M_H}}|n_{j,i}ln(n_{j,i})|$ o $\Psi_{{M_N},i}$ para la muestra $i$.
Aquí: 
<ul>
  <li>$R_{{M_H},i}$ es la riqueza de las especies de $M_H$ en la muestra $i$. </li>
  <li>$|M_H|$ es el tamaño de el conjunto $M_H$.</li>
  <li>$I_{M_H}$ es el conjunto de indices de $M_H$.</li>
  <li>$n_{j,i}$ es la abundancia relativa de la especie $j$ en la muestra $i$.</li>
</ul>
    Análogamente se calcula $\Psi_{{M_N},i}$ para la muestra i.
</h3>

In [303]:
def get_Psi(set_M,sample):
    ######### Recibe el conjunto M_H o M_N y la muestra con la presencia relativa de cada especie
    
    
    #M_in_sample es el conjunto M_H o M_N intersección las especies presentes en la muestra i
    M_in_sample=set(sample[sample!=0].index) & set_M
    
    #Se calcula la R_M
    R_M_sample=np.divide(len(M_in_sample),len(set_M))
    
    #Se obtiene el array n, que contiene las abundanicas relativas de las especies presentes de M en la muestra i
    #Posteriormente se calcula el logaritmo y la suma
    n=sample[sample!=0][list(M_in_sample)]
    log_n=np.log(n)
    sum_nlnn=np.sum(n*log_n)
    
    #Finalmente se recupera Psi para la muestra i y el conjunto M
    Psi=np.divide(R_M_sample,len(set_M))*np.absolute(sum_nlnn)
    
    #Se evita que el caso Psi sea igual a 0 para evitar división entre 0 en la siguiente función. 
    if Psi==0:
        Psi=1e-05
    return Psi

######### Regresa el número Psi asociado a la muestra i y  al conjunto M_H o M_N.   

<h3>get_all_GMHI llama a todas las funciones anteriores para calcular todos los $\text{GMHI}=log_{10}\frac{\Psi_{{M_H},i}}{\Psi_{{M_N},i}}$ de cada muestra $i$.
</h3>

In [304]:
def get_all_GMHI(tax,MH,MN):
    ######### Se ingresa la taxonomia, el conjunto de especies MH y MN.
    

    #Se crea la variable GMHI, una serie de pandas que tiene como indice el nombre de la muestra y como información su indice GMHI.
    #Esta serie se llenará con un ciclo for, que recorre todas las especies
    samples=tax.columns 
    GMHI=pd.Series(index=samples,name='GMHI',dtype='float64')
    for sample in samples:
        
        #Se obtiene Psi_MH y Psi_MN con la función get_Psi
        Psi_MH=get_Psi(MH,tax[sample])
        Psi_MN=get_Psi(MN,tax[sample])
        
        #Se hace el cociente y se evalua en el logaritmo base 10. Posteriormente se agrega la información a la serie GMHI
        GMHI_sample=np.log10(np.divide(Psi_MH,Psi_MN))
        GMHI[sample]=GMHI_sample
        
    return GMHI 

######### Se regresa la serie con el índice GMHI de cada muestra

In [305]:
def get_accuracy(GMHI,meta):
    #Se recibe el GMHI obtenido y los metadatos para comparar la efectividad de la predicción
    
#     #En diagnosis se almacenan los diagnosticos de cada muestra
#     diagnosis=meta['Diagnosis'].copy(deep=False)
#     diagnosis.index=meta['SampleID']
    
#     #Se evaluan los verdaderos positivos y los verdaderos negativos. Posteriormente se obtiene el accuracy
#     true_positive=GMHI[diagnosis=='Healthy'][GMHI>0].count()
#     true_negative=GMHI[diagnosis!='Healthy'][GMHI<0].count()
#     accuracy=np.divide(true_positive+true_negative,GMHI.count())
    
    
    return balanced_accuracy_score(['Unhealthy' if x != 'Healthy' else 'Healthy' for x in meta['Diagnosis']], ['Unhealthy' if x < 0 else 'Healthy' for x in list(GMHI)])
    return accuracy

######### Se regresa el accuracy obtenido

<h3> Se cargan los datos de CAMDA</h3>

In [306]:
meta = pd.read_csv('../../DataSets/CAMDA/metadata.txt', sep="\t")
tax=pd.read_csv('../../DataSets/CAMDA/taxonomy.txt',sep='\t',index_col=0)
metrics=get_fH_fN_dH_dN(meta,tax)

<h2>Se calculan los conjuntos de especies buenas y malas, sin despreciar las obtenidas en el articulo de GMHI, es decir, se agregan. Con estos conjuntos se calcularán posteriormente los indices GMHI de taxonomia para cualquier otro metagenoma.</h2>

In [307]:
theta_f=2.5
theta_d=0.1
MH,MN=get_MH_MN(metrics,theta_f,theta_d)

|MH|= 143
|MN|= 19


<h3>Se cargan las especies buenas y malas de artículo de GMHI y se depuran de acuerdo al formato </h3>

In [308]:
MH_test=pd.read_csv('../../DataSets/INDEX/GMHI/MH_species.txt', sep="\t", index_col=0)
MN_test=pd.read_csv('../../DataSets/INDEX/GMHI/MN_species.txt', sep='\t',index_col=0)
MH_estandar=set()
for i in MH_test.index:
    MH_estandar.add(i[3:])
MN_estandar=set()    
for i in MN_test.index:
    MN_estandar.add(i[3:])

In [309]:
H=MH_estandar.union(MH)
N=MN_estandar.union(MN)

In [310]:
#Se calcula el GMHI de cada muestra y también se obtiene el data frame de los metadatos
GMHI=get_all_GMHI(tax,H,N)
    
accuracy=get_accuracy(GMHI,meta)

print('El accuracy obtenido es de', accuracy*100 ,'%')
GMHI

El accuracy obtenido es de 76.83972310969116 %


SRR5946989    3.040186
SRR5983265    1.512590
SRR5946777    3.030951
SRR5946822   -2.020200
SRR5946857    1.311124
                ...   
SRR5946648   -0.665628
SRR5946925    1.504066
ERR209694     0.916405
SRR5946668   -3.895397
ERR209312     2.355977
Name: GMHI, Length: 613, dtype: float64

<h3> Es importante aclarar que se eligieron los parámetros tales que maximizan el accuracy de acuerdo a los datos de CAMDA. Este índice puede mejorarse al elegir un conjunto de muestras mayor</h3>

<h1> Se experimentó usar el GMHI aplicado a pathways para ver si con esto es posible predecir los estados de la muestra</h1>

In [311]:
#Se cargan los datos dr pathways
pathways=pd.read_csv('../../DataSets/CAMDA/pathways.txt', sep="\t", index_col=0)
pathways

Unnamed: 0_level_0,SRR5946989,SRR5983265,SRR5946777,SRR5946822,SRR5946857,SRR5946780,ERR209782,SRR5936213,SRR5947045,SRR5947812,...,SRR5947830,ERR209403,SRR5947050,ERR209519,SRR5935745,SRR5946648,SRR5946925,ERR209694,SRR5946668,ERR209312
# Pathway,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
UNINTEGRATED|g__Absiella.s__Absiella_dolichum,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
UNINTEGRATED|g__Acetobacter.s__Acetobacter_sp_CAG_267,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
UNINTEGRATED|g__Acetobacter.s__Acetobacter_sp_CAG_977,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
UNINTEGRATED|g__Acholeplasma.s__Acholeplasma_sp_CAG_878,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.009803,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
UNINTEGRATED|g__Acidaminococcus.s__Acidaminococcus_intestini,0.0,0.0,0.0,0.02683,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
VALSYN-PWY: L-valine biosynthesis|g__Veillonella.s__Veillonella_rogosae,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.000004,...,0.0,0.000000,0.000002,0.0,0.0,0.0,0.0,0.0,0.000002,0.0
VALSYN-PWY: L-valine biosynthesis|g__Veillonella.s__Veillonella_seminalis,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
VALSYN-PWY: L-valine biosynthesis|g__Veillonella.s__Veillonella_tobetsuensis,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
VALSYN-PWY: L-valine biosynthesis|g__Victivallales_unclassified.s__Victivallales_bacterium_CCUG_44730,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0


<h3>Se normaliza Pathways</h3>

In [312]:
pathways=pathways*100

<h3> Se calcula el GMHI para pathways, primero se obtienen las métricas pues es tardado obtenerlas. Luego se calcula el índice con diferentes parámetros para elegir el de mayor accuracy </h3> 


In [313]:
meta = pd.read_csv('../../DataSets/CAMDA/metadata.txt', sep="\t")
metrics_pathways=get_fH_fN_dH_dN(meta,pathways)

In [314]:
theta_f=2.4
theta_d=0.1
MH_pathways,MN_pathways=get_MH_MN(metrics_pathways,theta_f,theta_d)

|MH|= 879
|MN|= 1053


In [315]:
GMHI_path=get_all_GMHI(pathways,MH_pathways,MN_pathways)
accuracy=get_accuracy(GMHI_path,meta)
print('El accuracy obtenido es de', accuracy*100 ,'%')
GMHI_path

El accuracy obtenido es de 73.81895633652822 %


SRR5946989    2.256705
SRR5983265    2.755124
SRR5946777    2.450560
SRR5946822   -2.440611
SRR5946857    3.603390
                ...   
SRR5946648   -1.687041
SRR5946925    1.936091
ERR209694     1.778456
SRR5946668   -2.180028
ERR209312     0.132102
Name: GMHI, Length: 613, dtype: float64

<h1>Ya que fue exitoso el experimento, se calculan los GMHI para integrated y unintegrated pathways. </h1>

In [316]:
pathways=pd.read_csv('../../DataSets/CAMDA/pathways.txt', sep="\t", index_col=0)
pathways=pathways*100
pathways

FileNotFoundError: [Errno 2] No such file or directory: '../../DataSets/pathways.txt'

<h3>La siguiente función separa a integrated y unintegrated.</h3>

In [None]:
def get_integrated_unintegrated(pathways):
    index_pathways=pd.Series(pathways.index)
    condition=index_pathways.apply(lambda x: 'UNINTEGRATED' in x)
    index_unintegrated=index_pathways[condition]
    index_integrated=index_pathways[condition==False]
    return pathways.loc[index_integrated,:], pathways.loc[index_unintegrated,:]

In [None]:
integrated_pathways,unintegrated_pathways=get_integrated_unintegrated(pathways)

<h3> Se normalizan ambas tablas de pathways</h3>

In [None]:
def normalize_tax(tax):
    return (tax/tax.sum())*100

In [None]:
integrated_pathways=normalize_tax(integrated_pathways)
unintegrated_pathways=normalize_tax(unintegrated_pathways)

In [None]:
meta = pd.read_csv('../../DataSets/CAMDA/metadata.txt', sep="\t")
metrics_integrated=get_fH_fN_dH_dN(meta,integrated_pathways)
metrics_unintegrated=get_fH_fN_dH_dN(meta,unintegrated_pathways)

<h2> Se calculan los conjuntos para GMHI_integrated y GMHI_unintegrated. Se eligen los parámetros como antes.</h2>

In [None]:
theta_f=2.5
theta_d=0.1
MH_integrated,MN_integrated=get_MH_MN(metrics_integrated,theta_f,theta_d)

In [None]:
theta_f=2.7
theta_d=0.1
MH_unintegrated,MN_unintegrated=get_MH_MN(metrics_unintegrated,theta_f,theta_d)

In [None]:
GMHI_integrated=get_all_GMHI(integrated_pathways,MH_integrated,MN_integrated)
accuracy=get_accuracy(GMHI_integrated,meta)
print('El accuracy obtenido es de', accuracy*100 ,'%')
GMHI_integrated

In [None]:
GMHI_unintegrated=get_all_GMHI(unintegrated_pathways,MH_unintegrated,MN_unintegrated)
accuracy=get_accuracy(GMHI_unintegrated,meta)
print('El accuracy obtenido es de', accuracy*100 ,'%')
GMHI_unintegrated

<h1> Una vez calculados los GMHI, se elabora un Random Forest Tree. </h1>

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import classification_report

<h3>Se ingresan los GMHI a un data frame</h3>

In [None]:
Indices=pd.DataFrame(index=GMHI.index)
Indices['GMHI_tax']=GMHI
Indices['GMHI_integrated_pathways']=GMHI_integrated
Indices['GMHI_unintegrated_pathways']=GMHI_unintegrated

In [None]:
meta = pd.read_csv('../../DataSets/CAMDA/metadata.txt', sep="\t")
classificator=meta['Diagnosis'].copy(deep=True)

In [None]:
Indices

<h2>Se etiquetan a los Healthy como 1 y a los Unhealthy como 0 </h2>

In [None]:

classificator[classificator!='Healthy']=0
classificator[classificator=='Healthy']=1

In [None]:
X=Indices.values
Y=classificator.values.astype(int)
X

<h2>Se contruye el Random Forest Tree</h2>

In [None]:
clf = RandomForestClassifier(n_estimators=500, max_depth=6,min_samples_split=3, random_state=0)
clf = clf.fit(X, Y)
scores = cross_val_score(clf, X, Y, cv=5)
scores.mean()


<h3>Observemos que al dividir los datos, se tiene una aceptable accuracy para los 5 sets de datos</h3>

In [None]:
Y_pred = clf.predict(X)


In [None]:
accuracy_score(Y,Y_pred)

In [None]:
clf.feature_importances_

In [None]:
print(classification_report(Y, Y_pred))

<h1> Se utilizan los datos de q2-dysbiosis para testear los datos</h1>

<h3> Se cargan los datos de q2-dysbiosis</h3>

In [None]:
meta_test = pd.read_csv('../../DataSets/q2-dysbiosis_test/metadata_test.txt', sep="\t",)
species_test=pd.read_csv('../../DataSets/q2-dysbiosis_test/species.txt',sep="\t", index_col=0)
pathways_test=pd.read_csv('../../DataSets/q2-dysbiosis_test/pathways_stratified.txt',sep="\t", index_col=0)
pathways_unstratified_test=pd.read_csv('../../DataSets/q2-dysbiosis_test/pathways_unstratified.txt',sep="\t", index_col=0)

<h3>Cambiamos a los llamados control por Healthy de meta_test y se cambia el nombre de la columna id por SampleID</h3>

In [None]:
for i in meta_test.index:
    if meta_test.loc[i,'Diagnosis']=='control':
        meta_test.loc[i,'Diagnosis']='Healthy'
meta_test.rename(columns={'id':'SampleID'},inplace=True)

In [None]:
pathways_test=pathways_test*100

<h3>Obtenemos integrated y unintegrated de pathways_test </h3>

In [None]:
integrated_pathways_test,unintegrated_pathways_test=get_integrated_unintegrated(pathways_test)

<h2>A las funciones de los conjuntos de funciones se les cambian los espacios por guiones bajos, pues usualmente se presentan así.</h2>

In [None]:
MH_integrated_test=set()
for i in MH_integrated:
    MH_integrated_test.add(i.replace(' ','_'))

In [None]:
MN_integrated_test=set()
for i in MN_integrated:
    MN_integrated_test.add(i.replace(' ','_'))

In [None]:
MH_unintegrated_test=set()
for i in MH_unintegrated:
    MH_unintegrated_test.add(i.replace(' ','_'))

In [None]:
MN_unintegrated_test=set()
for i in MN_unintegrated:
    MN_unintegrated_test.add(i.replace(' ','_'))

In [None]:
meta_test

<h2> Calculamos los GMHI para los datos test</h2>

In [None]:
GMHI_tax_test=get_all_GMHI(species_test,H,N)
accuracy=get_accuracy(GMHI_tax_test,meta_test)
print('El accuracy obtenido es de', accuracy*100 ,'%')

In [None]:
GMHI_integrated_test=get_all_GMHI(integrated_pathways_test,MH_integrated_test,MN_integrated_test)
accuracy=get_accuracy(GMHI_integrated_test,meta_test)
print('El accuracy obtenido es de', accuracy*100 ,'%')

In [None]:
GMHI_unintegrated_test=get_all_GMHI(unintegrated_pathways_test,MH_unintegrated,MN_unintegrated)
accuracy=get_accuracy(GMHI_unintegrated_test,meta_test)
print('El accuracy obtenido es de', accuracy*100 ,'%')

In [None]:
Indices_test=pd.DataFrame(index=GMHI_tax_test.index)
Indices_test['GMHI_tax']=GMHI_tax_test
Indices_test['GMHI_integrated_pathways']=GMHI_integrated_test
Indices_test['GMHI_unintegrated_pathways']=GMHI_unintegrated_test
Indices_test

In [None]:
classificator_test=meta_test['Diagnosis'].copy(deep=True)

<h3>Cambiamos a los healthy como 1 y a los unhealthy como 0</h3>

In [None]:
classificator_test[classificator_test!='Healthy']=0
classificator_test[classificator_test=='Healthy']=1

In [None]:
X_test=Indices_test.values
Y_test=classificator_test.values.astype(int)


<h3>Predicción y accuracy para los datos de q2-dyobisis:</h3>

In [None]:
Y_pred_test = clf.predict(X_test)
Y_pred_test

In [None]:
accuracy_score(Y_test,Y_pred_test)

<h1>Prueba datos COVID</h1>

<h3> Se cargan los datos de covid</h3>

In [None]:
meta_covid=pd.read_csv('../../DataSets/COVID/CAMDA_metadata.txt', sep="\t",index_col=0)
tax_covid=pd.read_csv('../../DataSets/COVID/CAMDA_taxa.txt', sep='\t',index_col=0)
pathways_covid=pd.read_csv('../../DataSets/COVID/CAMDA_pathways.txt', sep='\t',index_col=0)

In [None]:
pathways_covid=pathways_covid*100

<h2> Se calculan los GMHI para tax y path</h2>

In [None]:
GMHI_tax_covid=get_all_GMHI(tax_covid,H,N)

In [None]:
integrated_covid,unintegrated_covid=get_integrated_unintegrated(pathways_covid)

In [None]:
GMHI_integrated_covid=get_all_GMHI(integrated_covid,MH_integrated_test,MN_integrated_test)

In [None]:
GMHI_unintegrated_covid=get_all_GMHI(unintegrated_covid,MH_unintegrated_test,MN_unintegrated_test)

<h3>Se integran a un data frame </h3>

In [None]:
predicts=pd.DataFrame(index=GMHI_tax_covid.index)
predicts['GMHI_tax']=GMHI_tax_covid
predicts['GMHI_integrated_pathways']=GMHI_integrated_covid
predicts['GMHI_unintegrated_pathways']=GMHI_unintegrated_covid
predicts

<h2>Se realiza la predicción</h2>

In [None]:
clf.predict(predicts.values)

<h3>Se calculan las probabilidades de ser sano, entre mas cercano a 1, es más sano </h3>

In [None]:
prob=clf.predict_proba(predicts.values)
prob[:,1]

In [None]:
predicts['RFT']=clf.predict(predicts.values)
predicts['Prob_RFT']=prob[:,1]

In [None]:
for i in predicts.index:
    if predicts.loc[i,'RFT']==0:
        predicts.loc[i,'RFT']='Unhealthy'
    else:
        predicts.loc[i,'RFT']='Healthy'


In [None]:
predicts

In [None]:
predicts.to_csv('pred_COVID_RFT_GMHI.csv', sep=',')


<h1> Es posible mejorar este índice al obtener un mejor conjunto para los pathways. Cuando el GMHI de pathways es 0, se sospecha que no hay suficientes funciones buenas o malas registradas en nuestros conjuntos. </h1>