# Extracción y limpieza de datos

## Lectura e importación de los Datasets sobre datos genómicos benignos y patogénicos

En este Notebook se procede a realizar una limpieza de los dos datasets con el fin de dejar los datos preparados para poder trabajar con ellos.
Para ello, se seguirán los siguientes pasos:

- Primero, importamos los dos datasets.
- Comprobamos qué columnas son numéricas inicialmente y cuáles no.
- Comprobamos qué columnas numéricas no aparecen como tal (debido a que contengan caracteres indicando el nulo mal codificados, aparezcan como un rango de valores,...) y las formateamos.
- Comprobamos el número de nulos de cada columna. De esta forma, en caso de que haya columnas con un alto porcentaje de nulos serán eliminadas. La eliminación se hará en ambos datasets aunque solo sea bajo en uno de ellos, ya que la información que nos dan las columnas de estos datos es su comparación.
- Comprobamos que las columnas no numéricas están codificadas con un rango de valores categóricos determinado, para eliminar signos que puedan ser ruido
- Comprobamos si las variables numéricas pueden expresarse mediante una variable binaria, indicando si es benigna o patógena; para ello, nos basaremos en los datos que tenemos de cada uno de los atributos.
- Comprobamos el rango de valores con el que trabajan las variables numéricas en los dos datasets; si es posible, se realizará una normalización a valores entre 0 y 1 para hacerlos más útiles.
- Exportaremos los datasets en formato .csv para utilizarlos en los siguientes pasos


In [1]:
import pandas as pd
import numpy as np

#### Para mostrar todas las columnas
pd.set_option('display.max_columns', None, "display.max_columns", None)

df1 = pd.read_csv(r'datasets/benign.out',low_memory=False,error_bad_lines=False,sep='\t')
df2 = pd.read_csv(r'datasets/pathogenic.out',low_memory=False,error_bad_lines=False,sep='\t')

### Lectura de las columnas que queremos utilizar en el Dataset

Las columnas que vamos a utilizar en el dataset se encuentran listadas en un fichero columnas.txt

In [2]:
f = open("files/columnas.txt", "r")
columns = f.read().split("\n")
    
df1 = df1[columns]
df2 = df2[columns]

In [3]:
df1.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10896 entries, 0 to 10895
Data columns (total 144 columns):
 #   Column                                                           Non-Null Count  Dtype  
---  ------                                                           --------------  -----  
 0   #chr                                                             10896 non-null  object 
 1   pos(1-based)                                                     10896 non-null  int64  
 2   ref                                                              10896 non-null  object 
 3   alt                                                              10896 non-null  object 
 4   aaref                                                            10896 non-null  object 
 5   aaalt                                                            10896 non-null  object 
 6   hg19_chr                                                         10896 non-null  object 
 7   hg19_pos(1-based)                      

In [4]:
df2.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5515 entries, 0 to 5514
Data columns (total 144 columns):
 #   Column                                                           Non-Null Count  Dtype  
---  ------                                                           --------------  -----  
 0   #chr                                                             5515 non-null   object 
 1   pos(1-based)                                                     5515 non-null   int64  
 2   ref                                                              5515 non-null   object 
 3   alt                                                              5515 non-null   object 
 4   aaref                                                            5515 non-null   object 
 5   aaalt                                                            5515 non-null   object 
 6   hg19_chr                                                         5515 non-null   object 
 7   hg19_pos(1-based)                        

Antes de comenzar, tenemos que ver qué significan las columnas para saber qué tipo de operaciones tenemos que aplicar sobre cada una de ellas.
Para ello, nos vamos a basar en la documentación de los siguientes enlaces:

- https://drive.google.com/file/d/1xck02Xa2f4y5RUf0Sb6GZQNqQScURy8B/view
- https://vatlab.github.io/vat-docs/applications/annotation/variants/dbnsfp/

El valor de las columnas se recogerá en un archivo adicional.

Primero, vamos a comprobar qué columnas están multivaluadas (contienen varios valores)

In [4]:
multival = []
for col in columns:
    for val in df1[col].unique():
        if ';' in str(val):
            multival.append(col)
            break
    for val in df2[col].unique():
        if ';' in str(val):
            if col not in multival:
                multival.append(col)
            break     
df1[multival]

Unnamed: 0,aapos,SIFT_score,SIFT4G_score,Polyphen2_HDIV_score,Polyphen2_HVAR_score,MutationTaster_score,MutationAssessor_score,FATHMM_score,PROVEAN_score,VEST4_score,MVP_score,MPC_score,DEOGEN2_score,LIST-S2_score,Aloft_prob_Tolerant,Aloft_prob_Recessive,Aloft_prob_Dominant
0,4,0.258,0.304,.,.,1,-1.1,-1.02,-0.27,0.071,0.553347326629,0.626244545349,.,0.29767,.,.,.
1,258;120,0.138;.,0.1;0.1,.;.,.;.,0.999476,2.43;.,-0.88;.,-2.99;.,0.25;0.234,0.834527163076;0.834527163076,0.872226974682;.,.;.,0.923708;0.927707,.;.,.;.,.;.
2,353;215,0.649;.,0.605;0.531,.;.,.;.,1,0.105;.,-0.84;.,-0.05;.,0.056;0.07,0.546866791787;0.546866791787,0.631211528713;.,.;.,0.147585;0.117888,.;.,.;.,.;.
3,375;237,0.022;.,0.418;0.417,.;.,.;.,1,1.58;.,-0.89;.,-1.15;.,0.455;0.463,0.710180199758;0.710180199758,0.942051258392;.,.;.,0.888711;0.891911,.;.,.;.,.;.
4,510;372,0.236;.,0.275;0.293,.;.,.;.,0.999792,0.325;.,3.85;.,-2.72;.,0.36;0.352,0.695986974221;0.695986974221,0.550584074293;.,.;.,0.756924;0.776522,.;.,.;.,.;.
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10891,122;2257;168,0.081;0.252;.,0.361;0.966;.,.;0.245;.,.;0.155;.,0.995901;0.945543,.;0.485;.,-4.72;-4.72;.,-1.48;-1.19;.,0.052;0.014;.,.;.;.,.;0.849239045082;.,.;0.775965;.,0.689031;0.705629;0.689031,.;.;.,.;.;.,.;.;.
10892,1260,1.0,1.0,0.0,0.0,1,-1.585,-5.13,0.94,0.007,.,0.0443853036192,0.233462,0.148285,.,.,.
10893,998,0.05,0.578,0.008,0.002,1,1.79,-5.47,-1.7,0.647,0.820936177805,0.0469799550025,0.394953,0.29957,.,.,.
10894,503,0.031,0.038,0.992,0.594,0.999515,0.855,-6.04,-0.99,0.667,0.783423680238,1.47171924055,0.790767,0.688631,.,.,.


Mientras que muchas columnas son datos provenientes de modelos de los cuales hay que escoger el valor más alto (ya que es el que indica el valor más patógeno), no todas cumplen esta característica. Antes de hacer una modificación, vamos a ver cuales son:

 - FATHMM_score
 - PROVEAN_score
 - SIFT_score   
 - SIFT4G_score
 - Polyphen2_HDIV_score
 - Polyphen2_HVAR_score                            

En estas columnas, el significado de los valores está invertido, es decir: el mínimo es el valor más patógeno. Aplicaremos una elección del valor mínimo; en el resto, cogeremos el valor mayor de la colección de datos.

(ni aapos ni 'Aloft_prob_Tolerant', 'Aloft_prob_Recessive', 'Aloft_prob_Dominant' son scores, por lo que, por el momento, las dejaremos tal cual. Lo que si que haremos será sustituir los puntos por valores nulos, para comprobar si estas columnas contienen un número de datos aceptable)

### Conseguir los valores máximos y mínimos de celdas con valores múltiples

##### GetMaxOfRangeValues(values)

Dada una cadena (string) values con una serie de valores numéricos separados por ';' el formato 'a;b;c;...;n', devuelve el valor máximo de la lista de valores (ignora los puntos, los nulos, las variables numéricas únicas )

##### GetMinOfRangeValues(values)

Dada una cadena (string) values con una serie de valores numéricos separados por ';' el formato 'a;b;c;...;n', devuelve el valor mínimo de la lista de valores (ignora los puntos, los nulos, las variables numéricas únicas )

In [5]:
def GetMaxOfRangeValues(values):
    ### detectar si es un valor nulo
    if values != None:
        
        ### si es un único valor numérico lo devolvemos
        if type(values) == int or type(values) == float: return values
        else:
            
            ### Si la cadena contiene ; (colecciones) o '.' (posibles valores nulos) la tratamos
            if ";" in values or "." in values:
                try:

                    tmp = values.split(";")
                    tmp2 =[value for value in tmp if value != '.']

                    ### Si hay valores numéricos (que no sean colecciones solo de puntos) devolvemos el máximo
                    ### Si no, devolvemos un nulo
                    if len(tmp2) > 0:

                        return float(max(tmp2))
                    else:

                        return np.nan
                except:
                    print(values)
                    return values
                
            ### Si no, devolvemos la original
            else:
                
                return values
    else:
        
        return np.nan 

In [6]:
def GetMinOfRangeValues(values):
    ### detectar si es un valor nulo
    if values != None:
        
        ### si es un único valor numérico lo devolvemos
        if type(values) == int or type(values) == float: return values
        else:
            
            ### Si la cadena contiene ; (colecciones) o '.' (posibles valores nulos) la tratamos
            if ";" in values or "." in values:
                
                try:
                    
                    tmp = values.split(";")
                    tmp2 =[value for value in tmp if value != '.']

                    ### Si hay valores numéricos (que no sean colecciones solo de puntos) devolvemos el máximo
                    ### Si no, devolvemos un nulo
                    if len(tmp2) > 0:

                        return float(min(tmp2))

                    else:
                        return np.nan
                    
                except:
                    print(values)
                    return values
                
            ### Si no, devolvemos la original
            else:
                
                return values
    else:
        
        return np.nan 

In [7]:
reverse = ['FATHMM_score','PROVEAN_score','SIFT_score','SIFT4G_score','Polyphen2_HDIV_score','Polyphen2_HVAR_score']
for col in reverse:
    df1[col] = df1[col].apply(lambda x: GetMinOfRangeValues(x))
    df2[col] = df2[col].apply(lambda x: GetMinOfRangeValues(x))
    
to_remove = ['aapos','Aloft_prob_Tolerant','Aloft_prob_Recessive','Aloft_prob_Dominant']
not_reversed = [x for x in columns if x not in to_remove]
not_reversed = [x for x in not_reversed if x not in reverse]
for col in not_reversed:
    df1[col] = df1[col].apply(lambda x: GetMaxOfRangeValues(x))
    df2[col] = df2[col].apply(lambda x: GetMaxOfRangeValues(x))

In [8]:
def SetNanValues(x):
    if x != np.nan and x != None:
        if ';' in str(x) or '.' in str(x):
            tmp = str(x).split(";")
            tmp2 =[value for value in tmp if value != '.']
            if len(tmp2) > 0:
                return x
            else :
                return np.nan
        else:
            return np.nan
    else:
        return np.nan
    
for col in ['aapos','Aloft_prob_Tolerant', 'Aloft_prob_Recessive', 'Aloft_prob_Dominant']:
    df1[col] = df1[col].apply(lambda x: SetNanValues(x))
    df2[col] = df2[col].apply(lambda x: SetNanValues(x)) 

### Variables numéricas del dataset de 'benign' y valores

In [9]:
df1.select_dtypes(include=np.number).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10896 entries, 0 to 10895
Data columns (total 70 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   pos(1-based)                10896 non-null  int64  
 1   hg19_pos(1-based)           10896 non-null  int64  
 2   SIFT_score                  10545 non-null  float64
 3   SIFT4G_score                10437 non-null  float64
 4   Polyphen2_HDIV_score        9816 non-null   float64
 5   Polyphen2_HVAR_score        9816 non-null   float64
 6   LRT_score                   9625 non-null   float64
 7   LRT_Omega                   9625 non-null   float64
 8   FATHMM_score                10528 non-null  float64
 9   PROVEAN_score               10564 non-null  float64
 10  VEST4_score                 10836 non-null  float64
 11  MetaSVM_score               10809 non-null  float64
 12  MetaLR_score                10809 non-null  float64
 13  M-CAP_score                 280

In [10]:
df1.describe()

Unnamed: 0,pos(1-based),hg19_pos(1-based),SIFT_score,SIFT4G_score,Polyphen2_HDIV_score,Polyphen2_HVAR_score,LRT_score,LRT_Omega,FATHMM_score,PROVEAN_score,VEST4_score,MetaSVM_score,MetaLR_score,M-CAP_score,REVEL_score,MVP_score,MPC_score,PrimateAI_score,DEOGEN2_score,BayesDel_addAF_score,BayesDel_noAF_score,ClinPred_score,LIST-S2_score,CADD_raw_hg19,CADD_phred_hg19,DANN_score,fathmm-MKL_coding_score,fathmm-XF_coding_score,Eigen-raw_coding,Eigen-phred_coding,Eigen-PC-raw_coding,Eigen-PC-phred_coding,GenoCanyon_score,integrated_fitCons_score,GM12878_fitCons_score,H1-hESC_fitCons_score,HUVEC_fitCons_score,LINSIGHT,GERP++_RS,phyloP100way_vertebrate,phyloP30way_mammalian,phyloP17way_primate,phastCons100way_vertebrate,phastCons30way_mammalian,phastCons17way_primate,SiPhy_29way_logOdds,1000Gp3_AF,1000Gp3_AFR_AF,1000Gp3_EUR_AF,1000Gp3_AMR_AF,1000Gp3_EAS_AF,1000Gp3_SAS_AF,TWINSUK_AF,ALSPAC_AF,UK10K_AF,ESP6500_AA_AF,ESP6500_EA_AF,ExAC_AF,ExAC_nonTCGA_AF,ExAC_nonpsych_AF,gnomAD_exomes_AF,gnomAD_genomes_AF,gnomAD_genomes_POPMAX_AF,P(HI),HIPred_score,GHIS,P(rec),GDI,GDI-Phred,LoFtool_score
count,10896.0,10896.0,10545.0,10437.0,9816.0,9816.0,9625.0,9625.0,10528.0,10564.0,10836.0,10809.0,10809.0,2809.0,10809.0,7790.0,9681.0,10561.0,9969.0,10886.0,10886.0,10837.0,10513.0,10896.0,10896.0,10896.0,10896.0,10246.0,10315.0,10315.0,10315.0,10315.0,10896.0,10367.0,10367.0,10367.0,10367.0,23.0,10890.0,10896.0,10896.0,10896.0,10896.0,10896.0,10896.0,10874.0,10050.0,10050.0,10050.0,10050.0,10050.0,10050.0,5930.0,5930.0,5930.0,9259.0,9259.0,10775.0,10750.0,10766.0,10786.0,10727.0,10727.0,2462.0,2498.0,2228.0,2090.0,2502.0,2502.0,2218.0
mean,77809550.0,78034850.0,0.250889,0.284394,0.35161,0.247714,0.15118,8.121377,-0.415346,-1.341702,0.304747,-0.691645,0.193481,0.136305,0.225933,0.573413,0.375759,0.444461,0.249658,-0.427538,-0.368516,0.041661,0.697971,1.765902,16.682364,0.854079,0.540429,0.340044,-0.387152,2.312673,-0.350304,2.471486,0.772017,0.629029,0.622385,0.639234,0.624179,0.407182,2.319803,2.401601,0.412908,0.215375,0.606661,0.602073,0.600295,10.065975,0.067581,0.073991,0.064869,0.065695,0.065757,0.064917,0.106759,0.106877,0.10682,0.075279,0.06716,0.06081,0.060965,0.061176,0.061653,0.063109,0.06132,0.32452,0.499413,0.509718,0.292036,3940.656848,9.67283,0.479371
std,59548300.0,59748460.0,0.300925,0.301975,0.421211,0.358276,0.257095,226.519277,2.180888,1.807624,0.245622,0.515675,0.248972,0.202026,0.197819,0.271821,0.384867,0.17016,0.230272,0.220369,0.272038,0.104624,0.23758,1.344879,8.188233,0.221494,0.388482,0.300672,0.721707,2.535736,0.724608,2.680154,0.381573,0.102779,0.087718,0.088828,0.085339,0.376656,3.578101,2.816671,0.898194,0.711293,0.456973,0.426086,0.407798,5.173938,0.171184,0.177084,0.178458,0.175764,0.187365,0.177575,0.219548,0.219616,0.219568,0.176229,0.181523,0.168677,0.168784,0.168126,0.170672,0.167036,0.171954,0.258143,0.190784,0.07363,0.22277,8681.126403,7.692522,0.345802
min,88756.0,138755.0,0.0,0.0,0.0,0.0,0.0,0.0,-12.4,-12.58,0.002,-2.0058,0.0,0.00095,0.0,0.013882,0.000382,0.154256,0.000139,-1.08361,-1.19798,1e-05,0.0024,-2.771251,0.001,0.036852,0.0,0.0034,-2.793459,0.000141,-2.869872,0.000139,1e-06,0.001892,0.0,0.0,0.0,0.052183,-12.3,-8.218,-6.245,-6.613,0.0,0.0,0.0,0.0022,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000132,0.0,0.0,8e-06,9e-06,1.1e-05,4e-06,7e-06,7e-06,0.00084,0.111908,0.373524,0.00102,1.96708,0.0642,0.000207
25%,32333360.0,32542010.0,0.017,0.03,0.001,0.002,4.7e-05,0.105282,-1.92,-2.17,0.106,-1.026,0.0108,0.017411,0.075,0.349429,0.115791,0.307796,0.068268,-0.581574,-0.560784,0.005606,0.568943,0.763498,11.66,0.800333,0.103938,0.079038,-0.917134,0.498959,-0.881734,0.64489,0.698893,0.554377,0.573888,0.576033,0.564101,0.098374,0.9065,0.283,-0.0955,-0.147,0.006,0.061,0.108,6.01905,0.002396,0.0,0.0,0.0,0.0,0.0,0.000539,0.000519,0.000529,0.001362,0.000116,0.001046,0.001066,0.001245,0.00097,0.001473,0.000201,0.12195,0.369629,0.444529,0.12295,367.466,4.05229,0.116
50%,63439130.0,62049960.0,0.121,0.172,0.06,0.027,0.007943,0.316417,-0.33,-0.98,0.229,-0.9327,0.0782,0.049554,0.165,0.600466,0.249517,0.410379,0.175935,-0.453456,-0.396784,0.016653,0.762924,1.790119,18.125,0.969928,0.65414,0.221524,-0.371368,1.446389,-0.28775,1.589632,0.999668,0.660377,0.59043,0.65145,0.631631,0.179181,3.46,1.643,0.92,0.599,0.969,0.851,0.797,9.57175,0.00619,0.009077,0.000994,0.002882,0.0,0.0,0.008091,0.008173,0.008199,0.010903,0.000743,0.003083,0.003142,0.003537,0.002893,0.005594,0.001007,0.22721,0.519197,0.505031,0.20606,1498.31247,7.20198,0.501
75%,119342600.0,119168400.0,0.385,0.462,0.883,0.42,0.182388,0.686095,1.01,-0.23,0.45,-0.5327,0.2911,0.155468,0.316,0.812035,0.499609,0.55541,0.366939,-0.308558,-0.207607,0.037037,0.885311,2.768021,23.1,0.995457,0.931705,0.582879,0.185116,3.238,0.23436,3.356185,1.0,0.706548,0.702456,0.709663,0.711,0.852675,4.89,4.11825,1.138,0.665,1.0,0.995,0.985,14.043675,0.024161,0.046142,0.014911,0.017291,0.008929,0.014315,0.072209,0.071873,0.072302,0.044959,0.017018,0.01336,0.013618,0.0142,0.013224,0.021771,0.012795,0.45035,0.636561,0.562703,0.3844,4900.23196,14.2622,0.818
max,247419000.0,247582300.0,1.0,1.0,1.0,1.0,1.0,7779.79,6.49,13.57,0.997,1.3377,0.9999,0.994657,0.992,1.0,3.043661,0.973357,0.988043,0.733491,0.815833,0.999961,0.999865,7.532029,38.0,0.99963,0.99907,0.987401,1.131697,19.01063,1.048397,22.18314,1.0,0.839682,0.958517,0.858454,0.836244,0.978928,6.17,10.003,1.312,0.756,1.0,1.0,1.0,26.5331,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.999965,1.0,0.99999,0.87535,0.718477,0.99968,74772.86558,42.91324,0.997


### Variables numéricas del dataset de 'pathogenic' y valores

In [11]:
df2.select_dtypes(include=np.number).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5515 entries, 0 to 5514
Data columns (total 69 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   pos(1-based)                5515 non-null   int64  
 1   hg19_pos(1-based)           5515 non-null   int64  
 2   SIFT_score                  5327 non-null   float64
 3   SIFT4G_score                5364 non-null   float64
 4   Polyphen2_HDIV_score        5055 non-null   float64
 5   Polyphen2_HVAR_score        5055 non-null   float64
 6   LRT_score                   5219 non-null   float64
 7   LRT_Omega                   5219 non-null   float64
 8   FATHMM_score                5379 non-null   float64
 9   PROVEAN_score               5378 non-null   float64
 10  MetaSVM_score               5447 non-null   float64
 11  MetaLR_score                5447 non-null   float64
 12  M-CAP_score                 5430 non-null   float64
 13  REVEL_score                 5447 

In [12]:
df2.describe()

Unnamed: 0,pos(1-based),hg19_pos(1-based),SIFT_score,SIFT4G_score,Polyphen2_HDIV_score,Polyphen2_HVAR_score,LRT_score,LRT_Omega,FATHMM_score,PROVEAN_score,MetaSVM_score,MetaLR_score,M-CAP_score,REVEL_score,MVP_score,MPC_score,PrimateAI_score,DEOGEN2_score,BayesDel_addAF_score,BayesDel_noAF_score,ClinPred_score,LIST-S2_score,CADD_raw_hg19,CADD_phred_hg19,DANN_score,fathmm-MKL_coding_score,fathmm-XF_coding_score,Eigen-raw_coding,Eigen-phred_coding,Eigen-PC-raw_coding,Eigen-PC-phred_coding,GenoCanyon_score,integrated_fitCons_score,GM12878_fitCons_score,H1-hESC_fitCons_score,HUVEC_fitCons_score,LINSIGHT,GERP++_RS,phyloP100way_vertebrate,phyloP30way_mammalian,phyloP17way_primate,phastCons100way_vertebrate,phastCons30way_mammalian,phastCons17way_primate,SiPhy_29way_logOdds,1000Gp3_AF,1000Gp3_AFR_AF,1000Gp3_EUR_AF,1000Gp3_AMR_AF,1000Gp3_EAS_AF,1000Gp3_SAS_AF,TWINSUK_AF,ALSPAC_AF,UK10K_AF,ESP6500_AA_AF,ESP6500_EA_AF,ExAC_AF,ExAC_nonTCGA_AF,ExAC_nonpsych_AF,gnomAD_exomes_AF,gnomAD_genomes_AF,gnomAD_genomes_POPMAX_AF,P(HI),HIPred_score,GHIS,P(rec),GDI,GDI-Phred,LoFtool_score
count,5515.0,5515.0,5327.0,5364.0,5055.0,5055.0,5219.0,5219.0,5379.0,5378.0,5447.0,5447.0,5430.0,5447.0,5463.0,5023.0,5243.0,5177.0,5515.0,5515.0,5487.0,5333.0,5515.0,5515.0,5515.0,5515.0,5081.0,5165.0,5165.0,5165.0,5165.0,5515.0,5179.0,5179.0,5179.0,5179.0,20.0,5515.0,5515.0,5515.0,5515.0,5515.0,5515.0,5515.0,5515.0,405.0,405.0,405.0,405.0,405.0,405.0,413.0,413.0,413.0,771.0,771.0,2107.0,1970.0,1972.0,2493.0,2179.0,2179.0,1071.0,1057.0,995.0,970.0,1066.0,1066.0,966.0
mean,71903920.0,72186500.0,0.021593,0.032175,0.893316,0.827701,0.010699,0.060644,-3.52549,-4.869182,0.712253,0.809497,0.555218,0.822627,0.947211,1.148306,0.721079,0.791051,0.385344,0.39527,0.910965,0.930674,3.80732,27.156349,0.990399,0.928688,0.822104,0.664201,8.502959,0.603396,8.472333,0.941553,0.642712,0.630761,0.649706,0.637083,0.897749,4.707782,6.475314,1.007663,0.581716,0.964042,0.867314,0.820749,15.375034,0.003771,0.003874,0.004813,0.004374,0.002385,0.003558,0.004582,0.004688,0.004636,0.001969,0.002516,0.000806,0.000854,0.000836,0.000719,0.000852,0.001034,0.422047,0.600293,0.540527,0.417126,1162.701048,4.99487,0.16352
std,56342120.0,56273030.0,0.09115,0.105586,0.25817,0.300346,0.07764,0.181625,2.281173,2.535656,0.529573,0.23023,0.291229,0.172093,0.090037,0.815311,0.156388,0.231637,0.183825,0.218021,0.204748,0.104249,0.82911,3.910514,0.045217,0.138139,0.208275,0.367713,4.387142,0.349912,4.859548,0.209383,0.099387,0.088966,0.092004,0.087353,0.090374,1.533089,2.760329,0.403253,0.262563,0.165461,0.26969,0.293776,3.55079,0.024247,0.028942,0.03145,0.031624,0.015613,0.028948,0.032182,0.032159,0.03216,0.018705,0.022732,0.012103,0.012452,0.012171,0.011771,0.012437,0.012527,0.281691,0.189356,0.08297,0.302922,2082.322879,4.713407,0.235468
min,177012.0,218471.0,0.0,0.0,0.0,0.0,0.0,0.0,-12.4,-13.96,-1.2162,0.0,0.004008,0.011,0.067524,0.000372,0.178245,0.002016,-0.618487,-0.720656,0.00022,0.042096,-1.01677,0.002,0.132257,0.00206,0.008185,-1.83375,0.020124,-1.913915,0.020338,2e-06,0.006267,0.0,0.0,0.055017,0.694796,-12.2,-3.591,-4.103,-2.973,0.0,0.0,0.0,0.5551,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000132,0.0,0.0,8e-06,9e-06,1.1e-05,4e-06,7e-06,7e-06,0.03037,0.111908,0.373524,0.06846,1.23583,0.04285,0.000482
25%,31227400.0,29621080.0,0.0,0.0,0.981,0.828,0.0,0.0,-5.13,-6.6475,0.56435,0.72875,0.296715,0.755,0.941616,0.506211,0.619717,0.711488,0.28462,0.287217,0.954204,0.911709,3.474959,24.9,0.995597,0.93398,0.785817,0.522392,5.214213,0.467427,4.87534,0.999991,0.569682,0.573888,0.602189,0.564101,0.829837,4.42,4.676,1.026,0.599,1.0,0.922,0.7985,13.2376,0.0002,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000132,0.0,0.000116,8e-06,9e-06,1.1e-05,8e-06,7e-06,1.6e-05,0.16989,0.476971,0.470743,0.15588,63.06397,1.623683,0.0188
50%,51807100.0,51889730.0,0.0,0.001,1.0,0.99,0.0,0.0,-3.55,-4.69,0.9551,0.9098,0.588175,0.885,0.979032,0.939675,0.749429,0.885532,0.419927,0.449974,0.993387,0.965003,3.918179,27.0,0.998385,0.9752,0.910104,0.752815,7.926615,0.688375,7.540222,1.0,0.693126,0.627178,0.658983,0.6365,0.922125,5.1,7.497,1.124,0.618,1.0,0.995,0.978,15.87,0.0002,0.0,0.0,0.0,0.0,0.0,0.00027,0.000259,0.000264,0.0,0.000116,2.5e-05,2.8e-05,3.3e-05,2e-05,2.1e-05,6.2e-05,0.36677,0.631864,0.545682,0.289425,335.72865,3.89213,0.0794
75%,109547300.0,109998900.0,0.003,0.012,1.0,0.999,1.2e-05,0.057257,-2.06,-3.19,1.0594,0.9741,0.82031,0.945,0.993468,1.73078,0.849673,0.958441,0.526209,0.556411,0.998603,0.988851,4.212508,29.6,0.999161,0.98837,0.947031,0.910378,11.32162,0.834481,11.224,1.0,0.706548,0.702456,0.709663,0.714379,0.978329,5.56,8.27,1.176,0.676,1.0,1.0,0.996,18.31755,0.000399,0.0,0.000994,0.001441,0.0,0.0,0.000539,0.000778,0.000529,0.000227,0.000349,7.4e-05,7.6e-05,8.8e-05,6e-05,7e-05,0.000232,0.71276,0.773793,0.615169,0.708207,1168.79164,6.50003,0.169
max,247424800.0,247588100.0,1.0,1.0,1.0,1.0,0.978301,4.79077,5.67,4.15,1.9566,0.9999,0.999342,1.0,0.999996,4.999644,0.968954,0.998937,0.743354,0.83,0.999986,0.9999,9.391629,49.0,0.999628,0.99982,0.993513,1.24247,28.28369,1.153886,35.05594,1.0,0.839682,0.958517,0.854111,0.830532,0.984146,6.17,10.003,1.312,0.756,1.0,1.0,1.0,22.6116,0.271565,0.352496,0.425447,0.40634,0.183532,0.432515,0.440399,0.423197,0.431632,0.341353,0.406461,0.3439,0.3437,0.3252,0.379106,0.333396,0.426707,0.99999,0.87535,0.710682,0.99087,17941.81819,28.59961,0.997


## Reconvertimos a numéricas aquellas columnas en las que las conversiones previas fallaron

In [13]:
result = []

def ConvertToNumericValue(col,value):
    if value != None:
        ### si es un valor numérico lo devolvemos
        if type(value) == int or type(value) == float: 
            return value
        ### si no
        else:
            ### comprobamos si es un valor nulo expresado con un punto
            if "." == str(value):
                return np.nan
            ### Si no, intentamos hacer casting a float
            else:
                try:
                    ### si funciona, devolvemos el valor
                    return float(value)
                ### si no, devolvemos el elemento original
                except:
                    if col not in result:
                        result.append(col)
                    if value not in result:
                        result.append(value)
                    return value
    ### si el elemento es nulo
    else:
        return np.nan     

Con la impresión de los valores, comprobamos si existen símbolos o patrones en los datos incorrectos.

In [14]:
for col in columns:
    df1[col] = df1[col].apply(lambda x: ConvertToNumericValue(col,x))
    df2[col] = df2[col].apply(lambda x: ConvertToNumericValue(col,x))    
print(result)    

['#chr', 'X', 'ref', 'G', 'C', 'A', 'T', 'alt', 'aaref', 'R', 'Q', 'V', 'E', 'L', 'P', 'S', 'H', 'D', 'M', 'F', 'I', 'W', 'N', 'Y', 'K', 'aaalt', 'hg19_chr', 'aapos', '258;120', '353;215', '375;237', '510;372', '554;416', '728;590', '852;714', '1026;888', '1088;950', '1135;997', '1289;1151', '1322;1184', '1429;1291', '1451;1313', '1514;1376', '1565;1427', '1666;1528', '1883;1749', '1909;1794', '2025;1910', '174;174', '220;248;220;220;200;46', '294;274;274', '249;229;229', '145;145;145;1', '111;111;111', '34;34;34;34', '477;476;477;476;285', '523;522;523;522;331', '534;533;534;533;342', '563;562;563;562;371', '571;570;571;570;379', '634;633;634;633;442', '765;764;765;764;573', '819;818;819;818;627', '824;823;824;823;632', '837;836;837;836;645', '1045;1044;1044;1044;852', '1102;1101;1101;1101;909', '765;764', '753;752', '740;739', '656;655', '618;617', '569;568', '544;543', '541;540', '497;497', '315;315', '171;171', '29;29', '84;84;84', '236;236;236', '357;357;357;142', '368;368;368;153

 - aapos está sin formatear
 - MutPred_score contiene un guión entre sus símbolos que hay que convertir a nulo.

In [15]:
def MutPred_score_Format(value):
    if value != None:
        ### si es un valor numérico lo devolvemos
        if type(value) == int or type(value) == float: 
            return value
        ### si no
        else:
            ### comprobamos si es un valor nulo expresado con un punto
            if "." == str(value) or '-' == str(value):
                return np.nan
            ### Si no, intentamos hacer casting a float
            else:
                try:
                    ### si funciona, devolvemos el valor
                    return float(value)
                ### si no, devolvemos el elemento original
                except:
                    return value
    ### si el elemento es nulo
    else:
        return np.nan     
    
df1['MutPred_score'] = df1['MutPred_score'].apply(lambda x: MutPred_score_Format(x))
df2['MutPred_score'] = df2['MutPred_score'].apply(lambda x: MutPred_score_Format(x))   

Para 'aapos', una opción interesante es dividir su valor en tantas filas como valores tenga, modificando únicamente ese parámetro.

In [16]:
def toList(x):
    if type(x)==str: 
        return x.split(";") 
    else: 
        return x

df1['aapos'] = df1['aapos'].apply(lambda x: toList(x))
df1 = df1.apply( pd.Series.explode )

df2['aapos'] = df2['aapos'].apply(lambda x: toList(x))
df2 = df2.apply( pd.Series.explode )

### Comprobamos los valores distintos de los datos no numéricos

In [17]:
df1.select_dtypes(exclude=np.number).info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 39229 entries, 0 to 10895
Data columns (total 21 columns):
 #   Column                                                           Non-Null Count  Dtype 
---  ------                                                           --------------  ----- 
 0   #chr                                                             39229 non-null  object
 1   ref                                                              39229 non-null  object
 2   alt                                                              39229 non-null  object
 3   aaref                                                            39225 non-null  object
 4   aaalt                                                            39225 non-null  object
 5   hg19_chr                                                         39229 non-null  object
 6   aapos                                                            36675 non-null  object
 7   Aloft_prob_Tolerant                              

In [18]:
df2.select_dtypes(exclude=np.number).info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22834 entries, 0 to 5514
Data columns (total 21 columns):
 #   Column                                                           Non-Null Count  Dtype 
---  ------                                                           --------------  ----- 
 0   #chr                                                             22834 non-null  object
 1   ref                                                              22834 non-null  object
 2   alt                                                              22834 non-null  object
 3   aaref                                                            22796 non-null  object
 4   aaalt                                                            22796 non-null  object
 5   hg19_chr                                                         22834 non-null  object
 6   aapos                                                            21733 non-null  object
 7   Aloft_prob_Tolerant                               

### Comprobamos los valores distintos de los scores

In [19]:
result = [x for x in columns if 'score' in x]
df1[result].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 39229 entries, 0 to 10895
Data columns (total 33 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   SIFT_score                38461 non-null  float64
 1   SIFT4G_score              37568 non-null  float64
 2   Polyphen2_HDIV_score      36266 non-null  float64
 3   Polyphen2_HVAR_score      36266 non-null  float64
 4   LRT_score                 34713 non-null  float64
 5   MutationTaster_score      39070 non-null  float64
 6   MutationAssessor_score    34539 non-null  float64
 7   FATHMM_score              38326 non-null  float64
 8   PROVEAN_score             38500 non-null  float64
 9   VEST4_score               39112 non-null  float64
 10  MetaSVM_score             39079 non-null  float64
 11  MetaLR_score              39079 non-null  float64
 12  M-CAP_score               12486 non-null  float64
 13  REVEL_score               39079 non-null  float64
 14  MutPre

MutPred_score apenas tiene valores; M-CAP_score, HIPred_score y LoFtool_score no tienen una cantidad muy alta de datos

In [20]:
df2[result].info() 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22834 entries, 0 to 5514
Data columns (total 33 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   SIFT_score                22282 non-null  float64
 1   SIFT4G_score              22371 non-null  float64
 2   Polyphen2_HDIV_score      21676 non-null  float64
 3   Polyphen2_HVAR_score      21676 non-null  float64
 4   LRT_score                 21998 non-null  float64
 5   MutationTaster_score      22777 non-null  float64
 6   MutationAssessor_score    20706 non-null  float64
 7   FATHMM_score              22375 non-null  float64
 8   PROVEAN_score             22404 non-null  float64
 9   VEST4_score               22577 non-null  float64
 10  MetaSVM_score             22652 non-null  float64
 11  MetaLR_score              22652 non-null  float64
 12  M-CAP_score               22531 non-null  float64
 13  REVEL_score               22652 non-null  float64
 14  MutPred

In [21]:
matching = [s for s in columns if "Gene damage prediction" in s]
for col in matching:
    print(df1[col].unique())
    print(df2[col].unique())

['Medium' nan 'High' 'Low']
[nan 'Medium' 'High' 'Low']
['Medium' nan 'High' 'Low']
[nan 'Medium' 'High' 'Low']
['Medium' nan 'High' 'Low']
[nan 'Medium' 'Low' 'High']
['Medium' nan 'High' 'Low']
[nan 'Medium' 'Low' 'High']
['Medium' nan 'High']
[nan 'Medium' 'High' 'Low']
['Medium' nan 'High']
[nan 'Medium' 'High' 'Low']
['Medium' nan 'High' 'Low']
[nan 'Medium' 'Low' 'High']
['Medium' nan 'High' 'Low']
[nan 'Medium' 'High' 'Low']
['Medium' nan 'High' 'Low']
[nan 'Medium' 'High' 'Low']
['Medium' nan 'High' 'Low']
[nan 'Medium' 'High' 'Low']


Las columnas Gene damage prediction están bien formateadas al mismo tipo de valores: 'Low', 'Medium' y 'High'.

In [22]:
matching = [s for s in columns if "ref" == s or "alt" == s]
for col in matching:
    print(df1[col].unique())
    print(df2[col].unique())

['G' 'C' 'A' 'T']
['C' 'G' 'T' 'A']
['C' 'T' 'G' 'A']
['T' 'A' 'C' 'G']


In [23]:
matching = [s for s in columns if "aaref" == s or "aaalt" == s]
for col in matching:
    print(df1[col].unique())
    print(df2[col].unique())

['R' 'T' 'Q' 'A' 'G' 'V' 'E' 'L' 'P' 'S' 'H' 'D' 'M' 'F' 'I' 'W' 'N' 'Y'
 'C' 'K' nan 'X']
['R' 'P' 'M' 'I' 'K' 'L' 'G' 'H' 'E' 'N' 'F' 'T' 'A' 'V' 'S' 'W' 'C' 'D'
 'Y' 'Q' nan 'X']
['P' 'I' 'R' 'S' 'M' 'V' 'N' 'F' 'L' 'W' 'C' 'T' 'H' 'K' 'D' 'G' 'E' 'A'
 'Q' 'Y' nan 'X']
['W' 'L' 'V' 'T' 'R' 'S' 'C' 'E' 'K' 'Q' 'P' 'G' 'M' 'Y' 'F' 'H' 'I' 'A'
 'D' 'N' 'X' nan]


Las columnas 'ref' y 'alt' también están bien formateadas (Valores adenina (A), citosina (C), guanina (G) y timina (T)).

Las columnas 'aaref' y 'aaalt' también están igualmente codificadas con las siguientes posibilidades:
['R' 'T' 'Q' 'A' 'G' 'V' 'E' 'L' 'P' 'S' 'H' 'D' 'M' 'F' 'I' 'W' 'N' 'Y' 'C' 'K' 'X'].

Vamos a comprobar el número de valores existentes para aapos y hg19_chr

In [24]:
print(len(df1['aapos'].unique()))
print(len(df2['aapos'].unique()))
print(len(df1['hg19_chr'].unique()))
print(len(df2['hg19_chr'].unique()))

4132
1729
23
23


In [25]:
print(df1['HIPred'].unique())
print(df2['HIPred'].unique())

['Y' nan 'N']
[nan 'Y' 'N']


presupondremos que 'Y' significa Si y 'N' significa No.

## Sustituímos en las columnas AF los nulos por el valor 0

In [26]:
result = [x for x in columns if x.endswith('_AF')]
for col in result:
    df1[col] = df1[col].fillna(0)
    df2[col] = df2[col].fillna(0)

## Comprobamos el porcentaje de nulos de cada columna

### Dataset 'Benign'

In [27]:
for col in columns:   
    print(col + " -> " + str(df1[col].isnull().sum()) + "/" + str(len(df1.index))+" (" + str(round((df1[col].isnull().sum()/len(df1.index)) * 100,2)) + ")%")

#chr -> 0/39229 (0.0)%
pos(1-based) -> 0/39229 (0.0)%
ref -> 0/39229 (0.0)%
alt -> 0/39229 (0.0)%
aaref -> 4/39229 (0.01)%
aaalt -> 4/39229 (0.01)%
hg19_chr -> 0/39229 (0.0)%
hg19_pos(1-based) -> 0/39229 (0.0)%
aapos -> 2554/39229 (6.51)%
SIFT_score -> 768/39229 (1.96)%
SIFT4G_score -> 1661/39229 (4.23)%
Polyphen2_HDIV_score -> 2963/39229 (7.55)%
Polyphen2_HVAR_score -> 2963/39229 (7.55)%
LRT_score -> 4516/39229 (11.51)%
LRT_Omega -> 4516/39229 (11.51)%
MutationTaster_score -> 159/39229 (0.41)%
MutationAssessor_score -> 4690/39229 (11.96)%
FATHMM_score -> 903/39229 (2.3)%
PROVEAN_score -> 729/39229 (1.86)%
VEST4_score -> 117/39229 (0.3)%
MetaSVM_score -> 150/39229 (0.38)%
MetaLR_score -> 150/39229 (0.38)%
M-CAP_score -> 26743/39229 (68.17)%
REVEL_score -> 150/39229 (0.38)%
MutPred_score -> 36213/39229 (92.31)%
MVP_score -> 10298/39229 (26.25)%
MPC_score -> 4670/39229 (11.9)%
PrimateAI_score -> 798/39229 (2.03)%
DEOGEN2_score -> 2841/39229 (7.24)%
BayesDel_addAF_score -> 17/39229 (0.04)

En los datos del dataset 'Benign', tenemos que las columnas que tienen un mayor número de nulos son:

- (aproximadamente 45%):

    - TWINSUK_AF
    - ALSPAC_AF 
    - UK10K_AF
    
    
- ( > 90 %):

    - Columnas relacionadas con el term. 'Interactions'
    - Columnas relacionadas con el term. 'P(HI)'
    - Columnas relacionadas con el term. 'HIPred'
    - Columnas relacionadas con el term. 'GHIS'
    - Columnas relacionadas con el term. 'P(rec)'
    - Columnas relacionadas con el term. 'GDI'
    - Aloft_prob_Tolerant -> 10894/10896 (99.98)%
    - Aloft_prob_Recessive -> 10894/10896 (99.98)%
    - Aloft_prob_Dominant -> 10894/10896 (99.98)%
    - Todas las predicciones de 'Gene damage prediction' 
    - LINSIGHT (99.86%)

Estas últimas directamente se descartan, ya que no podemos extraer nada de información de ellas.

### Dataset Pathogenic

In [28]:
for col in columns:   
    print(col + " -> " + str(df2[col].isnull().sum()) + "/"+str(len(df2.index))+" (" + str(round((df2[col].isnull().sum()/len(df2.index)) * 100,2)) + ")%")

#chr -> 0/22834 (0.0)%
pos(1-based) -> 0/22834 (0.0)%
ref -> 0/22834 (0.0)%
alt -> 0/22834 (0.0)%
aaref -> 38/22834 (0.17)%
aaalt -> 38/22834 (0.17)%
hg19_chr -> 0/22834 (0.0)%
hg19_pos(1-based) -> 0/22834 (0.0)%
aapos -> 1101/22834 (4.82)%
SIFT_score -> 552/22834 (2.42)%
SIFT4G_score -> 463/22834 (2.03)%
Polyphen2_HDIV_score -> 1158/22834 (5.07)%
Polyphen2_HVAR_score -> 1158/22834 (5.07)%
LRT_score -> 836/22834 (3.66)%
LRT_Omega -> 836/22834 (3.66)%
MutationTaster_score -> 57/22834 (0.25)%
MutationAssessor_score -> 2128/22834 (9.32)%
FATHMM_score -> 459/22834 (2.01)%
PROVEAN_score -> 430/22834 (1.88)%
VEST4_score -> 257/22834 (1.13)%
MetaSVM_score -> 182/22834 (0.8)%
MetaLR_score -> 182/22834 (0.8)%
M-CAP_score -> 303/22834 (1.33)%
REVEL_score -> 182/22834 (0.8)%
MutPred_score -> 5320/22834 (23.3)%
MVP_score -> 207/22834 (0.91)%
MPC_score -> 2115/22834 (9.26)%
PrimateAI_score -> 1064/22834 (4.66)%
DEOGEN2_score -> 579/22834 (2.54)%
BayesDel_addAF_score -> 0/22834 (0.0)%
BayesDel_noAF_

En los datos del dataset 'Pathogenic', tenemos que las columnas que tienen un mayor número de nulos son:

- ( > 90 %):

    - Columnas relacionadas con el term. 'Interactions'
    - Columnas relacionadas con el term. 'P(HI)'
    - Columnas relacionadas con el term. 'HIPred'
    - Columnas relacionadas con el term. 'GHIS'
    - Columnas relacionadas con el term. 'P(rec)'
    - Columnas relacionadas con el term. 'GDI'
    - Aloft_prob_Tolerant -> 10894/10896 (99.98)%
    - Aloft_prob_Recessive -> 10894/10896 (99.98)%
    - Aloft_prob_Dominant -> 10894/10896 (99.98)%
    - Todas las predicciones de 'Gene damage prediction' 
    - LoFtool_score (92.06)%

In [29]:
columns_to_remove = [x for x in columns if round((df2[x].isnull().sum()/len(df2.index)) * 100,2) > 85]
try:
    df1 = df1.drop(columns=columns_to_remove)
    df2 = df2.drop(columns=columns_to_remove)
    columns = [x for x in columns if x not in columns_to_remove]
except:
    # En caso de que ya hayan sido eliminadas
    pass

Vemos que el número de elementos en cada dataset es razonable excepto para la columna de  LINSIGHT, que apenas tiene valores; esta columna no nos aportará nada, por lo que tendremos que eliminarla también.

In [30]:
try:
    columns.remove("LINSIGHT")
    df1.pop("LINSIGHT")
    df2.pop("LINSIGHT")
    print("LINSIGHT Eliminado en ambos datasets")
except:
    # En caso de que ya hayan sido eliminadas
    pass

### Eliminamos los duplicados en ambos datasets (en caso de que haya)


In [31]:
df1 = df1.drop_duplicates()
df2 = df2.drop_duplicates()

In [32]:
df1.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22631 entries, 0 to 10895
Data columns (total 119 columns):
 #   Column                            Dtype  
---  ------                            -----  
 0   #chr                              object 
 1   pos(1-based)                      int64  
 2   ref                               object 
 3   alt                               object 
 4   aaref                             object 
 5   aaalt                             object 
 6   hg19_chr                          object 
 7   hg19_pos(1-based)                 int64  
 8   aapos                             object 
 9   SIFT_score                        float64
 10  SIFT4G_score                      float64
 11  Polyphen2_HDIV_score              float64
 12  Polyphen2_HVAR_score              float64
 13  LRT_score                         float64
 14  LRT_Omega                         float64
 15  MutationTaster_score              float64
 16  MutationAssessor_score            float

In [33]:
df2.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12167 entries, 0 to 5514
Data columns (total 119 columns):
 #   Column                            Dtype  
---  ------                            -----  
 0   #chr                              object 
 1   pos(1-based)                      int64  
 2   ref                               object 
 3   alt                               object 
 4   aaref                             object 
 5   aaalt                             object 
 6   hg19_chr                          object 
 7   hg19_pos(1-based)                 int64  
 8   aapos                             object 
 9   SIFT_score                        float64
 10  SIFT4G_score                      float64
 11  Polyphen2_HDIV_score              float64
 12  Polyphen2_HVAR_score              float64
 13  LRT_score                         float64
 14  LRT_Omega                         float64
 15  MutationTaster_score              float64
 16  MutationAssessor_score            float6

## Comparación del rango de valores entre los dos datasets

In [34]:
def GetColumnsWithMinAndMaxValuesInRange(df, a=0, b=1):
    tmp = df.describe().loc[['min','max']].T
    tmp = tmp[tmp['min'] >= a]
    tmp = tmp[tmp['max'] <= b]
    tmp = tmp.T
    return tmp

def GetColumnsWithMinAndMaxValuesOutOfRange(df, a=0, b=1):
    tmp = df.describe().loc[['min','max']].T
    tmp = tmp[tmp['min'] < a]
    tmp = tmp[tmp['max'] > b]
    tmp = tmp.T
    return tmp

In [35]:
GetColumnsWithMinAndMaxValuesInRange(df1,0,1)

Unnamed: 0,SIFT_score,SIFT4G_score,Polyphen2_HDIV_score,Polyphen2_HVAR_score,LRT_score,MutationTaster_score,VEST4_score,MetaLR_score,M-CAP_score,REVEL_score,MutPred_score,MVP_score,PrimateAI_score,DEOGEN2_score,ClinPred_score,LIST-S2_score,DANN_score,fathmm-MKL_coding_score,fathmm-XF_coding_score,GenoCanyon_score,integrated_fitCons_score,GM12878_fitCons_score,H1-hESC_fitCons_score,HUVEC_fitCons_score,phastCons100way_vertebrate,phastCons30way_mammalian,phastCons17way_primate,1000Gp3_AF,1000Gp3_AFR_AF,1000Gp3_EUR_AF,1000Gp3_AMR_AF,1000Gp3_EAS_AF,1000Gp3_SAS_AF,TWINSUK_AF,ALSPAC_AF,UK10K_AF,ESP6500_AA_AF,ESP6500_EA_AF,ExAC_AF,ExAC_Adj_AF,ExAC_AFR_AF,ExAC_AMR_AF,ExAC_EAS_AF,ExAC_FIN_AF,ExAC_NFE_AF,ExAC_SAS_AF,ExAC_nonTCGA_AF,ExAC_nonTCGA_Adj_AF,ExAC_nonTCGA_AFR_AF,ExAC_nonTCGA_AMR_AF,ExAC_nonTCGA_EAS_AF,ExAC_nonTCGA_FIN_AF,ExAC_nonTCGA_NFE_AF,ExAC_nonTCGA_SAS_AF,ExAC_nonpsych_AF,ExAC_nonpsych_Adj_AF,ExAC_nonpsych_AFR_AF,ExAC_nonpsych_AMR_AF,ExAC_nonpsych_EAS_AF,ExAC_nonpsych_FIN_AF,ExAC_nonpsych_NFE_AF,ExAC_nonpsych_SAS_AF,gnomAD_exomes_AF,gnomAD_exomes_AFR_AF,gnomAD_exomes_AMR_AF,gnomAD_exomes_ASJ_AF,gnomAD_exomes_EAS_AF,gnomAD_exomes_FIN_AF,gnomAD_exomes_NFE_AF,gnomAD_exomes_SAS_AF,gnomAD_exomes_POPMAX_AF,gnomAD_exomes_controls_AF,gnomAD_exomes_controls_AFR_AF,gnomAD_exomes_controls_AMR_AF,gnomAD_exomes_controls_ASJ_AF,gnomAD_exomes_controls_EAS_AF,gnomAD_exomes_controls_FIN_AF,gnomAD_exomes_controls_NFE_AF,gnomAD_exomes_controls_SAS_AF,gnomAD_exomes_controls_POPMAX_AF,gnomAD_genomes_AF,gnomAD_genomes_AFR_AF,gnomAD_genomes_AMR_AF,gnomAD_genomes_ASJ_AF,gnomAD_genomes_EAS_AF,gnomAD_genomes_FIN_AF,gnomAD_genomes_NFE_AF,gnomAD_genomes_AMI_AF,gnomAD_genomes_SAS_AF,gnomAD_genomes_POPMAX_AF
min,0.0,0.0,0.0,0.0,0.0,1e-37,0.002,0.0,0.00095,0.0,0.049,0.013882,0.154256,0.000139,1e-05,0.0024,0.036852,0.0,0.0034,1e-06,0.001892,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,0.997,0.9999,0.994657,0.992,1.0,1.0,0.973357,0.988043,0.999961,0.999865,0.99963,0.99907,0.987401,1.0,0.839682,0.958517,0.858454,0.836244,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.999965,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [36]:
GetColumnsWithMinAndMaxValuesInRange(df2,0,1)

Unnamed: 0,SIFT_score,SIFT4G_score,Polyphen2_HDIV_score,Polyphen2_HVAR_score,LRT_score,MutationTaster_score,VEST4_score,MetaLR_score,M-CAP_score,REVEL_score,MutPred_score,MVP_score,PrimateAI_score,DEOGEN2_score,ClinPred_score,LIST-S2_score,DANN_score,fathmm-MKL_coding_score,fathmm-XF_coding_score,GenoCanyon_score,integrated_fitCons_score,GM12878_fitCons_score,H1-hESC_fitCons_score,HUVEC_fitCons_score,phastCons100way_vertebrate,phastCons30way_mammalian,phastCons17way_primate,1000Gp3_AF,1000Gp3_AFR_AF,1000Gp3_EUR_AF,1000Gp3_AMR_AF,1000Gp3_EAS_AF,1000Gp3_SAS_AF,TWINSUK_AF,ALSPAC_AF,UK10K_AF,ESP6500_AA_AF,ESP6500_EA_AF,ExAC_AF,ExAC_Adj_AF,ExAC_AFR_AF,ExAC_AMR_AF,ExAC_EAS_AF,ExAC_FIN_AF,ExAC_NFE_AF,ExAC_SAS_AF,ExAC_nonTCGA_AF,ExAC_nonTCGA_Adj_AF,ExAC_nonTCGA_AFR_AF,ExAC_nonTCGA_AMR_AF,ExAC_nonTCGA_EAS_AF,ExAC_nonTCGA_FIN_AF,ExAC_nonTCGA_NFE_AF,ExAC_nonTCGA_SAS_AF,ExAC_nonpsych_AF,ExAC_nonpsych_Adj_AF,ExAC_nonpsych_AFR_AF,ExAC_nonpsych_AMR_AF,ExAC_nonpsych_EAS_AF,ExAC_nonpsych_FIN_AF,ExAC_nonpsych_NFE_AF,ExAC_nonpsych_SAS_AF,gnomAD_exomes_AF,gnomAD_exomes_AFR_AF,gnomAD_exomes_AMR_AF,gnomAD_exomes_ASJ_AF,gnomAD_exomes_EAS_AF,gnomAD_exomes_FIN_AF,gnomAD_exomes_NFE_AF,gnomAD_exomes_SAS_AF,gnomAD_exomes_POPMAX_AF,gnomAD_exomes_controls_AF,gnomAD_exomes_controls_AFR_AF,gnomAD_exomes_controls_AMR_AF,gnomAD_exomes_controls_ASJ_AF,gnomAD_exomes_controls_EAS_AF,gnomAD_exomes_controls_FIN_AF,gnomAD_exomes_controls_NFE_AF,gnomAD_exomes_controls_SAS_AF,gnomAD_exomes_controls_POPMAX_AF,gnomAD_genomes_AF,gnomAD_genomes_AFR_AF,gnomAD_genomes_AMR_AF,gnomAD_genomes_ASJ_AF,gnomAD_genomes_EAS_AF,gnomAD_genomes_FIN_AF,gnomAD_genomes_NFE_AF,gnomAD_genomes_AMI_AF,gnomAD_genomes_SAS_AF,gnomAD_genomes_POPMAX_AF
min,0.0,0.0,0.0,0.0,0.0,1.30379e-19,0.015,0.0,0.004008,0.011,0.096,0.067524,0.178245,0.002016,0.00022,0.042096,0.132257,0.00206,0.008185,2e-06,0.006267,0.0,0.0,0.055017,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,0.978301,1.0,1.0,0.9999,0.999342,1.0,1.0,0.999996,0.968954,0.998937,0.999986,0.9999,0.999628,0.99982,0.993513,1.0,0.839682,0.958517,0.854111,0.830532,1.0,1.0,1.0,0.271565,0.352496,0.425447,0.40634,0.183532,0.432515,0.440399,0.423197,0.431632,0.341353,0.406461,0.3439,0.453,0.3435,0.5108,0.2504,0.5628,0.5049,0.4614,0.3437,0.4561,0.34,0.5152,0.2491,0.5628,0.5106,0.4615,0.3252,0.4285,0.3435,0.511,0.2465,0.5639,0.4748,0.4613,0.379106,0.340296,0.413197,0.28876,0.181041,0.475201,0.424837,0.431612,0.431612,0.384973,0.341671,0.414884,0.29817,0.17578,0.471754,0.426679,0.432674,0.432674,0.333396,0.339757,0.396493,0.29529,0.192702,0.481365,0.434049,0.378619,0.426707,0.426707


In [37]:
GetColumnsWithMinAndMaxValuesOutOfRange(df1,0,1)

Unnamed: 0,MutationAssessor_score,FATHMM_score,PROVEAN_score,MetaSVM_score,CADD_raw_hg19,Eigen-raw_coding,Eigen-PC-raw_coding,GERP++_RS,phyloP100way_vertebrate,phyloP30way_mammalian
min,-4.235,-12.4,-12.58,-2.0058,-2.771251,-2.793459,-2.869872,-12.3,-8.218,-6.245
max,4.925,6.49,13.57,1.3377,7.532029,1.131697,1.048397,6.17,10.003,1.312


In [38]:
GetColumnsWithMinAndMaxValuesOutOfRange(df2,0,1)

Unnamed: 0,MutationAssessor_score,FATHMM_score,PROVEAN_score,MetaSVM_score,CADD_raw_hg19,Eigen-raw_coding,Eigen-PC-raw_coding,GERP++_RS,phyloP100way_vertebrate,phyloP30way_mammalian
min,-2.58,-12.4,-13.96,-1.2162,-1.01677,-1.83375,-1.913915,-12.2,-3.591,-4.103
max,5.72,5.67,4.15,1.9566,9.391629,1.24247,1.153886,6.17,10.003,1.312


Algunas columnas pueden ser convertidas a valores de patogénico o benigno en función de su valor numérico.
Por tanto, procedo a realizar la creación de una columna numérica que indicará con un 1 si es patógena y un 0 si no lo es según los resultados de estos modelos. Los valores nulos seguirán siendo nulos.

### Creación de las columnas FLAG

In [39]:
def isPatho(x,umbral,sign):
    if x != np.nan and x != None:
        if sign == ">":
            if x > umbral:
                return 1
            else:
                return 0
        else: 
            if sign == ">=":
                if x >= umbral:
                    return 1
                else:
                    return 0
            else: 
                if sign == "<":
                    if x < umbral:
                        return 1
                    else:
                        return 0    
                else: 
                    if sign == "<=":
                        if x <= umbral:
                            return 1
                        else:
                            return 0  
                    else:
                        raise Exception
                
    else:
        return -1

In [40]:
df1['FATHMM_score_FLAG'] = df1['FATHMM_score'].apply(lambda x: isPatho(x,1.5,"<="))
df2['FATHMM_score_FLAG'] = df2['FATHMM_score'].apply(lambda x: isPatho(x,1.5,"<="))

df1['fathmm-MKL_coding_score_FLAG'] = df1['fathmm-MKL_coding_score'].apply(lambda x: isPatho(x,0.5,"<="))
df2['fathmm-MKL_coding_score_FLAG'] = df2['fathmm-MKL_coding_score'].apply(lambda x: isPatho(x,0.5,"<="))

df1['fathmm-XF_coding_score_FLAG'] = df1['fathmm-XF_coding_score'].apply(lambda x: isPatho(x,0.5,"<="))
df2['fathmm-XF_coding_score_FLAG'] = df2['fathmm-XF_coding_score'].apply(lambda x: isPatho(x,0.5,"<="))

df1['BayesDel_addAF_FLAG'] = df1['BayesDel_addAF_score'].apply(lambda x: isPatho(x,0.0692655,"<"))
df2['BayesDel_addAF_FLAG'] = df2['BayesDel_addAF_score'].apply(lambda x: isPatho(x,0.0692655,"<"))
   
df1['BayesDel_noAF_FLAG'] = df1['BayesDel_noAF_score'].apply(lambda x: isPatho(x,-0.0570105,"<"))
df2['BayesDel_noAF_FLAG'] = df2['BayesDel_noAF_score'].apply(lambda x: isPatho(x,-0.0570105,"<"))  

df1['integrated_fitCons_score_FLAG'] = df1['integrated_fitCons_score'].apply(lambda x: isPatho(x,0.7,">"))
df2['integrated_fitCons_score_FLAG'] = df2['integrated_fitCons_score'].apply(lambda x: isPatho(x,0.7,">"))

df1['GM12878_fitCons_score_score_FLAG'] = df1['GM12878_fitCons_score'].apply(lambda x: isPatho(x,0.7,">"))
df2['GM12878_fitCons_score_score_FLAG'] = df2['GM12878_fitCons_score'].apply(lambda x: isPatho(x,0.7,">"))

df1['H1-hESC_fitCons_score_FLAG'] = df1['H1-hESC_fitCons_score'].apply(lambda x: isPatho(x,0.7,">"))
df2['H1-hESC_fitCons_score_FLAG'] = df2['H1-hESC_fitCons_score'].apply(lambda x: isPatho(x,0.7,">"))

df1['HUVEC_fitCons_score_FLAG'] = df1['HUVEC_fitCons_score'].apply(lambda x: isPatho(x,0.7,">"))
df2['HUVEC_fitCons_score_FLAG'] = df2['HUVEC_fitCons_score'].apply(lambda x: isPatho(x,0.7,">"))
    
df1['LRT_score_FLAG'] = df1['LRT_score'].apply(lambda x: isPatho(x, 0.001,"<="))
df2['LRT_score_FLAG'] = df2['LRT_score'].apply(lambda x: isPatho(x, 0.001,"<=")) 
  
df1['MutationAssessor_score_FLAG'] = df1['MutationAssessor_score'].apply(lambda x: isPatho(x, 1.9,">"))
df2['MutationAssessor_score_FLAG'] = df2['MutationAssessor_score'].apply(lambda x: isPatho(x, 1.9,">"))  

df1['MutationTaster_score_FLAG'] = df1['MutationTaster_score'].apply(lambda x: isPatho(x,  0.5,">"))
df2['MutationTaster_score_FLAG'] = df2['MutationTaster_score'].apply(lambda x: isPatho(x,  0.5,">"))

df1['Polyphen2_HDIV_score_FLAG'] = df1['Polyphen2_HDIV_score'].apply(lambda x: isPatho(x,0.453,">="))
df2['Polyphen2_HDIV_score_FLAG'] = df2['Polyphen2_HDIV_score'].apply(lambda x: isPatho(x,0.453,">="))

df1['Polyphen2_HVAR_score_FLAG'] = df1['Polyphen2_HVAR_score'].apply(lambda x: isPatho(x,0.447,">="))
df2['Polyphen2_HVAR_score_FLAG'] = df2['Polyphen2_HVAR_score'].apply(lambda x: isPatho(x,0.447,">="))    
   
df1['PROVEAN_score_FLAG'] = df1['PROVEAN_score'].apply(lambda x: isPatho(x,-2.5,"<="))
df2['PROVEAN_score_FLAG'] = df2['PROVEAN_score'].apply(lambda x: isPatho(x,-2.5,"<="))

df1['SIFT_score_FLAG'] = df1['SIFT_score'].apply(lambda x: isPatho(x, 0.05,"<="))
df2['SIFT_score_FLAG'] = df2['SIFT_score'].apply(lambda x: isPatho(x, 0.05,"<="))
df1['SIFT4G_score_FLAG'] = df1['SIFT4G_score'].apply(lambda x: isPatho(x, 0.05,"<="))
df2['SIFT4G_score_FLAG'] = df2['SIFT4G_score'].apply(lambda x: isPatho(x, 0.05,"<="))    

df1['VEST4_score_FLAG'] = df1['VEST4_score'].apply(lambda x: isPatho(x, 0.05,">="))
df2['VEST4_score_FLAG'] = df2['VEST4_score'].apply(lambda x: isPatho(x, 0.05,">=")) 

df1['GERP++_RS_FLAG'] = df1['GERP++_RS'].apply(lambda x: isPatho(x, 2,">="))
df2['GERP++_RS_FLAG'] = df2['GERP++_RS'].apply(lambda x: isPatho(x, 2,">="))

df1['DEOGEN2_score_FLAG'] = df1['DEOGEN2_score'].apply(lambda x: isPatho(x, 0.5,">="))
df2['DEOGEN2_score_FLAG'] = df2['DEOGEN2_score'].apply(lambda x: isPatho(x, 0.5,">="))

df1['phastCons100way_vertebrate_FLAG'] = df1['phastCons100way_vertebrate'].apply(lambda x: isPatho(x, 0.999,">"))
df2['phastCons100way_vertebrate_FLAG'] = df2['phastCons100way_vertebrate'].apply(lambda x: isPatho(x, 0.999,">"))

df1['phastCons30way_mammalian_FLAG'] = df1['phastCons30way_mammalian'].apply(lambda x: isPatho(x, 0.999,">"))
df2['phastCons30way_mammalian_FLAG'] = df2['phastCons30way_mammalian'].apply(lambda x: isPatho(x, 0.999,">"))

df1['phastCons17way_primate_FLAG'] = df1['phastCons17way_primate'].apply(lambda x: isPatho(x, 0.999,">"))
df2['phastCons17way_primate_FLAG'] = df2['phastCons17way_primate'].apply(lambda x: isPatho(x, 0.999,">")) 

df1['phyloP100way_vertebrate_FLAG'] = df1['phyloP100way_vertebrate'].apply(lambda x: isPatho(x,2,">"))
df2['phyloP100way_vertebrate_FLAG'] = df2['phyloP100way_vertebrate'].apply(lambda x: isPatho(x,2,">"))

df1['phyloP30way_mammalian_FLAG'] = df1['phyloP30way_mammalian'].apply(lambda x: isPatho(x,2,">"))
df2['phyloP30way_mammalian_FLAG'] = df2['phyloP30way_mammalian'].apply(lambda x: isPatho(x,2,">"))

df1['phyloP17way_primate_FLAG'] = df1['phyloP17way_primate'].apply(lambda x: isPatho(x,2,">"))
df2['phyloP17way_primate_FLAG'] = df2['phyloP17way_primate'].apply(lambda x:isPatho(x,2,">"))

df1['SiPhy_29way_logOdds_FLAG'] = df1['SiPhy_29way_logOdds'].apply(lambda x: isPatho(x,12,">="))
df2['SiPhy_29way_logOdds_FLAG'] = df2['SiPhy_29way_logOdds'].apply(lambda x: isPatho(x,12,">="))    

df1['CADD_raw_hg19_FLAG'] = df1['CADD_raw_hg19'].apply(lambda x: isPatho(x,20,">"))
df2['CADD_raw_hg19_FLAG'] = df2['CADD_raw_hg19'].apply(lambda x: isPatho(x,20,">"))
df1['CADD_phred_hg19_FLAG'] = df1['CADD_phred_hg19'].apply(lambda x: isPatho(x,20,">"))
df2['CADD_phred_hg19_FLAG'] = df2['CADD_phred_hg19'].apply(lambda x: isPatho(x,20,">"))

df1['DANN_score_FLAG'] = df1['DANN_score'].apply(lambda x: isPatho(x,0.99,">="))
df2['DANN_score_FLAG'] = df2['DANN_score'].apply(lambda x: isPatho(x,0.99,">="))   

df1['GenoCanyon_score_FLAG'] = df1['GenoCanyon_score'].apply(lambda x: isPatho(x,0.999,">"))
df2['GenoCanyon_score_FLAG'] = df2['GenoCanyon_score'].apply(lambda x: isPatho(x,0.999,">"))
    
df1['MetaLR_score_FLAG'] = df1['MetaLR_score'].apply(lambda x: isPatho(x,0.5,">"))
df2['MetaLR_score_FLAG'] = df2['MetaLR_score'].apply(lambda x: isPatho(x,0.5,">"))    

df1['M-CAP_score_FLAG'] = df1['M-CAP_score'].apply(lambda x: isPatho(x,0.025,">"))
df2['M-CAP_score_FLAG'] = df2['M-CAP_score'].apply(lambda x: isPatho(x,0.025,">"))

df1['MetaSVM_score_FLAG'] = df1['MetaSVM_score'].apply(lambda x: isPatho(x,0,">"))
df2['MetaSVM_score_FLAG'] = df2['MetaSVM_score'].apply(lambda x: isPatho(x,0,">"))

df1['LIST-S2_score_FLAG'] = df1['LIST-S2_score'].apply(lambda x: isPatho(x,0.85,">="))
df2['LIST-S2_score_FLAG'] = df2['LIST-S2_score'].apply(lambda x: isPatho(x,0.85,">="))

Para algunas columnas, como vimos antes, podemos invertir sus valores; es decir, si su escala trabaja desde -10 a 10, los valores más altos pasarán a ser los más bajos y a la inversa.

Para ello, usaremos la ecuación de una recta entre dos puntos: 
Sea C una columna de datos con valores mínimos xMin, y máximo xMax, a la cúal queremos cambiar de escala a los valores definidos entre xMin' y xMax'.

Definimos a la recta A como la recta definida por los valores mínimo xMin y máximo xMax de una columna C para los valores X de sus puntos, y por los valores mínimo xMin' y máximo xMax' de los valores a los que queremos convertirlos, para los valores Y de sus puntos. De tal forma que, cada punto indica con su coordenada X el valor que toma en la escala inicial, y con su coordenada Y su valor en la nueva escala.
De esta forma, podemos utilizar la ecuación de una recta entre dos puntos para hallar el valor de dicha Y a partir de un valor X, sabiendo que dos puntos de la recta son (xMin, xMin') y (xMax, xMax').

In [41]:
def equationBetweenTwoPoints(x1, x2, y1, y2, x):
    if (x2 - x1 != 0):
        return (((x - x1)/(x2-x1))*(y2-y1)) + y1

def reverseScale(x, xMin, xMax):
    return equationBetweenTwoPoints(xMin, xMax, xMax, xMin, x)

def normalizeValues(x, xMin, xMax):
    return equationBetweenTwoPoints(xMin, xMax, 0, 1, x)

### Normalización de los rangos en los datasets

In [42]:
min_f_1, max_f_1 = df1['FATHMM_score'].describe()[['min','max']].tolist()
min_p_1, max_p_1 = df1['PROVEAN_score'].describe()[['min','max']].tolist()
min_s_1, max_s_1 = df1['SIFT_score'].describe()[['min','max']].tolist()
min_s4_1, max_s4_1 = df1['SIFT4G_score'].describe()[['min','max']].tolist()
min_lrt_1, max_lrt_1 = df1['LRT_score'].describe()[['min','max']].tolist()

df1['FATHMM_score_reverted'] = df1['FATHMM_score'].apply(lambda x: reverseScale(x, min_f_1, max_f_1))
df1['PROVEAN_score_reverted'] = df1['PROVEAN_score'].apply(lambda x: reverseScale(x, min_p_1, max_p_1))
df1['SIFT_score_reverted'] = df1['SIFT_score'].apply(lambda x: reverseScale(x, min_s_1, max_s_1))
df1['SIFT4G_score_reverted'] = df1['SIFT4G_score'].apply(lambda x: reverseScale(x, min_s4_1, max_s4_1))
df1['LRT_score_reverted'] = df1['LRT_score'].apply(lambda x: reverseScale(x, min_lrt_1, max_lrt_1))


In [43]:
min_f_2, max_f_2 = df2['FATHMM_score'].describe()[['min','max']].tolist()
min_p_2, max_p_2 = df2['PROVEAN_score'].describe()[['min','max']].tolist()
min_s_2, max_s_2 = df2['SIFT_score'].describe()[['min','max']].tolist()
min_s4_2, max_s4_2 = df2['SIFT4G_score'].describe()[['min','max']].tolist()
min_lrt_2, max_lrt_2 = df1['LRT_score'].describe()[['min','max']].tolist()

df2['FATHMM_score_reverted'] = df2['FATHMM_score'].apply(lambda x: reverseScale(x, min_f_2, max_f_2))
df2['PROVEAN_score_reverted'] = df2['PROVEAN_score'].apply(lambda x: reverseScale(x, min_p_2, max_p_2))
df2['SIFT_score_reverted'] = df2['SIFT_score'].apply(lambda x: reverseScale(x, min_s_2, max_s_2))
df2['SIFT4G_score_reverted'] = df2['SIFT4G_score'].apply(lambda x: reverseScale(x, min_s4_2, max_s4_2))
df2['LRT_score_reverted'] = df2['LRT_score'].apply(lambda x: reverseScale(x, min_lrt_2, max_lrt_2))

In [44]:
### PROVEAN score = -14 to 14.

df1['PROVEAN_score_normalized'] = df1['PROVEAN_score_reverted'].apply(lambda x: normalizeValues(x, -14, 14))
df2['PROVEAN_score_normalized'] = df2['PROVEAN_score_reverted'].apply(lambda x: normalizeValues(x, -14, 14))

In [45]:
### FATHMM_score = -16.13 to 10.64

df1['FATHMM_score_normalized'] = df1['FATHMM_score_reverted'].apply(lambda x: normalizeValues(x, -16.13, 10.64))
df2['FATHMM_score_normalized'] = df2['FATHMM_score_reverted'].apply(lambda x: normalizeValues(x, -16.13, 10.64))

In [46]:
### SIFT_score  = 0 to 1.

df1['SIFT_score_normalized'] = df1['SIFT_score_reverted'].apply(lambda x: normalizeValues(x, 0, 1))
df2['SIFT_score_normalized'] = df2['SIFT_score_reverted'].apply(lambda x: normalizeValues(x, 0, 1))

In [47]:
### SIFT_score 4G =  0 to 1.

df1['SIFT4G_score_normalized'] = df1['SIFT4G_score_reverted'].apply(lambda x: normalizeValues(x, 0, 1))
df2['SIFT4G_score_normalized'] = df2['SIFT4G_score_reverted'].apply(lambda x: normalizeValues(x, 0, 1))

In [48]:
del df1['FATHMM_score_reverted']
del df2['FATHMM_score_reverted']
del df1['PROVEAN_score_reverted']
del df2['PROVEAN_score_reverted']
del df1['SIFT_score_reverted']
del df2['SIFT_score_reverted']
del df1['SIFT4G_score_reverted']
del df2['SIFT4G_score_reverted']

In [49]:
### MutationAssessor_score = -5.2 to 6.5.

df1['MutationAssessor_score_normalized'] = df1['MutationAssessor_score'].apply(lambda x: normalizeValues(x, -5.2, 6.5))
df2['MutationAssessor_score_normalized'] = df2['MutationAssessor_score'].apply(lambda x: normalizeValues(x, -5.2, 6.5))

In [50]:
### LRT_Omega: 0 to 7780.54

df1['LRT_Omega_normalized'] = df1['LRT_Omega'].apply(lambda x: normalizeValues(x, 0, 7780.54))
df2['LRT_Omega_normalized'] = df2['LRT_Omega'].apply(lambda x: normalizeValues(x, 0, 7780.54))

In [51]:
### MetaSVM_score: -2 to 3.

df1['MetaSVM_score_normalized'] = df1['MetaSVM_score'].apply(lambda x: normalizeValues(x, -2, 3))
df2['MetaSVM_score_normalized'] = df2['MetaSVM_score'].apply(lambda x: normalizeValues(x, -2, 3))

In [52]:
### MPC_score: 0 to 5.

df1['MPC_score_normalized'] = df1['MPC_score'].apply(lambda x: normalizeValues(x, 0, 5))
df2['MPC_score_normalized'] = df2['MPC_score'].apply(lambda x: normalizeValues(x, 0, 5))

In [53]:
### GERP++_RS:  -12.3 to 6.17.

df1['GERP++_RS_normalized'] = df1['GERP++_RS'].apply(lambda x: normalizeValues(x, -12.3, 6.17))
df2['GERP++_RS_normalized'] = df2['GERP++_RS'].apply(lambda x: normalizeValues(x, -12.3, 6.17))

In [54]:
### phyloP100way_vertebrate: -20.0 to 10.003 

df1['phyloP100way_vertebrate_normalized'] = df1['phyloP100way_vertebrate'].apply(lambda x: normalizeValues(x, -20.0, 10.003))
df2['phyloP100way_vertebrate_normalized'] = df2['phyloP100way_vertebrate'].apply(lambda x: normalizeValues(x, -20.0, 10.003))

In [55]:
### phyloP30way_mammalian: -20 to 1.312

df1['phyloP30way_mammalian_normalized'] = df1['phyloP30way_mammalian'].apply(lambda x: normalizeValues(x, -20.0, 1.312))
df2['phyloP30way_mammalian_normalized'] = df2['phyloP30way_mammalian'].apply(lambda x: normalizeValues(x, -20.0, 1.312))

In [56]:
### phyloP17way_primate: -13.362 to 0.756

df1['phyloP17way_primate_normalized'] = df1['phyloP17way_primate'].apply(lambda x: normalizeValues(x, -13.362, 0.756))
df2['phyloP17way_primate_normalized'] = df2['phyloP17way_primate'].apply(lambda x: normalizeValues(x, -13.362, 0.756))

In [57]:
### SiPhy_29way_logOdds: 0 to 37.9718

df1['SiPhy_29way_logOdds_normalized'] = df1['SiPhy_29way_logOdds'].apply(lambda x: normalizeValues(x, 0, 37.9718))
df2['SiPhy_29way_logOdds_normalized'] = df2['SiPhy_29way_logOdds'].apply(lambda x: normalizeValues(x, 0, 37.9718))

In [58]:
### bStatistic: 0 to 1000

df1['bStatistic_normalized'] = df1['bStatistic'].apply(lambda x: normalizeValues(x, 0, 1000))
df2['bStatistic_normalized'] = df2['bStatistic'].apply(lambda x: normalizeValues(x, 0, 1000))

In [59]:
### BayesDel_addAF: -1.11707 to 0.750927

df1['BayesDel_addAF_score_normalized'] = df1['BayesDel_addAF_score'].apply(lambda x: normalizeValues(x, -1.11707, 0.750927))
df2['BayesDel_addAF_score_normalized'] = df2['BayesDel_addAF_score'].apply(lambda x: normalizeValues(x, -1.11707, 0.750927))

In [60]:
### BayesDel_noAF_score: -1.31914 to 0.840878.

df1['BayesDel_noAF_score_normalized'] = df1['BayesDel_noAF_score'].apply(lambda x: normalizeValues(x, -1.31914, 0.840878))
df2['BayesDel_noAF_score_normalized'] = df2['BayesDel_noAF_score'].apply(lambda x: normalizeValues(x, -1.31914, 0.840878))

In [61]:
GetColumnsWithMinAndMaxValuesInRange(df1,0,1)

Unnamed: 0,SIFT_score,SIFT4G_score,Polyphen2_HDIV_score,Polyphen2_HVAR_score,LRT_score,MutationTaster_score,VEST4_score,MetaLR_score,M-CAP_score,REVEL_score,MutPred_score,MVP_score,PrimateAI_score,DEOGEN2_score,ClinPred_score,LIST-S2_score,DANN_score,fathmm-MKL_coding_score,fathmm-XF_coding_score,GenoCanyon_score,integrated_fitCons_score,GM12878_fitCons_score,H1-hESC_fitCons_score,HUVEC_fitCons_score,phastCons100way_vertebrate,phastCons30way_mammalian,phastCons17way_primate,1000Gp3_AF,1000Gp3_AFR_AF,1000Gp3_EUR_AF,1000Gp3_AMR_AF,1000Gp3_EAS_AF,1000Gp3_SAS_AF,TWINSUK_AF,ALSPAC_AF,UK10K_AF,ESP6500_AA_AF,ESP6500_EA_AF,ExAC_AF,ExAC_Adj_AF,ExAC_AFR_AF,ExAC_AMR_AF,ExAC_EAS_AF,ExAC_FIN_AF,ExAC_NFE_AF,ExAC_SAS_AF,ExAC_nonTCGA_AF,ExAC_nonTCGA_Adj_AF,ExAC_nonTCGA_AFR_AF,ExAC_nonTCGA_AMR_AF,ExAC_nonTCGA_EAS_AF,ExAC_nonTCGA_FIN_AF,ExAC_nonTCGA_NFE_AF,ExAC_nonTCGA_SAS_AF,ExAC_nonpsych_AF,ExAC_nonpsych_Adj_AF,ExAC_nonpsych_AFR_AF,ExAC_nonpsych_AMR_AF,ExAC_nonpsych_EAS_AF,ExAC_nonpsych_FIN_AF,ExAC_nonpsych_NFE_AF,ExAC_nonpsych_SAS_AF,gnomAD_exomes_AF,gnomAD_exomes_AFR_AF,gnomAD_exomes_AMR_AF,gnomAD_exomes_ASJ_AF,gnomAD_exomes_EAS_AF,gnomAD_exomes_FIN_AF,gnomAD_exomes_NFE_AF,gnomAD_exomes_SAS_AF,gnomAD_exomes_POPMAX_AF,gnomAD_exomes_controls_AF,gnomAD_exomes_controls_AFR_AF,gnomAD_exomes_controls_AMR_AF,gnomAD_exomes_controls_ASJ_AF,gnomAD_exomes_controls_EAS_AF,gnomAD_exomes_controls_FIN_AF,gnomAD_exomes_controls_NFE_AF,gnomAD_exomes_controls_SAS_AF,gnomAD_exomes_controls_POPMAX_AF,gnomAD_genomes_AF,gnomAD_genomes_AFR_AF,gnomAD_genomes_AMR_AF,gnomAD_genomes_ASJ_AF,gnomAD_genomes_EAS_AF,gnomAD_genomes_FIN_AF,gnomAD_genomes_NFE_AF,gnomAD_genomes_AMI_AF,gnomAD_genomes_SAS_AF,gnomAD_genomes_POPMAX_AF,FATHMM_score_FLAG,fathmm-MKL_coding_score_FLAG,fathmm-XF_coding_score_FLAG,BayesDel_addAF_FLAG,BayesDel_noAF_FLAG,integrated_fitCons_score_FLAG,GM12878_fitCons_score_score_FLAG,H1-hESC_fitCons_score_FLAG,HUVEC_fitCons_score_FLAG,LRT_score_FLAG,MutationAssessor_score_FLAG,MutationTaster_score_FLAG,Polyphen2_HDIV_score_FLAG,Polyphen2_HVAR_score_FLAG,PROVEAN_score_FLAG,SIFT_score_FLAG,SIFT4G_score_FLAG,VEST4_score_FLAG,GERP++_RS_FLAG,DEOGEN2_score_FLAG,phastCons100way_vertebrate_FLAG,phastCons30way_mammalian_FLAG,phastCons17way_primate_FLAG,phyloP100way_vertebrate_FLAG,phyloP30way_mammalian_FLAG,phyloP17way_primate_FLAG,SiPhy_29way_logOdds_FLAG,CADD_raw_hg19_FLAG,CADD_phred_hg19_FLAG,DANN_score_FLAG,GenoCanyon_score_FLAG,MetaLR_score_FLAG,M-CAP_score_FLAG,MetaSVM_score_FLAG,LIST-S2_score_FLAG,LRT_score_reverted,PROVEAN_score_normalized,FATHMM_score_normalized,SIFT_score_normalized,SIFT4G_score_normalized,MutationAssessor_score_normalized,LRT_Omega_normalized,MPC_score_normalized,GERP++_RS_normalized,phyloP100way_vertebrate_normalized,phyloP30way_mammalian_normalized,phyloP17way_primate_normalized,SiPhy_29way_logOdds_normalized,bStatistic_normalized,BayesDel_addAF_score_normalized,BayesDel_noAF_score_normalized
min,0.0,0.0,0.0,0.0,0.0,1e-37,0.002,0.0,0.00095,0.0,0.049,0.013882,0.154256,0.000139,1e-05,0.0024,0.036852,0.0,0.0034,1e-06,0.001892,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.050714,0.139335,0.0,0.0,0.082479,0.0,7.6e-05,0.0,0.392694,0.645411,0.478042,5.8e-05,0.0,0.017912,0.056092
max,1.0,1.0,1.0,1.0,1.0,1.0,0.997,0.9999,0.994657,0.992,1.0,1.0,0.973357,0.988043,0.999961,0.999865,0.99963,0.99907,0.987401,1.0,0.839682,0.958517,0.858454,0.836244,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.999965,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.984643,0.844976,1.0,1.0,0.865385,0.999904,0.608732,1.0,1.0,1.0,1.0,0.698758,1.0,0.990666,0.988405


In [62]:
GetColumnsWithMinAndMaxValuesInRange(df1,0,1)

Unnamed: 0,SIFT_score,SIFT4G_score,Polyphen2_HDIV_score,Polyphen2_HVAR_score,LRT_score,MutationTaster_score,VEST4_score,MetaLR_score,M-CAP_score,REVEL_score,MutPred_score,MVP_score,PrimateAI_score,DEOGEN2_score,ClinPred_score,LIST-S2_score,DANN_score,fathmm-MKL_coding_score,fathmm-XF_coding_score,GenoCanyon_score,integrated_fitCons_score,GM12878_fitCons_score,H1-hESC_fitCons_score,HUVEC_fitCons_score,phastCons100way_vertebrate,phastCons30way_mammalian,phastCons17way_primate,1000Gp3_AF,1000Gp3_AFR_AF,1000Gp3_EUR_AF,1000Gp3_AMR_AF,1000Gp3_EAS_AF,1000Gp3_SAS_AF,TWINSUK_AF,ALSPAC_AF,UK10K_AF,ESP6500_AA_AF,ESP6500_EA_AF,ExAC_AF,ExAC_Adj_AF,ExAC_AFR_AF,ExAC_AMR_AF,ExAC_EAS_AF,ExAC_FIN_AF,ExAC_NFE_AF,ExAC_SAS_AF,ExAC_nonTCGA_AF,ExAC_nonTCGA_Adj_AF,ExAC_nonTCGA_AFR_AF,ExAC_nonTCGA_AMR_AF,ExAC_nonTCGA_EAS_AF,ExAC_nonTCGA_FIN_AF,ExAC_nonTCGA_NFE_AF,ExAC_nonTCGA_SAS_AF,ExAC_nonpsych_AF,ExAC_nonpsych_Adj_AF,ExAC_nonpsych_AFR_AF,ExAC_nonpsych_AMR_AF,ExAC_nonpsych_EAS_AF,ExAC_nonpsych_FIN_AF,ExAC_nonpsych_NFE_AF,ExAC_nonpsych_SAS_AF,gnomAD_exomes_AF,gnomAD_exomes_AFR_AF,gnomAD_exomes_AMR_AF,gnomAD_exomes_ASJ_AF,gnomAD_exomes_EAS_AF,gnomAD_exomes_FIN_AF,gnomAD_exomes_NFE_AF,gnomAD_exomes_SAS_AF,gnomAD_exomes_POPMAX_AF,gnomAD_exomes_controls_AF,gnomAD_exomes_controls_AFR_AF,gnomAD_exomes_controls_AMR_AF,gnomAD_exomes_controls_ASJ_AF,gnomAD_exomes_controls_EAS_AF,gnomAD_exomes_controls_FIN_AF,gnomAD_exomes_controls_NFE_AF,gnomAD_exomes_controls_SAS_AF,gnomAD_exomes_controls_POPMAX_AF,gnomAD_genomes_AF,gnomAD_genomes_AFR_AF,gnomAD_genomes_AMR_AF,gnomAD_genomes_ASJ_AF,gnomAD_genomes_EAS_AF,gnomAD_genomes_FIN_AF,gnomAD_genomes_NFE_AF,gnomAD_genomes_AMI_AF,gnomAD_genomes_SAS_AF,gnomAD_genomes_POPMAX_AF,FATHMM_score_FLAG,fathmm-MKL_coding_score_FLAG,fathmm-XF_coding_score_FLAG,BayesDel_addAF_FLAG,BayesDel_noAF_FLAG,integrated_fitCons_score_FLAG,GM12878_fitCons_score_score_FLAG,H1-hESC_fitCons_score_FLAG,HUVEC_fitCons_score_FLAG,LRT_score_FLAG,MutationAssessor_score_FLAG,MutationTaster_score_FLAG,Polyphen2_HDIV_score_FLAG,Polyphen2_HVAR_score_FLAG,PROVEAN_score_FLAG,SIFT_score_FLAG,SIFT4G_score_FLAG,VEST4_score_FLAG,GERP++_RS_FLAG,DEOGEN2_score_FLAG,phastCons100way_vertebrate_FLAG,phastCons30way_mammalian_FLAG,phastCons17way_primate_FLAG,phyloP100way_vertebrate_FLAG,phyloP30way_mammalian_FLAG,phyloP17way_primate_FLAG,SiPhy_29way_logOdds_FLAG,CADD_raw_hg19_FLAG,CADD_phred_hg19_FLAG,DANN_score_FLAG,GenoCanyon_score_FLAG,MetaLR_score_FLAG,M-CAP_score_FLAG,MetaSVM_score_FLAG,LIST-S2_score_FLAG,LRT_score_reverted,PROVEAN_score_normalized,FATHMM_score_normalized,SIFT_score_normalized,SIFT4G_score_normalized,MutationAssessor_score_normalized,LRT_Omega_normalized,MPC_score_normalized,GERP++_RS_normalized,phyloP100way_vertebrate_normalized,phyloP30way_mammalian_normalized,phyloP17way_primate_normalized,SiPhy_29way_logOdds_normalized,bStatistic_normalized,BayesDel_addAF_score_normalized,BayesDel_noAF_score_normalized
min,0.0,0.0,0.0,0.0,0.0,1e-37,0.002,0.0,0.00095,0.0,0.049,0.013882,0.154256,0.000139,1e-05,0.0024,0.036852,0.0,0.0034,1e-06,0.001892,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.050714,0.139335,0.0,0.0,0.082479,0.0,7.6e-05,0.0,0.392694,0.645411,0.478042,5.8e-05,0.0,0.017912,0.056092
max,1.0,1.0,1.0,1.0,1.0,1.0,0.997,0.9999,0.994657,0.992,1.0,1.0,0.973357,0.988043,0.999961,0.999865,0.99963,0.99907,0.987401,1.0,0.839682,0.958517,0.858454,0.836244,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.999965,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.984643,0.844976,1.0,1.0,0.865385,0.999904,0.608732,1.0,1.0,1.0,1.0,0.698758,1.0,0.990666,0.988405


In [63]:
GetColumnsWithMinAndMaxValuesOutOfRange(df1,0,1)

Unnamed: 0,MutationAssessor_score,FATHMM_score,PROVEAN_score,MetaSVM_score,CADD_raw_hg19,Eigen-raw_coding,Eigen-PC-raw_coding,GERP++_RS,phyloP100way_vertebrate,phyloP30way_mammalian
min,-4.235,-12.4,-12.58,-2.0058,-2.771251,-2.793459,-2.869872,-12.3,-8.218,-6.245
max,4.925,6.49,13.57,1.3377,7.532029,1.131697,1.048397,6.17,10.003,1.312


In [64]:
GetColumnsWithMinAndMaxValuesOutOfRange(df2,0,1)

Unnamed: 0,MutationAssessor_score,FATHMM_score,PROVEAN_score,MetaSVM_score,CADD_raw_hg19,Eigen-raw_coding,Eigen-PC-raw_coding,GERP++_RS,phyloP100way_vertebrate,phyloP30way_mammalian
min,-2.58,-12.4,-13.96,-1.2162,-1.01677,-1.83375,-1.913915,-12.2,-3.591,-4.103
max,5.72,5.67,4.15,1.9566,9.391629,1.24247,1.153886,6.17,10.003,1.312


In [65]:
df1.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22631 entries, 0 to 10895
Data columns (total 171 columns):
 #   Column                              Dtype  
---  ------                              -----  
 0   #chr                                object 
 1   pos(1-based)                        int64  
 2   ref                                 object 
 3   alt                                 object 
 4   aaref                               object 
 5   aaalt                               object 
 6   hg19_chr                            object 
 7   hg19_pos(1-based)                   int64  
 8   aapos                               object 
 9   SIFT_score                          float64
 10  SIFT4G_score                        float64
 11  Polyphen2_HDIV_score                float64
 12  Polyphen2_HVAR_score                float64
 13  LRT_score                           float64
 14  LRT_Omega                           float64
 15  MutationTaster_score                float64
 16  Mut

In [66]:
df2.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12167 entries, 0 to 5514
Data columns (total 171 columns):
 #   Column                              Dtype  
---  ------                              -----  
 0   #chr                                object 
 1   pos(1-based)                        int64  
 2   ref                                 object 
 3   alt                                 object 
 4   aaref                               object 
 5   aaalt                               object 
 6   hg19_chr                            object 
 7   hg19_pos(1-based)                   int64  
 8   aapos                               object 
 9   SIFT_score                          float64
 10  SIFT4G_score                        float64
 11  Polyphen2_HDIV_score                float64
 12  Polyphen2_HVAR_score                float64
 13  LRT_score                           float64
 14  LRT_Omega                           float64
 15  MutationTaster_score                float64
 16  Muta

## Exportamos los Datasets en formato CSV para su posterior uso

In [67]:
df1.to_csv(path_or_buf="datasets/clean_bening.csv")
df2.to_csv(path_or_buf="datasets/clean_patho.csv")