## Exploración y preprocesamiento del conjunto de datos. 

En este notebook se verá la visualización y exploración de los datos. Se hará uso del conjunto reference.csv que después del preprocesamiento se llamará reference_preprocessed2.csv

Se intentó utilizar el conjunto query.csv como conjunto test, eliminando el tipo Neg.cell de la columna label, sin embargo, durante las pruebas, no conseguimos normalizar de forma adecuada el conjunto, por lo que finalmente este conjunto no se ha usado.  

In [4]:
import pandas as pd

In [6]:
import matplotlib.pyplot as plt
import seaborn as sns

In [7]:
df = pd.read_csv('reference.csv')

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2761 entries, 0 to 2760
Columns: 17996 entries, Unnamed: 0 to label
dtypes: float64(17862), int64(133), object(1)
memory usage: 379.1+ MB


In [54]:
print(df['label'])


0       Neg.cell
1       Neg.cell
2       Neg.cell
3       Neg.cell
4       Neg.cell
          ...   
3407       T.CD8
3408       T.CD8
3409       T.CD8
3410       T.CD8
3411       T.CD8
Name: label, Length: 3412, dtype: object


El número total de filas es 2761, mientras que tenemos un total de 17996 columnas. La primera columna "unnamed" y la última "label" serán eliminadas durante la experimentación.

In [10]:
df.drop(columns=["Unnamed: 0"]).head(20)

Unnamed: 0,A1BG,A1CF,A2M,A2ML1,A4GALT,A4GNT,AAAS,AACS,AADAC,AADACL2,...,ZWINT,ZXDA,ZXDB,ZXDC,ZYG11A,ZYG11B,ZYX,ZZEF1,ZZZ3,label
0,5.54,0.0,0.0,5.33,0.0,0.0,3.76,0.0,0.0,0.0,...,0.0,0.0,7.3,29.75,16.41,8.88,82.66,10.21,24.27,B.cell
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,15.88,16.71,13.4,0.0,0.0,0.0,B.cell
2,81.18,0.0,0.0,1.31,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.52,9.21,4.32,0.0,0.0,0.0,B.cell
3,0.0,2.2,4.09,15.15,0.0,0.0,0.0,2.88,0.0,0.0,...,0.0,16.43,8.41,62.27,139.67,73.72,20.85,12.43,66.03,B.cell
4,0.0,0.31,0.0,1.85,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,5.55,21.96,2.28,0.0,109.35,0.0,B.cell
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,32.89,12.47,0.0,0.0,0.0,0.0,B.cell
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,113.51,10.7,4.1,0.0,0.0,0.0,B.cell
7,0.0,0.39,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,2.4,12.92,15.0,0.0,0.0,0.0,B.cell
8,48.37,0.0,0.0,0.0,0.0,0.0,39.35,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,3.77,0.0,0.0,0.0,0.0,B.cell
9,0.0,0.0,0.0,0.0,0.0,0.0,160.11,0.0,0.0,0.0,...,55.71,0.0,0.0,0.0,0.0,5.15,0.0,7.17,0.0,B.cell


In [67]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Cargamos datos
reference_df = pd.read_csv('reference.csv')
query_df = pd.read_csv('query.csv')

# Eliminamos la columna 'Unnamed' que corresponde al índice
reference_df = reference_df.loc[:, ~reference_df.columns.str.contains('Unnamed:0')]

# Obtenemos las etiquetas del conjunto de datos original
reference_labels = reference_df['label'].values
query_df = query_df.loc[:, ~query_df.columns.str.contains('Unnamed:0')]

query_labels = query_df['label'].values

import numpy as np

# Obtener las etiquetas únicas
unique_reference_labels = np.unique(reference_labels)
unique_query_labels = np.unique(query_labels)

# Imprimir las etiquetas únicas para referencia y consulta
print("Etiquetas únicas en el conjunto de referencia:", unique_reference_labels)
print("Etiquetas únicas en el conjunto de consulta:", unique_query_labels)


Etiquetas únicas en el conjunto de referencia: ['B.cell' 'Macrophage' 'NK' 'T.CD4' 'T.CD8']
Etiquetas únicas en el conjunto de consulta: ['B.cell' 'Macrophage' 'NK' 'Neg.cell' 'T.CD4' 'T.CD8']


In [34]:

# Verificamos si algún valor en cada columna es distinto de cero
any_nonzero = (df != 0).any()

#Verificamos si todos los valores en cada columna son cero
all_zero = (df == 0).all()

# Nombres de las columnas donde todos los valores son cero:
columns_with_all_zeros = all_zero[all_zero].index.tolist()

# Verifica si hay alguna columna donde todos los valores sean cero
if any(all_zero):
    print(f"Existen columnas con todos los valores cero: {columns_with_all_zeros}")
    for col in columns_with_all_zeros:
        print(f"\nColumna: {col}")
        print(df[col])
else:
    print("No hay columnas donde todos los valores sean cero.")



Existen columnas con todos los valores cero: ['ACTL7B', 'AMELY', 'AMTN', 'ARHGDIG', 'ATOH1', 'ATOH7', 'BARHL1', 'BHLHA9', 'BHLHE23', 'BPIFA2', 'BPIFB6', 'C1orf185', 'C2CD4B', 'C5orf46', 'CACNG4', 'CLDN17', 'CLPSL2', 'CPXCR1', 'CSHL1', 'CSN1S1', 'CXorf51A', 'CXorf51B', 'DEFB106A', 'DEFB106B', 'DEFB107A', 'DEFB107B', 'DEFB116', 'DEFB119', 'DEFB133', 'DEFB134', 'DUSP21', 'EVX2', 'F9', 'FABP1', 'FABP9', 'FAM181B', 'FAM183A', 'FGF17', 'FGF6', 'FOXA3', 'FTHL17', 'GAGE12D', 'GALR3', 'GDF7', 'GK2', 'GNG13', 'GPR27', 'GSC2', 'GSX1', 'HBM', 'HDGFL1', 'HELT', 'HES3', 'HES5', 'HFE2', 'HIST1H4G', 'IFNA10', 'IL31', 'IL9', 'INS', 'IZUMO2', 'LCE3B', 'LCE4A', 'LCN6', 'LRRC52', 'LY6G6D', 'MAFA', 'MAGEA9B', 'MAGEB4', 'MAGEB5', 'MOS', 'MPC1L', 'MTNR1B', 'NMS', 'NPS', 'NTF3', 'OR4E2', 'OR4X1', 'OR52B2', 'OR52D1', 'OR6X1', 'PAGE3', 'PCDHA8', 'PCDHB3', 'PCP4', 'PDHA2', 'PFN3', 'PHGR1', 'PLA2G2E', 'POM121L12', 'POU3F3', 'PRKACG', 'PRLH', 'PRR15L', 'PYDC2', 'RBMY1A1', 'RBMY1B', 'RIIAD1', 'RIPPLY2', 'SCGB1C1', 

In [15]:


# Identificamos las columnas que tienen todos los valores iguales a 0
all_zero_columns = (df == 0).all()

# Lista de columnas a eliminar: unnamed y columnas con todos los valores 0
columns_to_drop = ['Unnamed: 0'] + all_zero_columns[all_zero_columns].index.tolist()

# Creamos un nuevo DF con las columnas que no están en columns_to_drop
df_filtered = df.drop(columns=columns_to_drop)

# Guardamos el nuevo DataFrame en un archivo CSV
df_filtered.to_csv('reference_preprocessed2.csv', index=False)

# Mensaje de confirmación:
print(f"Nuevo conjunto de datos guardado en 'reference_preprocessed2.csv'. Columnas eliminadas: {columns_to_drop}")


Nuevo conjunto de datos guardado en 'reference_preprocessed.csv'. Columnas eliminadas: ['Unnamed: 0', 'ACTL7B', 'AMELY', 'AMTN', 'ARHGDIG', 'ATOH1', 'ATOH7', 'BARHL1', 'BHLHA9', 'BHLHE23', 'BPIFA2', 'BPIFB6', 'C1orf185', 'C2CD4B', 'C5orf46', 'CACNG4', 'CLDN17', 'CLPSL2', 'CPXCR1', 'CSHL1', 'CSN1S1', 'CXorf51A', 'CXorf51B', 'DEFB106A', 'DEFB106B', 'DEFB107A', 'DEFB107B', 'DEFB116', 'DEFB119', 'DEFB133', 'DEFB134', 'DUSP21', 'EVX2', 'F9', 'FABP1', 'FABP9', 'FAM181B', 'FAM183A', 'FGF17', 'FGF6', 'FOXA3', 'FTHL17', 'GAGE12D', 'GALR3', 'GDF7', 'GK2', 'GNG13', 'GPR27', 'GSC2', 'GSX1', 'HBM', 'HDGFL1', 'HELT', 'HES3', 'HES5', 'HFE2', 'HIST1H4G', 'IFNA10', 'IL31', 'IL9', 'INS', 'IZUMO2', 'LCE3B', 'LCE4A', 'LCN6', 'LRRC52', 'LY6G6D', 'MAFA', 'MAGEA9B', 'MAGEB4', 'MAGEB5', 'MOS', 'MPC1L', 'MTNR1B', 'NMS', 'NPS', 'NTF3', 'OR4E2', 'OR4X1', 'OR52B2', 'OR52D1', 'OR6X1', 'PAGE3', 'PCDHA8', 'PCDHB3', 'PCP4', 'PDHA2', 'PFN3', 'PHGR1', 'PLA2G2E', 'POM121L12', 'POU3F3', 'PRKACG', 'PRLH', 'PRR15L', 'PYDC2

In [17]:
df_filtered.head()

Unnamed: 0,A1BG,A1CF,A2M,A2ML1,A4GALT,A4GNT,AAAS,AACS,AADAC,AADACL2,...,ZWINT,ZXDA,ZXDB,ZXDC,ZYG11A,ZYG11B,ZYX,ZZEF1,ZZZ3,label
0,5.54,0.0,0.0,5.33,0.0,0.0,3.76,0.0,0.0,0.0,...,0.0,0.0,7.3,29.75,16.41,8.88,82.66,10.21,24.27,B.cell
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,15.88,16.71,13.4,0.0,0.0,0.0,B.cell
2,81.18,0.0,0.0,1.31,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.52,9.21,4.32,0.0,0.0,0.0,B.cell
3,0.0,2.2,4.09,15.15,0.0,0.0,0.0,2.88,0.0,0.0,...,0.0,16.43,8.41,62.27,139.67,73.72,20.85,12.43,66.03,B.cell
4,0.0,0.31,0.0,1.85,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,5.55,21.96,2.28,0.0,109.35,0.0,B.cell


In [75]:
df2 = pd.read_csv('reference_preprocessed2.csv')
# Valores únicos en la columna 'label'
label_counts = df2['label'].value_counts()

# Mostramos los resultados
print("Conteo de valores en la columna 'label':")
print(label_counts)


Conteo de valores en la columna 'label':
label
T.CD8         1231
T.CD4          599
B.cell         573
Macrophage     294
NK              64
Name: count, dtype: int64


### Exploración y preprocesamiento de query.csv, del que finalmente no se hizo ningún uso. 

In [19]:
df_query = pd.read_csv('query.csv')

In [21]:
df_query.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3412 entries, 0 to 3411
Columns: 17996 entries, Unnamed: 0 to label
dtypes: float64(17939), int64(56), object(1)
memory usage: 468.5+ MB


In [69]:
import pandas as pd

# Cargamos el archivo CSV
df_query = pd.read_csv('query.csv')

# Eliminamos las filas donde la columna 'label' sea 'Neg.cell'
df_query = df_query[df_query['label'] != 'Neg.cell']


# DataFrame resultante:
df_query.to_csv('query_filtered.csv', index=False)


In [70]:
df_query_filtered = pd.read_csv('query_filtered.csv')

In [71]:
df_query_filtered.head()

Unnamed: 0.1,Unnamed: 0,A1BG,A1CF,A2M,A2ML1,A4GALT,A4GNT,AAAS,AACS,AADAC,...,ZWINT,ZXDA,ZXDB,ZXDC,ZYG11A,ZYG11B,ZYX,ZZEF1,ZZZ3,label
0,2229,0.0,0.0,357.46,0.0,0.0,0.0,0.0,36.06,0.0,...,81.51,0.0,20.98,9.78,32.15,0.0,210.38,0.0,0.0,Macrophage
1,2230,0.0,1.51,1352.5,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,37.73,27.2,15.61,16.25,18.91,0.0,Macrophage
2,2231,0.0,1.98,10.2,0.0,0.0,0.0,126.17,14.39,0.0,...,0.0,0.0,0.0,4.15,19.15,3.49,5.24,8.24,145.65,T.CD4
3,2232,0.0,1.14,0.0,2.42,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,2.79,13.81,13.75,0.0,15.76,0.0,T.CD8
4,2233,0.0,3.28,0.0,0.0,0.0,0.0,0.0,272.41,0.0,...,0.0,0.0,0.0,5.54,0.0,0.0,0.0,0.0,0.0,T.CD8
