# Justificacion de porque elegimos este dataset

La selección de este dataset se fundamenta en la relevancia del problema de salud que aborda. Si bien a nivel mundial la incidencia del cáncer de vesícula biliar es relativamente baja, en Chile esta patología presenta una frecuencia significativamente mayor, ubicándose dentro de los diez tipos de cáncer más comunes del país y superando el promedio internacional.

En este contexto, disponer de un conjunto de datos que permita analizar esta enfermedad resulta particularmente valioso, dado que posibilita la generación de conocimiento sobre un problema de salud pública de alta prioridad nacional. Además, el estudio de este dataset ofrece la oportunidad de contribuir al desarrollo de estrategias de investigación y análisis que, eventualmente, podrían apoyar en la comprensión y mitigación del impacto de esta patología en la población chilena.

# Data Quality Report

### Perfilado

<strong>Visualizaciones generales</strong>
* Primeras y ultimas filas
* Descripción de las columnas (Nombre, tipo y conteo de no nulos)

In [12]:
# Importaciones
import pandas as pd
import numpy as np

In [13]:
# Carga del dataframe y visualización de las primeras 5 columnas
df = pd.read_csv('./paired_bladder_2022_clinical_data.tsv', sep="\t", encoding="utf-8")

df.head(5)

Unnamed: 0,Study ID,Patient ID,Sample ID,Age at Diagnosis,Age at Which Sequencing was Reported (Years),Cancer Type,Cancer Type Detailed,Ethnicity Category,Fraction Genome Altered,Gene Panel,...,Sample coverage,Sample Type,Sex,Smoker,Somatic Status,Specimen Stage,Systemic Treatment,TMB (nonsynonymous),Tumor Purity,Treatment between Pri-Met sample collection
0,paired_bladder_2022,P-0000034,P-0000034-T01-IM3,75.0,78,Bladder Cancer,Bladder Urothelial Carcinoma,Non-Spanish; Non-Hispanic,0.1591,IMPACT341,...,461.0,Primary,Male,Former,Matched,4.0,1,13.309864,40,
1,paired_bladder_2022,P-0000043,P-0000043-T02-IM3,50.0,58,Bladder Cancer,Bladder Urothelial Carcinoma,,0.4515,IMPACT341,...,833.0,Metastasis,,Active,Matched,4.0,1,29.947193,50,
2,paired_bladder_2022,P-0000056,P-0000056-T01-IM3,57.0,60,Bladder Cancer,Bladder Urothelial Carcinoma,Non-Spanish; Non-Hispanic,0.0689,IMPACT341,...,1004.0,Primary,Male,Active,Matched,3.0,1,5.545777,60,
3,paired_bladder_2022,P-0000063,P-0000063-T01-IM3,61.0,63,Bladder Cancer,Bladder Urothelial Carcinoma,Non-Spanish; Non-Hispanic,0.5047,IMPACT341,...,900.0,Primary,Male,Never,Matched,3.0,1,15.528174,70,
4,paired_bladder_2022,P-0000068,P-0000068-T01-IM3,77.0,80,Bladder Cancer,Bladder Urothelial Carcinoma,Non-Spanish; Non-Hispanic,0.0052,IMPACT341,...,973.0,Metastasis,Male,Former,Matched,4.0,0,3.327466,20,


In [14]:
# Visualización de las últimas 5 columnas
df.tail(5)

Unnamed: 0,Study ID,Patient ID,Sample ID,Age at Diagnosis,Age at Which Sequencing was Reported (Years),Cancer Type,Cancer Type Detailed,Ethnicity Category,Fraction Genome Altered,Gene Panel,...,Sample coverage,Sample Type,Sex,Smoker,Somatic Status,Specimen Stage,Systemic Treatment,TMB (nonsynonymous),Tumor Purity,Treatment between Pri-Met sample collection
1654,paired_bladder_2022,P-0015660,s_C_E4UHYD_P001_d,,,Bladder/Urinary Tract Cancer,Bladder/Urinary Tract,Non-Spanish; Non-Hispanic,0.1831,IMPACT468,...,378.0,Primary,Male,Former,,,,6.917585,,Chemo
1655,paired_bladder_2022,P-0033364,s_C_F75LKW_M001_d,,,Bladder/Urinary Tract Cancer,Bladder/Urinary Tract,Non-Spanish; Non-Hispanic,0.4761,IMPACT468,...,933.0,Metastasis,Male,Former,,,,9.511679,,Naive
1656,paired_bladder_2022,P-0015663,s_C_F8A03J_M001_d,,,Bladder/Urinary Tract Cancer,Bladder/Urinary Tract,Non-Spanish; Non-Hispanic,0.2607,IMPACT468,...,867.0,Metastasis,Female,Active,,,,6.052887,,Naive
1657,paired_bladder_2022,P-0034906,s_C_N0LN75_P001_d,,,Bladder/Urinary Tract Cancer,Bladder/Urinary Tract,Non-Spanish; Non-Hispanic,0.8778,IMPACT468,...,562.0,Primary,Female,Former,,,,6.917585,,Chemo
1658,paired_bladder_2022,P-0015920,s_C_YKYKUT_P001_d,,,Bladder/Urinary Tract Cancer,Bladder/Urinary Tract,Non-Spanish; Non-Hispanic,0.7014,IMPACT468,...,647.0,Primary,Male,Former,,,,41.505509,,Naive


In [15]:
# Informe de calidad de datos
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1659 entries, 0 to 1658
Data columns (total 35 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   Study ID                                      1659 non-null   object 
 1   Patient ID                                    1659 non-null   object 
 2   Sample ID                                     1659 non-null   object 
 3   Age at Diagnosis                              1272 non-null   float64
 4   Age at Which Sequencing was Reported (Years)  1506 non-null   object 
 5   Cancer Type                                   1659 non-null   object 
 6   Cancer Type Detailed                          1659 non-null   object 
 7   Ethnicity Category                            1647 non-null   object 
 8   Fraction Genome Altered                       1526 non-null   float64
 9   Gene Panel                                    1659 non-null   o

In [22]:
df['Age at Which Sequencing was Reported (Years)'].head(10)

0    78
1    58
2    60
3    63
4    80
5    67
6    38
7    58
8    75
9    64
Name: Age at Which Sequencing was Reported (Years), dtype: object

A simple vista, las variables <code>Treatment between Pri-Met sample collection</code>, <code>Met Location</code> y <code>Metastatic Site</code> poseen muy pocos valores no nulos, ademas la columna <code>Age at Which Sequencing was Reported (Years)</code> y `Intravesical Treatment` aparecen como tipo `object` a pesar de ser numericos.


<strong>Rango de valores numéricos</strong>

In [16]:
df.describe()

Unnamed: 0,Age at Diagnosis,Fraction Genome Altered,MSI Score,Mutation Count,Overall Survival (Months),Number of Samples Per Patient,Sample coverage,Specimen Stage,TMB (nonsynonymous)
count,1272.0,1526.0,1632.0,1621.0,1582.0,1659.0,1652.0,1313.0,1659.0
mean,65.165881,0.180792,0.645938,14.474399,32.630613,1.665461,693.682203,2.892612,15.965241
std,11.2703,0.177171,2.597361,22.966564,24.624368,0.948407,321.043195,0.744829,22.565535
min,0.0,0.0,-1.0,1.0,0.0,1.0,57.0,1.0,0.0
25%,58.0,0.0279,0.0,6.0,12.6,1.0,512.0,3.0,5.872318
50%,66.0,0.12865,0.16,10.0,28.57,1.0,638.0,3.0,10.376377
75%,73.0,0.288575,0.65,16.0,44.8275,2.0,782.0,3.0,19.023358
max,92.0,0.9545,42.11,469.0,104.877,6.0,2641.0,4.0,404.678709


Se puede observar en el minimo de la columna <code>Age at Diagnosis</code> que hay un valor atípico de 0, lo cual no tiene sentido clínico. Tambien se puede notar que el `MSI Score` tiene un valor mínimo de -1, lo cual biologicamente no tiene sentido.

<strong>Valores únicos y duplicados</strong>

In [29]:
categorias = df.select_dtypes(exclude="number").apply(lambda x: x.unique())

for col, vals in categorias.items():
    print(f"Valores únicos en '{col}': {vals}")

Valores únicos en 'Study ID': ['paired_bladder_2022']
Valores únicos en 'Patient ID': ['P-0000034' 'P-0000043' 'P-0000056' ... 'P-0067727' 'P-0067776'
 'P-0067920']
Valores únicos en 'Sample ID': ['P-0000034-T01-IM3' 'P-0000043-T02-IM3' 'P-0000056-T01-IM3' ...
 's_C_F8A03J_M001_d' 's_C_N0LN75_P001_d' 's_C_YKYKUT_P001_d']
Valores únicos en 'Age at Which Sequencing was Reported (Years)': ['78' '58' '60' '63' '80' '67' '38' '75' '64' '65' '50' '69' '59' '81'
 '45' '66' '61' '72' '73' '79' '54' '68' '89' '87' '83' '44' '70' '56'
 '71' '39' '74' '57' '82' '76' '53' '77' nan '52' '33' '49' '28' '86' '55'
 '51' '85' '46' '90' '62' '84' '88' '40' '48' '47' '42' '>90' '35' '43'
 '41' '29' '21' '34' '36']
Valores únicos en 'Cancer Type': ['Bladder Cancer' 'Cancer of Unknown Primary'
 'Bladder/Urinary Tract Cancer']
Valores únicos en 'Cancer Type Detailed': ['Bladder Urothelial Carcinoma' 'Urethral Squamous Cell Carcinoma'
 'Plasmacytoid/Signet Ring Cell Bladder Carcinoma'
 'Upper Tract Urothelia

Al revisar los valores únicos, se pueden observar inconsistencias como:
* En `Age at Which Sequencing was Reported (Years)` se ve el valor >90 a pesar de que el resto es númerico
* Solapamiento en la columna `Cancer Type` entre los valores 'Bladder Cancer' y 'Bladder/Urinary Tract Cancer'
* En `Metastatic Site` se encuentran las categorías duplicadas de: 
  * 'Lymph Node' y 'Lymph node'
  * Problemas de granularidad en 'Pubic bone' y 'Bone'
* En `MSI Type` hay un error ortografico en ingles, donde se usa la categoria 'Instable' en vez de la palabra correcta que es 'Unstable', ademas de estar la categoria 'Do not report' que es lo mismo que tener un valor nulo
* En la columna `Oncotree Code` 'BLADDER' y 'BLCA' son lo mismo, pasa lo mismo con 'Kidney/Upper Tract' y 'Kidney/Upper tract' y 'Pelvis' con 'Renal Pelvis'
* En la columna 'Race Category' tanto 'PT REFUSED TO ANSWER' como 'OTHER' para efectos del estudio prodian considerarse como nulos.

<strong>Outliers</strong>

In [33]:
numeric_cols = df.select_dtypes(include=np.number)

z_scores = (numeric_cols - numeric_cols.mean()) / numeric_cols.std()

outliers_por_col = (z_scores.abs() > 3).sum()

outliers_pct = ((z_scores.abs() > 3).sum() / len(df)) * 100

outliers_summary = pd.DataFrame({
    "cantidad_outliers": outliers_por_col,
    "porcentaje_outliers": outliers_pct.round(2)
})

print(outliers_summary)

                               cantidad_outliers  porcentaje_outliers
Age at Diagnosis                               8                 0.48
Fraction Genome Altered                        9                 0.54
MSI Score                                     20                 1.21
Mutation Count                                10                 0.60
Overall Survival (Months)                      0                 0.00
Number of Samples Per Patient                 27                 1.63
Sample coverage                               44                 2.65
Specimen Stage                                 0                 0.00
TMB (nonsynonymous)                           19                 1.15


Podemos observar que `Sample coverage` es la columna con más outliers, tanto en cantidad como en porcentaje. Los outliers se mueven en un rango entre 0 y 44, teniendo en cuenta que es un dataset de mas de 1000 pacientes, son un porcentaje asumible.