### Carregando os Dados

Vamos carregar nosso dataset.

In [24]:
import pandas as pd

df = pd.read_excel("datasets/dataset.xlsx")

In [25]:
df.head()

Unnamed: 0,Age at diagnosis,Regional nodes positive (1988+),Total number of in situ/malignant tumors for patient,Radiation recode,Chemotherapy recode,Radiation sequence with surgery,ER Status Recode Breast Cancer (1990+),PR Status Recode Breast Cancer (1990+),CS tumor size (2004-2015),Derived HER2 Recode (2010+),Regional nodes examined (1988+),COD to site recode,Race recode,Sex,Vital status recode (study cutoff used),Diagnosis_year,Last_fu _year,interva_years,stutus_5_years
0,54,3,1,Beam radiation,Yes,Radiation after surgery,Negative,Negative,25,Positive,14,Alive,White,Female,Alive,2011,2016,5,Alive
1,59,3,1,Beam radiation,Yes,Radiation after surgery,Positive,Negative,36,Negative,19,Alive,White,Female,Alive,2011,2016,5,Alive
2,54,0,2,Beam radiation,No/Unknown,Radiation after surgery,Positive,Positive,6,Negative,5,Alive,White,Female,Alive,2010,2016,6,Alive
3,58,0,1,Beam radiation,No/Unknown,Radiation after surgery,Positive,Positive,1,Negative,1,Alive,White,Female,Alive,2010,2016,6,Alive
4,89,0,1,None/Unknown,No/Unknown,No radiation and/or cancer-directed surgery,Negative,Positive,17,Negative,1,Alive,White,Female,Alive,2011,2016,5,Alive


### Limpeza e Tratamento dos Dados

Primeiramente, vamos renomear as variáveis para tornar a análise mais conveniente.

In [26]:
df.columns

Index(['Age at diagnosis', 'Regional nodes positive (1988+)',
       'Total number of in situ/malignant tumors for patient',
       'Radiation recode', 'Chemotherapy recode',
       'Radiation sequence with surgery',
       'ER Status Recode Breast Cancer (1990+)',
       'PR Status Recode Breast Cancer (1990+)', 'CS tumor size (2004-2015)',
       'Derived HER2 Recode (2010+)', 'Regional nodes examined (1988+)',
       'COD to site recode', 'Race recode', 'Sex',
       'Vital status recode (study cutoff used)', 'Diagnosis_year',
       'Last_fu _year', 'interva_years', 'stutus_5_years'],
      dtype='object')

In [27]:
df = df.rename({
    "Age at diagnosis": "diagnosis_age",
    "Regional nodes positive (1988+)": "lymph_nodes",
    "Total number of in situ/malignant tumors for patient": "malignant_tumors",
    "Radiation recode": "radiation_type",
    "Chemotherapy recode": "chemotherapy_done",
    "Radiation sequence with surgery": "radiation_sequence",
    "ER Status Recode Breast Cancer (1990+)": "estrogen_info",
    "PR Status Recode Breast Cancer (1990+)": "progesterone_info",
    "CS tumor size (2004-2015)": "tumor_size",
    "Derived HER2 Recode (2010+)": "her2_info",
    "Regional nodes examined (1988+)": "nodes_examined",
    "COD to site recode": "cause_of_death",
    "Race recode": "race",
    "Sex": "sex",
    "Vital status recode (study cutoff used)": "vital_status",
    "Diagnosis_year": "diagnosis_year",
    "Last_fu _year": "treatment_year",
    "interva_years": "num_screening",
    "stutus_5_years": "vital_status_5y"
}, axis="columns")

df.columns = df.columns.str.strip()

In [28]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35349 entries, 0 to 35348
Data columns (total 19 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   diagnosis_age       35349 non-null  int64 
 1   lymph_nodes         35349 non-null  int64 
 2   malignant_tumors    35349 non-null  int64 
 3   radiation_type      35349 non-null  object
 4   chemotherapy_done   35349 non-null  object
 5   radiation_sequence  35349 non-null  object
 6   estrogen_info       35349 non-null  object
 7   progesterone_info   35349 non-null  object
 8   tumor_size          35349 non-null  object
 9   her2_info           35349 non-null  object
 10  nodes_examined      35349 non-null  int64 
 11  cause_of_death      35349 non-null  object
 12  race                35349 non-null  object
 13  sex                 35349 non-null  object
 14  vital_status        35349 non-null  object
 15  diagnosis_year      35349 non-null  int64 
 16  treatment_year      35

Agora, precisamos detectar os dados faltantes e preencher eles com `np.nan` para depois tratarmos eles de maneira adequada. Para isso vamos dar uma olhada nos dados que temos no dataset.

In [29]:
for col in df.columns:
    values = df[col].unique()
    values.sort()
    print(f"{col}: {values}\n")

diagnosis_age: [ 2 15 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87
 88 89 90 91 92 93 94 95 96 97 98 99]

lymph_nodes: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 46 48 52 65 76
 95 97 98 99]

malignant_tumors: [1 2 3 4]

radiation_type: ['Beam radiation' 'Combination of beam with implants or isotopes'
 'None/Unknown' 'Radioactive implants (includes brachytherapy) (1988+)'
 'Radioisotopes (1988+)' 'Refused (1988+)']

chemotherapy_done: ['No/Unknown' 'Yes']

radiation_sequence: ['Intraoperative rad with other rad before/after surgery'
 'Intraoperative radiation' 'No radiation and/or cancer-directed surgery'
 'Radiation after surgery' 'Radiation before and after surgery'
 'Radiation prior to surgery' 'Surgery both before and after radiati

Como podemos ver, as variáveis "radiation_type", "estrogen_info", "progesterone_info", "her2_info" e "race" possuem valores faltantes ou desconhecidos. Vamos analisar esses valores para decidir o que fazer com eles.

In [30]:
with_na = ["radiation_type", "estrogen_info", "progesterone_info", 
           "her2_info", "race"]

for col in with_na:
    print(f"{col}: {100*df[col].value_counts(normalize=True)}\n")

radiation_type: radiation_type
Beam radiation                                           49.415825
None/Unknown                                             46.253076
Radioactive implants (includes brachytherapy) (1988+)     2.701632
Refused (1988+)                                           1.372033
Combination of beam with implants or isotopes             0.183881
Radioisotopes (1988+)                                     0.073552
Name: proportion, dtype: float64

estrogen_info: estrogen_info
Positive      77.077145
Negative      18.345639
Unknown        4.483861
Borderline     0.093355
Name: proportion, dtype: float64

progesterone_info: progesterone_info
Positive      65.755750
Negative      29.316247
Unknown        4.783728
Borderline     0.144276
Name: proportion, dtype: float64

her2_info: her2_info
Negative      77.566551
Positive      13.264873
Unknown        7.018586
Borderline     2.149990
Name: proportion, dtype: float64

race: race
White      86.655917
Black      12.863164
Unk

Como podemos ver, os dados faltantes de `radiation_type` são os valores `None/Unknown` e representam $46\%$ da amostra. Portanto, não iremos fazer nada com esses valores pois eles representam uma parte significativa do dataset e podem ter informação relevantes para nós.

Já para as variáveis `estrogen_info`, `progesterone_info`, `her2_info` e `race` os valores faltantes são os valores `Unknown`. Para as três primeiras, temos uma porcentagem muito pequena de valores faltantes, sendo $4.4\%$, $4.7\%$ e $7\%$, respectivamente. Essas porcentagens, apesar de pequenas, podem ter alguma relevância na hora do treinamento, portanto vamos manter seus dados faltantes. 

Por fim, a variável `race` tem apenas $0.4\%$ de dados faltantes, portanto podemos imputar valores nela sem muitas preocupações.

In [31]:
import numpy as np

df["race"] = df["race"].replace("Unknown", np.nan)
df["race"].unique()

array(['White', 'Black', nan], dtype=object)

Além disso, as variáveis "lymph_nodes" e "tumor_size" possuem os códigos `99` e `"999"` para valores faltantes, respectivamente. Vamos preencher esses valores faltantes com `np.nan`.

In [32]:
import numpy as np

df.columns = df.columns.str.strip()

df["tumor_size"] = (df["tumor_size"]
    .replace("Blank(s)", "999")
    .astype(int)
    .replace(999, np.nan))

df.loc[df["lymph_nodes"] > 90, "lymph_nodes"] = np.nan

In [33]:
print(f"{df["lymph_nodes"].unique()}\n")
print(f"{df["tumor_size"].unique()}\n")

[ 3.  0. nan  4.  1. 10.  2. 11.  5. 39. 13. 15.  9. 21.  6. 19. 14. 16.
 12. 52. 23.  7. 30. 34.  8. 17. 20. 26. 33. 24. 29. 46. 42. 27. 18. 22.
 31. 65. 25. 32. 28. 35. 40. 38. 76. 36. 37. 41. 48.]

[ 25.  36.   6.   1.  17.  46. 990.  11.  14.  48.  18.  15.  10.  12.
   7.  24.  90.  32.  21.   4.  13. 130.  38.  19.  52.  23.  nan 100.
  56. 120. 150.  54.  50.  33.   9.  35.   8.   2.  70.  20.  26.  72.
  45.   5.  37. 991.  28.  40.  27.  22.   0.  60.  16.   3.  29.  55.
  30. 106.  79.  42.  44. 170.  65. 998.  57. 114. 110.  75.  97.  64.
  47.  61.  31.  43.  51.  80.  34. 153.  83.  82.  66. 180.  86. 105.
 116.  78. 123.  49. 995.  39.  41.  73. 103.  53. 240. 994. 800. 140.
 127.  63. 993.  95. 108.  71.  62. 132.  85. 992.  68.  58.  89. 220.
  76.  81.  69. 119.  59. 200.  99.  77. 101. 160. 190.  88. 112. 129.
 115.  87.  92.  94.  84. 172. 137. 135. 996. 162. 117.  98.  96.  67.
 210. 109. 440. 161. 230.  91. 270. 165. 125. 989. 145.  74. 600. 185.
 320. 134. 131. 10

In [34]:
df["sex"].value_counts(normalize=True)*100

sex
Female    99.117372
Male       0.882628
Name: proportion, dtype: float64

### Engenharia de Features

Vamos analisar nossas features e tentar criar features novas e mais relevantes. Primeiro vamos olhar para a feature `chemotherapy_done`. Ela tem valores `"Yes"` e `"No/Unknown"`, então podemos convertê-los para `1` e `0`, respectivamente.

In [35]:
def chemotherapy_done(value):
    if value == "Yes":
        return 1
    
    return 0

df["chemotherapy_done"] = df["chemotherapy_done"].apply(chemotherapy_done)
df["chemotherapy_done"].value_counts()

chemotherapy_done
0    21118
1    14231
Name: count, dtype: int64

Agora, podemos olhar para a feature `radiation_type`. Ela diz quais os tipos de radioterapia o paciente fez. Com isso vamos criar duas novas features: `radiotherapy_done` e `radiotherapy_group`.

In [36]:
def radiotherapy_done(value):
    if value in ["None/Unknown", "Refused (1988+)"]:
        return 0

    return 1

def radiotherapy_group(value):
    if value == "Beam radiation":
        return "external"
    elif value == "Radioactive implants (includes brachytherapy) (1988+)":
        return "internal"
    elif value == "Radioisotopes (1988+)":
        return "internal"
    elif value == "Combination of beam with implants or isotopes":
        return "combination"

    return "None"

df["radiotherapy_done"] = df["radiation_type"].apply(radiotherapy_done)
df["radiotherapy_group"] = df["radiation_type"].apply(radiotherapy_group)

df[["radiation_type", "radiotherapy_done", "radiotherapy_group"]].head(10)

Unnamed: 0,radiation_type,radiotherapy_done,radiotherapy_group
0,Beam radiation,1,external
1,Beam radiation,1,external
2,Beam radiation,1,external
3,Beam radiation,1,external
4,None/Unknown,0,
5,None/Unknown,0,
6,Radioactive implants (includes brachytherapy) ...,1,internal
7,None/Unknown,0,
8,None/Unknown,0,
9,Beam radiation,1,external


Agora, vamos criar a feature `therapeutic_plan`, uma variável numérica ordinal que diz quantos tratamentos diferentes 

In [37]:
df["therapeutic_plan"] = df["chemotherapy_done"] + df["radiotherapy_done"]

Agora, vamos remover as variáveis `vital_status` e `cause_of_death` pois elas estão diretamente relacionadas com a variável alvo `vital_status_5y` e isso pode projudicar nosso modelo.

In [38]:
df = df.drop(["vital_status", "cause_of_death"], axis="columns")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35349 entries, 0 to 35348
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   diagnosis_age       35349 non-null  int64  
 1   lymph_nodes         27755 non-null  float64
 2   malignant_tumors    35349 non-null  int64  
 3   radiation_type      35349 non-null  object 
 4   chemotherapy_done   35349 non-null  int64  
 5   radiation_sequence  35349 non-null  object 
 6   estrogen_info       35349 non-null  object 
 7   progesterone_info   35349 non-null  object 
 8   tumor_size          32997 non-null  float64
 9   her2_info           35349 non-null  object 
 10  nodes_examined      35349 non-null  int64  
 11  race                35179 non-null  object 
 12  sex                 35349 non-null  object 
 13  diagnosis_year      35349 non-null  int64  
 14  treatment_year      35349 non-null  int64  
 15  num_screening       35349 non-null  int64  
 16  vita

Agora, vamos criar as features `lymph_nodes_ratio` e `tumor_load`, sendo a primeira a razão entre a quantidade de linfonodos no paciente e os linfonodos examinados e a segunda sendo um indicativo de carga tumoral no paciente.

In [40]:
df["lymph_nodes_ratio"] = df["lymph_nodes"] / (df["nodes_examined"] + 1e-6)
df["tumor_load"] = df["tumor_size"] * df["malignant_tumors"]

df[["lymph_nodes_ratio", "tumor_load"]].head(10)

Unnamed: 0,lymph_nodes_ratio,tumor_load
0,0.214286,25.0
1,0.157895,36.0
2,0.0,12.0
3,0.0,1.0
4,0.0,17.0
5,,46.0
6,0.0,990.0
7,0.0,12.0
8,0.0,11.0
9,0.0,14.0


In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer 
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

num_columns = df.select_dtypes(include="number").columns.to_list()
cat_columns = df.select_dtypes(include="object").columns.to_list()

numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median"))
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown=))
])