# Making the blocks

In this jupyter file, we both made two dataframes of features and the target vector. 

By numerizing the nominal variables and changing the date features to numbers, our two blocks of features would be prepared to be used in machine learning algorithms. 

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.impute import SimpleImputer

## The first block 

In [2]:
df1 = pd.read_csv('NNTG_NEC-retrospective register_LIVE_enc_Form 1.csv', 
                  sep=';' )

In [3]:
df1.isnull().sum() # To know the number of missing values

DATEBRTH     0
SEX          0
PATNO        0
DATEDIAG     0
DATEMET      0
PRIMTUM      0
PRTUMRES     0
DATESURG    51
NONE         0
SURGMET      0
RADTHRPY     0
SMOKHAB      0
STRPTCYT     0
SANDOSTN     0
PROTHRCA     0
INTRFERN     0
OTHRPRTH     0
MORPH        0
KI67         1
CGA1         0
SYNAPTOF     0
OCTREO       0
TISSUEBL     0
LIVER        0
LYMPHNDS     0
LUNG         0
BONE         0
PERFSTAT     0
OTHRORGM     0
BRAIN        0
BMI          1
HORMSYMP     0
DATESYMP    74
CARSYNDR     0
dtype: int64

In [4]:
df1.columns

Index(['DATEBRTH', 'SEX', 'PATNO', 'DATEDIAG', 'DATEMET', 'PRIMTUM',
       'PRTUMRES', 'DATESURG', 'NONE', 'SURGMET', 'RADTHRPY', 'SMOKHAB',
       'STRPTCYT', 'SANDOSTN', 'PROTHRCA', 'INTRFERN', 'OTHRPRTH', 'MORPH',
       'KI67', 'CGA1', 'SYNAPTOF', 'OCTREO', 'TISSUEBL', 'LIVER', 'LYMPHNDS',
       'LUNG', 'BONE', 'PERFSTAT', 'OTHRORGM', 'BRAIN', 'BMI', 'HORMSYMP',
       'DATESYMP', 'CARSYNDR'],
      dtype='object')

In [5]:
df1.head() # Have an overview on the features and their types

Unnamed: 0,DATEBRTH,SEX,PATNO,DATEDIAG,DATEMET,PRIMTUM,PRTUMRES,DATESURG,NONE,SURGMET,...,LYMPHNDS,LUNG,BONE,PERFSTAT,OTHRORGM,BRAIN,BMI,HORMSYMP,DATESYMP,CARSYNDR
0,01mar1942,Male,4001,20sep2006,20sep2006,Gastric,No,,Yes,No,...,Yes,Yes,Yes,WHO 1,No,No,30.0,No,,No
1,03mar1940,Female,4003,29sep2006,29sep2006,Other,No,,No,No,...,Yes,Yes,Yes,WHO 1,No,No,32.0,No,,No
2,03may1953,Male,4004,01dec2006,01dec2006,Colon,Yes,01dec2006,Yes,No,...,No,No,No,WHO 0,No,No,24.0,No,,No
3,05oct1936,Female,4005,04may2005,06jun2005,Other,Yes,04may2005,Yes,No,...,No,No,No,WHO 0,Yes,No,24.0,No,,No
4,06nov1966,Male,4006,17jul2003,17jul2003,Pancreas,No,,No,Yes,...,No,No,Yes,WHO 1,No,No,23.0,No,,No


- We should now numerize this block. Based on the features' types we will implement different tasks to numerize the data

In [6]:
import datetime
def from_dob_to_age(born): # This function transforms the date of birth to age
    today = datetime.date.today()
    return today.year - born.year - ((today.month, today.day) < (born.month, born.day))

REF: https://moonbooks.org/Articles/How-to-get-the-age-from-a-date-of-birth-DOB-in-python-/

In [7]:
df1['DATEBRTH'] = pd.to_datetime(df1['DATEBRTH']) 
# Make sure the type of feature is datetime so we can use from_dob_to_age function
df1['Age']= df1['DATEBRTH'].apply(lambda x: from_dob_to_age(x))

In [8]:
df1['DATEDIAG'] = pd.to_datetime(df1['DATEDIAG'])
df1['DATEMET'] = pd.to_datetime(df1['DATEMET'])

In [9]:
df1['DATEMET - DATEDIAG'] = (df1['DATEMET'] -df1['DATEDIAG']).apply(lambda x: x.days)
# Making a new feature which is the number of days between date of metastasis and date of diagnosis

In [10]:
X = pd.concat([df1['Age'], df1['DATEMET - DATEDIAG']], axis=1)
# X is representive of numerized transformation of the first block

In [11]:
X["SEX"]=np.where(df1["SEX"] == 'Male', 0, 1) # Binarizing the SEX feature

In [12]:
np.unique(df1['PRIMTUM']) # Wanna see how many levels this feature take

array(['Colon', 'Esofagus', 'Gastric', 'Other', 'Pancreas', 'Rectum',
       'Unknown'], dtype=object)

In [13]:
X = pd.concat([X,pd.get_dummies(df1['PRIMTUM'], prefix='PRIMTUM')],axis=1)
# This funtion dummifies the feature "PRIMTUM"

In [14]:
np.unique(df1["PRTUMRES"]) # Wanna see how many levels this feature take

array(['No', 'Yes'], dtype=object)

In [15]:
# Since the levels of above feature is just two, instead of dummufication we simply assing 0 and 1 levels to it
X["PRTUMRES"]=np.where(df1["PRTUMRES"] == 'No', 0, 1)

In [16]:
# This also implies for the following features, as well
X["OPT-NONE"]=np.where(df1["NONE"] == 'No', 0, 1)
X["OPT-RADTHRPY"]=np.where(df1["RADTHRPY"] == 'No', 0, 1)
X["OPT-STRPTCYT"]=np.where(df1["STRPTCYT"] == 'No', 0, 1)
X["OPT-SANDOSTN"]=np.where(df1["SANDOSTN"] == 'No', 0, 1)
X["OPT-INTRFERN"]=np.where(df1["INTRFERN"] == 'No', 0, 1)
X["OPT-OTHRPRTH"]=np.where(df1["OTHRPRTH"] == 'No', 0, 1)
X["SURGMET"]=np.where(df1["SURGMET"] == 'No', 0, 1)

In [17]:
np.unique(df1["SMOKHAB"])

array(['Ex-Smoker', 'Missing/NA', 'Non-Smoker', 'Smoker', 'Unknown'],
      dtype=object)

In [18]:
X = pd.concat([X,pd.get_dummies(df1['SMOKHAB'], prefix='SMOKHAB')],axis=1)

In [19]:
np.unique(df1["PROTHRCA"])

array(['Missing/NA', 'No', 'Yes'], dtype=object)

In [20]:
X = pd.concat([X,pd.get_dummies(df1['PROTHRCA'], prefix='PROTHRCA')],axis=1)

In [21]:
np.unique(df1["MORPH"])

array(['Large Cell Carcinoma', 'NA (Not Applicable)', 'Other',
       'Small Cell Carcinoma'], dtype=object)

In [22]:
X = pd.concat([X,pd.get_dummies(df1['MORPH'], prefix='MORPH')],axis=1)

In [23]:
X = pd.concat([X,df1['KI67']],axis=1)
# This feature consists of one missing value, so we should impute this item

In [24]:
# Locating the missing item
X['KI67'][1]

nan

We used three different methods to impute the missing value:

1- KNN imputer

In [25]:
imputer = KNNImputer(n_neighbors=4, weights="uniform")
print('The first imputation method estimates the missing item as {}.'.format(pd.DataFrame(imputer.fit_transform(X))[30][1]))

The first imputation method estimates the missing item as 58.0.


2- Iterative Imputer

In [26]:
imp = IterativeImputer(max_iter=20, random_state=0)
imp.fit(X)
print('The second imputation method estimates the missing item as {}.'.format(pd.DataFrame(np.round(imp.transform(X)))[30][1]))

The second imputation method estimates the missing item as 63.0.


3- Simple Imputer with Median scale

In [27]:
imp = SimpleImputer(missing_values=np.nan, strategy='median')
imp.fit(X)
print('The third imputation method estimates the missing item as {}.'.format(pd.DataFrame(imp.transform(X))[30][1]))

The third imputation method estimates the missing item as 65.0.


In [28]:
# Now using all the three methods, we impute the missing item by the average of three pobtained values:
nan = [58, 63, 65]
X['KI67'][1] = np.average(nan)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['KI67'][1] = np.average(nan)


The above result is mentioned in table 5.6 of thesis.

In [29]:
np.unique(df1["CGA1"])

array(['Negative', 'Not Done', 'Partly Positive', 'Strongly Positive'],
      dtype=object)

In [30]:
X = pd.concat([X,pd.get_dummies(df1['CGA1'], prefix='CGA1')],axis=1)

In [31]:
np.unique(df1["SYNAPTOF"])

array(['Negative', 'Not Done', 'Partly Positive', 'Strongly Positive'],
      dtype=object)

In [32]:
X = pd.concat([X,pd.get_dummies(df1['SYNAPTOF'], prefix='SYNAPTOF')],axis=1)

In [33]:
np.unique(df1["OCTREO"])

array(['Missing/NA', 'Negative', 'Not Done', 'Pos. < Liver',
       'Pos. > Liver'], dtype=object)

In [34]:
X = pd.concat([X,pd.get_dummies(df1['OCTREO'], prefix='OCTREO')],axis=1)

In [35]:
np.unique(df1["LIVER"])

array(['No', 'Yes'], dtype=object)

In [36]:
X["SOM_LIVER"]=np.where(df1["LIVER"] == 'No', 0, 1)

In [37]:
np.unique(df1["LYMPHNDS"])

array(['No', 'Yes'], dtype=object)

In [38]:
X["SOM_LYMPHNDS"]=np.where(df1["LYMPHNDS"] == 'No', 0, 1)

In [39]:
np.unique(df1["LUNG"])

array(['No', 'Yes'], dtype=object)

In [40]:
X["SOM_LUNG"]=np.where(df1["LUNG"] == 'No', 0, 1)

In [41]:
np.unique(df1["BONE"])

array(['No', 'Yes'], dtype=object)

In [42]:
X["SOM_BONE"]=np.where(df1["BONE"] == 'No', 0, 1)
X["SOM_OTHRORGM"]=np.where(df1["OTHRORGM"] == 'No', 0, 1)
X["SOM_BRAIN"]=np.where(df1["BRAIN"] == 'No', 0, 1)

In [43]:
np.unique(df1["PERFSTAT"])

array(['WHO 0', 'WHO 1', 'WHO 2', 'WHO 3'], dtype=object)

In [44]:
X = pd.concat([X,pd.get_dummies(df1['PERFSTAT'], prefix='PERFSTAT')],axis=1)

In [45]:
X = pd.concat([X,df1['BMI']],axis=1)
# This feature also consists of one missing value, so we should impute this item

In [46]:
X['BMI'][30]
# Locating the missing item

nan

In [47]:
imputer = KNNImputer(n_neighbors=4, weights="uniform")
print('The first imputation method estimates the missing item as {}.'.format(pd.DataFrame(imputer.fit_transform(X))[54][30]))

The first imputation method estimates the missing item as 24.75.


In [48]:
imp = IterativeImputer(max_iter=20, random_state=0)
imp.fit(X)
print('The second imputation method estimates the missing item as {}.'.format(pd.DataFrame(np.round(imp.transform(X)))[54][30]))

The second imputation method estimates the missing item as 25.0.


In [49]:
imp = SimpleImputer(missing_values=np.nan, strategy='median')
imp.fit(X)
print('The third imputation method estimates the missing item as {}.'.format(pd.DataFrame(imp.transform(X))[54][30]))

The third imputation method estimates the missing item as 24.0.


In [50]:
nan = [24.75, 25, 24]
X['BMI'][30] = np.average(nan)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['BMI'][30] = np.average(nan)


The above result is mentioned in table 5.6 of thesis.

In [51]:
np.unique(df1["HORMSYMP"])

array(['No', 'Yes'], dtype=object)

In [52]:
X["HORMSYMP"]=np.where(df1["HORMSYMP"] == 'No', 0, 1)

In [53]:
np.unique(df1["CARSYNDR"])

array(['No', 'Yes'], dtype=object)

In [54]:
X["CARSYNDR"]=np.where(df1["CARSYNDR"] == 'No', 0, 1)

In [55]:
X.shape

(80, 57)

So finally our first block consists of 57 features and 80 samples, has been made. Lets take a look at the first 5 rows of it:

In [56]:
X.head()

Unnamed: 0,Age,DATEMET - DATEDIAG,SEX,PRIMTUM_Colon,PRIMTUM_Esofagus,PRIMTUM_Gastric,PRIMTUM_Other,PRIMTUM_Pancreas,PRIMTUM_Rectum,PRIMTUM_Unknown,...,SOM_BONE,SOM_OTHRORGM,SOM_BRAIN,PERFSTAT_WHO 0,PERFSTAT_WHO 1,PERFSTAT_WHO 2,PERFSTAT_WHO 3,BMI,HORMSYMP,CARSYNDR
0,79,0,0,0,0,1,0,0,0,0,...,1,0,0,0,1,0,0,30.0,0,0
1,81,0,1,0,0,0,1,0,0,0,...,1,0,0,0,1,0,0,32.0,0,0
2,68,0,0,1,0,0,0,0,0,0,...,0,0,0,1,0,0,0,24.0,0,0
3,84,33,1,0,0,0,1,0,0,0,...,0,1,0,1,0,0,0,24.0,0,0
4,54,0,0,0,0,0,0,1,0,0,...,1,0,0,0,1,0,0,23.0,0,0


This dataframe is ready to be used in the machine learning algorithms. All the values are numerized.

Referring to table 5.4 in the thesis.

## The second block

In [57]:
df2 = pd.read_csv('NNTG_NEC-retrospective register_LIVE_enc_Form 2.csv',
                 sep=';')

In [58]:
df2.isnull().sum()

PATNO        0
HIAA         0
CGA2         1
HMGLBN       0
LACTDHDR     0
PLATELTS     0
WHITEBLD     0
CRETININ     0
ALKPHSPH     0
CRP1        19
CRP2         0
TUMMARK1     0
TUMMARK2     0
TUMMARK3     0
HORMON1      0
HORMON2      0
HORMON3      0
HORMON4      0
dtype: int64

In [59]:
df2.head()

Unnamed: 0,PATNO,HIAA,CGA2,HMGLBN,LACTDHDR,PLATELTS,WHITEBLD,CRETININ,ALKPHSPH,CRP1,CRP2,TUMMARK1,TUMMARK2,TUMMARK3,HORMON1,HORMON2,HORMON3,HORMON4
0,4001,Not Done,Not Done,Normal,>Normal <= 2UNL,Normal,>10x10ˆ9/L,Normal,Not Done,,True,Yes,Normal,Normal,No,Missing/NA,Missing/NA,Missing/NA
1,4003,Not Done,Not Done,<11 g/dL,> 2UNL,Normal,Normal,Normal,>3 UNL,,True,No,Missing/NA,Missing/NA,No,Missing/NA,Missing/NA,Missing/NA
2,4004,Not Done,Not Done,Normal,> 2UNL,Normal,>10x10ˆ9/L,Normal,>3 UNL,,True,Yes,Elevated,Missing/NA,No,Missing/NA,Missing/NA,Missing/NA
3,4005,Not Done,Not Done,Normal,Normal,>400x10ˆ9/L,>10x10ˆ9/L,Normal,Normal,,True,Yes,Normal,Missing/NA,No,Missing/NA,Missing/NA,Missing/NA
4,4006,Normal,Not Done,Normal,Normal,Normal,Normal,Normal,Normal,27.0,False,Yes,Normal,Missing/NA,No,Missing/NA,Missing/NA,Missing/NA


Same as previous block, first we check how many levels every nominal feature consists. If it consists of 2 levels, we simple binarize it and if there were more levels we will do the dummification.

In [60]:
np.unique(df2["HIAA"])

array(['> 2UNL', '>Normal <= 2UNL', 'Normal', 'Not Done'], dtype=object)

In [61]:
X2 =(pd.get_dummies(df2["HIAA"], prefix='HIAA'))

In [62]:
np.unique(df2["HMGLBN"])

array(['<11 g/dL', 'Normal', 'Not Done'], dtype=object)

In [63]:
X2 = pd.concat([X2,pd.get_dummies(df2["HMGLBN"], prefix='HMGLBN')],axis=1)

In [64]:
np.unique(df2["LACTDHDR"])

array(['> 2UNL', '>Normal <= 2UNL', 'Normal', 'Not Done'], dtype=object)

In [65]:
X2 = pd.concat([X2,pd.get_dummies(df2["LACTDHDR"], prefix='LACTDHDR')],axis=1)

In [66]:
np.unique(df2["PLATELTS"])

array(['>400x10ˆ9/L', 'Normal', 'Not Done'], dtype=object)

In [67]:
X2 = pd.concat([X2,pd.get_dummies(df2["PLATELTS"], prefix='PLATELTS')],axis=1)

In [68]:
np.unique(df2["WHITEBLD"])

array(['>10x10ˆ9/L', 'Normal', 'Not Done'], dtype=object)

In [69]:
X2 = pd.concat([X2,pd.get_dummies(df2["WHITEBLD"], prefix='WHITEBLD')],axis=1)

In [70]:
np.unique(df2["CRETININ"])

array(['> Normal', 'Normal'], dtype=object)

In [71]:
X2["CRETININ"]=np.where(df2["CRETININ"] == '> Normal', 0, 1)

In [72]:
np.unique(df2["ALKPHSPH"])

array(['>3 UNL', '>Normal <= 3 UNL', 'Normal', 'Not Done'], dtype=object)

In [73]:
X2 = pd.concat([X2,pd.get_dummies(df2["ALKPHSPH"], prefix='ALKPHSPH')],axis=1)

In [74]:
np.unique(df2["TUMMARK1"]) # two levels

array(['No', 'Yes'], dtype=object)

In [75]:
X2["TUMMARK1"]=np.where(df2["TUMMARK1"] == 'No', 0, 1)

In [76]:
X2 =pd.concat([X2,pd.get_dummies(df2["CGA2"], prefix='CGA2')],axis=1) # this feature consists of one missing value

So we will impute this item by unifrom method.

In [77]:
X2.loc[df2.CGA2.isnull(), X2.columns.str.startswith("CGA2_")] = np.nan #retaining the Nan value of the CAG2

In [78]:
X2.loc[55]
# Locating the missing item

HIAA_> 2UNL                  0.0
HIAA_>Normal <= 2UNL         0.0
HIAA_Normal                  0.0
HIAA_Not Done                1.0
HMGLBN_<11 g/dL              0.0
HMGLBN_Normal                1.0
HMGLBN_Not Done              0.0
LACTDHDR_> 2UNL              1.0
LACTDHDR_>Normal <= 2UNL     0.0
LACTDHDR_Normal              0.0
LACTDHDR_Not Done            0.0
PLATELTS_>400x10ˆ9/L         1.0
PLATELTS_Normal              0.0
PLATELTS_Not Done            0.0
WHITEBLD_>10x10ˆ9/L          1.0
WHITEBLD_Normal              0.0
WHITEBLD_Not Done            0.0
CRETININ                     1.0
ALKPHSPH_>3 UNL              0.0
ALKPHSPH_>Normal <= 3 UNL    0.0
ALKPHSPH_Normal              1.0
ALKPHSPH_Not Done            0.0
TUMMARK1                     0.0
CGA2_> 2UNL                  NaN
CGA2_>Normal <= 2UNL         NaN
CGA2_Normal                  NaN
CGA2_Not Done                NaN
Name: 55, dtype: float64

In [79]:
imputer = KNNImputer(n_neighbors=4, weights="uniform")
print('The first imputation method estimates the missing item as:.')
print(pd.DataFrame(imputer.fit_transform(X2)).loc[55])
print('which is equal to CAG2 Not Done since the row 26 in the above vector is estimated as 1')

The first imputation method estimates the missing item as:.
0     0.0
1     0.0
2     0.0
3     1.0
4     0.0
5     1.0
6     0.0
7     1.0
8     0.0
9     0.0
10    0.0
11    1.0
12    0.0
13    0.0
14    1.0
15    0.0
16    0.0
17    1.0
18    0.0
19    0.0
20    1.0
21    0.0
22    0.0
23    0.0
24    0.0
25    0.0
26    1.0
Name: 55, dtype: float64
which is equal to CAG2 Not Done since the row 26 in the above vector is estimated as 1


In [80]:
X2['CGA2_> 2UNL'][55] = 0
X2['CGA2_>Normal <= 2UNL'][55] = 0
X2['CGA2_Normal'][55] = 0
X2['CGA2_Not Done'][55] = 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X2['CGA2_> 2UNL'][55] = 0
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X2['CGA2_>Normal <= 2UNL'][55] = 0
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X2['CGA2_Normal'][55] = 0
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X2['CGA2_Not Done'][55] = 1


In [81]:
X2.shape

(80, 27)

So finally our second block consists of 27 features and 80 samples, has been made. Lets take a look at the first 5 rows of it:

In [82]:
X2.head()

Unnamed: 0,HIAA_> 2UNL,HIAA_>Normal <= 2UNL,HIAA_Normal,HIAA_Not Done,HMGLBN_<11 g/dL,HMGLBN_Normal,HMGLBN_Not Done,LACTDHDR_> 2UNL,LACTDHDR_>Normal <= 2UNL,LACTDHDR_Normal,...,CRETININ,ALKPHSPH_>3 UNL,ALKPHSPH_>Normal <= 3 UNL,ALKPHSPH_Normal,ALKPHSPH_Not Done,TUMMARK1,CGA2_> 2UNL,CGA2_>Normal <= 2UNL,CGA2_Normal,CGA2_Not Done
0,0,0,0,1,0,1,0,0,1,0,...,1,0,0,0,1,1,0.0,0.0,0.0,1.0
1,0,0,0,1,1,0,0,1,0,0,...,1,1,0,0,0,0,0.0,0.0,0.0,1.0
2,0,0,0,1,0,1,0,1,0,0,...,1,1,0,0,0,1,0.0,0.0,0.0,1.0
3,0,0,0,1,0,1,0,0,0,1,...,1,0,0,1,0,1,0.0,0.0,0.0,1.0
4,0,0,1,0,0,1,0,0,0,1,...,1,0,0,1,0,1,0.0,0.0,0.0,1.0


Referring to table 5.5 in the thesis.

## Defining the target

In [83]:
df4 = pd.read_csv('NNTG_NEC-retrospective register_LIVE_enc_Form 4.csv',
                 sep=',')

In [84]:
df1['DATEDIAG'] = pd.to_datetime(df1['DATEDIAG'])
df4['DATELOBS'] = pd.to_datetime(df4['DATELOBS'])

In [85]:
y = (df4['DATELOBS'] - df1['DATEDIAG']).apply(lambda x: x.days)

So our target would be the number of days between diagnosis of cancer and the last observation of the patients.