#Notebook 1

Author: W.Dumindu Tharushika

Peer Reviewer: Lakindu Karunanayake

Review Date: 24 Feb 2025

Notebook Purpose: Perform data understanding and preprocessing for the SEER breast cancer dataset to prepare datasets for classification (Mortality Status) and regression (Survival Months) as per




#1. Import Library and Load Dataset

Purpose: Import pandas, load the SEER breast cancer dataset, and display first five rows to inspect structure

Justification: pandas.read_csv is efficient for loading CSV files; head() confirms dataset structure (Case Study A, Task 2)

Code Reuse Reference: This cell leverages code blocks from Week 5 Code

Reuse Session:
Task 1: import pandas as pd

Task 2: pd.read_csv('/DataFilePath/DataFileName.csv')

Task 3: data.head(10)

In [2]:
# Importing pandas library
import pandas as pd
#Loading data into a DataFrame
data_frame = pd.read_csv("/content/5DATA002W.2 Coursework Dataset(25012025v6.0).csv")
#displaying first five rows to check
data_frame.head()

Unnamed: 0,Patient_ID,Month_of_Birth,Age,Sex,Occupation,T_Stage,N_Stage,6th_Stage,Differentiated,Grade,A_Stage,Tumor_Size,Estrogen_Status,Progesterone_Status,Regional_Node_Examined,Reginol_Node_Positive,Survival_Months,Mortality_Status
0,A0012,12,68.0,Female,Teaching,T1,N1,IIA,Poorly differentiated,3,Regional,4.0,Positive,Positive,24.0,1,60,Alive
1,A0013,12,50.0,Female,Medical,T2,N2,IIIA,Moderately differentiated,2,Regional,35.0,Positive,Positive,14.0,5,62,Alive
2,A0014,11,58.0,Female,Engineering,T3,N3,IIIC,Moderately differentiated,2,Regional,63.0,Positive,Positive,14.0,7,75,Alive
3,A0015,3,58.0,Female,Technology,T1,N1,IIA,Poorly differentiated,3,Regional,18.0,Positive,Positive,2.0,1,84,Alive
4,A0016,1,47.0,Female,Multimedia,T2,N1,IIB,Poorly differentiated,3,Regional,41.0,Positive,Positive,3.0,1,50,Alive


# 2. Inspect Data Types and Missing Values
   
Leveraged from Week 5 Code Reuse Session: Task 5 (data.info)

Purpose: Identify data types and missing values to assess data quality (LO1)

Justification: info() reveals formatting issues and missing data for cleaning (Case Study A, Task 3)

In [3]:
# get type information to check if there are any formatting problems
data_frame.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4024 entries, 0 to 4023
Data columns (total 18 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Patient_ID              4024 non-null   object 
 1   Month_of_Birth          4024 non-null   int64  
 2   Age                     4015 non-null   float64
 3   Sex                     4020 non-null   object 
 4   Occupation              43 non-null     object 
 5   T_Stage                 4024 non-null   object 
 6   N_Stage                 4024 non-null   object 
 7   6th_Stage               4024 non-null   object 
 8   Differentiated          4024 non-null   object 
 9   Grade                   4024 non-null   int64  
 10  A_Stage                 4024 non-null   object 
 11  Tumor_Size              4021 non-null   float64
 12  Estrogen_Status         4024 non-null   object 
 13  Progesterone_Status     4024 non-null   object 
 14  Regional_Node_Examined  4023 non-null   

#3. Descriptive Statistics for Numerical Variables
Leveraged from Week 5 Code Reuse Session: Task 7 (data.describe)

Purpose: Detect value errors (e.g., outliers, negative values) in numerical columns

Justification: describe() highlights scalability issues like negative Age, critical for data cleaning

In [4]:
# get descriptive statistics to check value errors
data_frame.describe()

Unnamed: 0,Month_of_Birth,Age,Grade,Tumor_Size,Regional_Node_Examined,Reginol_Node_Positive,Survival_Months
count,4024.0,4015.0,4024.0,4021.0,4023.0,4024.0,4024.0
mean,6.481362,54.107098,2.150596,30.419299,14.373602,4.158052,71.472167
std,3.475442,11.715528,0.638234,21.16108,8.129293,5.109331,25.361855
min,1.0,-50.0,1.0,-75.0,1.0,1.0,1.0
25%,3.0,47.0,2.0,16.0,9.0,1.0,56.0
50%,6.0,54.0,2.0,25.0,14.0,2.0,73.0
75%,10.0,61.0,3.0,38.0,19.0,5.0,90.0
max,12.0,502.0,4.0,140.0,61.0,46.0,760.0


#4. Descriptive Statistics for Categorical Variables

Leveraged from Week 5 Code Reuse Session: Task 8 (data.describe(include='object'))

Purpose: Check for inconsistencies in categorical data (e.g., case variations)

Justification: include='object' reveals unique value counts for categorical cleaning

In [5]:
# get description for categorical variables to check value errors in object type data
data_frame.describe(include='object')

Unnamed: 0,Patient_ID,Sex,Occupation,T_Stage,N_Stage,6th_Stage,Differentiated,A_Stage,Estrogen_Status,Progesterone_Status,Mortality_Status
count,4024,4020,43,4024,4024,4024,4024,4024,4024,4024,4024
unique,4024,2,40,4,3,5,4,2,2,2,7
top,A4035,Female,House Person,T2,N1,IIA,Moderately differentiated,Regional,Positive,Positive,Alive
freq,1,4001,2,1786,2732,1305,2351,3932,3755,3326,3399


#5. Check Missing Values
Leveraged from Week 5 Code Reuse Session: Task 10 (data.isnull().sum)

Purpose: Quantify missing data to inform cleaning strategy

Justification: High missing rates (e.g., Occupation) justify dropping; low rates (e.g., Age) allow row dropping

In [6]:
# Calculating the sum of missing values for each column to identify columns with null values
data_frame.isnull().sum()

Unnamed: 0,0
Patient_ID,0
Month_of_Birth,0
Age,9
Sex,4
Occupation,3981
T_Stage,0
N_Stage,0
6th_Stage,0
Differentiated,0
Grade,0


In [7]:
#  get the percentage of missing data values per variable
data_frame.isna().sum()/len(data_frame)*100

Unnamed: 0,0
Patient_ID,0.0
Month_of_Birth,0.0
Age,0.223658
Sex,0.099404
Occupation,98.931412
T_Stage,0.0
N_Stage,0.0
6th_Stage,0.0
Differentiated,0.0
Grade,0.0


#7. Inspect Unique Values
Leveraged from Week 5 Code Reuse Session: Task 9 (data['Variable Name'].unique)

Purpose: Identify invalid or inconsistent values in key columns

Justification: unique() detects data entry errors for cleaning (Case Study A, Task 3)

In [8]:
# get unique values to find anomalies
data_frame['Age'].unique()

array([ 68.,  50.,  58.,  47.,  51.,  40.,  69.,  46.,  65.,  48.,  62.,
        61.,  56.,  43.,  60.,  57.,  55.,  63.,  66.,  53.,  59.,  54.,
        49.,  64.,  42.,  nan,  37.,  67.,  31.,  52.,  33.,  45.,  38.,
        39.,  36., 180.,  41.,  44., -50.,  32.,  34., 502.,  35.,  30.,
        89.])

In [9]:
# get unique values to find anomalies
data_frame['Month_of_Birth'].unique()

array([12, 11,  3,  1,  2,  5,  4,  6,  8,  9,  7, 10])

In [10]:
# get unique values to find anomalies
data_frame['Sex'].unique()

array(['Female', '1', nan], dtype=object)

In [11]:
# Displaying unique values in the T_Stage column to verify categorical consistency
data_frame['T_Stage'].unique()

array(['T1', 'T2', 'T3', 'T4'], dtype=object)

In [12]:
# Displaying uniques in the N_Stage column to verify categorical consistency
data_frame['N_Stage'].unique()

array(['N1', 'N2', 'N3'], dtype=object)

In [13]:
# Displaying uniques in the 6th_Stage column to verify categorical consistency
data_frame['6th_Stage'].unique()

array(['IIA', 'IIIA', 'IIIC', 'IIB', 'IIIB'], dtype=object)

In [14]:
# Displaying uniqus in the Tumor_Size column to check for potential
#outliers or negative values
data_frame['Tumor_Size'].unique()

array([  4.,  35.,  63.,  18.,  41.,  20.,   8.,  30., 103.,  32.,  13.,
        59.,  15.,  19.,  46.,  24.,  25.,  29.,  40.,  70.,  22.,  50.,
        17.,  21.,  10.,  27.,  23.,   5.,  51.,   9.,  55., 120.,  77.,
         2.,  11.,  12.,  26.,  75., 130.,  34.,  80.,   3.,  60.,  14.,
        16.,  45.,  36.,  76.,  38.,  49.,   7.,  72., 100.,  43.,  62.,
        37.,  68., -75.,  52.,  85.,  57.,  39.,  28.,  48., 110.,  65.,
         6., 105., 140.,  42.,  31.,  90., 108.,  98.,  47.,  54.,  61.,
        74.,  33.,   1.,  87.,  nan,  81.,  58., 117.,  44., 123., 133.,
        95., 107.,  92.,  69.,  56.,  82.,  66.,  78.,  97.,  88.,  53.,
        83., 101.,  84., 115.,  73., 125., 104.,  94.,  86.,  64.,  96.,
        79.,  67.])

In [15]:
# Displaying uniques in the Estrogen_Status column to check for inconsistencies
data_frame['Estrogen_Status'].unique()

array(['Positive', 'Negative'], dtype=object)

In [16]:
# Displaying uniques in the Progesterone_Status to check for inconsistencies
data_frame['Progesterone_Status'].unique()

array(['Positive', 'Negative'], dtype=object)

In [17]:
# Displaying uniques in the Regional_Node_Examined to verify data integrity
data_frame['Regional_Node_Examined'].unique()

array([24., 14.,  2.,  3., 18., 11.,  9., 20., 21., 13., 23., 16.,  1.,
       22., 15.,  4., 26., 31., 25., 10.,  5.,  6., 19., 12.,  8., 17.,
        7., 49., 33., 30., 34., 28., 32., 27., 42., 29., nan, 41., 39.,
       46., 40., 51., 44., 38., 47., 54., 36., 61., 60., 37., 35., 43.,
       52., 45., 57.])

In [18]:
# Displaying uniques in the Reginol_Node_Positive to verify data integrity
data_frame['Reginol_Node_Positive'].unique()

array([ 1,  5,  7,  2, 18, 12,  3, 14, 22, 17, 23,  4, 10,  6,  9,  8, 20,
       16, 13, 11, 24, 27, 21, 26, 15, 28, 19, 29, 31, 46, 33, 37, 30, 35,
       25, 32, 41, 34])

In [19]:
# Displaying uniqus in the Survival_Months to check for anomalies
data_frame['Survival_Months'].unique()

array([ 60,  62,  75,  84,  50,  89,  54,  14,  70,  92,  64,  56,  38,
        49, 105, 107,  77,  81,  78, 102,  98,  82,  86,  52,  90,  31,
        37, 103,  42,  61,  63,  39,  59,  71,  74,  73,  91, 106,  80,
        44,  85,  79, 104,  12,  95,  55, 101,  65,  72,  57,  87,  40,
        25,   8,  53,  58,  24,  66,  69,  93,  94, 100,  96,  41,  67,
        51,  13,  11,  47,  23,  45,  68,  76,  15,  16,  99,   7,  48,
        88,  34,  97, 760,  83,  17,   3,  22,  30,   6,  32,   9,   5,
        10,  19,  18,  35,  27,  36,   4,  29,  33,  26,  20,  28,  43,
         1,  46,  21,   2])

In [20]:
# Displaying uniques in the Mortality_Status to identify different representations
data_frame['Mortality_Status'].unique()

array(['Alive', 'Dead', 'ALIVE', 'DEAD', 'ALive', 'alive', 'dead'],
      dtype=object)

#8. Drop Irrelevant Features

Leveraged from Week 5 Code Reuse Session: Task 12 (data.drop(columns))

Purpose: Remove features with high missing rates or low predictive value

Justification: Occupation (98.9% missing) is unreliable; Patient_ID is a unique identifier: Month_of_Birth has low clinical relevance (cite: SEER documentation, no prognostic value for breast cancer)

In [21]:
# Removing the Occupation column due to high percentage of missing values (98.9%)
data_frame.drop("Occupation",axis=1,inplace=True)
# Removing the Month_of_Birth column due to low clinical relevance for breast cancer prognosis
data_frame.drop("Month_of_Birth",axis=1,inplace=True)
# Removing the Patient_ID column as it's just a unique identifier without predictive value
data_frame.drop("Patient_ID",axis=1,inplace=True)
# Displaying the first five rows to verify the columns were removed
data_frame.head()

Unnamed: 0,Age,Sex,T_Stage,N_Stage,6th_Stage,Differentiated,Grade,A_Stage,Tumor_Size,Estrogen_Status,Progesterone_Status,Regional_Node_Examined,Reginol_Node_Positive,Survival_Months,Mortality_Status
0,68.0,Female,T1,N1,IIA,Poorly differentiated,3,Regional,4.0,Positive,Positive,24.0,1,60,Alive
1,50.0,Female,T2,N2,IIIA,Moderately differentiated,2,Regional,35.0,Positive,Positive,14.0,5,62,Alive
2,58.0,Female,T3,N3,IIIC,Moderately differentiated,2,Regional,63.0,Positive,Positive,14.0,7,75,Alive
3,58.0,Female,T1,N1,IIA,Poorly differentiated,3,Regional,18.0,Positive,Positive,2.0,1,84,Alive
4,47.0,Female,T2,N1,IIB,Poorly differentiated,3,Regional,41.0,Positive,Positive,3.0,1,50,Alive


In [22]:
# Removing all rows with any missing values to ensure complete cases for analysis
data_frame.dropna(inplace =True)

In [23]:
# Verifying no missing values remain in the dataset
data_frame.isna().sum()

Unnamed: 0,0
Age,0
Sex,0
T_Stage,0
N_Stage,0
6th_Stage,0
Differentiated,0
Grade,0
A_Stage,0
Tumor_Size,0
Estrogen_Status,0


# 8. Convert Data Types

Leveraged from Week 5 Code Reuse Session:
- Task 24: data.astype()

Purpose: Simplify data types for Age, Tumor_Size, Regional_Node_Examined (no decimal precision needed)

Justification: Integer types are suitable for these features in clinical contexts (e.g., Age in years)


In [24]:
# Converting Age, Tumor_Size, and Regional_Node_Examined columns to integer type for proper analysis
data_frame = data_frame.astype({"Age":"int","Tumor_Size":"int","Regional_Node_Examined":"int"})

# 9. Save Intermediate Dataset

Leveraged from Week 5 Code Reuse Session:
- Task 27: data.to_csv()
- Task 2: pd.read_csv()
- Task 3: data.head()

Purpose: Save cleaned data as a checkpoint and reload for further preprocessing

Justification: Allows recovery without repeating initial cleaning steps

In [25]:
# Saving the cleaned dataset to a CSV file for future use
data_frame.to_csv(r'/content/clean_case.csv', index=False)
# Loading the saved cleaned dataset to verify it was saved correctly
df = pd.read_csv('/content/clean_case.csv')
# Displaying the first five rows of the cleaned dataset
df.head()

Unnamed: 0,Age,Sex,T_Stage,N_Stage,6th_Stage,Differentiated,Grade,A_Stage,Tumor_Size,Estrogen_Status,Progesterone_Status,Regional_Node_Examined,Reginol_Node_Positive,Survival_Months,Mortality_Status
0,68,Female,T1,N1,IIA,Poorly differentiated,3,Regional,4,Positive,Positive,24,1,60,Alive
1,50,Female,T2,N2,IIIA,Moderately differentiated,2,Regional,35,Positive,Positive,14,5,62,Alive
2,58,Female,T3,N3,IIIC,Moderately differentiated,2,Regional,63,Positive,Positive,14,7,75,Alive
3,58,Female,T1,N1,IIA,Poorly differentiated,3,Regional,18,Positive,Positive,2,1,84,Alive
4,47,Female,T2,N1,IIB,Poorly differentiated,3,Regional,41,Positive,Positive,3,1,50,Alive


# 10. Exploratory Data Analysis and Visualization

Leveraged from Week 5 Code Reuse Session:
- Task 14: px.histogram()
- Task 16: px.box()

Purpose: Explore distributions of Age and Tumor_Size to identify outliers

Justification: Histograms and box plots reveal data issues (Case Study A, Task 2)


In [26]:
# Importing plotly express for interactive data visualization
import plotly.express as px
# Creating a histogram to visualize the age distribution in the dataset
age_histo_fig = px.histogram(df, x='Age')
# Displaying the age histogram
age_histo_fig.show()

In [27]:
# Creating a box plot to identify potential outliers in the Age column
Age_fig = px.box(data_frame, x='Age')
# Displaying the age box plot
Age_fig.show()

In [28]:
# Creating a histogram to visualize the tumor size distribution
Tumor_size_fig = px.histogram(df, x='Tumor_Size')
# Displaying the tumor size histogram
Tumor_size_fig.show()

# 11. Define Outlier Detection Function

Leveraged from Week 5 Code Reuse Session:
- Task 17: find_outliers_IQR()

Purpose: Identify outliers in numerical variables using IQR method

Justification: IQR is robust for detecting anomalies (Case Study A, Task 3)

In [29]:
# Defining a function to find outliers using the Interquartile Range (IQR) method
def find_outliers_IQR(df):
 # Calculating the first quartile (25th percentile)
 q1=df.quantile(0.25)
 # Calculating the third quartile (75th percentile)
 q3=df.quantile(0.75)
 # Calculating the interquartile range (IQR)
 IQR=q3-q1
 # Identifying outliers as values below Q1-1.5*IQR or above Q3+1.5*IQR
 outliers = df[((df<(q1-1.5*IQR))|(df>(q3+1.5*IQR)))]
 return outliers

# 12. Handle Age Outliers

Leveraged from Week 5 Code Reuse Session:
- Task 19: find_outliers_IQR call
- Task 20: data.drop

Purpose: Remove inconsistent Age values to improve data quality

Justification: Extreme Age values (e.g., 502) are unrealistic for breast cancer patients

In [30]:
# Finding outliers in the Age column using the IQR method
age_outliers = find_outliers_IQR(df['Age'])
# Printing the number of outliers found in the Age column
print("Age, number of outliers: "+ str(len(age_outliers)))
# Displaying the outliers
age_outliers

Age, number of outliers: 4


Unnamed: 0,Age
139,180
209,-50
512,502
829,89


In [31]:
# Removing the identified age outliers from the dataset by their index positions
df.drop(df.index[[139,209,512]], inplace=True)
# Checking if any outliers remain after removal
age_outliers = find_outliers_IQR(df['Age'])
# Printing the number of remaining outliers in the Age column
print("Age, number of outliers: "+ str(len(age_outliers)))
# Displaying any remaining outliers
age_outliers

Age, number of outliers: 1


Unnamed: 0,Age
829,89


# 13. Handle Tumor_Size Outliers

Leveraged from Week 5 Code Reuse Session:
- Task 19: find_outliers_IQR call

Purpose: Remove negative Tumor_Size values as they are logically impossible

Justification: Negative Tumor_Size is invalid; extreme values (>72) may skew models

In [32]:
# Creating a box plot to identify potential outliers in the Tumor_Size column
Tumor_size_fig = px.box(df, x='Tumor_Size')
# Displaying the tumor size box plot
Tumor_size_fig.show()

In [33]:
# Finding outliers in the Tumor_Size column using the IQR method
Tumor_size_outliers = find_outliers_IQR(df['Tumor_Size'])
# Printing the number of outliers found in the Tumor_Size column
print("Tumor size, number of outliers: "+ str(len(Tumor_size_outliers)))
# Displaying the outliers
Tumor_size_outliers

Tumor size, number of outliers: 217


Unnamed: 0,Tumor_Size
8,103
51,77
61,75
83,120
84,80
...,...
3913,120
3948,140
3974,90
3992,100


In [34]:
# Removing negative tumor size values as they are logically impossible
df = df[df['Tumor_Size'] >= 0]
# Creating a box plot to identify potential outliers in the Tumor_Size column
Tumor_size_fig = px.box(df, x='Tumor_Size')
# Displaying the tumor size box plot
Tumor_size_fig.show()

In [35]:
# Checking if negative values were successfully removed
Tumor_size_outliers = find_outliers_IQR(df['Tumor_Size'])
# Printing the number of remaining outliers in the Tumor_Size column
print("Tumor size, number of outliers: "+ str(len(Tumor_size_outliers)))
# Displaying any remaining outliers
Tumor_size_outliers

Tumor size, number of outliers: 216


Unnamed: 0,Tumor_Size
8,103
51,77
61,75
83,120
84,80
...,...
3913,120
3948,140
3974,90
3992,100


In [36]:
# Creating a subset of the data containing only tumor size outliers (> 72mm)
tumor_size_outlier_df = df[df['Tumor_Size'] > 72]
# Displaying the first five rows of the tumor size outliers subset
tumor_size_outlier_df.head()

Unnamed: 0,Age,Sex,T_Stage,N_Stage,6th_Stage,Differentiated,Grade,A_Stage,Tumor_Size,Estrogen_Status,Progesterone_Status,Regional_Node_Examined,Reginol_Node_Positive,Survival_Months,Mortality_Status
8,40,Female,T4,N3,IIIC,Poorly differentiated,3,Regional,103,Positive,Positive,20,18,70,Alive
51,63,Female,T3,N1,IIIA,Well differentiated,1,Regional,77,Positive,Negative,20,2,70,Alive
61,59,Female,T3,N1,IIIA,Moderately differentiated,2,Regional,75,Positive,Positive,20,2,75,Alive
83,48,Female,T3,N2,IIIA,Moderately differentiated,2,Regional,120,Positive,Positive,7,5,82,Alive
84,52,Female,T3,N1,IIIA,Moderately differentiated,2,Regional,80,Positive,Positive,10,1,84,Alive


In [37]:
# Creating a histogram to visualize the tumor size distribution
Tumor_size_fig = px.histogram(df, x='Mortality_Status')
# Displaying the tumor size histogram
Tumor_size_fig.show()

# 14. Clean Mortality_Status

Leveraged from Week 5 Code Reuse Session:
- Task 23: data.map()
- Task 9: data.unique()

Purpose: Standardize Mortality_Status to binary (Alive: 0, Dead: 1) for classification

Justification: Case variations (e.g., ALIVE, alive) cause inconsistencies; binary encoding suits classification


In [38]:
# Removing Survival_Months column
df_mortality_status_1 = df.drop('Survival_Months', axis=1)

In [39]:
# get type information to check if there are any formatting problems
df_mortality_status_1.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4003 entries, 0 to 4006
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Age                     4003 non-null   int64 
 1   Sex                     4003 non-null   object
 2   T_Stage                 4003 non-null   object
 3   N_Stage                 4003 non-null   object
 4   6th_Stage               4003 non-null   object
 5   Differentiated          4003 non-null   object
 6   Grade                   4003 non-null   int64 
 7   A_Stage                 4003 non-null   object
 8   Tumor_Size              4003 non-null   int64 
 9   Estrogen_Status         4003 non-null   object
 10  Progesterone_Status     4003 non-null   object
 11  Regional_Node_Examined  4003 non-null   int64 
 12  Reginol_Node_Positive   4003 non-null   int64 
 13  Mortality_Status        4003 non-null   object
dtypes: int64(5), object(9)
memory usage: 469.1+ KB


In [40]:
# get descriptive statistics to check value errors
df_mortality_status_1.describe()

Unnamed: 0,Age,Grade,Tumor_Size,Regional_Node_Examined,Reginol_Node_Positive
count,4003.0,4003.0,4003.0,4003.0,4003.0
mean,54.004996,2.152386,30.363228,14.37272,4.149888
std,8.980543,0.637619,20.939991,8.135409,5.098972
min,30.0,1.0,1.0,1.0,1.0
25%,47.0,2.0,16.0,9.0,1.0
50%,54.0,2.0,25.0,14.0,2.0
75%,61.0,3.0,38.0,19.0,5.0
max,89.0,4.0,140.0,61.0,46.0


In [41]:
# Standardizing the Mortality_Status column values by mapping all variations to binary values (0=Alive, 1=Dead)
df['Mortality_Status'] = df['Mortality_Status'].map({'Alive': 0 , 'Dead': 1, 'ALIVE': 0, 'DEAD': 1, 'ALive': 0, 'alive': 0, 'dead': 1})

In [42]:
# Displaying the first five rows to verify the mapping
df.head()

Unnamed: 0,Age,Sex,T_Stage,N_Stage,6th_Stage,Differentiated,Grade,A_Stage,Tumor_Size,Estrogen_Status,Progesterone_Status,Regional_Node_Examined,Reginol_Node_Positive,Survival_Months,Mortality_Status
0,68,Female,T1,N1,IIA,Poorly differentiated,3,Regional,4,Positive,Positive,24,1,60,0
1,50,Female,T2,N2,IIIA,Moderately differentiated,2,Regional,35,Positive,Positive,14,5,62,0
2,58,Female,T3,N3,IIIC,Moderately differentiated,2,Regional,63,Positive,Positive,14,7,75,0
3,58,Female,T1,N1,IIA,Poorly differentiated,3,Regional,18,Positive,Positive,2,1,84,0
4,47,Female,T2,N1,IIB,Poorly differentiated,3,Regional,41,Positive,Positive,3,1,50,0


In [43]:
# Checking the unique values in the Mortality_Status column to ensure successful mapping
df['Mortality_Status'].unique()

array([0, 1])

# 15. Clean Sex

Leveraged from Week 5 Code Reuse Session:
- Task 23: data.map()
- Task 9: data.unique()

Purpose: Encode Sex as binary (Female: 0, Male: 1) for modeling

Justification: '1' assumed as Male based on dataset inspection; binary encoding suits ML algorithms


In [44]:
# Checking the unique values in the Sex column
df['Sex'].unique()

array(['Female', '1'], dtype=object)

In [45]:
# Converting the Sex column to binary values (0=Female, 1=Male)
# The value '1' is assumed to represent 'Male' based on dataset inspection
df['Sex'] = df['Sex'].map({'Female': 0, '1': 1})
# Verifying the conversion by displaying unique values
df['Sex'].unique()

array([0, 1])

#16. Encode Categorical Variables

Leveraged from Week 5 Code Reuse Session:
- Task 25: import LabelEncoder
- Task 26: label_encoder.fit_transform()

Purpose: Convert categorical variables to numerical for ML modeling

Justification: LabelEncoder is used for simplicity; assumes ordinality for T_Stage, N_Stage, etc.
(Case Study A, Task 3); OneHotEncoder considered but avoided to reduce dimensionality


In [46]:
# Importing the LabelEncoder from sklearn for encoding categorical variables
from sklearn import preprocessing
# Creating a label encoder object
label_encoder = preprocessing.LabelEncoder()
# Encoding various categorical columns to numerical values
df['T_Stage']= label_encoder.fit_transform(df['T_Stage'])
df['N_Stage']= label_encoder.fit_transform(df['N_Stage'])
df['6th_Stage']= label_encoder.fit_transform(df['6th_Stage'])
df['Estrogen_Status']= label_encoder.fit_transform(df['Estrogen_Status'])
df['Differentiated']= label_encoder.fit_transform(df['Differentiated'])
df['Grade']= label_encoder.fit_transform(df['Grade'])
df['A_Stage']= label_encoder.fit_transform(df['A_Stage'])
df['Progesterone_Status']= label_encoder.fit_transform(df['Progesterone_Status'])
# Displaying the first five rows to verify the encoding
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4003 entries, 0 to 4006
Data columns (total 15 columns):
 #   Column                  Non-Null Count  Dtype
---  ------                  --------------  -----
 0   Age                     4003 non-null   int64
 1   Sex                     4003 non-null   int64
 2   T_Stage                 4003 non-null   int64
 3   N_Stage                 4003 non-null   int64
 4   6th_Stage               4003 non-null   int64
 5   Differentiated          4003 non-null   int64
 6   Grade                   4003 non-null   int64
 7   A_Stage                 4003 non-null   int64
 8   Tumor_Size              4003 non-null   int64
 9   Estrogen_Status         4003 non-null   int64
 10  Progesterone_Status     4003 non-null   int64
 11  Regional_Node_Examined  4003 non-null   int64
 12  Reginol_Node_Positive   4003 non-null   int64
 13  Survival_Months         4003 non-null   int64
 14  Mortality_Status        4003 non-null   int64
dtypes: int64(15)
memory usage:

# 17. Standardize Numerical Features

Leveraged from Week 5 Code Reuse Session:
- Task 25: import sklearn.preprocessing
- Task 29: StandardScaler.fit_transform()

Purpose: Scale numerical features to mean=0, std=1 for ML algorithms

Justification: Standardization benefits algorithms like Logistic Regression, KNN (Case Study A, Task 3)


In [47]:
# Importing StandardScaler for feature scaling
from sklearn.preprocessing import StandardScaler
# Defining the numerical columns to be standardized
numeric_columns = ['Age', 'Tumor_Size', 'Regional_Node_Examined', 'Reginol_Node_Positive']
# Creating a StandardScaler object
ss = StandardScaler()
# Standardizing the numerical columns (mean=0, std=1)
df[numeric_columns] = ss.fit_transform(df[numeric_columns])
# Displaying the first five rows to verify the standardization
df.head()

Unnamed: 0,Age,Sex,T_Stage,N_Stage,6th_Stage,Differentiated,Grade,A_Stage,Tumor_Size,Estrogen_Status,Progesterone_Status,Regional_Node_Examined,Reginol_Node_Positive,Survival_Months,Mortality_Status
0,1.558564,0,0,0,0,1,2,1,-1.259147,1,1,1.183528,-0.617827,60,0
1,-0.446019,0,1,1,2,0,1,1,0.221459,1,1,-0.04582,0.166743,62,0
2,0.444907,0,2,2,4,0,1,1,1.558781,1,1,-0.04582,0.559028,75,0
3,0.444907,0,0,0,0,1,2,1,-0.590486,1,1,-1.521038,-0.617827,84,0
4,-0.780117,0,1,0,1,1,2,1,0.508028,1,1,-1.398103,-0.617827,50,0


# 19. Prepare Final Datasets for Modeling

Leveraged from Week 5 Code Reuse Session:
- Task 27: data.to_csv()
- Task 28: data[data.Variable_Name != value]

Purpose: Create separate datasets for classification (Mortality Status) and regression (Survival Months)

Justification: Different models require differently prepared datasets (Case Study A, Task 8)


In [48]:
# Removing Survival_Months column
df_mortality_status = df.drop('Survival_Months', axis=1)
#Saving the mortality status dataset for classification analysis
df_mortality_status.to_csv(r'/content/mortality_status.csv', index=False)
# Displaying the first five rows of the survival months dataset
df_mortality_status.head()

Unnamed: 0,Age,Sex,T_Stage,N_Stage,6th_Stage,Differentiated,Grade,A_Stage,Tumor_Size,Estrogen_Status,Progesterone_Status,Regional_Node_Examined,Reginol_Node_Positive,Mortality_Status
0,1.558564,0,0,0,0,1,2,1,-1.259147,1,1,1.183528,-0.617827,0
1,-0.446019,0,1,1,2,0,1,1,0.221459,1,1,-0.04582,0.166743,0
2,0.444907,0,2,2,4,0,1,1,1.558781,1,1,-0.04582,0.559028,0
3,0.444907,0,0,0,0,1,2,1,-0.590486,1,1,-1.521038,-0.617827,0
4,-0.780117,0,1,0,1,1,2,1,0.508028,1,1,-1.398103,-0.617827,0


In [49]:
# Creating a subset containing only deceased patients for survival months regression task
df_survival = df[df.Mortality_Status != 0]
df_survival = df_survival.drop('Mortality_Status', axis=1)
# Displaying the first five rows of the survival months dataset
df_survival.info()

<class 'pandas.core.frame.DataFrame'>
Index: 614 entries, 7 to 4000
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Age                     614 non-null    float64
 1   Sex                     614 non-null    int64  
 2   T_Stage                 614 non-null    int64  
 3   N_Stage                 614 non-null    int64  
 4   6th_Stage               614 non-null    int64  
 5   Differentiated          614 non-null    int64  
 6   Grade                   614 non-null    int64  
 7   A_Stage                 614 non-null    int64  
 8   Tumor_Size              614 non-null    float64
 9   Estrogen_Status         614 non-null    int64  
 10  Progesterone_Status     614 non-null    int64  
 11  Regional_Node_Examined  614 non-null    float64
 12  Reginol_Node_Positive   614 non-null    float64
 13  Survival_Months         614 non-null    int64  
dtypes: float64(4), int64(10)
memory usage: 72.0 KB

In [50]:
# Saving the survival months dataset for regression analysis
df_survival.to_csv('/content/survival_months.csv', index=False)