# Oral Cancer Prediction Project

University of Colorado Boulder

DTSA 5509

This project aims to develop a machine learning model to predict oral cancer diagnosis based on various risk factors, symptoms, and demographic information. The dataset contains 84,922 rows and 25 columns, including features like age, gender, tobacco use, alcohol consumption, HPV infection, and various oral symptoms.

In [None]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
import xgboost as xgb
import lightgbm as lgb

Here, I load in the full dataset and print out a summary of the columns and data types. Most of the columns are binary Yes/No features. There are no missing values in over 80K rows, which is sufficient data for this classification task.

In [None]:
df = pd.read_csv("/kaggle/input/oral-cancer-prediction-dataset/oral_cancer_prediction_dataset.csv")
df = df.drop_duplicates()
df = df.set_index('ID')
df.info()

In [None]:
df.index.names



There are many features in the dataset that are not suitable for making diagnosis predictions. Features like Cancer Stage and Treatment Type effectively tell us that the patient has cancer, so they should not be used to infer a patient's diagnosis. To prevent data leakage, Cancer Stage, Treatment Type, Survival Rate, Cost of Treatment, Economic Burden, Tumor Size, and Early Diagnosis are discarded. These may possible be used in further applications or validation efforts later on.

In [None]:
columns2drop = ['Cancer Stage', 'Treatment Type', 'Survival Rate (5-Year, %)',
                     'Cost of Treatment (USD)', 'Economic Burden (Lost Workdays per Year)', 
                     'Tumor Size (cm)', 'Early Diagnosis']

df = df.drop(labels=columns2drop, axis=1)

Now, the dataset only contains features that can be used for diagnosis like lifestyle, demographic, and risk factors as well as medical history and symptoms. The following EDA will help determine which high-information features to use and which to discard due to co-linearity or overall low-information gain.

In [None]:
df.info()

for col in df.columns:
        print(f'\n{col}\n{df[col].unique()}\n')

In [None]:
def eda_plots(df):

    df = df.copy()
    numeric_cols = df.select_dtypes(include='number').columns
    categorical_cols = df.select_dtypes(exclude='number').columns

    for col in numeric_cols:
        fig, ax = plt.subplots(figsize=(5, 3))
        ax.hist(df[col], bins='auto', density='True')
        ax.set_xlabel(col)
        ax.set_ylabel('Density')
        ax.set_title(f'{col} distribution')

    for col in categorical_cols:
        value_counts = df[col].value_counts()
        fig, ax = plt.subplots(figsize=(5, 3))
        ax.bar(x=value_counts.index, height=value_counts.values/len(df))
        if col == 'Country':
            ax.tick_params(axis='x', rotation=90)
        ax.set_xlabel(col)
        ax.set_ylabel('Percent')
        ax.set_title(f'{col} distribution')
        
    return None

eda_plots(df)

Just looking at the distribution of the features, it seems that the target shows a near-perfect split between positive and negative oral cancer diagnoses, but many of the symptoms and other risk factors show lofty imabalances. As some have commented, this is atypical for medical data, especially for cancer diagnosis. Typically, oral cancer has a prevalence of around 3% in high risk settings.

The feature imbalances are more realistic than the distribution of the target variable, but may pose classification challenges, but can possibly be mitigated through up-sampling or weighting. It is expected that the majority of people do not have symptoms like oral sores or difficulty swallowing. Despite being more realistic, HPV, tobacco use, and alcohol consumption prevalence do not seem to match global prevalence for the age group (based on some cursory googling). Getting more insight into the data integrity would be great, but as this is an exercise in supervised learning, it is not my most pressing concern (though, I would like some answers).

The patient age distribution is normally distributed and centered around 56 years of age. Because most of the patients are older than middle aged, the information from this model may not generalize well for younger patients, limiting its field applications. It also limits our insights as risk factors may be age-related.

Most of the data is from patients in eastern countries like India, Pakistan, and Taiwan. African countries are represented the least in the data set, and western countries fall inbetween.

In [None]:
def eda_plots_by_diagnosis(df, target_col):

    df = df.copy()

    numeric_cols = df.select_dtypes(include='number').columns.tolist()
    if target_col in numeric_cols:
        numeric_cols.remove(target_col)
    
    categorical_cols = df.select_dtypes(exclude='number').columns.tolist()
    if target_col in categorical_cols:
        categorical_cols.remove(target_col)
    

    target_values = df[target_col].unique()
    
    for col in numeric_cols:
        fig, ax = plt.subplots(figsize=(12, 6))
        
        for target_value in target_values:
            subset = df[df[target_col] == target_value]
            ax.hist(subset[col], bins='auto', density='True', alpha=0.5, label=f'{target_value}')
        
        ax.set_xlabel(col)
        ax.set_ylabel('Density')
        ax.set_title(f'{col} by {target_col}')
        ax.legend(title='Cancer?')
        plt.tight_layout()
        plt.show()
    
    for col in categorical_cols:
        fig, ax = plt.subplots(figsize=(12, 6))
        categories = df[col].unique()
        x = np.arange(len(categories))
        w = 0.4
        
        for i, target_value in enumerate(target_values):
            subset = df[df[target_col] == target_value]
            counts = pd.Series(0, index=categories)
            counts.update(subset[col].value_counts())

            
            ax.bar(x + (i-0.5)*w, counts.values/len(subset), w, label=f'{target_value}')

            ax.set_title(f'{col} by {target_col}')
            ax.set_xlabel(col)
            ax.set_xticks(x)
            ax.set_ylabel('Proportion')
            ax.set_xticklabels(categories)
            ax.legend(title='Cancer?')
            
            if col == 'Country':
                ax.tick_params(axis='x', rotation=90)
        
        plt.tight_layout()
        plt.show()
    
    return None

eda_plots_by_diagnosis(df, target_col='Oral Cancer (Diagnosis)' )

Additional EDA shows that when grouping features by diagnosis, every feature shows a near 50/50 split despite cancer diagnosis. It is now knownthat the data is likely artificial (although not sufficiently disclosed) and lacks real world prevalance rates for these features. What this means for modeling is that these features likely do not have much predictive power. It is unlikely that, of the people with red patches in their mouth, 50% have cancer. Just as it is unlikely that 50% of people have cancer overall. It is likely that fearure importance will not align with real global correlations.

## 4. Data Preprocessing

In [None]:
class Prep():
    def __init__(self):
        
        self.categorical_features = None
        self.numeric_features = None
        self.encoders = {}
        self.scaler = MinMaxScaler()
    
    def fit(self, df):

        self.categorical_features = df.select_dtypes(include=['object']).columns.tolist()
        self.numeric_features = df.select_dtypes(include=['number']).columns.tolist()
        
        for col in self.categorical_features:
            self.encoders[col] = LabelEncoder()
            self.encoders[col].fit(df[col].astype(str))
        
        if self.numeric_features:
            self.scaler.fit(df[self.numeric_features])
        
        return self
    
    def transform(self, df):

        df_transformed = df.copy()

        for col in self.categorical_features:
            if col in df_transformed.columns:
                df_transformed[col] = self.encoders[col].transform(df_transformed[col].astype(str))
        
        if self.numeric_features:
            numeric_cols = [col for col in self.numeric_features if col in df_transformed.columns]
            if numeric_cols:
                df_transformed[numeric_cols] = self.scaler.transform(df_transformed[numeric_cols])
        
        return df_transformed
    
    def fit_transform(self, df):

        self.fit(df)
        
        return self.transform(df)

In [None]:
prepo = Prep()
df = prepo.fit_transform(df)

In [None]:
df.head()

## 5. Feature Selecton

## 6. Model Building

## 7. Hyperparameter Optimization

## 8. Model Evaluation

## 9. Feature Importance Analysis

## 10. Results & Discussion

## 11. Model Deployment Example