# Predictive Modeling of Multiple Sclerosis Risk in CIS Patients

## Overview

This project focuses on predictive modeling to determine the risk of conversion from Clinically Isolated Syndrome (CIS) to Multiple Sclerosis (MS) in Mexican mestizo patients. The dataset used for this analysis is derived from a prospective cohort study conducted at the National Institute of Neurology and Neurosurgery (NINN) in Mexico City, Mexico, between 2006 and 2010.

## Objectives

- **Identify Predictive Factors**: Determine which symptoms and clinical features are better predictors of conversion from CIS to MS.
- **Model Classification**: Classify patients into two groups based on whether they convert to MS (CDMS) or not (non-CDMS).
- **Evaluate Performance**: Assess the performance of various predictive models and their effectiveness in predicting MS risk.

## Dataset

- **Source**: Pineda, Benjamin; Flores Rivera, Jose De Jesus (2023), “Conversion predictors of Clinically Isolated Syndrome to Multiple Sclerosis in Mexican patients: a prospective study.”, Mendeley Data, V1, doi: 10.17632/8wk5hjx7x2.1
- **License**: CC BY 4.0
- **Description**: The dataset contains patient information such as age, schooling, gender, symptoms, and MRI findings.

### Columns

- **ID**: Patient identifier
- **Age**: Age of the patient (in years)
- **Schooling**: Time spent in school (in years)
- **Gender**: 1=male, 2=female
- **Breastfeeding**: 1=yes, 2=no, 3=unknown
- **Varicella**: 1=positive, 2=negative, 3=unknown
- **Initial_Symptoms**: Symptom categories (e.g., visual, sensory, motor)
- **Mono_or_Polysymptomatic**: 1=monosymptomatic, 2=polysymptomatic
- **Oligoclonal_Bands**: 0=negative, 1=positive
- **LLSSEP**: 0=negative, 1=positive
- **ULSSEP**: 0=negative, 1=positive
- **VEP**: 0=negative, 1=positive
- **BAEP**: 0=negative, 1=positive
- **Periventricular_MRI**: 0=negative, 1=positive
- **Cortical_MRI**: 0=negative, 1=positive
- **Infratentorial_MRI**: 0=negative, 1=positive
- **Spinal_Cord_MRI**: 0=negative, 1=positive
- **initial_EDSS**: Expanded Disability Status Scale at initial assessment
- **final_EDSS**: Expanded Disability Status Scale at final assessment
- **Group**: 1=CDMS, 2=non-CDMS


## A. Business Understanding

### A.1 Business Goal Declaration

The primary business goal of this project is to develop a robust predictive model to assess the risk of conversion from Clinically Isolated Syndrome (CIS) to Multiple Sclerosis (MS) among Mexican mestizo patients. By accurately identifying which patients are at higher risk of conversion, the model aims to:

1. **Enhance Early Intervention**: Enable healthcare providers to identify high-risk patients early, leading to timely and targeted interventions.
2. **Improve Patient Outcomes**: Help in personalizing treatment plans based on the risk of conversion, potentially improving long-term patient outcomes.
3. **Optimize Resource Allocation**: Allow healthcare facilities to allocate resources more effectively by focusing on patients with a higher likelihood of conversion.
4. **Contribute to Research**: Provide insights into the risk factors associated with CIS conversion to MS, supporting ongoing research in neurology and improving understanding of disease progression.



### A.2 Expected Value Framework

The expected value framework for this project includes:

1. **Clinical Impact**: 
   - **Early Detection**: Improved identification of patients at risk for MS can lead to earlier therapeutic interventions, which may slow disease progression and improve quality of life.
   - **Personalized Treatment**: Tailoring treatment plans based on risk assessment can enhance the effectiveness of interventions and reduce unnecessary treatments for low-risk patients.

2. **Economic Impact**:
   - **Cost Savings**: Early and accurate risk prediction can reduce the long-term costs associated with advanced MS treatments and hospitalizations.
   - **Resource Efficiency**: Optimized use of medical resources and better management of healthcare services can lead to cost savings for healthcare systems.

3. **Operational Impact**:
   - **Workflow Improvement**: Integration of predictive models into clinical workflows can streamline decision-making processes and reduce the burden on healthcare providers.
   - **Training and Development**: Enhanced training for medical staff on utilizing predictive models can improve overall clinical practices and outcomes.

4. **Research Impact**:
   - **Knowledge Advancement**: The project will contribute to the understanding of CIS to MS conversion risk factors, aiding further research and development in the field of neurology.



### A.3 Business Strategy Declaration

The strategy for achieving the business goals includes:

1. **Data Preparation and Analysis**:
   - **Data Cleaning and Preprocessing**: Ensure the dataset is clean and suitable for analysis, addressing any missing values and inconsistencies.
   - **Feature Engineering**: Develop relevant features from the raw data that will enhance the predictive power of the models.

2. **Model Development and Validation**:
   - **Model Selection**: Evaluate various predictive models (e.g., logistic regression, decision trees, random forests, and gradient boosting) to identify the most effective ones for predicting conversion risk.
   - **Model Training and Tuning**: Train models using historical patient data and fine-tune them to optimize performance metrics such as accuracy, sensitivity, and specificity.
   - **Validation and Testing**: Validate the models using cross-validation techniques and test them on a separate dataset to assess their generalizability.

3. **Implementation and Integration**:
   - **Deployment**: Develop a user-friendly interface for healthcare providers to access the predictive model and integrate it into existing clinical systems.
   - **Training and Support**: Provide training for medical staff on how to use the predictive tool effectively and offer ongoing support for any issues that may arise.

4. **Monitoring and Improvement**:
   - **Performance Monitoring**: Continuously monitor the performance of the predictive model and gather feedback from users.
   - **Model Updates**: Periodically update the model based on new data and evolving clinical practices to ensure continued accuracy and relevance.


## B. Data Understanding

### B.1 Data Preprocessing

#### B.1.1 Collection of Raw Data

To begin the data preprocessing phase, the raw data needs to be collected from the designated source. In this project, the dataset is available on Kaggle. Follow these steps to access and retrieve the raw data:

1. **Access Kaggle Dataset**:
   - **Dataset URL**: [Conversion predictors of Clinically Isolated Syndrome to Multiple Sclerosis in Mexican patients](https://www.kaggle.com/datasets/benjaminpineda/conversion-predictors-of-cis-to-ms)
   - Ensure you have a Kaggle account. If not, create one and log in.

2. **Download the Dataset**:
   - Navigate to the dataset page on Kaggle.
   - Click on the "Download" button to download the dataset as a compressed file (usually in `.zip` format).

3. **Extract the Files**:
   - After downloading, extract the contents of the compressed file to a local directory on your computer.
   - The dataset should contain files such as CSV or Excel files with the raw data.



In [4]:
# %pip install opendatasets --upgrade --quiet

Note: you may need to restart the kernel to use updated packages.


In [1]:
# Access Kaggle Dataset using opendatasets,  download the dataset, and extract the data from the dataset

import opendatasets as od

od.download("https://www.kaggle.com/datasets/desalegngeb/conversion-predictors-of-cis-to-multiple-sclerosis/data")


Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username:Your Kaggle Key:Dataset URL: https://www.kaggle.com/datasets/desalegngeb/conversion-predictors-of-cis-to-multiple-sclerosis
Downloading conversion-predictors-of-cis-to-multiple-sclerosis.zip to .\conversion-predictors-of-cis-to-multiple-sclerosis


100%|██████████| 3.02k/3.02k [00:00<00:00, 779kB/s]







In [2]:
# change the folder name to a more readable name - data

import os

os.rename("conversion-predictors-of-cis-to-multiple-sclerosis", "data")
os.rename("data/conversion_predictors_of_clinically_isolated_syndrome_to_multiple_sclerosis.csv", "data/data.csv")


In [6]:
# Read the data from the csv file

import pandas as pd

data = pd.read_csv("data/data.csv")

# Drop first comlumn
data = data.drop(columns=["Unnamed: 0"], axis=1)

data.head()

Unnamed: 0,Gender,Age,Schooling,Breastfeeding,Varicella,Initial_Symptom,Mono_or_Polysymptomatic,Oligoclonal_Bands,LLSSEP,ULSSEP,VEP,BAEP,Periventricular_MRI,Cortical_MRI,Infratentorial_MRI,Spinal_Cord_MRI,Initial_EDSS,Final_EDSS,group
0,1,34,20.0,1,1,2.0,1,0,1,1,0,0,0,1,0,1,1.0,1.0,1
1,1,61,25.0,3,2,10.0,2,1,1,0,1,0,0,0,0,1,2.0,2.0,1
2,1,22,20.0,3,1,3.0,1,1,0,0,0,0,0,1,0,0,1.0,1.0,1
3,2,41,15.0,1,1,7.0,2,1,0,1,1,0,1,1,0,0,1.0,1.0,1
4,2,34,20.0,2,1,6.0,2,0,1,0,0,0,1,0,0,0,1.0,1.0,1


#### B.1.2 Explanation of Dataframes

- **ID**: A unique identifier assigned to each patient. This ensures that each record can be individually tracked and distinguished from others.

- **Age**: The age of the patient at the time of their initial assessment, measured in years. Age is a crucial factor in many medical conditions and can influence disease progression and risk.

- **Schooling**: The total number of years the patient has spent in formal education. This can be a proxy for socioeconomic status and cognitive reserve, which might impact disease outcomes.

- **Gender**: The gender of the patient, coded as 1 for male and 2 for female. Gender differences can influence the prevalence and progression of diseases like MS.

- **Breastfeeding**: Indicates whether the patient was breastfed, with values 1 for yes, 2 for no, and 3 for unknown. Breastfeeding history can have long-term health implications, including immune system development.

- **Varicella**: Indicates the patient's history of varicella (chickenpox), with values 1 for positive, 2 for negative, and 3 for unknown. Varicella infection has been studied for its potential links to MS.

- **Initial_Symptoms**: Categories of symptoms experienced by the patient at the onset of their condition, such as visual, sensory, or motor symptoms. The type and severity of initial symptoms can provide insights into disease progression.

- **Mono_or_Polysymptomatic**: Indicates whether the patient experienced one symptom (monosymptomatic) or multiple symptoms (polysymptomatic) at the initial presentation, with 1 for monosymptomatic and 2 for polysymptomatic. This can affect the likelihood of conversion to MS.

- **Oligoclonal_Bands**: Presence of oligoclonal bands in cerebrospinal fluid, with 0 for negative and 1 for positive. Oligoclonal bands are a diagnostic marker often associated with MS.

- **LLSSEP**: Results from the lower limb somatosensory evoked potentials test, with 0 for negative and 1 for positive. This test assesses the sensory pathways and can indicate nerve damage.

- **ULSSEP**: Results from the upper limb somatosensory evoked potentials test, with 0 for negative and 1 for positive. Similar to LLSSEP, but focused on the upper limbs.

- **VEP**: Visual evoked potentials test results, with 0 for negative and 1 for positive. This test evaluates the visual pathways and can detect abnormalities related to MS.

- **BAEP**: Brainstem auditory evoked potentials test results, with 0 for negative and 1 for positive. This assesses auditory pathways and can help identify neurological dysfunction.

- **Periventricular_MRI**: MRI findings indicating the presence of lesions in the periventricular regions of the brain, with 0 for negative and 1 for positive. Periventricular lesions are often associated with MS.

- **Cortical_MRI**: MRI findings indicating the presence of lesions in the cortical areas of the brain, with 0 for negative and 1 for positive. Cortical lesions can be relevant in MS diagnosis and progression.

- **Infratentorial_MRI**: MRI findings indicating the presence of lesions in the infratentorial regions (e.g., brainstem and cerebellum), with 0 for negative and 1 for positive. Lesions in these areas can be indicative of MS.

- **Spinal_Cord_MRI**: MRI findings indicating the presence of lesions in the spinal cord, with 0 for negative and 1 for positive. Spinal cord lesions are significant for MS diagnosis and can affect mobility.

- **initial_EDSS**: Expanded Disability Status Scale score at the initial assessment, measuring the patient's level of disability. This scale helps quantify disability and track changes over time.

- **final_EDSS**: Expanded Disability Status Scale score at the final assessment. This provides an indication of how the patient's disability has progressed or changed throughout the study period.

- **Group**: Classification of patients into two groups: 1 for those who converted to Clinically Definite Multiple Sclerosis (CDMS) and 2 for those who did not convert (non-CDMS). This column is crucial for classification and predictive modeling.

Each column in the dataset provides important information that can be used to understand the factors influencing the progression from CIS to MS and to develop predictive models for risk assessment.

### B.2 Variable type and data structure consistency

#### B.2.1 Demystifying Variables Type (Numerical/ Categorical)


In [13]:
def MissingUniqueStatistics(df):
    
    import io
    import pandas as pd
    import psutil, os, gc, time
    import seaborn as sns
    from IPython.display import display, HTML
    # pd.set_option('display.max_colwidth', -1)
    from io import BytesIO
    import base64
    
    print("MissingUniqueStatistics process has began:\n")
    
    # Get the initial time at the start of the process
    proc = psutil.Process(os.getpid())
    gc.collect()
    mem_0 = proc.memory_info().rss
    start_time = time.time()

    variable_name_list = []
    total_entry_list = []
    data_type_list = []
    unique_values_list = []
    number_of_unique_values_list = []
    missing_value_number_list = []
    missing_value_ratio_list = []
    mean_list=[]
    std_list=[]
    min_list=[]
    Q1_list=[]
    Q2_list=[]
    Q3_list=[]
    max_list=[]

    df_statistics = df.describe().copy()

    for col in df.columns:

        variable_name_list.append(col)
        total_entry_list.append(len(df[col]))
        data_type_list.append(df[col].dtype)
        unique_values_list.append(df[col].unique())
        number_of_unique_values_list.append(len(df[col].unique()))
        missing_value_number_list.append(df[col].isnull().sum())
        missing_value_ratio_list.append((df[col].isnull().sum()/len(df[col]))*100)

        try:
            mean_list.append(df_statistics.loc[:,col][1])
            std_list.append(df_statistics.loc[:,col][2])
            min_list.append(df_statistics.loc[:,col][3])
            Q1_list.append(df_statistics.loc[:,col][4])
            Q2_list.append(df_statistics.loc[:,col][5])
            Q3_list.append(df_statistics.loc[:,col][6])
            max_list.append(df_statistics.loc[:,col][7])
        except: 
            mean_list.append('NaN')
            std_list.append('NaN')
            min_list.append('NaN')
            Q1_list.append('NaN')
            Q2_list.append('NaN')
            Q3_list.append('NaN')
            max_list.append('NaN')

    data_info_df = pd.DataFrame({'Variable': variable_name_list,
                               '#_Total_Entry':total_entry_list,
                               '#_Missing_Value': missing_value_number_list,
                               '%_Missing_Value':missing_value_ratio_list,
                               'Data_Type': data_type_list,
                               'Unique_Values': unique_values_list,
                               '#_Unique_Values':number_of_unique_values_list,
                               'Mean':mean_list,
                               'STD':std_list,
                               'Min':min_list,
                               'Q1':Q1_list,
                               'Q2':Q2_list,
                               'Q3':Q3_list,
                               'Max':max_list
                               })

    data_info_df = data_info_df.set_index("Variable", inplace=False)


    print('MissingUniqueStatistics process has been completed!')
    print("--- in %s minutes ---" % ((time.time() - start_time)/60))

    return data_info_df.sort_values(by='%_Missing_Value', ascending=False)

In [14]:
data_info = MissingUniqueStatistics(data)
data_info

MissingUniqueStatistics process has began:

MissingUniqueStatistics process has been completed!
--- in 0.0008666356404622396 minutes ---


  mean_list.append(df_statistics.loc[:,col][1])
  std_list.append(df_statistics.loc[:,col][2])
  min_list.append(df_statistics.loc[:,col][3])
  Q1_list.append(df_statistics.loc[:,col][4])
  Q2_list.append(df_statistics.loc[:,col][5])
  Q3_list.append(df_statistics.loc[:,col][6])
  max_list.append(df_statistics.loc[:,col][7])


Unnamed: 0_level_0,#_Total_Entry,#_Missing_Value,%_Missing_Value,Data_Type,Unique_Values,#_Unique_Values,Mean,STD,Min,Q1,Q2,Q3,Max
Variable,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Final_EDSS,273,148,54.212454,float64,"[1.0, 2.0, 3.0, nan]",4,1.448,0.65323,1.0,1.0,1.0,2.0,3.0
Initial_EDSS,273,148,54.212454,float64,"[1.0, 2.0, 3.0, nan]",4,1.36,0.587504,1.0,1.0,1.0,2.0,3.0
Schooling,273,1,0.3663,float64,"[20.0, 25.0, 15.0, 22.0, 12.0, 0.0, 9.0, 14.0,...",13,15.176471,4.244175,0.0,12.0,15.0,20.0,25.0
Initial_Symptom,273,1,0.3663,float64,"[2.0, 10.0, 3.0, 7.0, 6.0, 14.0, 8.0, 15.0, 5....",16,6.430147,4.222009,1.0,3.0,6.0,9.0,15.0
Gender,273,0,0.0,int64,"[1, 2]",2,1.615385,0.487398,1.0,1.0,2.0,2.0,2.0
BAEP,273,0,0.0,int64,"[0, 1]",2,0.065934,0.248623,0.0,0.0,0.0,0.0,1.0
Spinal_Cord_MRI,273,0,0.0,int64,"[1, 0]",2,0.315018,0.465376,0.0,0.0,0.0,1.0,1.0
Infratentorial_MRI,273,0,0.0,int64,"[0, 1]",2,0.29304,0.455993,0.0,0.0,0.0,1.0,1.0
Cortical_MRI,273,0,0.0,int64,"[1, 0]",2,0.432234,0.496296,0.0,0.0,0.0,1.0,1.0
Periventricular_MRI,273,0,0.0,int64,"[0, 1]",2,0.505495,0.500888,0.0,0.0,1.0,1.0,1.0


#### B.2.2 Data Structure Control (Float String)

In [17]:
# Create a data dictionary for the data
# variable_name, variable_definition, variable_structure

variable_definition = ["Age of the patient (in years)",
                       "Time spent in school (in years)",
                       "Indicates the gender of the patient",
                       "Indicates the patient's breastfeeding status",
                       "Indicates the patient's Varicella(chickenpox) status",
                       "Categories of initial symptoms (e.g., visual, sensory, motor)",
                       "Indicates whether the patient is monosymptomatic or polysymptomatic",
                       "Indicates the patient's Oligoclonal Bands status",
                       "Indicates the patient's lower limb somatosensory evoked potential status",
                       "Indicates the patient's upper limb somatosensory evoked potential status",
                       "Indicates the patient's visual evoked potential status",
                       "Indicates the patient's brainstem auditory evoked potential status",
                       "Indicates the patient's periventricular MRI status",
                       "Indicates the patient's cortical MRI status",
                       "Indicates the patient's infratentorial MRI status",
                       "Indicates the patient's spinal cord MRI status",
                       "Expanded Disability Status Scale at initial assessment",
                       "Expanded Disability Status Scale at final assessment",
                       "Indicates the group of the patient"]
                       

def determine_variable_structure(column):
    """Determine the structure of a variable based on its content."""
    if pd.api.types.is_numeric_dtype(column):
        if column.nunique() < 10:
            return "Cardinal"
        else:
            return "Continuous-Ratio"
    elif pd.api.types.is_string_dtype(column):
        if column.nunique() < 10:
            return "Nominal"
        else:
            return "Nominal"  # Consider it nominal if unique values are many
    elif pd.api.types.is_categorical_dtype(column):
        return "Nominal"
    else:
        return "Unknown"
    
def create_data_dictionary(df):
    """Create a data dictionary for the DataFrame."""
    data_dict = {
        'variable_name': df.columns,
        'variable_definition': [variable_definition[i] for i in range(len(df.columns))],
        'variable_structure': [determine_variable_structure(df[col]) for col in df.columns]
    }
    return pd.DataFrame(data_dict)


data_dictionary = create_data_dictionary(data)
           

In [18]:
data_dictionary

Unnamed: 0,variable_name,variable_definition,variable_structure
0,Gender,Age of the patient (in years),Cardinal
1,Age,Time spent in school (in years),Continuous-Ratio
2,Schooling,Indicates the gender of the patient,Continuous-Ratio
3,Breastfeeding,Indicates the patient's breastfeeding status,Cardinal
4,Varicella,Indicates the patient's Varicella(chickenpox) ...,Cardinal
5,Initial_Symptom,"Categories of initial symptoms (e.g., visual, ...",Continuous-Ratio
6,Mono_or_Polysymptomatic,Indicates whether the patient is monosymptomat...,Cardinal
7,Oligoclonal_Bands,Indicates the patient's Oligoclonal Bands status,Cardinal
8,LLSSEP,Indicates the patient's lower limb somatosenso...,Cardinal
9,ULSSEP,Indicates the patient's upper limb somatosenso...,Cardinal
