# Autumn 2024 ADSP 31009 ON1 Machine Learning & Predictive Analytics Final Project

## 1. Problem Definition (5 points): Clearly articulate the real-world problem the project addresses and define a measurable objective.

### The real-world problem that I am trying to resolve is creating a machine learning model that can be used to create a multi-class classification of the various thyroid issues that people are facing in the world. 

### A Measurable Objective that is defined for this particular project is to be able to apply the knowledge that was gained throughout the Machine Learning & Predictive Analytics to a particular real world problem. Additional objectives includes the following
1. having a better understanding of the data and domain of where the dataset originate from
2. more hands on experience in the realm of Machine Learning Engineering (MLE) to create models that can solve real world problem
3. understand the ML Workflow to create value added benefits for companies
4. improve predictive accuracy than existing methods employed in the field 
5. Identify factors that are relevant in identifying thyroid problem in patient
6. Predict the class that each record belong to based on attributes in the dataset
7. encourage further discussion and research on plausible methods employed for the domain

## 2. Assumptions & Hypotheses (5 points): Explicitly state assumptions about the data and chosen model(s), identifying potential limitations.

### Here are the assumptions that have to be made regarding the data:
1. That the data collection process is performed well without any bias. For example, there wasn't explicit omission of people who have thyroid problem, and inclusion of poeple who do not have thyroid problem (although this may be an assumption that does exist in the dataset. Survivorship bias exist in this dataset since presumably particular criteria such as being a patient at a hospital/ran into symptoms of throat problem entail having the entry created in the dataset)
2. The dataset is a fair representation of a population (even though again this may be an assumption that is easily broken since people who don't have thyroid problem wouldn't create an entry in the dataset. This indicate sampling bias)
3. There was not any coding errors while doing the data collection process (for example, no typing errors for the numbers)
4. The potential limitation that is involved in this project is the given amount of time, energy, constraints, and direction that could be used to explore the variables. There could be additional higher dimensional cross column variables that can be explored

### The data comes from a reputable source of UCI Machine Learning Repository. Given the nature of academic honesty and research purposes, I would assume that I do not have to worry as much about if the data is intentionally created with fake data

## 3. Data Exploration (5 points): Conduct in-depth exploratory data analysis, visualizing key relationships and identifying potential quality issues.

### Please see associated ThyroidDisease.ipynb for all the exploratory data analysis and visualization performed. Regarding potential quality issues, this dataset is relatively clean without many issues of missing data. However, there are some impossible values that are observed for certain records, which is identified in the analysis. Beside the impossible values which are likely entry errors or intentionally created fake data, there are some data that are missing at random (MAR). This can be identified due to the fact that the measured metric columns have an associated metric_measured column that shows whether or not the recorded person have performed a measurement. If the measurement isn't performed, it make sense that the value for the metric column should be `?` as the value is not obtained.

## 4. Feature Engineering (5 points): Justify feature selection, creation, and any necessary transformations to enhance model input.

For the dataset, I performed the following steps to perform feature engineering and transformation in hopes of enhancing model output
1. Perform kNNImputation regarding the specific metric measures of people based on the {metric_name}_measured having a True/False value
2. One hot encoded category variable such as sex and referral_source
3. Created an additional variable called log_TSH_imputed since TSH_imputed is left skewed
4. Standard scaling features that are in the dataset

## 5. Modeling Approach (5 points): Explain the rationale for model selection and demonstrate techniques to mitigate overfitting/underfitting.

### The modeling approaches that I am trying to use for this project are as following:
1. Support vector classification (SVC) since this problem is a classification problem
    - with different kernels to experiment how the model performs
2. Tree-based classifier algorithms such as following
    - DecisionTreeClassifier
    - RandomForestClassifier
    - AdaBoostClassifier
    - GradientBoostingClassifier
        - Reason why they are chosen are because ease of implementation and explainability, which are essential in the hospital setting as doctors may want to explain what is going on for the patient to consider different treatment options

### The techniques that are employed to mitigate overfitting/underfitting are the following:
1. Perform RandomizedSearchCV to find the optimal parameters for a particular model
2. Perform GridSearchCV to find the optimal paramters for a particular model (this is more time consuming)

## 6. Model Justification (5 points): Provide a clear justification for the final model choice, potentially including the use of regularization techniques.

### The model that I decided to use as a final one is the AdaBoost one
- Adaboost quick review:
    - Sequentially training models to correct the errors made by previous models
    - Each model is trained on a weighted version of the training data with higher weights assigned to misclassified samples
-  Reason why it is chosen
    - Able to identify alternative classes beyond just concentrating on the negative label
    - emphasis on difficult to classify samples help improve overall model performance 
    - can be applied to both classification and regression problems
    - versatile and powerful tool

## 7. Results & Insights (5 points): Accurately report performance metrics and extract meaningful conclusions from the modeling process.

### The performance metrics that is applicable for this particular use case is accuracy, confusion matrix, precision & recall, and AUC-ROC.
- For each one of the performance metrics, here are the result for the final model
    - Accuracy
        - help identify the class imbalance
    - Confusion matrix
        - help identify the class imbalance
    - Precision & Recall
        - help identify the class imbalance. 
    - AUC-ROC
        - help identify the class imbalance and how the model is performing for each one of the classes
- The images and values of each of the corresponding mentioned performance metrics is seen in the attached ThyroidDisease.ipynb

### Some of the conclusion that could be drawn from the modeling process are the following
- there is a relatively long process that needs to be done in order to fully understand what the data is about and what real-world problem can the model built from the model help resolve. 
- This modeling process also needs continuous iteration and monitoring since coding errors may have occurred, and there may be issues that needs additional consideration.
- Identifying thyroid problem in patient is an inherently hard task with just structured data
- Even with well-defined classification problem, there are data issues that need to be examined thoroughly
- Imbalanced class is quite often the case in medical setting, as seen in this particular dataset
- Underfitting and overfitting also is seen in this problem using various modeling tools and techniques

## 8. Future Directions (5 points): Outline specific recommendations for model improvement and further research avenues.

### Recommendations for model improvement includes the following:
1. Create additional feature variables based on existing variables that are in the dataset. This dataset only have 29 attributes.
2. Collect more diverse data on the class variables that are under represented
3. Creating new training data based on the testing dataset with classes that are not within training dataset
4. Perform out-of-bag evaluation on random forest model
5. Create additional plots on the overall ML process to understand how each of the steps and components contribute to changes in the model

### The further research avenues includes the following:
1. Combining dataset from other resources to create a more coherent understanding of how some of the columns are created
2. Examine more on how the referral_source is defined. This isn't as clear from the data dictionary or background information. Perhaps if there are more information on the referral_source is defined, analysis could be performed to understand if particular referral_source indicates urgency
3. Explore incorporating images of thyroid to have a more diverse dataset, and perform computer vision tasks on these images to better identify the problematic thyroid problems since it is perhaps routine task for a doctor to capture images of the patient during examination to visualize if there may be problems
4. Combine result from this model with another machine learning model to inspect the thyroid problem that people have. This echos the ideas of using ensemble methods, stacked ensemble models, and having a pipeline of different models contributing 
5. Creating more advanced model based on artificial neural network that can be applied in the different folder in the dataset
6. Apply Synthetic Minority Oversampling Technique (SMOTE) for this dataset as it is observed from the analysis that there are huge population of rows that are considered negative for thyroid problem so this may be overpowering making the model unable to learn the features that are specific to the patient who do have thyroid problem
7. Create plots of model internals to create action items that can be implemented, integrated, and executed for hospitals/doctors
8. Examination of time for model deployment and options since it may be considered essential for a model that determine if a patient has thyroid problem to run quickly instead of taking hours to run
9. Examination of interpretability of models. The model that are built in this project are considerably explainable with the the different attributes that are in the dataset, but as model building venture into the territory of neural network, explainability become a problem and may pose issues for hospital settings