## Instructions {-}

1. Please answer the following questions as part of your project proposal.

2. Write your answers in the *Markdown* cells of the Jupyter notebook. You don't need to write any code, but if you want to, you may use the *Code* cells.

3. Use [Quarto](https://quarto.org/docs/output-formats/html-basics.html) to print the *.ipynb* file as HTML. You will need to open the command prompt, navigate to the directory containing the file, and use the command: `quarto render filename.ipynb --to html`. Submit the HTML file.

4. The project proposal is worth 8 points, and is due on **18th April 2023 at 11:59 pm**. 

5. You must make one submission as a group, and not individually.

6. Maintaining a GitHub repository is optional, though encouraged for the project.

7. Share the link of your project's GitHub repository [here](https://docs.google.com/spreadsheets/d/1khao3unpj_vsx4kOSg_Zzo77YK1UWL2w73Oa0aAirOo/edit#gid=0) (optional).

# 1) Team name
Mention your team name.

*(0 points)*

Betabots

# 2) Member names
Mention the names of your team members.

*(0 points)*

Junho Park, Alanda Zong, Luke Lilenthial

# 3) Link to the GitHub repository (optional)
Share the link of the team's project repository on GitHub.

Also, put the link of your project's GitHub repository [here](https://docs.google.com/spreadsheets/d/1khao3unpj_vsx4kOSg_Zzo77YK1UWL2w73Oa0aAirOo/edit#gid=0).

We believe there is no harm in having other teams view your GitHub repository. However, if you don't want anyone to see your team's work, you may make the repository *Private* and add your instructor and graduate TA as *Colloborators* in it.

*(0 points)*

https://github.com/alandaz/Springrepo

# 4) Topic
Mention the topic of your course project.

*(0 points)*

We are using cancer data to try to predict whether a tumor is benign or malignant.

# 5) Problem statement

*(4 points)*

Explain the problem statement. The problem statement must include:

## 5a) The problem

It may be difficult to classify tumor types without complex and invasive tests. We want to be able to accurately classify tumors as malignant or not based on surface level medical observations like length, mass and texture with a low rate of false negatives.

## 5b) Type of response
Is it about predicting a continuous response or a binary response or a combination of both?

We will be predicting the binary response of malignant (1) or begnign (0).

## 5c) Performance metric
How will you assess model accuracy?

  - If it is a classification problem, then which measure(s) will you optimize for your model – precision, recall, false negative rate (FNR), accuracy, ROC-AUC etc., and why?
  - If it is a regression problem, then which measure(s) will you optimize for your model – RMSE (Root mean squared error), MAE (mean absolute error), maximum absolute error etc., and why?

We will use recall and accuracy to optimize our model. Recall is most important as type I errors are very important to avoid. We do not want our model to miss a malignant tumor in a patient when they have one. With this metric maintained at a high level, we will then consider accuracy when tuning the threshold of the model in order to consider the how closely our model can classify tumor types.

## 5d) Naive model accuracy
What is the accuracy of the naive model (Standard deviation of response in case of continuous response / proportion of the majority class in case of classification model)

In [5]:
import pandas as pd
import numpy as np

cancer = pd.read_csv("Cancer_Data.csv")

cancer.diagnosis.value_counts() / cancer.shape[0]

B    0.627417
M    0.372583
Name: diagnosis, dtype: float64

Accuracy: 0.627

# 6) Data

## 6a) Source
What data sources will you use, and how will the data help solve the problem? Explain.
If the data is open source, share the link of the data.

*(0.5 point)*

The source is from Kaggle. Here is the link: https://www.kaggle.com/datasets/erdemtaha/cancer-data. This data will help us solve the problem because by analyzing the physical attributes of tumors, we can find what predictors strongly associate with a tumor being benign or malignant. This can help develop better diagnosises and treatments. 

## 6b) Response & predictors
What is the response, and mention some of the predictors.

*(0.5 point)*

The response variable would be the diagnosis column, and some of the predictors would be measurements of the tumor. A few examples include: radius, texture, perimeter, area, smoothness, compactness, concavity, and symmetry.

## 6c) Size
What is the number of continuous predictors, categorical predictors, and observations in your dataset(s). If you are using multiple datasets, please provide the information for each dataset. When counting predictors, count only those that have sufficient non-missing values, and will be useful.

*(1 point)*

In [25]:
cancer= cancer.drop(cancer.columns[cancer.columns.str.contains('unnamed',case = False)],axis = 1)
print(cancer.head())

rows = cancer[cancer.isnull().any(axis=1)]
cols= cancer.columns[cancer.isnull().any()]

if len(rows) == 0: 
    print('no more N/As')
    
if len(cols) == 0: 
    print('no more N/As')
    
print(cancer.shape)

         id diagnosis  radius_mean  texture_mean  perimeter_mean  area_mean  \
0    842302         M        17.99         10.38          122.80     1001.0   
1    842517         M        20.57         17.77          132.90     1326.0   
2  84300903         M        19.69         21.25          130.00     1203.0   
3  84348301         M        11.42         20.38           77.58      386.1   
4  84358402         M        20.29         14.34          135.10     1297.0   

   smoothness_mean  compactness_mean  concavity_mean  concave points_mean  \
0          0.11840           0.27760          0.3001              0.14710   
1          0.08474           0.07864          0.0869              0.07017   
2          0.10960           0.15990          0.1974              0.12790   
3          0.14250           0.28390          0.2414              0.10520   
4          0.10030           0.13280          0.1980              0.10430   

   ...  radius_worst  texture_worst  perimeter_worst  area_wor

In the dataset, there is only one categorical variable: the variable that defines whether the diagnosis is benign or malignant. Otherwise, all the other predictors are numerical. There is one column that is completely empty, so we can drop that column. Otherwise, all the data is complete, which I just checked above. Excluding the categorical variable, there are 30 continuous variables that will act as predictors, and one more variable for the ID of the diagnosis. There are 568 rows of data for every column, so each of the 568 rows represent a different cancer cell along with 30 features to determine the state (benign or maligant) of the cell. 

# 7) Exisiting solutions
Are there existing solutions of your problem? Almost all Kaggle datasets have exisiting solutions. If yes, then how do you plan to build up on those solutions? **What is the highest model accuracy / performance achieved in the existing solutions?**

*(1 point)*

Several solutions claim 97-98% accuracy (1 recent solution claims 99% accuracy). We will improve on these solutions by ensuring a very high precision and by creating the most interpretable model possible based on the findings of multiple models. While completing these tasks, we seek to match the high accuracy rates already achieved in these solutions.

# 8) Stakeholders

Who are the stakeholders, and how will your project benefit them? Explain.

*(1 point)*

The stakeholders are medical professionals and people who are affected by cancer. This would benefit them because it would give the stakeholders a clearer idea of if a tumor is likely to be malignant without invasive tests. While more complete tests are likely to follow, an understanding of liklihood patterns will help professionals to better prioritize medical resources based on tumor characteristics.

# 9) Work-split
*(This question is answered for you)*

How do you plan to split the project work amongst individual team members?

We will learn to develop and tune the following models in the STAT303 sequence:

1. MARS

2. Decision trees with cost-complexity pruning

3. Bagging (Bagging MARS / decision trees)

4. Random Forests

5. AdaBoost

6. Gradient boosting

7. XGBoost

8. Lasso / Ridge / Stepwise selection 

Each team member is required to develop and tune at least one of the above models. In the end, the team will combine all the developed models to create a model more accurate than each of the individual models.

*(0 points)*

* If only 3 total are required, will select 1 from each group given below after learning about them *

Alanda- Models 6-8 

Juno- Models 1-3 (MARS)

Luke- Models 4-5 (Random Forest)