# Assignment 1
### Understanding Uncertainty
### Due 9/5

1. Create a new public repo on Github under your account. Include a readme file.
2. Clone it to your machine. Put this file into that repo.
3. Use the following function to download the example data for the course:

In [4]:
def download_data(force=False):
    """Download and extract course data from Zenodo."""
    import urllib.request, zipfile, os
    
    zip_path = 'data.zip'
    data_dir = 'data'
    
    if not os.path.exists(zip_path) or force:
        print("Downloading course data")
        urllib.request.urlretrieve(
            'https://zenodo.org/records/16954427/files/data.zip?download=1',
            zip_path
        )
        print("Download complete")
    else:
        print("Download file already exists")
        
    if not os.path.exists(data_dir) or force:
        print("Extracting data files...")
        with zipfile.ZipFile(zip_path, 'r') as zip_ref:
            zip_ref.extractall(data_dir)
        print("Data extracted")
    else:
        print("Data directory already exists")

download_data()

Download file already exists
Extracting data files...


BadZipFile: File is not a zip file

4. Open one of the datasets using Pandas:
    1. `ames_prices.csv`: Housing characteristics and prices
    2. `college_completion.csv`: Public, nonprofit, and for-profit educational institutions, graduation rates, and financial aid
    3. `ForeignGifts_edu.csv`: Monetary and in-kind transfers from foreign entities to U.S. educational institutions
    4. `iowa.csv`: Liquor sales in Iowa, at the transaction level
    5. `metabric.csv`: Cancer patient and outcome data
    6. `mn_police_use_of_force.csv`: Records of physical altercations between Minnessota police and private citizens
    7. `nhanes_data_17_18.csv`: National Health and Nutrition Examination Survey
    8. `tuna.csv`: Yellowfin Tuna Genome (I don't recommend this one; it's just a sequence of G, C, A, T )
    9. `va_procurement.csv`: Public spending by the state of Virginia

5. Pick two or three variables and briefly analyze them
    - Is it a categorical or numeric variable?
    - How many missing values are there? (`df['var'].isna()` and `np.sum()`)
    - If categorical, tabulate the values (`df['var'].value_counts()`) and if numeric, get a summary (`df['var'].describe()`)

6. What are some questions and prediction tools you could create using these data? Who would the stakeholder be for that prediction tool? What practical or ethical questions would it create? What other data would you want, that are not available in your data?

7. Commit your work to the repo (`git commit -am 'Finish assignment'` at the command line, or use the Git panel in VS Code). Push your work back to Github and submit the link on Canvas in the assignment tab.

In [None]:
# Importing pandas and numpy
import pandas as pd
import numpy as np

# Creating df
df = pd.read_csv("/Users/Caroline/Desktop/school/understanding_uncertainty/data/metabric.csv")

# Naming what columns I want to view and creating a df for that
cols_to_view = ['Cancer Type', 'Type of Breast Surgery', 'Chemotherapy', 'Hormone Therapy']
df_selected = df[cols_to_view]
df_selected

# QUESTION 5a
# Cancer type is categorial; type of breast surgery is categorical; I feel like chemo and hormone therapy are numeric because I'd peronally replace YES/NO with 0/1
# Upon second thought, 0/1 are still categories, so all my selected columns are categorical

# QUESTION 5b
for col in cols_to_view:
    empty = df[col].isna()
    print(f"There are {np.sum(empty)} empty cells in the {col} column.")
# No columns have mising values. Great!

# QUESTION 5c
value_counts_cols = [df[col].value_counts() for col in cols_to_view]
# print(value_counts_cols) (output copied below)

"""
Output:

[
Cancer Type:
Breast Cancer    1343
Name: count, dtype: int64, 

Type of Breast Surgery
MASTECTOMY           773
BREAST CONSERVING    570
Name: count, dtype: int64, 

Chemotherapy
NO     1057
YES     286
Name: count, dtype: int64, 

Hormone Therapy
YES    811
NO     532
Name: count, dtype: int64
]
"""

# NOTES ON QUESTION 5c
# All cancers are breast cancer
# 58% had mastectomies and 43% had breasts conserved
    # Rounding keeps this from adding the 100%
# 79% did not have chemo and 21% did.
# 60% had hormone therapy and 40% did not.



# Extra challenge: I want to explore how many had both chemotherapy and hormone therapy, how many had one, and how many had neither.
# A bit of help from ChatGPT on putting df before [(df['Chemotherapy'] == 'YES') & (df['Hormone Therapy'] == 'NO')] and small error-checking.

chemo = df[(df['Chemotherapy'] == 'YES') & (df['Hormone Therapy'] == 'NO')]
hormone = df[(df['Chemotherapy'] == 'NO') & (df['Hormone Therapy'] == 'YES')]
both = df[(df['Chemotherapy'] == 'YES') & (df['Hormone Therapy'] == 'YES')]
neither = df[(df['Chemotherapy'] == 'NO') & (df['Hormone Therapy'] == 'NO')]


# Print counts
print("Both:", len(both))
print("Only Chemotherapy:", len(chemo))
print("Only Hormone Therapy:", len(hormone))
print("Neither:", len(neither))



"""
6. 
What are some questions and prediction tools you could create using these data? 
Who would the stakeholder be for that prediction tool? 
What practical or ethical questions would it create? 
What other data would you want, that are not available in your data?

A big question is about the age at diagnosis. This is important for detection and scanning - at what age should people be worried or start testing? And, at what ages are 
chemo and hormone therapy useful? Another question is about treatment. Which treatment (or combination of treatments) lead to a survival status or a long overall survival in 
months? A final question is if there is a correlation between the number of lymph nodes examined (positive) and tumor size and stage. Does checking more lymph nodes help catch
tumors at a smaller size or earlier stage? Or, do more positive lymph nodes correlate with larger or later-stage tumors? I am not sure what the lymph node category is saying 
exactly, so either question would be answered after doing more research on what the column means. I could create multivariate graphs to answer these questions and see if 
p < 0.05 to relate the variables to each other, or, even better, build a predictive model using our ML class teachings! 

Of course, there is a lot of uncertainty with these data. What are the margins of error (if any)? What are the error rates/methods of checking to avoid errors when inputting 
the data? Are the variables simply correlated or causative? What columns depend on two or more columns, rather than just one (really, all of them, since medicine is very
holistic)? 

Ethically, many questions are created. Choosing a cut-off age for regular scanning means younger individuals who are, in theory, low-risk, may miss detetion of a cancer. 
Relying on an ML or predictive model alone to choose treatment can miss nuance that a human doctor can bring to the table. (Actually, studies have shown that a combination of 
human choice AND AI-predictive decisions beat doctors or AI working alone. Teamwork!) Practically, can it create new indices like the Nottingham, that prognosticate? 

I would like to know if there are more treatments used than the three in the database. I also want to know if there are more prognosticative indices like the Nottingham one. 
Are there varied ways to measure tumor size (mass vs dimensions)? Are there more physiological measures, like resting heart rate, weight, and blood pressure, that can be collected? 
What about ages tested or scanned? 

A note: I am very interested in healthcare (and the environment) so loved this dataset and would enjoy seeing more examples in future assignments along these lines! Thank you!

"""



There are 0 empty cells in the Cancer Type column.
There are 0 empty cells in the Type of Breast Surgery column.
There are 0 empty cells in the Chemotherapy column.
There are 0 empty cells in the Hormone Therapy column.
Both: 137
Only Chemotherapy: 149
Only Hormone Therapy: 674
Neither: 383


"\ndisplay(df[(df['Chemotherapy'] == 1) & (df['Hormone Therapy'] == False)])\n\n\n6. \nWhat are some questions and prediction tools you could create using these data? \nWho would the stakeholder be for that prediction tool? \nWhat practical or ethical questions would it create? \nWhat other data would you want, that are not available in your data?\n\nA big question is about the age at diagnosis. This is important for detection and scanning - at what age should people be worried or start testing? And, at what ages are \nchemo and hormone therapy useful? Another question is about treatment. Which treatment (or combination of treatments) lead to a survival status or a long overall survival in \nmonths? A final question is if there is a correlation between the number of lymph nodes examined (positive) and tumor size and stage. Does checking more lymph nodes help catch\ntumors at a smaller size or earlier stage? Or, do more positive lymph nodes correlate with larger or later-stage tumors? I