# INF399 

A jupyter notebook containing all the code to reproduce your work and a report of all your methodological choices and results. Please "restart and run all" before submission, so that you submit a clean version. 

Code should be documented and special tricks (e.g. to avoid division by zero, to make sure it takes finite time to run, etc.) should be reported. The rational behind all steps in the code should be clear from the report. In particular, if you use are subsampling, you should report it, and you should consider for each step how much subsampling is appropriate. 

**NOTE:** 
Model selection is an important part of the task and will be graded accordingly. Before applying machine learning algorithms, you should always consider (and report) what results you expect. When you have successfully applied machine learning algorithms, you should always comment on how well the results match your expectations. 

## Task 1 - Preprocessing 
In this task you summarize and visualize the data and prepare it for analysis. 
- Load data <b style="color:green">DONE</b>
- Decribe data <b style="color:green">DONE</b>
- Check for any missing values and handle these appropriately. <b style="color:green">DONE</b>
- Find the ranges and basic statistics of the features and rescale them if appropriate. For similar data, scaling using `arcsinh(x/5)` has been used successfully. <b style="color:green">DONE</b>
- Visualize the univariate densities of all features using your favorite density estimator. <b style="color:green">DONE</b>
- Calculate basic bivariate statistics, such as correlations. <b style="color:green">DONE</b>
- Perform any other appropriate preprocessing steps. <b style="color:green">DONE</b>
- Discuss the results of your summaries and visualization efforts and explain your preprocessing choices (not doing any preprocessing is also a choice). <b style="color:red">TODO</b>

The datasets contain information on 20,000 blood cells of 20 rheumatoid arthritis patients and 20 healthy controls. The first two columns identify the patient and the patient group. The remaining columns are the cell markers measured. 

## Task 2 - Dimensionality reduction

- Visualize the mass cytometry dataset using at least three different representation learning algorithms.  <b style="color:orange">HALF DONE</b>
- Explain your choices of algorithms. <b style="color:red">TODO</b>
- For each algorithm, explain your choice of parameters. <b style="color:red">TODO</b>
- For each dimensionality reduction, describe the main features you see and discuss if these features come from the data or the dimensionality reduction technique. <b style="color:red">TODO</b>
- Discuss the differences and similarities of your dimensionality reductions. <b style="color:red">TODO</b>
- Embedd in 3D to get a possible better density estimation for each patient <b style="color:red">TODO</b>

# Task 3 - GMM
- Use BIC and AIC to guess at number of componenets for the GMM model (model selection) <b style="color:green">DONE</b>
- Compare a classifier trained on GMM with 8 componenets vs 36 components (two basic ways to augment the data).
- Validate the result
    - Train a dicriminator that tries to distinguish between generated and original data. Use that to rate the quality of generated data. I.E. GAN <b style="color:red">TODO</b>

# Task 4 - Sample Patients
- **Process:**
    1. Embed the data in lower dim <b style="color:green">DONE</b>
    2. Run kernel density estimation on the embedded sampled and original data <b style="color:green">DONE</b>
    3. Look at the KL (symmetric) divergence on the estimated density between each pair of sampled and orignial patient using a grid space on the embedded space <b style="color:green">DONE</b>

- **Notes:**
    - create a metric for comparing patients that looks that similarity between same group and the difference between groups, as well as the overall KL divergence.
    - Try **other density** comparators:
        - earth mover's distance <b style="color:red">TODO</b>
        - waserstein distance <b style="color:red">TODO</b>
        - Use a **classifier** to seprate between generated and original patients using density of umap embedding on a meshgrid? <b style="color:red">TODO</b>
    - **Embedd in 3D** to get a possible better density estimation for each patient <b style="color:red">TODO</b>
    - Draw the histogram of each axis on the umap-embedded patients to see why the **sample generator does not produce evenly spaced samples**. <b style="color:red">TODO</b>
    - clean up and create a **pipline** <b style="color:red">TODO</b>

## Task 5 - VAE
1. estimate the data-generating distrubtion **for all cells**. <b style="color:red">TODO</b>
2. estimate the data-generating distrubtion **for each patient group**, assume each group patient comes from different distributions. <b style="color:red">TODO</b>

### Task 6 - GAN?
1. estimate the data-generating distrubtion **for all cells**. <b style="color:red">TODO</b>
2. estimate the data-generating distrubtion **for each patient group**, assume each group patient comes from different distributions. <b style="color:red">TODO</b>

## Task 7 - Fine tune parameters 
1. the gmm <b style="color:red">TODO</b>
2. the vae <b style="color:red">TODO</b>
3. the gan <b style="color:red">TODO</b>
4. run on data with all cell markers <b style="color:red">TODO</b>
5. Look at another dataset. <b style="color:red">TODO</b>

## NOTES
- include deep learning models and probabilistic models! 

validation:
- use embedding to see if generated data and original data ends up similar (still don’t know if correct, but if not similar then definately not similar) 
- Start with running inf367 project 2 with the new data, supsampling 
- Set up pipeline 
- Start with fitting GMM 


- We have a basic form of model selection for the GMM
- We want to compare how good the new sampling is by
    - Comparing a classifier trained on non-augmented data vs augmented data
    - We also need a measure for how good different augmentation compare.
    - I need a validation and test data to compare different classifiers on patients, but there are just 20 patients and 20 control
- TODO:
    - Train a classifier with out augmenting the data. Therefore we need a train, validation and test set to select a good model! 
        - Problem: is 40 patients to few for a classifier to do well?
    -  To do so, we need a simple classifier that can take 20000 parameters as a single input. (perhaps a desision tree). We also need to split the data into train and test. 
    - Train a dicriminator that tries to distinguish between generated and original data. Use that to rate the quality of generated data. I.E. GAN


Validation:
- want to compare how good the new sampling is by
    1. comparing classifiers trained on generated data vs original data.
    2. we also need a measure for how good different augementation compare

Notes:
- expect it to be very good at the cell level, okey on patient