# CS-EJ3211 Machine Learning with Python 

## Student Project: Tissue type classification based on microarray gene expression profiles
**submission deadline 22.03.2021 23:59 Helsinki time**

### Student project instructions

In order to participate in the project, you must submit a project report by 22.03.2021. The report is submitted as a Python notebook (.ipynb format), and should follow the required outline presented in this notebook.

The submitted report should contain all Python code used in the project (early prototyping and "scrapbooking" can be excluded). The notebook should be arranged so that the reader can replicate your workflow by running the cells in the notebook in order.

**General recommendations**\
Strive to use the notation used on this course if you use mathematical formulas or symbols. In the case that you want to use different notation, use good scientific writing principles and clearly define the meaning of your symbols.

**Please comment your code.**\
The commenting doesn't have to be as comprehensive as it is in the exercise rounds (where it is for educational reasons), but it should give some indication of the what is happening in different sections of your code.

## Introduction

"A microarray is a laboratory tool used to detect the expression of thousands of genes at the same time. DNA microarrays are microscope slides that are printed with thousands of tiny spots in defined positions, with each spot containing a known DNA sequence or gene."\
text source: https://www.nature.com/scitable/definition/microarray-202/

<img src="DNA_microarray.jpg" width=800/>

image source: https://www.genome.gov/about-genomics/fact-sheets/DNA-Microarray-Technology

The microarray data for this problem consists of normalized relative expression of certain genes measured in different tissue. There are 3000 gene probes and 2000 samples. The full dataset can be found at https://www.ebi.ac.uk/arrayexpress/ (accession number E-MTAB-62). 

The first columns of  'data_subset.csv' file (file located in 'coursedata' folder) contains ID's of samples (e.g. 'GSM23227.CEL') and analyses info ('RMA') and the rest - expression values for 3000 genes. 

Your task is to predict the type of tissue ('disease' vs 'normal') based on expression profile of samples. 

In addition to this task, you can solve ML problem of predicting multiple types of tissue  {'cell line', 'disease', 'neoplasm', 'normal'}. This is an optional task, which allows to earn extra points. 

<a id='problem'></a>
<div class=" alert alert-info">

## Problem formulation (5 p)

In contrast to the conceptual presentation of the problem in the introduction, this section formulates the problem as a machine learning problem. You should:

- Define the type of your problem. Is it a regression or classification problem? Or perhaps something else?

- Define the **data points** in your problem and define the **features** and **labels** of the points.

- Define the **metric** that serves as the measure of quality of an ML model on your problem. For example, the mean-squared-error might be a reasonable choice for a regression problem, whereas some kind of balanced accuracy score might suit a classification problem with imbalanced classes. Note that this is not necessarily equivalent to the loss function used by your model!
    
</div>

### YOUR TEXT HERE ###

Some text ....

<a id='methods'></a>
<div class=" alert alert-info">

# Methods

## **General instructions:**
    
This section presents the methods used to solve the machine learning problem and walks through the process of solving the problem. This section could include:

- A description of the dataset. What is the source of the dataset? How many data points does it contain? The features and labels where already presented in the previous section but can be presented once again.
    
- Describe why and how the data split on subsets.

- A description of the pre-processing methods that you have used on your data. 

- A description of the model(s) you are using to solve your machine learning problem. Of what form are the predictor functions (include formula if applicable)? What is the loss function to be minimized or maximized (include formula if applicable). You should also include a short description of the hyperparameters that you tune to optimize the model. 
    
- If you use some tools/methods for model selection and validation (e.g. cross-validation, grid search), explain the purpose of it and how it was performed.
    
- A description of hyperparameter tuning and model selection process. E.g. which validation methods have you used to estimate the model performance on previously unseen data?


# **Specific instructions:**

# PART 1 (mandatory, 15 p) 

    
Your task is to build logistic regression and Support Vector Machine (SVM) models for solving tissue type prediction task. During this course, you have familiarized yourself with multiple ML methods from scikit-learn library, but now you will need to independently learn the specifics of how to use the SVM classifier in scikit-learn by studying the documentation and related resources. 
    
More precisely, you need to:

1. Upload the "data_subset.csv" file as a Pandas dataframe. The file contains gene expression data for tissues of different types. The first column contains the sample id and the second column indicates how the data was analysed (Robust Multi-array Average or RMA). The remaining columns, excluding the final one, contain the relative gene expression values. Finally, the last column contains the category (label) to which the data points belong to ('cell line', 'disease', 'neoplasm', 'normal'). 


2. In this part, you will only use data points belonging to two of the four categories in the dataset - 'disease' and 'normal'. Consequently, you should create a new data frame that only contains the data points with these labels. The new dataset should consist of 700 data points.


3. Create numpy arrays `X` (feature matrix) and `y` (label vector) based on the data frame. The feature matrix should contain the expression data and be of shape `(700, 3000)`.
   The label vector `y` should be of shape=(700,) and contain integer values 1 (for data points labled as "disease") and 0 (for data points labled as 'normal').
   
4. Split the data with `train_test_split` into training and test sets (with 80:20 ratio, random_state=42). Keep test set aside until final evaluation. Use training data to choose the model. 

5. Implement PCA (using 20 components) with logistic regression:

   - Use Pipeline sklearn class to chain pre-processing steps (StandardScaler() and PCA(n_components=20, random_state=42)) and logistic regression. 
   - Use `cross_val_score class` from sklearn.model_selection to perform 5-fold cross-validation and get average F1-score (use parameters scoring='f1' and cv=5 in `cross_val_score object`).
 

6. Implement PCA (using 20 components) with SVM:

  - Construct Pipeline object with scaler and PCA for SVM model in a similar way as for logistic regression.
  - Use training set for choosing parameters and hyperparameters. Specifically, perform grid search combined with cross-validation on the Pipeline object by using the `GridSearchCV` class in scikit-learn. 
  
  The candidate parameter values for the SVM model in your grid search should be `'C': [0.01, 1, 100]` and `'gamma': [1e-04, 1e-03, 1e-02]}`, the number of folds used for cross-validation should be `cv=5`, and scoring parameter `f1`.
  - Report F1-score of SVM model with best parameter values for `C` and `gamma`.
  

7. Choose model with best F1-score and perform final evaluation:

    - Fit model (pipeline object) on the training dataset.
    - Report the accuracy and F1-score on the training and test sets.
    - Plot a normalized confusion matrix for the test set. 

Useful links:

- Learn about Support Vector Machine (SVM) methods (e.g. https://scikit-learn.org/stable/modules/svm.html#support-vector-machines) and the implementation of SVM (specifically the SVC) in the scikit-learn library.
- Pipeline example https://scikit-learn.org/stable/auto_examples/compose/plot_digits_pipe.html
- Metrics for evaluation https://scikit-learn.org/stable/modules/model_evaluation.html
- Function for plotting confusion matrix https://scikit-learn.org/stable/modules/generated/sklearn.metrics.plot_confusion_matrix.html

### YOUR TEXT HERE ###


Some text about methods ......

In [None]:
### PART 1 ###
### YOUR CODE HERE ###

# PART 2 (optional, extra 15 p max)

In this part, you need to predict several tissue types ('cell line', 'disease', 'neoplasm', 'normal') based on gene expression data. Use the whole dataset and implement a SVM model for predictions.

If you'd like to earn more points for the project, you can:
    
- Perform 3-fold cross-validation for SVM model (5 points max)  or/and perform grid search for SVM parameters `'C': [0.01, 1, 100]`, and `'gamma': [1e-04, 1e-03, 1e-02]}` (5 points max). 
- Perform 3-fold cross-validation combined with grid search (15 points max).

You can either (1) implement CV or Grid Search and get 5 points max or (2) implement both, but separately, and get 10 points max, or (3) combine CV + Grid Search and get 15 points max. You need to choose only one option - e.g., you cannot do (1)+(3) and get 20 points. 

**NOTE!!!** In grid search for each combination of parameters report at least F1-score (you can report other metrics in addition). You should report the F1-score (average across 3 folds, if implementing CV) for each class ('cell line', 'disease', 'neoplasm', 'normal') **SEPARATELY**. This means, that you would need to perform  CV/GridSearch "manually", with for-loops, without using sklearn GridSearchCV and Pipeline classes. 
- During CV/GridSearch report evaluation metrics only on validation set.
- Perform final evaluation on the test set similarly as in part 1 (report f1-score for training and test sets, and plot a confusion matrix for test set).


Hints:
- You can use `StratifiedKFold(n_splits=3, shuffle=True, random_state=42)` for cross-validation.
- If using `f1_score` from sklearn.metrics, use parameter `average=None` to get f1-score for each class.

### YOUR TEXT HERE ###

Some text about methods ......

In [None]:
### PART 2 ###
### YOUR CODE HERE ###

<a id='result'></a>
<div class=" alert alert-info">

## Results (5 p)

This section presents the results of the experiments. In most problems, the central result is the estimated performance of the final model on new data with respect to the chosen performance metric. In addition, you can for example, present results for different models or consider how the hyperparameters affect the models performance.

</div>

### YOUR TEXT HERE ###

Some text about your results ...

<a id='discussion'></a>
<div class=" alert alert-info">


## Discussion/Conclusions (5 p)

In this section you should analyze the results on a more general level and summarize the findings of your project work. If possible, you should at least answer the following questions:
- Do the results suggest satisfactory performance of your final model, or is there much room for improvement?
- How do your results compare to benchmarks/ solutions of others (if such are available)?
- Are you aware of some methodological shortcomings in the project?
- Do you have ideas for how to improve the performance (e.g. using more training data, using more features for the data points, using different class of predictor functions (hypothesis space) ?

</div>

### YOUR TEXT HERE ###

Discussion ....