# <center>Machine learning from scratch - Part II</center>
## <center>WebValley ReImagined 2021</center>
### <center>Marco Chierici</center>
#### <center>FBK/DSH</center>

In this handout we will go through basic concepts of machine learning using Python and scikit-learn on a real-world dataset of biological relevance from [The Cancer Genome Atlas](https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga) (TCGA) program.

# Breast cancer dataset

The data include gene expression of **499 patients** (already split into 399/100 training/test sets) with **breast invasive carcinoma (BRCA)**, aiming at predicting the **estrogen receptor status** (positive vs negative samples).

The data was preprocessed a little bit to facilitate the progress of the tutorial.

Let's start by loading a few modules that we'll be using later:

In [None]:
import numpy as np
import pylab as plt
import pandas as pd
from sklearn import neighbors
from pathlib import Path ## for creating paths in a neat way

Define files to read:

In [None]:
DATA_DIR = Path("data")
DATA_TR = DATA_DIR / "brca_genes_tr.tsv.gz"
DATA_TS = DATA_DIR / "brca_genes_ts.tsv.gz"
DATA_TS

_Note:_ from now on we will use the "tr" suffix to denote the training set, and "ts" for the test set.

Read the files in as _pandas dataframes_:

In [None]:
data_tr = pd.read_csv(DATA_TR, sep="\t")
data_ts = pd.read_csv(DATA_TS, sep="\t")

The function `read_csv` has a lot more input arguments to deal with different situations.

If you want to know more about this or any other Python function, use the `help(function_name)` command or, within a notebook, `function_name?`.

What do we have here? Start with getting the dimensions of what we just loaded:

In [None]:
data_tr.shape

What's inside?

A peek at the first rows reveals that the first column (the dataframe index) contains the sample IDs, then we have three more columns with what seems clinical data, and the remaining columns are genes:

In [None]:
data_tr.head()

Note the use of **prefixing** in column names (`gene`): not only does it add a level of information on the content of the variables, but it is useful since it allows selecting or filtering out groups of variables in a simpler way.

Drop the first column from the train and test expression sets, since it's just the sample IDs (we put them in to be able to check whether samples and labels match, but once we are sure of what we are doing we don't really need them anymore).

In [None]:
data_tr = data_tr.drop('Sample', axis=1)
data_ts = data_ts.drop('Sample', axis=1)

Check what happened

In [None]:
data_tr.head()

`stage` is the tumor stage according to the American Joint Committee on Cancer (AJCC).

`ER` is the status of the Estrogen Receptor (binary label: Positive/Negative) and it has been linked to the survival of patients: ER-positive patients are more likely to have a shorter survival than ER-negatives.

`survival` is the patient's living status at the followup.

We will use `ER` as our target variable for a machine learning model able to predict ER status from gene expression.

For the remaining part of this hands-on, we need the data and labels to be stored in Numpy arrays (`.values` method): let's start with the labels.

In [None]:
y_train = data_tr["ER"].values.ravel()
y_test = data_ts["ER"].values.ravel()

The `.ravel()` method returns a flattened 1-D array:

In [None]:
y_test

Now, we drop the target columns from the data, convert to Numpy array, and store the result in a new variable:

In [None]:
cols_to_remove = ["ER", "survival", "stage"]
X_train = data_tr.drop(cols_to_remove, axis=1).values
X_test = data_ts.drop(cols_to_remove, axis=1).values

X_train



---


*Naming conventions: in the machine learning world, usually `x` is the data and `y` the target variable (the labels).*

---



Always double-check the dimensions!

When coding, it is a good practice to have a peek at the resulting variables, to be sure everything is OK: i.e., is that variable like it is supposed to be? Did I accidentally throw away a feature column?

This can avoid lots of problems later on! Papers were even retracted because of this kind of errors...

In [None]:
X_train.shape

Let's go back to the sample labels:

In [None]:
y_train.shape

In [None]:
data_tr.shape

---

### Recap

- **ER = Positive** indicates **worse** outcome (survival)
- **ER = Negative** indicates **better** outcome

---

# 1. Data preprocessing

The downstream analysis can benefit from data preprocessing, i.e., rescaling or standardizing data values.

In Scikit learn you can use `MinMaxScaler` or `StandardScaler` in the `preprocessing` submodule. Here is an example using `StandardScaler`:

In [None]:
from sklearn.preprocessing import StandardScaler
## first you need to create a "scaler" object
scaler = StandardScaler()
## then you actually scale data by fitting the scaler object on the data...
scaler.fit(X_train)
## ... and using it to transform the data
x_tr = scaler.transform(X_train)
## note that we don't fit the scaler on the test set: we just transform it
x_ts = scaler.transform(X_test)

Note how we transformed the test set: we computed the scaling parameters on the training set and applied them to the test set. In this way, we did not use any information in the test set to standardize it.

The labels are in text form ("Positive", "Negative"). While this can be dealt with by the classifier, in general it is better to encode them into a numerical form.

This can be done using scikit-learn's `LabelEncoder`, which has a similar `fit`/`transform` API.

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y_tr = le.fit_transform(y_train)
y_ts = le.transform(y_test)

Check out as usual:

In [None]:
y_ts

In the encoded space, '1' corresponds to 'Positive': it seems intuitive, but that's a happy coincidence (always check).

# 2. Unsupervised data exploration

Remember the scatterplot matrix we showed on the 4-feature Iris data?

_How should we visualize feature relationships on this dataset with ~20K features?_

It is probably better to first reduce the dimensionality of our data.

Principal Component Analysis (PCA) is one example of data dimesionality reduction technique. It finds a sequence of linear combination of the variables (called the _principal components_) that explain the maximum variance and summarize the most information in the data and are mutually uncorrelated.

## 2.1 PCA

Let's perform an **unsupervised learning** task on our data set "as is" by decomposing it in its Principal Components.

In scikit-learn, we can use the module PCA:

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)

So far we have a PCA _object_ but no transformation yet.

To actually transform the data, we'll have to _fit_ the PCA object on the training data, and then _transforming_ them in the Principal Component space:

In [None]:
pca.fit(x_tr)
z_tr = pca.transform(x_tr)
# or:
# z_tr = pca.fit_transform(x_tr)

In [None]:
x_tr.shape

In [None]:
z_tr.shape

Let's have a look at the _variance ratio_, i.e. the percentage of the variance explained by each component:

In [None]:
print(pca.explained_variance_ratio_)

Is it always convenient to visualize the first two principal components in a scatterplot, in order to get a first assessment of the goodness of the decomposition.

We will color the points in the plot according to our sample labels.

In [None]:
f = plt.figure()
plt.scatter(z_tr[:, 0], z_tr[:, 1], c=y_tr, cmap="coolwarm")
plt.title("PCA of Train data")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.show()
f.savefig("PCA_train.pdf")

Now we apply the transformation to the test data, plot it, and save it as PDF.

In [None]:
# fit a PCA on the train & use it to transform the test set
pca.fit(x_tr)
z_ts = pca.transform(x_ts)

## plot
f = plt.figure()
plt.scatter(z_ts[:, 0], z_ts[:, 1], c=y_ts, cmap="coolwarm")
plt.title("PCA of Test data")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.show()
f.savefig("PCA_test.pdf")

## 2.2 UMAP

We now perform a UMAP transformation of the data, recalling what we did on the Iris dataset.

In [None]:
import umap
reducer = umap.UMAP(n_neighbors=5, random_state=91)
embedding = reducer.fit_transform(x_tr)
embedding.shape

Convert from Numpy array to Pandas dataframe for more convenient plotting with `seaborn`:

In [None]:
df_2D = pd.DataFrame(embedding, columns=['UMAP1', 'UMAP2'])
df_2D['class'] = y_tr
df_2D.head()

In [None]:
import seaborn as sns
sns.scatterplot(x="UMAP1", y="UMAP2", hue="class", data=df_2D)

# the Matplotlib way:
# plt.scatter(embedding[:, 0], embedding[:, 1], c=[sns.color_palette()[x] for x in y_tr])

Since we have a UMAP object projecting the training set into a low-dimensional space, we can transform new data (i.e., the test set) based on the existing embedding. 

We use the `transform` method on the reducer object, and plot the resulting transformation:

In [None]:
embedding_ts = reducer.transform(x_ts)

df_2D_ts = pd.DataFrame(embedding_ts, columns=['UMAP1', 'UMAP2'])
df_2D_ts['class'] = y_ts

sns.scatterplot(x="UMAP1", y="UMAP2", hue="class", data=df_2D_ts)

We can independently project the test set too:

In [None]:
reducer = umap.UMAP(n_neighbors=5, random_state=91)
reducer.fit(x_ts)
embedding_ts = reducer.transform(x_ts)

df_2D_ts = pd.DataFrame(embedding_ts, columns=['UMAP1', 'UMAP2'])
df_2D_ts['class'] = y_ts

sns.scatterplot(x="UMAP1", y="UMAP2", hue="class", data=df_2D_ts)

# 3. Supervised Learning

## 3.1 k-NN classifier

Based on the PCA we built on our data, we decide to try some supervised learning on them.

Scikit-learn provides you access to several models via a very convenient _fit_ and _predict_ interface.

For example, let's fit a **k-NN** model on the whole training data and then use it to predict the labels of the test data.

In [None]:
knn = neighbors.KNeighborsClassifier(n_neighbors=5)
knn.fit(x_tr, y_tr)
y_pred_knn = knn.predict(x_ts) # predict labels on test data

_In general, a classifier has **parameters** that need to be tuned. Default choices are not good in all situations._

_For example, in k-NN the main parameter is the **number of neighbors** used in the nearest neighbors algorithm._

_More on this later!_

To evaluate the predictions we need some kind of metrics. 

### Recap: confusion matrix

In this example, the first row is class 0, so the confusion matrix will look like:

|      |  |  Predicted  |    |
|------|-----------|----|----|
|      |           | 0 | 1  |
| True | 0        | TN | FP |
|      | 1         | FN | TP |


In [None]:
from sklearn.metrics import confusion_matrix
conf = confusion_matrix(y_ts, y_pred_knn)
conf

The total number of class 0 test samples (AN = All Negatives) should be equal to the sum of the first row of the confusion matrix, i.e., TN + FP:

In [None]:
np.sum(y_ts==0) # total number of "class 0" samples in the test set

Similarly for class 1, i.e., AP = All Positives = TP + FN:

In [None]:
np.sum(y_ts==1) # total number of "class 1" samples in the test set

Compute the Accuracy, remembering/using the formula: 

ACC = (TN + TP) / (TN + TP + FN + FP)

In [None]:
tp = conf[1, 1]
tn = conf[0, 0]
fp = conf[0, 1]
fn = conf[1, 0]

acc = (tn + tp) / (tn + tp + fn + fp)
print(acc)

Now compute the Sensitivity:

SENS = TP / (TP + FN)

In [None]:
tp / (tp + fn)

Computing metrics by hand is good, but what about a quicker option?

As seen in the lectures, Scikit Learn offers a handy broad range of functions to evaluate your classifier through its submodule `metrics`.

Let's compute the accuracy using the scikit-learn built-in function `accuracy_score`, taking as input the predicted labels (`y_pred_knn`) and the true labels (`y_ts`):

In [None]:
from sklearn import metrics
metrics.accuracy_score(y_ts, y_pred_knn)

What about Sensitivity? The built-in function is called `recall_score`, as Recall is an alternate name for Sensitivity:

In [None]:
metrics.recall_score(y_ts, y_pred_knn)

Scikit-learn also provides a neat `metrics.classification_report` function that outputs a few metrics stratified by class:

In [None]:
print(metrics.classification_report(y_ts, y_pred_knn))

Let's consider the Matthews Correlation Coefficient (MCC):

![MCC formula](https://www.researchgate.net/profile/Pablo_Moscato/publication/223966631/figure/fig1/AS:305103086080001@1449753652505/Calculation-of-Matthews-Correlation-Coefficient-MCC-A-Contingency-matrix_W640.jpg)

*Q: Do you remember the main features of MCC?*

In scikit-learn it is computed by the `metrics.matthews_corrcoef` function.

If we get the MCC for our kNN predictions, we can observe that it is in line with our *a priori* knowledge of the dataset (from the article):

In [None]:
print(metrics.matthews_corrcoef(y_ts, y_pred_knn))

*Compare the metrics that you computed so far. What can you say about this classification task? Does the classifier learn something?*

The metrics may look good (e.g., accuracy around 0.9, MCC above 0.6) but...

... how do you know if this model performs similarly well on unseen data?

In other words, does this model *generalize* beyond its training set?

This is why *data partitioning* techniques are used.

## Data partitioning

### Hold-out strategy

The idea behind data partitioning is to split your original data set into a **train** portion (for developing your machine learning model) and a **test** portion (for evaluating the performance of the trained model).

The simplest and most straightforward way to partition your data set is to randomly split it in two groups (*hold-out strategy*).

---

"But we already have a dataset split into train and test!", you may object.

That's perfectly fine! 

In fact, the train portion can be used to train a classifier in a cross-validation setting, as we will see. The test portion is then used only for inference.

For the sake of this tutorial, we will further split the neuroblastoma train set into two subsets.

---


You achieve this using scikit-learn's function `train_test_split`, in the `model_selection` submodule.

For example, let's split the data (`X_train`) into 80% train and 20% test (note the argument `test_size=0.2`), preserving class label proportions:

In [None]:
from sklearn.model_selection import train_test_split
x_tr, x_ts, y_tr, y_ts = train_test_split(X_train, y_train, test_size=0.2, stratify=y_train, random_state=101)

* `stratify` is used to maintain class proportions in the splits;
* `random_state` is the seed for the pseudo-random number generator (PRNG) - more on this below!

In [None]:
x_tr.shape

In [None]:
X_train.shape



---

What is the random_state?

Whenever randomness is involved in a computer program, we need to rely on some sort of workaround because computers follow their instructions blindly and they are therefore completely predictable.

One approach relies on *Pseudo-Random Number Generators* (PRNGs). 

PNRGs are algorithms that use mathematical formulas or precalculated tables to produce sequences of numbers that appear random. 

PNRGs are initialized by a *seed* (an integer), so that *the same seed yields the same sequence of pseudo-random numbers*. This is useful for reproduciblity.


---



Remember to rescale the new training/test sets and to encode the labels:

In [None]:
scaler = StandardScaler()
x_tr = scaler.fit_transform(x_tr)
x_ts = scaler.transform(x_ts)

le = LabelEncoder()
y_tr = le.fit_transform(y_tr)
y_ts = le.transform(y_ts)

*Now, retrain a kNN model on X_train and evaluate its performance on X_test. Try using different random states for data splitting.*

In [None]:
from sklearn import metrics
knn = neighbors.KNeighborsClassifier(n_neighbors=5)
knn.fit(x_tr, y_tr)
y_pred_knn = knn.predict(x_ts)

acc = metrics.accuracy_score(y_ts, y_pred_knn)
mcc = metrics.matthews_corrcoef(y_ts, y_pred_knn)

print(f"Accuracy = {acc:.3f}")
print(f"MCC = {mcc:.3f}")

# More ideas

* Fit a model on the "tr" set, predict on "ts", evaluate the performance
* Find a way to determine the "optimal" K for the kNN model
* Use the other labels ("survival", "stage") to train a classifier
* Evaluate the performance for each type of label

# What next?

* Cross-validation
* Other classifiers: Support vector machines, Random Forests, Neural nets
* Feature ranking
* Parameter tuning