Algorithms That Learn To Predict 
================================

In [None]:
import copy
from copy import deepcopy
import numpy as np
import pandas as pd
import sklearn as sk

Machine learning (ML) refers to the use of algorithms which can learn
from data. The inputs to a ML algorithm will generally be some sort of
data set---which I will refer to as the *training data*---and the
output will usually be another algorithm, which I will call a
*fit model*, or sometimes just a *fit* if I'm feeling lazy.
-   ML algorithm : training data $\rightarrow$ fit model

The fit model itself also takes data as input, and generally requires
that the data provided to it must be very similar in nature to that
provided to the ML algorithm as training data: For example, assuming
the data sets in question are represented in table form, the data
provided to the fit model must usually have all or almost all of the
same columns as the training data set did. However, the output from
the fit model is usually much simpler, often consisting of a predicted
*numeric value* or *categorical label* for each individual
sampling unit of the data set.
-   Fit model : test data $\rightarrow$ predicted values

We will use the `sklearn` python module, which implements ML
algorithms in an object-oriented manner. In this framework,
a constructor is used to create a learner object embodying the algorithm;
the `fit` method can then be called on the learner object, after
which it becomes the fit model object. I sometimes `deepcopy` the
learner and fit the copy so as to keep the objects representing
the learner and fit model separate; this can be useful if you want
to re-use the same learner to fit multiple fit models using different
data sets.

Here is an example using `sklearn.linear_model.LinearRegression`:

In [None]:
np.random.seed(123)
n = 20
 ## generate some random data for two variables:
predictor1 = np.random.randn(n)
predictor2 = np.random.randn(n)
 ## now set a third variable equal to a weighted sum of those
 ## two variables plus a random error term:
output = 2*predictor1 + predictor2 + np.random.randn(n)
 ## bundle up the three variables composing our data set into a
 ## DataFrame object:
featureData = pd.DataFrame({"p1":predictor1, "p2":predictor2})
 ## split featureData and output into training and test sets:
trainFeats = featureData.iloc[0:10, ]
trainOutput = output[0:10]
testFeats = featureData.iloc[10:20, ]  ## should not overlap trainData!
testOutput = output[10:20]
 ## now train model using only trainData:
import sklearn.linear_model as lm
learner = lm.LinearRegression()
fitModel = deepcopy(learner).fit(trainFeats, trainOutput)

Now we can use the `fitModel` to make predictions on rows 11--20 of
`featData`; in `sklearn`, this is done by calling the `predict`
method of `fitModel` with the the test feature matrix as argument:

In [None]:
 ## generate predictions for test data:
predictions = fitModel.predict(testFeats)
 ## we'll use plotnine (python version of ggplot) for plotting:
import plotnine
from plotnine import qplot, ggplot, aes, geom_point
 ## get rid of default gray background:
plotnine.theme_set(plotnine.theme_bw())
 ## plot actual values of out column against predicted values
 ## for the test data using ggplot2::qplot
qplot(pd.Series(predictions), pd.Series(testOutput))

This an example of *supervised learning*, in which one of the
variables in the training data set (`out` in this case) is treated
as an output to be predicted using the others. Note that the test set
does not need to have this variable present to make predictions;
indeed we did not give it that information in the `predict` call
above!

Thus, in supervised learning approaches the fit model requires only a
subset of the variables present in the training data to be present in
the test data in order to make predictions.

In *unsupervised learning* this is not the case, and we must
generally have all variables from the training data also present in
any test data that we wish to make predictions on. What is this
"unsupervised learning", you ask, and what might it be used to
predict? Let's consider an example to make things more concrete:

In [None]:
trainData = trainFeats.copy()
trainData["out"] = trainOutput
testData = testFeats.copy()
testData["out"] = testOutput
 ## use k-means clustering algorithm to fit 2 clusters to training data
import sklearn.cluster as clust
kmLearner = clust.KMeans(n_clusters=2)
kmeansFit = deepcopy(kmLearner).fit(trainData)
 ## predict which cluster each test datum is in:
kmPredictions = kmeansFit.predict(testData)
kmPredictions

In [None]:
 ## two clusters in this case correspond to low and high values of "out":
qplot(pd.Series(kmPredictions).astype(str), pd.Series(testOutput), geom="boxplot")

As seen in this example, unsupervised learning algorithms try to find
some latent structure in the training data---such as the carving of
the variable space (frequently called *feature space* in ML) into
two disjoint clusters done by `kmeans`, about which more will be
said later.

Many unsupervised learning algorithms, including `kmeans`, produce
fit models which can be used to determine how test data would fit into
the learned latent structure; for instance, here we were able to
assign each test datum to one of the two clusters learned from the
training data set. There are some unsupervised learning approaches
which generate fit models which are not immediately equipped to make
test set predictions, however---hierarchical clustering and tSNE come
to mind here---which can limit their utility in some situations.

Data 
====
Machine learning---perhaps I should lose the qualifier and just say
learning---isn't much without data!

We're going to see how machine learning algorithms work by applying
them to both real and simulated data. It's critical to play with real
data in learning machine learning, as it is very difficult to
replicate many important features of real data via
simulation. Simulation does play an important role in ML as well,
however: only with simulated data can we check how our algorithms
perform when all of the assumptions that underlie their derivation are
truly met. It is also frequently much easier to "turn the knobs" on
various data set properties of interest---like the number of sampling
units $n$, the number of features $m$, the degree of correlation
between features, etc.---with simulated data than in the lab or the
external world!

We will consider two real gene expression data sets:
1.  an RNA-seq data set downloaded from Gene Expression Omnibus
    (accession
    [GSE120430](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE120430))
    analyzing transcriptional targets of core promoter factors in
    Drosophila neural stem cells [@neves2019distinct].
2.  a microarray (!) data set from 2006 collected to predict
    sensitivity to preoperative chemotherapy using expression levels
    measured in fine-needle breast cancer biopsy specimens
    [@hess2006pharmacogenomic].

I'll defer further discussion of the Hess data set until we get to
supervised analysis methods.

In order to read in the data from file, I'm going to define a
convenience function resetting some of the defaults of the
`pd.read_csv` function:

In [None]:
def rt(f):
    return pd.read_csv(f, sep="\t", index_col=0, header=0)

Now let's use this function to read the Neves data set, along with a
file containing Drosophila melanogaster gene annotations, in from the
files included here in the github project:

In [None]:
nevesExpr = np.log2(rt("data/gse120430_deseq_normalized.tsv.gz") + 1)
nevesExpr.iloc[0:5, 0:5]
 ## (note that gene expression matrix files are usually provided
 ##  using genes-in-rows format)
 ## simplify nevesExpr by removing genes with no data:
nevesExpr = nevesExpr.loc[nevesExpr.sum(axis=1) > 0]
 ## by contrast, sample annotation files generally follow the
 ## older statistics convention of sampling units-in-rows
nevesAnnot = rt("data/gse120430_sample_annotation.tsv")
dmGenes = rt("data/d_melanogaster_gene_annotations.saf.gz")

Let's take a quick look at `nevesAnnot`:

In [None]:
nevesAnnot

To minimize the chance of any bugs in our analysis code, it is useful
to align the rows of the sample annotation data (and gene annotation
data, if we have it) to the columns of the expression matrix:

In [None]:
 ## align sample annotations to expression data:
nevesAnnot = nevesAnnot.loc[nevesExpr.columns]
 ## align dmGenes to expression data:
dmGenes = dmGenes.loc[nevesExpr.index]

The `group` column indicates whether each sample is in group
expressing the control (mCherry) or one of the experimental RNAi
transgenes (TAF9, TBP, or TRF2).
The sample names in the expression data and sample annotations are
Gene Expression Omnibus accession ids; we'll replace these with more
descriptive names based on the grouping information in the sample
annotations:

In [None]:
 ## use more descriptive names for samples
betterSampleNames = [nevesAnnot["group"].iloc[i] + "-" + str(1+i%3)
                     for i in range(nevesAnnot.shape[0])]
nevesExpr.columns = betterSampleNames
nevesAnnot.index = betterSampleNames


Finally, because the descriptive gene names for the measured
Drosophila genes are in one-to-one correspondence with the Flybase
gene ids used to label the rows in the file
`data/gse120430_deseq_normalized.tsv.gz`, we'll swap them
out:

In [None]:
 ## use more descriptive names for genes
nevesExpr.index = dmGenes["GeneName"]

The code shown above for loading in the Neves data set is also contained in the file `load_neves.py`.