# Supplemental notes, getting started

We'll be using the same tools that we used last week for this session.

- [pandas](pandas.pydata.org) for data handling (our dataframe library)
- [seaborn](seaborn.pydata.org) for _nice_ data visualization
- [scikit-learn](scikit-learn.org) an extensive machine learning library.
- [numpy](numpy.org) - a fundamental maths library best used by people with a strong maths background.  We won't explore it much today, but it does have some useful methods that we'll need.  It underlies all other mathematical and plotting tools that we use in Python.

Shortly we'll also by trying out:
- [statsmodel](statsmodel.org) - this is another library for doing statistical fitting.  It generates R-like reports.
- We'll also being trying out two new parts of scikit-learn sklearn.cross_decomposition and decomposition for PLS and PCA 

We'll be using scikit-learn over the next few weeks, and it's well worth reading the documentation and high level descriptions.

As before, the aim is to get familiar with code-sharing workflows - so we will be doing pair programming for the duration of the day! _You will probably want to take a moment to look at the documentation of the libraries above - especially pandas_

The other useful resource is Stack Overflow - if you have a question that sounds like 'how do I do {x}' then someone will probably have answered it on SO. Questions are also tagged by library so if you have a particular pandas question you can do something like going to https://stackoverflow.com/questions/tagged/pandas (just replace the 'pandas' in the URL with whatever library you're trying to use.

Generally answers on SO are probably a lot closer to getting you up and running than the documentation. Once you get used to the library then the documentation is generally a quicker reference. We will cover strategies for getting help in class.

Topics that we'll be discussing in this session include:
Robust Regression - http://scikit-learn.org/stable/modules/linear_model.html#robustness-regression-outliers-and-modeling-errors


## Git links

We will be working through using GitHub and GitKraken to share code between pairs. We will go through all the workflow in detail in class but here are some useful links for reference:

- GitKraken interface basics: https://support.gitkraken.com/start-here/interface
- Staging and committing (save current state -> local history): https://support.gitkraken.com/working-with-commits/commits
- Pushing and pulling (sync local history <-> GitHub history): https://support.gitkraken.com/working-with-repositories/pushing-and-pulling
- Forking and pull requests (request to sync your GitHub history <-> someone else's history - requires a _review_):
  - https://help.github.com/articles/about-forks/
  - https://help.github.com/articles/creating-a-pull-request-from-a-fork/

# Exercise:  Apply statsmodel to the synthetic drilling hole data

Statsmodel has an API with similarities to scikit-learn, but uses statistical language (particularly as used in financial and economic models) rather than the terminology that is more common in machine learning.  Statsmodel refers to endogeneous and exogeneous variables.  In many ways they reflect the differences in philosophy between how people with a statistics modelling background work, vs people with machine learning/computing backgrounds.  Scikit-learn has a focus on training and validation error curves and cross-validation to choose a model, whereas statsmodel provides metrics for hypothesis tests and goodness-of-fit.

We'll briefly look at a typical report that statsmodel generates after fitting.

In [None]:
# Install required packages if using jupyterhub
# %pip install -r ../requirements.txt

In [None]:
import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt

from regression_help import create_composition_dataframe, create_observations, create_templates_matrix

templates = create_templates_matrix()
compositions = create_composition_dataframe(150)
observations = create_observations(compositions, templates)

In [None]:
template_names = list(compositions.keys()) + ['-']
fig, axs = plt.subplots(nrows=3, ncols=2, constrained_layout=True)
for j in range(3):
    for i in range(2):
        if j*2 + i < 5:
            axs[j, i].plot(templates[:, j*2 + i])
            axs[j, i].set_title(template_names[j*2 + i])
        else:
            axs[j, i].axis('off')
plt.show()

In [None]:
plt.plot(observations[:, 3])

In [None]:
X = templates
y = observations[:, 3]
model = sm.OLS(y, X).fit()

In [None]:
plt.plot(model.predict(templates))

In [None]:
model.summary()

Let's compare this against ground truth:

In [None]:
compositions.loc[3]

# Exercise:  Apply PCA to the synthetic drilling hole data

We have just worked through the theory behind principal components analysis.  Let's see how we can use it in scikit-learn.  We have also seen how our drilling data problem has many correlated variables, which we have good reason to expect have a lower-dimensional structure.

## Look at the correlation matrix

Lets look at the correlation matrix for the original instrument-observed features.  As you've seen, the feature variables are highly correlated with each other.  In real-world situations, if we wanted to use linear least squares to find the underlying templates, this can cause linear least squares to fail.  Particularly as we will often have fewer observations than there are unknown variables (426).  

Seaborn's heatmap is useful for showing these correlations visually, but it can be very slow.  Matplotlib's matshow is faster, try Matplotlib if heatmap doesn't work well on your machine.

In [None]:
import seaborn as sns
correlation_matrix = np.corrcoef(X)
sns.heatmap(correlation_matrix)

In [None]:
plt.matshow(correlation_matrix)

In [None]:
# Let's plot all of the observations over each other.
fig, ax = plt.subplots(1,1)
for observation_idx in range(0, 50):
    plt.plot(observations[:,observation_idx])

This quick exploratory plot immediately tells us there is lots of structure in this dataset.  There are clearly at least two kinds of sample here.

## Apply PCA to observations matrix

Go look at scikit-learn's documentation to check what is the module for PCA and apply it on the observations using 15 components. You'll see that the PCA follows the same API we've used for linear regression, with a model instantiation followed by a call of the fit function.

In [None]:
from  import 

pca = 
pca.fit()

Let's have a look at the components

In [None]:
print(pca.components_)
print(len(pca.components_[0]))

Oops!  Can you see what's wrong here?  The components should have the same length as the orginal observations.  These components have only 150 elements.  This means that the observations array isn't organised properly for PCA fitting.  It thinks that we have ~400 observations, each with 150 features.  To fix this we should transpose the matrix.

It's always good to look at the dimensions of our arrays to ensure we're not making trivial but annoying mistakes like this.

In [None]:
transposed_observations = observations.transpose()

Let's fit the PCA again on the transposed observations.

And look at the number of features in a component again.

That's better!  We clearly have many more elements in these component arrays now.

Let's look at the explained variance, as discussed in the Powerpoints.  Intuitively, this sounds like what you want.  Some documentation on the web also suggests that this is what we want. But Python is a living project and this now returns abstract quantities ("eigenvalues") that are related to the variances, but are not variances.

In [None]:
print(pca.explained_variance_)

Nowadays, we should use explained_variance_ratio, like this:

In [None]:
print (pca.explained_variance_ratio_)

This says that the first component was able to account for about 86% of the variation, and the two following components account for about 1% and then less than 1% each.

PCA has compressed 87% of the variation in ~400 features into just two transformed features!

In [None]:
# What do these principal components look like?
plt.plot(pca.components_[0])

In [None]:
# How does this compare to our quartz template?  Remember,
# with PCA we've only seen the observations.  The PCA transform
# didn't know what the templates were beforehand.
plt.plot(templates[:, 0])

As you can see, the principal components may not find the original templates, but what it does find are patterns that can help distinguish the templates from each other.  Notice how the first three peaks do seem to relate to the first three in the template, though inverted (mathematically, this doesn't matter).  Interestingly there is a fourth peak present.  This is likely because that fourth peak is important for distinguishing quartz from another phase that otherwise looks similar.  Note it has opposite sign to the first three peaks.  It can be very hard (and it's often academic) to interpret qualitatively what the meaning in the components is.  But in this case it does have a clear relevance. 

In [None]:
# Dilithium?
plt.plot(pca.components_[1])

In [None]:
# The dilithium template
plt.plot(templates[:, 1])

This is interesting.  In the mixtures dilithium and quartz often appear together.  The principal component has a fourth peak that is sensitive to the dilithium peak.  The quartz peak was inverted as it was trying to filter it out.

Note: During the initial presentation I hadn't looked at it closely enough and thought that the peak was offset a bit.  This happens sometimes with principal components when there are interferences, and the principal component becomes sensitive to a leading or falling edge.  But it doesn't seem to be happening in this particular instance.

Let's plot them over each other to be sure -

In [None]:
plt.plot(templates[:, 1])
plt.plot(pca.components_[1])

In [None]:
# And the third, kryptonite phase,
# followed by the third principal component
plt.plot(templates[:, 2])

In [None]:
plt.plot(pca.components_[2])

In [None]:
# And the unobtainium phase, followed by the fourth component
plt.plot(templates[:, 3])

In [None]:
plt.plot(pca.components_[3])

This looks better than I expected, for unobtainium.  But there's a good chance that PCA regression will struggle to predict unobtainium well, especially when there isn't very much present.

We can plot explained variance like this:

In [None]:
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')

We seem to hit a sweet spot at about 2 components, and then explained variance only gradually improves after this point.  This is a good result, given that we know that we know these samples only contain four minerals, and all other variance is from noise.

How do we look at the transformed data?  We use the transform method to find the "scores" associated with each observation.  These are new observations in the new feature space that PCA has found.  Each element in observations_in_pc_space has 15 elements in it, being the reduced feature space.  So we have a new feature space of 15 dimensions, when we used to have 426.

In [None]:
observations_in_pc_space = pca.transform(transposed_observations)

In [None]:
# Here's the first observation in the PC feature space.
plt.plot(observations_in_pc_space[0])

In [None]:
# Here's the third observation
plt.plot(observations_in_pc_space[2])

The point of all of this is to reduce the dimensionality so that we can better see the underlying structure.

We'd like to be able to plot the "scores" against each other, and look for patterns.

That's hard in 15 dimensions, but as we've seen, we can explain most of the variance in only a couple of dimensions.  Let's plot PC1 against PC2 for all observations.

In [None]:
plt.scatter(observations_in_pc_space[:, 0], observations_in_pc_space[:, 1])

Oh look!  There are clusters!

This tells us that there are three distinct groupings of our samples in the set of 150.  This is based only on the first couple of principal components.

What about PC2 and PC3?  Are there any clusters there?

# Exercise:  Look at the Octane dataset

Try to study the Octane dataset in the same manner!  You'll need pandas to read the dataset octane.csv from the data folder. You'll need to disregard the columns with the sample names and the octane ratings, as we're only interested in the spectra.

In [None]:


# Open templates from disk, drop the sample names and octane numbers and the
# octane observations.
octane_dataframe = 
octane_dataframe = 
observations = octane_dataframe.values

In [None]:
# Let's plot all of the observations over each other.
for i in range(observations.shape[0]):
    plt.plot(observations[i, :])

It's a bit hard to tell from this plot, but there may be a couple of clusters present in this data.

Let's look at the principal components. We don't need to transpose these observations, so this real life data is easier to use than the synthetic drillholes! First, create a PCA model, fit it to the observations, and print the explained variance ratios.

Now let's plot the cumulative explained variance.

In this dataset the first component is able to account for 92.3% of the variance in the dataset!  That's even better than with our synthetic drilling data.  It looks like we need about 4 components before there is little contribution from the remaining components.

In [None]:
# Let's look at that first component
plt.plot()

In [None]:
# And the second component


Without domain knowledge in the industry, interpreting these shapes is hard, or impossible.  Fortunately, they're usually not what we're interested, and we've done PCA to help us see clusters, or as a transformation prior to applying a machine learning method (such as linear regression in the simplest case). Transform the observations to get their coordinates in the principal components' space.

In [None]:
observations_in_pc_space = 

Now use a scatterplot to visualize the first two components as we've done in the previous exercise.

In [None]:
plt.scatter()

This is interesting!  It supports and makes more clear that there are likely to be two clusters here.  In this situation domain knowledge is helpful to understand what these two clusters could be.  If we know the row indices we can separate out the observations on the right.  Often just eyeballing (physically, in the plant, or at the instrument) the two groups of samples is enough to reveal why they're different and if the right group should be treated as outlier observations to be removed, or if they're important and central to the process variation being modelled.

Let's look at the second and third components.

There is no obvious clustering between the second and third components. How about in the third and fourth components?

Yes, there is clustering in the third and fourth!

Let's find out which observations are not like the others here.  We can split them by looking at PC4, maybe choosing those that have score more that are less than -0.03 or greater than 0.03.

By looking at PC4, and if we had domain knowledge, we may have been able to deduce why these samples are different without further inspection.

In [None]:
PC4 = observations_in_pc_space[:,3]

In [None]:
# np.where returns a "tuple of arrays".  This code
# will extract the indices of the rows that fit the
# "where" criteria
PC4_big = np.where((PC4 > 0.03))[0]
PC4_small = np.where((PC4 < -0.03))[0]

Let's look at PC1 too.  The split is obvious at a score of 0.4 on that component

In [None]:
PC1 = observations_in_pc_space[:,0]

In [None]:
PC1_strange = 

Lets look at the observation rows that we've extracted.

In [None]:
PC1_strange

That means that our strange observations on PC1 have indices: 24, 25, 35, 36, 37 and 38

In [None]:
PC4_big

In [None]:
PC4_small

That's interesting!  On the PC4 axis, our strange observations are:
24, 25, 35, 36 and 38.  37 is missing from this set.  But out of curiosity, did we almost get it?

In [None]:
PC4[37]

Nope, it's just strange on PC1.  Let's plot these strange observations.

In [None]:
fig, ax = plt.subplots(1,1)
for strange_observations in PC1_strange:
    plt.plot(observations[strange_observations])

And the others:

In [None]:
fig, ax = plt.subplots(1,1)
all_observations = np.arange(0, 39)
for normal_observations in np.setdiff1d(all_observations, PC1_strange):
    plt.plot(observations[normal_observations])

The strange nature of these is quite clear.  What do we do with them is a domain expert question.  Importantly, they really jump out when we look at them with PCA.  For argument's sake, lets call them outliers and say we want to remove them.  We can easily split the outliers from the non-outliers now.

In [None]:
outliers = observations[PC1_strange]

In [None]:
all_observations = np.arange(0, 39)
inliers = observations[np.setdiff1d(all_observations, PC1_strange)]

If we now wanted to create a regression model to predict Octane, we could iterate our PCA process and choose to build the regression model only on the inlier observations.  It's likely to give a better model than the one that uses all observations.  We could also use the original model to identify if new observations are outliers.  For example, using the original 0.4 threshold on PC1.  If a new observation has PC1 (original model) greater than this then we flag it to the user as an outlier, instead of trying to do an Octane prediction.  Perhaps in the Octane context it's an indicator of a serious problem in the petrol production plant.