## Tutorial 24: Machine Learning in Python with scikit-learn

Today we will replace our custom code for building predictive models with
**sklearn**, a popular module for many tasks in machine learning. You'll 
find that most of the well-known predictive models exist in the module,
and many others extend the same structures when implementing new models.

### Loading the libraries

We will make use of the three class modules, as well as **numpy**:

In [None]:
import wiki
import iplot
import wikitext

import numpy as np

import numpy as np
import matplotlib.pyplot as plt
import sklearn

In [None]:
assert wiki.__version__ >= 6
assert wikitext.__version__ >= 2
assert iplot.__version__ >= 3

### Get some data

For today, we will once again take links from the "important publications in
philosophy" page to build a corpus for prediction. We will make a `WikiCorpus`
object to simplify the computation of metrics for the page. Below I have removed
two pages that give our Windows users some trouble.

In [None]:
np.random.seed(0)
links = wikitext.get_internal_links('List_of_important_publications_in_philosophy')['ilinks']
links.remove("What_Is_it_Like_to_Be_a_Bat?")
links.remove("What_is_Life?_(Schrödinger)")
links = np.random.permutation(links)

In [None]:
wcorp = wikitext.WikiCorpus(links, num_clusters=15, num_topics=15)

As before, we now extract the number of internal links on each page, whether
the page is translated into German ('de'), and five predictor variables that we
will try to use in constructing our models.

In [None]:
num_ilinks = wcorp.meta['num_ilinks'].values
lan_version = np.array(['de' in x for x in wcorp.meta['langs']], dtype=np.int)

num_sections = wcorp.meta['num_sections'].values
num_images = wcorp.meta['num_images'].values
num_elinks = wcorp.meta['num_elinks'].values
num_langs = wcorp.meta['num_langs'].values
num_chars = np.array([len(x) for x in wcorp.meta['doc'].values])

### Predictor matrix

Last time we somewhat awkwardly kep the five predictor variables seperate from one
another. In my solutions for Tutorial 23, I put them all into a list, but this was
still somewhat clunky. Now, let's combine them together using **numpy** in a a single
matrix of numbers:

In [None]:
x = np.stack([num_sections, num_images, num_elinks, num_langs, num_chars], axis=1)
x

This two-dimensional array of numbers should remind you of the term-frequency matrix that
I showed when introducing the **gensim** module. The array is arranged so that it has one
row for each sample of data and one column for each variable. There are few helpful things
to know about how numpy arrays work. First, they have a shape attribue that us the dimensions
of the array. Here, we have 734 rows and 5 columns:

In [None]:
x.shape

Also, to select a subset of the matrix we use a similar slicing notation that we saw in
tutorial 4 and 9. However, now we need notation to describe both the rows and columns,
which are seperated by a comma. A colon indicates that all rows/colums should be taken.
Look at these examples:

In [None]:
x[0, :] # the first row, all columns

In [None]:
x[:3, :] # the first 3 rows and all columns

In [None]:
x[:2, :2] # first 2 rows and first three columns

The slicing notation will be useful as we construct models from the data. For example,
let's now create the training and testing responses and matricies:

In [None]:
y_train = num_ilinks[:325]
y_test  = num_ilinks[325:]
x_train = x[:325, :]
x_test  = x[325:, :]

The first 325 observations constitute the "training" set and the rest of the data
(of about the same amount) as the "testing" set. We will use these throughout the tutorial.

### Using sklearn

Now, on to actually using the sklearn module. We will see first how to build
a linear regression, but the nice thing about sklearn is that a similar set of
steps can be used to apply almost any algorithm to the data.

Start by constructing an instance of the model you want to build:

In [None]:
reg = sklearn.linear_model.LinearRegression()
reg

Printing out the model, as above, is not nessisary, but does show use all of the
input choices available for our model.

Next, call the `fit` method using the training data (the training matrix x followed
by the response y) to learn the parameters of the model using the training data:

In [None]:
reg.fit(x_train, y_train)

Notice that this changes the model `reg` directly and we do not need to save the
result. That is, sklearn uses an object oriented design.

Now, if we call the `predict` method on a data matrix, predictions from the
model are returned as a numpy array:

In [None]:
reg.predict(x_train)

If we want to see the coefficents in the model itself, call the `.coef_` property:

In [None]:
reg.coef_

These methods — `fit`, `predict`, and `coef_` — exist for all sklearn estimators
whenever they make sense for the given model.

### Logistic regression

A very popular variation of linear regression, called logistic regression, exists
to work with classification tasks. The details are beyond our similar treatment today,
but let's see how the model works. There are some extra features available for classification
tasks in sklearn.

Start by redefining the response variable to be whether a page appears in German:

In [None]:
y_train = lan_version[:325]
y_test  = lan_version[325:]

Next, construct the model:

In [None]:
logreg = sklearn.linear_model.LogisticRegression()
logreg

And fit it to the data (this work exactly the same as the regression problem):

In [None]:
logreg.fit(x_train, y_train)

Now, if we call the prediction function it will spit out predictions that
are either 1 (page available in German) or 0 (page not available in German).

In [None]:
logreg.predict(x_train)

That is great, and exactly what we often want when doing prediction tasks. In some
situations, however, we do not want the predictions themselves but rather an estimate
of the probability that an input has a page in German. To get that call the method
`predict_proba` (here, showing just the first 10 samples for illustration):

In [None]:
logreg.predict_proba(x_train)[:10, :]

The first column is the probability of observering a `0` and the second
is the probability of observering a `1`.

### Evaluating the model

There are also a number of helpful functions in sklearn for preprocessing our data
as well as evaluating the results. Let's produce the predicted classes for the 
testing set from our Logistic regression:

In [None]:
y_test_hat = logreg.predict(x_test)

There are two metrics that I like to spend most of my time looking at. Namely,
the confusion matrix (which shows how many items are mis-classified):

In [None]:
sklearn.metrics.confusion_matrix(y_test, y_test_hat)

As well as the accuracy score:

In [None]:
sklearn.metrics.accuracy_score(y_test, y_test_hat)

Together, these give a good picture of how well the model performs.

### Try another model

Now, for some practice. Look at all of the available models from sklearn:

- http://scikit-learn.org/stable/supervised_learning.html#supervised-learning

Pick something that seems interesting, and use this model to make predictions
on the training set. Then, compute the confusion matrix and accuracy. 

Does your model outperform the Logitic example?

**Answer**:

### For next time

Next class we are going to learn how to build a predictive model using the 
words within the text (this is in many ways a much more interesting task and
more insightful for our ability to analyzing the data). To do this, we need
one more more module (it implements a new model, but does so in the style of
sklearn). To install it, run the following in a terminal (macOS) or the 
Anaconda navigator (windows):

```
conda install -c conda-forge glmnet
```

Then, make sure that the library installs by running the following line. I
am a bit worried that this library may give errors on some machines, so
please check it today before you head out.

In [None]:
import glmnet