# Supervised Learning

Supervised learning uses labeled data.
The labels are human made annotations that describes the data.
In our case, the data will be text documents.
One common task is classification where the goal is to split the data into classes, for example by topic.
Documents can be classified for example by type, topic, language, or jurisdiction.

## Visualizing Classes

To get an idea of how classification works, we'll first look at a simple example that doesn't involve text.
We can simulate a data set with only two features, so that they can be visualized graphically.
Real data sets rarely have so few features and are harder to visualize.

First, we import some functions we will need.
- scikit-learn (`sklearn`) is a library for machine learning.
- matplotlib is a library for plotting and visualizing results.

In [None]:
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification

Then, we use sklearn to make a simulated or synthetic data set.

In [None]:
data, labels = make_classification(n_features=2, n_redundant=0, n_informative=2,
                                   random_state=2, n_clusters_per_class=1)

Now, we can plot the data points in a *scatter plot*.
The two classes have different colors.

In [None]:
plt.scatter(data[:, 0], data[:, 1], c=labels, cmap=plt.cm.Set1);

:::{admonition} Decision Boundaries
We call the "border" between the classes the *decision boundary*.
We can see that these classes are nearly *linearly separable*, which means that we can draw a straight line that separates the two classes.
At least one of the gray points lie within the red "area".
Without these points, the data would be linearly separable.

Some simple machine learning algorithms can only separate data linearly, while more powerful algorithms, such as *neural networks*, can make arbitrary decision boundaries.
:::

## Class Labels

Below is an example of a document which is a contract.
The label (document type) is not part of the document.
If the documents are stored as tabular data, the labels are in one or more separate columns.
The first column could contain the document while the second column could contain the label.

```
Agreement

1. Introduction: The lender is Bank Cred AS.
The borrower is the person or persons (borrower and co-applicant) who apply
for and are granted a loan with this Agreement.
Payment: The credit is paid to the account number the Borrower states.

2. Repayment: The Borrower repays the loan amount at fixed term amounts, including 
interest and term fees (i.e. annuity loans) as specified in the Loan Agreement.

3. Late Payment Interest: In the event of late payment, late payment interest accrues
at the interest rate determined in accordance with the Act relating to Interest on
Overdue Payments.

Signature Place/date: Oslo, 08.08.2021
For Bank Cred AS Bank adviser Maria Wilson
Name: Christopher Thomson Borrower
```

This document could have different labels/classes depending on our application needs.
For example, if we're interested in document types, it could be labeled "contract".
If we're interested in fields of law, it might be labeled "contract law" or "banking law".

## Vectorizing Text

To use text documents in machine learning, they must first be vectorized.