# Encoding and Classifying Text

This lab is about how to get started with ML using sklearn.

* Read about the [20 newsgroups](https://scikit-learn.org/stable/datasets/real_world.html#the-20-newsgroups-text-dataset) data set. The data set provides training and test splits.
* Experiment with feature extraction and classification. For example:
  * Change which newsgroups are included (be ware of using too much data)
  * Change the feature extraction: word counts vs tf-idf, ngrams for words and characters
  * Try using clean data vs unprocessed. You can change this in the data loader.
* Try out some classifiers
  * [kNN](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier)
  * [Decision trees](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier)
  * [SVM](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html)
* Plot your data and classification results using PCA and matplotlib.

This is intended as an exercise, not as an examination, so try to ask as many questions as you can during the lab.

In [None]:
import numpy as np

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d.axes3d import Axes3D

## Load the data set

Let's load the data.

In [None]:
from sklearn.datasets import fetch_20newsgroups
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']

newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories)

print(newsgroups_train.data[:3])

## Vectorization

A vectorizer transforms your text into vectors. See more in the [docs](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). Look at the loaded newsgroup objects to find the text (```newsgroups_train.data```), labels (```newsgroups_train.target_names```), and even label encodings (```newsgroups_train.target```).

## Classification

Import a classifier from sklearn the try to train it on your training set.

## Linear dimensionality reduction

Principal component analysis (PCA) finds a linear subspace for our data, so that it can be plotted.

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=min(100, X.shape[1]//2))
X_pca = pca.fit_transform(X)

plt.figure(figsize=(6, 4))
plt.plot(np.cumsum(pca.explained_variance_ratio_)*100)
plt.ylabel("Cumulative explained variance [%]")
plt.xlabel("n components")
plt.show()

In [None]:
def plot_2d(X, y, labels, ax=None):
  if ax is None:
    fig = plt.figure(figsize=(6, 6), dpi=100)
    ax = fig.subplots(1, 1)
  ax.scatter(X[:, 0], X[:, 1], c=y, s=15, cmap='tab10', alpha=.5)
  for label in np.unique(y):
    ax.text(np.mean(X[y==label, 0]),
            np.mean(X[y==label, 1]),
            labels[label],
            fontsize=16, zorder=1)
  if ax is None:
    fig.tight_layout(pad=0)
    fig.show()

fig = plt.figure(figsize=(13, 6))
ax = fig.subplots(1, 2)
plot_2d(X, y, labels, ax[0])
plot_2d(X_pca, y, labels, ax[1])
fig.tight_layout(pad=0)
fig.show()

In [None]:
def plot_3d(X, y, labels, angle=None, ax=None):
  if ax is None:
    fig = plt.figure(figsize=(8, 6), dpi=100)
    ax = fig.add_subplot(1, 1, 1, projection='3d')
  ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y,
              alpha=.5, cmap='tab10')
  for label in np.unique(y):
    ax.text(np.mean(X[y==label, 0]),
            np.mean(X[y==label, 1]),
            np.mean(X[y==label, 2]),
            labels[label],
            fontsize=10,
            horizontalalignment='center',
            verticalalignment='center')
  if angle is not None:
    ax.view_init(20, angle % 360)
  if ax is None:
    fig.tight_layout(pad=0)
    fig.show()

fig = plt.figure(figsize=(8, 6), dpi=100)
ax = fig.add_subplot(1, 1, 1, projection='3d')
plot_3d(X_pca, y, labels, ax=ax)
fig.show()