# Tutorial: How Do We Know It's Working?
This tutorial is based on the official tutorial titled [Working With Data](http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html) by the `scikit-learn` team. Compared to those versions, the current tutorial features a dataset that has been published by [Kaggle](https://www.kaggle.com) as part of a competition that they organised for [SMS Spam Detection](https://www.kaggle.com/c/sms-spam-detection). Furthermore, a number of changes have been carried in the code of the above-mentioned tutorials in order to include `pandas` in the data pre-processing stage, and to assure compatibility with the updated version of `scikit-learn`.

In this tutorial you will learn how to:
* Extract feature vectors from text documents
* Load, inspect and pre-process a dataset of comments on social media
* Train a classifier to predict whether an SMS is spam (e.g. "You are rewarded with a $1500 Bonus Prize, call 09066364589") or not
* Use Grid Search in order to tune better the hyper-parameters of your Machine Learning pipeline
* Perform K-Means clustering and explore its results

In order to run this iPython Notebook, [Jupyter](http://jupyter.org/) should be installed in your machine. Besides Jupyter, the following Python packages should also be installed: (i) `pandas` and (ii) `scikit-learn`. The easiest way to install all of these together is with [Anaconda](https://www.anaconda.com/) (Windows, macOS and Linux installers available).

In [None]:
# We are using this library to detect the Python version with 
# which the Notebook kernel is running.
import sys 

In [None]:
print('You are running Python %s' % sys.version)
version_info_major = sys.version_info.major

# Loading the Kaggle Dataset
For the purposes of this tutorial, we will be using a dataset of SMS messages along with their classification labels (i.e. "spam" or "ham"). The dataset is encoded in binary-encoded `pickle` files which reside in `./Kaggle/spam.pickle`.

In [None]:
kaggle_dataset_location = './Kaggle/spam.pickle' # The location of the Kaggle dataset

In [None]:
# cPickle should be loaded on in the case that kernel is running
# on Python 2.
if version_info_major < 3:
    import cPickle as pickle
else:
    import pickle

In [None]:
# Loading the binary-encoded pickle files from the designated location.
with open(kaggle_dataset_location, 'rb') as f:
    kaggle_dataset = pickle.load(f)

The `kaggle_dataset` variable contains the dataset as a pythonic dictionary of lists. We will be using the `pandas` library in order to tranform this structure into a `pandas.DataFrame` which will simplify the data inspection and pre-processing process.

In [None]:
import pandas as pd

kaggle_dataset_df = pd.DataFrame(kaggle_dataset)
# Printing the number of rows (data points) in the loaded DataFrame.
print('The number of rows in the loaded DataFrame is: %d' % len(kaggle_dataset_df)) 

We print the first 10 rows of the `DataFrame` in order to get an understanding of the structure of the dataset.

In [None]:
display(kaggle_dataset_df.head(n=10))

In [None]:
print(kaggle_dataset_df['Message'][1])
print(kaggle_dataset_df['Message'][11])

In [None]:
# Print all the available columns of the dataset.
print(kaggle_dataset_df.columns)
# Define the target column which we want to predict.
target_column = u'Class'

In [None]:
# Returns the category codes of each of the classes in the target-column.
# outputs = kaggle_dataset_df['Class'].astype('category').cat.codes

In [None]:
class_names = kaggle_dataset_df['Class'].astype('category').cat.categories.tolist()
print(class_names)

It is very important to gain a basic understanding about how potentially imbalanced towards certain classes our dataset is before moving further.

In [None]:
for cl in class_names:
    print('%d comments that are labeled as %s.' % (len(kaggle_dataset_df[kaggle_dataset_df['Class'] == cl]), cl))

In [None]:
# Set a random_state number for replicability of the experiments.
random_state = 10
resampled_ham_df = kaggle_dataset_df[kaggle_dataset_df['Class'] == 'ham'].sample(len(kaggle_dataset_df[kaggle_dataset_df['Class'] == 'spam']), 
                                                                                 random_state=random_state)

In [None]:
len(resampled_ham_df)

In [None]:
resampled_dataset_df = pd.concat([resampled_ham_df, kaggle_dataset_df[kaggle_dataset_df['Class'] == 'spam']],
                                 ignore_index=True)

In [None]:
# Returns the category codes of each of the classes in the target-column.
outputs = resampled_dataset_df['Class'].astype('category').cat.codes

In [None]:
inputs = resampled_dataset_df['Message']
class_names = resampled_dataset_df['Class'].astype('category').cat.categories.tolist()
print(class_names)

In [None]:
for cl in class_names:
    print('%d comments that are labeled as %s.' % (len(resampled_dataset_df[resampled_dataset_df['Class'] == cl]), cl))

In order to evaluate the performance of our algorithm, we should test its performance on data that it hasn't *seen* during training. Luckily, `scikit-learn` includes an appropriate function that splits the items for a dataset into random train and test subsets.

We set the portion of the original dataset that will be used for testing.

In [None]:
from sklearn.model_selection import train_test_split

test_size = 0.2
# Split dataset into training and testing according to the test_size variable.
# output_train and output_test are lists containing the classes' indices.
input_train, input_test = train_test_split(inputs.tolist(), test_size=test_size, random_state=random_state)
y_train, y_test = train_test_split(outputs.tolist(), test_size=test_size, random_state=random_state)

# Extracting Features from Text: The Bag-of-Words Approach
In order to be able to use text documents$^1$ as either input to Machine Learning algorithms, we need to follow a process that would turn them into numerical feature vectors. We generally refer to this process as *vectorisation*. The most intuitive way to do so is the **bags-of-words** approach, which is carried out as follows:
1. Identify all the words that occur in the documents of a training set.
2. Assign a fixed integer ID to each one of those words. For example in Python you could build a dictionary that would map each word to each corresponding integer ID:
 ```python
 dictionary = {'I': 1,
               'study': 2,
               'machine': 3,
               'learning': 4,
               ...}
 ```
3. For each document in the training set, we count the number of occurrences of each word, and we store it in $X[d, w]$, as the value of the $w$-th feature for the $d$-th document, where $w$ is the index of the word in the dictionary.

The bags-of-words representation implies that total number of features is the number of distinct words in the corpus, which typically is larger than 100k. 

While storing all these values in a `numpy` array would require substantial amount of memory, most values in $X$ will be zeros since for a given document only a small subset of the set of the distinct words in the dataset will be present. For this reason, we say that bags-of-words are typically high-dimensional sparse datasets. We can save a lot of memory by only storing the non-zero parts of the feature vectors in memory. `scipy.sparse` matrices are data structures that do exactly this, and scikit-learn has built-in support for these structures.

In `scipy` text preprocessing, tokenising and stop-words (e.g. "and", "or" and "that") filtering are included in a high-level component that is able to build a dictionary of features and transform documents to feature vectors:

$^1$ Text documents can vary substantially in length and writing style. In our case, we refer to text documents as the short-lengthed comments of our Kaggle dataset, but the techniques presented in this tutorial could work on much longer collections, such as articles or books.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
# The lines below load the TweetTokenizer from the nltk library.
# You can comment-them-in along with the tokenizer variable of
# the CountVectorizer should you like to see the results with
# a different tokeniser.
# from nltk.tokenize import TweetTokenizer
# word_tokenizer = TweetTokenizer(preserve_case=False, 
#                                 strip_handles=True, 
#                                 reduce_len=True).tokenize

count_vectorizer = CountVectorizer(ngram_range=(1, 1),
                                   stop_words=None,
                                   # tokenizer=word_tokenizer,
                                   # Ignore terms that have a document frequency strictly higher than the given threshold.
                                   # If float, the parameter represents a proportion of documents, integer absolute counts.
                                   max_df=1.0,
                                   # Ignore terms that have a document frequency strictly lower than the given threshold.
                                   # If float, the parameter represents a proportion of documents, integer absolute counts.
                                   min_df=1)
count_vectorizer

## Working on a Toy Example
Let’s use it to tokenize and count the word occurrences of a minimalistic corpus of text documents:

In [None]:
toy_corpus = [u'You have studied Machine Learning',
              u'We love learning Machine Learning',
              u'Looking forward to #ArupLearningWeek',
              u'Have you studied Machine Learning']

# Fits and tranforms the corpus in its bag-of-words representation.
toy_count = count_vectorizer.fit_transform(toy_corpus) 

During the fitting process each term is assigned a unique integer index corresponding to a column in the resulting `toy_count` matrix (i.e. equivalent to the $X$ matrix that has been mentioned in the description of this section of the tutorial).

In [None]:
print(toy_count) # This is the memory-efficient representation of a sparse matrix.
print(toy_count.toarray())

In [None]:
print('The total number of DOCUMENTS in the corpus equals to the number of rows of the returned array: %d' %  toy_count.toarray().shape[0])
print('The total number of UNIQUE WORDS in the corpus equals to the number of columns of the returned array: %d' %  toy_count.toarray().shape[1])

You can see that the first and the last rows of the array are identical. This is happening because they correspond to comments with the same words, and, thus, are encoded in equal vectors, which leads to loss of valuable information. `CountVectorizer` also supports counts of n-grams of words or consecutive characters. N-grams are runs of consecutive characters or words, so for example in the case of word bi-grams, every consecutive pair of words would be a feature. Support for n-grams can be enabled by adjusting the `ngram_range` variable during the initialisation of the `CountVectorizer`.

In the initialisation of `CountVectorizer` set the `ngram_range` variable to `(1, 2)`, and check the resulting `toy_count` matrix by running `toy_count.toarray()`. Do the results make sense?

The interpretation of the columns can be retrieved as follows:

In [None]:
count_vectorizer.get_feature_names()

Once the vectoriser is fitted, you can retrieve the index (starting from zero) of a particular word in the dictionary by simply calling:

In [None]:
count_vectorizer.vocabulary_.get(u'machine')

For further details about the functionality of `CountVectorizer`, please refer [here](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer).

## Working on the Kaggle Dataset

Print a small part of the comments (i.e the first five in the list) that will be used for training.

In [None]:
input_train[:5] 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
# The lines below load the TweetTokenizer from the nltk library.
# You can comment-them-in along with the tokenizer variable of
# the CountVectorizer should you like to see the results with
# a different tokeniser.
# from nltk.tokenize import TweetTokenizer
# word_tokenizer = TweetTokenizer(preserve_case=False, 
#                                 strip_handles=True, 
#                                 reduce_len=True).tokenize

count_vectorizer = CountVectorizer(ngram_range=(1, 1),
                                   stop_words=None,
                                   # tokenizer=word_tokenizer,
                                   # Ignore terms that have a document frequency strictly higher than the given threshold.
                                   # If float, the parameter represents a proportion of documents, integer absolute counts.
                                   max_df=1.0,
                                   # Ignore terms that have a document frequency strictly lower than the given threshold.
                                   # If float, the parameter represents a proportion of documents, integer absolute counts.
                                   min_df=1)
count_vectorizer

In [None]:
# Fits and tranforms the corpus in its bag-of-words representation.
X_train_count = count_vectorizer.fit_transform(input_train)

During the fitting process each term is assigned a unique integer index corresponding to a column in the resulting `X_train_count` matrix (i.e. equivalent to the $X$ matrix that has been mentioned in the description of this section of the tutorial).

In [None]:
print(X_train_count.shape)
print(X_train_count.toarray())

Can you tell how many documents and how many unique words exist in the Kaggle dataset?

### From Occurrences To Frequencies
Occurrence count is a good start. However, longer documents will have higher average count values than shorter documents, even though they might talk about similar topics. To avoid these potential discrepancies, it suffices to divide the number of occurrences of each word in a document by the total number of words in the document. The number of times a term occurs in a document, divided by the number of terms in a document is called the **term frequency** (**tf**).

Another refinement on top of term frequency is to downscale weights for words that occur in many documents in the corpus, and are therefore less informative than those that occur only in a smaller portion of the corpus. In order to achieve this we can weight terms on the basis of the **inverse document frequency** (**idf**). The *document frequency* is the number of documents a given word occurs in; the inverse document frequency is often defined as the total number of documents in the corpus divided by the document frequency.

Combining tf and idf results in a *family of weightings* (tf is usually multiplied by idf, but there a few different variations of how idf is computed) known as **term frequency-inverse document frequency** (**tf–idf**).

Both tf and tf–idf on our `toy_corpus` can be computed using `scikit-learn` as follows:

In [None]:
# We print the entirety of our toy corpus.
toy_corpus

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer

# Computing tf using the counts that have been computed from the CountVectorizer.
tf_transformer = TfidfTransformer(use_idf=False, norm='l1', smooth_idf=False)
X_train_tf = tf_transformer.fit_transform(toy_count)

print(X_train_tf.shape)
print(X_train_tf.toarray())

In [None]:
# Computing tf-idf using the counts that have been computed from the CountVectorizer.
tfidf_transformer = TfidfTransformer(use_idf=True, norm='l1', smooth_idf=False)
X_train_tfidf = tfidf_transformer.fit_transform(toy_count)

print(X_train_tfidf.shape)
print(X_train_tfidf.toarray())

Rather than transforming the raw counts with the `TfidfTransformer`, it is alternatively possible to use the `TfidfVectorizer` to directly parse the dataset. We will use this to compute the tf-idf scores on our `toy_corpus` as follows:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect = TfidfVectorizer(ngram_range=(1, 1),
                             # stop_words='english',
                             # tokenizer='word_tokenizer',
                             # Ignore terms that have a document frequency strictly higher than the given threshold.
                             # If float, the parameter represents a proportion of documents, integer absolute counts.
                             # max_df=0.75,
                             # Ignore terms that have a document frequency strictly lower than the given threshold.
                             # If float, the parameter represents a proportion of documents, integer absolute counts.
                             # min_df=5,
                             # tf-idf hyper-parameters
                             use_idf=True, norm='l1', smooth_idf=False)

X_train_tfidf = tfidf_vect.fit_transform(toy_corpus)

print(X_train_tfidf.shape)
print(X_train_tfidf.toarray())

As expected, the scores are identical to the ones computed at the previous step using the combination of `CountVectorizer` and `TfidfTransformer`. We will compute now the tf-idf scores on the Kaggle dataset using again the `TfidfVectorizer`.

We are also leveraging the `stop_words`,`max_df` and `min_df` parameters of the `TfidfVectorizer` in order to exclude frequent and extremely infrequent words that would not help us in our classification task. This also helps us to reduce the number of columns of the $X$ matrix (i.e. `X_train_tfidf`) to only 1853 columns from the total of 14315 that it originally had when these parameters were not used.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect = TfidfVectorizer(ngram_range=(1, 1),
                             stop_words='english',
                             # tokenizer='word_tokenizer',
                             # Ignore terms that have a document frequency strictly higher than the given threshold.
                             # If float, the parameter represents a proportion of documents, integer absolute counts.
                             max_df=0.75,
                             # Ignore terms that have a document frequency strictly lower than the given threshold.
                             # If float, the parameter represents a proportion of documents, integer absolute counts.
                             min_df=5,
                             # tf-idf hyper-parameters
                             use_idf=True, norm='l1', smooth_idf=False)

X_train_tfidf = tfidf_vect.fit_transform(input_train)

print(X_train_tfidf.shape)
print(X_train_tfidf.toarray())

Try filtering out terms that are either too frequent or infrequent in the dataset by adjusting the `max_df` and `min_df` variable respectively. This is an easy way of not only filtering out the less informative words but also reducing the number of features (less storage complexity).

Occasionally, it is important to have an understanding of the words that are excluded via the `stop_words` parameter—some of those words could be important for a particular task. The full list of `stop_words` that `sklearn` uses can be accessed as follows:

In [None]:
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

for word in sorted(ENGLISH_STOP_WORDS): print(word)

# Building a Predictive Model using K-Nearest-Neighbours
Now that we have our training features and the labels of each post, we can train a classifier to predict whether a message is a spam or not. Let's start with a KNN classifier, which provides a simple baseline, although is perhaps not the best classifier for this task:

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier(n_neighbors=3).fit(X_train_tfidf, y_train)

To predict the outcome on a new comment, we need to extract the features using almost the same feature extracting chain as before. The differences are that we call (i) `transform` instead of `fit_transform` on the transformer or vectoriser, and (ii) `predict` on the classifier since they have both been fit to the training set.

You can test your own comments by changing the text in the `test_comment` variable. Does your classifier identify spam messages properly?

In [None]:
test_comment = 'TO STOP THIS TEXT CALL 52331'
test_comment_tfidf = tfidf_vect.transform([test_comment])
y_pred = knn_clf.predict(test_comment_tfidf)

print('%s: %s' % (test_comment, class_names[y_pred[0]]))

## Evaluating the Performance on the Test Set
We will be evaluating the performance of our KNN classifier on the *unseen* data of the test set based on the accuracy metric. In a binary classification task, such as ours, the accuracy with which a model predicts a specific class $c$ (e.g. spam) is formally defined as:

\begin{align}
\frac{\sum \text{TP} + \sum \text{TN}}{\sum \text{TP} + \sum \text{FP} + \sum \text{TN} + \sum \text{FN}}
\end{align}
where:
* $\text{TP}$ refers to True Positive predictions: both the predicted and the empirical labels are $c$
* $\text{TN}$ refers to True Negative predictions: both the predicted and the empirical labels are $\neq c$
* $\text{FP}$ refers to False Positive predictions: the predicted label is $c$ but the empirical label $\neq c$
* $\text{FN}$ refers to False Negative predictions: the predicted label is $\neq c$ but the empirical label is $c$

In [None]:
from sklearn import metrics

X_test_tfidf = tfidf_vect.transform(input_test)
y_pred = knn_clf.predict(X_test_tfidf)
print('Accuracy: %.2f' % (metrics.accuracy_score(y_test, y_pred)))


## Building a Pipeline
In order to make our pipeline (i.e. vectoriser or transformer $\rightarrow$ classifier) easier to work with, `scikit-learn` provides the `Pipeline` class that behaves like a compound classifier.

In [None]:
from sklearn.pipeline import Pipeline
clf_pipeline = Pipeline([('tfidf', TfidfVectorizer(ngram_range=(1, 1),
                                               stop_words='english',
                                               # tokenizer=word_tokenizer,
                                               # If float, the parameter represents a proportion of documents, integer absolute counts.
                                               max_df=1.0,
                                               # Ignore terms that have a document frequency strictly lower than the given threshold.
                                               # If float, the parameter represents a proportion of documents, integer absolute counts.
                                               min_df=5,
                                               # tf-idf hyper-parameters
                                               use_idf=True, norm='l1', smooth_idf=False)),
                         ('clf', KNeighborsClassifier(n_neighbors=3))])

The names `tfidf` and `clf` (classifier) are arbitrary. We shall see their use in the section on grid search, below. We can now train (on the training set) and test (on the test set) the model in a similar fashion to when we had all the different components separate.

In [None]:
# Model Training
clf_pipeline.fit(input_train, y_train)
y_pred = clf_pipeline.predict(input_test) # We are making prediction on the test set.
print('Accuracy: %.2f' % (metrics.accuracy_score(y_test, y_pred)))

Let's see if we can do better with a linear Support Vector Machine (SVM). We can change the learner by just plugging a different classifier object into our pipeline as follows:

In [None]:
from sklearn.linear_model import SGDClassifier
clf_pipeline = Pipeline([('tfidf', TfidfVectorizer(ngram_range=(1, 1),
                                               stop_words='english',
                                               # tokenizer=word_tokenizer,
                                               # If float, the parameter represents a proportion of documents, integer absolute counts.
                                               max_df=1.0,
                                               # Ignore terms that have a document frequency strictly lower than the given threshold.
                                               # If float, the parameter represents a proportion of documents, integer absolute counts.
                                               min_df=5,
                                               # tf-idf hyper-parameters
                                               use_idf=True, norm='l2', smooth_idf=False)),
                         ('clf', SGDClassifier(loss='hinge',
                                           penalty='l2',
                                           tol=1e-5,
                                           random_state=random_state))])
# Model Training
clf_pipeline.fit(input_train, y_train)

In [None]:
y_pred = clf_pipeline.predict(input_test) # We are making prediction on the test set.
print('Accuracy: %.2f' % (metrics.accuracy_score(y_test, y_pred)))

## Evaluating Performance
`scikit-learn` provides further utilities for a more detailed performance analysis of the results using different metrics (i.e. precision, recall, and F1-score).

A *confusion matrix* is a summary of prediction results on a classification problem. The number of correct and incorrect predictions are summarized with count values and broken down by each class. An example of a binary confusion matrix is presented below:

| Class          | <span style="font-weight:normal">Predicted: `ham`</span> | <span style="font-weight:normal">Predicted: `spam`</span> |
|--------------------|------------------|-------------------|
| Actual: `ham`  | TN               | FP                |
| Actual: `spam` | FN               | TP                |

The table above illustrates the different classification scenarios:
* $\text{TP}$ refers to True Positive predictions: the email is a spam and the algorithm predicted spam
* $\text{TN}$ refers to True Negative predictions: the email is a ham and the algorithm predicted ham
* $\text{FP}$ refers to False Positve predictions: the email is a ham and the algorithm predicted spam 
* $\text{FN}$ refers to False Negative predictions: the email is a spam and the algorithm predicted ham

A number of performance metrics, complimentary to the accuracy that we looked at above, can be derived from the confusion matrix:
 
* **Precision** measures the ratio of correctly predicted topic to the total, and is formally defined as:
\begin{align}
\text{precision}=\frac{\sum \text{TP}}{\sum \text{TP}+ \sum\text{FP}}
\end{align}
 

* **Recall** measures the ratio of correctly predicted topic to the total, and is formally defined as:
\begin{align}
\text{recall}=\frac{\sum \text{TP}}{\sum \text{TP}+ \sum \text{FN}}
\end{align}
 

- **F1-score** is a metric that jointly optimises the precision and recall, and is formally defined as:
\begin{align}
\text{F1-score}=2\cdot\frac{\text{precision}\cdot\text{recall}}{\text{precision}+\text{recall}}
\end{align}

In [None]:
display(pd.DataFrame(metrics.confusion_matrix(y_test,
                                              y_pred)))

The output matrix shows the performance across all the different classes according to precision, recall and f1-score. Support refers to the number of samples that belong to each particular class.

For further details you can have a look [here](https://en.wikipedia.org/wiki/Precision_and_recall).

In [None]:
print(metrics.classification_report(y_test, 
                                    y_pred,
                                    target_names=class_names))

Like we did before, we can test how well our classifier is doing by inputing our own comments.

In [None]:
test_comment = 'Text 12312 NOW to get this offer!'
y_pred = clf_pipeline.predict([test_comment])

print('%s: %s' % (test_comment, class_names[y_pred[0]]))

Try experimenting with different hyper-parameters (e.g. `ngram_range`, `tokenizer`, `max_df` or `min_df`) to see whether you achieve any better accuracy.

# Hyper-parameter Tuning using Grid Search and Cross-validation
We have already encountered some hyper-parameters such as `use_idf` in the `TfidfTransformer` (and `TfidfVectorizer`). Classifiers tend to have many hyper-parameters as well. For example `KNeighborsClassifier` includes parameter for the number of neighbours and `SGDClassifier` has a penalty parameter alpha and configurable loss and penalty terms in the objective function.

Instead of tweaking the hyper-parameters of the various components of the chain, it is possible to run an exhaustive search of the best hyper-parameters on a grid of possible values. Let's use this to explore whether we can make the KNeighborsClassifier perform as well as our linear SVM.

In [None]:
knn_pipeline = Pipeline([('tfidf', TfidfVectorizer()),
                         ('clf', KNeighborsClassifier(n_neighbors=3))])

from sklearn.model_selection import GridSearchCV
parameters = {'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3), (1, 4)],
              'tfidf__use_idf': (True, False),
              'clf__n_neighbors': (1, 3, 5, 7, 9)}


## Cross-validation
$k$-fold cross-validation is the process of splitting the training data into $k$ smaller sets, and training a model with the same hyper-parameters $k$ times. During each one of the training phases, a different part of the $k$ sets is left out and the model is trained using the rest $k-1$ folds of the data. After a single training phase is completed, the model is validated on the single fold which it had not seen during its training.
![Cross-validation](Figures/grid_search_cross_validation.png "How cross-validation works?")
Source: [https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation)

The number of $k$ folds that will be used for cross-validation is determined using the `cv` parameter on the `GridSearchCV` module of the `sklearn` library.

Obviously, such an exhaustive search can be expensive. If we have multiple CPU cores at our disposal, we can tell the grid searcher to try these eleven parameter combinations in parallel with the `n_jobs` parameter. If we give this parameter a value of -1, grid search will detect how many cores are installed and uses them all.

In [None]:
knn_pipeline = GridSearchCV(knn_pipeline, param_grid=parameters, 
                            # Sets the number of folds (k) that will be used for 
                            # k-fold cross-validation.
                            cv=3, 
                            n_jobs=-1)

The grid search instance behaves like a normal `scikit-learn` model, so we can use the `fit` function to initialise the training process.

In [None]:
knn_pipeline.fit(input_train, y_train)

We can get the optimal parameters out by inspecting the object's `grid_scores_` attribute, which is a list of parameters/score pairs. To get the best scoring attributes, we do as follows:

In [None]:
for param_name in knn_pipeline.best_params_:
    print("%s: %r" % (param_name, knn_pipeline.best_params_[param_name]))
print('The best achieved cross-validation accuracy is %.2f' % knn_pipeline.best_score_)

Now that the process is complete, we will test our performance on the test set, and output the corresponding confusion matrix with the precision, recall and f1-score across the two classes.

In [None]:
y_pred = knn_pipeline.predict(input_test) # We are making prediction on the test set.
print('Accuracy: %.2f' % (metrics.accuracy_score(y_test, y_pred)))

In [None]:
print(metrics.classification_report(y_test, 
                                    y_pred,
                                    target_names=class_names))

Let's test and see how well new comments are classified on our `knn_pipeline`.

In [None]:
test_comment = 'Send code if you want to buy this amazing new application!'
y_pred = knn_pipeline.predict([test_comment])

print('%s: %s' % (test_comment, class_names[y_pred[0]]))

# Exploring K-Means Clustering
Now that we have extracted features from our training documents we're in a position to experiment with clustering. We will use K-Means as its one of the most intuitive clustering methods, although it does have a few limitations.

K-Means clustering with 10 clusters can be achieved as follows:

In [None]:
from sklearn.cluster import KMeans
num_clusters = 10
k_means = KMeans(num_clusters)
k_means.fit(X_train_tfidf)

The assignments of the original posts to cluster id is given by `km.labels_` once `km.fit(...)` has been called. The centroids of the clusters is given by `km.cluster_centers_`. Intuitively, the vector that describes the centre of a cluster is just like any other feature vector. An interesting way to explore what each cluster is representing is to calculate and print the top weighted (either by occurrence or tf-idf) terms for that cluster:

In [None]:
order_centroids = k_means.cluster_centers_.argsort()[:, ::-1]
terms = tfidf_vect.get_feature_names()
for i in range(num_clusters):
    print("Cluster %d:" % i)
    for ind in order_centroids[i, :5]:
        print(' %s' % terms[ind])
    print('\n')

A number of different metrics exist that allow us to measure how well the clusters fit the known distribution of underlying newsgroups. One such metric is the homogeneity which is a measure of how pure the clusters are with respect to the known labels (i.e. spam or ham).

In [None]:
print("Homogeneity: %.3f" % metrics.homogeneity_score(y_train, k_means.labels_))

Homogeneity scores vary between 0 and 1; a score of 1 indicates that the clusters match the original label distribution exactly.

Explore what happens if you make the number of clusters larger. What do you notice? Do the clusters begin to make more intuitive sense?

# Next Steps

If you have enjoyed this tutorial, the exercises in [Working With Data](http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html) are a good next step.