# Content based recommendation

Content based recommendation egnines compare items and find items similar to a query item. In this exercise, we will learn to measure the similarity of textual documents and build  a recommender for news articles.

In [None]:
from __future__ import print_function
import requests
import time
import random
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.decomposition import PCA
from sklearn.preprocessing import normalize
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

%matplotlib inline

Define some helper functions we are going to use later.

In [None]:
def row_as_array(M, i):
  """Get the i:th row of a sparse matrix M and return it as a 1-D numpy array"""
  return np.squeeze(np.asarray(tf[i, :].todense()))

def row_as_series(M, i, vectorizer):
  return pd.Series(row_as_array(M, i), index=vectorizer.get_feature_names())

## News article dataset

Let's load a set of Reuters news articles. We'll need to clean the data a bit and extract the actual text of the articles to a variable called `documents`.

In [None]:
dataset = requests.get('http://ana.cachopo.org/datasets-for-single-label-text-categorization/r52-train-all-terms.txt?attredirects=0').content

In [None]:
documents = []
labels = []
for line in dataset.decode('UTF-8').split('\n'):
  # the part before the tab is a document class, which we ignore here
  fields = line.strip().split('\t', 1)
  if len(fields) == 2:
    labels.append(fields[0])
    documents.append(fields[1])

labels = np.array(labels)
documents = pd.Series(documents)

print('Loaded {} documents'.format(len(documents)))

The news articles have been grouped to these categories (each article belong to one of the categories):

In [None]:
np.unique(labels)

Let's print the first news articles to see how they look. Notice that punctuation and upper cases have already been removed in the data source. 

In [None]:
print(documents[0])

## Preprocessing text documents

Most of the machine learning methods deal with number, not text. Therefore, text is commonly converted into numerical vectors for processing. A simple baseline method is to just count how many times each word appears in a document and collect the counts (called, the term frequencies) into a vector. The scikit-learn library provides a tool for performing the conversion.

* The `ngram_range` parameter indicates that we want to have not just the individual words as tokens, but also sequences of 2 and 3 consequtive words.
* The `min_df` parameter filters out tokens which appear less frequently than in 0.5% of the documents.
* The `stop_words` parameter filters out common English words.

See the [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) for more information.

In [None]:
vectorizer = CountVectorizer(ngram_range=(1, 1), min_df=0.005, stop_words='english')
vectorizer.fit(documents)

The `fit` method collects the tokens that appear in the documents and match the `ngram_range`, `min_df`, `stop_word` parameters.

Let's see what kind of tokens it found.

In [None]:
print('Number of tokens: {}'.format(len(vectorizer.get_feature_names())))

print()
print('First 10 tokens:')
vectorizer.get_feature_names()[:10]

The main functionality of the vectorizer is that we can convert text documents into vectors.

In [None]:
tf = vectorizer.transform(documents)

`tf` is a matrix, where each row is a count vector for one document.

Let's define a helper function for inspecting the rows and inspect the first document vector.

In [None]:
print("The dimension of the tf matrix are {}".format(tf.shape))

In [None]:
x = row_as_series(tf, 0, vectorizer)
x

Most of the counts are zeros because one document contains only a subset of all words in the vocabulary.

To show that it's not all zeroes, let's take a subset that includes some non-zero values:

In [None]:
x['cocoa':'commission']

The length of the news articles varies from a couple of words to over 300 words as can be seen below.

In [None]:
plt.hist(tf.sum(axis=1), bins=30)
plt.xlabel('Number of words in a article')
plt.show()

We don't want to consider two documents similar just because they have similar number of words. Therefore, we normalize the term frequency matrix.

In [None]:
Xnormalized = normalize(tf.todense(), norm='l2')

## Dimensionality reduction

Human can't easily understand the tf matrix that has over 1000 dimensions. If we want to visualize the data, we need to reduce it to two dimensions. This will obviously lose some (or actually quite a lot of) information in the data, but PCA will keep the most important dimensions. 

In [None]:
dimreduction = PCA(n_components=2)
Xreduced = dimreduction.fit_transform(Xnormalized)

In [None]:
print("The dimensionality of the original tf matrix: {}".format(tf.shape))
print("The dimensionality after PCA: {}".format(Xreduced.shape))

Let's plot the data after the dimensionality reduction.

In [None]:
plt.figure(figsize=(18, 16))
plt.plot(Xreduced[:, 0], Xreduced[:, 1], '.')
plt.show()

There seems to be some interesting structure in the data, at least three separated clusters.

To better understand what kind of articles belong to the clusters, we can overlay the beginning of a few random documents on the image.

In [None]:
plt.figure(figsize=(18, 16))
plt.plot(Xreduced[:, 0], Xreduced[:, 1], '.')

indexes = random.sample(range(tf.shape[0]), 10)
for i in indexes:
    label = ' '.join(documents[i].split(' ')[:10])
    plt.text(Xreduced[i, 0], Xreduced[i, 1], label, size='xx-large')

plt.show()

## Clustering

In [None]:
kmeans = KMeans(n_clusters=3)
kmeans.fit(Xreduced)

The clustering algorithm assigned a cluster index to each sample. Let's see the first few cluster indexes:

In [None]:
kmeans.labels_

We can visualize the clustering result by drawing each data point with the color corresponding the its cluster.

In [None]:
available_colors = 'bgrcmyk'
colors = [available_colors[i % len(available_colors)] for i in kmeans.labels_]

plt.figure(figsize=(18, 16))
for i in range(Xreduced.shape[0]):
    plt.plot(Xreduced[i, 0], Xreduced[i, 1], '.', color=colors[i])
plt.show()

Clustering the two dimensional data is a toy example because you can already see the structure by looking at the visualization.

In reality, clustering is applied to high dimensional data, like the new article data before the dimensionality reduction.

In [None]:
kmeans2 = KMeans(n_clusters=6)
kmeans2.fit(Xnormalized)

In [None]:
for cl in range(kmeans2.n_clusters):
    indexes = random.sample(np.where(kmeans2.labels_ == cl)[0].tolist(), 5)

    print("Random documents from cluster {}".format(cl))
    print('-'*40)
    
    for i in indexes:
        print(documents[i])
        
    print()

## Exercise

Vary the number of clusters (`n_clusters`) and both analyses (the clustering of the dimensionality reduced data and the clustering of the original data). How large or small number of clusters still leads to sensible clusterings?

## Exercise

The CountVectorizer tends to overweight very common words. A better way is to re-weight words that appear in many documents by multiplying the raw term frequencies (TF) with (some function of) the inverse of in how many documnets a term occurs (inverse document frequencies, IDF). The resulting vectors are called TF-IDF vectors. Scikit-learn provides [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer) for this task. It is almost a drop-in replacement for the CountVectorizer that was used above.

Repeat the above analysis with TfidfVectorizer.