# Recommender Systems
* Latent Factor Models
* Trends + Metadata
* Item/Content Similarity
* Association Rules / FP-Growth

## Dimensionality and Topic Modeling

### Matrix Factorization Examples: Collaborative Filtering and Topic Modeling

We'll look at basic topic modeling pattern which is:
* Common in business scenarios beyond text
* Relies on matrix factorization

Nonnegative matrix factorization (NMF) is about finding two matrices which, when multiplied, approximate a given matrix.

<img src="https://materials.s3.amazonaws.com/i/NMF.png">

#### Why would we want to do this?

__Collaborative Filtering__

1. Imagine the rows of V represent customers, and the columns of V represent products (physical goods or media, like movies or songs).

2. Then we can imagine that the W matrix represents each customer's affinity to a set of latent factors (aspects of a product, like movie genre or clothing style).

3. H represents the correlation between those factors and each product.

4. It is easy to imagine that with a ton of "factors" we could get a really good breakdown of customers' likes and product aspects. But -- if we can get close to this with just a small number of factors -- then we will have a *compact, fast, inexpensive* way to recommend products to customers (or maybe even engineer products).

This is really another flavor of dimensionality reduction -- we're again finding a low-dimensional representation of the linkage patterns between customers and products.

The same exact approach can help us distill "topics" in natural language processing. If we imagine the rows of V as documents (or text instances) and the columns of V as terms, then the latent factors can be "topics" that unite terms via corresponding usages in documents.

In [None]:
# Author: Olivier Grisel <olivier.grisel@ensta.org>
#         Lars Buitinck
#         Chyi-Kwei Yau <chyikwei.yau@gmail.com>
# License: BSD 3 clause

from __future__ import print_function
from time import time

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF
from sklearn.datasets import fetch_20newsgroups

n_samples = 2000
n_features = 1000
n_components = 10
n_top_words = 20

# Load the 20 newsgroups dataset and vectorize it. We use a few heuristics
# to filter out useless terms early on: the posts are stripped of headers,
# footers and quoted replies, and common English words, words occurring in
# only one document or in at least 95% of the documents are removed.

print("Loading dataset...")
t0 = time()
dataset = fetch_20newsgroups(shuffle=True, random_state=1,
                             remove=('headers', 'footers', 'quotes'))
print("done in %0.3fs." % (time() - t0))

In the following, 
* Stop words are words so common that we don't want them counted at all (like "the")
* TF-IDF is a way of encoding a document into a vector, which counts the frequency of terms in the document, but divides by the common-ness of the terms across all documents
  * E.g., "have" -- if it were included at all and not filtered out as a stop word -- would carry minimal weight because it occurs in most documents

In [None]:
data_samples = dataset.data[:n_samples]

# Use tf-idf features for NMF.
print("Extracting tf-idf features for NMF...")
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
                                   max_features=n_features,
                                   stop_words='english')
t0 = time()
tfidf = tfidf_vectorizer.fit_transform(data_samples)
print("done in %0.3fs." % (time() - t0))

"Frobenius norm" is the square-root-of-sum-of-squares, of a matrix
  * In this case is represents a Euclidean distance between the target matrix and the product of the approximate factors, so we'd like it to be small

In [None]:
# Fit the NMF model
print("Fitting the NMF model (Frobenius norm) with tf-idf features, n_samples=%d and n_features=%d..." % (n_samples, n_features))
t0 = time()
nmf = NMF(n_components=n_components, random_state=1, alpha=.1, l1_ratio=.5).fit(tfidf)
print("done in %0.3fs." % (time() - t0))

In [None]:
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print(tfidf_feature_names)

In [None]:
nmf.components_[0]

In [None]:
[(name, magnitude) for (name, magnitude) in zip(tfidf_feature_names, nmf.components_[0]) if name < 'b']

In [None]:
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()

print("\nTopics in NMF model (Frobenius norm):")
print_top_words(nmf, tfidf_feature_names, n_top_words)

There are many other approaches to topic modeling, from Naive Bayes models to Latent Dirichlet Allocation, to neural networks.