# Recommender Systems
* Latent Factor Models
* Trends + Metadata
* Item/Content Similarity
* Association Rules / FP-Growth

## Dimensionality and Topic Modeling

### Matrix Factorization Examples: Collaborative Filtering and Topic Modeling

We'll look at basic topic modeling pattern which is:
* Common in business scenarios beyond text
* Relies on matrix factorization

Nonnegative matrix factorization (NMF) is about finding two matrices which, when multiplied, approximate a given matrix.

<img src="https://materials.s3.amazonaws.com/i/NMF.png">

#### Why would we want to do this?

__Collaborative Filtering__

1. Imagine the rows of V represent customers, and the columns of V represent products (physical goods or media, like movies or songs).

2. Then we can imagine that the W matrix represents each customer's affinity to a set of latent factors (aspects of a product, like movie genre or clothing style).

3. H represents the correlation between those factors and each product.

4. It is easy to imagine that with a ton of "factors" we could get a really good breakdown of customers' likes and product aspects. But -- if we can get close to this with just a small number of factors -- then we will have a *compact, fast, inexpensive* way to recommend products to customers (or maybe even engineer products).

This is really another flavor of dimensionality reduction -- we're again finding a low-dimensional representation of the linkage patterns between customers and products.

The same exact approach can help us distill "topics" in natural language processing. If we imagine the rows of V as documents (or text instances) and the columns of V as terms, then the latent factors can be "topics" that unite terms via corresponding usages in documents.

In [1]:
# Author: Olivier Grisel <olivier.grisel@ensta.org>
#         Lars Buitinck
#         Chyi-Kwei Yau <chyikwei.yau@gmail.com>
# License: BSD 3 clause

from __future__ import print_function
from time import time

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF
from sklearn.datasets import fetch_20newsgroups

n_samples = 2000
n_features = 1000
n_components = 10
n_top_words = 20

# Load the 20 newsgroups dataset and vectorize it. We use a few heuristics
# to filter out useless terms early on: the posts are stripped of headers,
# footers and quoted replies, and common English words, words occurring in
# only one document or in at least 95% of the documents are removed.

print("Loading dataset...")
t0 = time()
dataset = fetch_20newsgroups(shuffle=True, random_state=1,
                             remove=('headers', 'footers', 'quotes'))
print("done in %0.3fs." % (time() - t0))

Loading dataset...
done in 1.420s.


In the following, 
* Stop words are words so common that we don't want them counted at all (like "the")
* TF-IDF is a way of encoding a document into a vector, which counts the frequency of terms in the document, but divides by the common-ness of the terms across all documents
  * E.g., "have" -- if it were included at all and not filtered out as a stop word -- would carry minimal weight because it occurs in most documents

In [2]:
data_samples = dataset.data[:n_samples]

# Use tf-idf features for NMF.
print("Extracting tf-idf features for NMF...")
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
                                   max_features=n_features,
                                   stop_words='english')
t0 = time()
tfidf = tfidf_vectorizer.fit_transform(data_samples)
print("done in %0.3fs." % (time() - t0))

Extracting tf-idf features for NMF...
done in 0.359s.


"Frobenius norm" is the square-root-of-sum-of-squares, of a matrix
  * In this case is represents a Euclidean distance between the target matrix and the product of the approximate factors, so we'd like it to be small

In [3]:
# Fit the NMF model
print("Fitting the NMF model (Frobenius norm) with tf-idf features, n_samples=%d and n_features=%d..." % (n_samples, n_features))
t0 = time()
nmf = NMF(n_components=n_components, random_state=1, alpha=.1, l1_ratio=.5).fit(tfidf)
print("done in %0.3fs." % (time() - t0))

Fitting the NMF model (Frobenius norm) with tf-idf features, n_samples=2000 and n_features=1000...
done in 0.252s.


In [4]:
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print(tfidf_feature_names)

['00', '000', '10', '100', '11', '12', '128', '13', '130', '14', '15', '16', '17', '18', '19', '1992', '1993', '20', '200', '21', '22', '23', '24', '25', '250', '26', '27', '28', '29', '2nd', '30', '300', '31', '32', '33', '34', '35', '36', '37', '38', '3d', '40', '42', '43', '44', '45', '48', '49', '50', '500', '51', '55', '60', '66', '70', '72', '75', '80', '800', '86', '90', '92', '93', '__', 'able', 'ac', 'accept', 'access', 'according', 'act', 'action', 'actually', 'add', 'added', 'addition', 'address', 'administration', 'advance', 'age', 'ago', 'agree', 'aids', 'air', 'al', 'allow', 'allowed', 'alt', 'america', 'american', 'amiga', 'analysis', 'anonymous', 'answer', 'answers', 'anti', 'anybody', 'apartment', 'appears', 'apple', 'application', 'applications', 'apply', 'appreciated', 'approach', 'appropriate', 'apr', 'april', 'archive', 'area', 'areas', 'aren', 'argument', 'armenia', 'armenian', 'armenians', 'army', 'article', 'ask', 'asked', 'asking', 'assume', 'atheism', 'attack'

In [5]:
nmf.components_[0]

array([8.23963526e-03, 6.91942718e-02, 1.54262543e-01, 7.47558702e-02,
       5.89972819e-02, 7.74345567e-02, 4.31107957e-04, 4.33120399e-02,
       9.89838497e-04, 6.49787680e-02, 7.46599177e-02, 6.93757423e-02,
       3.80165613e-02, 3.20071511e-02, 1.32714841e-02, 1.61353918e-02,
       3.26349276e-02, 7.08782057e-02, 2.99509804e-02, 2.49964685e-02,
       2.54943956e-02, 2.26323850e-02, 2.17834039e-02, 5.02778932e-02,
       2.52819646e-02, 1.59392421e-02, 1.15007385e-02, 7.08263665e-03,
       4.21495575e-03, 7.03032975e-03, 5.24030368e-02, 1.65567569e-02,
       7.24453483e-03, 2.15336677e-02, 7.32362892e-03, 0.00000000e+00,
       1.11602980e-02, 0.00000000e+00, 0.00000000e+00, 1.82491394e-03,
       0.00000000e+00, 4.68971862e-02, 1.63705626e-03, 0.00000000e+00,
       0.00000000e+00, 8.12693287e-03, 0.00000000e+00, 0.00000000e+00,
       6.05485982e-02, 4.22061219e-02, 1.81902258e-03, 1.04720227e-02,
       2.21383344e-02, 1.29457460e-03, 4.85224852e-03, 1.09826316e-02,
      

In [6]:
[(name, magnitude) for (name, magnitude) in zip(tfidf_feature_names, nmf.components_[0]) if name < 'b']

[('00', 0.008239635263297997),
 ('000', 0.06919427177399445),
 ('10', 0.15426254250482974),
 ('100', 0.07475587017743668),
 ('11', 0.05899728192229134),
 ('12', 0.07743455674566742),
 ('128', 0.00043110795669383696),
 ('13', 0.043312039884082824),
 ('130', 0.0009898384968139342),
 ('14', 0.06497876801662089),
 ('15', 0.07465991773307805),
 ('16', 0.06937574228969356),
 ('17', 0.0380165612960387),
 ('18', 0.03200715113691779),
 ('19', 0.013271484134595328),
 ('1992', 0.016135391771745767),
 ('1993', 0.032634927577005125),
 ('20', 0.07087820572423732),
 ('200', 0.02995098042223915),
 ('21', 0.024996468544751203),
 ('22', 0.02549439559582958),
 ('23', 0.02263238496285581),
 ('24', 0.021783403914011987),
 ('25', 0.050277893175993044),
 ('250', 0.025281964602312433),
 ('26', 0.015939242066013614),
 ('27', 0.011500738466525906),
 ('28', 0.007082636653535145),
 ('29', 0.004214955746868225),
 ('2nd', 0.007030329750926434),
 ('30', 0.0524030368101537),
 ('300', 0.016556756906167737),
 ('31', 0.

In [7]:
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()

print("\nTopics in NMF model (Frobenius norm):")
print_top_words(nmf, tfidf_feature_names, n_top_words)


Topics in NMF model (Frobenius norm):
Topic #0: just people don think like know time good make way really say right ve want did ll new use years
Topic #1: windows use dos using window program os drivers application help software pc running ms screen files version card code work
Topic #2: god jesus bible faith christian christ christians does heaven sin believe lord life church mary atheism belief human love religion
Topic #3: thanks know does mail advance hi info interested email anybody looking card help like appreciated information send list video need
Topic #4: car cars tires miles 00 new engine insurance price condition oil power speed good 000 brake year models used bought
Topic #5: edu soon com send university internet mit ftp mail cc pub article information hope program mac email home contact blood
Topic #6: file problem files format win sound ftp pub read save site help image available create copy running memory self version
Topic #7: game team games year win play season playe

There are many other approaches to topic modeling, from Naive Bayes models to Latent Dirichlet Allocation, to neural networks.