# Recommending Text Articles with Matrix Factorization

## Introduction

We'll be using the same datasets from last week to attack using topic modeling to build out recommendation systems. Essentially, we'll be asking the machine to "read" the documents and decide which documents are most like one-another. Let's start by getting the Fetch20 data and grabbing 3 of the categories - starting small. You can see the full list of available categories here: http://scikit-learn.org/stable/datasets/twenty_newsgroups.html. 

## Question 1

* Load ONLY the training set from each of these categories 
* Remove the headers, footers, and quotes from each member of the set
* Vectorize the dataset, once with CountVectorizer and once with TF-IDF.

In [None]:
import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)
import numpy as np
import nltk
import os
from sklearn import datasets

categories = ['alt.atheism', 'comp.graphics', 'rec.sport.baseball']

# student section here




In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# student section here






## Question 2

* Train both LSA and NMF based models with the vectorized data. If you can, use both CountVectorizer and TF-IDF Vectorized data. (In sklearn, LSA is called TruncatedSVD since it can serve many purposes!)
* Print out your topics and their top 10 associated words. Do they make sense? Are they different for LSA vs NMF? How many topics do we need in order to make sure our Latent Space is understanding all the input document flavors?
* Make sure to save the vectors for the training documents after they're transformed into the latent space, we'll need them in Question 3.

In [None]:
def display_topics(model, feature_names, no_top_words, topic_names=None):
    for ix, topic in enumerate(model.components_):
        if not topic_names or not topic_names[ix]:
            print("\nTopic ", ix)
        else:
            print("\nTopic: '",topic_names[ix],"'")
        print(", ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))

In [None]:
from sklearn.decomposition import NMF, TruncatedSVD

# student section here





In [None]:
display_topics(lsa_tfidf,count_vectorizer.get_feature_names(),10)

In [None]:
display_topics(lsa_cv,count_vectorizer.get_feature_names(),10)

In [None]:
display_topics(nmf_cv,count_vectorizer.get_feature_names(),10)

## Question 3

* Use the vectors for the documents to train a Nearest Neighbors finder (not K Nearest Neighbors - "Nearest Neighbor")
* Load in the "test" data from fetch20 for one of the categories you've used in the training data, and process it with the vectorizer and one of the LSA models.
* Use your Nearest Neighbor model to find the 10 most similar documents to some document in the testing data, and return them as a recommendation for the user. Remember to choose a metric that makes sense in this context.
* Repeat with the other LSA model and the NMF model

In [None]:
from sklearn.neighbors import NearestNeighbors
new_data = datasets.fetch_20newsgroups(subset='test', 
                                       categories=['rec.sport.baseball'], 
                                       remove=('headers', 
                                               'footers', 'quotes'))

In [None]:
def get_recommendations(first_article, model, vectorizer, training_vectors):
    new_vec = model.transform(
        vectorizer.transform([first_article]))
    
    # student section here
    
    
    # student section ends here
    return results[1][0]

print(get_recommendations(new_data.data[2], lsa_cv, count_vectorizer, lsa_cv_data))

In [None]:
def print_recommendations(first_article,recommend_list):
    print(first_article)
    print('\n------\n')
    for resp in recommend_list:
        print('\n --- Result --- \n')
        print(ng_train.data[resp])
        
rec_list = get_recommendations(new_data.data[2], lsa_cv, count_vectorizer, lsa_cv_data)
print_recommendations(new_data.data[2],rec_list)

In [None]:
rec_list = get_recommendations(new_data.data[2], lsa_tfidf, tfidf_vectorizer, lsa_tfidf_data)
print_recommendations(new_data.data[2],rec_list)

In [None]:
rec_list = get_recommendations(new_data.data[2], nmf_cv, count_vectorizer, nmf_cv_data)
print_recommendations(new_data.data[2],rec_list)

## Question 4 (Optional, but recommended)

* We can do the same thing with LDA as we have with Latent Space reductions. Create an LDA model with the fetch20 training data, with at least 4 topics.
* Load in the fetch20 testing data for one of the categories only.
* Use a nearest-neighbor model to recommend 10 documents to a user based on any other document, returning the number and text of the document for review.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn import datasets

categories = ['alt.atheism', 'comp.graphics', 'rec.sport.baseball']

# student section here





# student section ends here

X = count_vectorizer.fit_transform(ng_train.data)
data = lda.fit_transform(X)

test = lda.transform(count_vectorizer.transform(new_data.data))
print(new_data.data[0])
print(test[0])

In [None]:
rec_list = get_recommendations(new_data.data[2], lda, count_vectorizer, data)
print_recommendations(new_data.data[2],rec_list)

We can see that LDA allows us to do the same thing, but with a bit less success than the matrix reduction methods. In both cases, we end up with some understanding of "how much of each topic" is in the document. However, LDA is not as strong with small documents. The topics just aren't as well defined using the LDA methodology without lots of words to sample from in each document. With longer documents, LDA is excellent at pulling apart and understanding what topics are making up the document.

## Question 5

* Open our second dataset at `data/ap/ap_split.txt` (Source: http://www.cs.columbia.edu/~blei/lda-c/). This is a dataset of articles from the associated press with no pre-determined scheme of topics. 
* Like last week, split and clean this dataset. You may use the same code from last week.
* We want to build an article recommender for the Associated Press. Let's imagine that whatever input document we choose is the article the user is currently reading - we need to provide a list of a few documents to populate the "you might also like..." section of their website. Use the tools you developed above to make a production pipeline to accomplish this. 
    * **Inputs:** the ID number of the article (position in your list)
    * **Outputs:** the IDs for the 5 articles most like the one being read

In [None]:
with open('../data/ap_split.txt','r') as f:
    raw_text = f.read()
docs = raw_text.split('---')
docs[1]

import re
match = re.compile("<[^>]*>").search
for i,doc in enumerate(docs):
    final = []
    temp = doc.split('\n')
    for line in temp:
        if not match(line):
            final.append(line)
    docs[i] = ' '.join(final).strip().lower().replace("`","").replace("'","")
docs[0]

In [None]:
# Spawn and train the models to be put into the pipeline

n_topics = 100
n_iter = 10

# student section here


# student section ends here

data = lda.fit_transform(X)

In [None]:
def get_recommendations(docs, article_id, model, vectorizer, training_vectors):
    # student section here
    
    
    # student section ends here
    return results[1][0]

def print_recommendations(docs, recommend_list):
    print("Recommended IDs: ", recommend_list)
    for resp in recommend_list:
        print('\n --- Result --- \n')
        print(docs[resp])
        
rec_list = get_recommendations(docs, 0, lda, count_vectorizer, data)
print_recommendations(docs, rec_list)

In this result, we fed in an article about high school violence and received back recommendations about high school boycotts, teenagers marching against high school conditions, and terror attacks. Not the most cheery of subjects, but it's clear that the model is understanding the documents and recommending "more of the same" as we expect. Let's try with a different topic article.

In [None]:
rec_list = get_recommendations(docs, 100, lda, count_vectorizer, data)
print_recommendations(docs, rec_list)

Excellent! We put in an article about an airline acquisition that also talked about the president of the company. All of our recommendations are either business or airline related! This should work as a recommender - from here, all we'd need to do is link up the plumbing to some database of articles to make sure that whatever article we want to pull as a recommendation gets pulled properly and displayed. That's beyond the scope of this exercise (since we don't have a database), but is the last remaining step to make a functional article recommendation system. With more data, more improvements could be made... like adding a "recent-ness" filter and other steps... but for now, we've got a system!

## Question 6
Let's build some clusters on our documents. We can already see that the relationship between documents in our latent space is meaningful, so clusters should be able to "lump" similar documents together in the topic space.

* Spawn an LSA or NMF model with a proper vectorizer for the AP data - use 50 or more dimensions in the reduction (we'll want that for some visualizations).
* Run a clustering algorithm on the reduced data, keeping track of the predicted cluster labels. 
* Use sklearn's t-SNE class to visualize the clusters in 2D.
* Use the cluster ID for each document as the input, and return 10 other documents from the same cluster as a recommendation

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# student section here


# student section ends here

tfidf_data = tfidf_vectorizer.fit_transform(docs)

In [None]:
from sklearn.decomposition import TruncatedSVD

n_comp = 50
# student section here



In [None]:
from sklearn.cluster import KMeans

# student section here



Let's chat about t-SNE really quickly: t-SNE is only used for displaying high dimensional data. NEVER convert to t-SNE and then cluster. Distances aren't preserved at all, only relationships to one another. So it's great for keeping the overall relationships in tact (as you can see below since the clusters tend to stay together), but the actual distances in 2D aren't meaningful and thus shouldn't be used to cluster.

In [None]:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
%matplotlib inline


tsne = TSNE().fit_transform(lsa_data)

In [None]:
plt.figure(figsize=(12,12))
plt.scatter(tsne[:,0],tsne[:,1],c=plt.cm.rainbow(labels*10))

We could thus also use the clustering labels as a recommendation (albeit, not as direct as using the nearest neighbors) by finding what cluster the document belongs to, then recommending other documents in that cluster randomly.

In [None]:
import pandas as pd
def get_cluster_recommends(docs, labels, id_of_doc):
    df = pd.DataFrame(labels, columns=['label'])
    df = df[df['label'] == labels[id_of_doc]]
    return list(df.sample(10).index)

rec_list = get_cluster_recommends(docs, labels, 0)
print_recommendations(docs,rec_list)

We can see it's not as direct, but when fed the same article on student violence, this method recommends education, student, and high school related articles. This will of course be somewhat randomized - but the methodology is sound. It also doesn't seem to be quite so "direct" in terms of choosing articles that are on EXACTLY the same topic. This method is not as popular, but can also be used to recommend articles when users are at risk of getting caught in a feedback loop of always seeing articles that are too similar.