### UNSUPERVISED LEARNING

# Recommending documents with LSA

----

#### ❗ NLTK is hard to install. I recommend running this notebook in Google Colab instead: https://drive.google.com/file/d/1xel4VmTqzFoZkOiEijyGhYuH6BQW5lYM/view?usp=sharing

----

We'd like to find documents with similar content to a document we like, but without having to rely on tagging or other labels. This is what **latent semantic analysis** is for. We can 'sense' the meaning of a document from the words it contains.

Inspired by and/or based on [**science concierge**](https://github.com/titipata/science_concierge) and [**Chris Clark's repo**](https://github.com/groveco/content-engine) on content-based recommendation.

[This blog post](https://www.themarketingtechnologist.co/a-recommendation-system-for-blogs-content-based-similarity-part-2/) is also really good. [Pysuggest](https://pypi.python.org/pypi/pysuggest) might be worth looking at, and so might [Crab](https://muricoca.github.io/crab/).

Believe it or not, we can do all of it in about 10 lines of code!

----

We'll start with some data:

In [1]:
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/seg/2017-tle-hall/master/data/title_abstract_doi.csv')

df.head()

Unnamed: 0,title,abstract,doi
0,Uranium Measurement By Airborne Gamma‐Ray Spec...,"In the airborne measurement of uranium, window...",10.1190/1.1440542
1,Coupling In Amplitude Variation With Offset An...,Linear amplitude-variation-with-offset (AVO) a...,10.1190/geo2012-0429.1
2,Principal Component Spectral Analysis,Spectral decomposition methods help illuminate...,10.1190/1.3119264
3,Extended Arrays For Marine Seismic Acquisition,In‐line arrays for both source and receiver ha...,10.1190/1.1440827
4,Modeling Anisotropic Static Elastic Properties...,We have quantified the effects of clay fractio...,10.1190/geo2015-0575.1


## Prepare the data

In [2]:
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import RegexpTokenizer

In [3]:
# Instantiate the stemmer and tokenizer.
stemmer, tokenizer = PorterStemmer(), RegexpTokenizer(r'\w+')

# Make a function to preprocess each item in the data.
def preprocess(item):  # 3
    return ' '.join(stemmer.stem(token) for token in tokenizer.tokenize(item))

# Apply the preprocessing.
data = [preprocess(item) for item in df.abstract]

## Compute the document matrix

The matrix is a **term frequency, inverse document frequency** or "tfidf" matrix. This counts how many times words and/or phrases ('terms') appear in a document, then scales those frequencies to the inverse of how frequent they are in the cohort. So a rare word like 'coulomb' carries more weight than a common one like 'seismic'.

The `sklearn` implementation automatically filters 'stop' words, eliminating things like 'the' or 'this'. It works just like `sklearn`'s other models:

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words='english', ngram_range=(1,1))
vecs = tfidf.fit_transform(data)

The resulting matrix has one row for each document, and one column for each 'term'. If we include n-grams, which are groups of words, the matrix will be very large.

In [5]:
vecs.shape

(1000, 6133)

## Reduce the number of dimensions

To make the matrix more manageable, we can reduce the number of dimensions with singular value decomposition. We'll reduce it down to 100 dimensions.

In [6]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=100).fit_transform(vecs)

## Build and store the distance tree

The distance tree is a fast dta structure for finding nearest neighbours in a high-dimensional space.

In [7]:
from sklearn.neighbors import KDTree

tree = KDTree(svd)

## Query the tree for recommendations

Now we can find a paper we're interested in and try to find similar papers.

In [8]:
target = 333

df.title[target]

'A Terracing Operator For Physical Property Mapping With Potential Field Data'

In [9]:
# Recommend 5 docs for a single document.
_, idx = tree.query([svd[target]], k=6)

[df.title[i] for i in idx[0] if i != target]

['U.S. National Magnetic Anomaly Survey Specifications Workshop Report',
 'High-Resolution Gravity Study Of The Gray Fossil Site',
 'Geologic Implications Of Aeromagnetic Data For The Eastern Continental Margin Of The United States',
 'Calculation Of Magnitude Magnetic Transforms With High Centricity And Low Dependence On The Magnetization Vector Direction',
 'The World‐Wide Gravity Program Of The Mapping And Charting Research Laboratory Of Ohio State University']

## Exercise

- Can you visualize the document clusters with t-SNE or UMAP?

See the **Unsupervised clustering** notebook.