# Measuring Course Similiarity 

This document will demonstrate to quantify similiarity of different courses within the Orange course offerings.  The
similiarity of courses will be measured by comparing the similiarity of terms within the course descriptions.  The
first order implementation of this involves identifying all words within the course description and measuring which course have 
a large intersection of terms.  However this simple model has several shortcomings:

1) A large number of words are ubiquitous and uninformative.  Articles, pronouns, and conjunctions do not carry 
significant computational information on the content and audience of an article

2) There is significant ambiguity within language involving conjucations and declension.  For example "computer" and 
"computers" identify a single concept, but  within the 'bag-of-words' model described above these concepts are not linked.

3) Language is sparse and and word can be highly correlated.  For example terms like "SQL", "HIVE", and "Database" are 
all related and courses which discuss each may be similiar.  But using a simple bag-of-words model will ignore these 
relationships and lose correlated information.

This document will attempt to show how each of these shortcomings can be addressed within a similiarity measurement 
system.

As such these document is broken into the following sections:


1) Environment setup: Loading required packages which may need to be installed

2) Ingestion: Loading the data into Jupyter

3) Tokenization, Cleansing, and Stemming: Identifying words within the narrative, removing uninformative words, 
and disambiguating word tense.

4) Latent Semantic Index: Addressing highly correlated terms and reducing language dimensionality

5) Similiarity Measurement: Measuring course similiarity

6) Storage and reuse

## Environment Setup

Two non-standard python packages are required for this analysis:

* NLTK: *Natural Language Toolkit* highly advanced NLP toolkit.  We will use it to disambiguate tense, known as stemming.  
However the package has several features which are useful to explore including part of speech tagging and existing 
text corpora.  Part of Speech (POS) tagging is useful in identify noun phrases within a narrative.  Linguistically 
noun phrases often carry a great deal a information when compared to verb phrases.  This may later help reduce 
dimensionality

* STOP_WORDS: A simple package containing uninformative english words like conjucations, pronouns, etc.  We will use this 
to remove these words from the matrix and reduce dimensionality

In [None]:
%matplotlib inline

In [None]:
import pip
required_packages = ['nltk', 'stop_words']
installed_packages = [package.project_name for package in pip.get_installed_distributions()]
for pkg_name in required_packages:
    if not pkg_name in installed_packages:
        pip.main(['install', pkg_name])

In [None]:
import stop_words
from nltk.stem import PorterStemmer
import stop_words
from IPython.display import *
from matplotlib.pylab import *
from scipy import sparse
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
import re
from scipy.sparse import linalg as sla
from functools import partial
import sqlalchemy as sq
import pandas as pd
import numpy as np
import sklearn.cross_validation as cv
import getpass

## Ingestion

We connect to the source database and load the course information

In [None]:
course_data = %sql SELECT * FROM course_description_catalog
df = course_data.DataFrame()

In [None]:
df[df.series.apply(lambda ele: 'securi' in ele.lower())]


## Tokenization, Cleansing, and Stemming

We need to take the text of the course description and break it down into individual words (tokenization), 
remove words which are uninformative (cleansing or stop word removal), 
and identify the common roots of identified words (stemming).


In [None]:
example_narrative = df.iloc[0]['coursedes']
print(example_narrative)

In [None]:
# tokenization: we split words if their exists a space or a limited set of puncations {-, !, ?, :, .,  ;, %, (, )}
# This leads to some mistakes (for example E-mail is broken into e and mail).
elements = re.split("[, (\-!?:.;%'\")]+", example_narrative.lower())
elements

In [None]:
# stop word removal is simply a list of words.  it is important to remove in a bag-o-words model to reduce dimensionality.
# if n-gram pharses are used later to capture negation or other concepts, this needs to be done more carefully
stopwords = stop_words.get_stop_words('english')
stopwords[0:10]

In [None]:
elements_prime = [x for x in elements if not x in stopwords]
print("Cleansed list: %s"%elements)
print("Removed Words: %s"%set.difference(set(elements), set(elements_prime)))

In [None]:
# Stemming: we use a simple stemming algorithm from NLTK to find the roots of words:
stemmer = PorterStemmer()
elements_stemmed = [stemmer.stem(x) for x in elements_prime ]
print(elements_stemmed)

In [None]:
# Defining a tokenization function
stopwords = stop_words.get_stop_words('english')
stemmer = PorterStemmer()

def tokenizer(narrative, stopwords=None, stemmer=None):
    elements = re.split("[, (\-!?:.;%'\")]+", narrative.lower())
    elements = [x for x in elements if len(x) > 0]
    if stopwords is not None:
        elements = [x for x in elements if not x in stopwords]    
    if stemmer is not None:
        elements = [stemmer.stem(x) for x in elements]
    return(elements)
tokenizer_fcn = partial(tokenizer,stopwords=stopwords, stemmer=stemmer)

In [None]:
tokenizer_fcn(example_narrative)

### Building out the TF and TF-IDF Matrix

We can use sklearn's count tokenizer to create the initial Term Frequency matrix and the TfidfTransformer 
to transform this into an TF-IDF matrix.  This will serve as the basis to measure similiarity going forward.

In [None]:
count_mdl = CountVectorizer(tokenizer=tokenizer_fcn)
tf_mtx = count_mdl.fit_transform(df.coursedes)
tf_idf_fcn = TfidfTransformer(norm=None)
tf_idf_mtx = tf_idf_fcn.fit_transform(tf_mtx)

In [None]:
# Build the TF dataframe
course_index, word_index = tf_mtx.nonzero()
count = tf_mtx.data
df_term_frequency = pd.DataFrame({"course_id": df['course#'].values[course_index], 
                                 "word": np.array(count_mdl.get_feature_names())[word_index], 
                                 'count': count})
HTML(df_term_frequency.head(n=20).to_html())

In [None]:
# Load at the most prominent words in the corpus
df_term_frequency.groupby("word").sum().sort_values('count', ascending=False).head(n=20)

## Latent Semantic Analysis

Before we measure similiarity we want to reduce dimensionality. Both the TF and TF-IDF matrices can be approximated using 
a reduced rank matrix:

$$TFIDF = U S V^{T} \approx U_{k} S_{k} V_{k}^{T} $$

Where $U_{k}$, $S_{k}$, and $V_{k}$ are truncated terms in the SVD decomposition focusing on the basis vectors 
which explain most the variance in the dataset.  In this expression we can reduce dimensionality and focus on 

where we explain 

In [None]:
U, s, V = sla.svds(tf_idf_mtx, k=500)
plot(np.arange(500), s[-1::-1])


In [None]:
# The singular values plataue at about 50 terms.  For simplicity we will truncate the SVD there

In [None]:
latent_course_vectors_k, s, latent_word_vectors_k = sla.svds(tf_idf_mtx, k=50)
latent_word_vectors_k = latent_word_vectors_k.T

## Similiarity measurement

We can now measure similiarity between courses using the latent course vectors and the cosine similiairty 

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
# Remove the diagonal such that the course is not similiar to itself
similiarity_matrix = cosine_similarity(latent_course_vectors_k) - np.identity(latent_course_vectors_k.shape[0])

In [None]:
# Let's look at one course:
idx = np.random.choice(np.arange(df.shape[0]))
selected_course = df.ix[idx]
selected_course

In [None]:
values = similiarity_matrix[idx, :]
similiar_courses = values.argsort()[-10::]
values.sort()
select_values = values[-10::]
df_tmp = pd.DataFrame({"course title" : df.iloc[similiar_courses[::-1]]['course title'], 
                      'similiarity': select_values[::-1]})
HTML(df_tmp.to_html())

In [None]:
## Let's look at the histogram of similiarities
results = hist(np.array(similiarity_matrix.flatten()), np.linspace(-1, 1, 200))
xlabel('Similiarity')
ylabel('Counts/(0.01)')

In [None]:
# Let's see what the top 2.5% of similiar courses are and save those are similiar in 
# a database
ii, jj = np.triu_indices(similiarity_matrix.shape[0], k=1)
thresh = np.percentile(similiarity_matrix[ii, jj], 97.5)
print(thresh)

In [None]:
similiarity_matrix[similiarity_matrix < thresh] = 0
ii, jj = np.tril_indices(similiarity_matrix.shape[0], k=-1)
similiarity_matrix[ii, jj] = 0

In [None]:
sp_similiarity_matrix = sparse.csr_matrix(similiarity_matrix)


In [None]:
# We are double counting her
ii, jj = sp_similiarity_matrix.nonzero()
similiarities = sp_similiarity_matrix.data
course_id_1 = df['course#'].values[ii]
course_id_2 = df['course#'].values[jj]
course_title_1 = df['course title'].values[ii]
course_title_2 = df['course title'].values[jj]

In [None]:
df_similiarity = pd.DataFrame({'course_number_1': course_id_1, 
                              'course_number_2': course_id_2, 'course title 1': course_title_1, 
                               'course title 2': course_title_2,
                              'similiarity': similiarities})
df_similiarity.sort_values('similiarity', ascending=False, inplace=True)
HTML(df_similiarity.head(n=50).to_html())

There are a host of test prep courses which filter to the top

In [None]:
df_similiarity_tmp = pd.merge(df_similiarity, df, left_on = 'course_number_1', right_on = 'course#', how='inner')

In [None]:
HTML(df_similiarity_tmp[df_similiarity_tmp.series != 'Test Preps'][['course title 1', 'course title 2', 'similiarity']].sort_values('similiarity', ascending=False).head(n=20).to_html())

In [None]:
# Check that each course is similiar to at least one other course
set.difference(set(np.arange(similiarity_matrix.shape[0])), set.union(set(np.unique(ii)), set(np.unique(jj))))

## Storage and Reuse

Now we would like to write out some of these table to the source database so that they can be reused later.

We specifically want:

* TF Matrix: This will be useful if we ever want to recompute the TF-IDF matrix and redo the latent semantic
factorization

* Document Latent Factors: This will be useful if we want to calculate similairities later

* Word Latent Factors: Useful in implementing the fold-in method to update the document latent factors without
recalculating the factorization

* Similiarity Table: Useful in identifying similiar courses

In [None]:
password = getpass.getpass()
conn = sq.create_engine('postgresql://hr:%s@192.168.161.79:5432/hr?sslmode=require'%password)
del(password)

In [None]:
df_term_frequency.to_sql('course_term_frequency_tbl', conn, if_exists='replace')

In [None]:
df_similiarity.to_sql('course_similiarity_tbl', conn, if_exists='replace')

In [None]:
df_latent_course_vectors = pd.DataFrame(latent_course_vectors_k, index=df['course#'].values, columns = ['latent_vector%i'%ii for ii in range(latent_course_vectors_k.shape[1])])
df_latent_word_vectors = pd.DataFrame(latent_word_vectors_k, index=count_mdl.get_feature_names(), columns = ['latent_vector%i'%ii for ii in range(latent_word_vectors_k.shape[1])])
df_latent_singular_values = pd.DataFrame({"name": ['latent_vector%i'%ii for ii in range(latent_word_vectors_k.shape[1])], 'value': s})

In [None]:
df_latent_course_vectors.to_sql('latent_course_vector_tbl', conn, if_exists='replace')
df_latent_word_vectors.to_sql('latent_word_vector_tbl', conn, if_exists='replace')
df_latent_singular_values.to_sql('latent_singular_values_tbl', conn, if_exists='replace')