# Document Similarity & Topic Modelling

---
You are currently looking at **version 1.0** of this notebook.

---

## Part 1 - Document Similarity

For the first part of this assignment, you will complete the functions `doc_to_synsets` and `similarity_score` which will be used by `document_path_similarity` to find the path similarity between two documents.

The following functions are provided:
* **`convert_tag:`** converts the tag given by `nltk.pos_tag` to a tag used by `wordnet.synsets`. You will need to use this function in `doc_to_synsets`.
* **`document_path_similarity:`** computes the symmetrical path similarity between two documents by finding the synsets in each document using `doc_to_synsets`, then computing similarities using `similarity_score`.

You will need to finish writing the following functions:
* **`doc_to_synsets:`** returns a list of synsets in document. This function should first tokenize and part of speech tag the document using `nltk.word_tokenize` and `nltk.pos_tag`. Then it should find each tokens corresponding synset using `wn.synsets(token, wordnet_tag)`. The first synset match should be used. If there is no match, that token is skipped.
* **`similarity_score:`** returns the normalized similarity score of a list of synsets (s1) onto a second list of synsets (s2). For each synset in s1, find the synset in s2 with the largest similarity value. Sum all of the largest similarity values together and normalize this value by dividing it by the number of largest similarity values found. Be careful with data types, which should be floats. Missing values should be ignored.

Once `doc_to_synsets` and `similarity_score` have been completed, submit to the autograder which will run `test_document_path_similarity` to test that these functions are running correctly. 

*Do not modify the functions `convert_tag`, `document_path_similarity`, and `test_document_path_similarity`.*

In [None]:
import numpy as np
import nltk
from nltk.corpus import wordnet as wn
import pandas as pd

from sklearn.metrics import accuracy_score

### Synset

In [None]:
cat = wn.synset('cat.n.01')
dog = wn.synset('dog.n.01')
cat.hypernyms(), dog.hypernyms()

### Helper to convert nltk-pos_tags to wordnet-pos_tags

In [None]:
def convert_tag(tag):
    """Convert the tag given by nltk.pos_tag to the tag used by wordnet.synsets"""
    tag_dict = {'N': 'n', 'J': 'a', 'R': 'r', 'V': 'v'}
    try:
        return tag_dict[tag[0]]
    except KeyError:
        return None

### Convert document to list of synsets
Tokenizes and tags the words in the document doc.
 - Then finds the first synset for each word/tag combination.
 - If a synset is not found for that combination it is skipped.

In [None]:
def doc_to_synsets(doc):
    synsetlist =[]
    tokens=nltk.word_tokenize(doc)
    pos=nltk.pos_tag(tokens)    
    for tup in pos:
        try:
            synsetlist.append(wn.synsets(tup[0], convert_tag(tup[1]))[0])
        except:
            continue           
    return synsetlist

In [None]:
doc_to_synsets('Fish are nvqjp friends.')

### Normalized Similarity score of 2 lists of synsets (s1, s2)
 - for each synset in s1, finds the synset in s2 with the largest similarity value.
 - take the mean of largest similarity values

In [None]:
def similarity_score(s1, s2):
    max_scores = []
    for synset1 in s1:
        run_max = 0
        for synset2 in s2:
            try:
                sim_score = synset1.path_similarity(synset2)
                run_max = max(run_max, sim_score)
            except:
                continue
        if run_max > 0:
             max_scores.append(run_max)  
    return np.mean(max_scores) or 0

In [None]:
synsets1 = doc_to_synsets('I like cats')
synsets2 = doc_to_synsets('I like dogs')
similarity_score(synsets1, synsets2)

### Find the symmetrical similarity between doc1 and doc2

In [None]:
def document_path_similarity(doc1, doc2):
    synsets1 = doc_to_synsets(doc1)
    synsets2 = doc_to_synsets(doc2)
    return (similarity_score(synsets1, synsets2) + similarity_score(synsets2, synsets1)) / 2

In [None]:
doc1 = 'This is a function to test document_path_similarity.'
doc2 = 'Use this function to see if your code in doc_to_synsets and similarity_score is correct!'
document_path_similarity(doc1, doc2)

In [None]:
from nltk.book import *

In [None]:
document_path_similarity(' '.join(sent3), ' '.join(sent3))

### Document similarity - paraphrasing
**`paraphrases`** is a DataFrame which contains the following columns:
- `Quality` is an indicator variable which indicates if the two documents 
- `D1` and `D2` are paraphrases of one another (1 for paraphrase, 0 for not paraphrase).

In [None]:
# Use this dataframe for questions most_similar_docs and label_accuracy
paraphrases = pd.read_csv('data/paraphrases.csv')
paraphrases.head()

### Most similar documents
Using `document_path_similarity`, find the pair of documents in paraphrases which has the maximum similarity score.

In [None]:
def most_similar_docs(df_):
    doc_sim_scores = pd.DataFrame([(D1, D2, document_path_similarity(D1, D2)) 
                  for D1, D2 in zip(df_.loc[:, 'D1'], df_.loc[:, 'D2'])], columns=['D1', 'D2','score'])
    max_idx = doc_sim_scores.loc[:, 'score'].idxmax()  # np.argmax deprecated
    max_instance = doc_sim_scores.iloc[max_idx]
    return tuple(max_instance)

In [None]:
most_similar_docs(paraphrases)

### Label accuracy
Provide labels for the twenty pairs of documents by computing the similarity for each pair using `document_path_similarity`.  
Let the classifier rule be that if the score is greater than 0.75:
 - label is (1) paraphrase
 - else label (0) is not paraphrase
 Report accuracy of the classifier using scikit-learn's accuracy_score.

In [None]:
def label_accuracy(df_, threshold=0.75):
    doc_sim_scores = pd.DataFrame([(D1, D2, document_path_similarity(D1, D2)) 
                  for D1, D2 in zip(df_.loc[:, 'D1'], df_.loc[:, 'D2'])], columns=['D1', 'D2','score'])
    doc_sim_scores['label'] = (doc_sim_scores['score'] > threshold) *1
    return doc_sim_scores

In [None]:
df_paraphrase = label_accuracy(paraphrases, threshold=0.75)
df_paraphrase

In [None]:
df_paraphrase.describe()

## Part 2 - Topic Modelling

For the second part of this assignment, you will use Gensim's LDA (Latent Dirichlet Allocation) model to model topics in `newsgroup_data`. You will first need to finish the code in the cell below by using gensim.models.ldamodel.LdaModel constructor to estimate LDA model parameters on the corpus, and save to the variable `ldamodel`. Extract 10 topics using `corpus` and `id_map`, and with `passes=25` and `random_state=34`.

'https://radimrehurek.com/gensim/models/ldamodel.html'

In [None]:
import numpy as np
import nltk
from nltk.corpus import wordnet as wn
import pandas as pd

import pickle
import gensim
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score

### Import data

In [None]:
# !pwd
# !ls data
# !head -10 data/newsgroups.dms

In [None]:
# Load the list of documents
with open('data/newsgroups.dms', 'rb') as f:
    newsgroup_data = pickle.load(f)

### Select and clean tokens
Use CountVectorizor to find three letter tokens
 - remove stop_words 
 - remove tokens that don't appear in at least 20 documents
 - remove tokens that appear in more than 20% of the documents

In [None]:
vect = CountVectorizer(min_df=20, 
                       max_df=0.2, 
                       stop_words='english', 
                       token_pattern='(?u)\\b\\w\\w\\w+\\b')

### Fit and transform data (create sparse matrix)

In [None]:
X = vect.fit_transform(newsgroup_data)

### Convert sparse matrix to gensim corpus.

In [None]:
corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)

### Mapping from word IDs to words (To be used in LdaModel's id2word parameter)

In [None]:
id_map = {v:k for k, v in vect.vocabulary_.items()}

### LDA model

In [None]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus=corpus, num_topics=10, passes=25, id2word=id_map, random_state=0)
print(ldamodel)
print(ldamodel.print_topics(num_topics=4, num_words=5))

### Put together

In [None]:
def lda_model(doc, min_df=20, max_df=0.2, stop_words='english', token_pattern='(?u)\\b\\w\\w\\w+\\b', n_topics=10, n_words=10, passes=25):
    vect = CountVectorizer(min_df=min_df, 
                       max_df=max_df, 
                       stop_words=stop_words, 
                       token_pattern=token_pattern)
    X = vect.fit_transform(doc)
    corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)
    id2word_dict = {v:k for k, v in vect.vocabulary_.items()}
    return gensim.models.ldamodel.LdaModel(corpus=corpus, num_topics=n_topics, passes=passes, id2word=id2word_dict, random_state=0)

In [None]:
lda = lda_model(newsgroup_data)

### LDA Topics
 - find a list of the N topics and the most significant M words in each topic.

In [None]:
N, M = 10, 5
lda.show_topics(num_topics=N, num_words=M)

### Topic distribution
 - find the topic distribution for a new document
 - use vect.transform on the the new doc, and Sparse2Corpus to convert the sparse matrix to gensim corpus

In [None]:
new_doc = ["\n\nIt's my understanding that the freezing will start to occur because \
of the\ngrowing distance of Pluto and Charon from the Sun, due to it's\nelliptical orbit. \
It is not due to shadowing effects. \n\n\nPluto can shadow Charon, and vice-versa.\n\nGeorge \
Krumins\n-- "]

In [None]:
def lda_topic_dist(doc, ldamodel):
    X = vect.transform(doc)
    corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)
    return ldamodel[corpus][0]

In [None]:
lda_topic_dist(new_doc, ldamodel)

### Topic names
 - assign topic names to the topics you found
 - create a new topic name if needed

In [None]:
topics_names = 'Health,Science,Automobiles,Politics,Government,Travel,Computers & IT,Sports,Business,Society & Lifestyle,Religion,Education'.split(',')
topics_names = np.array(topics_names)
topics_names[::-1][:3]

In [None]:
topics = [(topic_id, word[0], word[1]) for topic_id, topic_words in lda.show_topics(num_topics=10, num_words=10, formatted=False) 
                     for word in topic_words]
df_topic = pd.DataFrame(topics, columns=['topic_id', 'words', 'probability'])
df_topic.head()

In [None]:
df_ = pd.DataFrame()
df_topic['words'] += ' '
df_['excerpt'] = df_topic.groupby('topic_id')['words'].sum().values
df_

In [None]:
def topic_name_max(topic_words, topics_names):
    idx_max = np.argmax(np.array([document_path_similarity(topic_words, topic_name) for topic_name in topics_names]))
    return topics_names[idx_max]

In [None]:
df_['topic_max'] = [topic_name_max(topic_word, topics_names) for topic_word in topic_words]

In [None]:
def topic_name_topn(topic_words, topics_names, N=3):
    idx_max = np.argsort(np.array([document_path_similarity(topic_words, topic_name) for topic_name in topics_names]))
    return topics_names[idx_max[::-1][:N]]

In [None]:
df_['topics_topn'] = [topic_name_topn(topic_word, topics_names) for topic_word in topic_words]

In [None]:
def topic_name_sort(topic_words, topics_names, N=3):
    idx_max = np.argsort(np.array([document_path_similarity(topic_words, topic_name) for topic_name in topics_names]))
    return topics_names[idx_max[::-1]]

In [None]:
df_['topics_sort'] = [topic_name_sort(topic_word, topics_names) for topic_word in topic_words]

In [None]:
def topic_name_mean(topic_words, topics_names, N=3):
    from collections import Counter
    # Mute np.mean - division by zero
    import warnings
    with warnings.catch_warnings():
        warnings.simplefilter("ignore", category=RuntimeWarning)
        idx_max = []
        for tw in topic_words.split(' '):
            np.seterr(all='raise')
            try:
                if tw == '':
                    continue
                idx = np.argmax(np.array([document_path_similarity(tw, topic_name) for topic_name in topics_names]))
                idx_max.append(idx)
                print('word: {0:20} -> topic: {2}({1})'.format(tw, idx, topics_names[idx]))
            except:
                continue
        most_common = topics_names[Counter(idx_max).most_common(1)[0][0]]
        print('most common topic: {}\n'.format(most_common))
    return most_common

In [None]:
df_['topics_mean'] = [topic_name_mean(topic_word, topics_names) for topic_word in topic_words]

In [None]:
df_

In [None]:
# education = wn.synset('education.n.01')
# sports = wn.synset('sports.n.01')
# education.hypernyms(), sports.hypernyms()

In [None]:
for tn in topics_names:
    print('topic: {:20} -> {}'.format(tn, doc_to_synsets(tn), wn.synsets(tn)))

In [None]:
# Train update
# lda.update(other_corpus)