# Document Similarity & Topic Modelling

## Part 1 - Document Similarity

For the first part of this project, I wrote the functions `doc_to_synsets` and `similarity_score` which were used by `document_path_similarity` to find the path similarity between two documents.

The following functions I did not write but are used in this project:
* **`convert_tag:`** converts the tag given by `nltk.pos_tag` to a tag used by `wordnet.synsets`. 
* **`document_path_similarity:`** computes the symmetrical path similarity between two documents by finding the synsets in each document using `doc_to_synsets`, then computing similarities using `similarity_score`.

These functions I did create for this project:
* **`doc_to_synsets:`** returns a list of synsets in document. This function first tokenizes and then tags a part of speech in the document using `nltk.word_tokenize` and `nltk.pos_tag`. Then it finds each tokens corresponding synset using `wn.synsets(token, wordnet_tag)`. The first synset match is used. If there is no match, that token is skipped.
* **`similarity_score:`** returns the normalized similarity score of a list of synsets (s1) onto a second list of synsets (s2). For each synset in s1, it finds the synset in s2 with the largest similarity value. All of the largest similarity values are summed together and normalized by dividing by the number of largest similarity values found. Missing values are ignored.

In [64]:
%%capture
import numpy as np
import nltk
nltk.download('punkt')
from nltk.corpus import wordnet as wn
import pandas as pd
nltk.data.path.append("assets/")

def convert_tag(tag):
    """Convert the tag given by nltk.pos_tag to the tag used by wordnet.synsets"""
    
    tag_dict = {'N': 'n', 'J': 'a', 'R': 'r', 'V': 'v'}
    try:
        return tag_dict[tag[0]]
    except KeyError:
        return None

In [65]:
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')
def doc_to_synsets(doc):
    """
    Returns a list of synsets in document.

    Tokenizes and tags the words in the document doc.
    Then finds the first synset for each word/tag combination.
    If a synset is not found for that combination it is skipped.

    Args:
        doc: string to be converted

    Returns:
        list of synsets

    Example:
        doc_to_synsets('Fish are friends.')
        Out: [Synset('fish.n.01'), Synset('be.v.01'), Synset('friend.n.01')]
    """

    tokens = nltk.word_tokenize(doc)
    pos = nltk.pos_tag(tokens)
    tag = [tag[1] for tag in pos]
    wntag = [convert_tag(tags) for tags in tag]
    wnpos = list(zip(tokens,wntag))
    synset_list = []
    for word,tag in wnpos:
            synset_list.append(wn.synsets(word,tag))
    final = [val[0] for val in synset_list if len(val) > 0] 
    # final = synset_list
    return final



def similarity_score(s1, s2):
    """
    returns the normalized similarity score of a list of synsets (s1) onto a second list of synsets (s2). 
    For each synset in s1, it finds the synset in s2 with the largest similarity value. 
    All of the largest similarity values are summed together and normalized by dividing by the number of largest 
    similarity values found. Missing values are ignored.
    """
    max_sim = []
    for x in s1:
        sim = []
        for y in s2:
            sim.append(x.path_similarity(y))
            sim = [word for word in sim if word !=None]
        if(sim):
            max_sim.append(max(sim))
    return sum(max_sim)/len(max_sim)
doc_to_synsets('Fish are friends.')

[Synset('fish.n.01'), Synset('be.v.01'), Synset('friend.n.01')]

In [66]:
def document_path_similarity(doc1, doc2):
    """Finds the symmetrical similarity between doc1 and doc2"""

    synsets1 = doc_to_synsets(doc1)
    synsets2 = doc_to_synsets(doc2)

    return (similarity_score(synsets1, synsets2) + similarity_score(synsets2, synsets1)) / 2


`paraphrases` is a DataFrame which contains the following columns: `Quality`, `D1`, and `D2`.

`Quality` is an indicator variable which indicates if the two documents `D1` and `D2` are paraphrases of one another (1 for paraphrase, 0 for not paraphrase).

In [67]:
paraphrases = pd.read_csv(r"\paraphrases.csv")
paraphrases.head()

Unnamed: 0,Quality,D1,D2
0,1,"Ms Stewart, the chief executive, was not expec...","Ms Stewart, 61, its chief executive officer an..."
1,1,After more than two years' detention under the...,After more than two years in detention by the ...
2,1,"""It still remains to be seen whether the reven...","""It remains to be seen whether the revenue rec..."
3,0,"And it's going to be a wild ride,"" said Allan ...","Now the rest is just mechanical,"" said Allan H..."
4,1,The cards are issued by Mexico's consulates to...,The card is issued by Mexico's consulates to i...


### most_similar_docs

I used `document_path_similarity`, to find the pair of documents in paraphrases which has the maximum similarity score.

In [68]:
def most_similar_docs():
    paraphrases['similarity_score'] = paraphrases.apply(lambda x:document_path_similarity(x['D1'],x['D2']), axis=1)
    sorted_paraphrases = paraphrases.sort_values('similarity_score', ascending = False)
    d1 =sorted_paraphrases['D1'].iloc[0]
    d2 =sorted_paraphrases['D2'].iloc[0]
    similarity_score = sorted_paraphrases['similarity_score'].iloc[0]
    # ans = paraphrases.loc[paraphrases['similarity_score'] == paraphrases['similarity_score'].max()].squeeze().values
    # return (ans[1],ans[2],ans[3])
    return d1,d2,similarity_score
    # return sorted_paraphrases.iloc[0]
most_similar_docs()

('"Indeed, Iran should be put on notice that efforts to try to remake Iraq in their image will be aggressively put down," he said.',
 '"Iran should be on notice that attempts to remake Iraq in Iran\'s image will be aggressively put down," he said.\n',
 0.9590643274853801)

### label_accuracy

I provide labels for the twenty pairs of documents by computing the similarity for each pair using `document_path_similarity`. The classifier rule was that if the score is greater than 0.75, the label is a paraphrase (1), else the label is paraphrase (0). The accuracy of the classifier is reported using scikit-learn's accuracy_score.

In [69]:
def label_accuracy():
    from sklearn.metrics import accuracy_score
    paraphrases['similarity_score'] = paraphrases.apply(lambda x:document_path_similarity(x['D1'],x['D2']), axis=1)
    paraphrases['classifier'] = np.where(paraphrases['similarity_score'] >0.75, 1, 0)
    
    return accuracy_score(paraphrases['Quality'], paraphrases['classifier'])

label_accuracy()

0.7

## Part 2 - Topic Modelling

For the second part of this project, I used Gensim's LDA (Latent Dirichlet Allocation) model to model topics in `newsgroup_data`. I first used gensim.models.ldamodel.LdaModel constructor to estimate LDA model parameters on the corpus, and saved them to the variable `ldamodel`. I then extract 10 topics using `corpus` and `id_map`, and with `passes=25` and `random_state=34`.

In [70]:
import pickle
import gensim
from sklearn.feature_extraction.text import CountVectorizer

# Load the list of documents
with open(r"\newsgroups",
          'rb') as f:
    newsgroup_data = pickle.load(f)

# Used CountVectorizor to find three letter tokens, remove stop_words, 
# removed tokens that don't appear in at least 20 documents,
# removed tokens that appear in more than 20% of the documents
vect = CountVectorizer(min_df=20, max_df=0.2, stop_words='english', 
                       token_pattern='(?u)\\b\\w\\w\\w+\\b')
# Fit and transform
X = vect.fit_transform(newsgroup_data)

# Convert sparse matrix to gensim corpus.
corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)

# Mapping from word IDs to words (To be used in LdaModel's id2word parameter)
id_map = dict((v, k) for k, v in vect.vocabulary_.items())


In [71]:
# Used the gensim.models.ldamodel.LdaModel constructor to estimate 
# LDA model parameters on the corpus, and save to the variable `ldamodel`
ldamodel = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=id_map, random_state=34, 
                                           passes=25, num_topics = 10)

### lda_topics

I used `ldamodel` to find a list of the 10 topics and the most significant 10 words in each topic. This is structured as a list of 10 tuples where each tuple takes on the form:

`(9, '0.068*"space" + 0.036*"nasa" + 0.021*"science" + 0.020*"edu" + 0.019*"data" + 0.017*"shuttle" + 0.015*"launch" + 0.015*"available" + 0.014*"center" + 0.013*"information"')`

In [72]:
def lda_topics():
    return ldamodel.print_topics()
lda_topics()

[(0,
  '0.056*"edu" + 0.043*"com" + 0.033*"thanks" + 0.022*"mail" + 0.021*"know" + 0.020*"does" + 0.014*"info" + 0.012*"monitor" + 0.010*"looking" + 0.010*"don"'),
 (1,
  '0.024*"ground" + 0.018*"current" + 0.018*"just" + 0.013*"want" + 0.013*"use" + 0.011*"using" + 0.011*"used" + 0.010*"power" + 0.010*"speed" + 0.010*"output"'),
 (2,
  '0.061*"drive" + 0.042*"disk" + 0.033*"scsi" + 0.030*"drives" + 0.028*"hard" + 0.028*"controller" + 0.027*"card" + 0.020*"rom" + 0.018*"floppy" + 0.017*"bus"'),
 (3,
  '0.023*"time" + 0.015*"atheism" + 0.014*"list" + 0.013*"left" + 0.012*"alt" + 0.012*"faq" + 0.012*"probably" + 0.011*"know" + 0.011*"send" + 0.010*"months"'),
 (4,
  '0.025*"car" + 0.016*"just" + 0.014*"don" + 0.014*"bike" + 0.012*"good" + 0.011*"new" + 0.011*"think" + 0.010*"year" + 0.010*"cars" + 0.010*"time"'),
 (5,
  '0.030*"game" + 0.027*"team" + 0.023*"year" + 0.017*"games" + 0.016*"play" + 0.012*"season" + 0.012*"players" + 0.012*"win" + 0.011*"hockey" + 0.011*"good"'),
 (6,
  '0.0

### topic_distribution

For the  document `new_doc`, I found the topic distribution.

*This function returns a list of tuples, where each tuple is `(#topic, probability)`*

In [73]:
new_doc = ["\n\nIt's my understanding that the freezing will start to occur because \
of the\ngrowing distance of Pluto and Charon from the Sun, due to it's\nelliptical orbit. \
It is not due to shadowing effects. \n\n\nPluto can shadow Charon, and vice-versa.\n\nGeorge \
Krumins\n-- "]

In [74]:
def topic_distribution():
   
    Y = vect.transform(new_doc)
    corpus2 = gensim.matutils.Sparse2Corpus(Y, documents_columns=False)
    # ldamodel2 = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=id_map, random_state=34, passes=25, num_topics = 10)
    return list(ldamodel.get_document_topics(corpus2))[0]
topic_distribution()

[(0, 0.020003108),
 (1, 0.020003324),
 (2, 0.020001281),
 (3, 0.49674895),
 (4, 0.020004038),
 (5, 0.020004129),
 (6, 0.020002972),
 (7, 0.020002645),
 (8, 0.020003129),
 (9, 0.34322643)]

### topic_names

From the list of the following given topics, I assigned topic names to the topics I found. If none of these names best matched the topics I created a new 1-3 word "title" for the topic.

Topics: Health, Science, Automobiles, Politics, Government, Travel, Computers & IT, Sports, Business, Society & Lifestyle, Religion, Education.

In [75]:
def topic_names():
    labels = ['Health', 'Automobiles', 'Government', 'Travel', 'Computers & IT', 'Sports', 'Business', 'Society & Lifestyle', 'Region', 'Education']

    topics = lda_topics()

    results = []
    for x in topics:
        sim = []
        for y in labels:
            sim.append(document_path_similarity(str(x),str(y)))
        best = sorted(zip(sim, labels))[-1][1]
        results.append(best)
    return results
topic_names()

['Society & Lifestyle',
 'Education',
 'Education',
 'Society & Lifestyle',
 'Automobiles',
 'Education',
 'Education',
 'Society & Lifestyle',
 'Education',
 'Society & Lifestyle']