# Which NIPS papers are similar? *[a simple sklearn knn & tf-idf exercise]*

## Goal: Find the papers that are similar based on abstract and full-text
### Steps:
1. Find the important keywords of each document using tf-idf
2. Apply knn_model on tf-idf to find similar papers

### Cleaning: 
* Clean text from \n \x and things like that by 
    1. Replace \n and \x0c with space
    2. Apply unicode
    3. Make everything lower case

In [1]:
import pandas as pd
import sklearn 
import numpy as np
import numpy as np
import nltk
nltk.download('punkt')
import re
import time

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/Amirhossein/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Let's discover the data:

In [2]:
# import data using pandas and put into SFrames:
papers_data = pd.read_csv('Data/output/Papers.csv')
authors_data = pd.read_csv('Data/output/Authors.csv')
authorId_data = pd.read_csv('Data/output/PaperAuthors.csv')

In [3]:
papers_data.head(5)

Unnamed: 0,Id,Title,EventType,PdfName,Abstract,PaperText
0,5677,Double or Nothing: Multiplicative Incentive Me...,Poster,5677-double-or-nothing-multiplicative-incentiv...,Crowdsourcing has gained immense popularity in...,Double or Nothing: Multiplicative\nIncentive M...
1,5941,Learning with Symmetric Label Noise: The Impor...,Spotlight,5941-learning-with-symmetric-label-noise-the-i...,Convex potential minimisation is the de facto ...,Learning with Symmetric Label Noise: The\nImpo...
2,6019,Algorithmic Stability and Uniform Generalization,Poster,6019-algorithmic-stability-and-uniform-general...,One of the central questions in statistical le...,Algorithmic Stability and Uniform Generalizati...
3,6035,Adaptive Low-Complexity Sequential Inference f...,Poster,6035-adaptive-low-complexity-sequential-infere...,We develop a sequential low-complexity inferen...,Adaptive Low-Complexity Sequential Inference f...
4,5978,Covariance-Controlled Adaptive Langevin Thermo...,Poster,5978-covariance-controlled-adaptive-langevin-t...,Monte Carlo sampling for Bayesian posterior in...,Covariance-Controlled Adaptive Langevin\nTherm...


In [4]:
authors_data.head()

Unnamed: 0,Id,Name
0,4113,Constantine Caramanis
1,4828,Richard L. Lewis
2,5506,Ryan Kiros
3,7331,Kfir Levy
4,8429,Wei Cao


In [5]:
authorId_data.head(5)

Unnamed: 0,Id,PaperId,AuthorId
0,1,5677,7956
1,2,5677,2649
2,3,5941,8299
3,4,5941,8300
4,5,5941,575


### Define two functions for being able to go from index to id and visa-versa on papers_data: 
1. A function that takes paper_id and papers_data as input and gives its index
2. A function that takes index as input and gives its paper_id

In [6]:
def given_paperID_give_index(paper_id, paper_data):
    return paper_data[paper_data['Id']==paper_id].index[0]
#
def given_index_give_PaperID(index, paper_data):
    return paper_data.iloc[index]['Id']

### Let's look at second paper as an example before cleaning:

In [7]:
Ex_paper_id = 5941
Ex_paper_index = given_paperID_give_index(Ex_paper_id, papers_data)
papers_data.iloc[Ex_paper_index]['PaperText'][0:1000]

'Learning with Symmetric Label Noise: The\nImportance of Being Unhinged\n\nBrendan van Rooyen\xe2\x88\x97,\xe2\x80\xa0\n\xe2\x88\x97\n\nAditya Krishna Menon\xe2\x80\xa0,\xe2\x88\x97\n\nThe Australian National University\n\n\xe2\x80\xa0\n\nRobert C. Williamson\xe2\x88\x97,\xe2\x80\xa0\n\nNational ICT Australia\n\n{ brendan.vanrooyen, aditya.menon, bob.williamson }@nicta.com.au\n\nAbstract\nConvex potential minimisation is the de facto approach to binary classification.\nHowever, Long and Servedio [2010] proved that under symmetric label noise\n(SLN), minimisation of any convex potential over a linear function class can result in classification performance equivalent to random guessing. This ostensibly\nshows that convex losses are not SLN-robust. In this paper, we propose a convex,\nclassification-calibrated loss and prove that it is SLN-robust. The loss avoids the\nLong and Servedio [2010] result by virtue of being negatively unbounded. The\nloss is a modification of the hinge loss, wh

### Clean Abstract and PaperText:
* Clean text from \n \x and things like that by 
    1. Replace \n and \x0c with space
    2. Apply unicode
    3. Make everything lower case

In [8]:
def clean_text(text):
    list_of_cleaning_signs = ['\x0c', '\n']
    for sign in list_of_cleaning_signs:
        text = text.replace(sign, ' ')
    text = unicode(text, errors='ignore')
    clean_text = re.sub('[^a-zA-Z]+', ' ', text)
    return clean_text.lower()

In [9]:
papers_data['PaperText_clean'] = papers_data['PaperText'].apply(lambda x: clean_text(x))
papers_data['Abstract_clean'] = papers_data['Abstract'].apply(lambda x: clean_text(x))

### Let's look at the example paper after cleaning:

In [10]:
# Example After Cleaning
papers_data.iloc[1]['PaperText_clean'][0:1000]

u'learning with symmetric label noise the importance of being unhinged brendan van rooyen aditya krishna menon the australian national university robert c williamson national ict australia brendan vanrooyen aditya menon bob williamson nicta com au abstract convex potential minimisation is the de facto approach to binary classification however long and servedio proved that under symmetric label noise sln minimisation of any convex potential over a linear function class can result in classification performance equivalent to random guessing this ostensibly shows that convex losses are not sln robust in this paper we propose a convex classification calibrated loss and prove that it is sln robust the loss avoids the long and servedio result by virtue of being negatively unbounded the loss is a modification of the hinge loss where one does not clamp at zero hence we call it the unhinged loss we show that the optimal unhinged solution is equivalent to that of a strongly regularised svm and is

### Build tf-idf matrix based on Abstract & PaperText:
* Using Token and Stem [Thanks to the great post by Brandon Rose: http://brandonrose.org/clustering]

In [None]:
# here Brandon defines a tokenizer and stemmer which returns the set of stems in the text that it is passed
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
def tokenize_and_stem(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
# Producing tf_idf matrix separately based on Abstract
tfidf_vectorizer_Abstract = TfidfVectorizer(max_df=0.95, max_features=200000,
                                 min_df=0.05, stop_words='english',
                                 use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))
%time tfidf_matrix_Abstract = tfidf_vectorizer_Abstract.fit_transform(papers_data['Abstract_clean'])

# Producing tf_idf matrix separately based on PaperText
tfidf_vectorizer_PaperText = TfidfVectorizer(max_df=0.9, max_features=200000,
                                 min_df=0.1, stop_words='english',
                                 use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))
%time tfidf_matrix_PaperText = tfidf_vectorizer_PaperText.fit_transform(papers_data['PaperText_clean'])

In [None]:
terms_Abstract = tfidf_vectorizer_Abstract.get_feature_names()
terms_PaperText = tfidf_vectorizer_Abstract.get_feature_names()

### Let's create a function that takes a paper_id and tfidf_matrix and gives n-important keywords:
* [Thanks to the great post by Thomas Buhrmann: https://buhrmann.github.io/tfidf-analysis.html]

In [None]:
def top_tfidf_feats(row, terms, top_n=25):
    topn_ids = np.argsort(row)[::-1][:top_n]
    top_feats = [(terms[i], row[i]) for i in topn_ids]
    df = pd.DataFrame(top_feats)
    df.columns = ['feature', 'tfidf']
    return df['feature']
def given_paperID_give_keywords(paper_data, tfidfMatrix, terms, paper_id, top_n=20):
    row_id = given_paperID_give_index(paper_id, paper_data)
    row = np.squeeze(tfidfMatrix[row_id].toarray())
    return top_tfidf_feats(row, terms, top_n)

### Let's check the top 10-keywords of the example paper based on Abstract:
Note: The words are in stemmed form

In [None]:
paper_id_example = 5941
print "Keywords based on Abstract:"
print given_paperID_give_keywords(papers_data, tfidf_matrix_Abstract, terms_Abstract, paper_id_example, top_n = 10)

### Build NearestNeighbors models based on Abstract and PaperText:

In [None]:
from sklearn.neighbors import NearestNeighbors
# Based on Abstract
num_neighbors = 4
nbrs_Abstract = NearestNeighbors(n_neighbors=num_neighbors,  algorithm='auto').fit(tfidf_matrix_Abstract)
distances_Abstract, indices_Abstract = nbrs_Abstract.kneighbors(tfidf_matrix_Abstract)
# Based on PaperText
nbrs_PaperText = NearestNeighbors(n_neighbors=num_neighbors,  algorithm='auto').fit(tfidf_matrix_PaperText)
distances_PaperText, indices_PaperText = nbrs_PaperText.kneighbors(tfidf_matrix_PaperText)

In [None]:
print "Nbrs of the example paper based on Abstract similarity: %r" % indices_Abstract[1]
print "Nbrs of the example paper based on PaperText similarity: %r" % indices_PaperText[1]

### Let's check the abstract of the similar papers found for the example paper mentioned above:
* a) Using model based on Abstract
* b) Using model based on PaperText

In [None]:
Ex_paper_id = 5941
Ex_index = given_paperID_give_index(Ex_paper_id, papers_data)
print "The Abstract of the example paper is:\n"
print papers_data.iloc[indices_Abstract[Ex_index][0]]['Abstract']
print "The Abstract of the similar papers are:\n"
for i in xrange(1, len(indices_Abstract[Ex_index])):
    print "Neighbor No. %r has following abstract: \n" % i
    print papers_data.iloc[indices_Abstract[Ex_index][i]]['Abstract']
    print "\n"

In [None]:
Ex_paper_id = 5941
Ex_index = given_paperID_give_index(Ex_paper_id, papers_data)
print "The Abstract of the example paper is:\n"
print papers_data.iloc[indices_PaperText[Ex_index][0]]['Abstract']
print "The Abstract of the similar papers are:\n"
for i in xrange(1, len(indices_PaperText[Ex_index])):
    print "Neighbor No. %r has following abstract: \n" % i
    print papers_data.iloc[indices_PaperText[Ex_index][i]]['Abstract']
    print "\n"

### Some post-processing functions to help us read author's names and title of their papers:

In [None]:
def given_paperID_give_authours_id(paper_id, author_data, author_id_data):
    id_author_list = author_id_data[author_id_data['PaperId']==paper_id]['AuthorId']
    return id_author_list

def given_authorID_give_name(author_id, author_data):
    author_name = author_data[author_data['Id'] == author_id]['Name']
    return author_name

def given_similar_paperIDs_give_their_titles(sim_papers_list_index, paper_data):
    titles = []
    for index in sim_papers_list_index:
        titles.append(paper_data.iloc[index]['Title']+'.')
    return titles

In [None]:
Ex_paper_id = 5941
Ex_index = given_paperID_give_index(Ex_paper_id, papers_data)
print "Title of similar papers to the example paper based on Abstract:\n\n"
for title in given_similar_paperIDs_give_their_titles(indices_Abstract[Ex_index], papers_data):
    print title

In [None]:
Ex_paper_id = 5941
Ex_index = given_paperID_give_index(Ex_paper_id, papers_data)
print "Title of similar papers to the example paper based on Abstract:\n\n"
for title in given_similar_paperIDs_give_their_titles(indices_PaperText[Ex_index], papers_data):
    print title

### *** Questions & notes: ***
1. Are these papers really similar? i.e. Is there an automated way to evaluate?
    * Maybe we can check if the recommended similar papers referenced the same papers? 


3. Which model is better? Abstract or PaperText? Which papers are recommended by both models? Are these more similar?

4. Try different parameters in generating tf-idf and/or different algorithms in producing the knn model.



### *** References: ***
1. http://brandonrose.org/clustering
2. https://buhrmann.github.io/tfidf-analysis.html
3. "Machine Learning Foundations: A Case Study Approach" course on Coursera