# NIPS Papers: Which papers are similar? 

In [1]:
import graphlab
import numpy as np
import pandas as pd

### Let's discover what data we have:

In [2]:
papers_df = pd.read_csv('Data/output/Papers.csv')
papers_data = graphlab.SFrame(data = papers_df)
authors_df = pd.read_csv('Data/output/Authors.csv')
authors_data = graphlab.SFrame(data = authors_df)
authorId_df = pd.read_csv('Data/output/PaperAuthors.csv')
authorId_data = graphlab.SFrame(data = authorId_df)

[INFO] [1;32m1452309862 : INFO:     (initialize_globals_from_environment:282): Setting configuration variable GRAPHLAB_FILEIO_ALTERNATIVE_SSL_CERT_FILE to /Library/Python/2.7/site-packages/certifi/cacert.pem
[0m[1;32m1452309862 : INFO:     (initialize_globals_from_environment:282): Setting configuration variable GRAPHLAB_FILEIO_ALTERNATIVE_SSL_CERT_DIR to 
[0mThis non-commercial license of GraphLab Create is assigned to aminia@u.washington.edu and will expire on November 11, 2016. For commercial licensing options, visit https://dato.com/buy/.

[INFO] Start server at: ipc:///tmp/graphlab_server-3122 - Server binary: /Library/Python/2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1452309862.log
[INFO] GraphLab Server Version: 1.7.1


In [3]:
papers_data.head(5)

Id,Title,EventType,PdfName,Abstract
5677,Double or Nothing: Multiplicative Incentive ...,Poster,5677-double-or-nothing- multiplicative-incent ...,Crowdsourcing has gained immense popularity in ...
5941,Learning with Symmetric Label Noise: The ...,Spotlight,5941-learning-with- symmetric-label-noise- ...,Convex potential minimisation is the de ...
6019,Algorithmic Stability and Uniform Generalization ...,Poster,6019-algorithmic- stability-and-uniform- ...,One of the central questions in statistical ...
6035,Adaptive Low-Complexity Sequential Inference for ...,Poster,6035-adaptive-low- complexity-sequential- ...,We develop a sequential low-complexity inference ...
5978,Covariance-Controlled Adaptive Langevin ...,Poster,5978-covariance- controlled-adaptive- ...,Monte Carlo sampling for Bayesian posterior ...

PaperText
Double or Nothing: Multiplicative\nIncen ...
Learning with Symmetric Label Noise: ...
Algorithmic Stability and Uniform ...
Adaptive Low-Complexity Sequential Inference ...
Covariance-Controlled Adaptive ...


In [4]:
authors_data.head(5)

Id,Name
4113,Constantine Caramanis
4828,Richard L. Lewis
5506,Ryan Kiros
7331,Kfir Levy
8429,Wei Cao


In [5]:
authorId_data.head(5)

Id,PaperId,AuthorId
1,5677,7956
2,5677,2649
3,5941,8299
4,5941,8300
5,5941,575


# Goal: Find the papers that are similar based on abstract and full-text

### Steps:
1. Find the important keywords of each document using tf-idf
2. Apply knn_model on tf-idf to find similar papers

### Cleaning issue: 
* Clean text from \n \x and things like that by 
    1. Replace \n and \x0c ' : ! 0-9 and ... with space
    2. Apply unicode

In [6]:
# Example Before Cleaning:
papers_data[1]['PaperText'][0:1000]

'Learning with Symmetric Label Noise: The\nImportance of Being Unhinged\n\nBrendan van Rooyen\xe2\x88\x97,\xe2\x80\xa0\n\xe2\x88\x97\n\nAditya Krishna Menon\xe2\x80\xa0,\xe2\x88\x97\n\nThe Australian National University\n\n\xe2\x80\xa0\n\nRobert C. Williamson\xe2\x88\x97,\xe2\x80\xa0\n\nNational ICT Australia\n\n{ brendan.vanrooyen, aditya.menon, bob.williamson }@nicta.com.au\n\nAbstract\nConvex potential minimisation is the de facto approach to binary classification.\nHowever, Long and Servedio [2010] proved that under symmetric label noise\n(SLN), minimisation of any convex potential over a linear function class can result in classification performance equivalent to random guessing. This ostensibly\nshows that convex losses are not SLN-robust. In this paper, we propose a convex,\nclassification-calibrated loss and prove that it is SLN-robust. The loss avoids the\nLong and Servedio [2010] result by virtue of being negatively unbounded. The\nloss is a modification of the hinge loss, wh

### Clean Abstract and PaperText:

In [7]:
import re
def clean_text(text):
    list_of_cleaning_signs = ['\x0c', '\n']
    for sign in list_of_cleaning_signs:
        text = text.replace(sign, ' ')
    text = unicode(text, errors='ignore')
    clean_text = re.sub('[^a-zA-Z]+', ' ', text)
    return clean_text

In [8]:
papers_data['PaperText_clean'] = papers_data['PaperText'].apply(lambda x: clean_text(x))
papers_data['Abstract_clean'] = papers_data['Abstract'].apply(lambda x: clean_text(x))

In [9]:
# Example After Cleaning
papers_data[1]['PaperText_clean'][0:1000]

'Learning with Symmetric Label Noise The Importance of Being Unhinged Brendan van Rooyen Aditya Krishna Menon The Australian National University Robert C Williamson National ICT Australia brendan vanrooyen aditya menon bob williamson nicta com au Abstract Convex potential minimisation is the de facto approach to binary classification However Long and Servedio proved that under symmetric label noise SLN minimisation of any convex potential over a linear function class can result in classification performance equivalent to random guessing This ostensibly shows that convex losses are not SLN robust In this paper we propose a convex classification calibrated loss and prove that it is SLN robust The loss avoids the Long and Servedio result by virtue of being negatively unbounded The loss is a modification of the hinge loss where one does not clamp at zero hence we call it the unhinged loss We show that the optimal unhinged solution is equivalent to that of a strongly regularised SVM and is 

### Build tf-idf columns for Abstract and PaperText:

In [10]:
papers_data['word_count_abstract_clean'] = graphlab.text_analytics.count_words(papers_data['Abstract_clean'])
papers_data['tf_idf_Abstract_clean'] = graphlab.text_analytics.tf_idf(papers_data['word_count_abstract_clean'])
papers_data['word_count_papertext_clean'] = graphlab.text_analytics.count_words(papers_data['PaperText_clean'])
papers_data['tf_idf_PaperText_clean'] = graphlab.text_analytics.tf_idf(papers_data['word_count_papertext_clean'])

### Build a function that takes paperID as input and prints keywords sorted by tf-idf:

In [11]:
def given_paperID_give_keywords(papers_data, paper_id, tfidf_col_name):
    paper = papers_data[papers_data['Id']== paper_id]
    keywords = paper[[tfidf_col_name]].stack(tfidf_col_name,
                                             new_column_name=['word', tfidf_col_name]).sort(tfidf_col_name,
                                                                                            ascending=False)
    return keywords

In [12]:
# Example for keywords based on abstract
paper_id_example = 5941
print "Keywords based on Abstract:"
print given_paperID_give_keywords(papers_data, paper_id_example, 'tf_idf_Abstract_clean')
print "Keywords based on PaperText:"
print given_paperID_give_keywords(papers_data, paper_id_example, 'tf_idf_PaperText_clean')

Keywords based on Abstract:
+----------------+-----------------------+
|      word      | tf_idf_Abstract_clean |
+----------------+-----------------------+
|      sln       |     29.9946828097     |
|    unhinged    |     17.9968096858     |
|      loss      |     16.2185981757     |
|   potential    |     13.1684959485     |
|    servedio    |     11.9978731239     |
|  minimisation  |     11.9978731239     |
|     convex     |      10.143223242     |
|     robust     |     8.22252007178     |
|   equivalent   |     7.83899004053     |
| classification |      6.9301713235     |
+----------------+-----------------------+
[86 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
Keywords based on PaperText:
+------------+------------------------+
|    word    | tf_idf_PaperText_clean |
+------------+------------------------+
|    sln     |     515.908544327      |
|  unhinged  |     263.95320872

### Build a function that takes paperID and knn_model as input and gives similar paper's Ids using knn_model:

In [13]:
 def given_paperID_give_similar_papersID(knnModel, papers_data, paper_id):
    similar_paper_ids = knnModel.query(papers_data[papers_data['Id']== paper_id], verbose=False)['reference_label']
    return similar_paper_ids

In [14]:
knn_model_Abstract = graphlab.nearest_neighbors.create(papers_data, features=['tf_idf_Abstract_clean'], label='Id')
knn_model_PaperText = graphlab.nearest_neighbors.create(papers_data, features=['tf_idf_PaperText_clean'], label='Id')

PROGRESS: Starting brute force nearest neighbors model training.
PROGRESS: Starting brute force nearest neighbors model training.


In [15]:
paper_id_example = 5941
Abstract_sim_papers_example = given_paperID_give_similar_papersID(knn_model_Abstract, papers_data, paper_id_example)
PaperText_sim_papers_example = given_paperID_give_similar_papersID(knn_model_PaperText, papers_data, paper_id_example)
print "Similar papers based on Abstract:"
print Abstract_sim_papers_example
print "Similar papers based on PaperText:"
print PaperText_sim_papers_example

Similar papers based on Abstract:
[5941, 5742, 5801, 5924, 5745]
Similar papers based on PaperText:
[5941, 5999, 5994, 5921, 5806]


### Some post-processing functions:

In [16]:
def given_paperID_give_authours_id(paper_id, author_data, author_id_data):
    id_author_list = author_id_data[author_id_data['PaperId']==paper_id]['AuthorId']
    return id_author_list

def given_authorID_give_name(author_id, author_data):
    author_name = author_data[author_data['Id'] == author_id]['Name'][0]
    return author_name

def given_similar_paperIDs_give_their_titles(sim_papers_list, paper_data):
    titles = []
    for paper_id in sim_papers_list:
        titles.append(paper_data[paper_data['Id']==paper_id]['Title'][0]+'.')
    return titles
        

In [17]:
print "Title of similar papers based on Abstract:\n\n"
for title in given_similar_paperIDs_give_their_titles(Abstract_sim_papers_example, papers_data):
    print title

Title of similar papers based on Abstract:


Learning with Symmetric Label Noise: The Importance of Being Unhinged.
Top-k Multiclass SVM.
Reflection, Refraction, and Hamiltonian Monte Carlo.
A Dual Augmented Block Minimization Framework for Learning with Limited Memory.
Distributionally Robust Logistic Regression.


In [18]:
print "Title of similar papers based on PaperText:\n\n"
for title in given_similar_paperIDs_give_their_titles(PaperText_sim_papers_example, papers_data):
    print title

Title of similar papers based on PaperText:


Learning with Symmetric Label Noise: The Importance of Being Unhinged.
Fast Classification Rates for High-dimensional Gaussian Generative Models.
Online F-Measure Optimization.
Convergence Rates of Active Learning for Maximum Likelihood Estimation.
On the Accuracy of Self-Normalized Log-Linear Models.


# *** Question: Are they really similar? i.e. Is there a way to evaluate? ***
## *** Maybe we can check if they referenced the same papers? ***