## NIPS Paper notebook

In [1]:
import graphlab
import numpy as np
import pandas as pd

## Let's discover what data we have:

In [2]:
papers_df = pd.read_csv('Data/output/Papers.csv')
papers_data = graphlab.SFrame(data = papers_df)
authors_df = pd.read_csv('Data/output/Authors.csv')
authors_data = graphlab.SFrame(data = authors_df)
authorId_df = pd.read_csv('Data/output/PaperAuthors.csv')
authorId_data = graphlab.SFrame(data = authorId_df)

[INFO] [1;32m1452221178 : INFO:     (initialize_globals_from_environment:282): Setting configuration variable GRAPHLAB_FILEIO_ALTERNATIVE_SSL_CERT_FILE to /opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/certifi/cacert.pem
[0m[1;32m1452221178 : INFO:     (initialize_globals_from_environment:282): Setting configuration variable GRAPHLAB_FILEIO_ALTERNATIVE_SSL_CERT_DIR to 
[0mThis non-commercial license of GraphLab Create is assigned to aminia@u.washington.edu and will expire on November 11, 2016. For commercial licensing options, visit https://dato.com/buy/.

[INFO] Start server at: ipc:///tmp/graphlab_server-1326 - Server binary: /opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1452221178.log
[INFO] GraphLab Server Version: 1.7.1


In [3]:
papers_data.head(5)

Id,Title,EventType,PdfName,Abstract
5677,Double or Nothing: Multiplicative Incentive ...,Poster,5677-double-or-nothing- multiplicative-incent ...,Crowdsourcing has gained immense popularity in ...
5941,Learning with Symmetric Label Noise: The ...,Spotlight,5941-learning-with- symmetric-label-noise- ...,Convex potential minimisation is the de ...
6019,Algorithmic Stability and Uniform Generalization ...,Poster,6019-algorithmic- stability-and-uniform- ...,One of the central questions in statistical ...
6035,Adaptive Low-Complexity Sequential Inference for ...,Poster,6035-adaptive-low- complexity-sequential- ...,We develop a sequential low-complexity inference ...
5978,Covariance-Controlled Adaptive Langevin ...,Poster,5978-covariance- controlled-adaptive- ...,Monte Carlo sampling for Bayesian posterior ...

PaperText
Double or Nothing: Multiplicative\nIncen ...
Learning with Symmetric Label Noise: ...
Algorithmic Stability and Uniform ...
Adaptive Low-Complexity Sequential Inference ...
Covariance-Controlled Adaptive ...


In [4]:
authors_data.head(5)

Id,Name
4113,Constantine Caramanis
4828,Richard L. Lewis
5506,Ryan Kiros
7331,Kfir Levy
8429,Wei Cao


In [5]:
authorId_data.head(5)

Id,PaperId,AuthorId
1,5677,7956
2,5677,2649
3,5941,8299
4,5941,8300
5,5941,575


# Goal: Given a author's name find people who work similar to her/him
* ## 1. Find the papers that are similar based on abstract, full-text, and both
* ## 2. Find the Authors assosiated with those papers

# 1. Find the papers that are similar based on abstract, full-text, and both
* ## a) Find the important keywords of each document using tf-idf
* ## b) Apply knn_model on tf-idf to find similar papers

### Challenge: something needs to be done to \n \x and things like that in PaperText. They need to be replaced by space. It seems like that the Abstract does not have such problems though.

In [6]:
papers_data[0]['PaperText']

'Double or Nothing: Multiplicative\nIncentive Mechanisms for Crowdsourcing\nNihar B. Shah\nUniversity of California, Berkeley\nnihar@eecs.berkeley.edu\n\nDengyong Zhou\nMicrosoft Research\ndengyong.zhou@microsoft.com\n\nAbstract\nCrowdsourcing has gained immense popularity in machine learning applications\nfor obtaining large amounts of labeled data. Crowdsourcing is cheap and fast, but\nsuffers from the problem of low-quality data. To address this fundamental challenge in crowdsourcing, we propose a simple payment mechanism to incentivize\nworkers to answer only the questions that they are sure of and skip the rest. We\nshow that surprisingly, under a mild and natural \xe2\x80\x9cno-free-lunch\xe2\x80\x9d requirement, this\nmechanism is the one and only incentive-compatible payment mechanism possible. We also show that among all possible incentive-compatible mechanisms\n(that may or may not satisfy no-free-lunch), our mechanism makes the smallest possible payment to spammers. Interest

In [7]:
papers_data[40]['Abstract']

'Deep structured output learning shows great promise in tasks like semantic image segmentation. We proffer a new, efficient deep structured model learning scheme, in which we show how deep Convolutional Neural Networks (CNNs) can be used to directly estimate the messages in message passing inference for structured prediction with Conditional Random Fields CRFs). With such CNN message estimators, we obviate the need to learn or evaluate potential functions for message calculation. This confers significant efficiency for learning, since otherwise when performing structured learning for a CRF with CNN potentials it is necessary to undertake expensive inference for every stochastic gradient iteration. The network output dimension of message estimators is the same as the number of classes, rather than exponentially growing in the order of the potentials. Hence it is more scalable for cases that a large number of classes are involved. We apply our method to semantic image segmentation and ac

## So let's start with Abstract first:

In [8]:
first_paper = papers_data[papers_data['Id'] == 5677]
first_paper['word_count'] = graphlab.text_analytics.count_words(first_paper['Abstract'])
first_paper['word_count']

dtype: dict
Rows: 1
[{'all': 1, 'show': 2, 'skip': 1, 'over': 1, 'cheap': 1, 'mild': 1, 'experiments': 1, 'mechanism': 7, 'questions': 1, 'possible.': 1, 'workers': 1, 'to': 4, 'only': 2, 'under': 2, 'has': 1, 'propose': 1, 'possible': 2, 'they': 1, 'not': 1, 'unique': 2, 'form.': 1, 'large': 1, 'multiplicative': 1, 'sure': 1, 'are': 1, 'our': 2, 'for': 2, 'smallest': 1, 'rest.': 1, 'benefit.': 1, '(that': 1, 'satisfy': 1, 'we': 4, 'incentive-compatible': 2, 'mechanisms': 1, 'monetary': 1, 'crowdsourcing,': 1, 'interestingly,': 1, 'workers,': 1, 'gained': 1, 'surprisingly,': 1, 'of': 4, 'makes': 1, 'or': 2, 'among': 1, 'simple': 1, 'fast,': 1, 'obtaining': 1, 'one': 1, 'learning': 1, 'spammers.': 1, 'from': 1, 'takes': 1, 'crowdsourcing': 2, 'immense': 1, 'reduction': 1, 'rates': 1, 'hundred': 1, 'no-free-lunch': 1, 'that': 3, 'but': 1, 'observe': 1, 'low-quality': 1, 'this': 3, 'challenge': 1, 'labeled': 1, 'no-free-lunch),': 1, 'error': 1, 'problem': 1, 'address': 1, 'and': 4, 'is': 

In [9]:
papers_data['word_count'] = graphlab.text_analytics.count_words(papers_data['Abstract'])
tfidf = graphlab.text_analytics.tf_idf(papers_data['word_count'])
papers_data['tf_idf'] = tfidf

In [10]:
### create a function that gets a paper_data and paper ID and gives the important keywords sorted by tf-idf
def keywords_given_paperID(papers_data, paper_id):
    paper = papers_data[papers_data['Id']== paper_id]
    keywords = paper[['tf_idf']].stack('tf_idf', new_column_name=['word', 'tf_idf']).sort('tf_idf', ascending=False)
    return keywords

In [11]:
# try the keywords_func on the first paper the id is 5677
print keywords_given_paperID(papers_data, 5677)

+----------------------+---------------+
|         word         |     tf_idf    |
+----------------------+---------------+
|      mechanism       | 27.4364651419 |
|       payment        | 17.9968096858 |
| incentive-compatible | 10.6115787628 |
|    crowdsourcing     | 9.80064854656 |
|        unique        | 8.77899729903 |
|       possible       |  6.2171296081 |
|    surprisingly,     | 5.99893656195 |
|     requirement,     | 5.99893656195 |
|       immense        | 5.99893656195 |
|        (that         | 5.99893656195 |
+----------------------+---------------+
[99 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


## Build a knn_model:

In [12]:
knn_model = graphlab.nearest_neighbors.create(papers_data, features=['tf_idf'], label='Id') 

PROGRESS: Starting brute force nearest neighbors model training.


In [13]:
knn_model.query(papers_data[papers_data['Id']== 5677], verbose=False)['reference_label']

dtype: int
Rows: 5
[5677, 5880, 5995, 5842, 5955]

In [14]:
## create a func that gets a knn_model, authors_data and a paper Id and gives the similar  paper Id and the name of authours
def similar_authors_to_given_paper(knnModel, author_data, author_id_data, paper_id):
    similar_paper_ids = knnModel.query(papers_data[papers_data['Id']== paper_id], verbose=False)['reference_label']
    sim_id_author_list = []
    for id in similar_paper_ids:
        id_author_list = author_id_data[author_id_data['PaperId']==id]['AuthorId']
        author_name_list = []
        for id_author in id_author_list:
            author_name = author_data[author_data['Id']==id_author]['Name'][0]
            author_name_list.append(author_name)
        sim_id_author_list.append([id, author_name_list]) 
    return sim_id_author_list

In [15]:
print similar_authors_to_given_paper(knn_model, authors_data, authorId_data, 5677)

[[5677, ['Nihar Bhadresh Shah', 'Denny Zhou']], [5880, ['Pinar Yanardag', 'S.V.N. Vishwanathan']], [5995, ['Bo Waggoner', 'Rafael Frongillo', 'Jacob D. Abernethy']], [5842, ['Ofer Dekel', 'Ronen Eldan', 'Tomer Koren']], [5955, ['Xingjian SHI', 'Zhourong Chen', 'Hao Wang', 'Dit-Yan Yeung', 'Wai-kin Wong', 'Wang-chun WOO']]]
