# Notebook 4: Scores
The notebooks before have laid all the groundwork needed to finally begin scoring our articles. Here, we will look at how scores were created and subsequently get a sense the final novelty metric on some papers.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

import time

from src import novelty
from src.utils import tqdm_f
from tqdm import tqdm, tqdm_notebook
tqdm.pandas(tqdm_notebook)

---
### Load DataFrames

In [3]:
df = pd.read_pickle('data/dataframes/main_df.pkl')
scholars = pd.read_pickle('data/dataframes/scholars_df.pkl')
print('Main dataframe')
display(df.head())
print('\nScholars dataframe')
display(scholars.head())

Main dataframe


Unnamed: 0,abstract,authors,day,month,tags,title,year,publish_date,title+abstract,top_topics
0,Learned feature representations and sub-phonem...,"[Fred Richardson, Douglas Reynolds, Najim Dehak]",3,4,"[cs.CL, cs.CV, cs.LG, cs.NE, stat.ML]",A Unified Deep Neural Network for Speaker and ...,2015,2015-03-04,A Unified Deep Neural Network for Speaker and ...,"(3, 12, 19)"
1,We propose a simple neural network model to de...,"[Muhammad Ghifary, W. Kleijn, Mengjie Zhang]",21,9,"[cs.CV, cs.AI, cs.LG, cs.NE, stat.ML]",Domain Adaptive Neural Networks for Object Rec...,2014,2014-09-21,Domain Adaptive Neural Networks for Object Rec...,"(3, 4, 9)"
2,Recent studies have demonstrated the power of ...,"[Lionel Pigou, Aäron Oord, Sander Dieleman, Mi...",5,6,"[cs.CV, cs.AI, cs.LG, cs.NE, stat.ML]",Beyond Temporal Pooling: Recurrence and Tempor...,2015,2015-05-06,Beyond Temporal Pooling: Recurrence and Tempor...,"(3, 9, 12)"
3,"In this paper, we address the task of Optical ...","[Rakesh Achanta, Trevor Hastie]",20,9,"[stat.ML, cs.AI, cs.CV, cs.LG, cs.NE]",Telugu OCR Framework using Deep Learning,2015,2015-09-20,Telugu OCR Framework using Deep Learning. In ...,"(1, 3, 19)"
4,Recent progress in using recurrent neural netw...,"[Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas...",27,2,"[stat.ML, cs.AI, cs.CL, cs.CV, cs.LG]",Describing Videos by Exploiting Temporal Struc...,2015,2015-02-27,Describing Videos by Exploiting Temporal Struc...,"(3, 9, 19)"



Scholars dataframe


Unnamed: 0_level_0,freq,profile,h-index,i10-index,citedby,avg_citedby_2015
scholar,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Yoshua Bengio,88,"{'_filled': True,\n 'affiliation': 'Professor,...",137,439,171496,2661
Uwe Aickelin,84,"{'_filled': True,\n 'affiliation': 'School of ...",52,123,9575,495
Marcus Hutter,78,"{'_filled': True,\n 'affiliation': 'Researcher...",31,106,5072,225
Chunhua Shen,75,"{'_filled': True,\n 'affiliation': 'School of ...",55,169,11896,329
Joseph Halpern,66,"{'_filled': True,\n 'affiliation': 'Professor ...",88,240,36802,990


---
# Scoring
As mentioned in the Introduction to this data challenge, I define a paper to be novel if it is (1) **new** and (2) **impactful**. I captured these two criteria in the Topic Score, while the Author Score was used to supplement the final Novelty Score.

### Topic Score
As seen from the topic trends in Notebook 2, a higher score should be awarded to papers which are chronologically earlier in the topic trend. Furthermore, being one of the earlier publisher in a topic trend is insufficient to be novel — if the topic itself is not impactful, there will not be much subsequent research in that field. As such, the number of published papers is equally important in considering the novelty of a paper.

With the above considerations, the Topic Score that I proposed is as follows:

$$\text{TS} \equiv (1 - d)^i \cdot \log_b(N) \qquad \forall i \in [0, \dots, N - 1]$$
where
- $\text{TS}$ is the Topic Score
- $N$ is the number of papers published in a given topic combination
- $d$ is the decay rate (default: 0.05)
- $b$ is the logarithm base (default: 10)

The first term, $\left( 1 - d \right)^i$, penalizes papers that were published later in a trend. This is a proxy for the first criterion (newness). Meanwhile, the second term $\log_b(N)$ awards a higher score to papers in *popular* topics, as a way to represent the second criterion (impactfulness). $d$ and $b$ are tunable parameters to tweak the relative importance of the two criteria. 

### Author Score
The authors of a paper could serve as an indicator to the quality of a paper. As discussed in Notebook 3, I chose to use the h-index as a proxy for the journal quality written by each author. For simplicity, I have used the arithmetic mean of all authors' h-indices to calculate the Author Score. 

$$\text{AS} = \dfrac{\sum_i^{N_a} h_i}{N_a}$$
where
- $\text{AS}$ is the Author Score
- $h_i$ is the h-index of the $i$-th author of a paper
- $N_a$ is the number of authors in a paper

More advanced formulations of the Author Scores are possible, and they are discussed in the Future Works section below.

### Novelty Score
The final Novelty Score (NS) is a weighted average of the two aforementioned scores. 

$$\text{NS} = \frac{a_1 TS + a_2 AS}{a_1 + a_2}$$

The Novelty Score is a non-negative integer and is not upper-bounded. While I can perform min-max normalization to keep the scores bounded between [0, 1], I specifically chosen not to do so as the validation and test data would have to be normalized with the exact same function used for the training data. Since this normalization procedure is stateful (i.e. we need to know the upper and lower bounds of the training data), data leakage may occur during validation and test time. 

---

I have applied the scores mentioned above to the training dataframe.

In [4]:
def assign_scores(df, scholars, weights=(0.5, 0.5), normalized=False):
    '''
    Assign the Topic Score, Author Score and Novelty Score to each article in `df`.
    '''
    output = []
    
    # Topic Score calculation
    for topics, _df in tqdm_f(is_range=False)(df.groupby('top_topics'), desc='Calculating Topic Scores'):
        _df = _df.sort_values('publish_date').reset_index()
        _df['topic_score'] = novelty.topic_score(len(_df))
        output.append(_df) 
    output = pd.concat(output).set_index('index').sort_values('index')
    output.index.name = None
    
    # Author Score calculation
    print('Calculating Author Scores...')
    time.sleep(0.5)
    output['author_score'] = novelty.author_score(output['authors'], scholars)
    
    # Novelty Score calculation
    print('Calculating Novelty Scores...')
    time.sleep(0.5)
    output['score'] = output.progress_apply(lambda row: np.average([row['topic_score'], row['author_score']], 
                                                          weights=weights), 
                                   axis=1)
    
    if normalized:
        output['score'] = novelty.normalize(output['score'])
        
    return output

In [5]:
df = assign_scores(df, scholars, weights=(0.5, 0.5), normalized=False)
df.head()

HBox(children=(IntProgress(value=0, description='Calculating Topic Scores', max=1159, style=ProgressStyle(desc…


Calculating Author Scores...


100%|██████████| 19306/19306 [00:20<00:00, 953.71it/s]
  out=out, **kwargs)


Calculating Novelty Scores...


100%|██████████| 19306/19306 [00:03<00:00, 6344.07it/s]


Unnamed: 0,abstract,authors,day,month,tags,title,year,publish_date,title+abstract,top_topics,topic_score,author_score,score
0,Learned feature representations and sub-phonem...,"[Fred Richardson, Douglas Reynolds, Najim Dehak]",3,4,"[cs.CL, cs.CV, cs.LG, cs.NE, stat.ML]",A Unified Deep Neural Network for Speaker and ...,2015,2015-03-04,A Unified Deep Neural Network for Speaker and ...,"(3, 12, 19)",0.864535,0.301551,0.583043
1,We propose a simple neural network model to de...,"[Muhammad Ghifary, W. Kleijn, Mengjie Zhang]",21,9,"[cs.CV, cs.AI, cs.LG, cs.NE, stat.ML]",Domain Adaptive Neural Networks for Object Rec...,2014,2014-09-21,Domain Adaptive Neural Networks for Object Rec...,"(3, 4, 9)",1.021247,0.338184,0.679716
2,Recent studies have demonstrated the power of ...,"[Lionel Pigou, Aäron Oord, Sander Dieleman, Mi...",5,6,"[cs.CV, cs.AI, cs.LG, cs.NE, stat.ML]",Beyond Temporal Pooling: Recurrence and Tempor...,2015,2015-05-06,Beyond Temporal Pooling: Recurrence and Tempor...,"(3, 9, 12)",0.556297,0.245878,0.401087
3,"In this paper, we address the task of Optical ...","[Rakesh Achanta, Trevor Hastie]",20,9,"[stat.ML, cs.AI, cs.CV, cs.LG, cs.NE]",Telugu OCR Framework using Deep Learning,2015,2015-09-20,Telugu OCR Framework using Deep Learning. In ...,"(1, 3, 19)",0.760365,0.739321,0.749843
4,Recent progress in using recurrent neural netw...,"[Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas...",27,2,"[stat.ML, cs.AI, cs.CL, cs.CV, cs.LG]",Describing Videos by Exploiting Temporal Struc...,2015,2015-02-27,Describing Videos by Exploiting Temporal Struc...,"(3, 9, 19)",0.477121,0.449995,0.463558


In [6]:
# Filter out values which are null and save the results
(
    df
    .pipe(lambda d: d[d['score'].notna()])
    .to_pickle('data/dataframes/main_df_scored.pkl')
)

---
# Empirical Validation of Novelty Score
As mentioned in the data challenge introduction, the number of cites a paper received is generally regarded in academia a good indicator of its quality, by extension its novelty. I was able to obtain 1754 citation data of articles in this dataset, including influential papers by well-regarded authors in the field.

With the citation data that I have, I was able to come up with some validations of the Novelty Score, as shown below.

In [7]:
# Create a dataframe containing articles with citation data
articles = pd.read_pickle('data/scraped/articles.pkl')
subset = (
    df
    .merge(articles, on='title')
    .pipe(lambda d: d[d.notna().all(axis=1)])  # Keep rows with citation data
    .pipe(lambda d: d[d['citedby'].apply(lambda x: isinstance(x, int))])  # Keep rows that contain integers
    .assign(citedby=lambda d: d['citedby'].astype(int))
)
subset.head()

Unnamed: 0,abstract,authors,day,month,tags,title,year,publish_date,title+abstract,top_topics,topic_score,author_score,score,citedby
0,Learned feature representations and sub-phonem...,"[Fred Richardson, Douglas Reynolds, Najim Dehak]",3,4,"[cs.CL, cs.CV, cs.LG, cs.NE, stat.ML]",A Unified Deep Neural Network for Speaker and ...,2015,2015-03-04,A Unified Deep Neural Network for Speaker and ...,"(3, 12, 19)",0.864535,0.301551,0.583043,132
1,We propose a simple neural network model to de...,"[Muhammad Ghifary, W. Kleijn, Mengjie Zhang]",21,9,"[cs.CV, cs.AI, cs.LG, cs.NE, stat.ML]",Domain Adaptive Neural Networks for Object Rec...,2014,2014-09-21,Domain Adaptive Neural Networks for Object Rec...,"(3, 4, 9)",1.021247,0.338184,0.679716,54
2,Recent studies have demonstrated the power of ...,"[Lionel Pigou, Aäron Oord, Sander Dieleman, Mi...",5,6,"[cs.CV, cs.AI, cs.LG, cs.NE, stat.ML]",Beyond Temporal Pooling: Recurrence and Tempor...,2015,2015-05-06,Beyond Temporal Pooling: Recurrence and Tempor...,"(3, 9, 12)",0.556297,0.245878,0.401087,73
3,"In this paper, we address the task of Optical ...","[Rakesh Achanta, Trevor Hastie]",20,9,"[stat.ML, cs.AI, cs.CV, cs.LG, cs.NE]",Telugu OCR Framework using Deep Learning,2015,2015-09-20,Telugu OCR Framework using Deep Learning. In ...,"(1, 3, 19)",0.760365,0.739321,0.749843,9
4,Recent progress in using recurrent neural netw...,"[Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas...",27,2,"[stat.ML, cs.AI, cs.CL, cs.CV, cs.LG]",Describing Videos by Exploiting Temporal Struc...,2015,2015-02-27,Describing Videos by Exploiting Temporal Struc...,"(3, 9, 19)",0.477121,0.449995,0.463558,504


### Validation 1

If the number of cites is a good proxy of novelty, they should be correlated. We can investigate it using the scatterplot below.

In [8]:
# I am taking the log of the citations since it is extremely skewed
(
    subset
    .assign(log_citedby=lambda d: d['citedby'].apply(lambda x: np.log(x + 1)))
    .plot
    .scatter(x='log_citedby', y='score', s=3)
);
corr = np.corrcoef(subset[['citedby', 'score']].T)[0, 1]
print(f'The chart below yields a correlation coefficient of {corr:.3f}.')

The chart below yields a correlation coefficient of 0.106.


### Validation 2

We can also view the number of papers by well-known authors, say, Geoffrey Hinton. 

In [9]:
(
    subset
    .pipe(lambda d: d[d['authors'].apply(lambda x: any('Hinton' in name for name in x))])
    .sort_values('score', ascending=False)
    .head(10)
)

Unnamed: 0,abstract,authors,day,month,tags,title,year,publish_date,title+abstract,top_topics,topic_score,author_score,score,citedby
441,When a large feedforward neural network is tra...,"[Geoffrey Hinton, Nitish Srivastava, Alex Kriz...",3,7,"[cs.NE, cs.CV, cs.LG]",Improving neural networks by preventing co-ada...,2012,2012-03-07,Improving neural networks by preventing co-ada...,"(3, 9, 11)",1.584677,0.722208,1.153443,3601
1114,Visual perception is a challenging problem in ...,"[Yichuan Tang, Ruslan Salakhutdinov, Geoffrey ...",27,6,"[cs.CV, cs.LG, stat.ML]",Deep Lambertian Networks,2012,2012-06-27,Deep Lambertian Networks. Visual perception i...,"(4, 9, 12)",1.293641,0.828606,1.061124,53
1344,Recurrent neural networks (RNNs) are a powerfu...,"[Alex Graves, Abdel-rahman Mohamed, Geoffrey H...",22,3,"[cs.NE, cs.CL]",Speech Recognition with Deep Recurrent Neural ...,2013,2013-03-22,Speech Recognition with Deep Recurrent Neural ...,"(3, 12, 16)",1.07828,0.808419,0.94335,3481
2185,Some high-dimensional data.sets can be modelle...,"[Geoffrey Hinton, Yee Teh]",10,1,"[cs.LG, stat.ML]",Discovering Multiple Constraints that are Freq...,2013,2013-10-01,Discovering Multiple Constraints that are Freq...,"(5, 16, 18)",0.699879,0.902344,0.801112,34
2621,An efficient way to learn deep density models ...,"[Yichuan Tang, Ruslan Salakhutdinov, Geoffrey ...",18,6,"[cs.LG, stat.ML]",Deep Mixtures of Factor Analysers,2012,2012-06-18,Deep Mixtures of Factor Analysers. An efficie...,"(3, 4, 10)",0.762701,0.828606,0.795654,51
2749,Product models of low dimensional experts are ...,"[Max Welling, Richard Zemel, Geoffrey Hinton]",19,10,"[cs.LG, stat.ML]",Efficient Parametric Projection Pursuit Densit...,2012,2012-10-19,Efficient Parametric Projection Pursuit Densit...,"(4, 13, 16)",0.650662,0.863291,0.756976,5
4184,We introduce a Deep Boltzmann Machine model su...,"[Nitish Srivastava, Ruslan Salakhutdinov, Geof...",26,9,"[cs.LG, cs.IR, stat.ML]",Modeling Documents with Deep Boltzmann Machines,2013,2013-09-26,Modeling Documents with Deep Boltzmann Machine...,"(4, 6, 16)",0.450454,0.827001,0.638727,152
2553,Conditional Restricted Boltzmann Machines (CRB...,"[Volodymyr Mnih, Hugo Larochelle, Geoffrey Hin...",14,2,"[cs.LG, stat.ML]",Conditional Restricted Boltzmann Machines for ...,2012,2012-02-14,Conditional Restricted Boltzmann Machines for ...,"(0, 4, 16)",0.477699,0.788073,0.632886,100
15525,We show how the necessary and sufficient condi...,"[James Marshall, Thomas Hinton]",9,7,"[cs.IT, cs.NE, math.IT]",Beyond No Free Lunch: Realistic Algorithms for...,2009,2009-09-07,Beyond No Free Lunch: Realistic Algorithms for...,"(0, 16, 20)",0.29933,0.905148,0.602239,15
626,Syntactic constituency parsing is a fundamenta...,"[Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav...",23,12,"[cs.CL, cs.LG, stat.ML]",Grammar as a Foreign Language,2014,2014-12-23,Grammar as a Foreign Language. Syntactic cons...,"(4, 16, 19)",0.47834,0.710283,0.594312,502


---
# Summary
In both validations above, unfortunately, there seems to be a weak correlation between the Novelty Score and the citation rate of papers. 

One possible explanation for this performance was that my hypothesis that the historical performance of an author would be a good indicator of a paper's novelty. In the example above, we can see that papers which included Prof. Hinton as the authors have wildly varying `citedby` values, even for those where he is the first author. 

### Method Drawbacks
- There are many parameters to tweak ($b$ and $d$ for the Topic Score) and the averaging weights for the Novelty Score. Picking the right scores is not immediately straightforward.
- A veteran scholar likely has a higher h-index compared to a budding scholar. Academic age has to be accounted for
- Computing the coherence score to find the ideal number of topics is extremely memory-intensive. It requires an AWS EC2 (r5.4xlarge) instance with 128 GB of RAM to operate, and would scale poorly to larger corpus.

### Future Work
- Normalize author's index by account for authors' academic age (use the m-index)
- Normalize the number of cites of a paper based on how long a paper has been published
- Consider using a weighted average when calculating the Author Score. The weighted average should put larger weights on the first and last author's indices.
- Collect information from sources such as blogs, GitHub, YouTube and Medium to gauge the impact of a paper.