# Result Analysis

In this notebook, we perform a result analysis of our model in two aspects:

1. whether the model is able to differentiate similar papers published by the same author, while at the same time discovering their shared topics

2. whether the model is able to give precise results compared to manual labeling

To start, we select 5 papers published by Professor Shang:
- CrossWeigh
    - Semantic Scholar: https://www.semanticscholar.org/paper/CrossWeigh%3A-Training-Named-Entity-Tagger-from-Wang-Shang/997855e1f17d34dd3922d953a587742d198844e6
    - PDF: https://www.aclweb.org/anthology/D19-1519.pdf
- AutoPhrase
    - Semantic Scholar: https://www.semanticscholar.org/paper/Automated-Phrase-Mining-from-Massive-Text-Corpora-Shang-Liu/96808500be49f3d502055bab1edd30dcbec4b99b
    - PDF: http://hanj.cs.illinois.edu/pdf/tkde18_jshang2.pdf
- LM-LSTM-CRF
    - Semantic Scholar: https://www.semanticscholar.org/paper/Empower-Sequence-Labeling-with-Task-Aware-Neural-Liu-Shang/7647a06965d868a4f6451bef0818994100a142e8
    - PDF: https://arxiv.org/pdf/1709.04109.pdf
- AutoNER
    - Semantic Scholar: https://www.semanticscholar.org/paper/Learning-Named-Entity-Tagger-using-Domain-Specific-Shang-Liu/5201efab94c9376ef894f6f33cab06a5c5e00073
    - PDF: https://www.aclweb.org/anthology/D18-1230.pdf
- SetExpan
    - Semantic Scholar: https://www.semanticscholar.org/paper/SetExpan%3A-Corpus-Based-Set-Expansion-via-Context-Shen-Wu/741d50647afac926dce001160d8253c7a5c14ca3
    - PDF: https://arxiv.org/pdf/1910.08192.pdf

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve
import seaborn as sns

In [None]:
dirs = os.listdir('../references/result_analysis')
dirs.remove('.DS_Store')
dirs

### 1. Ability to compare and contrast similar papers published by the same author

First we get the AutoPhrase results for all 5 papers. We select high quality phrases (quality score > 0.5) only.

In [None]:
autophrase = {}
autophrase_all = {}
autophrase_stats = pd.DataFrame()
for directory in dirs:
    fp = '../references/result_analysis/' + directory + '/AutoPhrase.txt'
    df = pd.read_csv(fp, delimiter='\t', header=None, names=['score', 'phrase'])
    df = df[['phrase', 'score']]
    autophrase_all[directory] = df
    df = df[df['score'] > 0.5]
    autophrase[directory] = df
    autophrase_stats[directory] = df['score'].describe()
autophrase_df = pd.concat(autophrase, axis=1)
autophrase_df.head(15)

As we can see from the dataframe above, there are some phrases, such as "maccabi tel aviv," "jiawei han," and "california," that ended up at the top of the ranked list while they are actually not domain-specific. 

To filter out these nonsignificant phrases, our model applies weight to the AutoPhrase result using the pre-processed arXiv dataset. For pre-processing, we have split the arXiv dataset into domains and run AutoPhrase on each of them to get domain specific phrases. For the 5 papers we are using, we select the domain to be "computer science" and the weighted results with high quality phrases (quality score > 0.5) are as follows:

In [None]:
weighted = {}
weighted_all = {}
weighted_stats = pd.DataFrame()
for directory in dirs:
    fp = '../references/result_analysis/' + directory + '/weighted_AutoPhrase.csv'
    df = pd.read_csv(fp, index_col='Unnamed: 0')
    weighted_all[directory] = df
    df = df[df['score'] > 0.5]
    weighted[directory] = df
    weighted_stats[directory] = df['score'].describe()
weighted_df = pd.concat(weighted, axis=1)
weighted_df.head(15)

As we can see from the dataframe above, there are some phrases, such as "natural language," "pos tagging," "lstm crf," and "text corpora," shared across these 5 papers. At the same time, each of these papers has its own unique phrases, such as "cross validation" for CrossWeigh, "knowledge base" for AutoPhrase, "sequence labeling" for LM-LSTM-CRF, "distant supervision" for AutoNER, and "bipartite graph" for SetExpan. 

Thus our model is able to differentiate similar papers published by the same author, while at the same time discovering their shared topics.

The density plots of quality scores before and after applying weight are as follows:

In [None]:
plt.figure(figsize=(20, 8))
ax = plt.subplot(1, 2, 1)
for directory in autophrase_all:
    sns.distplot(autophrase_all[directory]['score'].to_list(), hist=False, label=directory)
plt.title('AutoPhrase Quality Score Distribution', fontsize=25)
plt.xlim(0, 1)
plt.ylim(0, 4)
plt.ylabel('Density', fontsize=20)
plt.legend(loc="best", fontsize=15)
for label in (ax.get_xticklabels() + ax.get_yticklabels()):
    label.set_fontname('Arial')
    label.set_fontsize(15)

ax = plt.subplot(1, 2, 2)
for directory in weighted_all:
    sns.distplot(weighted_all[directory]['score'].to_list(), hist=False, label=directory)
plt.title('Weighted Quality Score Distribution', fontsize=25)
plt.xlim(0, 1)
plt.ylim(0, 4)
plt.ylabel('Density', fontsize=20)
plt.legend(loc="best", fontsize=15)
for label in (ax.get_xticklabels() + ax.get_yticklabels()):
    label.set_fontname('Arial')
    label.set_fontsize(15)

# plt.savefig('../data/report/Quality Score Distribution.png')
plt.show()

The quality score distribution shifts to the left after applying the weight. This is expected because nonsignificant phrases are weighted down.

### 2. Ability to give precise results comparing to manual labeling

To do this, we annotate the weighted results by manual checking and labeling whether the phrases can actually represent the paper. We compare the accuracy for phrases with a quality score > 0.5, > 0.6, and > 0.7. 

In [None]:
annote_stats = pd.DataFrame(columns=['Article', 
                                     'Accuracy (quality score > 0.5)', 
                                     'Accuracy (quality score > 0.6)', 
                                     'Accuracy (quality score > 0.7)'])
for directory in dirs:
    fp = '../references/result_analysis/' + directory + '/annotation.csv'
    df = pd.read_csv(fp, index_col='Unnamed: 0')
    df2 = df[df['score'] > 0.6]
    df3 = df[df['score'] > 0.7]
    annote_stats = annote_stats.append({'Article': directory, 
                                        'Accuracy (quality score > 0.5)': df['label'].mean(), 
                                        'Accuracy (quality score > 0.6)': df2['label'].mean(),
                                        'Accuracy (quality score > 0.7)': df3['label'].mean()}, ignore_index=True)
annote_stats

As we can see from the dataframe above, accuracy is higher for phrases with a higher quality score. 

Thus our model is able to give precise results compared to manual labeling.

The precision-recall curves are plotted as follows:

In [None]:
plt.figure(figsize=(10, 8))
for directory in dirs:
    fp = '../references/result_analysis/' + directory + '/annotation.csv'
    sample = pd.read_csv(fp, index_col='Unnamed: 0')
    precision, recall, thresholds = precision_recall_curve(
        y_true=sample['label'],
        probas_pred=sample['score'])
    plt.plot(recall, precision, scalex=False, scaley=False, label=directory)
plt.title('Precision-Recall Curve', fontsize=25)
plt.xlabel('Recall', fontsize=20)
plt.ylabel('Precision', fontsize=20)
plt.legend(loc="best", fontsize=15)
plt.tick_params(axis='both',labelsize=15)

# plt.savefig('../data/report/Precision-Recall Curve.png')
plt.show()