# Generating a reliable set of RecSys Paper Ids
The main focus of this notebook is to generate a list of semantic scholar paper ids for recommender systems papers. We are not interested in getting the paper records themselves, this will be the job of the next notebook, but rather the ids that can be used later to lookup the paper records on SS.

To do this we will use two metods:
1. Using  conference and journal paper data colleected from DBLP we can identify DLBP papers that are recsys related. We can do this by identifying papers from recommender system venues (RecSys for example).
2. We can search DBLP titles and venues for recommender systems like phrases.

Once we have a suitable set of DBLP papers then we can identify those that have DOIs and use thes to collect paper ids from SS.

Note: This notbeook does not need to be executed. The file, '../data/raw/1000_recsys_paper_ids_52550.feather' contains the paper ids that are produced from this notebook at the time of the study.

In [None]:
import os
import swifter
import json
import time
from datetime import datetime
import string 

from wordcloud import WordCloud
from nltk.corpus import stopwords
import matplotlib.pyplot as plt

import random
import requests
from itertools import chain
from more_itertools import sliced

import pandas as pd
from matplotlib.pylab import plt
import numpy as np

from glob import glob, iglob
from pathlib import Path

from loguru import logger
from IPython.display import display, clear_output

from multiprocessing import Pool

import sys
sys.path.append('../../src/')
from semantic_scholar_wrapper import SS

!pwd

In [None]:
ss = SS()
ss

# Setup

In [None]:
# This notebook will produce a dataset of paper ids that will be stored in this file.
recsys_paper_ids_dataset = '../data/raw/1000_recsys_paper_ids.feather'

# It will use a previously collected dataset of DBLP journal and conference papers as seeds for this.
dblp_journals_dataset = '../data/raw/dblp_journals_with_ss_paper_ids.feather'
dblp_conferences_dataset = '../data/raw/dblp_conferences_with_ss_paper_ids.feather'




In [None]:
# The SS fields used when collecting paper and author records.

paper_fields = [
        'paperId', 'title', 'url', 'venue', 'year', 'journal', 'isOpenAccess',
        'publicationTypes', 'publicationDate',
        'referenceCount', 'citationCount', 'influentialCitationCount', 
        'fieldsOfStudy',
        'abstract',    
        'authors.authorId', 'citations.paperId',  'references.paperId',
        'externalIds'
    ]

author_fields = [
    'authorId' ,'externalIds' ,'name' ,'affiliations'
    ,'paperCount' ,'citationCount' ,'hIndex' ,'papers.paperId'
]

# Find RecSys Papers in DBLP

Read in and combine the DBLP datasets.

In [None]:
dblp_journals_df = pd.read_feather(dblp_journals_dataset)
dblp_journals_df.shape, dblp_journals_df['ss_paperId'].nunique()

In [None]:
dblp_conferences_df = pd.read_feather(dblp_conferences_dataset)
dblp_conferences_df.shape, dblp_conferences_df['ss_paperId'].nunique()

In [None]:
dblp_df = pd.concat([dblp_journals_df, dblp_conferences_df], ignore_index=True)

dblp_df.shape, dblp_df['ss_paperId'].nunique()

We only need to focus on the unique paper ids, so drop duplicates.

In [None]:
dblp_df = dblp_df[dblp_df['ss_paperId'].notnull()].drop_duplicates(subset=['ss_paperId']).copy()
dblp_df

Combine these text columns together into a single string and remove punctuation. This will be useful to do some lookup of key recsys terms.

In [None]:

text_cols = [
    'dblp_title', 
    'dblp_journal_name', 'dblp_booktitle', 
    'dblp_conference_name', 'dblp_proceedings_publisher',
]


def remove_punctuation(text):

    # Add the single quote and drop the hyphen
    punctuation = '!"#$%&\'()*+,./:;<=>?@[\\]^_`{|}~' + "’"

    # Create a translation table mapping punctuation characters to None
    translator = str.maketrans('', '', punctuation)
    
    # Remove punctuation using translate method
    return text.translate(translator)


dblp_df['text'] = (
    dblp_df[text_cols]
    .swifter
    .apply(lambda row: remove_punctuation(' '.join(row.dropna().map(str)).lower()), axis=1)
)

dblp_df

## RecSys DBLP Keys

Let's identify the DBLP keys that are associated with the main RecSys venues. These incldue ACM RecSys, ToRS and several long-running workshop series.

In [None]:
recsys_keys = [
    'recsys', 'conf/hr-recsys', 'conf/orsum', 'conf/normalize', 'conf/behavrec', 'conf/inra', 'conf/intrs',
    'conf/kars', 'conf/leri', 'conf/rectour',  
    'journals/tors', 
]

def contains_phrases(text, phrases):
    for phrase in phrases:
        if phrase in text: return True

    return False

with_recsys_key = dblp_df['dblp_key'].map(lambda text: contains_phrases(text, recsys_keys))
with_recsys_key.sum()

## RecSys Phrases & Queries

Next we define key RecSys phrases to identify papers that contain these phrases.

In [None]:
# These are the queries we will use for SS search; note the quotations for exact match search.
recsys_queries = [
    '"recommender system"', '"recommendation system"', 
    '"collaborative filter"', '"collaborative recommend"',
    '"social information filter"', '"collaborative information filter"',
    '"user-item"',
    'recsys', 'grouplens', 'movielens', '"netflix prize"',
]

# To check DBLP titles we dont need the quotes and we will add 'recommender'
recsys_phrases = [q.replace('"', '') for q in recsys_queries] + ['recommender']

with_recsys_phrase = dblp_df['text'].swifter.apply(lambda text: contains_phrases(text, phrases=recsys_phrases))

with_recsys_phrase.sum()

## Combine RecSys Papers from DBLP

Focus on any papers that come from one of the main RecSys venues or contain a RecSys phrase.

In [None]:
recsys_dblp_paper_df = dblp_df[(with_recsys_key | with_recsys_phrase)]
recsys_dblp_paper_df.shape

Get the unique paper ids; these are the ids that Semantic Scholar uses for its API.

In [None]:
recsys_dblp_paper_ids = list(recsys_dblp_paper_df['ss_paperId'].unique())
len(recsys_dblp_paper_ids), recsys_dblp_paper_ids[:3]

# RecSys Papers on SS
Next, search Semantic Scholar using the RecSys queries/phrases defined earlier.

## Search SS using RecSys queries and combine results into a dataframe

In [None]:
search_fields = ['title', 'abstract', 'venue', 'year']

search_results = []

for recsys_query in recsys_queries:
    clear_output()
    logger.info(recsys_query)
    search_results.append(ss.bulk_paper_search(recsys_query, fields=search_fields, sleep=1))

search_results = list(chain.from_iterable(search_results))

len(search_results), search_results[:3]

Combine the search results into a dataframe.

In [None]:
search_results_df = pd.DataFrame(search_results)

# Combine the title, abstract, venue into a text column.
search_results_df['text'] = (search_results_df['title'].map(str) + search_results_df['abstract'].map(str) + search_results_df['venue'].map(str)).map(remove_punctuation)

search_results_df

## Validate the Search Results
Check to see if the `text` column does contain a recsys query; if it does its considered a valud search result.

In [None]:
search_results_df['is_valid'] = search_results_df['text'].map(
    lambda text: contains_phrases(text, [q.replace('"', '') for q in recsys_queries])
)

search_results_df['is_valid'].sum()

In [None]:
valid_search_results_df = search_results_df[search_results_df['is_valid']]
valid_search_results_df.shape, valid_search_results_df['title'].sample(20).values

In [None]:
recsys_valid_search_paper_ids = list(valid_search_results_df['paperId'].unique())
len(recsys_valid_search_paper_ids)

# Prepare the Dataset of Initial RecSys IDs

## Combine DBLP and SS RecSys Papers

In [None]:
recsys_paper_ids_df = pd.DataFrame(list(set(recsys_dblp_paper_ids).union(recsys_valid_search_paper_ids)), columns=['paperId'])
recsys_paper_ids_df

## Save RecSys Paper Ids

In [None]:
recsys_paper_ids_df.to_feather(recsys_paper_ids_dataset)

recsys_paper_ids_dataset.format(len(recsys_paper_ids_df))