# Build the RecSys SS Dataset
The objective of this notebook is to build the main dataset for th euniverse of RS papers. The starting point is a set of seed paper ids in 1000_recsys_paper_ids_52550.feather, which contains a candidate set of paper ids that are considered to be RS papers.

These papers are used to generate a list of author papers (papers published by the authors of these seed RS papers) and a list of linked papers, paapers that cite, or are cited by the RS papers.

Each of these ids is used to generate the seed papers dataframe -- a dataframe of SS paper records -- and from this we generate a set of author ids and use these to create an author dataframe.

All of this data is collected using various API calls to collect papers, authors and citations and requires several hours to run. Care is taken to ensure that paper records are contain a minimal amount of information -- some SS records appear to be little more than stubs with very limited data -- and we also make efforts to deal with some missing information, such as missing citations, by recalling the API. All of this is to say that, this data collection process will be imperfect but sufficient for the purpose of the the analysis we wish to carry out.

_**Note from future self:** With the benefit of hindsighht I would strongly recommend performing this type of large scale data collected task using the Semantic Scholar Datasets API because it provides for a more direct route to collecting the data using streamed json files, rather than relying on http requests. That said, the code that is provdied here was used to collect the original data, but be warned it will take >24 hours to run because of API rate limiting.

In [None]:
import swifter
import Stemmer

import os
import json
import time
from datetime import datetime
import string 

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

from nltk.tokenize import word_tokenize
import nltk
nltk.download('stopwords')

import matplotlib.pyplot as plt

import random
import requests
from itertools import chain
from more_itertools import sliced

import pandas as pd
from matplotlib.pylab import plt
import numpy as np

from glob import glob, iglob
from pathlib import Path

from loguru import logger
from IPython.display import display, clear_output

from multiprocessing import Pool

import sys
sys.path.append('../../src/')
from semantic_scholar_wrapper import SS

!pwd

In [None]:
ss = SS(max_attempts=6)
ss

# Setup
Various dataset filenames that will use used/produced by this notebook.

In [None]:
# This is the original set of recsys paper ids used as the starting point for the analysis
# The notebook 1000_build_recsys_paper_ids_dataset will recreate this list but it may not produce the same 
# set of ids at runtime.
recsys_paper_ids_dataset = '../data/raw/1000_recsys_paper_ids_52550.feather'

# These datasets are produced by this notebook
recsys_seed_papers_dataset = '../data/raw/2000_recsys_seed_papers.feather'
recsys_seed_authors_dataset = '../data/raw/2000_recsys_seed_authors.feather'

recsys_linked_papers_dataset = '../data/raw/2000_recsys_linked_papers.feather'

recsys_papers_dataset = '../data/raw/2000_recsys_papers.feather'
recsys_authors_dataset = '../data/raw/2000_recsys_authors.feather'


# Load Paper Ids
These are the seed RS paper ids identified in an earlier notebook.

In [None]:
recsys_seed_ids_df = pd.read_feather(recsys_paper_ids_dataset)
recsys_seed_ids_df

In [None]:
seed_paper_ids = list(recsys_seed_ids_df['paperId'])
len(seed_paper_ids)

# Get SS Seed Papers
We have the paper ids, now collect the paper records.

In [None]:
paper_fields = [
        'paperId', 'title', 'url', 'venue', 'year', 'journal', 'isOpenAccess',
        'publicationTypes', 'publicationDate',
        'referenceCount', 'citationCount', 'influentialCitationCount', 
        'fieldsOfStudy',
        'abstract',    
        'authors.authorId', 'citations.paperId',  'references.paperId',
        'externalIds',
        'citationStyles',
    ]

author_fields = [
    'authorId' ,'externalIds' ,'name' ,'affiliations'
    ,'paperCount' ,'citationCount' ,'hIndex' ,'papers.paperId'
]

In [None]:
recsys_seed_papers = ss.get_papers_in_batches(
    seed_paper_ids,
    fields=paper_fields,
    pool_size=10
)

recsys_seed_papers_df = ss.items_to_dataframe(recsys_seed_papers)
recsys_seed_papers_df.to_feather(recsys_seed_papers_dataset.format(len(recsys_seed_papers_df)))

recsys_seed_papers_df    

## Validate Seed Papers
To be valid a paper must be sufficiently complete, which means it has a title, year, authors.

In [None]:
def validate_papers(papers_df, min_title_length = len('preface.'), valid_years = (1950, 2024)):

    has_valid_title = (papers_df['title'].notnull()) & (papers_df['title'].map(len)>min_title_length)
    has_valid_year = papers_df['year'].between(*valid_years)
    has_valid_authors = papers_df['authors'].map(len)>0

    is_valid_seed_paper = has_valid_title & has_valid_year & has_valid_authors

    return papers_df[is_valid_seed_paper].copy()

recsys_seed_papers_df = validate_papers(recsys_seed_papers_df)
recsys_seed_papers_df

## Get Seed Authors
For each of these valid seed papers we next use the SS API to get information of their authors. This will be used to generate lists of all paper published by RS authors.

In [None]:
recsys_seed_author_ids = list(recsys_seed_papers_df['authors'].explode().dropna().unique())
len(recsys_seed_author_ids)

In [None]:
recsys_seed_authors = ss.get_authors_in_batches(
    recsys_seed_author_ids,
    fields=author_fields,
    pool_size=4
)

recsys_seed_authors_df = ss.items_to_dataframe(recsys_seed_authors)
recsys_seed_authors_df.to_feather(recsys_seed_authors_dataset.format(len(recsys_seed_authors_df)))
recsys_seed_authors_df

# Get Linked Papers
Similarly, for each of the seed papers we can generate lists of papers that are cited by or that cite these seed papers.

## Seed Cites/Refs

In [None]:
recsys_seed_citation_ids = list(recsys_seed_papers_df['citations'].explode().dropna().unique())
len(recsys_seed_citation_ids)

In [None]:
recsys_seed_reference_ids = list(recsys_seed_papers_df['references'].explode().dropna().unique())
len(recsys_seed_reference_ids)

## Get Seed Author Publications
The seed authors publications.

In [None]:
recsys_seed_author_pub_ids = list(recsys_seed_authors_df['papers'].explode().dropna().unique())
len(recsys_seed_author_pub_ids)

## Get the linked papers in batches
We have the paper ids for the author pubs, cits and refs, now we need to get the paper records for these cited/citing papers. It's a long process so we do this in batches so that if it terminates prematurely it is easier to restart from where we left off. Not ideal but good enough.

In [None]:
def crawl_in_parts(
    ids, part_filename_template, 
    get_items_fn=ss.get_papers_in_batches, item_fields=paper_fields, 
    batch_size=500, pool_size=10, part_size=500_000, restart_at_part=0
):

    ids_in_parts = list(sliced(ids, part_size))

    part_filenames = []
    
    for i, part in enumerate(ids_in_parts[restart_at_part:]):
    
        logger.info((i+restart_at_part, len(part), part_filename_template.format(i+restart_at_part)))
    
        # Get the next group of papers
        items = get_items_fn(
            part, fields=item_fields, 
            batch_size=batch_size, pool_size=pool_size
        )
    
        # Convert to a df
        items_df = ss.items_to_dataframe(items, pool_size=24)
    
        # Save the df
        part_filename = part_filename_template.format(i+restart_at_part)
        items_df.to_feather(part_filename)
        part_filenames.append(part_filename)
    
        # Free up the memory
        del items_df

    return part_filenames
    

def assemble_parts(part_filenames, pool_size=10):

    with Pool(pool_size) as p:

        dfs = p.map(pd.read_feather, part_filenames)

    return pd.concat(dfs, ignore_index=True)



Combine the cites, refs and author pubs to produce a large list of papers ids for which we need paper records.

In [None]:
linked_paper_ids = list(set(recsys_seed_citation_ids).union(set(recsys_seed_reference_ids)).union(recsys_seed_author_pub_ids))
len(linked_paper_ids)

In [None]:
part_filenames = crawl_in_parts(
    linked_paper_ids, 
    '../data/raw/parts/2000_linked_papers_part_{}_{}.feather',
    pool_size=15
)
part_filenames


In [None]:
recsys_linked_papers_df = assemble_parts(part_filenames)
recsys_linked_papers_df

In [None]:
recsys_linked_papers_df = validate_papers(recsys_linked_papers_df)
recsys_linked_papers_df

In [None]:
recsys_linked_papers_df.to_feather(recsys_linked_papers_dataset)
recsys_linked_papers_df

# Create RecSys Papers Dataset
Produce the main RS datarame by combining the seed papers and linked papers (including authors pubs). We add some additional columns which will be useful later, maybe.

## Combine Seed & Linked DFs

In [None]:
recsys_papers_df = (
    pd
    .concat([recsys_seed_papers_df, recsys_linked_papers_df], ignore_index=True)
    .drop_duplicates(subset=['paperId'])
)

recsys_papers_df

## Some Additional Cols
Strictly speaking these additional columns should have neen added in a later notebook to decouple this aspect from the data collection. 

In [None]:
# Fix the Fields of Study so that they always contain lists; this will simplify things later.
recsys_papers_df['fieldsOfStudy'] = recsys_papers_df['fieldsOfStudy'].map(
    lambda f:  list(f) if (type(f) is np.ndarray) | (type(f) is list) else []
)

recsys_papers_df

In [None]:
recsys_papers_df['authorCount'] = recsys_papers_df['authors'].map(len)
recsys_papers_df

### Paper source indicator cols

In [None]:
# Indicator cols --this helps us understand where the paper came from (seed, citation, ref, author pub).
recsys_papers_df['is_seed'] = recsys_papers_df['paperId'].isin(seed_paper_ids)
recsys_papers_df['is_seed_citation'] = recsys_papers_df['paperId'].isin(recsys_seed_citation_ids)
recsys_papers_df['is_seed_reference'] = recsys_papers_df['paperId'].isin(recsys_seed_reference_ids)
recsys_papers_df['is_seed_author_pub'] = recsys_papers_df['paperId'].isin(recsys_seed_author_pub_ids)

recsys_papers_df

In [None]:
# There are some (<40) papers whose ids do not appear in our seed or linked lists.
# These are ids that map to some other id by SS. We remove them because they cannot be tracked.
recsys_papers_df = recsys_papers_df[recsys_papers_df.filter(like='is_seed').any(axis=1)].copy()
recsys_papers_df

### Combining title & abstract; removing punctuation
Add a `text` field base don a normalised verion of the title and abstract.

In [None]:
def remove_punctuation(text):

    # Add the single quote and drop the hyphen
    punctuation = '!"#$%&\'()*+,./:;<=>?@[\\]^_`{|}~' + "’"

    # Create a translation table mapping punctuation characters to None
    translator = str.maketrans('', '', punctuation)
    
    # Remove punctuation using translate method
    return text.translate(translator)


# Combining titles and abstract; remove punctuation.
recsys_papers_df['text'] = (
    recsys_papers_df['title'].map(lambda s: s if type(s) is str else '') \
    + ' ' +\
    recsys_papers_df['abstract'].map(lambda s: s if type(s) is str else '') \
    + ' ' +\
    recsys_papers_df['venue'].map(lambda s: s if type(s) is str else '')

).str.lower().map(remove_punctuation)

recsys_papers_df

### Stemming
Some stemmed versions of the title/abstract text. May use this later.

In [None]:
stop_words = set(stopwords.words('english'))
stemmer = Stemmer.Stemmer('english')

def remove_stop_words(text, stop_words=stop_words):
    word_list = text.split()
    return ' '.join([word for word in word_list if word not in stop_words])

recsys_papers_df['text_without_stop_words'] = recsys_papers_df['text'].swifter.apply(remove_stop_words)


def stem_words(text):
    word_list = text.split()
    return ' '.join(stemmer.stemWords(word_list))

recsys_papers_df['stemmed_text'] = recsys_papers_df['text'].swifter.apply(stem_words)

recsys_papers_df['stemmed_text_without_stop_words'] = recsys_papers_df['text_without_stop_words'].swifter.apply(stem_words)
    

recsys_papers_df


### Adding nGrams
Not sure if we will use these ...

In [None]:
def generate_ngrams(text, n):
    words = text.split(' ')
    ngrams = []
    for i in range(len(words) - n + 1):
        ngrams.append('_'.join(words[i:i+n]))
    return ngrams
        
recsys_papers_df['ngrams_without_stop_words'] = (
    recsys_papers_df['stemmed_text_without_stop_words']
    .swifter.apply(lambda text: list(set(list(chain.from_iterable(
        [generate_ngrams(text, n) for n in range(2, 4)])))))
)

# recsys_papers_df['num_ngrams'] = recsys_papers_df['ngrams'].map(len)
recsys_papers_df['num_ngrams_without_stop_words'] = recsys_papers_df['ngrams_without_stop_words'].map(len)

recsys_papers_df

## Mark the RecSys Papers
Now we can mark the candidate recsys papers, which will include the original seed set but may also include papers that we have found in the linked and author pub sets. To do this we use the same query set that we used to identify our original se of seed papets.

In [None]:
recsys_queries = [
    '"recommender system"', '"recommendation system"', 
    '"collaborative filter"', '"collaborative recommend"',
    '"social information filter"', '"collaborative information filter"',
    '"user-item"',
    'recsys', 'grouplens', 'movielens', '"netflix prize"',
]

# To check DBLP titles we dont need the quotes and we will add 'recommender'
recsys_phrases = [q.replace('"', '') for q in recsys_queries] + ['recommender']


# Check which papers contain these phrases.
def contains_phrases(text, phrases=recsys_phrases):
    found_phrases = [phrase for phrase in phrases if phrase in text]

    return found_phrases

contains_recsys_phrases = recsys_papers_df['text'].swifter.apply(contains_phrases)

# The matching phrases for each paper.
recsys_papers_df['matching_recsys_phrases'] = contains_recsys_phrases

# The number of these matching phrases.
recsys_papers_df['num_recsys_phrases'] = contains_recsys_phrases.map(len)

# A paper is a recsys paper if it matches at least one recsys phrase; this is likely too weak.
recsys_papers_df['is_recsys_paper'] = recsys_papers_df['num_recsys_phrases']>0

recsys_papers_df.shape, recsys_papers_df['is_seed'].sum(), recsys_papers_df['is_recsys_paper'].sum()

# Final Set of Authors
Since we have added a few more recsys papers we need to update the authors too.

## Identify & Collect Missing Authors

In [None]:
all_recsys_author_ids = set(
    recsys_papers_df[recsys_papers_df['is_recsys_paper']]['authors']
    .explode().dropna().unique()
)

missing_recsys_author_ids = list(all_recsys_author_ids.difference(set(recsys_seed_authors_df['authorId'].unique())))
len(missing_recsys_author_ids)

In [None]:
missing_recsys_authors = ss.get_authors_in_batches(
    missing_recsys_author_ids,
    fields=author_fields,
    pool_size=4
)

missing_recsys_authors_df = ss.items_to_dataframe(missing_recsys_authors)
missing_recsys_authors_df

## Combine with the seed authors

In [None]:
recsys_authors_df = (
    pd
    .concat([recsys_seed_authors_df, missing_recsys_authors_df], ignore_index=True)
    .drop_duplicates(subset=['authorId'])
)

recsys_authors_df

## Sort author publications by year
I thkn that the author pubs may already be sorted by year but here we make sure out of an abdundance of caution. A slight hitch is that we cannot guarantee that every publication id is in our dataset because SS does not find everything even when it seems to have a valid id. Fortunately, >80% of pub ids are in the dataset and so we can get their year and sort accordingly. This should be sufficient for whatever we want to do, if anything, with this sorted publication data.

In [None]:
# Create an indexed version of recsys_papers_df to speedup the lookups
recsys_papers_df_indexed_by_id = recsys_papers_df.set_index('paperId')

# The papers that are in our papers df; This is just over 80% of the ids which 
# should be sufficient.
recsys_authors_df['available_papers'] = recsys_authors_df['papers'].swifter.apply(
    lambda ids: [id for id in ids if id in recsys_papers_df_indexed_by_id.index]
)

# Sort these paper ids by year.
recsys_authors_df['sorted_papers'] = (
    recsys_authors_df['available_papers']
    .swifter
    .apply(
        lambda ids: sorted(
            ids, 
            key=lambda id: recsys_papers_df_indexed_by_id.loc[id]['year']
        )
    )
)

recsys_authors_df

# Scrape as many missing citations as we can ...
Turns out the citation lists are hit and miss. So for now we can do another crawl to get the citations for each paper. We dont want to do this for every paper though. Just the ones with lots of missing cites.

## The papers with enough missing citations to recrawl
We focus in on papers weher we have have fewer than 95% of their citation count and where this is more than 5 missing cites. This avoids the need to crawl papers with very low citation counts. I tried using `get_citations` for this but it is way too slow because we are limited to 1 paper per second or about 60k per day. The alternative is to use `get_papers` but to only ask for the citation info and limit the batch size to about 100 papers. This seems to get citations for 95% of the papers which is pretty decent.

In [None]:
num_found_citations = recsys_papers_df['citations'].map(len)

min_frac_citations = 0.95
min_missing_citations = 5

frac_found_citations = num_found_citations/recsys_papers_df['citationCount']

num_missing_citations = recsys_papers_df['citationCount']-num_found_citations

with_missing_citations = (frac_found_citations<min_frac_citations) & (num_missing_citations>min_missing_citations)


papers_ids_with_missing_citations = list(recsys_papers_df[with_missing_citations]['paperId'].unique())
len(papers_ids_with_missing_citations), papers_ids_with_missing_citations[:3]

## Scrape the citations

In [None]:
part_filenames = crawl_in_parts(
    papers_ids_with_missing_citations, 
    '../data/raw/parts/2000_missing_citations_part_{}.feather',
    item_fields=['paperId', 'citationCount', 'citations.paperId'],
    batch_size=20,      # Trial and error suggest that this gives the best results.
    part_size=100_000,
    pool_size=4,
)
part_filenames

In [None]:
missing_citations_df = assemble_parts(part_filenames).set_index('paperId').add_prefix('scraped_')
missing_citations_df

## Add the missing citations data to the main dataframe

In [None]:
recsys_papers_df = recsys_papers_df.set_index('paperId').join(missing_citations_df, how='left').reset_index()
recsys_papers_df

# Save RecSys Datasets
Save the dataset of papers and the dataset of authors. These define the broader RS universe and contain RS specific papers/authors. They will be further refined and cleaned and used as the basis of the analysis.

In [None]:
recsys_papers_df.to_feather(recsys_papers_dataset)
recsys_papers_df.shape, recsys_papers_dataset

In [None]:
recsys_authors_df.to_feather(recsys_authors_dataset)
recsys_authors_df.shape, recsys_authors_dataset