# Introduction
At this stage we have a dataset of recsys papers which contains a subset of papers that are about recommender systems plus a larger set of linked papers that are either cites, refs or author publications. The core recommender systems papers are not perfect. At least some are accidental remindings and the purpose of this notebook is to try to refine this in order to focus in on a optimal core of on-point recomender systems papers.

We will not remove any papers from this dataset but rather try to make them as recsys-relevant or not. The approach taken will be to be conservative: to only consider papers as being core RS papers if there is a strong reason to do so. In particular, we will try to excldue papers that mention RS in passing, perhaps as an application area or as related work, wut without evidence of a material contribution to the RS field. This is a question of balance and one culd argue that the approach taken here is too conservative. That is a useful debate to have and it will be possible for others to adjust the refinement process here in order to loosen the constraints. However, it should be noted that by allowing more borderline papers to be considered as core RS papers, it will also likley increase the number of much less relevant papers too. 

In [None]:
import swifter

from glob import glob

from itertools import chain

import random
from itertools import chain
from more_itertools import sliced

import pandas as pd
from matplotlib.pylab import plt
import numpy as np

from glob import glob, iglob
from pathlib import Path

from loguru import logger
from IPython.display import display, clear_output

!pwd

# Setup

## Datasets and files

In [None]:
# The main dataset of papers, which includes a subset of RS papers.
recsys_papers_dataset = '../data/raw/2000_recsys_papers.feather'

# We will save this refined dataset in the `processed` data subdirectory; it is no longer strictly raw data.
refined_recsys_papers_dataset = recsys_papers_dataset.replace('raw/2000_recsys', 'processed/2100_refined_recsys')
refined_recsys_papers_dataset

## Load the main papers dataset

In [None]:
df = pd.read_feather(recsys_papers_dataset)
df

## The curent subset of RS papers
These are the papers that have been so far deemed to be RS papers, within the larger collection of papers.

In [None]:
recsys_df = df[df['is_recsys_paper']].copy()
recsys_df.shape

# RecSys Filters
Various filters that we will use to determine whether a paper should be considered as a true/core RecSys paper.

## Does the paper contain a recsys dblp key?
Any paper/record that uses one of the predefined RecSys (DBLP) keys (e.g. ACM RecSys, ToRs, various RecSys Workshops) will be consisdered to be a core RS paper.

In [None]:
with_dblp_key = df['DBLP'].notnull()

recsys_keys = [
    'recsys', 'conf/hr-recsys', 'conf/orsum', 'conf/normalize', 'conf/behavrec', 'conf/inra', 'conf/intrs',
    'conf/kars', 'conf/leri', 'conf/rectour',  
    'journals/tors', 
]

df.loc[with_dblp_key, 'has_recsys_key'] = df[with_dblp_key]['DBLP'].swifter.apply(
    lambda dblp_key: len([key for key in recsys_keys if key in dblp_key])>0)

df['has_recsys_key'] = df['has_recsys_key'].fillna(False)

df['has_recsys_key'].sum()

## Is the paper a candidate recsys paper?
We are starting to distibguish between candidate RS papers and true/core RS papers.

In [None]:
df['is_candidate_recsys_paper'] = df['is_recsys_paper']
df['is_candidate_recsys_paper'].sum()

## Is the paper in the correct time frame?
In this work we focus on 1990 - 2024. The 1990 year is somewhat abritrary but it does suffice as a sensible starting point for RS work.

In [None]:
within_recsys_years = df['year'].between(1990, 2024)

df['is_within_recsys_years'] = within_recsys_years

df['is_within_recsys_years'].sum()

## Does the paper contain any strong recsys phrases?
These are phrases that are strongly indicative of core RS work. They need to appear as exact matches in a paper record. Here we look for these phrases in the sbgrast, title, and venue texts and calculate various counts of their occurences.

In [None]:
strong_recsys_phrases = [
    'recommender', 
    'collaborative filter', 'social information filter', 'collaborative information filter', 'social information access',
    'recsys', 'movielens', 'grouplens', 'netflix prize'
]

def check_phrases(text, phrases=strong_recsys_phrases):
    return [phrase for phrase in phrases if phrase in text.lower()]

df['contains_strong_recsys_phrases'] = df['text'].swifter.apply(check_phrases)
df['num_strong_recsys_phrases'] = df['contains_strong_recsys_phrases'].map(len)

df['contains_strong_recsys_phrases_in_title'] = df['title'].swifter.apply(check_phrases)
df['num_strong_recsys_phrases_in_title'] = df['contains_strong_recsys_phrases_in_title'].map(len)

df['contains_strong_recsys_phrases_in_venue'] = df['venue'].swifter.apply(check_phrases)
df['num_strong_recsys_phrases_in_venue'] = df['contains_strong_recsys_phrases_in_venue'].map(len)

(df['num_strong_recsys_phrases']>0).sum(), (df['num_strong_recsys_phrases_in_title']>0).sum(), (df['num_strong_recsys_phrases_in_venue']>0).sum()

## Does it contain moderate recsys phrases
Similar to above but focusing on a weaker set of phrases; after some experimentation this set of moderate phrases turned out to me minimal.

In [None]:
moderate_recsys_phrases = [
    'recommendation system',
]

df['contains_moderate_recsys_phrases'] = df['text'].swifter.apply(lambda text: check_phrases(text, moderate_recsys_phrases))
df['num_moderate_recsys_phrases'] = df['contains_moderate_recsys_phrases'].map(len)

df['contains_moderate_recsys_phrases_in_title'] = df['title'].swifter.apply(lambda text: check_phrases(text, moderate_recsys_phrases))
df['num_moderate_recsys_phrases_in_title'] = df['contains_moderate_recsys_phrases_in_title'].map(len)

df['contains_moderate_recsys_phrases_in_venue'] = df['venue'].swifter.apply(lambda text: check_phrases(text, moderate_recsys_phrases))
df['num_moderate_recsys_phrases_in_venue'] = df['contains_moderate_recsys_phrases_in_venue'].map(len)

(df['num_moderate_recsys_phrases']>0).sum(), (df['num_moderate_recsys_phrases_in_title']>0).sum(), (df['num_moderate_recsys_phrases_in_venue']>0).sum()

## Does it contain any weak recsys phrases?
And finally the weakest phrases indicative of recommender systems.

In [None]:
weak_recsys_phrases = ['recommendation']

df['contains_weak_recsys_phrases'] = df['text'].swifter.apply(lambda text: check_phrases(text, weak_recsys_phrases))
df['num_weak_recsys_phrases'] = df['contains_weak_recsys_phrases'].map(len)

(df['num_weak_recsys_phrases']>0).sum()

## What is the max candidate recsys authorship count for the paper?
That is, how many candidate recsys papers has each author produced? What its the max for the paper? This is a useful feature to consider when looking at some weaker examples of RS work. If there is an author who has published plenty of core RS papers then perhaps a weaker RS paper by that author is safer to consider as a core RS paper than a similar paper from an author with little or no RS history.

In [None]:
recsys_df = df[df['is_candidate_recsys_paper']].copy()

# A mapping between papers ids and author ids
authors_by_paper = df.set_index('paperId')['authors'].explode().dropna().reset_index()
recsys_authors_by_paper = recsys_df.set_index('paperId')['authors'].explode().dropna().reset_index()

# The number of papers for each author id
num_papers_by_author = authors_by_paper.groupby('authors').size()
num_recsys_papers_by_author = recsys_authors_by_paper.groupby('authors').size()

# The authorship counts
df['authorship_counts'] = (
    df['authors']
    .swifter
    .apply(
        lambda authors: [
            num_papers_by_author.loc[author] 
            for author in authors 
            if author in num_papers_by_author.index
        ]
    )
)

df['recsys_authorship_counts'] = (
    df['authors']
    .swifter
    .apply(
        lambda authors: [
            num_recsys_papers_by_author.loc[author] 
            for author in authors 
            if author in num_recsys_papers_by_author.index
        ]
    )
)

df['min_authorship_count'] = (
    df['authorship_counts']
    .swifter
    .apply(lambda counts: min(counts) if len(counts)>0 else 0)
)

df['min_recsys_authorship_count'] = (
    df['recsys_authorship_counts']
    .swifter
    .apply(lambda counts: min(counts) if len(counts)>0 else 0)
)


df['max_authorship_count'] = (
    df['authorship_counts']
    .swifter
    .apply(lambda counts: max(counts) if len(counts)>0 else 0)
)

df['max_recsys_authorship_count'] = (
    df['recsys_authorship_counts']
    .swifter
    .apply(lambda counts: max(counts) if len(counts)>0 else 0)
)

df['sum_authorship_count'] = (
    df['authorship_counts']
    .swifter
    .apply(lambda counts: sum(counts) if len(counts)>0 else 0)
)

df['sum_recsys_authorship_count'] = (
    df['recsys_authorship_counts']
    .swifter
    .apply(lambda counts: sum(counts) if len(counts)>0 else 0)
)

(
    (df['max_authorship_count']>0).mean(), df[df['max_authorship_count']>0]['max_authorship_count'].mean(),
    (df['max_recsys_authorship_count']>0).mean(), df[df['max_recsys_authorship_count']>0]['max_recsys_authorship_count'].mean()
)

## RecSys Venue Counts
How many candidate recsys papers have been published at a paper's venue.

In [None]:
# The number of recsys papers in each venue
recsys_venue_counts = recsys_df.groupby('venue')['paperId'].nunique().dropna().drop('')

df['recsys_venue_count'] = (
    df['venue']
    .swifter
    .apply(lambda v: recsys_venue_counts.loc[v] if v in recsys_venue_counts.index else 0)
)

(df['recsys_venue_count']>0).mean(), df[(df['recsys_venue_count']>0)]['recsys_venue_count'].mean()


## RecSys Linked Papers
Here we are looking at the papers that cite or are cited by RS papers and how often this happens. I had originally looked at refs and cites however there are lots of non-recsys papers that attract recsys cites which means we risk including non-recsys papers just because recsys papers cite them. Better to focus on refs only I think.

In [None]:
recsys_paper_ids = set(recsys_df['paperId'].unique())

# A list of references that are RS papers.
candidate_linked_papers = (
    df['references']
    .swifter
    .apply(lambda refs: [
        ref 
        for ref in refs
        if ref in recsys_paper_ids
    ])
)

len(candidate_linked_papers)


The linked candidates that have more than a minimum count.

In [None]:

min_linked_count = 2

candidate_linked_papers_counts = candidate_linked_papers.explode().dropna().value_counts()

linked_paper_ids = set(candidate_linked_papers_counts[candidate_linked_papers_counts > min_linked_count].index)

len(candidate_linked_papers_counts), len(linked_paper_ids)

In [None]:
df['recsys_linked_papers'] = candidate_linked_papers.swifter.apply(
    lambda papers: [
        paper 
        for paper in papers 
        if paper in linked_paper_ids
    ]
)

df['num_recsys_linked_papers'] = df['recsys_linked_papers'].map(len)

(df['num_recsys_linked_papers']>0).mean(), df[df['num_recsys_linked_papers']>0]['num_recsys_linked_papers'].mean()

## Field of Study
The Computer Science FoS might also be a useful feature to consider in our refinement process.

In [None]:
df['fos_contains_cs'] = df['fieldsOfStudy'].swifter.apply(lambda fos: 'Computer Science' in fos)
df['fos_contains_cs'].mean()

# The Rules for RecSys Papers
Here is where we setup the various rules to use when determining whether a paper is a core RS paper.

In [None]:
has_recsys_dblp_key = df['has_recsys_key']
is_within_recsys_years = df['is_within_recsys_years']
is_candidate_recsys_paper = df['is_candidate_recsys_paper']

num_strong_recsys_phrases = df['num_strong_recsys_phrases']
num_strong_recsys_phrases_in_title = df['num_strong_recsys_phrases_in_title']
num_strong_recsys_phrases_in_venue = df['num_strong_recsys_phrases_in_venue']

num_moderate_recsys_phrases = df['num_moderate_recsys_phrases']
num_moderate_recsys_phrases_in_title = df['num_moderate_recsys_phrases_in_title']
num_moderate_recsys_phrases_in_venue = df['num_moderate_recsys_phrases_in_venue']

num_weak_recsys_phrases = df['num_weak_recsys_phrases']
num_recsys_linked_papers = df['num_recsys_linked_papers']
max_recsys_authorship_count = df['max_recsys_authorship_count']
recsys_venue_count = df['recsys_venue_count']
fos_contains_cs = df['fos_contains_cs']

## For RecSys Candidates
Apply these rules to papers that are already considered to be RS candidates. In other words, these are papers in our main dataset that have some evidence of being RS.

In [None]:
min_evidence_count = 5

min_authorship_count = 3
min_venue_count = 3
weak_evidence_factor = 1.5

# Is the abstract contains "Recommendation System" in title case then it can be
# a reliable indication of a specific recommender system.
has_system_evidence = df['abstract'].str.contains('Recommendation System')==True

has_strong_evidence = (
    (num_strong_recsys_phrases>0) | 
    (num_strong_recsys_phrases_in_venue>0) |
    ((num_moderate_recsys_phrases_in_title>0) & is_within_recsys_years) |
    ((num_moderate_recsys_phrases_in_venue>0) & is_within_recsys_years) | 
    (has_recsys_dblp_key)
)

# Is there a moderate recsys phrase and some additional evidence?
has_moderate_evidence = (
    (num_moderate_recsys_phrases>0) 
    & (
        (num_recsys_linked_papers 
         + (max_recsys_authorship_count  >= min_authorship_count)
         + (recsys_venue_count >= min_venue_count) 
         + fos_contains_cs
        ) > min_evidence_count
    )
)

# Is there a weak recsys phrase and some additional evidence?
has_weak_evidence = (
    (num_weak_recsys_phrases>0) 
    & (
        (num_recsys_linked_papers 
         + (max_recsys_authorship_count  >= min_authorship_count) 
         + (recsys_venue_count >= min_venue_count) 
         + fos_contains_cs
        ) > min_evidence_count*weak_evidence_factor
    )
)

check_candidate_recsys_papers = (

    # Its a canadiate recsys paper in the right time-frame ... 
    is_within_recsys_years & is_candidate_recsys_paper &

    (has_system_evidence | has_strong_evidence | has_moderate_evidence | has_weak_evidence)

)


check_candidate_recsys_papers.sum()

## For Non-Candidates
We also reconsider other papers in our main dataset that are not currently RS canddiates. These are papers that have been included because they are linked to an RS paper (author pubs, refs, cites). There is at least a chance that some will be RS papers even though they were not found in our original search for RS papers. We require more evidence for these papers.

In [None]:

# Slightly higher min evidence count
min_evidence_count = 6

# weak_evidence_factor = 2

# has_strong_evidence = (num_strong_recsys_phrases>0) | ((num_moderate_recsys_phrases_in_title>0) & is_within_recsys_years) | (has_recsys_dblp_key)

has_strong_evidence = (
    (num_strong_recsys_phrases>0) | 
    (num_strong_recsys_phrases_in_venue>0) |
    ((num_moderate_recsys_phrases_in_title>0) & is_within_recsys_years) |
    ((num_moderate_recsys_phrases_in_venue>0) & is_within_recsys_years) | 
    (has_recsys_dblp_key)
)



has_moderate_evidence = (
    (num_moderate_recsys_phrases>0) 
    & (
        (num_recsys_linked_papers 
         + (max_recsys_authorship_count  >= min_authorship_count)
         + (recsys_venue_count >= min_venue_count) 
         + fos_contains_cs
        ) > min_evidence_count
    )
)

has_weak_evidence = (
    (num_weak_recsys_phrases>0) 
    & (
        (num_recsys_linked_papers 
         + (max_recsys_authorship_count  >= min_authorship_count) 
         + (recsys_venue_count >= min_venue_count) 
         + fos_contains_cs
        ) > min_evidence_count*weak_evidence_factor
    )
)

check_non_candidate_recsys_papers = (

    # Its a canadiate recsys paper in the right time-frame ... 
    is_within_recsys_years & (~is_candidate_recsys_paper) &

    (has_system_evidence | has_strong_evidence | has_moderate_evidence | has_weak_evidence)
)


check_non_candidate_recsys_papers.sum()

## Mark the recommender systems papers
Identify the papers that qualify after the above filters/rules.

In [None]:
df['is_recsys_paper'] = np.where(check_candidate_recsys_papers | check_non_candidate_recsys_papers, True, False)
df['is_recsys_paper'].sum()

In [None]:
bad_recsys_paper_ids = ['51d267b782e7caf2b6bc7240b1a5f48044ffe115']

df['is_recsys_paper'] = np.where(df['paperId'].isin(bad_recsys_paper_ids), False, df['is_recsys_paper'])

df['is_recsys_paper'].sum()

# Mark Core RecSys Papers
It is plausible that some of the papers we identify as RecSys papers are relevant but not core to the field. They might be papers on a technique that is called out as being relevant to recsys in the abstract, for example. So, let's try to identify the core recsys papers that are unabiguously about recsys. To do this we will include all recsys papers published in the core recsys venues (`has_recsys_key` is True) and also those that have a strong recsys phrase in the title.

In [None]:
def title_contains_personalization_and_recommendation(title):
    title = title.lower()

    return ('personali' in title) & ('recommend' in title)

has_title_contains_personalization_and_recommendation = df['title'].map(title_contains_personalization_and_recommendation)


def title_ends_with_recommendation(title):

    title = title.lower().replace('.', '')
    last_word = title.split()[-1]

    return (last_word=='recommendation') | (last_word=='recommendations')

has_title_ends_with_recommendation = df['title'].map(title_ends_with_recommendation)


def contains_recsys_phrase(title):

    title = title.lower()

    # A fairly accommodating set of terms but they have to be in the title and the papers
    # will be recsys candidates to begin with.
    strong_recsys_phrases = set([
        'recommend',
        'collaborative filter', 'information filter',
        'recsys', 'movielens', 'grouplens', 'netflix prize', 'cold start', 'cold-start',
    ])

    for phrase in strong_recsys_phrases:
        if phrase in title: return True

    return False

has_recsys_title = df['title'].map(contains_recsys_phrase)

is_recsys_venue = df['num_strong_recsys_phrases_in_venue']>0
contains_freq_recommender = df['text'].map(lambda text: text.count('recommender')>2)

contains_recommendation_system = (
    ((df['title'].map(lambda title:title.lower()).str.contains('recommend')) & (df['text'].map(lambda text: text.count('recommend')>1)))
    | (df['text'].map(lambda text: text.count('recommend')>2))
)

is_recsys_paper = df['is_recsys_paper']
has_recsys_key = df['has_recsys_key']

recsys_venues = ['acm recsys', 'intrsrecsys', 'recsys poster', 'rectourrecsys', 'recsys challenge']
in_recsys_venue = df['venue'].map(lambda venue: venue.lower()).isin(recsys_venues)

is_within_recsys_years = df['year'].between(1990, 2023)

# is_core_recsys_paper = has_recsys_key | (is_recsys_paper & has_recsys_title)

is_core_recsys_paper = is_within_recsys_years & (
    has_recsys_dblp_key |
    (is_recsys_paper & has_title_ends_with_recommendation) |
    (is_recsys_paper & has_title_contains_personalization_and_recommendation) | 
    (is_recsys_paper & has_recsys_title) | 
    (is_recsys_paper & is_recsys_venue) | 
    (is_recsys_paper & contains_freq_recommender) |
    (is_recsys_paper & contains_recommendation_system)
)

df['is_core_recsys_paper'] = is_core_recsys_paper

df['is_recsys_paper'].sum(), df['is_core_recsys_paper'].sum()

# Save Refined RecSys Datasets
At this point we have a refined dataset with a clearer separation between RS and non-RS papers. Save this dataset.

In [None]:
df.to_feather(refined_recsys_papers_dataset)
df.shape, refined_recsys_papers_dataset