# Content-Based Filtering for Papers

This notebook demonstrates **content-based recommendation** using research papers.  We use Scikit-Learn's [`TfidfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer) to transform text into vectors and do neighborhood-based recommendation using article abstract text.

Paper abstracts are sourced from the now-defunct [HCI Bibliography](http://hcibib.org/).  Download the `hcibib.zip` data file from Blackboard and save it in the `data` directory.

## Setup

Let's import our core modules:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

We're going to use some Scikit-Learn models:

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

This repository has code to load the data:

In [3]:
from dsci641.hcibib import bib_conference_dfs

## Loading Data

Let's load the conference data:

In [4]:
papers, authors = bib_conference_dfs('data/hcibib.zip')

opening data/hcibib.zip
found 1453 conference files


In [5]:
papers.set_index('id', inplace=True)
papers.info()

<class 'pandas.core.frame.DataFrame'>
Index: 100624 entries, C.ACE.04.10 to C.YIUX.14.34
Data columns (total 4 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   date      100579 non-null  object
 1   title     100624 non-null  object
 2   abstract  100624 non-null  object
 3   pub       100624 non-null  object
dtypes: object(4)
memory usage: 3.8+ MB


In [6]:
authors.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 310979 entries, 0 to 310978
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   id      310979 non-null  object
 1   author  310979 non-null  object
dtypes: object(2)
memory usage: 4.7+ MB


We have about 100K papers.

## Finding an Author

We're going to recommend papers for me.  Let's look me up in the author lists:

In [7]:
mde = authors[authors['author'].isin(['Ekstrand, Michael', 'Ekstrand, Michael D.'])]
mde

Unnamed: 0,id,author
230289,C.ISW.09.4,"Ekstrand, Michael D."
263749,C.RecSys.10.159,"Ekstrand, Michael D."
263991,C.RecSys.11.133,"Ekstrand, Michael D."
264103,C.RecSys.11.349,"Ekstrand, Michael D."
264157,C.RecSys.11.395,"Ekstrand, Michael"
264207,C.RecSys.12.99,"Ekstrand, Michael"
264273,C.RecSys.12.233,"Ekstrand, Michael"
264472,C.RecSys.13.149,"Ekstrand, Michael D."
264789,C.RecSys.14.161,"Ekstrand, Michael D."
265006,C.RecSys.15.11,"Ekstrand, Michael D."


Get the papers themselves:

In [8]:
mde_papers = papers.loc[mde['id']]
mde_papers[['pub', 'title']]

Unnamed: 0_level_0,pub,title
id,Unnamed: 1_level_1,Unnamed: 2_level_1
C.ISW.09.4,Proceedings of the 2009 International Symposiu...,rv you're dumb: identifying discarded work in ...
C.RecSys.10.159,Proceedings of the 2010 ACM Conference on Reco...,Automatically building research reading lists
C.RecSys.11.133,Proceedings of the 2011 ACM Conference on Reco...,Rethinking the recommender research ecosystem:...
C.RecSys.11.349,Proceedings of the 2011 ACM Conference on Reco...,LensKit: a modular recommender framework
C.RecSys.11.395,Proceedings of the 2011 ACM Conference on Reco...,UCERSTI 2: second workshop on user-centric eva...
C.RecSys.12.99,Proceedings of the 2012 ACM Conference on Reco...,How many bits per rating?
C.RecSys.12.233,Proceedings of the 2012 ACM Conference on Reco...,When recommenders fail: predicting recommender...
C.RecSys.13.149,Proceedings of the 2013 ACM Conference on Reco...,Rating support interfaces to improve user expe...
C.RecSys.14.161,Proceedings of the 2014 ACM Conference on Reco...,User perception of differences in recommender ...
C.RecSys.15.11,Proceedings of the 2015 ACM Conference on Reco...,Letting Users Choose Recommender Algorithms: A...


In [14]:
gl_id = 'C.CSCW.94.175'
gl_row = papers.index.get_loc(gl_id)
papers.loc[gl_id]

date                                               1994-10-22
title       GroupLens: An Open Architecture for Collaborat...
abstract    Collaborative filters help people make choices...
pub         Proceedings of ACM CSCW'94 Conference on Compu...
Name: C.CSCW.94.175, dtype: object

## Counting Text

In [11]:
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(papers['abstract'])
X

<100624x81271 sparse matrix of type '<class 'numpy.int64'>'
	with 5344698 stored elements in Compressed Sparse Row format>

In [13]:
vectorizer.get_feature_names_out()

array(['00', '000', '0000', ..., 'zz', 'zzstructures', 'zzzoo'],
      dtype=object)

In [15]:
X[[gl_row], :]

<1x81271 sparse matrix of type '<class 'numpy.int64'>'
	with 58 stored elements in Compressed Sparse Row format>

In [20]:
pd.Series(X[[gl_row], :].toarray()[0, :], index=vectorizer.get_feature_names_out()).nlargest(10)

people           4
articles         3
based            2
better           2
bit              2
bureaus          2
clients          2
collaborative    2
developed        2
help             2
dtype: int64

## Processing Text

We are going to analyze text by using **TF-IDF** vectors.  They will be unit-normalized (the default), so cosine similarities are easy.

The `TfidfVectorizer` does this for us!  (Note, you do **not** use this for A1 or A2 — the content comes in a different form there.)

In [21]:
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(papers['abstract'])
X

<100624x81271 sparse matrix of type '<class 'numpy.float64'>'
	with 5344698 stored elements in Compressed Sparse Row format>

This gave us a **sparse matrix**, with one row for each paper and one column for each term (word).  We gave it the single column of text as input.

We can now look for similar papers. Let's take the original collaborative filtering paper:

Let's get its row from the matrix:

In [22]:
gl_vec = X[[gl_row], :]
gl_vec

<1x81271 sparse matrix of type '<class 'numpy.float64'>'
	with 58 stored elements in Compressed Sparse Row format>

And get the most similar papers:

In [23]:
gl_sims = gl_vec @ X.T
gl_sims = gl_sims.toarray()[0, :]
gl_sims = pd.Series(gl_sims, index=papers.index)
# top 10 not counting itself
gl_sims.nlargest(11).iloc[1:]

id
C.DL.15.195        0.240478
C.IR.05.106        0.234121
C.CIKM.15.1859     0.223453
C.CHI.03.1.585     0.223306
C.DL.07.438        0.217386
C.IUI.97.237       0.215445
C.RecSys.13.105    0.213065
C.IUI.10.31        0.210269
C.WWW.13.1.691     0.209438
C.IR.11.735        0.209077
dtype: float64

What is that most-similar paper?

In [24]:
papers.loc['C.CSCW.98.345']

date                                               1998-11-14
title       Using Filtering Agents to Improve Prediction Q...
abstract    Collaborative filtering systems help address i...
pub         Proceedings of ACM CSCW'98 Conference on Compu...
Name: C.CSCW.98.345, dtype: object

## Investigating TF-IDF

Let's peek at the *actual vectors*.  Turn the GroupLens paper's row into a series with the model's `vocabulary_`:

In [25]:
pd.Series(gl_vec.toarray()[0, :], index=vectorizer.get_feature_names_out()).nlargest(10)

bureaus      0.358843
articles     0.275108
bit          0.232363
servers      0.215757
clients      0.203127
people       0.202085
rating       0.186550
grouplens    0.179422
scores       0.175816
news         0.174336
dtype: float64

## Recommending for a User

We can now take 1 of 2 different approaches to recommend for a user:

* find similar articles to each article, and take the mean or total similarity
* aggregate the user's history into a single vector

Depending normalizations, they can be algebraicly equivalent in some cases.

Let's compute separately for each article.  We can get a matrix aligned with my articles:

In [26]:
mde_rows = papers.index.get_indexer_for(mde_papers.index)
mde_X = X[mde_rows, :]
mde_X

<11x81271 sparse matrix of type '<class 'numpy.float64'>'
	with 653 stored elements in Compressed Sparse Row format>

We can then multiply this *whole matrix* by the other one to get the similarities between each of my articles and all other articles:

In [27]:
mde_psims = mde_X @ X.T
mde_psims

<11x100624 sparse matrix of type '<class 'numpy.float64'>'
	with 849628 stored elements in Compressed Sparse Row format>

We can then *average* those similarities, so the final score is the average similarity to my papers — `axis=0` tells it to take the mean of the rows; `np.array` is needed to convert from an old-style matrix due to SciKit-Learn using older numpy APIs:

In [32]:
mde_sims = np.array(np.mean(mde_psims, axis=0))[0, :]
mde_sims = pd.Series(mde_sims, index=papers.index)
mde_sims.nlargest()

id
C.RecSys.15.11     0.266259
C.RecSys.11.349    0.259746
C.RecSys.11.133    0.250385
C.RecSys.14.161    0.226413
C.RecSys.12.233    0.216416
dtype: float64

Let's make sure we don't have papers I wrote.  This mask trick is an easy (and efficient) way to filter out items.

In [33]:
mask = pd.Series(True, index=papers.index)
mask[mde_papers.index] = False
mde_sims[mask].nlargest(10)

id
C.CHI.06.2.1103    0.216223
C.CLEF.15.376      0.195301
C.RecSys.11.383    0.187495
C.RecSys.11.353    0.187490
C.CHI.02.2.830     0.184500
C.RecSys.15.265    0.182742
C.CHI.06.1.1057    0.181467
C.UMAP.12.63       0.177319
C.WWW.09.671       0.173300
C.IR.09.203        0.171470
dtype: float64

What are these papers?

In [34]:
mde_sims[mask].nlargest(10).to_frame('score').join(papers[['title', 'pub']])

Unnamed: 0_level_0,score,title,pub
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
C.CHI.06.2.1103,0.216223,Making recommendations better: an analytic mod...,Proceedings of ACM CHI 2006 Conference on Huma...
C.CLEF.15.376,0.195301,Optimizing and Evaluating Stream-Based News Re...,CLEF 2015: International Conference of the Cro...
C.RecSys.11.383,0.187495,3rd workshop on recommender systems and the so...,Proceedings of the 2011 ACM Conference on Reco...
C.RecSys.11.353,0.18749,Recommenders benchmark framework,Proceedings of the 2011 ACM Conference on Reco...
C.CHI.02.2.830,0.1845,The role of transparency in recommender systems,Proceedings of ACM CHI 2002 Conference on Huma...
C.RecSys.15.265,0.182742,Evaluating Tag Recommender Algorithms in Real-...,Proceedings of the 2015 ACM Conference on Reco...
C.CHI.06.1.1057,0.181467,Accounting for taste: using profile similarity...,Proceedings of ACM CHI 2006 Conference on Huma...
C.UMAP.12.63,0.177319,Preference Relation Based Matrix Factorization...,Proceedings of the 2012 Conference on User Mod...
C.WWW.09.671,0.1733,Tagommenders: connecting users to items throug...,Proceedings of the 2009 International Conferen...
C.IR.09.203,0.17147,Learning to recommend with social trust ensemble,Proceedings of the 32nd Annual International A...
