# Addressing the cold start problem using content-based filtering

Project title: Automating literature reviews using recommender systems


Authors : Alarmelu PM, Akhila Bolisetty, Andrew Szeto


Cold start problem is when the user and the recommender are not familiar with each other.

A solution to this problem is to offer the user content based recommendations and try to understand the relevancy through user rating.

The purpose of this file is to generate content based recommendations for users based on this inputs

In [1]:
import os
#list the current work dir
os.getcwd()
from rake_nltk import Rake
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
#reading the science papers from science_papers.csv file

In [2]:
articles = pd.read_csv('science_papers.csv')
articles

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,abstract,acm_class,arxiv_id,author_text,categories,comments,Unnamed: 8,doi,num_authors,num_categories,primary_cat,title,updated,categories_list,created
0,1,329549,NTsort is an external sort on WindowsNT 5.0. I...,D.5;H.2,cs/9809004,"Jim Gray, Joshua Coates, Chris Nyberg","cs.DB,cs.PF",Original word file at:\n http://research.micr...,,,3,2,cs.DB,Performance / Price Sort,,"['cs.DB', 'cs.PF']",9/1/98
1,2,704923,Simple economic and performance arguments sugg...,H.3.4,cs/9809005,"Jim Gray, Goetz Graefe",cs.DB,Original document at:\n http://research.micro...,,,2,1,cs.DB,"The Five-Minute Rule Ten Years Later, and Othe...",,['cs.DB'],9/1/98
2,3,465653,The Microsoft TerraServer stores aerial and sa...,H.2.4;H.2.8;H.3.5,cs/9809011,"Tom Barclay, Robert Eberl, Jim Gray, John Nord...","cs.DB,cs.DL",Original file at\n http://research.microsoft....,,,15,2,cs.DB,Microsoft TerraServer,,"['cs.DB', 'cs.DL']",9/4/98
3,4,1372704,We study a set of linear transformations on th...,H.2.2,cs/9809023,"Davood Rafiei, Alberto Mendelzon",cs.DB,,,,2,1,cs.DB,Similarity-Based Queries for Time Series Data,9/18/98,['cs.DB'],9/17/98
4,5,269336,We propose an improvement of the known DFT-bas...,H.2;H.3,cs/9809033,"Davood Rafiei, Alberto Mendelzon",cs.DB,,,,2,1,cs.DB,Efficient Retrieval of Similar Time Sequences ...,9/25/98,['cs.DB'],9/18/98
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6662,6663,586717,"Objective: Translational science aims at ""tran...",,1812.10609,Qing Ke,"cs.DL,cs.CY,physics.soc-ph",Accepted at JAMIA; Supporting Information at\n...,,10.1093/jamia/ocy177,1,3,cs.DL,Identifying translational science through embe...,,"['cs.DL', 'cs.CY', 'physics.soc-ph']",12/26/18
6663,6664,750271,A Wikipedia book (known as Wikibook) is a coll...,,1812.10937,"Shahar Admati, Lior Rokach, Bracha Shapira","cs.DL,cs.IR,cs.LG",,,,3,3,cs.DL,Wikibook-Bot - Automatic Generation of a Wikip...,,"['cs.DL', 'cs.IR', 'cs.LG']",12/28/18
6664,6665,1173063,"As science advances, the academic community ha...",,1812.11252,"Haofeng Jia, Erik Saule","cs.IR,cs.DL,cs.SI",,,,2,3,cs.IR,Towards Finding Non-obvious Papers: An Analysi...,,"['cs.IR', 'cs.DL', 'cs.SI']",12/28/18
6665,6666,357648,Mechanisms are a fundamental concept in many a...,,1812.11431,Robert B Allen,cs.DL,,,,1,1,cs.DL,Issues for Using Semantic Modeling to Represen...,,['cs.DL'],12/29/18


In [3]:
articles = articles.drop(columns=['Unnamed: 0','Unnamed: 0.1'])


In [4]:
articles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6667 entries, 0 to 6666
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   abstract         6667 non-null   object 
 1   acm_class        1007 non-null   object 
 2   arxiv_id         6667 non-null   object 
 3   author_text      6667 non-null   object 
 4   categories       6667 non-null   object 
 5   comments         4269 non-null   object 
 6   Unnamed: 8       0 non-null      float64
 7   doi              1390 non-null   object 
 8   num_authors      6667 non-null   object 
 9   num_categories   6667 non-null   object 
 10  primary_cat      6667 non-null   object 
 11  title            6667 non-null   object 
 12  updated          1722 non-null   object 
 13  categories_list  6667 non-null   object 
 14  created          6667 non-null   object 
dtypes: float64(1), object(14)
memory usage: 781.4+ KB


In [1]:
#extract keywords from the abstract of the science papers to speed up the content based search.
#keyword extraction is done using RAKE 

Quoted from the rake documentation:

source:https://csurfer.github.io/rake-nltk/_build/html/index.html


RAKE short for Rapid Automatic Keyword Extraction algorithm, is a domain independent keyword extraction algorithm which tries to determine key phrases in a body of text by analyzing the frequency of word appearance and its co-occurance with other words in the text.

In [5]:
def sort_tuple(tup):
    tup.sort(key = lambda x : x[1])
    return tup

import RAKE
import operator
keywords=[]
for i in articles['abstract']:
    r = RAKE.Rake('SmartStoplist.txt')
    keyphrases = r.run(i)
    imp_phrases = sort_tuple(keyphrases)[-4:]
    tmp=[]
    for i in imp_phrases:
        tmp.append(i[0])
    keywords.append(tmp)
    print(keywords)

[['two-pass external sort', 'elapsed time performance comparable', 'robert ramey software development', 'simple gb/$ sort metric based']]
[['two-pass external sort', 'elapsed time performance comparable', 'robert ramey software development', 'simple gb/$ sort metric based'], ['randomly accessed pages', 'suggest optimal page sizes', 'rules-of-thumb change', 'two-pass sequentially accessed pages']]
[['two-pass external sort', 'elapsed time performance comparable', 'robert ramey software development', 'simple gb/$ sort metric based'], ['randomly accessed pages', 'suggest optimal page sizes', 'rules-of-thumb change', 'two-pass sequentially accessed pages'], ['sql server database served', 'united states geodetic survey', 'microsoft site servers managed', 'internet browsers provide intuitive spatial']]
[['two-pass external sort', 'elapsed time performance comparable', 'robert ramey software development', 'simple gb/$ sort metric based'], ['randomly accessed pages', 'suggest optimal page size

[['two-pass external sort', 'elapsed time performance comparable', 'robert ramey software development', 'simple gb/$ sort metric based'], ['randomly accessed pages', 'suggest optimal page sizes', 'rules-of-thumb change', 'two-pass sequentially accessed pages'], ['sql server database served', 'united states geodetic survey', 'microsoft site servers managed', 'internet browsers provide intuitive spatial'], ['time-series data', 'fourier series representation', 'answer similarity queries efficiently', 'underlying r-tree index'], ['search time', 'similar time sequences', 'real stock prices', 'dft-based indexing technique'], ['subjective programming benefits', 'fujitsu ap1000 multicomputer', 'specific data structure', 'distributed heap storage manager'], ['online data stores', 'database research community', 'long-range work', 'broader research agenda -- broadening'], ['query width bounded', 'important decision problems', 'unlike query width', 'bounded query-width introduced'], ['nearest neig

[['two-pass external sort', 'elapsed time performance comparable', 'robert ramey software development', 'simple gb/$ sort metric based'], ['randomly accessed pages', 'suggest optimal page sizes', 'rules-of-thumb change', 'two-pass sequentially accessed pages'], ['sql server database served', 'united states geodetic survey', 'microsoft site servers managed', 'internet browsers provide intuitive spatial'], ['time-series data', 'fourier series representation', 'answer similarity queries efficiently', 'underlying r-tree index'], ['search time', 'similar time sequences', 'real stock prices', 'dft-based indexing technique'], ['subjective programming benefits', 'fujitsu ap1000 multicomputer', 'specific data structure', 'distributed heap storage manager'], ['online data stores', 'database research community', 'long-range work', 'broader research agenda -- broadening'], ['query width bounded', 'important decision problems', 'unlike query width', 'bounded query-width introduced'], ['nearest neig

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



Adding the abstract keywords to the dataset. This will help speed up the content based search process

In [6]:
articles['keywords'] = keywords

In [7]:
articles

Unnamed: 0,abstract,acm_class,arxiv_id,author_text,categories,comments,Unnamed: 8,doi,num_authors,num_categories,primary_cat,title,updated,categories_list,created,keywords
0,NTsort is an external sort on WindowsNT 5.0. I...,D.5;H.2,cs/9809004,"Jim Gray, Joshua Coates, Chris Nyberg","cs.DB,cs.PF",Original word file at:\n http://research.micr...,,,3,2,cs.DB,Performance / Price Sort,,"['cs.DB', 'cs.PF']",9/1/98,"[two-pass external sort, elapsed time performa..."
1,Simple economic and performance arguments sugg...,H.3.4,cs/9809005,"Jim Gray, Goetz Graefe",cs.DB,Original document at:\n http://research.micro...,,,2,1,cs.DB,"The Five-Minute Rule Ten Years Later, and Othe...",,['cs.DB'],9/1/98,"[randomly accessed pages, suggest optimal page..."
2,The Microsoft TerraServer stores aerial and sa...,H.2.4;H.2.8;H.3.5,cs/9809011,"Tom Barclay, Robert Eberl, Jim Gray, John Nord...","cs.DB,cs.DL",Original file at\n http://research.microsoft....,,,15,2,cs.DB,Microsoft TerraServer,,"['cs.DB', 'cs.DL']",9/4/98,"[sql server database served, united states geo..."
3,We study a set of linear transformations on th...,H.2.2,cs/9809023,"Davood Rafiei, Alberto Mendelzon",cs.DB,,,,2,1,cs.DB,Similarity-Based Queries for Time Series Data,9/18/98,['cs.DB'],9/17/98,"[time-series data, fourier series representati..."
4,We propose an improvement of the known DFT-bas...,H.2;H.3,cs/9809033,"Davood Rafiei, Alberto Mendelzon",cs.DB,,,,2,1,cs.DB,Efficient Retrieval of Similar Time Sequences ...,9/25/98,['cs.DB'],9/18/98,"[search time, similar time sequences, real sto..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6662,"Objective: Translational science aims at ""tran...",,1812.10609,Qing Ke,"cs.DL,cs.CY,physics.soc-ph",Accepted at JAMIA; Supporting Information at\n...,,10.1093/jamia/ocy177,1,3,cs.DL,Identifying translational science through embe...,,"['cs.DL', 'cs.CY', 'physics.soc-ph']",12/26/18,"[showing excellent agreement, uncovering signi..."
6663,A Wikipedia book (known as Wikibook) is a coll...,,1812.10937,"Shahar Admati, Lior Rokach, Bracha Shapira","cs.DL,cs.IR,cs.LG",,,,3,3,cs.DL,Wikibook-Bot - Automatic Generation of a Wikip...,,"['cs.DL', 'cs.IR', 'cs.LG']",12/28/18,"[statistically significant results, machine le..."
6664,"As science advances, the academic community ha...",,1812.11252,"Haofeng Jia, Erik Saule","cs.IR,cs.DL,cs.SI",,,,2,3,cs.IR,Towards Finding Non-obvious Papers: An Analysi...,,"['cs.IR', 'cs.DL', 'cs.SI']",12/28/18,"[search relevant manuscripts, real citation re..."
6665,Mechanisms are a fundamental concept in many a...,,1812.11431,Robert B Allen,cs.DL,,,,1,1,cs.DL,Issues for Using Semantic Modeling to Represen...,,['cs.DL'],12/29/18,"[xfo programming environment, basic semantic m..."


In [2]:
#data cleaning

In [8]:
def clean(text):
    import re
    text = re.sub("\.",'', text)  # remove URLs
    text = re.sub('RT|cc', ' ', text)  # remove RT and cc
    text = re.sub('#\S+', '', text)  # remove hashtags
    text = re.sub('@\S+', '  ', text)  # remove mentions
    text = re.sub('[%s]' % re.escape("""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~"""), ' ', text)  # remove punctuations
    text = re.sub(r'[^\x00-\x7f]',r' ', text) 
    text = re.sub('\s+', ' ', text)  # remove extra whitespace
    return text

In [9]:
articles['clean_categories'] = articles.categories.apply(lambda x: clean(x))
articles['clean_primary_cat'] = articles.primary_cat.apply(lambda x: clean(x))

In [10]:
articles['clean_categories'] = articles['clean_categories'].map(lambda x: x.split(','))
articles['author_text'] = articles['author_text'].map(lambda x: x.split(',')[:3])
for index, row in articles.iterrows():
    row['clean_categories'] = [x.lower().replace(' ','') for x in row['clean_categories']]
    row['author_text'] = [x.lower().replace(' ','') for x in row['author_text']]

In [11]:
articles

Unnamed: 0,abstract,acm_class,arxiv_id,author_text,categories,comments,Unnamed: 8,doi,num_authors,num_categories,primary_cat,title,updated,categories_list,created,keywords,clean_categories,clean_primary_cat
0,NTsort is an external sort on WindowsNT 5.0. I...,D.5;H.2,cs/9809004,"[Jim Gray, Joshua Coates, Chris Nyberg]","cs.DB,cs.PF",Original word file at:\n http://research.micr...,,,3,2,cs.DB,Performance / Price Sort,,"['cs.DB', 'cs.PF']",9/1/98,"[two-pass external sort, elapsed time performa...",[csDB csPF],csDB
1,Simple economic and performance arguments sugg...,H.3.4,cs/9809005,"[Jim Gray, Goetz Graefe]",cs.DB,Original document at:\n http://research.micro...,,,2,1,cs.DB,"The Five-Minute Rule Ten Years Later, and Othe...",,['cs.DB'],9/1/98,"[randomly accessed pages, suggest optimal page...",[csDB],csDB
2,The Microsoft TerraServer stores aerial and sa...,H.2.4;H.2.8;H.3.5,cs/9809011,"[Tom Barclay, Robert Eberl, Jim Gray]","cs.DB,cs.DL",Original file at\n http://research.microsoft....,,,15,2,cs.DB,Microsoft TerraServer,,"['cs.DB', 'cs.DL']",9/4/98,"[sql server database served, united states geo...",[csDB csDL],csDB
3,We study a set of linear transformations on th...,H.2.2,cs/9809023,"[Davood Rafiei, Alberto Mendelzon]",cs.DB,,,,2,1,cs.DB,Similarity-Based Queries for Time Series Data,9/18/98,['cs.DB'],9/17/98,"[time-series data, fourier series representati...",[csDB],csDB
4,We propose an improvement of the known DFT-bas...,H.2;H.3,cs/9809033,"[Davood Rafiei, Alberto Mendelzon]",cs.DB,,,,2,1,cs.DB,Efficient Retrieval of Similar Time Sequences ...,9/25/98,['cs.DB'],9/18/98,"[search time, similar time sequences, real sto...",[csDB],csDB
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6662,"Objective: Translational science aims at ""tran...",,1812.10609,[Qing Ke],"cs.DL,cs.CY,physics.soc-ph",Accepted at JAMIA; Supporting Information at\n...,,10.1093/jamia/ocy177,1,3,cs.DL,Identifying translational science through embe...,,"['cs.DL', 'cs.CY', 'physics.soc-ph']",12/26/18,"[showing excellent agreement, uncovering signi...",[csDL csCY physicssoc ph],csDL
6663,A Wikipedia book (known as Wikibook) is a coll...,,1812.10937,"[Shahar Admati, Lior Rokach, Bracha Shapira]","cs.DL,cs.IR,cs.LG",,,,3,3,cs.DL,Wikibook-Bot - Automatic Generation of a Wikip...,,"['cs.DL', 'cs.IR', 'cs.LG']",12/28/18,"[statistically significant results, machine le...",[csDL csIR csLG],csDL
6664,"As science advances, the academic community ha...",,1812.11252,"[Haofeng Jia, Erik Saule]","cs.IR,cs.DL,cs.SI",,,,2,3,cs.IR,Towards Finding Non-obvious Papers: An Analysi...,,"['cs.IR', 'cs.DL', 'cs.SI']",12/28/18,"[search relevant manuscripts, real citation re...",[csIR csDL csSI],csIR
6665,Mechanisms are a fundamental concept in many a...,,1812.11431,[Robert B Allen],cs.DL,,,,1,1,cs.DL,Issues for Using Semantic Modeling to Represen...,,['cs.DL'],12/29/18,"[xfo programming environment, basic semantic m...",[csDL],csDL


In [12]:
articles.to_csv('articles.csv')

# content based filtering using Bag of words and cosine similarity 

Have used the following reference for understanding more about the filtering process

https://medium.com/mlearning-ai/content-based-recommender-system-using-nlp-445ebb777c7a

Using the bag of words implementation:

Here in our bag of words, 

we take the categories the paper belongs to

Author information

Abstract keywords extracted by RAKE algorithm

and make a bag of words.Once this is done, we will input a title the user likes and try to come up with recommendations

In [172]:
Bag_of_words = []
columns = columns = ['clean_categories', 'author_text', 'keywords']
for index, row in articles.iterrows():
    words = ''
    for col in columns:
        words += ' '.join(row[col]) + ' '
    Bag_of_words.append(words)

In [173]:
articles['Bag_of_words'] = Bag_of_words

In [174]:
cols_of_interest = ['title','Bag_of_words']
df = articles[cols_of_interest]

In [175]:

count = CountVectorizer()
count_matrix = count.fit_transform(df['Bag_of_words'])
cosine_sim = cosine_similarity(count_matrix, count_matrix)

In [176]:
indices = pd.Series(df['title'])

In [178]:
def recommend(title, cosine_sim = cosine_sim):
    recommended_papers = []
    idx = indices[indices == title].index[0]
    score_series = pd.Series(cosine_sim[idx]).sort_values(ascending = False)
    top_10_indices = list(score_series.iloc[1:11].index)
    
    for i in top_10_indices:
        recommended_papers.append(list(df['title'])[i])
    return recommended_papers

# Example of User profile creation

user_1 is researching about the trends in Data management and would like to learn more about the various papers in the field

His input to our algorithm will be:
Data Management: Past, Present, and Future


In [179]:
recommendations = recommend('Data Management: Past, Present, and Future')

In [180]:
recommendations

['Views over RDF Datasets: A State-of-the-Art and Open Challenges',
 'Clustering RDF Databases Using Tunable-LSH',
 'Datom: Towards modular data management',
 'Energy Efficiency: The New Holy Grail of Data Management Systems Research',
 'NewSQL: Towards Next-Generation Scalable RDBMS for Online Transaction Processing (OLTP) for Big Data Management',
 'Time Series Management Systems: A Survey',
 'Sketch-based Querying of Distributed Sliding-Window Data Streams',
 'Application of Inventory Management Principles for Efficient Data Placement in Storage Networks',
 'SPARQL query processing with Apache Spark',
 'NoSQL Database: New Era of Databases for Big data Analytics - Classification, Characteristics and Comparison']

We have generated the above recommendations for the user-1. 
Once this is done, we will ask the user to rate the relevancy on a scale of 1 to 5.
This was done manually and made into a dataset for our next steps

In [None]:
#user-2
recommendations = recommend('Perfect Hashing for Data Management Applications')

In [182]:
recommendations

['Succinct Data Structures for Retrieval and Approximate Membership',
 'An Optimal Bloom Filter Replacement Based on Matrix Solving',
 'Linear Probing with Constant Independence',
 'Hashing for Similarity Search: A Survey',
 'The Usefulness of Multilevel Hash Tables with Multiple Hash Functions in Large Databases',
 'Composite Hashing for Data Stream Sketches',
 'Strongly universal string hashing is fast',
 'Scalability and Total Recall with Fast CoveringLSH',
 'Proofs of Zero Knowledge',
 'Recursive n-gram hashing is pairwise independent, at best']

In [183]:
#user-3
recommendations = recommend('UDBMS: Road to Unification for Multi-model Data Management')

In [184]:
recommendations

['Principles of the Concept-Oriented Data Model',
 'Application of Inventory Management Principles for Efficient Data Placement in Storage Networks',
 'A Model-Based Frequency Constraint for Mining Associations from Transaction Data',
 'Why the relational data model can be considered as a formal basis for group operations in object-oriented systems',
 'Nucleus: A Pilot Project',
 'CASP-DM: Context Aware Standard Process for Data Mining',
 'Datom: Towards modular data management',
 "Beyond Roll-Up's and Drill-Down's: An Intentional Analytics Model to Reinvent OLAP (long-version)",
 'Web data modeling for integration in data warehouses',
 'Consistent Checkpointing in Distributed Databases: Towards a Formal Approach']

In [166]:
#user-4
recommendations = recommend('An introduction to Graph Data Management')

In [167]:
recommendations

['Negation in SPARQL',
 'G-CORE: A Core for Future Graph Query Languages',
 'Tracing technological development trajectories: A genetic knowledge persistence-based main path approach',
 'On Smart Query Routing: For Distributed Graph Querying with Decoupled Storage',
 'SPARQL over GraphX',
 'Rethinking serializable multiversion concurrency control',
 'Pregelix: Big(ger) Graph Analytics on A Dataflow Engine',
 'Optimizing XML querying using type-based document projection',
 'Matrix and Graph Operations for Relationship Inference: An Illustration with the Kinship Inference in the China Biographical Database',
 'Matrix and Graph Operations for Relationship Inference: An Illustration with the Kinship Inference in the China Biographical Database']

In [168]:
#user-6
recommendations = recommend('Evolvable Systems for Big Data Management in Business')

In [169]:
recommendations

['BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking',
 'Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources',
 'Identifying Dwarfs Workloads in Big Data Analytics',
 'A Hybrid ICT-Solution for Smart Meter Data Analytics',
 'biggy: An Implementation of Unified Framework for Big Data Management System',
 'Energy Efficiency: The New Holy Grail of Data Management Systems Research',
 'Big Data: Challenges, Opportunities and Realities',
 'How to extract data from proprietary software database systems using TCP/IP?',
 'BigOP: Generating Comprehensive Big Data Workloads as a Benchmarking Framework',
 'Algorithm and approaches to handle large Data- A Survey']

In [193]:
#user-7
recommendations = recommend('Warehousing Web Data')

In [194]:
recommendations


['Web data modeling for integration in data warehouses',
 "Mod\\'elisation et extraction de donn\\'ees pour un entrep\\^ot objet",
 'DWEB: A Data Warehouse Engineering Benchmark',
 "Conception d'un banc d'essais d\\'ecisionnel",
 'An MAS-Based ETL Approach for Complex Data',
 'Query-driven Data Completeness Management (PhD Thesis)',
 'Index and Materialized View Selection in Data Warehouses',
 'On the Communication of Scientific Results: The Full-Metadata Format',
 'Data Mining-based Fragmentation of XML Data Warehouses',
 'Biomedical Data Warehouses']

In [195]:
#user-8
recommendations = recommend('Clustering-Based Materialized View Selection in Data Warehouses')

In [196]:
recommendations

["Un index de jointure pour les entrep\\^ots de donn\\'ees XML",
 '"The Whole Is Greater Than the Sum of Its Parts": Optimization in Collaborative Crowdsourcing',
 'Automatic Selection of Bitmap Join Indexes in Data Warehouses',
 'Dynamic index selection in data warehouses',
 'Index and Materialized View Selection in Data Warehouses',
 "S\\'election simultan\\'ee d'index et de vues mat\\'erialis\\'ees",
 'Data Mining-based Materialized View and Index Selection in Data Warehouses',
 'Multidimensi Pada Data Warehouse Dengan Menggunakan Rumus Kombinasi',
 'Using Ontologies for the Design of Data Warehouses',
 'User Profile-Driven Data Warehouse Summary for Adaptive OLAP Queries']

In [197]:
#user-9
recommendations = recommend('Frequent Query Matching in Dynamic Data Warehousing')

In [198]:
recommendations

['XWeB: the XML Warehouse Benchmark',
 'A Review of Star Schema Benchmark',
 'OptImatch: Semantic Web System with Knowledge Base for Query Performance Problem Determination',
 'Discovering More Accurate Frequent Web Usage Patterns',
 'Imprecise temporal associations and decision support systems',
 'Dynamic Decision Support System Based on Bayesian Networks Application to fight against the Nosocomial Infections',
 'Fast Algorithms for Mining Interesting Frequent Itemsets without Minimum Support',
 'How important tasks are performed: peer review',
 'Secure and Efficient Skyline Queries on Encrypted Data',
 'Materialized View Selection by Query Clustering in XML Data Warehouses']

In [199]:
#user-10
recommendations = recommend('Enabling Smart Data: Noise filtering in Big Data classification')

In [200]:
recommendations

['DPASF: A Flink Library for Streaming Data preprocessing',
 'Algorithm and approaches to handle large Data- A Survey',
 'Data Quality Principles in the Semantic Web',
 'An Approach to Handle Big Data Warehouse Evolution',
 'BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking',
 'Big Data Systems Meet Machine Learning Challenges: Towards Big Data Science as a Service',
 'Data Integration for Supporting Biomedical Knowledge Graph Creation at Large-Scale',
 'A Minimal Variance Estimator for the Cardinality of Big Data Set Intersection',
 'A Hybrid ICT-Solution for Smart Meter Data Analytics',
 'Muppet: MapReduce-Style Processing of Fast Data']

In [201]:
#user-11
recommendations = recommend('Big Data: Challenges, Opportunities and Realities')

In [202]:
recommendations

['biggy: An Implementation of Unified Framework for Big Data Management System',
 'A Hybrid ICT-Solution for Smart Meter Data Analytics',
 'Big Data Systems Meet Machine Learning Challenges: Towards Big Data Science as a Service',
 'Algorithm and approaches to handle large Data- A Survey',
 'Muppet: MapReduce-Style Processing of Fast Data',
 'BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking',
 'Big Data Dimensional Analysis',
 'An Integration-Oriented Ontology to Govern Evolution in Big Data Ecosystems',
 'DPASF: A Flink Library for Streaming Data preprocessing',
 'VDMS: Efficient Big-Visual-Data Access for Machine Learning Workloads']

In [203]:
#user-12
recommendations = recommend('Big Data Meets HPC Log Analytics: Scalable Approach to Understanding Systems at Extreme Scale')

In [204]:
recommendations

['SAP HANA and its performance benefits',
 'Enabling Data Discovery through Virtual Internet Repositories',
 'InferSpark: Statistical Inference at Scale',
 'Towards a Unified Architecture for in-RDBMS Analytics',
 'Can the Elephants Handle the NoSQL Onslaught?',
 'Regression-based Online Anomaly Detection for Smart Grid Data',
 'A Survey on Geographically Distributed Big-Data Processing using MapReduce',
 'A Hybrid ICT-Solution for Smart Meter Data Analytics',
 'Columnar Database Techniques for Creating AI Features',
 'Security Implications of Distributed Database Management System Models']

In [205]:
#user-13
recommendations = recommend('Human-Centric Data Cleaning [Vision]')

In [206]:
recommendations

['Statistical Distortion: Consequences of Data Cleaning',
 'Private Exploration Primitives for Data Cleaning',
 'Public Data Integration with WebSmatch',
 'On the Relative Trust between Inconsistent Data and Inaccurate Constraints',
 'EmbedJoin: Efficient Edit Similarity Joins via Embeddings',
 'An Approach to Handle Big Data Warehouse Evolution',
 'Code Generation Techniques for Raw Data Processing',
 'Algorithm and approaches to handle large Data- A Survey',
 'BigOP: Generating Comprehensive Big Data Workloads as a Benchmarking Framework',
 'A Comparative Study on Remote Tracking of Parkinsons Disease Progression Using Data Mining Methods']

In [207]:
#user-14
recommendations = recommend('Pattern-Driven Data Cleaning')

In [208]:
recommendations

['QuickIM: Efficient, Accurate and Robust Influence Maximization Algorithm on Billion-Scale Networks',
 'Bitmap Filter: Speeding up Exact Set Similarity Joins with Bitwise Operations',
 'Ektelo: A Framework for Defining Differentially-Private Computations',
 'DESQ: Frequent Sequence Mining with Subsequence Constraints',
 'HoloClean: Holistic Data Repairs with Probabilistic Inference',
 'Efficient Destination Prediction Based on Route Choices with Transition Matrix Optimization',
 'Efficiently Charting RDF',
 'Error-Tolerant Big Data Processing',
 'Analytic Performance Model of a Main-Memory Index Structure',
 'GPU Accelerated Self-join for the Distance Similarity Metric']

In [210]:
 #user-15
recommendations = recommend('On the Challenges of Collaborative Data Processing')

In [211]:
 recommendations

['Industrial Big Data Analytics: Challenges, Methodologies, and Applications',
 "Classification non supervis\\'ee des donn\\'ees h\\'et\\'erog\\`enes \\`a large \\'echelle",
 'ReStore: Reusing Results of MapReduce Jobs',
 'Evolving Large-Scale Data Stream Analytics based on Scalable PANFIS',
 'Editorial for the First Workshop on Mining Scientific Papers: Computational Linguistics and Bibliometrics',
 'Comparative Evaluation of Big-Data Systems on Scientific Image Analytics Workloads',
 'Array Requirements for Scientific Applications and an Implementation for Microsoft SQL Server',
 'Collaboration in an Open Data eScience: A Case Study of Sloan Digital Sky Survey',
 'Developing an ontology for the access to the contents of an archival fonds: the case of the Catasto Gregoriano',
 'Mining Scientific Papers for Bibliometrics: a (very) Brief Survey of Methods and Tools']

In [212]:
#user-16
recommendations = recommend('The Family of MapReduce and Large Scale Data Processing Systems')

In [213]:
recommendations

['Evolving Large-Scale Data Stream Analytics based on Scalable PANFIS',
 'Effective Spatial Data Partitioning for Scalable Query Processing',
 'Processing Database Joins over a Shared-Nothing System of Multicore Machines',
 'Review of Apriori Based Algorithms on MapReduce Framework',
 'ReStore: Reusing Results of MapReduce Jobs',
 'Industrial Big Data Analytics: Challenges, Methodologies, and Applications',
 "Classification non supervis\\'ee des donn\\'ees h\\'et\\'erog\\`enes \\`a large \\'echelle",
 'Cascading map-side joins over HBase for scalable join processing',
 'InferSpark: Statistical Inference at Scale',
 'Distributed GraphLab: A Framework for Machine Learning in the Cloud']

In [214]:
#user-17
recommendations = recommend('Evaluating Accumulo Performance for a Scalable Cyber Data Processing Pipeline')

In [215]:
 recommendations

['Big Data: Challenges, Opportunities and Realities',
 'BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking',
 'Data Management: Past, Present, and Future',
 'NewSQL: Towards Next-Generation Scalable RDBMS for Online Transaction Processing (OLTP) for Big Data Management',
 'A Hybrid ICT-Solution for Smart Meter Data Analytics',
 'Answering Analytical Queries on Text Data with Temporal Term Histograms',
 'Datom: Towards modular data management',
 'Algorithm and approaches to handle large Data- A Survey',
 'GRAPHITE: An Extensible Graph Traversal Framework for Relational Database Management Systems',
 'Sketch-based Querying of Distributed Sliding-Window Data Streams']

In [216]:
 #user-18
recommendations = recommend('Reprowd: Crowdsourced Data Processing Made Reproducible')

In [217]:
 recommendations

['Make Research Data Public? -- Not Always so Simple: A Dialogue for Statisticians and Science Editors',
 'DeepLens: Towards a Visual Data Management System',
 'CaosDB - Research Data Management for Complex, Changing, and Automated Research Workflows',
 'Mining Open Government Data Used in Scientific Research',
 'Experimental Research Data Quality In Materials Science',
 'Experimental Research Data Quality In Materials Science',
 'Astronomy 3.0 Style',
 'Identification of multidisciplinary research based upon dissimilarity analysis of journals included in reference lists of Wageningen University & Research articles',
 'The use of microblogging for field-based scientific research',
 'A Framework to Explore the Knowledge Structure of Multidisciplinary Research Fields']

In [218]:
 #user-19
recommendations = recommend('Data Processing Benchmarks')

In [219]:
 recommendations

['Database Benchmarks',
 'On Big Data Benchmarking',
 'A Survey on Geographically Distributed Big-Data Processing using MapReduce',
 'Error-Tolerant Big Data Processing',
 'PRESISTANT: Learning based assistant for data pre-processing',
 'Data Warehouse Benchmarking with DWEB',
 'PRIMEBALL: a Parallel Processing Framework Benchmark for Big Data Applications in the Cloud',
 'Integrated Data Acquisition, Storage, Retrieval and Processing Using the COMPASS DataBase (CDB)',
 'Effective Spatial Data Partitioning for Scalable Query Processing',
 'Lara: A Key-Value Algebra underlying Arrays and Relations']

In [220]:
#user-20
recommendations = recommend('NEURON: Query Optimization Meets Natural Language Processing For Augmenting Database Education')

In [221]:
 recommendations

['Knowledge Rich Natural Language Queries over Structured Biological Databases',
 'DBMSs Should Talk Back Too',
 'Alg\\`ebre OLAP et langage graphique',
 'Wild Card Queries for Searching Resources on the Web',
 'CL Scholar: The ACL Anthology Knowledge Graph Miner',
 'Competency Questions and SPARQL-OWL Queries Dataset and Analysis',
 'A fusion algorithm for joins based on collections in Odra (Object Database for Rapid Application development)',
 'Scalable Semantic Querying of Text',
 'The DeLiVerMATH project - Text analysis in mathematics',
 'Generating Exact- and Ranked Partially-Matched Answers to Questions in Advertisements']

In [222]:
  #user-21
    recommendations = recommend('Tagging Scientific Publications using Wikipedia and Natural Language Processing Tools. Comparison on the ArXiv Dataset')

In [223]:
 recommendations

['Wikibook-Bot - Automatic Generation of a Wikipedia Book',
 'POS Tagging and its Applications for Mathematics',
 'Helix: Accelerating Human-in-the-loop Machine Learning',
 'Database-Agnostic Workload Management',
 'A survey on the importance of visualization and social collaboration in academic digital libraries',
 "The effect of 'Open Access' upon citation impact: An analysis of ArXiv's Condensed Matter Section",
 'Differential Privacy and Machine Learning: a Survey and Review',
 'Processing Analytical Workloads Incrementally',
 'How Developers Iterate on Machine Learning Workflows -- A Survey of the Applied Machine Learning Literature',
 'Semi-automated Annotation of Signal Events in Clinical EEG Data']

In [224]:
  #user-22
recommendations = recommend('Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL 2017)')

In [225]:
 recommendations

['Report on the 3rd Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL 2018)',
 'Handling Massive N-Gram Datasets Efficiently',
 'Robust Text-to-SQL Generation with Execution-Guided Decoding',
 'Ranking State-of-the-art Papers via Incomplete Tournaments Induced by Citations from Performance Tables',
 'Error-Tolerant Big Data Processing',
 'QuickIM: Efficient, Accurate and Robust Influence Maximization Algorithm on Billion-Scale Networks',
 'Efficient Destination Prediction Based on Route Choices with Transition Matrix Optimization',
 'MaskLink: Efficient Link Discovery for Spatial Relations via Masking Areas',
 'CDAS: A Crowdsourcing Data Analytics System',
 'Gradual Machine Learning for Entity Resolution']

In [226]:
  #user-22
recommendations = recommend('File mapping Rule-based DBMS and Natural Language Processing')

In [227]:
recommendations

['Extraction of Historical Events from Wikipedia',
 'Editorial for the Proceedings of the Workshop Knowledge Maps and Information Retrieval (KMIR2014) at Digital Libraries 2014',
 'Semantic Visualization and Navigation in Textual Corpus',
 'Document Selection in a Distributed Search Engine Architecture',
 'Towards a Ranking Model for Semantic Layers over Digital Archives',
 'A Survey of Volunteered Open Geo-Knowledge Bases in the Semantic Web',
 'Document Spanners for Extracting Incomplete Information: Expressiveness and Complexity',
 'Information Carriers and Identification of Information Objects: An Ontological Approach',
 'Enhancing Invenio Digital Library With An External Relevance Ranking Engine',
 'Moving Objects Analytics: Survey on Future Location & Trajectory Prediction Methods']

In [230]:
  #user-23
recommendations = recommend('Social media in scholarly communication')

In [231]:
recommendations

['Tweeting biomedicine: an analysis of tweets and citations in the biomedical literature',
 'Interpreting "altmetrics": viewing acts on social media through the lens of citation and social theories',
 'Towards the social media studies of science: social media metrics, present and future',
 'Determining sentiment in citation text and analyzing its impact on the proposed ranking index',
 'When is an article actually published? An analysis of online availability, publication, and indexation dates',
 'TrueReview: A Platform for Post-Publication Peer Review',
 'Scholarly use of social media and altmetrics: a review of the literature',
 'True Peer Review',
 'Do altmetrics correlate with the quality of papers? A large-scale empirical study based on F1000Prime data',
 'Coverage and adoption of altmetrics sources in the bibliometric community']

In [232]:
  #user-24
recommendations = recommend('Think before you collect: Setting up a data collection approach for social media studies')

In [233]:
recommendations

['Towards the social media studies of science: social media metrics, present and future',
 'Tweeting biomedicine: an analysis of tweets and citations in the biomedical literature',
 'Interpreting "altmetrics": viewing acts on social media through the lens of citation and social theories',
 'Determining sentiment in citation text and analyzing its impact on the proposed ranking index',
 'Can We Count on Social Media Metrics? First Insights into the Active Scholarly Use of Social Media',
 'When is an article actually published? An analysis of online availability, publication, and indexation dates',
 'Social Media Attention Increases Article Visits: An Investigation on Article-Level Referral Data of PeerJ',
 'Altmetrics in the wild: Using social media to explore scholarly impact',
 'The role of twitter in the life cycle of a scientific publication',
 'Tracking the History and Evolution of Entities: Entity-centric Temporal Analysis of Large Social Media Archives']

In [234]:
#user-25
recommendations = recommend('Tracking the Digital Footprints to Scholarly Articles from Social Media')

In [235]:
recommendations

['Social Media Attention Increases Article Visits: An Investigation on Article-Level Referral Data of PeerJ',
 'Data, Science and Society',
 'Detecting and Tracking The Real-time Hot Topics: A Study on Computational Neuroscience',
 'On Demand Memory Specialization for Distributed Graph Databases',
 'User Interests in German Social Science Literature Search - A Large Scale Log Analysis',
 'Make Research Data Public? -- Not Always so Simple: A Dialogue for Statisticians and Science Editors',
 'Interdisciplinarity as Diversity in Citation Patterns among Journals: Rao-Stirling Diversity, Relative Variety, and the Gini coefficient',
 'A Plan For Curating "Obsolete Data or Resources"',
 'The Use of Scientific Data: A Content Analysis',
 'Online Scientific Data Curation, Publication, and Archiving']

In [236]:
#user-26
recommendations = recommend('Anatomy of Scholarly Information Behavior Patterns in the Wake of Academic Social Media Platforms')

In [237]:
recommendations

['Truthy: Enabling the Study of Online Social Networks',
 'Privacy Preserving Social Network Publication Against Mutual Friend Attacks',
 'Geo-Social Group Queries with Minimum Acquaintance Constraint',
 'Towards the social media studies of science: social media metrics, present and future',
 'Think before you collect: Setting up a data collection approach for social media studies',
 'Interpreting "altmetrics": viewing acts on social media through the lens of citation and social theories',
 'Efficient Core Maintenance in Large Dynamic Graphs',
 'Revisiting the Age of Enlightenment from a Collective Decision Making Systems Perspective',
 'Data Mining on Social Interaction Networks',
 'Attendance Maximization for Successful Social Event Planning']