ScholarsNet

"A Data Portal for Scholars" - COMP4332 course project @ HKUST

Features

A minimalist Computer Science papers data portal
Easy papers & authors searching
High navigability
Detailed paper/author information
Relevance ranking

Live Demo

https://scholarsnet.herokuapp.com/

Installing Dependencies (Tested on Ubuntu 16.04):

Update your system:

sudo apt-get update

Install Linux dependencies:

sudo apt-get install build-essential libssl-dev libffi-dev python3-dev

Install Docker:

sudo apt-get install docker

Install Python virtual environment:

sudo apt-get install python-virtualenv

cd ScholarsNet
Create a new virtual environment:

virtualenv -p python3 env --no-site-packages

Switch to virtual environment:

source env/bin/activate

Install Python dependencies:

pip install -r requirements.txt

Phase 1: Data Collection

Arxiv
- Go to project root directory
- cd data_retrieval
- cd arxiv
- python arxiv_updater.py
- After running the script, data from Arxiv will be retrieved
DBLP
- Go to project root directory
- cd data_retrieval
- cd dblp
- python dblp_updater.py
- After running the script, data from DBLP will be retrieved
IEEE
- Open a new terminal
- In the new terminal:
sudo docker run -p 8050:8050 scrapinghub/splash
- Go back to your old terminal
- Go to project root directory
- cd data_retrieval
- cd paper_crawlers
- scrapy crawl ieee -o ieee.json
- After running the script, data from IEEE will be retrieved
ACM
- Go to project root directory
- cd data_retrieval
- cd paper_crawlers
- scrapy crawl acm_journal -o acm.json
- After running the script, data from ACM will be retrieved under 'acm.json'
Authors
- Go to project root directory
- cd data_retrieval
- cd paper_crawlers
- scrapy crawl author -o authors.json
- After running the script, data from ACM will be retrieved under 'authors.json'

Phase 2: Schema Mapping & Entity Resolution

Generating the string edit distance matrix:
- Go to project root directory
- cd entity_resolution
- python schema_mapping.py
- See the output in 'edit_distance_output.txt'
Doing Entity resolution together with generating the unified schema:
- Go to project root directory
- cd sqlite
- python acm_interface.py
- python arxiv_interface.py
- python dblp_interface.py
- python ieee_interface.py
- python authors_interface.py
- cd ..
- cd entity_resolution
- python merge_table.py

Phase 3: Data Fusion

Exact String Match
- Go to project root directory
- cd data_fusion
- python jaccard_strict.py
Jaccard Similarity
- Go to project root directory
- cd data_fusion
- python jaccard_strict.py
String Edit Distance
- Go to project root directory
- cd data_fusion
- python string_edit_distance.py
Jaccard SED
- Go to project root directory
- cd data_fusion
- python jaccard_sed.py

Phase 4: Data Mining / Auto Categorization

Generate Training Data
- Go to project root directory
- cd text_mining
- python generate_training_data.py
Train Machine Learning Model and Pickle
- Go to project root directory
- cd text_mining
- python model.py

Phase 5: Running the Portal Locally

Go to project root directory
./run.py
Head to your favorite browser and enter http://localhost:5000

CONTRIBUTORS:

Budi RYAN (bryanaa) (https://github.com/budiryan)
Dicky CHIU (mtchiu) (https://github.com/Dickyhaha)
Catherine ELEONORA (celeonora)

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
app		app
data_fusion		data_fusion
data_retrieval		data_retrieval
entity_resolution		entity_resolution
sqlite		sqlite
text_mining		text_mining
.gitignore		.gitignore
LICENSE		LICENSE
Procfile		Procfile
README.md		README.md
config.py		config.py
observations.txt		observations.txt
requirements.txt		requirements.txt
run.py		run.py
runtime.txt		runtime.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ScholarsNet

Features

Live Demo

Installing Dependencies (Tested on Ubuntu 16.04):

Phase 1: Data Collection

Arxiv

DBLP

IEEE

ACM

Authors

Phase 2: Schema Mapping & Entity Resolution

Generating the string edit distance matrix:

Doing Entity resolution together with generating the unified schema:

Phase 3: Data Fusion

Exact String Match

Jaccard Similarity

String Edit Distance

Jaccard SED

Phase 4: Data Mining / Auto Categorization

Generate Training Data

Train Machine Learning Model and Pickle

Phase 5: Running the Portal Locally

CONTRIBUTORS:

About

Releases

Packages

Languages

License

budiryan/ScholarsNet

Folders and files

Latest commit

History

Repository files navigation

ScholarsNet

Features

Live Demo

Installing Dependencies (Tested on Ubuntu 16.04):

Phase 1: Data Collection

Arxiv

DBLP

IEEE

ACM

Authors

Phase 2: Schema Mapping & Entity Resolution

Generating the string edit distance matrix:

Doing Entity resolution together with generating the unified schema:

Phase 3: Data Fusion

Exact String Match

Jaccard Similarity

String Edit Distance

Jaccard SED

Phase 4: Data Mining / Auto Categorization

Generate Training Data

Train Machine Learning Model and Pickle

Phase 5: Running the Portal Locally

CONTRIBUTORS:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages