Skip to content

Implementation of multiple NoSQL database technologies (Neo4j, MongoDB and Apache Spark) to handle a large scale data set of scientific publications.

License

Notifications You must be signed in to change notification settings

albertopirillo/smbud-project-2022

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

System and Methods for Big and Unstructured Data - Project

The aim of the project is to compare different NoSQL database technologies (in particular Neo4j, MongoDB and Apache Spark). This was done by implementing a bibliographic database storage solution capable of supporting a large scale data set containing different types of publications ranging from scientific papers, books, articles and so on.

The complete report of the project is available here

Pre-processing

A lot of pre-processing was performed, in order to make the downloaded data sets a good fit for our project. We have used Python and in particular the Pandas libarry. All the scripts and the notebooks we used are in this repository.

The project report contains detailed instructions on how to use such scripts to generate the exact same data sets that we used and on how to perform the exact same queries we executed.

Setup

Install all the required Python packages with:

pip install -r requirements.txt

First delivery

The first delivery was about Neo4j, a graph database. We used a data set downloaded from AMiner. After some additional pre-processing, the data set was uploaded into Neo4j and some queries and commands were executed.

The report of this delivery is available here

Second delivery

The second delivery was about MongoDB, a document-oriented database. We used the same data set downloaded from AMiner and some additional data sets to highlight the capabilities of MongoDB at handling sub-documents.
After some additional pre-processing, the data set was uploaded into MongoDB and some queries and commands were executed.

The report of this delivery is available here

Third delivery

The third delivery was about Apache Spark, a framework for large-scale data processing. We used the same data set downloaded from AMiner.
After some additional pre-processing, the data set was uploaded into Apache Spark and some queries and commands were executed.

The report of this delivery is available here

Software

License

Licensed under MIT License

About

Implementation of multiple NoSQL database technologies (Neo4j, MongoDB and Apache Spark) to handle a large scale data set of scientific publications.

Topics

Resources

License

Stars

Watchers

Forks