System and Methods for Big and Unstructured Data - Project

The aim of the project is to compare different NoSQL database technologies (in particular Neo4j, MongoDB and Apache Spark). This was done by implementing a bibliographic database storage solution capable of supporting a large scale data set containing different types of publications ranging from scientific papers, books, articles and so on.

The complete report of the project is available here

Pre-processing

A lot of pre-processing was performed, in order to make the downloaded data sets a good fit for our project. We have used Python and in particular the Pandas libarry. All the scripts and the notebooks we used are in this repository.

The project report contains detailed instructions on how to use such scripts to generate the exact same data sets that we used and on how to perform the exact same queries we executed.

Setup

Install all the required Python packages with:

pip install -r requirements.txt

First delivery

The first delivery was about Neo4j, a graph database. We used a data set downloaded from AMiner. After some additional pre-processing, the data set was uploaded into Neo4j and some queries and commands were executed.

The report of this delivery is available here

Second delivery

The second delivery was about MongoDB, a document-oriented database. We used the same data set downloaded from AMiner and some additional data sets to highlight the capabilities of MongoDB at handling sub-documents.
After some additional pre-processing, the data set was uploaded into MongoDB and some queries and commands were executed.

The report of this delivery is available here

Third delivery

The third delivery was about Apache Spark, a framework for large-scale data processing. We used the same data set downloaded from AMiner.
After some additional pre-processing, the data set was uploaded into Apache Spark and some queries and commands were executed.

The report of this delivery is available here

Software

License

Licensed under MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
mongodb		mongodb
neo4j		neo4j
scripts		scripts
spark		spark
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
er_diagram.drawio		er_diagram.drawio
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

System and Methods for Big and Unstructured Data - Project

Pre-processing

Setup

First delivery

Second delivery

Third delivery

Software

License

About

Contributors 3

Languages

License

albertopirillo/smbud-project-2022

Folders and files

Latest commit

History

Repository files navigation

System and Methods for Big and Unstructured Data - Project

Pre-processing

Setup

First delivery

Second delivery

Third delivery

Software

License

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 3

Languages