Spark Project

In this project, we try to evaluate the performance of Apache Spark & its MLlib library in function of different parameters: data size, slave nodes, CPU cores...

Data used in the tests comes from a poker hand data set, and you can find it here.

Documentation

Results this folder contains our graphs and explainations about our test method.
environment.sh a script you will certainly need to adapt to your Spark installation to set the environment variables required to run PySpark.
init.sh An init script to download the data, unzip it and set the environment
performance.py The main script to run
testTree.py The training method of our program
training.py called by performance.py to execute the tests and return their results.

Therefore, to run the project:

Run ./init.sh
Create a Spark master and slave(s) with the script in $SPARK_HOME/sbin
Run spark-submit performance.py outputfile NUMBER_OF_PARTITION

note : NUMBER_OF_PARTION define how the RDD will be partitioned inside spark system. Not nough partition and you will end up with not enough parallelisation of the work. For our experiments, we define this number equal to the number of cores available.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark Project

Documentation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
Results		Results
.gitignore		.gitignore
README.md		README.md
environment.sh		environment.sh
init.sh		init.sh
performance.py		performance.py
requirements.txt		requirements.txt
testTree.py		testTree.py
training.py		training.py

WiFinder-Trinity/SparkProject

Folders and files

Latest commit

History

Repository files navigation

Spark Project

Documentation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages