The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. [1]
The goal of this project is implementing statistical functions on big data using Apache Hadoop which is distributed open source platform.
Netflix Prize dataset has been chosen as a dataset. Follow the link to download the dataset.
Statistical functions:
-
Histogram:
View rate of each movie. -
Mean:
Average star of each movie. -
Median:
Median star of each movie. -
Mod and Count:
Mod and count of star of each movie. -
Standart Deviation:
Std of star distribution over movies.
-
Download Netflix Prize dataset from the following the link.
-
Install Java Development Kit 8:
$ sudo apt install openjdk-8-jdk
-
Install Apache Hadoop 2.7.7 following the tutorial: HADOOP_HDFS_INSTALLATION.md.
-
Clone the repository:
$ git clone https://github.com/cansuyildiz/HadoopNetflixPrizeDataset
-
Run jar files with parameters like below:
$ cd executable/ $ java -jar AverageStarPerMovie.jar <hdfs_input> <hdfs_output>
-
Examples:
$ java -jar AverageStarPerMovie.jar /user/netflix /user/netflix_output/ $ java -jar HistogramOfMovies.jar /user/netflix /user/netflix_output/ $ java -jar MedianPerMovie.jar /user/netflix /user/netflix_output/ $ java -jar ModAndCountPerMovie.jar /user/netflix /user/netflix_output/ $ java -jar StandartDeviationPerMovie.jar /user/netflix /user/netflix_output/