Skip to content

elifcansuyildiz/HadoopNetflixPrizeDataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Big Data Analysis on Netflix Prize Dataset with Hadoop

Introduction

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. [1]

The goal of this project is implementing statistical functions on big data using Apache Hadoop which is distributed open source platform.

Netflix Prize dataset has been chosen as a dataset. Follow the link to download the dataset.

Statistical functions:

  • Histogram: View rate of each movie.

  • Mean: Average star of each movie.

  • Median: Median star of each movie.

  • Mod and Count: Mod and count of star of each movie.

  • Standart Deviation: Std of star distribution over movies.

How to Use

  1. Download Netflix Prize dataset from the following the link.

  2. Install Java Development Kit 8:

    $ sudo apt install openjdk-8-jdk
  3. Install Apache Hadoop 2.7.7 following the tutorial: HADOOP_HDFS_INSTALLATION.md.

  4. Clone the repository:

    $ git clone https://github.com/cansuyildiz/HadoopNetflixPrizeDataset
  5. Run jar files with parameters like below:

    $ cd executable/
    $ java -jar AverageStarPerMovie.jar <hdfs_input> <hdfs_output>
  6. Examples:

    $ java -jar AverageStarPerMovie.jar /user/netflix /user/netflix_output/
    $ java -jar HistogramOfMovies.jar /user/netflix /user/netflix_output/
    $ java -jar MedianPerMovie.jar /user/netflix /user/netflix_output/
    $ java -jar ModAndCountPerMovie.jar /user/netflix /user/netflix_output/
    $ java -jar StandartDeviationPerMovie.jar /user/netflix /user/netflix_output/

References

[1] https://hadoop.apache.org/

Releases

No releases published

Packages

No packages published

Languages