Big Data Analysis on Netflix Prize Dataset with Hadoop

Big Data Analysis on Netflix Prize Dataset with Hadoop

Introduction

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. [1]

The goal of this project is implementing statistical functions on big data using Apache Hadoop which is distributed open source platform.

Netflix Prize dataset has been chosen as a dataset. Follow the link to download the dataset.

Statistical functions:

Histogram: View rate of each movie.
Mean: Average star of each movie.
Median: Median star of each movie.
Mod and Count: Mod and count of star of each movie.
Standart Deviation: Std of star distribution over movies.

How to Use

Download Netflix Prize dataset from the following the link.
Install Java Development Kit 8:
```
$ sudo apt install openjdk-8-jdk
```
Install Apache Hadoop 2.7.7 following the tutorial: HADOOP_HDFS_INSTALLATION.md.

Clone the repository:

$ git clone https://github.com/cansuyildiz/HadoopNetflixPrizeDataset

Run jar files with parameters like below:

$ cd executable/
$ java -jar AverageStarPerMovie.jar <hdfs_input> <hdfs_output>

Examples:

$ java -jar AverageStarPerMovie.jar /user/netflix /user/netflix_output/
$ java -jar HistogramOfMovies.jar /user/netflix /user/netflix_output/
$ java -jar MedianPerMovie.jar /user/netflix /user/netflix_output/
$ java -jar ModAndCountPerMovie.jar /user/netflix /user/netflix_output/
$ java -jar StandartDeviationPerMovie.jar /user/netflix /user/netflix_output/

References

[1] https://hadoop.apache.org/

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
executable		executable
src		src
.gitignore		.gitignore
HADOOP_HDFS_INSTALLATION.md		HADOOP_HDFS_INSTALLATION.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

executable

executable

src

src

.gitignore

.gitignore

HADOOP_HDFS_INSTALLATION.md

HADOOP_HDFS_INSTALLATION.md

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Big Data Analysis on Netflix Prize Dataset with Hadoop

Introduction

How to Use

References

About

Releases

Packages

Languages

License

elifcansuyildiz/HadoopNetflixPrizeDataset

Folders and files

Latest commit

History

Repository files navigation

Big Data Analysis on Netflix Prize Dataset with Hadoop

Introduction

How to Use

References

About

Topics

Resources

License

Stars

Watchers

Forks

Languages