Skip to content

Scala data-pipeline for amazon moview reviews data processing using kafka & spark streaming

Notifications You must be signed in to change notification settings

asutoshparida/di-amazon-review

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

amazon-review-pipeline

This repository will be used for managing all the code related to pipeline for amazon movies review data.


High Level Design

Test Image 3


Synopsis

Data

Ratings:http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/ratings_Movies_and_TV.csv
Metadata:http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/meta_Movies_and_TV.json.gz
Some descriptions on the dataset can be found at http://jmcauley.ucsd.edu/data/amazon/links.html



Assignment

Create a data pipeline to extract data from the CSV/JSON files, and import it into a database/DFS in order to perform analysis.
Your solution must:

    download the source data in the pipeline itself.
    have proper error handling/logging.
    show expressive, re-usable, clean code.
    handle duplicates.

It would be good if your solution could:

    be able to handle the CSV/JSON files as a stream/abstract the file reading into a streaming model.
    use a workflow orchestration tool to schedule the pipeline.
    use docker containers.
    be scalable in case new data were to flow in on a high-volume basis(10x bigger) and has to be imported at a regular basis.
    describe the data store and tools used to query it - including the presentation layer.

The final result must answer the following questions:

    What are the top 5 and bottom 5 movies in terms of overall average review ratings for a given month?
    For a given month, what are the 5 movies whose average monthly ratings increased the most compared with the previous month?

Requirements

JDK 1.7 or higher
maven
Scala 2.11.11
Spark 2.3.3
Apache kafka 2.11-2.4.1
elasticsearch 5.X
mysql
*AWS Lambda
*AWS EMR

* If we want to run the job on AWS EMR Using AWS Lambda.

Build

*** Change your /config/pipeline_config.json entries 
according to your configuration ***

Go to project root dir
> mvn clean install -U

Run

Can use /script/lambda/amazon_review_lambda.py to configure a lambda function on AWS lambda
by changing your VPC & security grougp configuration.

Or For dry run execute ReviewDataController.scala first and then CsvProducer.scala as scala application from intellij

About

Scala data-pipeline for amazon moview reviews data processing using kafka & spark streaming

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published