Skip to content

Latest commit

 

History

History
113 lines (72 loc) · 3.96 KB

README.md

File metadata and controls

113 lines (72 loc) · 3.96 KB

SMACK Yelp DataSet

version Build Status

This project was created to process the data contained on the Yelp dataset in a effort to get some interesting data from it while using some of the Technologies of the SMACK stack. I will try to use all of them but It's not required for the end solution.

The initial idea is to download, read, transform and serve the dataset so the end user can see the data in a more simplistic way.

Getting Started

These instructions will get you a copy of the whole pipeline ready to be used on your local machine. See deployment for notes on how to use this simple pipeline.

Prerequisites

Since all the services on this pipeline are containerized, you won't need to install a lot of things, just Docker.

Installing

  • Clone the repository
git clone git@github.com:geektimus/smack-yelp.git
cd ~/smack-yelp

Running the tests

This pipeline only contains test for the spark job and the akka consummers.

To run the tests just run

sbt test

On each project [parser, serving] folders

Docker image/container creation

To create all the containers required for this project you only need to use the Makefile inside the parser project.

Commands

make build: Builds the docker container codingmaniacs/spark_base version 2.4.0 (Based on the Spark version inside the image), then it creates the fat jar (Spark job) for the transformation and storage of the data on Cassandra.

make up: Uses the spark_base image to create three services, Spark Master, Spark Slave (Scalable using docker-compose scale spark-slave=<number of instances>), Spark History, Cassandra DB. The Spark cluster will reside on its own network and the cassandra db will reside in the spark network and in the backend network (here we will add the akka microservices)

make down: This will teardown all the services and the networks with it.

Usage

TODO

Built With

  • Scala - Base language
  • Spark - Data processing framework
  • Akka - Toolkit to build message-driven apps
  • Kafka - Distributed streaming platform
  • Docker - Containerize the parts of the pipeline

Useful Snippets

Get all the available categories of all the business

    val getCategoriesAsArray = udf((cat: String) =>
      cat.split(",").filter(c => c != null).map(s => s.trim)
    )

    dataFrame
      .withColumn("categories_split", getCategoriesAsArray(col("categories")))
      .createOrReplaceTempView("business")

    dataFrame
      .drop("categories")
      .sqlContext.sql("select distinct(explode(categories_split)) categories from business where categories is not null order by categories asc")
      .collect.foreach(println)

Contributing

Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.

Versioning

We use SemVer for versioning. For the versions available, see the tags on this repository.

Authors

See also the list of contributors who participated in this project.

License

license

This project is licensed under the MIT License - see the LICENSE.md file for details