SMACK Yelp DataSet

This project was created to process the data contained on the Yelp dataset in a effort to get some interesting data from it while using some of the Technologies of the SMACK stack. I will try to use all of them but It's not required for the end solution.

The initial idea is to download, read, transform and serve the dataset so the end user can see the data in a more simplistic way.

Getting Started

These instructions will get you a copy of the whole pipeline ready to be used on your local machine. See deployment for notes on how to use this simple pipeline.

Prerequisites

Since all the services on this pipeline are containerized, you won't need to install a lot of things, just Docker.

Installing

Clone the repository

git clone git@github.com:geektimus/smack-yelp.git
cd ~/smack-yelp

Running the tests

This pipeline only contains test for the spark job and the akka consummers.

To run the tests just run

sbt test

On each project [parser, serving] folders

Docker image/container creation

To create all the containers required for this project you only need to use the Makefile inside the parser project.

Commands

make build: Builds the docker container codingmaniacs/spark_base version 2.4.0 (Based on the Spark version inside the image), then it creates the fat jar (Spark job) for the transformation and storage of the data on Cassandra.

make up: Uses the spark_base image to create three services, Spark Master, Spark Slave (Scalable using docker-compose scale spark-slave=<number of instances>), Spark History, Cassandra DB. The Spark cluster will reside on its own network and the cassandra db will reside in the spark network and in the backend network (here we will add the akka microservices)

make down: This will teardown all the services and the networks with it.

Usage

TODO

Built With

Scala - Base language
Spark - Data processing framework
Akka - Toolkit to build message-driven apps
Kafka - Distributed streaming platform
Docker - Containerize the parts of the pipeline

Useful Snippets

Get all the available categories of all the business

    val getCategoriesAsArray = udf((cat: String) =>
      cat.split(",").filter(c => c != null).map(s => s.trim)
    )

    dataFrame
      .withColumn("categories_split", getCategoriesAsArray(col("categories")))
      .createOrReplaceTempView("business")

    dataFrame
      .drop("categories")
      .sqlContext.sql("select distinct(explode(categories_split)) categories from business where categories is not null order by categories asc")
      .collect.foreach(println)

Contributing

Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.

Versioning

We use SemVer for versioning. For the versions available, see the tags on this repository.

Authors

Alex Cano - Initial work - Geektimus

See also the list of contributors who participated in this project.

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

SMACK Yelp DataSet

Getting Started

Prerequisites

Installing

Running the tests

Docker image/container creation

Commands

Usage

Built With

Useful Snippets

Get all the available categories of all the business

Contributing

Versioning

Authors

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

SMACK Yelp DataSet

Getting Started

Prerequisites

Installing

Running the tests

Docker image/container creation

Commands

Usage

Built With

Useful Snippets

Get all the available categories of all the business

Contributing

Versioning

Authors

License