Skip to content

This project will be an initial attempt to create a set of containers to process, analyze and serve that analysis in a website

License

Notifications You must be signed in to change notification settings

geektimus/smack-yelp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SMACK Yelp DataSet

version Build Status

This project was created to process the data contained on the Yelp dataset in a effort to get some interesting data from it while using some of the Technologies of the SMACK stack. I will try to use all of them but It's not required for the end solution.

The initial idea is to download, read, transform and serve the dataset so the end user can see the data in a more simplistic way.

Getting Started

These instructions will get you a copy of the whole pipeline ready to be used on your local machine. See deployment for notes on how to use this simple pipeline.

Prerequisites

Since all the services on this pipeline are containerized, you won't need to install a lot of things, just Docker.

Installing

  • Clone the repository
git clone git@github.com:geektimus/smack-yelp.git
cd ~/smack-yelp

Running the tests

This pipeline only contains test for the spark job and the akka consummers.

To run the tests just run

sbt test

On each project [parser, serving] folders

Docker image/container creation

To create all the containers required for this project you only need to use the Makefile inside the parser project.

Commands

make build: Builds the docker container codingmaniacs/spark_base version 2.4.0 (Based on the Spark version inside the image), then it creates the fat jar (Spark job) for the transformation and storage of the data on Cassandra.

make up: Uses the spark_base image to create three services, Spark Master, Spark Slave (Scalable using docker-compose scale spark-slave=<number of instances>), Spark History, Cassandra DB. The Spark cluster will reside on its own network and the cassandra db will reside in the spark network and in the backend network (here we will add the akka microservices)

make down: This will teardown all the services and the networks with it.

Usage

TODO

Built With

  • Scala - Base language
  • Spark - Data processing framework
  • Akka - Toolkit to build message-driven apps
  • Kafka - Distributed streaming platform
  • Docker - Containerize the parts of the pipeline

Useful Snippets

Get all the available categories of all the business

    val getCategoriesAsArray = udf((cat: String) =>
      cat.split(",").filter(c => c != null).map(s => s.trim)
    )

    dataFrame
      .withColumn("categories_split", getCategoriesAsArray(col("categories")))
      .createOrReplaceTempView("business")

    dataFrame
      .drop("categories")
      .sqlContext.sql("select distinct(explode(categories_split)) categories from business where categories is not null order by categories asc")
      .collect.foreach(println)

Contributing

Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.

Versioning

We use SemVer for versioning. For the versions available, see the tags on this repository.

Authors

See also the list of contributors who participated in this project.

License

license

This project is licensed under the MIT License - see the LICENSE.md file for details

About

This project will be an initial attempt to create a set of containers to process, analyze and serve that analysis in a website

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published