Skip to content

VyuWing-Learning/Data-Engineering-Bootcamp-Apache-Spark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Engineering Hands On: Apache Spark

Problem Description

Web Server logs contain information on any event that was registered/logged with the website. This data contains a lot of insights on website visitors, behavior, crawlers accessing the site, business insights, security issues, etc.

In this Bootcamp, we will learn to use Apache Spark and extract useful insights from a sample Web Server Log.

Downloading the Dataset

Download the sample Web Server Log dataset by clicking here.

Project Setup:

To run this project in your local. Follow the below steps:

  1. Open it up in IntelliJ or any other IDE. (Download IntelliJ Community Version from here)
  2. Install Scala plugin by going into Settings. (For reference)
  3. Put the dataset in data directory. (data/access.log)
  4. run mvn clean install on the terminal to download all dependencies and build the jar. IntelliJ users can ignore this step, as it automatically downloads the requisite dependancies and creates the build.

Once you run the main class it will read the data/access.log file and parse it using Apache Spark and write the output files in data/logdata/ directory in parquet file format. It will also print the hourly trend table data on the console.

Interactively running the code in Scala shell

  1. Setup the shell from run configuration.
    alt text\

  2. Select scala REPL
    alt text
    alt text

  3. Run the Scala REPL and execute the code by CTRL + ALT + X for windows and linux users, and CONTROL + COMMAND + X for Mac users.

For reference please visit.

Frequently Asked Questions

What is scala-shell?

This is an interactive way to run the scala code. It is similar to python shell where we interactively run python scripts.

What is parquet file?

It is highly compressed columnar file format which is optimized for high speed reads and is prominently used in enterprise data warehouses. Further reads about Apache Parquet.

You're all set. VyuWing is Happy to Help!

For doubts on the project and to learn more, get in touch with our team : info@vyuwinglearning.com

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages