Data Engineering Hands On: Apache Spark

Problem Description

Web Server logs contain information on any event that was registered/logged with the website. This data contains a lot of insights on website visitors, behavior, crawlers accessing the site, business insights, security issues, etc.

In this Bootcamp, we will learn to use Apache Spark and extract useful insights from a sample Web Server Log.

Downloading the Dataset

Download the sample Web Server Log dataset by clicking here.

Project Setup:

To run this project in your local. Follow the below steps:

Open it up in IntelliJ or any other IDE. (Download IntelliJ Community Version from here)
Install Scala plugin by going into Settings. (For reference)
Put the dataset in data directory. (data/access.log)
run mvn clean install on the terminal to download all dependencies and build the jar. IntelliJ users can ignore this step, as it automatically downloads the requisite dependancies and creates the build.

Once you run the main class it will read the data/access.log file and parse it using Apache Spark and write the output files in data/logdata/ directory in parquet file format. It will also print the hourly trend table data on the console.

Interactively running the code in Scala shell

Setup the shell from run configuration.
\
Select scala REPL
Run the Scala REPL and execute the code by CTRL + ALT + X for windows and linux users, and CONTROL + COMMAND + X for Mac users.

For reference please visit.

Frequently Asked Questions

What is scala-shell?

This is an interactive way to run the scala code. It is similar to python shell where we interactively run python scripts.

What is parquet file?

It is highly compressed columnar file format which is optimized for high speed reads and is prominently used in enterprise data warehouses. Further reads about Apache Parquet.

You're all set. VyuWing is Happy to Help!

For doubts on the project and to learn more, get in touch with our team : info@vyuwinglearning.com

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.idea		.idea
data		data
src/main/scala/com/example		src/main/scala/com/example
README.md		README.md
Spark_Example_project.iml		Spark_Example_project.iml
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineering Hands On: Apache Spark

Problem Description

Downloading the Dataset

Project Setup:

Interactively running the code in Scala shell

Frequently Asked Questions

What is scala-shell?

What is parquet file?

You're all set. VyuWing is Happy to Help!

About

Releases

Packages

Contributors 2

Languages

VyuWing-Learning/Data-Engineering-Bootcamp-Apache-Spark

Folders and files

Latest commit

History

Repository files navigation

Data Engineering Hands On: Apache Spark

Problem Description

Downloading the Dataset

Project Setup:

Interactively running the code in Scala shell

Frequently Asked Questions

What is scala-shell?

What is parquet file?

You're all set. VyuWing is Happy to Help!

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages