Web Server logs contain information on any event that was registered/logged with the website. This data contains a lot of insights on website visitors, behavior, crawlers accessing the site, business insights, security issues, etc.
In this Bootcamp, we will learn to use Apache Spark and extract useful insights from a sample Web Server Log.
Download the sample Web Server Log dataset by clicking here.
To run this project in your local. Follow the below steps:
- Open it up in IntelliJ or any other IDE. (Download IntelliJ Community Version from here)
- Install Scala plugin by going into Settings. (For reference)
- Put the dataset in data directory. (data/access.log)
- run mvn clean install on the terminal to download all dependencies and build the jar. IntelliJ users can ignore this step, as it automatically downloads the requisite dependancies and creates the build.
Once you run the main class it will read the data/access.log file and parse it using Apache Spark and write the output files in data/logdata/ directory in parquet file format. It will also print the hourly trend table data on the console.
-
Run the Scala REPL and execute the code by CTRL + ALT + X for windows and linux users, and CONTROL + COMMAND + X for Mac users.
For reference please visit.
This is an interactive way to run the scala code. It is similar to python shell where we interactively run python scripts.
It is highly compressed columnar file format which is optimized for high speed reads and is prominently used in enterprise data warehouses. Further reads about Apache Parquet.
You're all set. VyuWing is Happy to Help!
For doubts on the project and to learn more, get in touch with our team : info@vyuwinglearning.com