This repository contains my work for the Data Science and Engineering with Apache Spark XSeries Program created by UC Berkeley and Databricks.
The program includes three courses:
- CS105x: Introduction to Apache Spark
- CS120x: Distributed Machine Learning with Apache Spark
- CS110x: Big Data Analysis with Apache Spark
These three courses are probably the best Apache Spark online training courses you can get. The core value of these three course comes from the remarkable labs developed by the course team who are top Spark experts from UCBerkeley and Databricks. For this series, learning is 10% lecture + 90% working on the labs. It is very hands-on and practical. That is how you learn a new programming tool - learning by doing; you won't learn a new programming tool by spend most of your time watching lecture vedios and reading books.
In total, the three courses have 10 labs. Most of these labs are big and not easy, and they can easily make into small projects if you really dive into it. Throughout the series, all labs are completed using notebooks on Databricks Community Edition, which is free. I must say that Databricks' notebook is an awesome tool, data scientist will love it.
All labs are completed using DataFrames
- Lab 0: Running Your First Notebook on Databricks
- Lab 1a: Spark Tutorial
- Lab 1b: Word Count
- Lab 2: Web Server Log Analysis
- Lab 1a: Math and Python Review
- Lab 1b: Word Count Using RDD
- Lab 2: Linear Regression-Predicting Release Year of a Song
- Lab 3: Click Through Rate Prediction
- Lab 4: Principal Component Analysis
- Lab 1: Spark ML Machine Learning Pipepine Application - Power Plant
- Lab 2: Alternating Least Square - Predicting Movie Ratings
- Lab 3: Text Analysis and Entity Resolution
Same labs finished using Resilient Distributed Dataset(RDD)
- Lab 1: Spark Tutorial RDD
- Lab 2: Word Count RDD
- Lab 3: Web Server Log Analyis RDD
- Lab 4: Math Review RDD
- Lab 5: Linear Regression-Predicting Release Year of a Song
- Lab 6: Click-Through Rate Prediction
- Lab 7: Principal Component Analysis
- Lab 8: Alternating Least Square - Predicting Movie Ratings
- Lab 9: Text Analysis and Entity Resolution