Skip to content

XIEQ/DataScienceEngineeringApacheSpark

Repository files navigation

Data Science and Engineering with Apache Spark

Apache Spark Introduction Machine Learning Big Data

This repository contains my work for the Data Science and Engineering with Apache Spark XSeries Program created by UC Berkeley and Databricks.

The program includes three courses:

  1. CS105x: Introduction to Apache Spark
  2. CS120x: Distributed Machine Learning with Apache Spark
  3. CS110x: Big Data Analysis with Apache Spark

My Review about the XSeries Program

These three courses are probably the best Apache Spark online training courses you can get. The core value of these three course comes from the remarkable labs developed by the course team who are top Spark experts from UCBerkeley and Databricks. For this series, learning is 10% lecture + 90% working on the labs. It is very hands-on and practical. That is how you learn a new programming tool - learning by doing; you won't learn a new programming tool by spend most of your time watching lecture vedios and reading books.

In total, the three courses have 10 labs. Most of these labs are big and not easy, and they can easily make into small projects if you really dive into it. Throughout the series, all labs are completed using notebooks on Databricks Community Edition, which is free. I must say that Databricks' notebook is an awesome tool, data scientist will love it.

Lab Notebooks

All labs are completed using DataFrames

CS105x: Introduction to Apache Spark

CS120x: Distributed Machine Learning with Apache Spark

CS110x: Big Data Analysis with Apache Spark

RDD Notebooks

Same labs finished using Resilient Distributed Dataset(RDD)

About

Data Science and Engineering with Apache Spark

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages