Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming
- An overview of the architecture of Apache Spark.
- RDD transformations and actions
- Spark SQL
- Develop Apache Spark 2.0 applications with PySpark
- Advanced techniques to optimize and tune Apache Spark jobs
- Spark on Amazon's Elastic MapReduce service
- Big data ecosystem overview
- Datasets and DataFrames
- Analyze structured and semi-structured data
- broadcast variables and accumulators
- Best practices of working with Apache Spark in the field.
- Big data ecosystem overview.