Develop Spark Applications with Python & Cloudera. Explore the RDD API, the original core abstraction of Spark. Use Spark SQL and DataFrames
Logistic Regression: Hadoop vs. Spark
Apache Spark: 0.9s
MapReduce: 110s
Execution time: lower s better
Word Count Examples
Many lines in Hadoop
A couple of lines in Spark
Easy to learn and full of features
Support different languages: Python, Scala, Java, R Cluster Managers: Many libraries
iterative algorithms + interactive data mining tools.
Use Python Use Spark on YARN in a Cloudera cluster. optional: Spark Standalone.
Why Couldera?
CDH+Tools // On-perm & Cloud
Director // Cloud
Altus // Platform-as-a-Service
Cloudera's Distribution including Hadoop (HDFS)