Spark

Spark Documentation
Databricks Spark Knowledge Base
Spark Programming Guide
Tuning Spark
advanced dependency management
Custom API Examples For Apache Spark - The examples are basic and only for newbies in Scala and Spark.
Welcome to Spark Python API Docs!
github.com/apache/spark
SparkTutorian.net - Apache Spark For the Common * Man!
sparktutorials.github.io
Spark 시작하기 (유용한 사이트 링크)
Learning Spark With Scala
Spark Internals
pubdata.tistory.com/category/Lecture_SPARK
Apache Spark - Executive Summary
Teach yourself Apache Spark – Guide for nerds!
Apache Spark - cyber.dbguide.net
Stanford CS347 Guest Lecture: Apache Spark
BerkeleyX: CS100.1x Introduction to Big Data with Apache Spark
- mooc-setup
- Spark로 빅데이터 입문, 1-2주차 노트
- Spark로 빅데이터 입문, 3주차 노트
BerkeleyX: CS190.1x Scalable Machine Learning
- Spark: Cluster Computing with Working Sets
- Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
bigdatauniversity.com
- Spark Fundamentals I
- Spark Fundamentals II
Introduction to Spark
Spark Programming
Introduction to Spark Internals
Intro to Apache Spark Training - Part 1
Cloudera
- Cloudera Engineering Blog · Spark Posts
- How-to: Tune Your Apache Spark Jobs (Part 1)
- How-to: Tune Your Apache Spark Jobs (Part 2)
- LSA-ing Wikipedia with Apache Spark
- Making Apache Spark Testing Easy with Spark Testing Base
- Getting Apache Spark Customers to Production
- Why Your Apache Spark Job is Failing
The Apache Spark @youtube
Apache spark 소개 및 실습
Spark 소개 1부
Spark 소개 2부
RE: ShootingStar TV 1회 - 아파치 스파크와 RDD
- 스터디용 아파치 스파크 환경구성 - 윈도우
- 스터디용 아파치 스파크 환경구성 - 인텔리J
databricks
- sparkhub.databricks.com
- Examples for Learning Spark
- Project Tungsten: Bringing Spark Closer to Bare Metal
- Simplifying Big Data Analytics with Apache Spark
- Databricks Announces General Availability of Its Cloud Platform
- A Deeper Understanding of Spark Internals - Aaron Davidson (Databricks)
- DEVOPS ADVANCED CLASS
- 스파크의 사용 환경 내용 - data bricks
What is shuffle read & shuffle write in Apache Spark
Scrap your MapReduce! (Or, Introduction to Apache Spark)
Learning Spark
Introduction to Data Science with Apache Spark
HPC is dying, and MPI is killing it
Spark은 왜 이렇게 유명해지고 있을까?
Analytics With Apache Spark Is Coming
Interactive Analytics using Apache Spark
bicdata
- 고급 분석을 '현실'로 만드는 스파크 -> 머신런닝 알고리즘이 포함 있지만, 고급분석가의 관점으로는 기초적인 알고리즘만 포함
- 모든 것을 더 편하게 만들어주는 스파크 -> M/R 형식의 프로그램은 많이 편해짐. MPI 방식은 지원하지 않음
- 하나 이상의 언어를 말하는 스파크 -> scala, java, python을 지원하지만, scala에 최적화되어 있고 나머지 언어는 좀 불편
- 더 빨리 결과를 도출하는 스파크 -> 성능 테스트를 해보면, SparkStream은 storm보다 느리고, SparkSQL은 Hive보다 느림. 일반적인 Spark 프로그램이 성능이 좋음
- 하둡 개발업체를 가리지 않는 스파크 -> 오픈소스는 대부분 업체를 가리지 않고, 용도와 장단점이 다름
- 실시간 고급 분석 -> 기존(하둡)보다는 빠른 고급분석(??)이기 하지만, 준실시간
VCNC가 Hadoop대신 Spark를 선택한 이유
[유재석의 데이터 인사이트] (25) 라인플러스 게임보안개발실...스파크+메소스로 10분 당 15TB 처리
http://bcho.tistory.com/tag/Apache Spark
- Spark 노트
- Apache Spark이 왜 인기가 있을까?
- Apache Spark 설치 하기
- Apache Spark 소개 - 스파크 스택 구조
- Apache Spark 클러스터 구조
- Apache Spark - RDD (Resilient Distributed DataSet) 이해하기 - #1/2
- Apache Spark RDD 이해하기 #2 - 스파크에서 함수 넘기기 (Passing function to Spark)
- Apache Spark(스파크) - RDD Persistence (스토리지 옵션에 대해서)
- Apache Spark - Key/Value Paris (Pair RDD)
- Apache Spark-Python vs Scala 성능 비교
blog.madhukaraphatak.com
- Introduction to Spark Data Source API - Part 1
Spark Summit
- Using Cascading to Build Data-centric Applications on Spark
- spark-summit.org/2015
- spark-summit.org/east-2016/schedule
  - Spark Summit East 2016 첫 날 덤프
  - Spark Summit East 2016 둘째 날 덤프
- spark-summit.org/2016/schedule
- Spark Summit 2016 West Training
- Spark Summit Europe 2016 참관기
- OrderedRDD: A Distributed Time Series Analysis Framework for Spark (Larisa Sawyer)
- Just Enough Scala for Spark (Dean Wampler)
- TensorFrames: Deep Learning with TensorFlow on Apache Spark (Tim Hunter)
- SPARK SUMMIT EAST 2017
- SPARK SUMMIT 2017 DATA SCIENCE AND ENGINEERING AT SCALE
Tuning Java Garbage Collection for Spark Applications
Spark(1.2.1 -> 1.3.1) 을 위한 Mesos(0.18 -> 0.22.rc) - Upgrade
RDDS ARE THE NEW BYTECODE OF APACHE SPARK
Apache Spark on Docker
Microbenchmarking Big Data Solutions on the JVM – Part 1
Spark, Mesos, Zeppelin, HDFS를 활용한 대용량 보안 데이터 분석
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around Comes Around
IBM, 오픈소스 커뮤니티에 머신러닝 기술 기증
Productionizing Spark and the Spark Job Server
is Hadoop dead and is it time to move to Spark
Spark + S3 + R3 을 이용한 데이터 분석 시스템 만들기 by VCNC
Parallel Programming with Spark (Part 1 & 2) - Matei Zaharia
Running multiple Spark Streaming jobs of different DStreams in parallel
Petabyte-Scale Text Processing with Spark
Combining Druid and Spark: Interactive and Flexible Analytics at Scale
Interactive Audience Analytics With Spark and HyperLogLog
Apache Spark Creator Matei Zaharia Interview
New Developments in Spark
Spark와 Hadoop, 완벽한 조합 (한국어)
Spark Architecture: Shuffle
Naytev Wants To Bring A Buzzfeed-Style Social Tool To Every Publisher With Spark
Spinning up a Spark Cluster on Spot Instances: Step by Step
Spark Meetup at Uber
Bay Area Apache Spark Meetup @ Intel
- Easy, scalable, fault tolerant stream processing with structured streaming - spark meetup at intel in santa clara
Can Apache Spark process 100 terabytes of data in interactive mode?
넷플릭스 빅데이터 플랫폼 아파치 스팍 통합 경험기
Succinct Spark from AMPLab: Queries on Compressed RDDs
How-to: Build a Complex Event Processing App on Apache Spark and Drools
Improving Spark application performance
[Spark] “Fast food” and tips for RDD
스칼라ML - 스칼라를 이용한 기계학습 기초(+Spark)
Secondary Sorting in Spark
Distributed computing with spark
Comparing the Dataflow/Beam and Spark Programming Models
Apache Spark Architecture
Scala vs. Python for Apache Spark
Natural Language Processing With Apache Spark
맵알, ‘아파치 스파크’ 교육 과정 무료로 공개
Spark HDFS Integration
spark textfile load file instead of lines
Reading Text Files by Lines
Evening w/ Martin Odersky! (Scala in 2016) +Spark Approximates +Twitter Algebird
ScalaJVMBigData-SparkLessons.pdf
Introduction to Spark 2.0 : A Sneak Peek At Next Generation Spark
- Spark Release 2.0.0
- Spark SQL, DataFrames and Datasets Guide
- A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets - When to use them and why
- Introducing Apache Spark 2.0
- Spark 2.0 Technical Preview: Easier, Faster, and Smarter
- Apache Spark 2.0 presented by Databricks co-founder Reynold Xin
- APACHE SPARK 2.0 API IMPROVEMENTS: RDD, DATAFRAME, DATASET AND SQL
- Spark 2.0 – Datasets and case classes
- Apache Spark 2.0 Performance Improvements Investigated With Flame Graphs
- Generating Flame Graphs for Apache Spark
- Apache Spark 2.0 Tuning Guide
- Using Apache Spark 2.0 to Analyze the City of San Francisco's Open Data
- Modern Spark DataFrame & Dataset | Apache Spark 2.0 Tutorial
- Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Michael Armbrust
- Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
- Spark 2.0 - by Matei Zaharia
- Spark 2.x Troubleshooting Guide
Introducing Apache Spark 2.1 Now available on Databricks
The easiest way to run Spark in production
Spark tuning for Enterprise System Administrators
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Takes On Dataflow in Benchmark Test
Stock inference engine using Spring XD, Apache Geode / GemFire and Spark ML Lib. http://pivotal-open-source-hub.github.io/StockInference-Spark
Learning Spark - 아키텍트를 꿈꾸는 사람들
- 2015_LearningSpark
Tutorial: Spark-GPU Cluster Dev in a Notebook A tutorial on ad-hoc, distributed GPU development on any Macbook Pro
GPU Acceleration on Apache Spark™
Spark에서 GPU를 사용해야하는 이유는 무엇입니까?
Cluster - spark
Apache Spark Key Terms, Explained
스파크 클라우데라 하둡 클러스터 원격 입출력 예제
이렇게 코딩 하면 안된다
spark를 이용한 hadoop cluster 원격 입출력
Best Practices for Using Apache Spark on AWS
Build a Prediction Engine Using Spark, Kudu, and Impala
Deep Dive: Apache Spark Memory Management
Deep Dive: Apache Spark Memory Management
option
- spark.executor.cores; node의 코어수
- spark.cores.max 전체 갯수
- e.g.
  - worker node가 2개이고 각 node당 8core cpu인데 spark.cores.max를 8로 주면 1개의 노드만 동작
  - 두개의 node에서 동작하게 하려면 spark.cores.max를 16으로
Apache Spark @Scale: A 60 TB+ production use case
How Do In-Memory Data Grids Differ from Spark?
Spark에서의 Data Skew 문제
처음해보는 스파크(spark)로 24시간안에 부동산 과열 분석해보기
Intro to Apache Spark for Java and Scala Developers - Ted Malaska (Cloudera)
Achieving a 300% speedup in ETL with Apache Spark
- Spark의 CSV 파일 작업에 대한 스니펫 소개
- non-distributed version에 비해 Spark는 뛰어난 속도 향상 기능을 제공하며 Parquet과 같은 최적화된 형식으로 변환 할 수 있는 기능을 제공
How to install and run Spark 2.0 on HDP 2.5 Sandbox
Experimenting with Neo4j and Apache Zeppelin (Neo4j)-[:LOVES]-(Zeppelin)
Time-Series Missing Data Imputation In Apache Spark
Data Science How-To: Using Apache Spark for Sports Analytics
- Using Spark To Analyze the NBA and the 3-point Shot
Hive on Spark: Getting Started
Working with UDFs in Apache Spark
- Python, Java, Scala에서 Apache Spark의 UDF, UDAF를 사용하는 간단한 예제
Apache Spark은 어떻게 가장 활발한 빅데이터 프로젝트가 되었나
Using Apache Spark for large-scale language model training
- Facebook에서 ngram 모델의 traing pipeline을 Apach Hive에서 Apache Spark으로 전환 시도 중
- 두 가지 솔루션에 대한 설명과 Spark DSL 과 Hive QL의 유연성 비교 및 성능 수치
WRITING TO A DATABASE FROM SPARK
Processing Solr data with Apache Spark SQL in IBM IOP 4.3
- Apache Spark을 Apach Solr로 연결하는 방법 소개
Blacklisting in Apache Spark
Tracking the Money — Scaling Financial Reporting at Airbnb
The Benefits of Migrating HPC Workloads To Apache Spark
- Spark 작업을 실행하기위한 Apache Zeppelin과 Livy 작업 서버 간의 통합에 대한 최근 개선 사항 설명
Spark StandAlone 설치부터 예제 테스트까지
데이터분석 인프라 구축기 (1/4)
데이터분석 인프라 구축기 (2/4)
데이터분석 인프라 구축기 (3/4)
데이터분석 인프라 구축기 (4/4)
parquet 사용 예제
zipWithIndex, for-yield 예제
Apache Spark installation on Windows 10
Cloudera session seoul - Spark bootcamp
Benchmarking Big Data SQL Platforms in the Cloud
- Vanilla Spark, Presto, Impala 보다 DataBricks 플랫폼이 더 빠르다는 주장
Building QDS: AIR Infrastructure
- Qubole이 Data Platforms 2017 conference 발표한 Air라는 플랫폼에 대한 내용입니다.
스파크 스터디 ParkS
- ParkS
Cost Based Optimizer in Apache Spark 2.2
- Apache Spark 2.2의 Cost Based Optimizer와 TPC-DS benchmark에서 CBO 사용 여부에 관계없이 쿼리 수행 시간을 비교한 결과와 통계 정보 수집 방법 등에 대해 설명
spark 프레임워크를 활용해 자바 기반 웹 애플리케이션 개발 맛보기

API

Spark Programming Model : Resilient Distributed Dataset (RDD) - 2015
Apache Spark: Examples Of Transformations
The RDD API By Example
backtobazics.com/category/big-data/spark example of API
APACHE SPARK: RDD, DATAFRAME OR DATASET?
Apache Spark’s Hidden REST API

aggregate

scala> val rdd = sc.parallelize(List(1, 2, 3, 3))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:21

scala> rdd.aggregate((0, 0))((x, y) => (x._1 + y, x._2 - y), (x, y) => (x._1 + y._1, x._2 + y._2))
res10: (Int, Int) = (9,-9)

scala> rdd.map(t => (t, -t)).reduce((a, b) => (a._1 + b._1, a._2 + b._2))
res11: (Int, Int) = (9,-9)

aggregateByKey
- AggregateByKey implements Collect_list in Spark 1.4
combineByKey
DataFrames
Datasets
groupByKey
- Avoid GroupByKey
HashPartitioner
- Apache Spark - HashPartitioner : How does it work?
- Partition by Hash on Keys
join
- RDD join 예제
- join 예제
persist
- RDD persist() or cache() 시 주의사항
SQL
- Spark SQL, DataFrames and Datasets Guide
  - Column
  - Dataset
  - Row
- spark-csv - CSV Data Source for Apache Spark 1.x
  - TextFileSuite.scala
- Spark SQL CSV Examples
- github.com/yhuai/spark/tree/eb77ee39b8616cb367541503baf7c07695ef1ec0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv
- Dataframes from CSV files in Spark 1.5: automatic schema extraction, neat summary statistics, & elementary data exploration
- Spark 2.0 read csv number of partitions (PySpark)
- How to read csv file as DataFrame?
- How to change column types in Spark SQL's DataFrame?
- Working with Nested Data Using Higher Order Functions in SQL on Databricks
  - Hadoop과 Spark은 nested structs, array, map 등과 같은 복잡하고 다양한 데이터를 처리하는 훌륭한 도구이지만 SQL에서 사용하는 건 어려움
  - Databricks 3.0에 추가된 TRANSFORM 연산과 Spark SQL에 추가된 "Higher Order Functions"를 소개(SPARK-19480)
- Spark SQL under the hood – part I

Book

Mastering Apache Spark 2.0

Conference

Spark Day 2017@Seoul(Spark Bootcamp)
Spark Day 2017- Spark 의 과거, 현재, 미래
Spark Day 2017 Machine Learning & Deep Learnig With Spark
Spark & Zeppelin을 활용한 한국어 텍스트 분류
- Spark & Zeppelin을 활용한 한국어 텍스트 분류
Zeppelin 노트북: NSMC Word2Vec & Sentiment Classification
Spark day 2017@Seoul - Spark on Kubernetes
Spark, Mesos, Zeppelin, HDFS를 활용한 대용량 보안 데이터 분석

Deep Learning

yahoo/CaffeOnSpark
CaffeOnSpark Open Sourced for Distributed Deep Learning on Big Data Clusters
Large Scale Distributed Deep Learning on Hadoop Clusters
SparkNet: Training Deep Networks in Spark
- Spark + Deep Learning: Distributed Deep Neural Network Training with SparkNet
[264] large scale deep-learning_on_spark
DeepSpark: Spark-Based Deep Learning Supporting Asynchronous Updates and Caffe Compatibility
The Unreasonable Effectiveness of Deep Learning on Spark
GPU Acceleration in Databricks Speeding Up Deep Learning on Apache Spark
Deep Learning on Databricks - Integrating with TensorFlow, Caffe, MXNet, and Theano

Hbase

example
- HBaseTest.scala, hbase_inputformat.py
I simple API to interact with HBase with Spark
Apache Spark Comes to Apache HBase with HBase-Spark Module

Ignite - Spark Shared RDDs

Library

Hadoop Tutorial: the new beta Notebook app for Spark & SQL
AWS Athena Data Source for Apache Spark
BigDL: Distributed Deep learning on Apache Spark
- BigDL: Distributed Deep learning on Apache Spark
CLOUD DATAPROC - Google Cloud Dataproc is a managed Spark and Hadoop service that is fast, easy to use, and low cost
- 구글, 스파크·하둡 관리 클라우드 서비스 공개
- [Google Cloud Dataproc 사용하기(http://whitechoi.tistory.com/48)
CueSheet - a framework for writing Apache Spark 2.x applications more conveniently
- No More "sbt assembly": Rethinking Spark-Submit using CueSheet
Dr. Elephant Self-Serve Performance Tuning for Hadoop and Spark
EMR
- Large-Scale Machine Learning with Spark on Amazon EMR
- Amazon EMR, Apache Spark 지원 시작
- Spark on EMR
- (BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
Envelope - a configuration-driven framework for Apache Spark that makes it easy to develop Spark-based data processing pipelines on a Cloudera EDH
- Envelope과 함께 Apache Spark, Apache Kudu 및 Apache Impala를 사용하여 Cloudera enterprise data hub (EDH)에 구현하는 방법
- Configuration specification
- Bi-temporal data modeling with Envelope
- Cloudera Enterprise Data Hub - Our flagship can now be yours
flambo - A Clojure DSL for Apache Spark
GraphFrame
- On-Time Flight Performance with GraphFrames for Apache Spark
Hail: Scalable Genomics Analysis with Apache Spark
- Apache Spark로 유전체 분석을 수행하는 도구 인 Hail에 대한 개요
- 샘플의 품질을 계산하고 간단한 게놈 차원의 연관 연구를 수행하는 예제 실행으로 시연하는 간단하고 강력한 프로그래밍 모델을 보유
Infinispan Spark connector 0.1 released!
- infinispan-spark
- infinispan-spark-connector-examples
KeystoneML - Machine Learning Pipeline
Livy, the Open Source REST Service for Apache Spark, Joins Cloudera Labs
- Livy: A REST Web Service For Apache Spark
MMLSpark - Microsoft Machine Learning for Apache Spark
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning http://oryx.io
pocketcluster - One-Step Spark/Hadoop Installer v0.1.0
snappydata - Unified Online Transactions + Analytics + Probabilistic Data Platform
- SnappyData: OLTP + OLAP Database built on Apache Spark http://www.snappydata.io
spark cassandra connector - 스파크에 카산드라 연동하는 라이브러리
spark-indexed - An efficient updatable key-value store for Apache Spark
spark-jobs-rest-client - Fluent client for interacting with Spark Standalone Mode's Rest API for submitting, killing and monitoring the state of jobs
Sparkline SNAP
- Introducing Sparkline SNAP: An Integrated OLAP platform on Spark
Sparklint - The missing Spark Performance Debugger that can be drag and dropped into your spark application!
- SparkLint: a Tool for Monitoring, Identifying and Tuning Inefficient Spark Jobs (Simon Whitear)
spark-nkp Natural Korean Processor for Apache Spark
Spark Notebook
SparMysqlSample
spark-packages - A community index of packages for Apache Spark
- 스칼라 의존성, 패키지 검색하는 웹 - http://spark-packages.org
spark-ts - Time Series for Spark (The spark-ts Package)
spark-xml - XML data source for Spark SQL and DataFrames

GraphX

Spark Streaming and GraphX at Netflix - Apache Spark Meetup, May 19, 2015
스사모 테크톡 - GraphX
Computing Shortest Distances Incrementally with Spark
Strata 2016 - This repo is for MLlib/GraphX tutorial in Strata 2016
Processing Hierarchical Data using Spark Graphx Pregel API
- GraphX API를 사용하는 예제와 방법

Mesos

Spark + Mesos cluster mode, who uploads the jar?

MLLib

Decision Trees
MLlib: Machine Learning in Apache Spark
movie recommendation with mllib
WSO2 Machine Learner: Why would You care?
Strata 2016 - This repo is for MLlib/GraphX tutorial in Strata 2016
Spark ML Lab
Apache Spark로 시작하는 머신러닝 입문
- Apache Spark 입문에서 머신러닝까지
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Deep Learning with Apache Spark and TensorFlow
Introduction to Machine Learning on Apache Spark MLlib
pipelineio - End-to-End Spark ML and Tensorflow AI Data Pipelines
TensorFlow On Spark: Scalable TensorFlow Learning on Spark Clusters - Andy Feng & Lee Yang
github.com/yahoo/TensorFlowOnSpark
- Open Sourcing TensorFlowOnSpark: Distributed Deep Learning on Big-Data Clusters
Extend Spark ML for your own model/transformer types
Deep learning for Apache Spark
Spark machine learning & deep learning
Accelerating Apache Spark MLlib with Intel® Math Kernel Library (Intel® MKL)
Improving BLAS library performance for MLlib
Introduction to Machine learning with Spark
- Introduction to Machine Learning with Spark
- Code and setup information for Introduction to Machine Learning with Spark
Introduction to ML with Apache Spark MLib by Taras Matyashovskyy
Extend Spark ML for your own model/transformer types
Spark Deep Learning Pipelines

PySpark

PySpark Cheat Sheet: Spark in Python
troubleshooting
- A Beginner's Guide on Troubleshooting Spark Applications
- Caused by: java.lang.ClassNotFoundException: * org.elasticsearch.spark.package sbt configuration such as resolvers
  - Spark Runtime Error - ClassDefNotFound: SparkConf
- java.lang.OutOfMemoryError: GC overhead limit exceeded increase driver memory
- org.apache.spark.SparkException: Could not find BlockManagerEndpoint1 or it has been stopped 검색해도 특별히 나오는게 없음
- spark java.io.IOException: Filesystem closed usually result RDD is too big
- Task not serializable
- TypeError: 'bool' object is not callable Use PYSPARK_PYTHON=...
  - Check Python version in worker before run PySpark job
  - spark-runs-in-local-but-not-in-yarn
- yarn.scheduler.maximum.allocation-mb
  - increase configuration for yarn-site.xml
  - empty disk (not enough free space may cause this too)
- Cannot submit Spark app to cluster, stuck on “UNDEFINED”
  - yarn.nodemanager.resource.memory-mb 조정 후 동작 확인
- contains a task of very large size warning
  - 문제; Dataframe으로 읽어 온 row들을 텍스트 처리 해서 row끼리 비교를 해야 하는데, a task of very large size warning 발생
  - 해결; 텍스트 처리 된 중간 결과물을 Redis에 저장한 뒤 별도 Spark 애플리케이션을 사용해서 Row by Row 처리
  - 원인
    - Spark는 각 Executor가 수행해야 할 작업을 Task라는 단위로 관리
    - RDD에 가해지는 연산을 상호 의존성에 따라 묶은 뒤 (Logical Planning) 여기에 최적화 룰을 적용해서 실제로 Executor가 처리해야 할 Task의 형태로 생성 (Physical Planning)
    - 이걸 내부 queue에 넣어 뒀다가 순차적으로 Executor에 보내서 처리
    - 이 과정을 좀 더 구체적으로 설명하자면, Driver 프로세스가 작업 루틴과 작업 대상 위치를 TaskDescription 객체로 만든 뒤 Serialize를 해서 Worker 프로세스에 네트워크 상으로 전송
    - 문제는 Task당 100kb를 넘으면 "contains a task of very large size warning" 경고 발생
    - 이 제한은 소스코드 안에 하드 코딩되어 있어 변경 불가능
    - broadcast 기능을 사용할 경우 상황은 더 악화
    - broadcast 기능은 task를 전송할 때와는 달리 데이터 값 그 자체를 Worker에 하나하나 보내는 방식으로 동작
    - 이 경우 보내야 할 row가 한두 개가 아니므로, 당연히 성능에 문제 발생
    - 이런 이유 때문에 자연어 처리가 된 중간 결과물을 별도 스토리지에 저장한 뒤 별도 애플리케이션에서 읽어와서 처리하는 방법만 가능
    - 여러 storage 중에서 굳이 Redis를 추천하는 이유는 빠르고, Key-Value Store라 관리하기 좋고, Sharding 기능 덕분에 읽기 분산도 잘 동작하기 때문
    - 최근 Spark ML에서 학습된 모델이 Redis에 저장되는 식으로 개발되고 있음
Getting started with PySpark - Part 1
Getting started with PySpark - Part 2
PySpark Internals
Fast Data Analytics with Spark and Python
pyspark-hbase.py
Deploying PySpark on Red Hat Storage GlusterFS
Spark Python Performance Tuning
weird case from pyspark-hbase (utf8 & unicode mixed)
Python Versus R in Apache Spark
biospark
Plagiarizing and Paraphrasing Code From an Online Class for Content Marketing
How-to: Use IPython Notebook with Apache Spark
Configuring IPython Notebook Support for PySpark
pyADAM - This is a wrapper to load Parquet data in PySpark
Accessing PySpark in PyCharm
pyspark-project-example - A simple example for PySpark based project
Recommendation Systems for Implicit Feedback
Hassle Free ETL with PySpark
안명호 : Python + Spark, 머신러닝을 위한 완벽한 결혼 - PyCon APAC 2016
Fully Arm Your Spark with Ipython and Jupyter in Python 3
- Installation
PySpark Cheat Sheet: Spark in Python
Apache Spark for Data Science
BigDL on CDH and Cloudera Data Science Workbench BigDL (Apache Spark의 심층 학습 라이브러리)을 워크 벤치와 함께 사용하는 방법
Distributed Deep Learning At Scale On Apache Spark With BigDL
Deep Learning to Big Data Analytics on Apache Spark Using BigDL - Yuhao Yang & Xianyan Jia
Deep Learning on Qubole Using BigDL for Apache Spark – Part 2
- 딥러닝 라이브러리인 BigDL을 사용하여 모델을 학습하고 평가하는 방법을 보여주는 간단한 자습서
Use your favorite Python library on PySpark cluster with Cloudera Data Science Workbench Python 라이브러리를 사용하는 PySpark 작업을 작성하는 방법
Install Spark on Windows (PySpark)

R

Spark 1.4 for RStudio
Python Versus R in Apache Spark
SparkR 설치 사용기 1 - Installation Guide On Yarn Cluster & Mesos Cluster & Stand Alone Cluster
sparklyr — R interface for Apache Spark
sparklyr — R interface for Apache Spark
sparklyr
xwMOOC 기계학습 - dplyr을 Spark 위에 올린 sparklyr
spark + R
MS R(구 Revolution R) on Spark - 설치 및 가능성 엿보기(feat. SparkR)
빅데이터 분석을 위한 스파크 2 프로그래밍 : 대용량 데이터 처리부터 머신러닝까지
On-Demand Webinar and FAQ: Parallelize R Code Using Apache Spark

Spark DL

A Vision for Making Deep Learning Simple From Machine Learning Practitioners to Business Analysts

Spark ML

KeystoneML - Machine Learning Pipeline
Feature Engineering at Scale With Spark
Audience Modeling With Spark ML Pipelines

Spark SQL

Spark SQL, DataFrames and Datasets Guide
Deep Dive into Spark SQL’s Catalyst Optimizer
SparkSQL cacheTable 메소드 사용 성능 비교 - default vs cacheTable vs cacheTable (with columnar Compression)
SparkSQL Internals
Spark Data Source API. Extending Our Spark SQL Query Engine
Five Spark SQL Utility Functions to Extract and Explore Complex Data Types
- JSON 및 중첩 구조를 처리하기 위해 탑재된 Spark SQL 함수를 사용하기 위한 튜토리얼
FLARE: SCALE UP SPARK SQL WITH NATIVE COMPILATION AND SET YOUR DATA ON FIRE!
- 실험 단계
- 쿼리플랜을 native code로 바꾸고 spark runtime system도 수정해 Spark SQL성능을 대폭 향상
- Flare: Native Compilation for Heterogeneous Workloads in Apache Spark

Streaming

Improved Fault-tolerance and Zero Data Loss in Spark Streaming
Four Things to know about Reliable Spark Streaming
Improved Fault-tolerance and Zero Data Loss in Spark Streaming
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real-Time Analytics with Spark Streaming
- Diving into Spark Streaming’s Execution Model
Can Spark Streaming survive Chaos Monkey?
RecoPick 실시간 데이터 처리 시스템 전환기 (Storm에서 Spark Streaming으로 전환)
From Big Data to Fast Data in Four Weeks or How Reactive Programming is Changing the World – Part 2
Spark Streaming으로 유실 없는 스트림 처리 인프라 구축하기
Real-time Streaming ETL with Structured Streaming in Apache Spark 2.1
Handling empty batches in Spark streaming
Spark Streaming Example(예제로 알아보는 Spark Streaming)
Long-running Spark Streaming Jobs on YARN Cluster
- spark-submit으로 장기간 streaming 분석 작업 실행하기
Spark Streaming 운영과 회고

YARN

Running Spark on YARN
Apache Spark Resource Management and YARN App Models
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark Yarn Cluster vs Spark Mesos Cluster (vs 기타 다양한 모드) 수행성능 및 활용성 비교
Dynamic Resource Allocation Spark on YARN
Investigation of Dynamic Allocation in Spark
Spark Cluster Settings On Yarn : Spark 1.4.1 + Hadoop 2.7.1

Zeppelin

Apache Zeppelin Release 0.7.0
www.zepl.com previously www.zeppelinhub.com
Practice
- meetup
Introduction to Zeppelin
Zeppelin overview
Zepplin (제플린) 설치하기
5. 웹 기반 명령어 해석기 Zeppelin Install
Angular display system dashboard on Zeppelin
Apache Zeppelin으로 데이터 분석하기 by VCNC
Zeppelin Context
[Apache Tajo] Apache Tajo 데스크탑 + Zeppelin 연동 하기
How-to: Install Apache Zeppelin on CDH
제플린 탑재한 이엠알 16년 4월
Zeppelin at Twitter
아파치 제플린, 한국에서 세계로 가기까지
Zeppelin Lab
Presto, Zeppelin을 이용한 초간단 BI 구축 사례
Presto, Zeppelin을 이용한 초간단 BI 시스템 구축 사례(1)
Serving Shiro enabled Apache Zeppelin with Apache mod_proxy + SSL (https)
Analyzing BigQuery datasets using BigQuery Interpreter for Apache Zeppelin
Zeppelin(제플린) 서울시립대학교 데이터 마이닝연구실 활용사례
- 제플린 걸음마 서울시립대학교 데이터마이닝 활용사례 제플린 노트북 통계 추출 코드
노트7의 소셜 반응을 분석해 보았다. #3 제플린 노트북을 이용한 상세 분석
9월 발렌타인 웨비너 - 민경국님의 Apache Zeppelin 입문 온라인 헨즈온강의
오픈소스 일기 2: Apache Zeppelin 이란 무엇인가?
How Apache Zeppelin runs a paragraph
Spark & Zeppelin을 활용한 머신러닝 실전 적용기
- Zeppelin 화재 뉴스 기사 분류 예제
스파크-제플린으로 통계 그래프 출력하기(윈도우환경) 실패 이야기
Apache Zeppelin Data Science Environment 1/21/16
도커로 간단 설치하는 Zeppelin
Zeppelin Build and Tutorial Notebook
DIT4C image for Apache Zeppelin
zdairi is zeppelin CLI tool
Zeppelin Paragraph 공유 시 자동 로그인 구현

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spark.md

spark.md

Spark

API

Book

Conference

Deep Learning

Hbase

Ignite - Spark Shared RDDs

Library

GraphX

Mesos

MLLib

PySpark

R

Spark DL

Spark ML

Spark SQL

Streaming

YARN

Zeppelin

Files

spark.md

Latest commit

History

spark.md

File metadata and controls

Spark

API

Book

Conference

Deep Learning

Hbase

Ignite - Spark Shared RDDs

Library

GraphX

Mesos

MLLib

PySpark

R

Spark DL

Spark ML

Spark SQL

Streaming

YARN

Zeppelin