Jupyter Notebooks to Practice Spark

Introduction

This tutorial is a work-in-progress for practising PySpark using Jupyter notebooks. Though I've provided some explanation on some of the basic concepts of Spark, such explanations by no means should be construed as complete. I will add more explanations as and when I get the time to revisit the work already done. I'd appreciate if people following this repo can provide me some feedback so that I can make necessary corrections.

Pre-requisites:

Python, Jupyter Notebook, Basics of distributed computing (theoretical understanding should be enough)
Understanding of Hadoop ecosystem is NOT required to understand Spark. Occasionally, words like HBase, HDFS might pop up. I have included an explanation wherever I felt it is an absolute necessity for the reader to understand the concept. If the explanation isn’t included, it means that the concept can be understood even without an understanding of the keyword in question. Likewise, knowledge of MapReduce is optional to learn Spark. That said, I have included a chapter explaining MapReduce through a word count example, which BTW is HelloWorld program in the BigData world.

Thanks

Slides from Coursera lecture. Included in the repository.
A big thanks to this tutorial which too was created with the same intent. It helped me a lot to understand the concepts by giving me something to build upon this tutorial.
Thanks to numerous Quora users who explain the technical jargons in the most lucid terms.

Spark Installation Notes

I followed this link to install Spark, with the following difference.

As opposed to using Anaconda distribution for Python, I went ahead with the installation which comes with Ubuntu. I am not a huge fan of Anaconda and prefer to install the Python libraries as and when required.
I installed Spark 2.2.0. Note that this version of Spark, does not work with Oracle Java 9. It works with Java 8. While installing Java, you'll be prompted to install version 9. DO NOT install 9. It took me quite some time to figure this out:-)

Topic	Notebook	Content Description
RDD: Definition and its creation	01_rdd_definition_and_creation.ipynb	Definition of RDD- Types of Operations- parallelize and textFile method to create RDD
RDD Basic Operations- Part I	02_rdd_basic_operations.ipynb	Explanation of immutability and lazy evaluation - Example of a few basic transformations and actions
WordCount in Spark	03_wordcount_mapreduce.ipynb	Explanation of Spark transformations through WordCount example
JOIN in Spark	04_join_in_spark.ipynb	Simple and Advanced JOIN through a Coursera assignment
Handling Parquet files	05_parquet_file_basics.ipynb	Introduction to Parquet and column oriented data storage- Example of reading parquet file

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

output_dir/people.parquet

output_dir/people.parquet

.gitignore

.gitignore

01_rdd_definition_and_creation.ipynb

01_rdd_definition_and_creation.ipynb

02_rdd_basic_operations.ipynb

02_rdd_basic_operations.ipynb

03_wordcount_mapreduce.ipynb

03_wordcount_mapreduce.ipynb

04_join_in_spark.ipynb

04_join_in_spark.ipynb

05_parquet_file_basics.ipynb

05_parquet_file_basics.ipynb

README.md

README.md

Repository files navigation

Jupyter Notebooks to Practice Spark

Introduction

Pre-requisites:

Thanks

Spark Installation Notes

Table of Contents

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
data		data
output_dir/people.parquet		output_dir/people.parquet
.gitignore		.gitignore
01_rdd_definition_and_creation.ipynb		01_rdd_definition_and_creation.ipynb
02_rdd_basic_operations.ipynb		02_rdd_basic_operations.ipynb
03_wordcount_mapreduce.ipynb		03_wordcount_mapreduce.ipynb
04_join_in_spark.ipynb		04_join_in_spark.ipynb
05_parquet_file_basics.ipynb		05_parquet_file_basics.ipynb
README.md		README.md

abhisheksaurabh1985/spark-for-noobs-by-a-noob

Folders and files

Latest commit

History

Repository files navigation

Jupyter Notebooks to Practice Spark

Introduction

Pre-requisites:

Thanks

Spark Installation Notes

Table of Contents

About

Resources

Stars

Watchers

Forks

Languages