Big-Data-Fundamentals-with-PySpark

Discription

Advance your data skills by mastering Apache Spark. Using the Spark Python API, PySpark, you will leverage parallel computation with large datasets, and get ready for high-performance machine learning. From cleaning data to creating features and implementing machine learning models, you'll execute end-to-end workflows with Spark. The track ends with building a recommendation engine using the popular MovieLens dataset and the Million Songs dataset.

1. Introduction to Big Data analysis with Spark Notebook

Fundamentals of BigData and introduction to Spark as distributed computing framework

Main components: Spark Core and Spark built-in libraries - Spark SQL, Spark MLlib, Graphx, and Spark Streaming
PySpark: Apache Spark's Python API to execute Spark jobs
PySpark shell: For developing the interactive applications in python
Spark modes: Local and clister mode

2. Programming in PySpark RDD’s Notebook

Introduction to RDDs, different features of RDDs, methods of creating RDDs and RDD operations (Transformation and Actions)

Transsformations: map(), flatMap(), filter(), union()
Actions: collect(), take(), first(), count()
Paired RDD Transformations: reduceByKey(), groupByKey(), sortByKey(), join(), countByKey(), collectAsMap()
Advanced RDD Actions: reduce(), saveAsTextFile()
Project: Write code that calculates the most common words

3. PySpark SQL & DataFrames Notebook

Introduction to Spark SQL, DataFrame abstraction, creating DataFrames, DataFrame operations and visualizing Big Data through DataFrames

DataFrame Transformations: select(), filter(), groupby(), orderby(), dropDuplicates() and withColumnRenamed()
DataFrame Actions: head(), show(), count(), clomuns and describe()
Data Visualization: hist(), displot(), pandas_histogram(), toPandas(), HandySpark
Project: Exploratory data analysis (EDA) on the "FIFA 2018 World Cup Player"

4. Machine Learning with PySpark MLlib Notebook

Introduction to Spark MLlib, the three C's of Machine Learning (Collaborative filtering, Classification and Clustering)

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
tempFile		tempFile
tempFile1		tempFile1
5000_points.txt		5000_points.txt
Complete_Shakespeare.txt		Complete_Shakespeare.txt
Fifa2018_dataset.csv		Fifa2018_dataset.csv
Introduction to Big Data analysis with Spark.ipynb		Introduction to Big Data analysis with Spark.ipynb
Machine Learning with PySpark MLlib.ipynb		Machine Learning with PySpark MLlib.ipynb
Programming in PySpark RDD’s.ipynb		Programming in PySpark RDD’s.ipynb
PySpark SQL & DataFrames.ipynb		PySpark SQL & DataFrames.ipynb
README.md		README.md
ham.txt		ham.txt
people.csv		people.csv
ratings.csv		ratings.csv
spam.txt		spam.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Big-Data-Fundamentals-with-PySpark

Discription

1. Introduction to Big Data analysis with Spark Notebook

2. Programming in PySpark RDD’s Notebook

3. PySpark SQL & DataFrames Notebook

4. Machine Learning with PySpark MLlib Notebook

About

Releases

Packages

Languages

cc59chong/Big-Data-Fundamentals-with-PySpark

Folders and files

Latest commit

History

Repository files navigation

Big-Data-Fundamentals-with-PySpark

Discription

1. Introduction to Big Data analysis with Spark Notebook

2. Programming in PySpark RDD’s Notebook

3. PySpark SQL & DataFrames Notebook

4. Machine Learning with PySpark MLlib Notebook

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages