# `BIG DATA FUNDAMENTALS WITH PYSPARK`

## What is Big Data?
Big data is a term used to refer to the study and applications of data sets that are too
complex for traditional data-processing software `- wikipedia`

## The 3 V's of Big Data
- `Volume:` Variety and Velocity
- `Volume:` Size of the data
- `Variety:` Different sources & formats
- `Velocity:` Speed of teh data 

## Big Data concepts and Terminology
- `Clustered computing:` Collection of resources of multiple machines
- `Parallel computing:` Simultaneous computation
- `Distributed computing:` Collection of nodes (networked computers) that run in parallel
- `Batch processing:` Breaking the job into small pieces and running them on individual
machines
- `Real-time processing:` Immediate processing of data


NOTES: **Clustered computing** is the pooling of resources of multiple machines to complete jobs. **Parallel computing** is a type of computation in which many calculations are carried out simultaneously. A **distributed computing** involves nodes or networked computers that run jobs in parallel. 

## Big Data processing systems
- `Hadoop/MapReduce:` Scalable and fault tolerant framework written in Java
    - Open SOurce
    - Batch Processing
- `Apache Spark:` General purpose and lightning fast cluster computing system
    - Open source
    - Both batch and real-time data processing


## Features of Apache Spark framework
- Distributed cluster computing framework
- Efficient in-memory computations for large data sets
- Lightning fast data processing framework
- Provides support for Java, Scala, Python, R and SQ

## Apache Spark Components
![image.png](attachment:0cb072c7-7396-4fa2-a300-31ce5a941d0e.png)

## Spark modes of deployment
- `Local mode:` Single machine such as your laptop
    - Local model convenient for testing, debugging and demonstration
- `Cluster mode:` Set of pre-defined machines
    - Good for production
- Workfkow: Local -> Cluster
- No Code Changes Necessary

# `PySpark: Spark with Python`

## Overview of PySpark
- Apache Spark is written in Scala
- To support Python with Spark, Apache Spark Community released PySpark
- Similar computation speed and power as Scala
- PySpark APIs are similar to Pandas and Scikit-learn


## What is Spark shell?
- Interactive environment for running Spark jobs
- Helpful for fast interactive prototyping
- Spark’s shells allow interacting with data on disk or in memory
- Three different Spark shells:
    - Spark-shell for Scala
    - PySpark-shell for Python
    - SparkR for R

## PySpark shell
- PySpark shell is the Python-based command line tool
- PySpark shell allows data scientists interface with Spark data structures
- PySpark shell support connecting to a cluster

## Understanding `SparkContext`
- `SparkContext` is an entry point into the world of Spark
- An entry point is a way of connecting to Spark cluster
- An entry point is like a key to the house
- PySpark has a default `SparkContext` called **sc**


NOTES:
- An **entry point** is where control is transferred from the Operating system to the provided program. 
- In simpler terms, it's like a key to your house. Without the key you cannot enter the house, similarly, without an entry point, you cannot run any PySpark jobs

### Inspecting SparkContext

In [1]:
#Verifying Spark 
!pip install pyspark 

In [2]:
# finding Pyspark 
!pip install findspark

In [3]:
#importing Libraries
import os
from pyspark import SparkConf
from pyspark.context import SparkContext

In [4]:
print(os.environ['SPARK_HOME'] )

C:\spark\spark-3.3.0-bin-hadoop3


In [5]:
#creating SparkContext
sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))

`Version:` To retrieve SparkContext version

In [6]:
sc.version

'3.3.0'

`Python Version:` To retrieve Python version of SparkContext

In [7]:
sc.pythonVer

'3.10'

`Master:` URL of the cluster or “local” string to run in local mode of SparkContext

In [8]:
sc.master

'local[*]'

### Loading data in PySpark

- SparkContext's `parallelize()` method

In [9]:
rdd = sc.parallelize([1,2,3,4,5])

- SparkContext's `textFile()` method

In [10]:
rdd2 = sc.textFile("test.txt")

## Use of function in python - `lambda()`, `map()`, `filter()`

### What are anonymous functions in Python?
- Lambda functions are anonymous functions in Python
- Very powerful and used in Python. Quite effiencetly with `map()` and `filter()`
- Lambda functions create functions to be called later similar to def
- It returns the functions without any name (i.e anonymous)
- Inline a function definition or to defer execution of a code

### Lambda function syntax

- The general form of lambda functions is

`>>> lambda arguments: expression`

- Example of lambda function

In [11]:
double = lambda x: x * 2
print(double(3))

6


### Difference between def vs lambda functions

- Python code to illustrate cube of a number

In [12]:
#python Funtion
def cube(x):
    return x ** 3

#lambda function
g = lambda x: x ** 3

#displaying on console
print(g(10))
print(cube(10))

1000
1000


### conclusion
- No return statement for lambda
- Can put lambda function anywhere


### Use of Lambda function in Python - `map()`

- `map()` function takes a function and a list and returns a new list which contains items
returned by that function for each item
- General syntax of `map()`

`>>> map(function, list)`


- Example of `map()`

In [13]:
items = [1, 2, 3, 4]
list(map(lambda x: x + 2 , items))

[3, 4, 5, 6]

### Use of Lambda function in python - `filter()`

- `filter()` function takes a function and a list  and returns  a new list for which the function evaluates as `True`
- General Syntax of `filter()`

`>>> filter(function, list)`

- Example of `filter()`

In [14]:
items = [1, 2, 3, 4]
list(filter(lambda x: (x%2 != 0), items))

[1, 3]

# EXERCISE:

- Print the version of SparkContext in the PySpark shell.
- Print the Python version of SparkContext in the PySpark shell.
- What is the master of SparkContext in the PySpark shell?

In [15]:
# Print the version of SparkContext
print("The version of Spark Context in the PySpark shell is", sc.version)

# Print the Python version of SparkContext
print("The Python version of Spark Context in the PySpark shell is", sc.pythonVer)

# Print the master of SparkContext
print("The master of Spark Context in the PySpark shell is", sc.master)

The version of Spark Context in the PySpark shell is 3.3.0
The Python version of Spark Context in the PySpark shell is 3.10
The master of Spark Context in the PySpark shell is local[*]


- Create a Python list named numb containing the numbers 1 to 100.
- Load the list into Spark using Spark Context's parallelize method and assign it to a variable spark_data.

In [16]:
# Create a Python list of numbers from 1 to 100 
numb = range(0,100)

# Load the list into PySpark  
spark_data = sc.parallelize(numb)

- Load a local text file README.md in PySpark shell.

In [17]:
# Load a local file into PySpark shell
lines = sc.textFile("test.txt")

- Print my_list which is available in your environment.
- Square each item in my_list using map() and lambda().
- Print the result of map function.

In [18]:
# list contaning 1 to 10 numbers
my_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

# Print my_list in the console
print("Input list is",my_list)

# Square all numbers in my_list
squared_list_lambda = list(map(lambda x: x**2 , my_list))

# Print the result of the map function
print("The squared numbers are", squared_list_lambda)

Input list is [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
The squared numbers are [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]


- Print my_list2 which is available in your environment.
- Filter the numbers divisible by 10 from my_list2 using filter() and lambda().
- Print the numbers divisible by 10 from my_list2.

In [19]:
# mylist2 defined 
my_list2 = [10, 21, 31, 40, 51, 60, 72, 80, 93, 101]

# Print my_list2 in the console
print("Input list is:", my_list2)

# Filter numbers divisible by 10
filtered_list = list(filter(lambda x: (x%10 == 0), my_list2))

# Print the numbers divisible by 10
print("Numbers divisible by 10 are:", filtered_list)

Input list is: [10, 21, 31, 40, 51, 60, 72, 80, 93, 101]
Numbers divisible by 10 are: [10, 40, 60, 80]


## Good Luck :)