# Set Up

In [1]:
import pyspark as sp
sc = sp.SparkContext.getOrCreate()

24/04/02 14:05:01 WARN Utils: Your hostname, codespaces-f38966 resolves to a loopback address: 127.0.0.1; using 172.16.5.4 instead (on interface eth0)
24/04/02 14:05:01 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/04/02 14:05:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


24/04/02 14:05:14 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors


# Introduction to Big Data analysis with Spark

##  Understanding SparkContext
`SparkContext` is an entry point to interact with underlying Spark functionality. <br>

An entry point is where control is transferred from the Operating system to the provided program. In simpler terms, it's like a key to your house. Without the key you cannot enter the house, similarly, without an entry point, you cannot run any PySpark jobs. **You can access the SparkContext in the PySpark shell as a variable named `sc`.**

In [4]:
# Print the version of SparkContext
print("The version of Spark Context in the PySpark shell is", sc.version)

The version of Spark Context in the PySpark shell is 3.5.1


## Interactive Use of PySpark
Spark comes with an interactive Python shell in which PySpark is already installed in it. PySpark shell is useful for basic testing and debugging and it is quite powerful.

## Loading data in PySpark shell
PySpark using SparkContext by two different methods.
* The first is the SparkContext’s `parallelize()` method on a list.
* The second is the SparkContext’s `textFile()` method on a file.


In [5]:
# Create a Python list of numbers from 1 to 100
numb = range(1, 100)

# Load the list into PySpark
spark_data = sc.parallelize(numb)

In [6]:
# file path
file_path = "5000_points.txt"
# Load a local file into PySpark shell
lines = sc.textFile(file_path)

## Use of lambda() with map()
The `map()` function in Python returns a list of the results after applying the given function to each item of a given iterable (list, tuple etc.).

The general syntax of `map()` function is `map(fun, iter)`. We can also use lambda functions with `map()`.

The general syntax of `map()` function with lambda() is `map(lambda <argument>:<expression>, iter)`

In [7]:
my_list = [3, 5, 6, 8, 10, 12, 27, 31]
# Print my_list in the console
print("Input list is", my_list)

Input list is [3, 5, 6, 8, 10, 12, 27, 31]


In [8]:
# Square all numbers in my_list
squared_list_lambda = list(map(lambda x: x**2, my_list))

# Print the result of the map function
print("The squared numbers are", squared_list_lambda)

The squared numbers are [9, 25, 36, 64, 100, 144, 729, 961]


## Use of lambda() with filter()
The `filter()` function in Python takes in a function and a list as arguments.

The general syntax of the `filter()` function is `filter(function, list_of_input)`.

The general syntax of the `filter()` function with `lambda()` is `filter(lambda <argument>:<expression>, list)`

In [9]:
my_list2 = [1, 10, 2, 100, 3, 1000]

# Print my_list2 in the console
print("Input list is:", my_list2)

Input list is: [1, 10, 2, 100, 3, 1000]


In [10]:
# Filter numbers divisible by 10
filtered_list = list(filter(lambda x: (x%10 == 0), my_list2))

# Print the numbers divisible by 10
print("Numbers divisible by 10 are:", filtered_list)

Numbers divisible by 10 are: [10, 100, 1000]
