# 0. **Install PySpark**

In [3]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488491 sha256=895c43e2994f1d0e0c9108f6817da10fd93f16de0e4eb397ec817c7cf16fd41e
  Stored in directory: /root/.cache/pip/wheels/80/1d/60/2c256ed38dddce2fdd93be545214a63e02fbd8d74fb0b7f3a6
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.1


# 1. **Importing Libraries and Initializing Spark Session**:


In [4]:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

- Imports necessary PySpark libraries.
- Initializes a Spark session with the application name 'SparkByExamples.com'.


# 2. **Broadcast Variable Definition**:


In [5]:
states = {"NY": "New York", "CA": "California", "FL": "Florida"}
broadcastStates = spark.sparkContext.broadcast(states)

- Defines a dictionary `states` mapping state codes to state names.
- Broadcasts the dictionary `states` to all nodes in the Spark cluster using `sparkContext.broadcast`.


# 3. **Defining Sample Data and Schema**:


In [6]:
data = [("James", "Smith", "USA", "CA"),
        ("Michael", "Rose", "USA", "NY"),
        ("Robert", "Williams", "USA", "CA"),
        ("Maria", "Jones", "USA", "FL")]

columns = ["firstname", "lastname", "country", "state"]
df = spark.createDataFrame(data=data, schema=columns)
df.printSchema()
df.show(truncate=False)

root
 |-- firstname: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- country: string (nullable = true)
 |-- state: string (nullable = true)

+---------+--------+-------+-----+
|firstname|lastname|country|state|
+---------+--------+-------+-----+
|James    |Smith   |USA    |CA   |
|Michael  |Rose    |USA    |NY   |
|Robert   |Williams|USA    |CA   |
|Maria    |Jones   |USA    |FL   |
+---------+--------+-------+-----+



- Defines sample data as a list of tuples, where each tuple represents a row in the DataFrame.
- Defines a schema with four fields: `firstname`, `lastname`, `country`, and `state`.
- Creates a DataFrame from the sample data and schema.
- Prints the schema of the DataFrame.
- Displays the content of the DataFrame without truncating the output.


# 4. **Defining a Function to Convert State Codes**:


In [7]:
def state_convert(code):
    return broadcastStates.value[code]

- Defines a function `state_convert` that takes a state code as input and returns the full state name using the broadcast dictionary.

# 5. **Transforming the DataFrame Using RDD**:


In [8]:
result = df.rdd.map(lambda x: (x[0], x[1], x[2], state_convert(x[3]))).toDF(columns)
result.show(truncate=False)

+---------+--------+-------+----------+
|firstname|lastname|country|state     |
+---------+--------+-------+----------+
|James    |Smith   |USA    |California|
|Michael  |Rose    |USA    |New York  |
|Robert   |Williams|USA    |California|
|Maria    |Jones   |USA    |Florida   |
+---------+--------+-------+----------+



- Converts the DataFrame to an RDD.
- Uses the `map` function to apply the `state_convert` function to the `state` field of each row.
- Converts the transformed RDD back to a DataFrame with the original column names.
- Displays the transformed DataFrame without truncating the output.


# 6. **Filtering DataFrame Using Broadcast Variable**:


In [13]:
# Convert dictionary keys to list
state_keys = list(broadcastStates.value.keys())

# Filter DataFrame using broadcast variable
filteDf = df.where(df['state'].isin(state_keys))
filteDf.show(truncate=False)

+---------+--------+-------+-----+
|firstname|lastname|country|state|
+---------+--------+-------+-----+
|James    |Smith   |USA    |CA   |
|Michael  |Rose    |USA    |NY   |
|Robert   |Williams|USA    |CA   |
|Maria    |Jones   |USA    |FL   |
+---------+--------+-------+-----+



- Uses the `where` method to filter the DataFrame rows based on the `state` field, checking if the state code is in the broadcast dictionary.
- Displays the filtered DataFrame without truncating the output.
