# 0. **Install PySpark**:

In [1]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488491 sha256=3c6f0691f5e6d071ad35fed391090247e1d0f1e8d29b06eaa9334f99e480bec1
  Stored in directory: /root/.cache/pip/wheels/80/1d/60/2c256ed38dddce2fdd93be545214a63e02fbd8d74fb0b7f3a6
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.1


# 1. **Importing Libraries and Initializing Spark Session**:

In [2]:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

   - Imports necessary PySpark libraries.
   - Initializes a Spark session with the application name 'SparkByExamples.com'.


# 2. **Defining Sample Data and Schema**:


In [3]:
dept = [("Finance", 10),
        ("Marketing", 20),
        ("Sales", 30),
        ("IT", 40)]

deptColumns = ["dept_name", "dept_id"]

- Defines sample data as a list of tuples, where each tuple represents a row in the DataFrame.
- Defines a schema with two fields: `dept_name` and `dept_id`.


# 3. **Creating DataFrame**:


In [4]:
deptDF = spark.createDataFrame(data=dept, schema=deptColumns)
deptDF.printSchema()
deptDF.show(truncate=False)

root
 |-- dept_name: string (nullable = true)
 |-- dept_id: long (nullable = true)

+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance  |10     |
|Marketing|20     |
|Sales    |30     |
|IT       |40     |
+---------+-------+



   - Creates a DataFrame from the sample data and schema.
   - Prints the schema of the DataFrame.
   - Displays the content of the DataFrame without truncating the output.


# 4. **Collecting Data from DataFrame**:


In [5]:
dataCollect = deptDF.collect()
print(dataCollect)

dataCollect2 = deptDF.select("dept_name").collect()
print(dataCollect2)

[Row(dept_name='Finance', dept_id=10), Row(dept_name='Marketing', dept_id=20), Row(dept_name='Sales', dept_id=30), Row(dept_name='IT', dept_id=40)]
[Row(dept_name='Finance'), Row(dept_name='Marketing'), Row(dept_name='Sales'), Row(dept_name='IT')]


   - Collects all rows of the DataFrame into a list of Row objects using the `collect()` method.
   - Collects only the `dept_name` column into a list of Row objects using the `select()` method followed by `collect()`.


# 5. **Iterating Over Collected Data**:


In [6]:
for row in dataCollect:
    print(row['dept_name'] + "," + str(row['dept_id']))

Finance,10
Marketing,20
Sales,30
IT,40


   - Iterates over the collected data and prints each department's name and ID in the specified format.
