# 0. **Install PySpark**

In [1]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488491 sha256=fb784e791a1e709030fe49dbaa69b2785fdaa74efcf0671192b7504be4bba469
  Stored in directory: /root/.cache/pip/wheels/80/1d/60/2c256ed38dddce2fdd93be545214a63e02fbd8d74fb0b7f3a6
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.1


# 1. **Create a Pandas DataFrame**:


In [2]:
import pandas as pd
data = [['Scott', 50], ['Jeff', 45], ['Thomas', 54], ['Ann', 34]]
pandasDF = pd.DataFrame(data, columns=['Name', 'Age'])
print(pandasDF)

     Name  Age
0   Scott   50
1    Jeff   45
2  Thomas   54
3     Ann   34


   Creates a Pandas DataFrame with the provided data and prints it.


# 2. **Initialize Spark session**:

In [3]:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .master("local[1]") \
    .appName("SparkByExamples.com") \
    .getOrCreate()


   Initializes a Spark session.

# 3. **Convert Pandas DataFrame to Spark DataFrame**:


In [5]:
sparkDF = spark.createDataFrame(pandasDF)

sparkDF.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Age: long (nullable = true)



In [6]:
sparkDF.show()

+------+---+
|  Name|Age|
+------+---+
| Scott| 50|
|  Jeff| 45|
|Thomas| 54|
|   Ann| 34|
+------+---+



   Converts the Pandas DataFrame to a Spark DataFrame and prints its schema and content.

# 4. **Define a custom schema**:


In [7]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
mySchema = StructType([
    StructField("First Name", StringType(), True),
    StructField("Age", IntegerType(), True)
])

   Defines a custom schema for the DataFrame.


# 5. **Apply the custom schema**:


In [8]:
sparkDF2 = spark.createDataFrame(pandasDF, schema=mySchema)

sparkDF2.printSchema()

root
 |-- First Name: string (nullable = true)
 |-- Age: integer (nullable = true)



In [9]:
sparkDF2.show()

+----------+---+
|First Name|Age|
+----------+---+
|     Scott| 50|
|      Jeff| 45|
|    Thomas| 54|
|       Ann| 34|
+----------+---+



   Creates a new Spark DataFrame with the custom schema and prints its schema and content.


# 6. **Configure Spark to use Apache Arrow**:


In [10]:
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
spark.conf.set("spark.sql.execution.arrow.pyspark.fallback.enabled", "true")
pandasDF2 = sparkDF2.select("*").toPandas()
print(pandasDF2)

  First Name  Age
0      Scott   50
1       Jeff   45
2     Thomas   54
3        Ann   34


   Configures Spark to use Apache Arrow for faster conversion between Pandas and Spark DataFrames and converts the Spark DataFrame back to a Pandas DataFrame.


# 7. **Verify Spark configurations for Apache Arrow**:


1. **`spark.conf.set("spark.sql.execution.arrow.enabled", "true")`**:
   - **Purpose**: Enables the use of Apache Arrow in PySpark.
   - **Explanation**: When this configuration is set to `true`, PySpark uses Apache Arrow to optimize the conversion between Spark DataFrames and Pandas DataFrames. Arrow provides a more efficient in-memory format that can speed up the conversion process significantly.

2. **`spark.conf.set("spark.sql.execution.arrow.pyspark.fallback.enabled", "true")`**:
   - **Purpose**: Enables fallback to non-Arrow implementation if Arrow-based conversion fails.
   - **Explanation**: If there is an issue with the Arrow-based conversion (e.g., due to incompatibility or a specific edge case), Spark will fall back to the traditional conversion method. This ensures that the conversion process is robust and doesn't fail abruptly.

### Why Use Apache Arrow?

- **Performance**: Arrow optimizes the conversion process, reducing the time required to convert large datasets between Spark and Pandas. This is especially beneficial when dealing with big data, where conversion overhead can be significant.
- **Memory Efficiency**: Arrow's columnar memory layout is designed for efficient memory use, which can help reduce the memory footprint during conversions.
- **Cross-Language Support**: Arrow provides a standardized memory format that can be used across different languages (e.g., Python, Java, R), making it easier to share data between different parts of a data processing pipeline.



In [11]:
arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled")
print("Apache Arrow Enabled:", arrow_enabled)

arrow_fallback_enabled = spark.conf.get("spark.sql.execution.arrow.pyspark.fallback.enabled")
print("Apache Arrow Fallback Enabled:", arrow_fallback_enabled)

Apache Arrow Enabled: true
Apache Arrow Fallback Enabled: true


   Retrieves and prints the Spark configurations to verify if Apache Arrow is enabled.
