# 0. **Install PySpark**

In [2]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488491 sha256=1c46b882e31fb009a5ff45ffd28367f2eea1cbe7d4d9c9b290ae367e008f9200
  Stored in directory: /root/.cache/pip/wheels/80/1d/60/2c256ed38dddce2fdd93be545214a63e02fbd8d74fb0b7f3a6
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.1


# 1. **Importing Libraries and Initializing Spark Session**:


In [3]:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

- Imports necessary PySpark libraries.
- Initializes a Spark session with the application name 'SparkByExamples.com'.

# 2. **Defining Sample Data and Schema**:


In [4]:
simpleData = [("James", 34, "2006-01-01", "true", "M", 3000.60),
              ("Michael", 33, "1980-01-10", "true", "F", 3300.80),
              ("Robert", 37, "06-01-1992", "false", "M", 5000.50)]

columns = ["firstname", "age", "jobStartDate", "isGraduated", "gender", "salary"]
df = spark.createDataFrame(data=simpleData, schema=columns)
df.printSchema()
df.show(truncate=False)

root
 |-- firstname: string (nullable = true)
 |-- age: long (nullable = true)
 |-- jobStartDate: string (nullable = true)
 |-- isGraduated: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: double (nullable = true)

+---------+---+------------+-----------+------+------+
|firstname|age|jobStartDate|isGraduated|gender|salary|
+---------+---+------------+-----------+------+------+
|James    |34 |2006-01-01  |true       |M     |3000.6|
|Michael  |33 |1980-01-10  |true       |F     |3300.8|
|Robert   |37 |06-01-1992  |false      |M     |5000.5|
+---------+---+------------+-----------+------+------+



- Defines sample data as a list of tuples, where each tuple represents a row in the DataFrame.
- Defines a schema with six fields: `firstname`, `age`, `jobStartDate`, `isGraduated`, `gender`, and `salary`.
- Creates a DataFrame from the sample data and schema.
- Prints the schema of the DataFrame.
- Displays the content of the DataFrame without truncating the output.


# 3. **Casting Columns to Different Types Using DataFrame API**:

In [8]:
from pyspark.sql.functions import col
from pyspark.sql.types import StringType, BooleanType, DateType

df2 = df.withColumn("age", col("age").cast(StringType())) \
        .withColumn("isGraduated", col("isGraduated").cast(BooleanType())) \
        .withColumn("jobStartDate", col("jobStartDate").cast(DateType()))
df2.printSchema()
df2.show()

root
 |-- firstname: string (nullable = true)
 |-- age: string (nullable = true)
 |-- jobStartDate: date (nullable = true)
 |-- isGraduated: boolean (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: double (nullable = true)

+---------+---+------------+-----------+------+------+
|firstname|age|jobStartDate|isGraduated|gender|salary|
+---------+---+------------+-----------+------+------+
|    James| 34|  2006-01-01|       true|     M|3000.6|
|  Michael| 33|  1980-01-10|       true|     F|3300.8|
|   Robert| 37|        NULL|      false|     M|5000.5|
+---------+---+------------+-----------+------+------+



- Imports necessary functions and data types from `pyspark.sql.functions` and `pyspark.sql.types`.
- Uses `withColumn` to cast the `age` column to `StringType`, `isGraduated` column to `BooleanType`, and `jobStartDate` column to `DateType`.
- Prints the schema of the resulting DataFrame `df2`.


# 4. **Casting Columns to Different Types Using `selectExpr`**:


In [6]:
df3 = df2.selectExpr("cast(age as int) age",
                    "cast(isGraduated as string) isGraduated",
                    "cast(jobStartDate as string) jobStartDate")
df3.printSchema()
df3.show(truncate=False)

root
 |-- age: integer (nullable = true)
 |-- isGraduated: string (nullable = true)
 |-- jobStartDate: string (nullable = true)

+---+-----------+------------+
|age|isGraduated|jobStartDate|
+---+-----------+------------+
|34 |true       |2006-01-01  |
|33 |true       |1980-01-10  |
|37 |false      |NULL        |
+---+-----------+------------+



- Uses `selectExpr` to cast the `age` column to `int`, `isGraduated` column to `string`, and `jobStartDate` column to `string`.
- Prints the schema and displays the content of the resulting DataFrame `df3`.

# 5. **Using SQL Queries to Perform Casting**:


In [7]:
df3.createOrReplaceTempView("CastExample")

df4 = spark.sql("SELECT STRING(age), BOOLEAN(isGraduated), DATE(jobStartDate) from CastExample")
df4.printSchema()
df4.show(truncate=False)

root
 |-- age: string (nullable = true)
 |-- isGraduated: boolean (nullable = true)
 |-- jobStartDate: date (nullable = true)

+---+-----------+------------+
|age|isGraduated|jobStartDate|
+---+-----------+------------+
|34 |true       |2006-01-01  |
|33 |true       |1980-01-10  |
|37 |false      |NULL        |
+---+-----------+------------+



- Creates a temporary view `CastExample` from the DataFrame `df3`.
- Uses a SQL query to cast the `age` column to `STRING`, `isGraduated` column to `BOOLEAN`, and `jobStartDate` column to `DATE`.
- Prints the schema and displays the content of the resulting DataFrame `df4`.