<a href="https://colab.research.google.com/github/asupraja3/spark-mlops-lab/blob/main/pyspark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Install PySpark
!pip install pyspark



**1. Start the Session and Set Config**
* In most environments (like Databricks or pyspark), the spark object is ready. If not, run this:


In [2]:
from pyspark.sql import SparkSession
# Standard way to get/create a SparkSession
spark = SparkSession.builder.appName("Pyspark Practice").getOrCreate()
# NEW: Set time policy for reliable date/time parsing across different formats
spark.sql("SET spark.sql.legacy.timeParserPolicy=LEGACY")

DataFrame[key: string, value: string]

**2. DataFrame Creation Examples**

In [3]:
#From Python List
data = [("Alice", 1, "2024-01-01"), ("Bob", 2, "2024-01-02")]
df_data = spark.createDataFrame(data, ["Name", "ID", "Date"])

#From JSON File
#df_json = spark.read.json("/temp.JSON")

#From CSV File
#df_json = spark.read.csv("/temp.csv", header=True, inferSchema=True)

#From Parquet File
#df_json = spark.read.parquet("/temp.pq")

**II. Schema: The Blueprint of Your Data**
* A Schema is the definition of the column names and their data types (e.g., String, Integer, Date). Spark can either guess the schema or you can define it precisely.
* inferSchema=True: Spark scans the data to guess the types. Quick for small datasets or exploration.
* StructType and StructField: Reliable for production ETL. Avoids misinterpretation, like treating a column of numbers as strings.



In [5]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
sample_data = [
    ("Alice", 25, "New York"),
    ("Bob", 30, "London"),
    ("Charlie", 25, "New York")
]

# 2. Explicitly define the schema (StructType)
defined_schema = StructType([StructField("name", StringType(), True),
                StructField("mployee_age", IntegerType(), True),
                StructField("city", StringType(), True)])

# 3. Create the DataFrame using the explicit schema
df = spark.createDataFrame(data=sample_data, schema=defined_schema)

df.printSchema()
df.show()

root
 |-- name: string (nullable = true)
 |-- mployee_age: integer (nullable = true)
 |-- city: string (nullable = true)

+-------+-----------+--------+
|   name|mployee_age|    city|
+-------+-----------+--------+
|  Alice|         25|New York|
|    Bob|         30|  London|
|Charlie|         25|New York|
+-------+-----------+--------+



In [9]:
from pyspark.sql.functions import lit
# a. SELECT (Selecting and Renaming Columns)
df_select = df.select("city", df["name"].alias("emp_name"))
df_select.show(3)
# b. withColumn (Adding or Modifying a Column)
df1 = df.withColumn("city_NY", (df.city == "New York"))# Boolean flag
df2 = df1.withColumn("const", lit(100)) # Add column with constant value
df2.show()
# c. drop (Removing Columns)
df3 = df2.drop("const")
df3.show()

+--------+--------+
|    city|emp_name|
+--------+--------+
|New York|   Alice|
|  London|     Bob|
|New York| Charlie|
+--------+--------+

+-------+-----------+--------+-------+-----+
|   name|mployee_age|    city|city_NY|const|
+-------+-----------+--------+-------+-----+
|  Alice|         25|New York|   true|  100|
|    Bob|         30|  London|  false|  100|
|Charlie|         25|New York|   true|  100|
+-------+-----------+--------+-------+-----+

+-------+-----------+--------+-------+
|   name|mployee_age|    city|city_NY|
+-------+-----------+--------+-------+
|  Alice|         25|New York|   true|
|    Bob|         30|  London|  false|
|Charlie|         25|New York|   true|
+-------+-----------+--------+-------+

