# The Essential PySpark Cheat Sheet for Data Engineers
This notebook serves as a comprehensive resource for revising essential PySpark transformations, functions, and methodologies. It is tailored to support your interview preparation and enhance your proficiency with PySpark on platforms like Databricks and other Python-based environments.

## Getting Started with PySpark
Before performing any transformations, begin by creating a Spark Session. This is the entry point to all PySpark functionalities.

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql.functions import *
from pyspark.sql.window import Window

spark = SparkSession.builder.appName("PySpark Application").getOrCreate()

## Reading Data
Load data from various file formats such as CSV, Excel, Parquet, or Avro. Use a custom schema for improved performance.

In [None]:
schema = StructType([
    StructField("id", IntegerType()),
    StructField("name", StringType()),
    StructField("dept", StringType())
])

data = spark.read.csv("filePath", header=True, schema=schema)

## Creating Sample DataFrame
Use this sample for practice and transformation examples.

In [3]:
data = [(1, 'juan', 100), (2, 'pedro', 200), (3, 'andres', 300), (4, 'carlos', 400)]
schema = ['id', 'name', 'salary']
df = spark.createDataFrame(data, schema)
df.show()

+---+------+------+
| id|  name|salary|
+---+------+------+
|  1|  juan|   100|
|  2| pedro|   200|
|  3|andres|   300|
|  4|carlos|   400|
+---+------+------+



## Window Functions
Example: Cumulative Sum and Average.

In [4]:
window_spec = Window.orderBy("id")
df.withColumn("cumulative_sum", sum("salary").over(window_spec)).show()
df.withColumn("cumulative_avg", avg("salary").over(window_spec)).show()

+---+------+------+--------------+
| id|  name|salary|cumulative_sum|
+---+------+------+--------------+
|  1|  juan|   100|           100|
|  2| pedro|   200|           300|
|  3|andres|   300|           600|
|  4|carlos|   400|          1000|
+---+------+------+--------------+

+---+------+------+--------------+
| id|  name|salary|cumulative_avg|
+---+------+------+--------------+
|  1|  juan|   100|         100.0|
|  2| pedro|   200|         150.0|
|  3|andres|   300|         200.0|
|  4|carlos|   400|         250.0|
+---+------+------+--------------+



## Joining DataFrames
Use `dept_id` as a join key between employees and departments.

In [5]:
emp_data = [(1, 'juan', 100, 1), (2, 'pedro', 200, 2), (3, 'carlos', 300, 3), (4, 'andres', 400, 4)]
emp_schema = ['id', 'name', 'salary', 'dept_id']
employee = spark.createDataFrame(emp_data, emp_schema)

dept_data = [(1, 'HR'), (2, 'Sales'), (3, 'DA'), (4, 'IT')]
dept_schema = ['dept_id', 'department']
department = spark.createDataFrame(dept_data, dept_schema)

joined_df = employee.join(department, "dept_id", "inner")
joined_df.select('id', 'name', 'salary', 'department').show()

+---+------+------+----------+
| id|  name|salary|department|
+---+------+------+----------+
|  1|  juan|   100|        HR|
|  2| pedro|   200|     Sales|
|  3|carlos|   300|        DA|
|  4|andres|   400|        IT|
+---+------+------+----------+

