## PySpark `.toDF()` — When to Use

**What it is**
- A convenience method that either:
  - **Converts an RDD → DataFrame** while assigning column names, or
  - **Renames all columns** of an existing DataFrame in one call.

---

### ✅ Use Case 1 — RDD → DataFrame (set column names)
    rdd = spark.sparkContext.parallelize([(1, "Alice"), (2, "Bob")])
    df  = rdd.toDF(["id", "name"])
    df.show()
**Why:** You started with an **RDD** and want to switch to the **DataFrame API** with readable column names.

---

### ✅ Use Case 2 — Rename *all* columns of an existing DataFrame
    data = [(1, "Alice"), (2, "Bob")]
    df   = spark.createDataFrame(data, ["c1", "c2"])
    df2  = df.toDF("id", "name")   # renames ALL columns
    df2.show()
**Why:** You already have a DataFrame and want to **replace every column name** at once (the count must match).

In [9]:

import pyspark 
from pyspark.sql import * 
from pyspark.sql.functions import * 
from pyspark.sql.types import * 
from lib.logger import Log4J

spark = SparkSession.builder \
        .master("local[3]") \
        .appName("MiscDemo") \
        .getOrCreate()


logger = Log4J(spark)

data_list = [("Ravi", "28", "1", "2002"),
                ("Abdul", "23", "5", "81"),  # 1981
                ("John", "12", "12", "6"),  # 2006
                ("Rosy", "7", "8", "63"),  # 1963
                ("Abdul", "23", "5", "81")]  # 1981 

raw_df = spark.createDataFrame(data_list).toDF("name","day","month","year").repartition(3)
raw_df.printSchema()

root
 |-- name: string (nullable = true)
 |-- day: string (nullable = true)
 |-- month: string (nullable = true)
 |-- year: string (nullable = true)



### Spark — Short & Simple

**`monotonically_increasing_id()`**
- Gives each row a **unique 64-bit ID**.
- IDs **increase within each partition**, not globally consecutive.
- Expect **gaps** and **different values** if partitioning/order changes.
- Use for **surrogate keys / uniqueness**, not for strict ordering.

**`withColumn(colName, colExpr)`**
- **Creates or replaces** a column.
- First arg = **target column name**; second arg = **expression** that computes its values (a Spark Column expression).
- Works **row-wise**, lazily; returns a **new DataFrame** (original unchanged).
- Use for deriving/cleaning values; **not** for renaming (use `withColumnRenamed`).

In [11]:
df1 = raw_df.withColumn("id", monotonically_increasing_id())
df1.show()

[Stage 0:>                                                          (0 + 3) / 3]

+-----+---+-----+----+-----------+
| name|day|month|year|         id|
+-----+---+-----+----+-----------+
| Ravi| 28|    1|2002|          0|
|Abdul| 23|    5|  81|          1|
|Abdul| 23|    5|  81| 8589934592|
| John| 12|   12|   6|17179869184|
| Rosy|  7|    8|  63|17179869185|
+-----+---+-----+----+-----------+



                                                                                

### Spark — `expr()` & `.cast()` (short & simple)

- **`expr("...")`**: Write **SQL** on columns inside the DataFrame API.  
  Returns a **Column** you can use in `select`, `withColumn`, `orderBy`, etc.  
  *Think:* “SQL logic here.” Example idea: `CASE WHEN`, `to_date(...)`, `colA + 1`.

- **`.cast(type)`**: Change a column’s **data type**.  
  Accepts `"int"`, `"double"`, `"string"`, `"date"`, `"timestamp"`, or Spark types like `IntegerType()`.  
  Non-convertible values → **NULL**.  
  *Think:* “Make this column int/double/string now.”

In [15]:
df2 = df1.withColumn("year",expr("""
        case when year < 21 then year + 2000
        when year < 100 then year + 1900
        else year 
        end """).cast(IntegerType()))
df2.show()
df2.printSchema()

+-----+---+-----+----+-----------+
| name|day|month|year|         id|
+-----+---+-----+----+-----------+
| Ravi| 28|    1|2002|          0|
|Abdul| 23|    5|1981|          1|
|Abdul| 23|    5|1981| 8589934592|
| John| 12|   12|2006|17179869184|
| Rosy|  7|    8|1963|17179869185|
+-----+---+-----+----+-----------+

root
 |-- name: string (nullable = true)
 |-- day: string (nullable = true)
 |-- month: string (nullable = true)
 |-- year: integer (nullable = true)
 |-- id: long (nullable = false)



### Spark — Set Data Types **Before** Transformations (short & simple)

- **Define schema on read:** Provide a `StructType` via `.schema(...)` when loading CSV/JSON; avoid relying on `inferSchema`. Parquet/ORC already carry types, but still normalize if needed.
- **Normalize early (cast once):** Right after load, convert columns to their **final types** (numbers, booleans, `date`/`timestamp` via `to_date`/`to_timestamp`). Do it **once** so you don’t keep re-casting later.
- **Centralize the typing step:** Group all casts/parsing in one place (immediately post-ingest) to keep downstream transforms clean and consistent.


In [20]:
df1.show()
df1.printSchema()
df5 = df1.withColumn("day",col("day").cast(IntegerType())) \
        .withColumn("month",col("month").cast(IntegerType())) \
        .withColumn("year",col("year").cast(IntegerType())) 

df6 = df5.withColumn("year",expr("""
            case when year < 21 then year + 2000 
            when year < 100 then year + 1900 
            else year 

            end """))
df6.show()
df5.printSchema()

+-----+---+-----+----+-----------+
| name|day|month|year|         id|
+-----+---+-----+----+-----------+
| Ravi| 28|    1|2002|          0|
|Abdul| 23|    5|  81|          1|
|Abdul| 23|    5|  81| 8589934592|
| John| 12|   12|   6|17179869184|
| Rosy|  7|    8|  63|17179869185|
+-----+---+-----+----+-----------+

root
 |-- name: string (nullable = true)
 |-- day: string (nullable = true)
 |-- month: string (nullable = true)
 |-- year: string (nullable = true)
 |-- id: long (nullable = false)

+-----+---+-----+----+-----------+
| name|day|month|year|         id|
+-----+---+-----+----+-----------+
| Ravi| 28|    1|2002|          0|
|Abdul| 23|    5|1981|          1|
|Abdul| 23|    5|1981| 8589934592|
| John| 12|   12|2006|17179869184|
| Rosy|  7|    8|1963|17179869185|
+-----+---+-----+----+-----------+

root
 |-- name: string (nullable = true)
 |-- day: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- year: integer (nullable = true)
 |-- id: long (nullable = false)


### Spark — `when()`, `otherwise()`, and `col()` (short & simple)

- **`when(condition, value)`**  
  Used for conditional logic (like SQL `CASE WHEN`).  
  Takes two args:  
  1. A condition (built with `col(...)`, comparisons, etc.).  
  2. The value to assign if the condition is true.

- **`otherwise(value)`**  
  Defines the fallback value if **none of the `when()` conditions match**.  
  Works like SQL `ELSE`.

- **`col("colName")`**  
  Tells Spark you are referencing a **column** (not a string).  
  Needed inside expressions to point to a DataFrame field.

---

**Mental model**  
- Use `col("x")` to reference a column.  
- Chain `when(...).when(...).otherwise(...)` to build conditional columns.  
- It’s the non-SQL way of writing `CASE WHEN ... ELSE ... END`.  

In [22]:
df7 = df6.withColumn("year", \
            when(col("year") < 21, col("year") + 2000 ) \
            .when(col("year")< 100, col("year") + 1900) \
            .otherwise(col("year"))
            )
df7.show()

+-----+---+-----+----+-----------+
| name|day|month|year|         id|
+-----+---+-----+----+-----------+
| Ravi| 28|    1|2002|          0|
|Abdul| 23|    5|1981|          1|
|Abdul| 23|    5|1981| 8589934592|
| John| 12|   12|2006|17179869184|
| Rosy|  7|    8|1963|17179869185|
+-----+---+-----+----+-----------+



In [29]:

import pyspark 
from pyspark.sql import * 
from pyspark.sql.functions import * 
from pyspark.sql.types import * 
from lib.logger import Log4J

spark = SparkSession.builder \
        .master("local[3]") \
        .appName("MiscDemo") \
        .getOrCreate()


logger = Log4J(spark)

data_list = [("Ravi", "28", "1", "2002"),
                ("Abdul", "23", "5", "81"),  # 1981
                ("John", "12", "12", "6"),  # 2006
                ("Rosy", "7", "8", "63"),  # 1963
                ("Abdul", "23", "5", "81")]  # 1981 

raw_df = spark.createDataFrame(data_list).toDF("name","day","month","year").repartition(3)
raw_df.printSchema()

root
 |-- name: string (nullable = true)
 |-- day: string (nullable = true)
 |-- month: string (nullable = true)
 |-- year: string (nullable = true)



### Spark — `to_date()` with `expr()` (short & simple)

- **Purpose:** Convert a string column into a proper **date** type.  
- **Arguments (2):**  
  1. The **string expression/column** that holds the date text.  
  2. The **format pattern** that tells Spark how to interpret that text.  

- **Format examples:**  
  - `"dd/MM/yyyy"` → `15/08/1995`  
  - `"MM-dd-yyyy"` → `08-15-1995`  
  - `"yyyyMMdd"`   → `19950815`  

- **Why format is needed:** Raw strings can look different (`"15/08/1995"` vs `"1995-08-15"`).  
  The format tells Spark **what each part (day, month, year) means** so parsing is correct.  

- **Mental model:**  
  `to_date("string_column", "format")` → “Read this text as a date using this format.”  

In [24]:
df8 = df7.withColumn("dob",expr("to_date(concat(day,'/',month,'/',year),'d/M/y')")) 
df8.show()

+-----+---+-----+----+-----------+----------+
| name|day|month|year|         id|       dob|
+-----+---+-----+----+-----------+----------+
| Ravi| 28|    1|2002|          0|2002-01-28|
|Abdul| 23|    5|1981|          1|1981-05-23|
|Abdul| 23|    5|1981| 8589934592|1981-05-23|
| John| 12|   12|2006|17179869184|2006-12-12|
| Rosy|  7|    8|1963|17179869185|1963-08-07|
+-----+---+-----+----+-----------+----------+



In [25]:
df9 = df7.withColumn("dob",to_date(expr("concat(day,'/',month,'/',year)"),'d/M/y'))
df9.show()

+-----+---+-----+----+-----------+----------+
| name|day|month|year|         id|       dob|
+-----+---+-----+----+-----------+----------+
| Ravi| 28|    1|2002|          0|2002-01-28|
|Abdul| 23|    5|1981|          1|1981-05-23|
|Abdul| 23|    5|1981| 8589934592|1981-05-23|
| John| 12|   12|2006|17179869184|2006-12-12|
| Rosy|  7|    8|1963|17179869185|1963-08-07|
+-----+---+-----+----+-----------+----------+



### Spark — Dropping Columns & Duplicates (short & simple)

- **`.drop("colName", "colName2", …)`**  
  Removes the given columns from the DataFrame.  
  *Think:* “Delete these columns from my table.”

- **`.dropDuplicates(["col1", "col2", …])`**  
  Removes rows where all the listed columns have the same values.  
  Keeps the **first occurrence** and drops the rest.  
  *Think:* “Only keep unique rows based on these columns.”  

- **Mental model:**  
  - `.drop()` → remove **unwanted columns**.  
  - `.dropDuplicates()` → remove **duplicate rows** based on column values.  


In [28]:
df9 = df7.withColumn("dob",to_date(expr("concat(day,'/',month,'/',year)"),'d/M/y')) \
        .drop("day","month","year") \
        .dropDuplicates(["name","dob"]) \
        .sort(expr("dob desc"))
df9.show()

+-----+-----------+----------+
| name|         id|       dob|
+-----+-----------+----------+
| Rosy|17179869185|1963-08-07|
|Abdul|          1|1981-05-23|
| Ravi|          0|2002-01-28|
| John|17179869184|2006-12-12|
+-----+-----------+----------+

