<a href="https://colab.research.google.com/github/amrit6878/Learning-PySpark/blob/main/withColumnFucntion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## withColumn()
 * is used to add a new column or modify an existing column in a DataFrame.
 * it allows you to apply transformations to columns efficiently.
```
df.withColumn(colName, col)
```
most common uses and applications of withColumn():


1. Derive new columns from existing ones using expressions.
2. Transform data in existing columns (e.g., normalization, scaling, text formatting).
3. Create categorical columns using conditions (when / otherwise).
4. Handling Missing Values - Replace null with default values.
5.  Extracting or Transforming Date and Time



In [5]:
data = [
    (1, 'Amrit', 70000, 25),
    (2, 'Neeraj', 68000, 28),
    (3, 'Shivam', 80000, 26),
    (4, 'Priya', 75000, 24),
    (5, 'Rohit', 72000, 27),
    (6, 'Anjali', 69000, 23),
    (7, 'Vikram', 82000, 29),
    (8, 'Kiran', 76000, 26),
    (9, 'Suman', 71000, 24),
    (10, 'Ravi', 73000, 25)
]

columns = ['id', 'name', 'salary', 'age']

df = spark.createDataFrame(data, columns)
df.show()

+---+------+------+---+
| id|  name|salary|age|
+---+------+------+---+
|  1| Amrit| 70000| 25|
|  2|Neeraj| 68000| 28|
|  3|Shivam| 80000| 26|
|  4| Priya| 75000| 24|
|  5| Rohit| 72000| 27|
|  6|Anjali| 69000| 23|
|  7|Vikram| 82000| 29|
|  8| Kiran| 76000| 26|
|  9| Suman| 71000| 24|
| 10|  Ravi| 73000| 25|
+---+------+------+---+



In [2]:
!pip install pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("colab").getOrCreate()



In [4]:
df.printSchema()

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- salary: string (nullable = true)



**ADDED A NEW COLUMN** - bonus of 10% of everyone salary

In [8]:
from pyspark.sql.functions import *
df = df.withColumn("bonus", col("salary") * 0.10)
df.show()

+---+------+------+---+------+
| id|  name|salary|age| bonus|
+---+------+------+---+------+
|  1| Amrit| 70000| 25|7000.0|
|  2|Neeraj| 68000| 28|6800.0|
|  3|Shivam| 80000| 26|8000.0|
|  4| Priya| 75000| 24|7500.0|
|  5| Rohit| 72000| 27|7200.0|
|  6|Anjali| 69000| 23|6900.0|
|  7|Vikram| 82000| 29|8200.0|
|  8| Kiran| 76000| 26|7600.0|
|  9| Suman| 71000| 24|7100.0|
| 10|  Ravi| 73000| 25|7300.0|
+---+------+------+---+------+



**Modify an Existing Column** - Converting Name to Uppercase

In [9]:
df = df.withColumn("name", upper(col("name")))
df.show()

+---+------+------+---+------+
| id|  name|salary|age| bonus|
+---+------+------+---+------+
|  1| AMRIT| 70000| 25|7000.0|
|  2|NEERAJ| 68000| 28|6800.0|
|  3|SHIVAM| 80000| 26|8000.0|
|  4| PRIYA| 75000| 24|7500.0|
|  5| ROHIT| 72000| 27|7200.0|
|  6|ANJALI| 69000| 23|6900.0|
|  7|VIKRAM| 82000| 29|8200.0|
|  8| KIRAN| 76000| 26|7600.0|
|  9| SUMAN| 71000| 24|7100.0|
| 10|  RAVI| 73000| 25|7300.0|
+---+------+------+---+------+



**Conditional Column Creation** - Overpaid or Good Pay

In [10]:
df = df.withColumn(
                    "salary_category",
                    when(col('salary')>=75000, 'Overpaid')
                    .otherwise('Good Pay')
                    )
df.show()

+---+------+------+---+------+---------------+
| id|  name|salary|age| bonus|salary_category|
+---+------+------+---+------+---------------+
|  1| AMRIT| 70000| 25|7000.0|       Good Pay|
|  2|NEERAJ| 68000| 28|6800.0|       Good Pay|
|  3|SHIVAM| 80000| 26|8000.0|       Overpaid|
|  4| PRIYA| 75000| 24|7500.0|       Overpaid|
|  5| ROHIT| 72000| 27|7200.0|       Good Pay|
|  6|ANJALI| 69000| 23|6900.0|       Good Pay|
|  7|VIKRAM| 82000| 29|8200.0|       Overpaid|
|  8| KIRAN| 76000| 26|7600.0|       Overpaid|
|  9| SUMAN| 71000| 24|7100.0|       Good Pay|
| 10|  RAVI| 73000| 25|7300.0|       Good Pay|
+---+------+------+---+------+---------------+



**Data Type Casting** -  salary to IntegerType

In [12]:
from pyspark.sql.types import IntegerType
df = df.withColumn("salary", col("salary").cast(IntegerType()))

**Mathematical Transformation** - salary squared for feature engineering

In [13]:
df = df.withColumn("salary_squared", col("salary") ** 2)
df.show()

+---+------+------+---+------+---------------+--------------+
| id|  name|salary|age| bonus|salary_category|salary_squared|
+---+------+------+---+------+---------------+--------------+
|  1| AMRIT| 70000| 25|7000.0|       Good Pay|         4.9E9|
|  2|NEERAJ| 68000| 28|6800.0|       Good Pay|       4.624E9|
|  3|SHIVAM| 80000| 26|8000.0|       Overpaid|         6.4E9|
|  4| PRIYA| 75000| 24|7500.0|       Overpaid|       5.625E9|
|  5| ROHIT| 72000| 27|7200.0|       Good Pay|       5.184E9|
|  6|ANJALI| 69000| 23|6900.0|       Good Pay|       4.761E9|
|  7|VIKRAM| 82000| 29|8200.0|       Overpaid|       6.724E9|
|  8| KIRAN| 76000| 26|7600.0|       Overpaid|       5.776E9|
|  9| SUMAN| 71000| 24|7100.0|       Good Pay|       5.041E9|
| 10|  RAVI| 73000| 25|7300.0|       Good Pay|       5.329E9|
+---+------+------+---+------+---------------+--------------+



✅ Key Takeaways withColumn() allows you to:

	1.	Add new columns (derived, calculated, or hardcoded)
	2.	Modify existing columns (like uppercasing or casting types)
	3.	Create conditional columns (feature engineering)
	4.	Handle nulls and missing values
	5.	Apply complex transformations using UDFs
	6.	Prepare datasets for ML and analytics