<a href="https://colab.research.google.com/github/codeprakash309/PySparkCodeHub/blob/main/PySpark_transformations_2025_07_31.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, rand, expr, when

In [2]:

# Create Spark session
spark = SparkSession.builder.appName("Sample50Records").getOrCreate()

In [3]:
# Create a DataFrame with 50 rows
df = (
    spark.range(0, 50)  # 50 records
    .withColumn("user_id", (col("id") + 1000).cast("int"))
    .withColumn("username", expr("concat('user_', id)"))
    .withColumn("age", (rand() * 30 + 20).cast("int"))  # Age between 20 and 50
    .withColumn("balance", (rand() * 100000).cast("double"))  # Random salary
    .withColumn("is_active", when((col("id") % 3) == 0, True).otherwise(False))
    .withColumn("signup_date", expr("current_date() - int(rand() * 180)"))  # Signup within 6 months
    .drop("id")
)

In [4]:
# Show schema and first few records
df.printSchema()
df.show(50, truncate=False)

root
 |-- user_id: integer (nullable = false)
 |-- username: string (nullable = false)
 |-- age: integer (nullable = true)
 |-- balance: double (nullable = false)
 |-- is_active: boolean (nullable = false)
 |-- signup_date: date (nullable = true)

+-------+--------+---+------------------+---------+-----------+
|user_id|username|age|balance           |is_active|signup_date|
+-------+--------+---+------------------+---------+-----------+
|1000   |user_0  |47 |91003.66015112009 |true     |2025-07-25 |
|1001   |user_1  |23 |44213.78620974649 |false    |2025-06-26 |
|1002   |user_2  |29 |77729.53566998956 |false    |2025-02-28 |
|1003   |user_3  |46 |86894.68632900532 |true     |2025-07-01 |
|1004   |user_4  |29 |81846.7346088069  |false    |2025-05-09 |
|1005   |user_5  |36 |3995.35008916293  |false    |2025-02-06 |
|1006   |user_6  |46 |37051.31249392974 |true     |2025-03-08 |
|1007   |user_7  |47 |77137.27854320833 |false    |2025-05-06 |
|1008   |user_8  |48 |40847.00404732502 |false  

 Filter Rows

In [5]:
# Get only active users
df_active = df.filter(col("is_active") == True)


the below code will show you the active user only

In [6]:
df_active.show()

+-------+--------+---+------------------+---------+-----------+
|user_id|username|age|           balance|is_active|signup_date|
+-------+--------+---+------------------+---------+-----------+
|   1000|  user_0| 47| 91003.66015112009|     true| 2025-07-25|
|   1003|  user_3| 46| 86894.68632900532|     true| 2025-07-01|
|   1006|  user_6| 46| 37051.31249392974|     true| 2025-03-08|
|   1009|  user_9| 47| 48141.54507396665|     true| 2025-03-29|
|   1012| user_12| 34| 66527.08959005556|     true| 2025-06-23|
|   1015| user_15| 37|6896.2601443522735|     true| 2025-03-03|
|   1018| user_18| 22| 8892.381998371535|     true| 2025-05-20|
|   1021| user_21| 32| 88287.07769500864|     true| 2025-04-28|
|   1024| user_24| 40|28481.720311661207|     true| 2025-03-02|
|   1027| user_27| 44| 93699.87913120167|     true| 2025-06-03|
|   1030| user_30| 37| 90057.22272489661|     true| 2025-04-09|
|   1033| user_33| 33|  37029.7336601188|     true| 2025-03-28|
|   1036| user_36| 31|   25080.712072099

**Select Specific Columns from the data frame**

In [9]:
# Select only user_id and balance
df_selected = df.select("user_id", "balance")
#below line of code will show you the 10 record along with selected column only
df_selected.show(10,truncate=False)

+-------+-----------------+
|user_id|balance          |
+-------+-----------------+
|1000   |91003.66015112009|
|1001   |44213.78620974649|
|1002   |77729.53566998956|
|1003   |86894.68632900532|
|1004   |81846.7346088069 |
|1005   |3995.35008916293 |
|1006   |37051.31249392974|
|1007   |77137.27854320833|
|1008   |40847.00404732502|
|1009   |48141.54507396665|
+-------+-----------------+
only showing top 10 rows



**Add a New Column:**

In [10]:
# Add a new column: balance_category
df_with_category = df.withColumn(
    "balance_category",
    when(col("balance") < 30000, "Low")
    .when(col("balance") < 70000, "Medium")
    .otherwise("High")
)


In [11]:
df_with_category.show(10,truncate=False)

+-------+--------+---+-----------------+---------+-----------+----------------+
|user_id|username|age|balance          |is_active|signup_date|balance_category|
+-------+--------+---+-----------------+---------+-----------+----------------+
|1000   |user_0  |47 |91003.66015112009|true     |2025-07-25 |High            |
|1001   |user_1  |23 |44213.78620974649|false    |2025-06-26 |Medium          |
|1002   |user_2  |29 |77729.53566998956|false    |2025-02-28 |High            |
|1003   |user_3  |46 |86894.68632900532|true     |2025-07-01 |High            |
|1004   |user_4  |29 |81846.7346088069 |false    |2025-05-09 |High            |
|1005   |user_5  |36 |3995.35008916293 |false    |2025-02-06 |Low             |
|1006   |user_6  |46 |37051.31249392974|true     |2025-03-08 |Medium          |
|1007   |user_7  |47 |77137.27854320833|false    |2025-05-06 |High            |
|1008   |user_8  |48 |40847.00404732502|false    |2025-04-12 |Medium          |
|1009   |user_9  |47 |48141.54507396665|