- randomSplit() is a built-in PySpark DataFrame function.
- It splits a DataFrame into multiple smaller DataFrames randomly.
- Typically used to split data into training, validation, and test datasets.

- Syntax:
    DataFrame.randomSplit(weights, seed=None)
    - weights: List of proportions for splitting (e.g., [0.7, 0.3] for 70/30 split)
    - seed: (optional) Random seed to get reproducible results.
    - If you set a seed, you ensure the same rows go into train/test every time you run the code.

In [3]:
data = [
    ("Souvik", 72, "Math"),
    ("Soukarjya", 32, "Chemistry"),
    ("Sandip", 74,"Math"),
    ("Prodipta", 76, "Data Analyst"),
    ("RamaSai", 69, "System Engineer"),
    ("Riya", 78, "Oracle Developer"),
    ("Padma", 46, "Data Analyst")
]

columns = ["name", "score", "Subject"]

df = spark.createDataFrame(data, columns)
df.show()


                                                                                

+---------+-----+----------------+
|     name|score|         Subject|
+---------+-----+----------------+
|   Souvik|   72|            Math|
|Soukarjya|   32|       Chemistry|
|   Sandip|   74|            Math|
| Prodipta|   76|    Data Analyst|
|  RamaSai|   69| System Engineer|
|     Riya|   78|Oracle Developer|
|    Padma|   46|    Data Analyst|
+---------+-----+----------------+



In [4]:
# Using randomSplit() to Split the DataFrame
# Split into 70% training and 30% test datasets
# The seed ensures reproducibility(you get the same split every time)
train_df, test_df = df.randomSplit([0.7, 0.3], seed=42)

print("Training Dataset: ")
train_df.show()

print("Test Dataset: ")
test_df.show()


Training Dataset: 


                                                                                

+---------+-----+----------------+
|     name|score|         Subject|
+---------+-----+----------------+
|   Souvik|   72|            Math|
|Soukarjya|   32|       Chemistry|
|     Riya|   78|Oracle Developer|
+---------+-----+----------------+

Test Dataset: 


[Stage 6:>                                                          (0 + 3) / 3]

+--------+-----+---------------+
|    name|score|        Subject|
+--------+-----+---------------+
|  Sandip|   74|           Math|
|Prodipta|   76|   Data Analyst|
| RamaSai|   69|System Engineer|
|   Padma|   46|   Data Analyst|
+--------+-----+---------------+



                                                                                