
# PySpark Data Generation and Anonymization

This notebook demonstrates how to generate a large CSV file with random data using PySpark and then anonymize the data using User Defined Functions (UDFs).

## Steps Covered:
1. Initialize a Spark session.
2. Generate random data including names, addresses, and dates of birth.
3. Anonymize the generated data.
4. Save the anonymized data to a CSV file.




## Step 1: Initialize Spark Session and Define Helper Functions

We start by initializing a Spark session and defining helper functions to generate random strings and dates.

- `generate_random_string(length)`: Generates a random string of a specified length.
- `generate_random_date(start_year, end_year)`: Generates a random date within the specified range.



In [None]:
# Import necessary libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit, udf
from pyspark.sql.types import StringType, DateType
import random
import string
from datetime import datetime, timedelta

# Initialize Spark session
spark = SparkSession.builder     .appName("Generate Large CSV")     .getOrCreate()

# Helper function to generate random strings
def generate_random_string(length=8):
    return ''.join(random.choices(string.ascii_letters, k=length))

# Helper function to generate random dates
def generate_random_date(start_year=1950, end_year=2000):
    start_date = datetime(start_year, 1, 1)
    end_date = datetime(end_year, 12, 31)
    random_date = start_date + timedelta(days=random.randint(0, (end_date - start_date).days))
    return random_date.strftime('%Y-%m-%d')

# Create a DataFrame with random data
data = [(i, generate_random_string(10), generate_random_string(10), generate_random_string(20), generate_random_date()) for i in range(1000000)]
columns = ["id", "first_name", "last_name", "address", "date_of_birth"]
df = spark.createDataFrame(data, columns)

# Show a sample of the data
df.show(10)



+---+----------+----------+--------------------+-------------+
| id|first_name| last_name|             address|date_of_birth|
+---+----------+----------+--------------------+-------------+
|  0|dBYjDGoILL|dBYjDGoILL|ZqlMtQOHpFYYWAettINf|   1993-02-22|
|  1|Weyeqbrpja|Weyeqbrpja|UhOdwBAqAIvkZKPKoafd|   1973-05-14|
|  2|vTsnFQrmQt|vTsnFQrmQt|lgLUnlqlIaDyXEOSntQf|   1961-07-03|
|  3|gPHMQOdZMC|gPHMQOdZMC|MASxhpInmJkaOBCCPdXA|   1966-01-22|
|  4|PXLDHXcQGW|PXLDHXcQGW|auCwYcfpfYBTLGKcjysI|   1990-07-28|
|  5|CqhwKsnlbV|CqhwKsnlbV|oahoYjbVkafrhLuwoAHe|   1972-09-13|
|  6|IxPQnXzlon|IxPQnXzlon|bcFYrmetqptkfdEqJpmf|   1968-07-16|
|  7|jhIzSNZcFG|jhIzSNZcFG|kSiYABGkOYvWPbAqzTsB|   1986-04-10|
|  8|rOTyzEHqsQ|rOTyzEHqsQ|NgmxiSSqnlCOavBmebrv|   1973-01-30|
|  9|JlzGShLXCV|JlzGShLXCV|YkdupFBHbdagGRCOIKwL|   1976-11-02|
+---+----------+----------+--------------------+-------------+
only showing top 10 rows




## Step 2: Define UDFs for Anonymization and Apply Them

In this step, we define User Defined Functions (UDFs) for anonymizing string data. We then apply these UDFs to the DataFrame columns that need to be anonymized.

- `anonymize_string(length)`: Generates a random string of the specified length.
- `anonymize_string_udf`: A UDF based on `anonymize_string` to be applied to DataFrame columns.



In [None]:
# Define UDFs for anonymization
def anonymize_string(length=8):
    return ''.join(random.choices(string.ascii_letters, k=length))

anonymize_string_udf = udf(anonymize_string, StringType())

# Apply anonymization to specific columns
anonymized_df = df.withColumn("first_name", anonymize_string_udf(lit(10)))                   .withColumn("last_name", anonymize_string_udf(lit(10)))                   .withColumn("address", anonymize_string_udf(lit(20)))

# Save the anonymized DataFrame to a new CSV file
anonymized_df.write.csv("anonymized_large_dataset_spark_2.csv", header=True)

# Display the first 10 rows of the anonymized DataFrame
anonymized_df.show(10)


+---+----------+----------+--------------------+-------------+
| id|first_name| last_name|             address|date_of_birth|
+---+----------+----------+--------------------+-------------+
|  0|mnlfJgnpQR|mnlfJgnpQR|nNqvuqeJQgsJdHNIlJkR|   1998-07-27|
|  1|UFuRByqRCf|UFuRByqRCf|xSCbOtEUgkmWpCgWKdGC|   2000-04-06|
|  2|FABIXDNUsm|FABIXDNUsm|KTtgHBsHzALAeSwNAFbB|   1998-01-15|
|  3|pRZnuugaVb|pRZnuugaVb|fxEVMuhnypbxnkooioPW|   1994-10-29|
|  4|myhXQuSVmO|myhXQuSVmO|DtkGclMLPhrEDNhXtfXa|   1999-05-27|
|  5|bEQPMkGewD|bEQPMkGewD|cjZwQjXYIVGzuUviScwb|   1971-11-15|
|  6|VnZXaKzwzt|VnZXaKzwzt|GkRxWEqaWfLJLjRmGCjd|   1982-05-30|
|  7|qduHvUrUsY|qduHvUrUsY|XcIyvOQkARkeoLAwewDe|   1994-05-15|
|  8|BXtpkPDOzC|BXtpkPDOzC|ykOSWobzsZNSeovweylc|   1970-08-12|
|  9|HVCYCobMBt|HVCYCobMBt|kLiCWfMJWwPQvvSTGdnz|   1993-05-24|
+---+----------+----------+--------------------+-------------+
only showing top 10 rows



In [None]:
[object Object]