1) zip()
2) zipWithIndex()
3) zipWithUniqueId()

**zip**

- Python zip() is a **built-in** function that takes **zero or more iterable objects** as **arguments** (e.g. **lists, tuples, or sets**) and **aggregates** them in the form of a series of **tuples**.

- zip() is primarily used for **combining two datasets element-wise**.

- When creating a PySpark DataFrame from **multiple lists**, ensure that the **lists are aligned correctly**. Each list represents a **column**, and their **lengths should be the same** to avoid data misalignment.

- The **zip** function is commonly used to **combine multiple lists element-wise**.

- It creates **tuples**, with **each tuple** containing values from corresponding positions in the input lists.

**Syntax**

     # Syntax of zip() function
     zip(iterator1, iterator2, ...)

**parameters:**
- It takes **iterable** objects as its **arguments**.

**Return value:**
- It returns a **zip object** which is the **iterable object** of **tuples**.
- If **no argument** is passed into **zip()**, it will return the **empty iterator**.
- If we pass **one argument**, it will return the **iterable of tuples** where **each tuple** has a **single element**.
- If we pass **more than two iterables**, it will return an **iterable of tuples** where **each tuple** contains elements of **all passed iterables**.

**1) zip() Function without Arguments**

In [0]:
# Initialize two lists
subjects1 = ["Java","Python","PHP"]
subjects2 = ['C#','CPP','C']

# zip() function with out arguments
final = zip()
print(list(final))

[]


**2) zip() with Single Iterable as Argument**

In [0]:
subjects1 = ["Java", "Python", "PHP"]
# Passing single iterable into zip()
final = zip(subjects1)
print(list(final))

[('Java',), ('Python',), ('PHP',)]


**3) zip() with Two Iterable as Argument**

In [0]:
# Initialize two lists
subjects1 = ["Java", "Python", "PHP"]
subjects2 = ['C#','CPP','C']
print("List1 :", subjects1)
print("List2 :", subjects2)

# Zip two lists
final = zip(subjects1, subjects2)
print("\nZip Lists :", list(final))

List1 : ['Java', 'Python', 'PHP']
List2 : ['C#', 'CPP', 'C']

Zip Lists : [('Java', 'C#'), ('Python', 'CPP'), ('PHP', 'C')]


In [0]:
list1 = ['Adarsh', 'Bibin', 'Chetan', 'Damini', 'Kennedy']
list2 = [30, 25, 35, 28, 29]
list3 = ['India', 'SriLanka', 'Nepal', 'US', 'UK']

rows = list(zip(list1, list2, list3))
rows

[('Adarsh', 30, 'India'),
 ('Bibin', 25, 'SriLanka'),
 ('Chetan', 35, 'Nepal'),
 ('Damini', 28, 'US'),
 ('Kennedy', 29, 'UK')]

- **zip()** takes corresponding **elements from each list** and groups them into **tuples**.
- **list(zip(...))** converts the **zipped** result into a **list of those tuples**.

      [
       ('Adarsh', 30, 'India'),
       ('Bibin', 25, 'SriLanka'),
       ('Chetan', 35, 'Nepal'),
       ('Damini', 28, 'US'),
       ('Kennedy', 29, 'UK')
      ]


In [0]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

schema = StructType([StructField("Name", StringType(), True),
                     StructField("Age", IntegerType(), True),
                     StructField("City", StringType(), True)])

df = spark.createDataFrame(rows, schema)
display(df)

Name,Age,City
Adarsh,30,India
Bibin,25,SriLanka
Chetan,35,Nepal
Damini,28,US
Kennedy,29,UK


**4) Pass Multiple Iterables into Python zip()**

In [0]:
# Initialize multiple lists
subjects1 = ["Java", "Python", "PHP"]
subjects2 = ['C#','CPP','C']
subjects3 = ['.net','pyspark','scala']
print("List1 :", subjects1)
print("List2 :", subjects2)
print("list3 :", subjects3)

# Zip multiple lists
final = zip(subjects1, subjects2, subjects3)
print("\nZip Lists :", list(final))

List1 : ['Java', 'Python', 'PHP']
List2 : ['C#', 'CPP', 'C']
list3 : ['.net', 'pyspark', 'scala']

Zip Lists : [('Java', 'C#', '.net'), ('Python', 'CPP', 'pyspark'), ('PHP', 'C', 'scala')]


**5) Pass Unequal Lengths of Iterables**
- when we pass the **unequal length** of iterables into **zip()** function, it will return the iterable of **tuples** having **same length** of **least passed iterable**.
- Here, **“html”** is the **extra element**.

In [0]:
# Initialize the unequal lengths of list
subjects1 = ["Java", "Python", "PHP", "html"]
subjects2 = ['C#','CPP','C']
print("List1 :", subjects1)
print("List2 :", subjects2)

# Zip the unequal lists
final = zip(subjects1, subjects2)
print("zip unequal lists:", list(final))

List1 : ['Java', 'Python', 'PHP', 'html']
List2 : ['C#', 'CPP', 'C']
zip unequal lists: [('Java', 'C#'), ('Python', 'CPP'), ('PHP', 'C')]


In [0]:
# Traversing Parallelly
print("List1:", subjects1)
print("list2:", subjects2)

for i, j in zip(subjects1, subjects2):
    print(i," ",j)

List1: ['Java', 'Python', 'PHP', 'html']
list2: ['C#', 'CPP', 'C']
Java   C#
Python   CPP
PHP   C


**6) Unzipping the Iterables**
- **unpack operator (*)** is used to **unzip** the **iterable objects**. If we pass the unpacking operator inside the zip, then iterators will be unzipped.

**Syntax:**

     # zip() with unpack operator
     zip(*zipped_data)

In [0]:
# Initialize the lists
subjects1 = ["Java", "Python", "PHP", "html"]
subjects2 = ['C#','CPP','C']

final = zip(subjects1, subjects2)
final1 = list(final)

# Unzipping the zipped object
subjects1,subjects2 = zip(*final1)
print("List1:", subjects1)
print("List2:", subjects2)

List1: ('Java', 'Python', 'PHP')
List2: ('C#', 'CPP', 'C')


**7) Zip iterable Objects into Python Dictionary**

In [0]:
# Initialize the lists
keys = ["course", "fee", "duration"]
values = ['Python','4000','45 days']
print("List1:", keys)
print("List2:", values)

# Use zip() to convert the dictionary
final = dict(zip(keys, values))
print("\nGet the dictionary using zip():", final)

List1: ['course', 'fee', 'duration']
List2: ['Python', '4000', '45 days']

Get the dictionary using zip(): {'course': 'Python', 'fee': '4000', 'duration': '45 days'}


**8) Using zip() with RDDs**

In [0]:
# Create two RDDs
rdd1 = spark.sparkContext.parallelize([1, 2, 3, 4])
rdd2 = spark.sparkContext.parallelize(["a", "b", "c", "d"])

# Zip the RDDs
zipped_rdd = rdd1.zip(rdd2)

# Collect and display results
print(zipped_rdd.collect())

[(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]


- **zip()** combines the **two RDDs element-wise**.
- The result is an **RDD of tuples**:
  - **Each tuple** contains **one element** from **rdd1** and the **corresponding element** from **rdd2**.

        [(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]

  - **Note:** Both **RDDs** must have the **same number of elements** and be partitioned the same way for zip() to work correctly.


**9) Zipping RDDs with Different Data Types**

In [0]:
# Create RDDs
rdd_numbers = spark.sparkContext.parallelize([1001, 1002, 1003, 1004, 1005, 1006])
rdd_strings = spark.sparkContext.parallelize(["Admin", "Sales", "Marketing", "HR", "Finance", "Maintenance"])
rdd_booleans = spark.sparkContext.parallelize([True, False, True, False, False, True])

# Zip RDDs
zipped_rdd = rdd_numbers.zip(rdd_strings).zip(rdd_booleans)
print(zipped_rdd.collect())

[((1001, 'Admin'), True), ((1002, 'Sales'), False), ((1003, 'Marketing'), True), ((1004, 'HR'), False), ((1005, 'Finance'), False), ((1006, 'Maintenance'), True)]


In [0]:
# Flatten the tuples and display
flattened_rdd = zipped_rdd.map(lambda x: (x[0][0], x[0][1], x[1]))
print(flattened_rdd.collect())

[(1001, 'Admin', True), (1002, 'Sales', False), (1003, 'Marketing', True), (1004, 'HR', False), (1005, 'Finance', False), (1006, 'Maintenance', True)]


     zipped_rdd = rdd_numbers.zip(rdd_strings).zip(rdd_booleans)

- This is a **two-step zip** process:

  - **First:** rdd_numbers.zip(rdd_strings)
    - Combines the **first and second** RDDs **element-wise** into **tuples** like:

          (1001, "Admin"), (1002, "Sales"), ...

  - Then: **.zip(rdd_booleans)**

    - Each result from the first zip is now zipped with a boolean value:

          ((1001, "Admin"), True), ((1002, "Sales"), False), ...

**Flattening the Tuples:**

      flattened_rdd = zipped_rdd.map(lambda x: (x[0][0], x[0][1], x[1]))

      x = ((1001, 'Admin'), True)
      
      (1001, 'Admin', True)

**Collecting and Printing the Results**

      print(flattened_rdd.collect())

**Output**

     [
       (1001, 'Admin', True),
       (1002, 'Sales', False),
       (1003, 'Marketing', True),
       (1004, 'HR', False),
       (1005, 'Finance', False),
       (1006, 'Maintenance', True)
     ]


**10) Using zip() for Advanced RDD Transformations**

- To add **row numbers or indices** to your RDD.

- **.zipWithIndex()** assigns a **sequential index (starting from 0)** to **each element** in the RDD.

- It returns a **new RDD of tuples**: each element paired with its corresponding index.

- Unlike **.zipWithUniqueId()**, the indices are **guaranteed to be 0-based and sequential**.

In [0]:
# Create an RDD
rdd = spark.sparkContext.parallelize(["kamal", "Bobby", "Senthil", "Dravid"])

# Zip with index
zipped_with_index = rdd.zipWithIndex()

# Collect and display results
print(zipped_with_index.collect())

[('kamal', 0), ('Bobby', 1), ('Senthil', 2), ('Dravid', 3)]


**5) Using zipWithUniqueId for Unique Identifiers**

- **.zipWithUniqueId()** pairs each element of the RDD with a **unique long integer ID**.

- The IDs are **monotonically increasing** and unique, but they are **not guaranteed** to be **sequential** (i.e., **not always 0, 1, 2, 3...**).

In [0]:
# Create an RDD
rdd = spark.sparkContext.parallelize(["kamal", "Bobby", "Senthil", "Dravid", "Shobha"])

# Zip with unique ID
zipped_with_unique_id = rdd.zipWithUniqueId()

# Collect and display results
print(zipped_with_unique_id.collect())

[('kamal', 1), ('Bobby', 3), ('Senthil', 4), ('Dravid', 6), ('Shobha', 7)]
