#### **Creating MapType**

- The PySpark **Map Type** datatype is used to represent the map **key-value pair** similar to the python Dictionary **(Dict)**.
 
- A **MapType** column in a Spark DataFrame is one that contains **complex data** in the form of **key-value pairs**, with keys and values each having defined **data types**.

- In this example, we use **StringType** for **both key and value** and specify **False** to indicate that the **map is not nullable**.

- The **"keyType" and "valueType"** can be any type that further extends the DataType class for e.g the **StringType, IntegerType, ArrayType, MapType, StructType (struct)** etc.

#### **Why Choose PySpark MapType Dict in Databricks?**

- PySpark **MapType Dict** allows for the **compact storage** of **key-value pairs** within a **single column**, reducing the overall storage footprint. This becomes crucial when dealing with **large datasets** in a Databricks environment.

- The MapType Dict provides a **flexible schema, accommodating dynamic and evolving data structures**. This is particularly beneficial in scenarios where the data **schema might change over time**.

#### **Syntax**

     MapType(keyType, valueType, valueContainsNull=True)

     from pyspark.sql.types import StringType, IntegerType, MapType
     
     MapType(StringType(), IntegerType(), valueContainsNull=True)     
     MapType(StringType(), IntegerType())
     MapType(StringType(), IntegerType(), True)
     MapType(StringType(), StringType(), True)

**Parameters**
- **keyType:**
  - This is the **data type** of the **keys**.
  - The **keys** in a **MapType** are **not allowed** to be **None or NULL**.
- **valueType:**
  - This is the **data type** of the **values**.
- **valueContainsNull:** 
  - This is a **boolean** value indicating whether the values can be **NULL or None**.
  - The **default** value is **True**, which indicates that the values can be **NULL**.
  - Specify **False** to indicate that the **map is not nullable**.

In [0]:
from pyspark.sql.functions import lit, col, expr
import pyspark.sql.functions as f
from pyspark.sql.types import StringType, IntegerType, StructType, StructField, MapType, ArrayType

In [0]:
help(MapType)

Help on class MapType in module pyspark.sql.types:

class MapType(DataType)
 |  MapType(keyType: pyspark.sql.types.DataType, valueType: pyspark.sql.types.DataType, valueContainsNull: bool = True)
 |  
 |  Map data type.
 |  
 |  Parameters
 |  ----------
 |  keyType : :class:`DataType`
 |      :class:`DataType` of the keys in the map.
 |  valueType : :class:`DataType`
 |      :class:`DataType` of the values in the map.
 |  valueContainsNull : bool, optional
 |      indicates whether values can contain null (None) values.
 |  
 |  Notes
 |  -----
 |  Keys in a map data type are not allowed to be null (None).
 |  
 |  Examples
 |  --------
 |  >>> from pyspark.sql.types import IntegerType, FloatType, MapType, StringType
 |  
 |  The below example demonstrates how to create class:`MapType`:
 |  
 |  >>> map_type = MapType(StringType(), IntegerType())
 |  
 |  The values of the map can contain null (``None``) values by default:
 |  
 |  >>> (MapType(StringType(), IntegerType())
 |  ...    

#### **1) Creating a DataFrame with MapType Schema**

**key: StringType()**

**value: IntegerType()**

In [0]:
# Sample DataFrame with a StringType column containing JSON strings
data = [("Naresh", "Bangalore", {"Age": 25, "emp_id": 768954, "Exp": 5}), 
        ("Harish", "Chennai", {"Age": 30, "emp_id": 768956, "Exp": 2}),
        ("Prem", "Hyderabad", {"Age": 28, "emp_id": 798954, "Exp": 8}), 
        ("Prabhav", "kochin", {"Age": 35, "emp_id": 788956, "Exp": 6}),
        ("Hari", "Nasik", {"Age": 21, "emp_id": 769954, "Exp": 9}), 
        ("Druv", "Delhi", {"Age": 36, "emp_id": 768946, "Exp": 4}),
        ]

# Define the schema for the MapType column
map_schema = StructType([
  StructField("Name", StringType(), True),
  StructField("City", StringType(), True),
  StructField("Properties", MapType(StringType(), IntegerType()))])

# Convert the StringType column to a MapType column
df_map_int = spark.createDataFrame(data, map_schema)

# Display the resulting DataFrame
display(df_map_int)

Name,City,Properties
Naresh,Bangalore,"Map(Exp -> 5, Age -> 25, emp_id -> 768954)"
Harish,Chennai,"Map(Exp -> 2, Age -> 30, emp_id -> 768956)"
Prem,Hyderabad,"Map(Exp -> 8, Age -> 28, emp_id -> 798954)"
Prabhav,kochin,"Map(Exp -> 6, Age -> 35, emp_id -> 788956)"
Hari,Nasik,"Map(Exp -> 9, Age -> 21, emp_id -> 769954)"
Druv,Delhi,"Map(Exp -> 4, Age -> 36, emp_id -> 768946)"


**key : StringType()**

**value: StringType()**

In [0]:
# Sample DataFrame with a StringType column containing JSON strings
data = [("Naresh", "Bangalore", {"Domain": "Gas", "Branch": "IT", "Designation": "DE"}), 
        ("Harish", "Chennai", {"Domain": "DS", "Branch": "CSC", "Designation": "DE"}),
        ("Prem", "Hyderabad", {"Domain": "Trade", "Branch": "EEE", "Designation": "DE"}), 
        ("Prabhav", "kochin", {"Domain": "Sales", "Branch": "AI", "Designation": "DE"}),
        ("Hari", "Nasik", {"Domain": "TELE", "Branch": "ECE", "Designation": "DE"}), 
        ("Druv", "Delhi", {"Domain": "BANKING", "Branch": "IT", "Designation": "DE"}),
        ]

# Define the schema for the MapType column
map_schema = StructType([
  StructField("Name", StringType(), True),
  StructField("City", StringType(), True),
  StructField("Properties", MapType(StringType(), StringType()))])

# Convert the StringType column to a MapType column
df_map_str = spark.createDataFrame(data, map_schema)

# Display the resulting DataFrame
display(df_map_str)

Name,City,Properties
Naresh,Bangalore,"Map(Designation -> DE, Domain -> Gas, Branch -> IT)"
Harish,Chennai,"Map(Designation -> DE, Domain -> DS, Branch -> CSC)"
Prem,Hyderabad,"Map(Designation -> DE, Domain -> Trade, Branch -> EEE)"
Prabhav,kochin,"Map(Designation -> DE, Domain -> Sales, Branch -> AI)"
Hari,Nasik,"Map(Designation -> DE, Domain -> TELE, Branch -> ECE)"
Druv,Delhi,"Map(Designation -> DE, Domain -> BANKING, Branch -> IT)"


**key : StringType()**

**value: StringType()**

**valueContainsNull=True**

In [0]:
# Sample DataFrame with a StringType column containing JSON strings
data = [("Naresh", "Bangalore", {"Domain": "Gas", "Branch": "IT", "Designation": "DE"}), 
        ("Harish", "Chennai", {"Domain": "DS", "Branch": "CSC", "Designation": None}),
        ("Prem", "Hyderabad", {"Domain": "Trade", "Branch": "EEE", "Designation": "DE"}), 
        ("Prabhav", "kochin", {"Domain": "Sales", "Branch": None, "Designation": "DE"}),
        ("Hari", "Nasik", {"Domain": "TELE", "Branch": "ECE", "Designation": "DE"}), 
        ("Druv", "Delhi", {"Domain": None, "Branch": "IT", "Designation": "DE"}),
        ]

# Define the schema for the MapType column
map_schema = StructType([
  StructField("Name", StringType(), True),
  StructField("City", StringType(), True),
  StructField("Properties", MapType(StringType(), StringType(), True))])

# Convert the StringType column to a MapType column
df_map_str_tr = spark.createDataFrame(data, map_schema)

# Display the resulting DataFrame
display(df_map_str_tr)

Name,City,Properties
Naresh,Bangalore,"Map(Designation -> DE, Domain -> Gas, Branch -> IT)"
Harish,Chennai,"Map(Designation -> null, Domain -> DS, Branch -> CSC)"
Prem,Hyderabad,"Map(Designation -> DE, Domain -> Trade, Branch -> EEE)"
Prabhav,kochin,"Map(Designation -> DE, Domain -> Sales, Branch -> null)"
Hari,Nasik,"Map(Designation -> DE, Domain -> TELE, Branch -> ECE)"
Druv,Delhi,"Map(Designation -> DE, Domain -> null, Branch -> IT)"


**key : StringType()**

**value: StringType()**

**valueContainsNull=False**

In [0]:
# Sample DataFrame with a StringType column containing JSON strings
data = [("Naresh", "Bangalore", {"Domain": "Gas", "Branch": "IT", "Designation": "DE"}), 
        ("Harish", "Chennai", {"Domain": None, "Branch": "CSC", "Designation": None}),
        ("Prem", "Hyderabad", {"Domain": "Trade", "Branch": "EEE", "Designation": "DE"}), 
        ("Prabhav", "kochin", {"Domain": "Sales", "Branch": None, "Designation": "DE"}),
        ("Hari", "Nasik", {"Domain": "TELE", "Branch": "ECE", "Designation": "DE"}), 
        ("Druv", "Delhi", {"Domain": None, "Branch": "IT", "Designation": "DE"}),
        ]

# Define the schema for the MapType column
map_schema = StructType([
  StructField("Name", StringType(), True),
  StructField("City", StringType(), True),
  StructField("Properties", MapType(StringType(), StringType(), False))])

# Convert the StringType column to a MapType column
df_map_str_fls = spark.createDataFrame(data, map_schema)

# Display the resulting DataFrame
display(df_map_str_fls)

[0;31m---------------------------------------------------------------------------[0m
[0;31mPySparkValueError[0m                         Traceback (most recent call last)
File [0;32m<command-807085200884174>, line 17[0m
[1;32m     11[0m map_schema [38;5;241m=[39m StructType([
[1;32m     12[0m   StructField([38;5;124m"[39m[38;5;124mName[39m[38;5;124m"[39m, StringType(), [38;5;28;01mTrue[39;00m),
[1;32m     13[0m   StructField([38;5;124m"[39m[38;5;124mCity[39m[38;5;124m"[39m, StringType(), [38;5;28;01mTrue[39;00m),
[1;32m     14[0m   StructField([38;5;124m"[39m[38;5;124mProperties[39m[38;5;124m"[39m, MapType(StringType(), StringType(), [38;5;28;01mFalse[39;00m))])
[1;32m     16[0m [38;5;66;03m# Convert the StringType column to a MapType column[39;00m
[0;32m---> 17[0m df_map_str_fls [38;5;241m=[39m spark[38;5;241m.[39mcreateDataFrame(data, map_schema)
[1;32m     19[0m [38;5;66;03m# Display the resulting DataFrame[39;00m
[1;32m     20[

In [0]:
df_age = df_map_int.select("*", expr("Properties['Age'] + 1").alias('Ages'))
display(df_age)

Name,City,Properties,Ages
Naresh,Bangalore,"Map(Exp -> 5, Age -> 25, emp_id -> 768954)",26
Harish,Chennai,"Map(Exp -> 2, Age -> 30, emp_id -> 768956)",31
Prem,Hyderabad,"Map(Exp -> 8, Age -> 28, emp_id -> 798954)",29
Prabhav,kochin,"Map(Exp -> 6, Age -> 35, emp_id -> 788956)",36
Hari,Nasik,"Map(Exp -> 9, Age -> 21, emp_id -> 769954)",22
Druv,Delhi,"Map(Exp -> 4, Age -> 36, emp_id -> 768946)",37


#### **2) Creating a DataFrame with Nested MapType Schema**

**Creating MapType from StructType**

- We can create a more complex schema using **StructType and StructField**. This is useful when the data involves **nested structures**.

In [0]:
# Define the schema for the DataFrame
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Details", MapType(StringType(), StructType([
      StructField("Age", IntegerType(), True),
      StructField("Pin", IntegerType(), True),
      StructField("City", StringType(), True),
      StructField("Gender", StringType(), True),
      StructField("DOB", StringType(), True),
      StructField("Fees", IntegerType(), True),
      StructField("Experience", IntegerType(), True),
      StructField("Address", StringType(), True)])), True)
    ])

# Sample data
data = [
    ("John", {"personal_info": {"Age": 30, "Pin": 517132, "City": "Bangalore", "Gender": "Male", "DOB": "16-061981", "Fees": 200000, "Experience": 5, "Address": "123 Main St"}}),
    ("Jaswanth", {"personal_info": {"Age": 25, "Pin": 527332, "City": "Hyderabad", "Gender": "Male", "DOB": "16-061981", "Fees": 250000, "Experience": 8, "Address": "456 Maple Ave"}}),
    ("Dinesh", {"personal_info": {"Age": 28, "Pin": 537432, "City": "Chennai", "Gender": "Male", "DOB": "16-061981", "Fees": 350000, "Experience": 3, "Address": "789 Elm St"}}),
    ("Watson", {"personal_info": {"Age": 30, "Pin": 557672, "City": "Nasik", "Gender": "Male", "DOB": "16-061981", "Fees": 3550000, "Experience": 6, "Address": "451 27th Main"}}),
    ("David", {"personal_info": {"Age": 25, "Pin": 757132, "City": "Cochin", "Gender": "Male", "DOB": "16-061981", "Fees": 4555000, "Experience": 9, "Address": "#401 Madiwala"}}),
    ("Dravid", {"personal_info": {"Age": 28, "Pin": 973132, "City": "Amaravati", "Gender": "Male", "DOB": "16-061981", "Fees": 6789000, "Experience": 12, "Address": "789 Mumbai"}}),
    ("Joseph", {"personal_info": {"Age": 30, "Pin": 678132, "City": "Mumbai", "Gender": "Male", "DOB": "16-061981", "Fees": 233000, "Experience": 3, "Address": "323 3rd Cross"}}),
    ("Dhanush", {"personal_info": {"Age": 25, "Pin": 874132, "City": "Delhi", "Gender": "Male", "DOB": "16-061981", "Fees": 9786000, "Experience": 10, "Address": "456 Maple Ave"}}),
    ("Sam", {"personal_info": {"Age": 28, "Pin": 632132, "City": "Ahmadabad", "Gender": "Male", "DOB": "16-061981", "Fees": 984567000, "Experience": 11, "Address": "189 Walaja"}})
]

# Create DataFrame
df_nest_map = spark.createDataFrame(data, schema=schema)

# Display the DataFrame
display(df_nest_map)

Name,Details
John,"Map(personal_info -> List(30, 517132, Bangalore, Male, 16-061981, 200000, 123 Main St))"
Jaswanth,"Map(personal_info -> List(25, 527332, Hyderabad, Male, 16-061981, 250000, 456 Maple Ave))"
Dinesh,"Map(personal_info -> List(28, 537432, Chennai, Male, 16-061981, 350000, 789 Elm St))"
Watson,"Map(personal_info -> List(30, 557672, Nasik, Male, 16-061981, 3550000, 451 27th Main))"
David,"Map(personal_info -> List(25, 757132, Cochin, Male, 16-061981, 4555000, #401 Madiwala))"
Dravid,"Map(personal_info -> List(28, 973132, Amaravati, Male, 16-061981, 6789000, 789 Mumbai))"
Joseph,"Map(personal_info -> List(30, 678132, Mumbai, Male, 16-061981, 233000, 323 3rd Cross))"
Dhanush,"Map(personal_info -> List(25, 874132, Delhi, Male, 16-061981, 9786000, 456 Maple Ave))"
Sam,"Map(personal_info -> List(28, 632132, Ahmadabad, Male, 16-061981, 984567000, 189 Walaja))"


In [0]:
# Corrected schema definition
schema = StructType([
    StructField("ID", StringType(), True),
    StructField("Profile", MapType(
        StringType(), 
        StructType([
            StructField("Name", StringType(), True),
            StructField("Age", IntegerType(), True),
            StructField("City", StringType(), True),
            StructField("Gender", StringType(), True),
            StructField("Skills", ArrayType(
                StructType([
                    StructField("SkillName", StringType(), True),
                    StructField("ExperienceYears", IntegerType(), True)
                ]), True), True),
            StructField("Matrix", ArrayType(
                StructType([
                    StructField("SkillName", StringType(), True),
                    StructField("ExperienceYears", IntegerType(), True)
                ]), True), True)
        ]), True), True)
])

# Sample data
data = [
    ("1", {"personal_info": {"Name": "John", "Age": 30, "City": "Bangalore", "Gender": "Male", "Skills": [{"SkillName": "Python", "ExperienceYears": 5}, {"SkillName": "Spark", "ExperienceYears": 3}], "Matrix": [{"SkillName": "Python", "ExperienceYears": 5}, {"SkillName": "Spark", "ExperienceYears": 3}]}}),
    ("2", {"personal_info": {"Name": "Kiran", "Age": 25, "City": "Hyderabad", "Gender": "Male", "Skills": [{"SkillName": "Java", "ExperienceYears": 4}, {"SkillName": "Kubernetes", "ExperienceYears": 6}], "Matrix": [{"SkillName": "Python", "ExperienceYears": 5}, {"SkillName": "Spark", "ExperienceYears": 3}]}}),
    ("3", {"personal_info": {"Name": "Kishore", "Age": 45, "City": "Chennai", "Gender": "Male", "Skills": [{"SkillName": "PySpark", "ExperienceYears": 6}, {"SkillName": "Spark", "ExperienceYears": 8}], "Matrix": [{"SkillName": "Python", "ExperienceYears": 5}, {"SkillName": "Spark", "ExperienceYears": 3}]}}),
    ("4", {"personal_info": {"Name": "Kashvi", "Age": 28, "City": "Nasik", "Gender": "Male", "Skills": [{"SkillName": "SQL", "ExperienceYears": 8}, {"SkillName": "Spark", "ExperienceYears": 5}], "Matrix": [{"SkillName": "Python", "ExperienceYears": 5}, {"SkillName": "Spark", "ExperienceYears": 3}]}}),
    ("5", {"personal_info": {"Name": "Kamal", "Age": 39, "City": "Mumbai", "Gender": "Male", "Skills": [{"SkillName": "Devops", "ExperienceYears": 2}, {"SkillName": "Spark", "ExperienceYears": 6}], "Matrix": [{"SkillName": "Python", "ExperienceYears": 5}, {"SkillName": "Spark", "ExperienceYears": 3}]}}),
    ("6", {"personal_info": {"Name": "Pratap", "Age": 49, "City": "Ahmadabad", "Gender": "Male", "Skills": [{"SkillName": "Databricks", "ExperienceYears": 7}, {"SkillName": "Spark", "ExperienceYears": 7}], "Matrix": [{"SkillName": "Python", "ExperienceYears": 5}, {"SkillName": "Spark", "ExperienceYears": 3}]}})
]

# Create DataFrame
df_multi_nest = spark.createDataFrame(data, schema=schema)

# Display the DataFrame
display(df_multi_nest)

ID,Profile
1,"Map(personal_info -> List(John, 30, Bangalore, Male, List(List(Python, 5), List(Spark, 3)), List(List(Python, 5), List(Spark, 3))))"
2,"Map(personal_info -> List(Kiran, 25, Hyderabad, Male, List(List(Java, 4), List(Kubernetes, 6)), List(List(Python, 5), List(Spark, 3))))"
3,"Map(personal_info -> List(Kishore, 45, Chennai, Male, List(List(PySpark, 6), List(Spark, 8)), List(List(Python, 5), List(Spark, 3))))"
4,"Map(personal_info -> List(Kashvi, 28, Nasik, Male, List(List(SQL, 8), List(Spark, 5)), List(List(Python, 5), List(Spark, 3))))"
5,"Map(personal_info -> List(Kamal, 39, Mumbai, Male, List(List(Devops, 2), List(Spark, 6)), List(List(Python, 5), List(Spark, 3))))"
6,"Map(personal_info -> List(Pratap, 49, Ahmadabad, Male, List(List(Databricks, 7), List(Spark, 7)), List(List(Python, 5), List(Spark, 3))))"


#### **3) Creating a DataFrame using create_map**

In [0]:
# Sample DataFrame with a StringType column containing JSON strings
data = [("Naresh", "Bangalore", 2, 21, 41, {"Age": 25, "emp_id": 768954, "Exp": 5}), 
        ("Harish", "Chennai", 4, 12, 5, {"Age": 30, "emp_id": 768956, "Exp": 2}),
        ("Prem", "Hyderabad", 5, 9, 12, {"Age": 28, "emp_id": 798954, "Exp": 8}), 
        ("Prabhav", "kochin", 7, 12, 4, {"Age": 35, "emp_id": 788956, "Exp": 6}),
        ("Hari", "Nasik", 8, 51, 35, {"Age": 21, "emp_id": 769954, "Exp": 9}), 
        ("Druv", "Delhi", 12, 15, 12, {"Age": 36, "emp_id": 768946, "Exp": 4}),
        ]

# Define the schema for the MapType column
map_schema = StructType([
  StructField("Name", StringType(), True),
  StructField("City", StringType(), True),
  StructField("key", IntegerType(), True),
  StructField("mode", IntegerType(), True),
  StructField("target", IntegerType(), True),
  StructField("Properties", MapType(StringType(), IntegerType()))])

# Convert the StringType column to a MapType column
df_map_nest = spark.createDataFrame(data, map_schema)

# Display the resulting DataFrame
display(df_map_nest)

Name,City,key,mode,target,Properties
Naresh,Bangalore,2,21,41,"Map(Exp -> 5, Age -> 25, emp_id -> 768954)"
Harish,Chennai,4,12,5,"Map(Exp -> 2, Age -> 30, emp_id -> 768956)"
Prem,Hyderabad,5,9,12,"Map(Exp -> 8, Age -> 28, emp_id -> 798954)"
Prabhav,kochin,7,12,4,"Map(Exp -> 6, Age -> 35, emp_id -> 788956)"
Hari,Nasik,8,51,35,"Map(Exp -> 9, Age -> 21, emp_id -> 769954)"
Druv,Delhi,12,15,12,"Map(Exp -> 4, Age -> 36, emp_id -> 768946)"


In [0]:
df_nest = df_map_nest.select("*", f.array('key', 'mode', 'target').alias('audience'),
                             f.create_map(
                               f.lit('acousticness'), col('Name'), 
                               f.lit('danceability'), col('City'),
                               f.lit('energy'), col('City'),
                               f.lit('instrumentalness'), col('City'),
                               f.lit('liveness'), col('City'),
                               f.lit('loudness'), col('City'),
                               f.lit('speechiness'), col('City'),
                               f.lit('tempo'), col('Name')).alias('qualities'))

display(df_nest)

Name,City,key,mode,target,Properties,audience,qualities
Naresh,Bangalore,2,21,41,"Map(Exp -> 5, Age -> 25, emp_id -> 768954)","List(2, 21, 41)","Map(tempo -> Naresh, energy -> Bangalore, liveness -> Bangalore, speechiness -> Bangalore, acousticness -> Naresh, danceability -> Bangalore, loudness -> Bangalore, instrumentalness -> Bangalore)"
Harish,Chennai,4,12,5,"Map(Exp -> 2, Age -> 30, emp_id -> 768956)","List(4, 12, 5)","Map(tempo -> Harish, energy -> Chennai, liveness -> Chennai, speechiness -> Chennai, acousticness -> Harish, danceability -> Chennai, loudness -> Chennai, instrumentalness -> Chennai)"
Prem,Hyderabad,5,9,12,"Map(Exp -> 8, Age -> 28, emp_id -> 798954)","List(5, 9, 12)","Map(tempo -> Prem, energy -> Hyderabad, liveness -> Hyderabad, speechiness -> Hyderabad, acousticness -> Prem, danceability -> Hyderabad, loudness -> Hyderabad, instrumentalness -> Hyderabad)"
Prabhav,kochin,7,12,4,"Map(Exp -> 6, Age -> 35, emp_id -> 788956)","List(7, 12, 4)","Map(tempo -> Prabhav, energy -> kochin, liveness -> kochin, speechiness -> kochin, acousticness -> Prabhav, danceability -> kochin, loudness -> kochin, instrumentalness -> kochin)"
Hari,Nasik,8,51,35,"Map(Exp -> 9, Age -> 21, emp_id -> 769954)","List(8, 51, 35)","Map(tempo -> Hari, energy -> Nasik, liveness -> Nasik, speechiness -> Nasik, acousticness -> Hari, danceability -> Nasik, loudness -> Nasik, instrumentalness -> Nasik)"
Druv,Delhi,12,15,12,"Map(Exp -> 4, Age -> 36, emp_id -> 768946)","List(12, 15, 12)","Map(tempo -> Druv, energy -> Delhi, liveness -> Delhi, speechiness -> Delhi, acousticness -> Druv, danceability -> Delhi, loudness -> Delhi, instrumentalness -> Delhi)"
