- dtypes() is a PySpark DataFrame function that returns a list of tuples.
- Each tuple contains two elements:
    1. The column name.
    2. The data type of that column (as a string).

- This is useful when you need a quick and simple way to view the data types of all columns in a DataFrame.

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("dtypesFunctionExample").getOrCreate()


Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/09/15 11:39:37 WARN Utils: Your hostname, KLZPC0015, resolves to a loopback address: 127.0.1.1; using 172.25.17.96 instead (on interface eth0)
25/09/15 11:39:37 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/09/15 11:39:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/09/15 11:39:55 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/09/15 11:39:55 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
25/09/15 11:39:55 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.


In [5]:
data = [
    ("Souvik", 72, 85, 95, True),
    ("Soukarjya", 32, 56, 46, False),
    ("Sandip", 74, 38, 66, False),
    ("Prodipta", 76, 89, 31, True),
    ("RamaSai", 69, 78, 38, False),
    ("Riya", 78, 82, 96, True),
    ("Padma", 46, 53, 49, True)
]

columns = ["name", "Chemistry", "Data Science", "Painting", "Passed"]

df = spark.createDataFrame(data, columns)
df.show()


                                                                                

+---------+---------+------------+--------+------+
|     name|Chemistry|Data Science|Painting|Passed|
+---------+---------+------------+--------+------+
|   Souvik|       72|          85|      95|  true|
|Soukarjya|       32|          56|      46| false|
|   Sandip|       74|          38|      66| false|
| Prodipta|       76|          89|      31|  true|
|  RamaSai|       69|          78|      38| false|
|     Riya|       78|          82|      96|  true|
|    Padma|       46|          53|      49|  true|
+---------+---------+------------+--------+------+



In [6]:
# Using dtypes() to get column names and datatypes

# get a list of (column_name. data_type) pairs
print("Column names and Data Types(dtypes): ")
print(df.dtypes)


Column names and Data Types(dtypes): 
[('name', 'string'), ('Chemistry', 'bigint'), ('Data Science', 'bigint'), ('Painting', 'bigint'), ('Passed', 'boolean')]


In [9]:
# Optional: Displaying dtypes in the Nicer format

# Iterate through dtypes and print each columnwith it's datatype
print("Format OutPut of column Data Types: ")
for col_name, data_type in df.dtypes:
    print(f"Column: {col_name}, Type: {data_type}")


Format OutPut of column Data Types: 
Column: name, Type: string
Column: Chemistry, Type: bigint
Column: Data Science, Type: bigint
Column: Painting, Type: bigint
Column: Passed, Type: boolean
