- What is summary() function in PySpark?
    - The summary(*statistics) function in PySpark provides descriptive statistics for numeric and columns in a DataFrame.

- By default, it shows the following statistics:
    - count: total number of rows
    - mean: average of numeric columns
    - stddev: standard deviation of numeric columns
    - min: minimum value
    - max: maximum value

- can also specify own statistics, like 'count', '25%'(first quartile), '50%'(median), '75%'(third quartile), etc.

- Syntax: 
    DataFrame.summary(*statistics)

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DataFrameSummaryExample").getOrCreate()


Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/09/16 09:03:50 WARN Utils: Your hostname, KLZPC0015, resolves to a loopback address: 127.0.1.1; using 172.25.17.96 instead (on interface eth0)
25/09/16 09:03:50 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/09/16 09:04:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
data = [
    (29, "Dipankar", 35000),
    (28, "Prodipta", 25000),
    (27, "Padma", 20000),
    (26, "Souvik", 70000),
    (28, "Soukarjya", 65000)
]

columns = ["age", "name", "salary"]

df = spark.createDataFrame(data, columns)
df.show()


                                                                                

+---+---------+------+
|age|     name|salary|
+---+---------+------+
| 29| Dipankar| 35000|
| 28| Prodipta| 25000|
| 27|    Padma| 20000|
| 26|   Souvik| 70000|
| 28|Soukarjya| 65000|
+---+---------+------+



In [3]:
# Using summary() function in Pyspark
# Example: default summary statistics(count, mean, stddev, min, max)
print("Default summary statistics:")
df.summary().show()


Default summary statistics:


25/09/16 09:05:07 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
[Stage 4:>                                                          (0 + 1) / 1]

+-------+-----------------+--------+------------------+
|summary|              age|    name|            salary|
+-------+-----------------+--------+------------------+
|  count|                5|       5|                 5|
|   mean|             27.6|    NULL|           43000.0|
| stddev|1.140175425099138|    NULL|23075.961518428652|
|    min|               26|Dipankar|             20000|
|    25%|               27|    NULL|             25000|
|    50%|               28|    NULL|             35000|
|    75%|               28|    NULL|             65000|
|    max|               29|  Souvik|             70000|
+-------+-----------------+--------+------------------+



                                                                                

In [4]:
# Example: Custom Statistics - only 'count', 'min', 'max'
print("Custom summary statistics(count, min, max): ")
df.summary("count", "min", "max").show()


Custom summary statistics(count, min, max): 




+-------+---+--------+------+
|summary|age|    name|salary|
+-------+---+--------+------+
|  count|  5|       5|     5|
|    min| 26|Dipankar| 20000|
|    max| 29|  Souvik| 70000|
+-------+---+--------+------+



                                                                                

- Summary() is used to get descriptive statistics of your data.
- It works on numeric columns and also returns counts on string columns.
- You can specify which statistics you want.

- Example Recap:
    - We created a DataFrame with age, name, and salary.
    - We used df.summary() to get quick insights into the data.