- freqItems() is a PySpark DataFrame function that returns frequent items (elements) in a column or multiple columns. It  helps identify commonly occuring values.
- It is useful for exploratory data analysis (EDA) and understanding data distributions.

- Syntax:
    df.freqItems(cols, support=None)

- cols = List of columns to analyze
- support = Minimum frequency threshold (default is 0.01, or 1%)

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("freqItemsFunctionExample").getOrCreate()


Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/09/16 13:07:37 WARN Utils: Your hostname, KLZPC0015, resolves to a loopback address: 127.0.1.1; using 172.25.17.96 instead (on interface eth0)
25/09/16 13:07:37 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/09/16 13:07:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/09/16 13:07:52 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/09/16 13:07:52 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
25/09/16 13:07:52 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
25/09/16 13:07:52 WARN Utils: Serv

In [2]:
data = [
    (1, "Padma", "India"),
    (2, "Padma", "Ireland"),
    (3, "Arvind", "Ireland"),
    (4, "Arvind", "India"),
    (5, "Soumi", "India"),
    (6, "Soumi", "Germany"),
    (7, "Riya", "Japan"),
    (8, "Riya", "UK"),
    (9, "Riya", "USA"),
    (10, "Riya", "AUS")
]

columns = ["ID", "Name", "Country"]
df = spark.createDataFrame(data, columns)
df.show()




+---+------+-------+
| ID|  Name|Country|
+---+------+-------+
|  1| Padma|  India|
|  2| Padma|Ireland|
|  3|Arvind|Ireland|
|  4|Arvind|  India|
|  5| Soumi|  India|
|  6| Soumi|Germany|
|  7|  Riya|  Japan|
|  8|  Riya|     UK|
|  9|  Riya|    USA|
| 10|  Riya|    AUS|
+---+------+-------+



                                                                                

In [3]:
# Example - Find frequent items in a single column (name)
freq_name = df.freqItems(["name"])

print("Frequent Items in 'name' column: ")
freq_name.show(truncate=False)
 

Frequent Items in 'name' column: 


[Stage 2:>                                                          (0 + 4) / 4]

+----------------------------+
|name_freqItems              |
+----------------------------+
|[Soumi, Riya, Arvind, Padma]|
+----------------------------+



                                                                                

In [4]:
# Example - Find frequent items in multiple columns (name, country)
freq_name_country = df.freqItems(["name", "country"])

print("Frequent Items in 'name' and 'country' columns: ")
freq_name_country.show(truncate=False)


Frequent Items in 'name' and 'country' columns: 




+----------------------------+----------------------------------------------+
|name_freqItems              |country_freqItems                             |
+----------------------------+----------------------------------------------+
|[Soumi, Riya, Arvind, Padma]|[India, Japan, Germany, Ireland, UK, AUS, USA]|
+----------------------------+----------------------------------------------+



                                                                                

In [5]:
# Example: change support threshold (optional)
# by default, support is 1%. You can increase it to find more frequent items.
freq_with_support = df.freqItems(["name"], support=0.3)

print("Frequent Items in 'name' column with support = 0.3 (30%): ")
freq_with_support.show(truncate=False)


Frequent Items in 'name' column with support = 0.3 (30%): 


[Stage 8:>                                                          (0 + 4) / 4]

+--------------+
|name_freqItems|
+--------------+
|[Riya]        |
+--------------+



                                                                                