#PySpark Count Distinct from DataFrame


---


**In PySpark, you can use distinct().count() of DataFrame or countDistinct() SQL function to get the count distinct.**


---


##distinct() 
**eliminates duplicate records(matching all columns of a Row) from DataFrame, count() returns the count of records on DataFrame. By chaining these you can get the count distinct of PySpark DataFrame.**

---

##countDistinct() 
**is a SQL function that could be used to get the count distinct of the selected multiple columns.**


---


**Let’s see these two ways with examples.**

**Before we start, first let’s create a DataFrame with some duplicate rows and duplicate values in a column.**

In [0]:
data = [
    ("James", "Sales", 3000),
    ("Michael", "Sales", 4600),
    ("Robert", "Sales", 4100),
    ("Maria", "Finance", 3000),
    ("James", "Sales", 3000),
    ("Scott", "Finance", 3300),
    ("Jen", "Finance", 3900),
    ("Jeff", "Marketing", 3000),
    ("Kumar", "Marketing", 2000),
    ("Saif", "Sales", 4100)
]

columns = ["Name", "Dept", "Salary"]

df = spark.createDataFrame(data=data, schema=columns)
df.printSchema()
df.show(truncate=False)

root
 |-- Name: string (nullable = true)
 |-- Dept: string (nullable = true)
 |-- Salary: long (nullable = true)

+-------+---------+------+
|Name   |Dept     |Salary|
+-------+---------+------+
|James  |Sales    |3000  |
|Michael|Sales    |4600  |
|Robert |Sales    |4100  |
|Maria  |Finance  |3000  |
|James  |Sales    |3000  |
|Scott  |Finance  |3300  |
|Jen    |Finance  |3900  |
|Jeff   |Marketing|3000  |
|Kumar  |Marketing|2000  |
|Saif   |Sales    |4100  |
+-------+---------+------+



##1. Using DataFrame distinct() and count()

**On the above DataFrame, we have a total of 10 rows and one row with all values duplicated, performing distinct count ( distinct().count() ) on this DataFrame should get us 9.**

In [0]:
print("Distinct Count: " + str(df.distinct().count()))

Distinct Count: 9


##2. Using countDistinct() SQL Function

**DataFrame distinct() returns a new DataFrame after eliminating duplicate rows (distinct on all columns). if you want to get count distinct on selected multiple columns, use the PySpark SQL function countDistinct(). This function returns the number of distinct elements in a group.**

**In order to use this function, you need to import it first.**

In [0]:
from pyspark.sql.functions import countDistinct

In [0]:
df2 = df.select(countDistinct("Dept", "Salary"))
df2.show(truncate=False)


+----------------------------+
|count(DISTINCT Dept, Salary)|
+----------------------------+
|8                           |
+----------------------------+



**Note that countDistinct() function returns a value in a Column type hence, you need to collect it to get the value from the DataFrame. And this function can be used to get the distinct count of any number of selected or all columns.**

In [0]:
print("Distinct Count of Department & Salary: "+ str(df2.collect()[0][0]))

Distinct Count of Department & Salary: 8


##3. Using SQL to get Count Distinct

In [0]:
df.createOrReplaceTempView("EMP")

In [0]:
spark.sql(" select count(distinct(*)) from EMP").show(truncate=False)

+----------------------------------+
|count(DISTINCT Name, Dept, Salary)|
+----------------------------------+
|9                                 |
+----------------------------------+

