### 1. **Initializing Spark Session**:


In [1]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488491 sha256=6369fe0000ef1539026b8e2326e396fe1d029913b8c1011baf3736589a21de58
  Stored in directory: /root/.cache/pip/wheels/80/1d/60/2c256ed38dddce2fdd93be545214a63e02fbd8d74fb0b7f3a6
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.1


In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import countDistinct

spark = SparkSession.builder \
        .appName('SparkByExamples.com') \
        .getOrCreate()

- Imports necessary PySpark libraries.
- Initializes a Spark session with the application name 'SparkByExamples.com'.


### 2. **Defining Sample Data and Schema**:


In [3]:
data = [("James", "Sales", 3000),
        ("Michael", "Sales", 4600),
        ("Robert", "Sales", 4100),
        ("Maria", "Finance", 3000),
        ("James", "Sales", 3000),
        ("Scott", "Finance", 3300),
        ("Jen", "Finance", 3900),
        ("Jeff", "Marketing", 3000),
        ("Kumar", "Marketing", 2000),
        ("Saif", "Sales", 4100)]

columns = ["Name", "Dept", "Salary"]

- Defines sample data as a list of tuples, where each tuple represents a row in the DataFrame.
- Defines a schema with three fields: `Name`, `Dept`, and `Salary`.


### 3. **Creating DataFrame**:


In [4]:
df = spark.createDataFrame(data=data, schema=columns)

- Creates a DataFrame from the sample data and schema.


### 4. **Showing Distinct Rows and Counting Distinct Rows**:


In [5]:
df.distinct().show()
print("Distinct Count: " + str(df.distinct().count()))

+-------+---------+------+
|   Name|     Dept|Salary|
+-------+---------+------+
|Michael|    Sales|  4600|
|  James|    Sales|  3000|
| Robert|    Sales|  4100|
|  Maria|  Finance|  3000|
|    Jen|  Finance|  3900|
|  Scott|  Finance|  3300|
|  Kumar|Marketing|  2000|
|   Jeff|Marketing|  3000|
|   Saif|    Sales|  4100|
+-------+---------+------+

Distinct Count: 9


- Uses `distinct()` to remove duplicate rows and display the distinct rows.
- Prints the count of distinct rows.


### 5. **Counting Distinct Combinations of 'Dept' and 'Salary'**:


In [6]:
df2 = df.select(countDistinct("Dept", "Salary"))
df2.show()
print("Distinct Count of Department & Salary: " + str(df2.collect()[0][0]))

+----------------------------+
|count(DISTINCT Dept, Salary)|
+----------------------------+
|                           8|
+----------------------------+

Distinct Count of Department & Salary: 8


- Uses `countDistinct()` to count the number of distinct combinations of `Dept` and `Salary`.
- Displays the result and prints the count.


### 6. **Using SQL to Count Distinct Rows**:


In [7]:
df.createOrReplaceTempView("PERSON")
spark.sql("SELECT COUNT(DISTINCT *) FROM PERSON").show()

+----------------------------------+
|count(DISTINCT Name, Dept, Salary)|
+----------------------------------+
|                                 9|
+----------------------------------+



- Registers the DataFrame as a temporary SQL view named `PERSON`.
- Uses SQL to count the number of distinct rows in the `PERSON` view.

### Key Points

- **Creating DataFrame**: Shows how to create a DataFrame with a given schema and data.
- **Counting Distinct Rows**: Demonstrates how to remove duplicate rows and count the number of distinct rows in a DataFrame.
- **Using countDistinct**: Uses the `countDistinct` function to count distinct combinations of specified columns.
- **SQL Queries**: Registers the DataFrame as a temporary view and uses SQL to perform operations on the DataFrame.
