- crosstab() creates a contingency table (cross-tabulation) between two columns. 
- It shows the frequency count for combinations of two categorical columns.
- commonly use for comparing categories and finding relationships between columns.
- Syntax:
    df.crosstab(col1, col2)

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("crosstabFunctionExample").getOrCreate()


Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/09/16 15:09:27 WARN Utils: Your hostname, KLZPC0015, resolves to a loopback address: 127.0.1.1; using 172.25.17.96 instead (on interface eth0)
25/09/16 15:09:27 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/09/16 15:09:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/09/16 15:09:40 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [2]:
data = [
    (1, "Padma", "India"),
    (2, "Padma", "Ireland"),
    (3, "Arvind", "Ireland"),
    (4, "Arvind", "India"),
    (5, "Soumi", "India"),
    (6, "Soumi", "Germany"),
    (7, "Riya", "Japan"),
    (8, "Riya", "UK"),
    (9, "Riya", "USA"),
    (10, "Riya", "AUS")
]

columns = ["ID", "Name", "Country"]
df = spark.createDataFrame(data, columns)
df.show()





+---+------+-------+
| ID|  Name|Country|
+---+------+-------+
|  1| Padma|  India|
|  2| Padma|Ireland|
|  3|Arvind|Ireland|
|  4|Arvind|  India|
|  5| Soumi|  India|
|  6| Soumi|Germany|
|  7|  Riya|  Japan|
|  8|  Riya|     UK|
|  9|  Riya|    USA|
| 10|  Riya|    AUS|
+---+------+-------+



                                                                                

In [3]:
# Example: Crosstab between name and country
# crosstab() counts how many times each name appears in each country.

crosstab_df = df.crosstab("name", "country")

print("Crosstab between name and country: ")
crosstab_df.show(truncate=False)


                                                                                

Crosstab between name and country: 




+------------+---+-------+-----+-------+-----+---+---+
|name_country|AUS|Germany|India|Ireland|Japan|UK |USA|
+------------+---+-------+-----+-------+-----+---+---+
|Padma       |0  |0      |1    |1      |0    |0  |0  |
|Soumi       |0  |1      |1    |0      |0    |0  |0  |
|Riya        |1  |0      |0    |0      |1    |1  |1  |
|Arvind      |0  |0      |1    |1      |0    |0  |0  |
+------------+---+-------+-----+-------+-----+---+---+



                                                                                

- Explanation:
    - The result is a DataFrame where:
        - Each row represents a distinct value from the 'name' column.
        - Each column (besides name_country) represents distinct values from the 'country' column.
        - The cells show the count of occurrences for each combination.

    - Nulls and missing combinations are represented as 0.

In [4]:
# Example: Crosstab between country and name(swapped)
# can swap the columns to change the perspective.

crosstab_country_name = df.crosstab("country", "name")

print("Crosstab between country and name: ")
crosstab_country_name.show(truncate=False)


                                                                                

Crosstab between country and name: 




+------------+------+-----+----+-----+
|country_name|Arvind|Padma|Riya|Soumi|
+------------+------+-----+----+-----+
|Germany     |0     |0    |0   |1    |
|AUS         |0     |0    |1   |0    |
|India       |1     |1    |0   |1    |
|Ireland     |1     |1    |0   |0    |
|USA         |0     |0    |1   |0    |
|UK          |0     |0    |1   |0    |
|Japan       |0     |0    |1   |0    |
+------------+------+-----+----+-----+



                                                                                