# Spark RAPIDS - Value Indexer

### Spark Session

In [1]:
spark

### RAPIDS Plugin Version Properties

In [2]:
spark._jvm.com.nvidia.spark.rapids.RapidsPluginUtils\
    .loadProps('rapids4spark-version-info.properties')

{'version': '22.06.0-SNAPSHOT', 'user': 'gshegalov', 'url': 'https://github.com/NVIDIA/spark-rapids.git', 'date': '2022-04-07T21:31:51Z', 'revision': '4a45c5dbefdc7e520d873ee9961fd42850418ce5', 'cudf_version': '22.06.0-SNAPSHOT', 'branch': 'branch-22.06'}

In [3]:
spark._jvm.com.nvidia.spark.rapids.RapidsPluginUtils\
    .loadProps('cudf-java-version-info.properties')

{'version': '22.06.0-SNAPSHOT', 'user': '', 'date': '2022-04-07T06:18:21Z', 'revision': 'acc42a849a5960079123bc2c76b8269f3d0733c9', 'branch': 'devtools-build-in-docker-for-native'}

In [41]:
from pyspark.sql.functions import *
from pyspark.sql.window import Window
spark.conf.set('spark.rapids.sql.explain', 'ALL')
spark.conf.set('spark.sql.adaptive.enabled', False)
spark.conf.set('spark.rapids.sql.enabled', True)

## Test Data

In [42]:
df = spark.createDataFrame(
    [
        ['aaa',],
        ['a'], 
        ['bb'],
        ['a'],
        ['aaa'],
    ],
    'c1 string'
)

df.createOrReplaceTempView('df')

In [43]:
df.show()

22/04/09 00:34:58 WARN GpuOverrides: 
*Exec <CollectLimitExec> will run on GPU
  *Partitioning <SinglePartition$> will run on GPU
  ! <RDDScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.RDDScanExec
    @Expression <AttributeReference> c1#111 could run on GPU



+---+
| c1|
+---+
|aaa|
|  a|
| bb|
|  a|
|aaa|
+---+



In [44]:
df1 = df\
    .distinct()\
    .orderBy('c1') \
    .withColumn('idx', 
                row_number().over(
                    Window.orderBy(monotonically_increasing_id())
                )
               )

In [62]:
df.alias('a').join(df1.alias('b'), df.c1 == df1.c1).selectExpr('a.c1', 'b.idx').show()

22/04/09 00:44:07 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/04/09 00:44:07 WARN GpuOverrides: 
*Exec <CollectLimitExec> will run on GPU
  *Partitioning <SinglePartition$> will run on GPU
  *Exec <ProjectExec> will run on GPU
    *Expression <Alias> cast(idx#119 as string) AS idx#260 will run on GPU
      *Expression <Cast> cast(idx#119 as string) will run on GPU
    *Exec <SortMergeJoinExec> will run on GPU
      #Exec <SortExec> could run on GPU but is going to be removed because replacing sortMergeJoin with shuffleHashJoin
        #Expression <SortOrder> c1#111 ASC NULLS FIRST could run on GPU but is going to be removed because parent plan is removed
        *Exec <ShuffleExchangeExec> will run on GPU
          *Partitioning <HashPartitioning> will run on GPU
          *Exec <FilterExec> will run on GPU
            *Expression <IsNotNull> isnotnull(c1#111) will run on GPU
     

+---+---+
| c1|idx|
+---+---+
|aaa|  2|
|aaa|  2|
| bb|  3|
|  a|  1|
|  a|  1|
+---+---+

