# RDD API

In this notebook you will solve one problem using RDD API.

In [44]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, desc

from pyspark.sql import Window

from pyspark.sql.types import StructField, LongType, StructType

import os

In [45]:
spark = (
    SparkSession
    .builder
    .appName('RDD API')
    .getOrCreate()
)

In [46]:
base_path = os.getcwd()

project_path = ('/').join(base_path.split('/')[0:-3]) 

users_input_path = os.path.join(project_path, 'data/users')

In [47]:
usersDF = (
    spark
    .read
    .option('path', users_input_path)
    .load()
)

# Task

Convert usersDF to RDD and then compute percentile for each user depending on the value of reputation. The user with highest reputation should have percentile 100. Convert the result back to DataFrame. If the value of reputation is the same, determine the order by the value of user_id.

Hint:
* first sort the DataFrame by reputation and user_id
* then convert it to rdd
* then use zipWithIndex transformation which gives you the access to the global index of each record
 * after zipWithIndex you can call lambda function
 * note that the index starts from 0
* add the index as a value to each record
* convert to DataFrame using spark.createDataFrame
* divide the index by total count and multiply by 100 to get the percentile

In [48]:
total_users = usersDF.count()

In [49]:
my_schema = StructType(
  [
    StructField('user_id', LongType()),
    StructField('reputation', LongType()),
    StructField('index', LongType())
  ]
)

In [50]:
indexedRDD = (
    usersDF
    .select('user_id', 'reputation')
    .orderBy('reputation', 'user_id')
    .rdd
    .zipWithIndex()
    .map(lambda x: (x[0]['user_id'], x[0]['reputation'], x[1] + 1))
)

In [51]:
result = spark.createDataFrame(indexedRDD, my_schema)

In [52]:
(
    result
    .withColumn('percentile', (col('index') / (total_users)) * 100)
    .orderBy(desc('reputation'))
).show(n=5)

+-------+----------+------+-----------------+
|user_id|reputation| index|       percentile|
+-------+----------+------+-----------------+
|   1325|    268228|153439|            100.0|
|   1492|    154965|153438|99.99934827521034|
|   1236|    152163|153437|99.99869655042069|
|  26969|    105646|153436|99.99804482563103|
|   2451|    100020|153435|99.99739310084138|
+-------+----------+------+-----------------+
only showing top 5 rows



In [54]:
spark.stop()