# Task

In this notebook you will answer a simple analytical question but as you will see it might be tricky to obtain a correct answer.

In [None]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
import os

In [None]:
spark = (
    SparkSession
    .builder
    .appName('Debugging-I')
    .getOrCreate()
)

In [None]:
base_path = os.getcwd()

project_path = ('/').join(base_path.split('/')[0:-3]) 

usersP_output_path =  os.path.join(project_path, 'data/usersP')

1. Read the data that represent users from the path `usersP_output_path`
 * the dataset is partitioned by `last2_id` which are last two digits of the `user_id`. So for example a user with user_id = 540 would be in the partition `40`
2. Find out how many users are together in the partitions `02` and `03`
 * try more different ways how to retrieve the result. 
3. Can you see the problem? If yes, can you explain what happened and determine what should be the correct approach?

In [None]:
users = spark.read.parquet(usersP_output_path)

In [None]:
# filter the users using the isin function:

(
  users
  .filter(col('last2_id').isin(['02', '03']))
).count()

In [None]:
# filter the users using `==` and `|` operators

(
  users
  .filter((col('last2_id') == '02') | (col('last2_id') == '03'))
).count()

In [None]:
# read the data directly from the path /last2_id=02 and from the path /last2_id=03

spark.read.parquet(
    usersP_output_path + '/last2_id=02'
).count()

In [None]:
spark.read.parquet(
    usersP_output_path + '/last2_id=03'
).count()

* The problem is that Spark infers the last2_id as a long type and it will cast the values 02 and 03 to 2 and 3, so it will not find any rows when using isin function.

* with the == operator it will cast the right side to long and both partitions 2 and 02 will be taken into account (because there is also a user with id user_id=2 which has partition `2`)

* The solution to the problem is either to read the data with schema where we tell spark that the datatype is string, or we disable the schema inference of the partition column `spark.sql.sources.partitionColumnTypeInference.enabled`

In [None]:
# check the schema of users

users.printSchema()

In [None]:
# Define a new schema where the last2_id is string

users_schema = StructType([
    StructField('user_id', LongType()),
    StructField('display_name', StringType()),
    StructField('about', StringType()),
    StructField('location', StringType()),
    StructField('downvotes', LongType()),
    StructField('upvotes', LongType()),
    StructField('reputation', LongType()),
    StructField('views', LongType()),
    StructField('last2_id', StringType())
])

In [None]:
# read the data again with the schema

users = spark.read.schema(users_schema).parquet(usersP_output_path)

In [None]:
# now use the isin function

(
  users
  .filter(col('last2_id').isin(['02', '03']))
).count()

In [None]:
# now try it also with the conf setting that will dispable the type inference for partitioning

spark.conf.set('spark.sql.sources.partitionColumnTypeInference.enabled', False)

In [None]:
# after you disable the setting, try the filter with isin and with == again
users = spark.read.parquet(usersP_output_path)

(
  users
  .filter(col('last2_id').isin(['02', '03']))
).count()

In [None]:
(
  users
  .filter((col('last2_id') == '02') | (col('last2_id') == '03'))
).count()

In [None]:
spark.stop()