<a href="https://colab.research.google.com/github/eder1985/pismo_recruiting_technical_case/blob/main/work/notebooks/Colab_Pismo_Recruiting_Technical_Case.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1><center>Pismo Recruiting Technical Case</center></h1>

---



## Objective
The objective of this notebook is to:
><li>Give a proper understanding about the different PySpark functions available. </li>
><li>A short introduction to Google Colab, as that is the platform on which this notebook is written on. </li>

Once you complete this notebook, you should be able to write pyspark programs in an efficent way. The ideal way to use this is by going through the examples given and then trying them on Colab. At the end there are a few hands on questions which you can use to evaluate yourself.

## 1. Pre-requisites

### Installing Spark

Install Dependencies:


1.   Java 8
2.   Apache Spark with hadoop and
3.   Findspark (used to locate the spark in the system)


In [1]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
!tar xf spark-3.1.1-bin-hadoop3.2.tgz
!pip install -q findspark

Set Environment Variables:

In [2]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop3.2"

In [3]:
!ls

spark-3.1.1-bin-hadoop3.2      spark-3.1.1-bin-hadoop3.2.tgz.1
spark-3.1.1-bin-hadoop3.2.tgz  work


In [4]:
import findspark
findspark.init()

from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark = SparkSession.builder.master("local[*]").getOrCreate()
spark.conf.set("spark.sql.repl.eagerEval.enabled", True) # Property used to format output tables better
spark

## 2. Generate Fake Data

### Installing libs

In [5]:
!pip install -q faker

### Imports

In [6]:
from faker import Faker
from faker.providers import BaseProvider
from datetime import datetime
from json import dumps
import pandas as pd
import random
import collections
import glob
import os

### Generating fake `event_id`: random UUIDs

In [7]:
fake = Faker()
Faker.seed(random.randrange(0, 99999999999999999999, 1))
fake_event_id = fake.uuid4()
print(fake_event_id)

2f6f1dbb-e7fd-4e89-ad09-8c63a74d1337


### Generating fake `timestamp`: random timestamps with values until 3 years ago

In [8]:
fake_timestamp = datetime.strftime(fake.date_time_between(start_date='-3y', end_date='now'),"%Y-%m-%dT%H:%M:%S")
print(fake_timestamp)

2022-12-14T07:19:07


### Generating fake `domain`: random values based on valid grade names list

In [9]:
class ProjectDomainProvider(BaseProvider):
    def project_domain_name(self):
        list_project_domain_names = ['account','transaction']
        return random.choice(list_project_domain_names)

fake.add_provider(ProjectDomainProvider)

fake_project_domain_name = fake.project_domain_name()
print(fake_project_domain_name)

transaction


### Generating fake `status`: random values based on list

In [10]:
class StatusTypeProvider(BaseProvider):
    def status_type(self):
        list_status_types = ['ACTIVE','INACTIVE','SUSPENDED','BLOCKED', 'DELETED']
        return random.choice(list_status_types)

fake.add_provider(StatusTypeProvider)

fake_status_type = fake.status_type()
print(fake_status_type)

DELETED


### Generating custom fake `uuid`: random values based on list

In [11]:
class CustomUUIDProvider(BaseProvider):
    def custom_uuid(self):
        list_uuids = [
            '1a1a1a1a-1a1a-1a1a-1a1a-1a1a1a1a1a1a',
            '2b2b2b2b-2b2b-2b2b-2b2b-2b2b2b2b2b2b'
            ]
        return random.choice(list_uuids)

### Defining `write_fake_data` and `read_fake_data` functions

In [12]:
def write_fake_data(fake, length, destination_path, unique_uuid = True):

    database = []
    current_time = datetime.now().strftime("%Y%m%d%H%M%S")
    filename = 'fake_events_'+current_time

    for x in range(length):
        uuid = fake.uuid4() if unique_uuid else fake.custom_uuid()
        project_domain_name = fake.project_domain_name()
        event_type = project_domain_name + "-status-change"

        database.append(collections.OrderedDict([
            ('event_id', uuid),
            ('timestamp', datetime.strftime(fake.date_time_between(start_date='-3y', end_date='now'),"%Y-%m-%dT%H:%M:%S")),
            ('domain', project_domain_name),
            ('event_type', event_type),
            ('data', collections.OrderedDict([
                ('id', fake.random_number(digits=6)),
                ('old_status', fake.status_type()),
                ('new_status', fake.status_type()),
                ('reason', fake.sentence(nb_words=5))
            ]))
        ]))

    with open('%s%s.json' % (destination_path, filename), 'w') as output:
        output.write(dumps(database, indent=4, sort_keys=False, default=str))

    print("Done.")

def read_fake_data(json_filepath):
    json_files = [os.path.normpath(i) for i in glob.glob(json_filepath)]
    df = pd.concat([pd.read_json(f) for f in json_files])
    return df

### Writing and reading fake data

In [29]:
def run(length, unique_uuid = True):
    fake = Faker()
    Faker.seed(random.randrange(0, 99999999999999999999, 1))
    fake.add_provider(ProjectDomainProvider)
    fake.add_provider(StatusTypeProvider)
    fake.add_provider(CustomUUIDProvider)

    destination_path = 'work/data/raw/events/'
    write_fake_data(fake, length, destination_path,unique_uuid)

    json_filepath = destination_path+'*.json'
    fake_data = read_fake_data(json_filepath)
    print(fake_data)

In [49]:
run(1000)

Done.
                                 event_id           timestamp       domain  \
0    5d77f0a4-5af2-4587-a04d-5e50d9eb4d20 2023-01-23 17:43:37  transaction   
1    2ebdfa9c-d092-4800-a212-aefdf67902ff 2022-03-04 16:00:23      account   
2    27830434-a676-40f6-813a-6e7fd0013700 2021-04-09 03:09:48  transaction   
3    2d6cbae0-86b5-4627-9197-87052e1ab292 2023-05-30 05:10:22      account   
4    39d6d799-c4ff-4743-b8cd-42efe2983f41 2021-09-08 13:52:59  transaction   
..                                    ...                 ...          ...   
995  59fa4d5a-e9a6-4739-9227-ceafdef68878 2020-08-08 14:48:57  transaction   
996  5e019632-9a37-4e30-a860-d1c08cc64bea 2020-12-03 04:27:24  transaction   
997  72e294c1-789d-4c36-8034-8ecd62530fd1 2022-03-30 20:31:34  transaction   
998  b0c3ada9-9334-40a0-9ef3-fef33da40f6a 2023-07-20 08:33:16      account   
999  39c78a50-71a3-43cd-8526-8bbf378ef671 2023-01-02 08:11:42      account   

                    event_type  \
0    transaction-status

In [50]:
run(10,unique_uuid = False)

Done.
                                event_id           timestamp       domain  \
0   5d77f0a4-5af2-4587-a04d-5e50d9eb4d20 2023-01-23 17:43:37  transaction   
1   2ebdfa9c-d092-4800-a212-aefdf67902ff 2022-03-04 16:00:23      account   
2   27830434-a676-40f6-813a-6e7fd0013700 2021-04-09 03:09:48  transaction   
3   2d6cbae0-86b5-4627-9197-87052e1ab292 2023-05-30 05:10:22      account   
4   39d6d799-c4ff-4743-b8cd-42efe2983f41 2021-09-08 13:52:59  transaction   
..                                   ...                 ...          ...   
5   1a1a1a1a-1a1a-1a1a-1a1a-1a1a1a1a1a1a 2023-05-08 03:28:33      account   
6   2b2b2b2b-2b2b-2b2b-2b2b-2b2b2b2b2b2b 2021-04-22 23:37:48      account   
7   1a1a1a1a-1a1a-1a1a-1a1a-1a1a1a1a1a1a 2023-01-14 19:52:23  transaction   
8   2b2b2b2b-2b2b-2b2b-2b2b-2b2b2b2b2b2b 2022-09-02 10:54:16  transaction   
9   1a1a1a1a-1a1a-1a1a-1a1a-1a1a1a1a1a1a 2020-08-14 07:39:34      account   

                   event_type  \
0   transaction-status-change   
1  

## 3. Exploring the Raw Dataset

### Loading the Dataset

In [51]:
!ls work/data/raw/events/ -la

total 388
drwxr-xr-x 3 root root   4096 Jul 26 00:16 .
drwxr-xr-x 4 root root   4096 Jul 25 22:17 ..
-rw-r--r-- 1 root root 380412 Jul 26 00:16 fake_events_20230726001616.json
-rw-r--r-- 1 root root   3828 Jul 26 00:16 fake_events_20230726001625.json
drwxr-xr-x 2 root root   4096 Jul 25 23:58 .ipynb_checkpoints


In [52]:
raw_events = spark.read.option("multiline","true").json('work/data/raw/events/')
raw_events.show(5, truncate = False)

+-----------------------------------------------------------------------------------+-----------+------------------------------------+-------------------------+-------------------+
|data                                                                               |domain     |event_id                            |event_type               |timestamp          |
+-----------------------------------------------------------------------------------+-----------+------------------------------------+-------------------------+-------------------+
|{421147, INACTIVE, SUSPENDED, Adult hair work yes.}                                |transaction|5d77f0a4-5af2-4587-a04d-5e50d9eb4d20|transaction-status-change|2023-01-23T17:43:37|
|{98864, ACTIVE, INACTIVE, Arrive arrive professor public sister.}                  |account    |2ebdfa9c-d092-4800-a212-aefdf67902ff|account-status-change    |2022-03-04T16:00:23|
|{971368, SUSPENDED, INACTIVE, Child imagine assume reason appear production heart.}|transactio

In [53]:
raw_events.count()

1010

### Dataframe Raw Schema

In [54]:
raw_events.printSchema()

root
 |-- data: struct (nullable = true)
 |    |-- id: long (nullable = true)
 |    |-- new_status: string (nullable = true)
 |    |-- old_status: string (nullable = true)
 |    |-- reason: string (nullable = true)
 |-- domain: string (nullable = true)
 |-- event_id: string (nullable = true)
 |-- event_type: string (nullable = true)
 |-- timestamp: string (nullable = true)



## 4. Applying transformations

### Columns transformations

In [55]:
partial_events = raw_events\
  .withColumn("timestamp",to_timestamp("timestamp"))\
  .withColumn("day",to_date("timestamp"))

In [56]:
partial_events.printSchema()

root
 |-- data: struct (nullable = true)
 |    |-- id: long (nullable = true)
 |    |-- new_status: string (nullable = true)
 |    |-- old_status: string (nullable = true)
 |    |-- reason: string (nullable = true)
 |-- domain: string (nullable = true)
 |-- event_id: string (nullable = true)
 |-- event_type: string (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- day: date (nullable = true)



### Drop duplicated events

In [57]:
from pyspark.sql.functions import countDistinct

# Count
partial_events.select(countDistinct("event_id", "event_type").alias("distinct_events")).show()


+---------------+
|distinct_events|
+---------------+
|           1004|
+---------------+



In [58]:
grouped_events = partial_events \
  .groupBy( \
      col("event_id"), \
      col("event_type")) \
  .agg( \
      max(col("timestamp")))

grouped_events.show(truncate = False)

+------------------------------------+-------------------------+-------------------+
|event_id                            |event_type               |max(timestamp)     |
+------------------------------------+-------------------------+-------------------+
|b750ebce-ae02-4727-b13b-333534a1220d|transaction-status-change|2021-08-24 17:09:11|
|4b1574ad-774a-4177-ae06-4a4946c3f340|transaction-status-change|2021-08-17 23:56:19|
|417e7f61-66ca-45b9-82b8-0aba181d9e2a|transaction-status-change|2020-10-10 15:35:50|
|1238dd32-2f13-43b9-b42c-b2e951829de2|account-status-change    |2020-10-12 21:00:03|
|ee238ce4-038f-4ee9-94ac-fbdbec466342|account-status-change    |2020-08-15 21:11:33|
|ea3f11c0-3d43-4669-bfe2-f8295e656b9f|transaction-status-change|2023-05-28 09:54:49|
|f5c3b540-9cdc-451d-aad5-2de6114a5ba4|transaction-status-change|2020-10-11 23:15:38|
|689eef58-0e19-4916-912f-84b54be65225|account-status-change    |2023-03-25 17:36:15|
|c5f17343-cbda-4ece-b27f-361da6835e40|transaction-status-change|2

In [59]:
final_events = grouped_events \
    .join(partial_events, ["event_id","event_type"]) \
    .dropDuplicates(["event_id","event_type"]) \
    .drop("timestamp") \
    .withColumnRenamed("max(timestamp)", "timestamp")
final_events.show()

+--------------------+--------------------+-------------------+--------------------+-----------+----------+
|            event_id|          event_type|          timestamp|                data|     domain|       day|
+--------------------+--------------------+-------------------+--------------------+-----------+----------+
|1238dd32-2f13-43b...|account-status-ch...|2020-10-12 21:00:03|{45209, INACTIVE,...|    account|2020-10-12|
|417e7f61-66ca-45b...|transaction-statu...|2020-10-10 15:35:50|{464860, BLOCKED,...|transaction|2020-10-10|
|4b1574ad-774a-417...|transaction-statu...|2021-08-17 23:56:19|{401625, INACTIVE...|transaction|2021-08-17|
|b750ebce-ae02-472...|transaction-statu...|2021-08-24 17:09:11|{509826, BLOCKED,...|transaction|2021-08-24|
|ea3f11c0-3d43-466...|transaction-statu...|2023-05-28 09:54:49|{124172, INACTIVE...|transaction|2023-05-28|
|ee238ce4-038f-4ee...|account-status-ch...|2020-08-15 21:11:33|{273015, DELETED,...|    account|2020-08-15|
|1eae4f2b-eec2-43d...|accoun

In [60]:
final_events.count()

1004

## 5. Write transformed data in parquet format

In [61]:
final_events \
  .write \
  .partitionBy("event_type", "day") \
  .mode("append") \
  .parquet("work/data/trusted/events/")

## 6. Read transformed data in parquet format

In [62]:
trusted_events = spark.read.parquet('work/data/trusted/events/')
trusted_events.show()

+--------------------+-------------------+--------------------+-----------+--------------------+----------+
|            event_id|          timestamp|                data|     domain|          event_type|       day|
+--------------------+-------------------+--------------------+-----------+--------------------+----------+
|ebc0a28c-8eb2-406...|2021-04-03 20:00:44|{248608, BLOCKED,...|transaction|transaction-statu...|2021-04-03|
|27830434-a676-40f...|2021-04-09 03:09:48|{971368, SUSPENDE...|transaction|transaction-statu...|2021-04-09|
|a528240a-d99b-487...|2022-05-20 20:58:18|{750886, SUSPENDE...|transaction|transaction-statu...|2022-05-20|
|b05fd65e-8527-46d...|2023-07-01 19:33:00|{8488, DELETED, I...|    account|account-status-ch...|2023-07-01|
|6f369047-1fc5-43b...|2021-04-30 21:11:22|{782949, SUSPENDE...|transaction|transaction-statu...|2021-04-30|
|007c357e-102a-4d2...|2021-01-25 19:58:21|{732774, SUSPENDE...|transaction|transaction-statu...|2021-01-25|
|ece9319d-4ab7-493...|2022-1

In [63]:
trusted_events.count()

2008

## 7. Move raw data to processed data folder

In [64]:
!mv work/data/raw/events/* work/data/processed/events/
!ls work/data/processed/events/ -la

total 760
drwxr-xr-x 2 root root   4096 Jul 26 00:18 .
drwxr-xr-x 4 root root   4096 Jul 26 00:13 ..
-rw-r--r-- 1 root root 379837 Jul 26 00:03 fake_events_20230726000304.json
-rw-r--r-- 1 root root   3800 Jul 26 00:03 fake_events_20230726000330.json
-rw-r--r-- 1 root root 380412 Jul 26 00:16 fake_events_20230726001616.json
-rw-r--r-- 1 root root   3828 Jul 26 00:16 fake_events_20230726001625.json
