<a href="https://colab.research.google.com/github/eder1985/pismo_recruiting_technical_case/blob/main/work/notebooks/Colab_Pismo_Recruiting_Technical_Case.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1><center>Pismo Recruiting Technical Case</center></h1>

---



## Objective
The objective of this notebook is to:
><li>Give a proper understanding about the different PySpark functions available. </li>
><li>A short introduction to Google Colab, as that is the platform on which this notebook is written on. </li>

Once you complete this notebook, you should be able to write pyspark programs in an efficent way. The ideal way to use this is by going through the examples given and then trying them on Colab. At the end there are a few hands on questions which you can use to evaluate yourself.

## 1. Pre-requisites

### Installing Spark

Install Dependencies:


1.   Java 8
2.   Apache Spark with hadoop and
3.   Findspark (used to locate the spark in the system)


In [24]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
!tar xf spark-3.1.1-bin-hadoop3.2.tgz
!pip install -q findspark

Set Environment Variables:

In [25]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop3.2"

In [None]:
!ls

sample_data  spark-3.1.1-bin-hadoop3.2	spark-3.1.1-bin-hadoop3.2.tgz


In [52]:
import findspark
findspark.init()

from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark = SparkSession.builder.master("local[*]").getOrCreate()
spark.conf.set("spark.sql.repl.eagerEval.enabled", True) # Property used to format output tables better
spark

## 2. Generate Fake Data

### Installing libs

In [None]:
!pip install -q faker

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.7 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/1.7 MB[0m [31m1.7 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.7/1.7 MB[0m [31m25.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m19.3 MB/s[0m eta [36m0:00:00[0m
[?25h

### Imports

In [None]:
from faker import Faker
from faker.providers import BaseProvider
from datetime import datetime
from json import dumps
import pandas as pd
import random
import collections
import glob
import os

### Generating fake `event_id`: random UUIDs

In [None]:
fake = Faker()
Faker.seed(random.randrange(0, 99999999999999999999, 1))
fake_event_id = fake.uuid4()
print(fake_event_id)

8dfb2a88-0b16-4681-ab20-27edfe0be021


### Generating fake `timestamp`: random timestamps with values until 3 years ago

In [None]:
fake_timestamp = datetime.strftime(fake.date_time_between(start_date='-3y', end_date='now'),"%Y-%m-%dT%H:%M:%S")
print(fake_timestamp)

2022-04-15T05:54:52


### Generating fake `domain`: random values based on valid grade names list

In [None]:
class ProjectDomainProvider(BaseProvider):
    def project_domain_name(self):
        list_project_domain_names = ['account','transaction']
        return random.choice(list_project_domain_names)

fake.add_provider(ProjectDomainProvider)

fake_project_domain_name = fake.project_domain_name()
print(fake_project_domain_name)

account


### Generating fake `status`: random values based on list

In [None]:
class StatusTypeProvider(BaseProvider):
    def status_type(self):
        list_status_types = ['ACTIVE','INACTIVE','SUSPENDED','BLOCKED', 'DELETED']
        return random.choice(list_status_types)

fake.add_provider(StatusTypeProvider)

fake_status_type = fake.status_type()
print(fake_status_type)

SUSPENDED


### Generating custom fake `uuid`: random values based on list

In [46]:
class CustomUUIDProvider(BaseProvider):
    def custom_uuid(self):
        list_uuids = [
            '1a1a1a1a-1a1a-1a1a-1a1a-1a1a1a1a1a1a',
            '2b2b2b2b-2b2b-2b2b-2b2b-2b2b2b2b2b2b'
            ]
        return random.choice(list_uuids)

### Defining `write_fake_data` and `read_fake_data` functions

In [47]:
def write_fake_data(fake, length, destination_path, unique_uuid = True):

    database = []
    current_time = datetime.now().strftime("%Y%m%d%H%M%S")
    filename = 'fake_events_'+current_time

    for x in range(length):
        uuid = fake.uuid4() if unique_uuid else fake.custom_uuid()
        project_domain_name = fake.project_domain_name()
        event_type = project_domain_name + "-status-change"

        database.append(collections.OrderedDict([
            ('event_id', uuid),
            ('timestamp', datetime.strftime(fake.date_time_between(start_date='-3y', end_date='now'),"%Y-%m-%dT%H:%M:%S")),
            ('domain', project_domain_name),
            ('event_type', event_type),
            ('data', collections.OrderedDict([
                ('id', fake.random_number(digits=6)),
                ('old_status', fake.status_type()),
                ('new_status', fake.status_type()),
                ('reason', fake.sentence(nb_words=5))
            ]))
        ]))

    with open('%s%s.json' % (destination_path, filename), 'w') as output:
        output.write(dumps(database, indent=4, sort_keys=False, default=str))

    print("Done.")

def read_fake_data(json_filepath):
    json_files = [os.path.normpath(i) for i in glob.glob(json_filepath)]
    df = pd.concat([pd.read_json(f) for f in json_files])
    return df

### Writing and reading fake data

In [67]:
def run(unique_uuid = True):
    fake = Faker()
    Faker.seed(random.randrange(0, 99999999999999999999, 1))
    fake.add_provider(ProjectDomainProvider)
    fake.add_provider(StatusTypeProvider)
    fake.add_provider(CustomUUIDProvider)

    length = 10
    destination_path = 'work/data/raw/events/'
    write_fake_data(fake, length, destination_path,unique_uuid)

    json_filepath = destination_path+'*.json'
    fake_data = read_fake_data(json_filepath)
    print(fake_data)

In [68]:
run()

Done.
                               event_id           timestamp       domain  \
0  0f74cba8-2eb8-403c-9547-9d0c3ddc3f8f 2023-01-10 00:26:49  transaction   
1  5a23b2fe-7319-4f1d-9737-7d35920ca6e1 2021-12-08 15:34:10      account   
2  71569876-00ff-4305-863f-e5a6434750a9 2022-10-15 11:09:04      account   
3  8f4a0141-6211-45a1-bd2f-49135bbb0fa7 2021-07-29 20:09:22      account   
4  47fd49c0-6b8b-4eb8-9431-93b41a6fa306 2022-03-31 15:42:51      account   
5  5762c582-9724-43e6-9425-b9aecae58f71 2020-12-24 08:57:22  transaction   
6  c7591dca-eef6-494f-8f87-f78bc805886f 2021-10-15 13:42:11  transaction   
7  00422dc9-cdaf-401c-b03a-b791571f03c2 2023-03-25 13:21:06  transaction   
8  9ee5f5d7-d4b7-4f3b-a850-dc8fb0b8718e 2023-03-16 16:41:46      account   
9  5ae60398-6b79-44a9-843e-1fbfdddb3606 2022-08-03 21:30:26      account   

                  event_type  \
0  transaction-status-change   
1      account-status-change   
2      account-status-change   
3      account-status-change 

In [69]:
run(unique_uuid = False)

Done.
                               event_id           timestamp       domain  \
0  0f74cba8-2eb8-403c-9547-9d0c3ddc3f8f 2023-01-10 00:26:49  transaction   
1  5a23b2fe-7319-4f1d-9737-7d35920ca6e1 2021-12-08 15:34:10      account   
2  71569876-00ff-4305-863f-e5a6434750a9 2022-10-15 11:09:04      account   
3  8f4a0141-6211-45a1-bd2f-49135bbb0fa7 2021-07-29 20:09:22      account   
4  47fd49c0-6b8b-4eb8-9431-93b41a6fa306 2022-03-31 15:42:51      account   
5  5762c582-9724-43e6-9425-b9aecae58f71 2020-12-24 08:57:22  transaction   
6  c7591dca-eef6-494f-8f87-f78bc805886f 2021-10-15 13:42:11  transaction   
7  00422dc9-cdaf-401c-b03a-b791571f03c2 2023-03-25 13:21:06  transaction   
8  9ee5f5d7-d4b7-4f3b-a850-dc8fb0b8718e 2023-03-16 16:41:46      account   
9  5ae60398-6b79-44a9-843e-1fbfdddb3606 2022-08-03 21:30:26      account   
0  1a1a1a1a-1a1a-1a1a-1a1a-1a1a1a1a1a1a 2023-02-07 15:19:35  transaction   
1  1a1a1a1a-1a1a-1a1a-1a1a-1a1a1a1a1a1a 2021-11-11 14:16:21      account   
2  1a1

## 3. Exploring the Dataset

### Loading the Dataset

In [71]:
!ls work/data/raw/events/ -la

total 16
drwxr-xr-x 2 root root 4096 Jul 25 22:20 .
drwxr-xr-x 4 root root 4096 Jul 25 22:17 ..
-rw-r--r-- 1 root root 3776 Jul 25 22:19 fake_events_20230725221956.json
-rw-r--r-- 1 root root 3772 Jul 25 22:20 fake_events_20230725222009.json


In [72]:
raw_events = spark.read.option("multiline","true").json('work/data/raw/events/')
raw_events.show(5, truncate = False)

+----------------------------------------------------------------+-----------+------------------------------------+-------------------------+-------------------+
|data                                                            |domain     |event_id                            |event_type               |timestamp          |
+----------------------------------------------------------------+-----------+------------------------------------+-------------------------+-------------------+
|{501533, DELETED, INACTIVE, Happy store upon go say.}           |transaction|0f74cba8-2eb8-403c-9547-9d0c3ddc3f8f|transaction-status-change|2023-01-10T00:26:49|
|{388045, SUSPENDED, ACTIVE, Be short within performance couple.}|account    |5a23b2fe-7319-4f1d-9737-7d35920ca6e1|account-status-change    |2021-12-08T15:34:10|
|{8911, DELETED, BLOCKED, Operation enjoy billion.}              |account    |71569876-00ff-4305-863f-e5a6434750a9|account-status-change    |2022-10-15T11:09:04|
|{232911, DELETED, BLOCKED, 

### Dataframe Raw Schema

In [73]:
raw_events.printSchema()

root
 |-- data: struct (nullable = true)
 |    |-- id: long (nullable = true)
 |    |-- new_status: string (nullable = true)
 |    |-- old_status: string (nullable = true)
 |    |-- reason: string (nullable = true)
 |-- domain: string (nullable = true)
 |-- event_id: string (nullable = true)
 |-- event_type: string (nullable = true)
 |-- timestamp: string (nullable = true)



In [74]:
events = raw_events\
  .withColumn("timestamp",to_timestamp("timestamp"))\
  .withColumn("day",to_date("timestamp"))

In [65]:
events.printSchema()

root
 |-- data: struct (nullable = true)
 |    |-- id: long (nullable = true)
 |    |-- new_status: string (nullable = true)
 |    |-- old_status: string (nullable = true)
 |    |-- reason: string (nullable = true)
 |-- domain: string (nullable = true)
 |-- event_id: string (nullable = true)
 |-- event_type: string (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- day: date (nullable = true)



In [75]:
events \
  .write \
  .partitionBy("event_type", "day") \
  .mode("overwrite") \
  .parquet("work/data/trusted/events/")