<a href="https://colab.research.google.com/github/a-agmon/interviewdata/blob/main/Interview_qs_TL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


### Please start by running the following cells that will download the data and the Spark environment.
#### Questions start after this part

In [1]:
!wget -q https://raw.githubusercontent.com/a-agmon/interviewdata/main/daily-transactions-2020-10-01
!wget -q https://raw.githubusercontent.com/a-agmon/interviewdata/main/daily-transactions-2020-10-02
!wget -q https://raw.githubusercontent.com/a-agmon/interviewdata/main/daily-transactions-2020-10-03

In [2]:
!rm -rf transactions-postproc
!rm -rf daily-transactions
!mkdir daily-transactions
!mv daily-*2020* daily-transactions/

In [3]:
!apt-get install tree
!sudo apt update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz
!tar xf spark-3.3.0-bin-hadoop3.tgz

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
The following NEW packages will be installed:
  tree
0 upgraded, 1 newly installed, 0 to remove and 20 not upgraded.
Need to get 40.7 kB of archives.
After this operation, 105 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 tree amd64 1.7.0-5 [40.7 kB]
Fetched 40.7 kB in 0s (151 kB/s)
Selecting previously unselected package tree.
(Reading database ... 155676 files and directories currently installed.)
Preparing to unpack .../tree_1.7.0-5_amd64.deb ...
Unpacking tree (1.7.0-5) ...
Setting up tree (1.7.0-5) ...
Processing triggers for man-db (2.8.3-2ubuntu0.1) ...
Get:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
Ign:2 https://developer.download.nvidia.com/compute/machine

In [4]:
!pip install -q findspark
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.0.tar.gz (281.3 MB)
[K     |████████████████████████████████| 281.3 MB 46 kB/s 
[?25hCollecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[K     |████████████████████████████████| 199 kB 61.5 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.0-py2.py3-none-any.whl size=281764026 sha256=47416c29a1c580e597451bc11b460bd7c6152990bbdd0f07111e5ed27248a76a
  Stored in directory: /root/.cache/pip/wheels/7a/8e/1b/f73a52650d2e5f337708d9f6a1750d451a7349a867f928b885
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.5 pyspark-3.3.0


In [5]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.3.0-bin-hadoop3"

import pandas as pd
import numpy as np
import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark import SparkConf
import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType

In [6]:
# starting a spark session
spark = SparkSession.builder.master("local[*]").getOrCreate()
conf = SparkConf()

Please also run these - they build the data structure for the first question

In [7]:
packs =  [
    {'pack':1, 'pack_start_date':123456, 'pack_end_date':123460, 'pack_installs':10},
    {'pack':2, 'pack_start_date':123460, 'pack_end_date':123470, 'pack_installs':5},
    {'pack':3, 'pack_start_date':123470, 'pack_end_date':123475, 'pack_installs':10}]

consumption = [
    {'account':'AB','install_date':123459, 'installs':10},
    {'account':'AB','install_date':123465, 'installs':5},
    {'account':'AB','install_date':123466, 'installs':3}]

### * START HERE *

In [8]:
# run these and view the generated tables

packs_df = spark.createDataFrame(packs)
consumption_df = spark.createDataFrame(consumption)

print("Packages table\n")
packs_df.show()
print("consumption table\n")
consumption_df.show()

packs_df.createOrReplaceTempView("packs")
consumption_df.createOrReplaceTempView("consumption")

Packages table

+----+-------------+-------------+---------------+
|pack|pack_end_date|pack_installs|pack_start_date|
+----+-------------+-------------+---------------+
|   1|       123460|           10|         123456|
|   2|       123470|            5|         123460|
|   3|       123475|           10|         123470|
+----+-------------+-------------+---------------+

consumption table

+-------+------------+--------+
|account|install_date|installs|
+-------+------------+--------+
|     AB|      123459|      10|
|     AB|      123465|       5|
|     AB|      123466|       3|
+-------+------------+--------+



## Instructions for Q1

The **Packages** table represents packages that customers purchase.
Each package has an ID, a start and end date (represented by a number), and a number of installs that the package includes.

The **Consumption** table shows us how many installs each account used and when. When we get consumption data for a user, then we need to check according to the date, which package the user used. A user can only have one package in any given time. 

The report we need to calculate needs to show how much installs a user used from each of its packages, and how many installs remain in each package the user purchased 

In [9]:
# an example to how a spark query can run
spark.sql("select * from packs").show()

+----+-------------+-------------+---------------+
|pack|pack_end_date|pack_installs|pack_start_date|
+----+-------------+-------------+---------------+
|   1|       123460|           10|         123456|
|   2|       123470|            5|         123460|
|   3|       123475|           10|         123470|
+----+-------------+-------------+---------------+



An example for the report we want to see


```
+-------+----+-----------------+------------+-------------+
|account|pack|InstallsInPackage|InstallsUsed|InstallsDelta|
+-------+----+-----------------+------------+-------------+
|     AB|   1|               10|          10|            0|
|     AB|   2|                5|           8|           -3|
+-------+----+-----------------+------------+-------------+

```

In [14]:
sqlQuery = """

SELECT 1 + 1

"""

In [15]:
spark.sql(sqlQuery).show()

+-------+
|(1 + 1)|
+-------+
|      2|
+-------+



### **Instruction for Q2**

A developer on the team wrote an ETL that runs once a day as a Spark job.
Every day it reads a csv file that shows the total value of each customer's transactions of that day and write them as a parquet file partitioned by date and customer id.
Below you can see an example of the CSV file. Note that each customer has one entry that represents the total sum of transaction value it did on that day.

However, sometimes the csv file contains a correction for a sum reported in the past. 

for example - This file represents the transactions on the 1/10. You can see that **customer 1002** has 2 entries. One for the 1/10 and one for 30/9. This means that the total sum of transactions the customer did on the 1/10 is 70, but also that the total sum of transaction it did on the 30/9 was 40 and that this sum should **replace** the value already reported on the 30/9. 


```
current date file: 2020-10-01

date,customer,price
2020-10-01,1000,40
2020-10-01,1001,10
2020-09-30,1002,40
2020-10-01,1002,70
2020-10-01,1003,10
2020-09-29,1004,10
2020-10-01,1004,10
```

After the transformations files written in this partitioning scheme based on date and customer id

```
|_date=2000-1-1
|___customer=100
|_______file.p
|___customer=101

```


In [16]:
# This is the folder that the prq files are written to
# before running the ETL this should be cleared 
!rm -rf transactions-postproc/

This function represents the ETL. It runs once a day with a string represening the current day. 

It reads the csv file, does some transformations, and write it.

In [17]:
def run_etl(current_date): 

  df = spark.read.option("header",True).csv(f"daily-transactions/daily-transactions-{current_date}")
  
  df = df.withColumn("priceNumeric", F.col("price").astype(IntegerType()))
  
  # some other transformation code 

  df.write \
  .option("header",True) \
  .partitionBy("date") \
  .mode("overwrite") \
  .parquet("transactions-postproc")

This cell simulate the ETL running over 3 days for testing purposes

In [18]:
%%time
# takes a minute to run!
days = ['2020-10-01', '2020-10-02', '2020-10-03']

for date_str in days:
  run_etl(date_str)

CPU times: user 69.8 ms, sys: 10.8 ms, total: 80.6 ms
Wall time: 7.41 s


Run the two lines below to test the results that should sum how much did the company made each day from all the customers

In [19]:
df = spark.read.option("header",True).parquet("transactions-postproc")

In [20]:
df.groupBy("date") \
.sum("priceNumeric") \
.sort("date") \
.show(10, False)

+----------+-----------------+
|date      |sum(priceNumeric)|
+----------+-----------------+
|2020-10-01|5120             |
|2020-10-02|5190             |
|2020-10-03|36610            |
+----------+-----------------+



Finance saw these results, and told us that there is an error here. They did the calculations manually and told us that it is supposed to be like this:


```

+----------+-----------------+
|date      |sum(priceNumeric)|
+----------+-----------------+
|2020-09-29|4880             |
|2020-09-30|9790             |
|2020-10-01|35330            |
|2020-10-02|32940            |
|2020-10-03|36610            |
+----------+-----------------+

```


Please help us find the bug in the code above, and return the right results

## Instructions for Q3

A developer on the team was running the follwing line in a function for logging purposes, and the job crashed with out of memory exception. 
The developer says that the cluster has many workers with a lot of memory and disk and still the job crashes.
Can you help explain how come this line makes the job crash with OOM even though the cluster is huge?

In [28]:
def someFunc():
  #.....
  for row in df.collect():
    print(f'Customerr{row["customer"]} => Paid {row["price"]}')