This coding challenge is a collection of Python jobs that are supposed to extract, transform and load data.
These jobs are using PySpark to process larger volumes of data and are supposed to run on a Spark cluster (via spark-submit
).
✅ Goals
- Get a working environment See local local
- Get a high-level understanding of the code and test dataset structure
- Have your preferred text editor or IDE setup and ready to go.
❌ Non-Goals
- solving the exercises / writing code
⚠️ The exercises will be given at the time of interview, and solved by pairing with the interviewer.
Please make sure you have the following installed and can run them
- Python (3.13.X), you can use for example pyenv to manage your python versions locally
- Poetry
- Java (17), you can use sdkman to install and manage java locally
We recommend using WSL 2 on Windows for this exercise, due to the lack of support of windows paths from Hadoop/Spark.
Follow instructions on the Windows official page and then the linux install.
poetry install
All of the following commands should be running successfully
poetry run pytest tests/unit
poetry run pytest tests/integration
poetry run mypy --ignore-missing-imports --disallow-untyped-calls --disallow-untyped-defs --disallow-incomplete-defs \
data_transformations tests
poetry run ruff format && poetry run ruff check
All commands are passing?
You are good to go!
⚠️ do not try to solve the exercises ahead of the interview
You are allowed to customize your environment (having the test in vscode directly for example): feel free to spend the time making this comfortable for you. This is not an expectation.
There are two exercises in this repo: Word Count, and Citibike.
Currently, these exist as skeletons, and have some initial test cases which are defined but some are skipped.
The following section provides context over them.
⚠️ do not try to solve the exercises ahead of the interview
/
├─ /data_transformations # Contains the main python library
│ # with the code to the transformations
│
├─ /jobs # Contains the entry points to the jobs
│ # performs argument parsing, and are
│ # passed to `spark-submit`
│
├─ /resources # Contains the raw datasets for the jobs
│
├─ /tests
│ ├─ /units # contains basic unit tests for the code
│ └─ /integration # contains integrations tests for the jobs
│ # and the setup
│
├─ .gitignore
├─ .pylintrc # configuration for pylint
├─ LICENCE
├─ poetry.lock
├─ pyproject.toml
└─ README.md # The current file
A NLP model is dependent on a specific input file. This job is supposed to preprocess a given text file to produce this input file for the NLP model (feature engineering). This job will count the occurrences of a word within the given text file (corpus).
There is a dump of the datalake for this under resources/word_count/words.txt
with a text file.
---
title: Wordcount Pipeline
---
flowchart LR
Raw["fa:fa-file words.txt"] --> J1{{word_count.py}} --> Bronze["fa:fa-file-csv word_count.csv"]
Simple *.txt
file containing text.
A single *.csv
file containing data similar to:
"word","count"
"a","3"
"an","5"
...
poetry build && poetry run spark-submit \
--master local \
--py-files dist/data_transformations-*.whl \
jobs/word_count.py \
<INPUT_FILE_PATH> \
<OUTPUT_PATH>
This problem uses data made publicly available by Citibike, a New York based bike share company.
For analytics purposes, the BI department of a hypothetical bike share company would like to present dashboards, displaying the
distance each bike was driven. There is a *.csv
file that contains historical data of previous bike rides. This input
file needs to be processed in multiple steps. There is a pipeline running these jobs.
---
title: Citibike Pipeline
---
flowchart TD
Raw["fa:fa-file-csv citibike.csv"] --> J1{{citibike_ingest.py}} --> Bronze["fa:fa-table-columns citibike.parquet"] --> J2{{citibike_distance_calculation.py}} --> Silver["fa:fa-table-columns citibike_distance.parquet"]
There is a dump of the datalake for this under resources/citibike/citibike.csv
with historical data.
Reads a *.csv
file and transforms it to parquet format. The column names will be sanitized (whitespaces replaced).
Historical bike ride *.csv
file:
"tripduration","starttime","stoptime","start station id","start station name","start station latitude",...
364,"2017-07-01 00:00:00","2017-07-01 00:06:05",539,"Metropolitan Ave & Bedford Ave",40.71534825,...
...
*.parquet
files containing the same content
"tripduration","starttime","stoptime","start_station_id","start_station_name","start_station_latitude",...
364,"2017-07-01 00:00:00","2017-07-01 00:06:05",539,"Metropolitan Ave & Bedford Ave",40.71534825,...
...
poetry build && poetry run spark-submit \
--master local \
--py-files dist/data_transformations-*.whl \
jobs/citibike_ingest.py \
<INPUT_FILE_PATH> \
<OUTPUT_PATH>
This job takes bike trip information and adds the "as the crow flies" distance traveled for each trip. It reads the previously ingested data parquet files.
Hint:
- For distance calculation, consider using Haversine formula as an option.
Historical bike ride *.parquet
files
"tripduration",...
364,...
...
*.parquet
files containing historical data with distance column containing the calculated distance.
"tripduration",...,"distance"
364,...,1.34
...
poetry build && poetry run spark-submit \
--master local \
--py-files dist/data_transformations-*.whl \
jobs/citibike_distance_calculation.py \
<INPUT_PATH> \
<OUTPUT_PATH>
⚠️ do not try to solve the exercises ahead of the interview
If you are unfamiliar with some of the tools used here, we recommend some resources to get started
- pytest: official
- pyspark: official and especially the DataFrame quickstart