This project is part of data engineering festival at HSBC.
In this project, I will be using the Spark RDD API for data processing.
You will analyse a dataset of (almost) all submitted HackerNews posts.
-
First run
python get_data.py
to download the data. Alternatively you can download it from https://s3-eu-west-1.amazonaws.com/kate-datasets/hackernews/HNStories.zip and unzip it. -
Then you can use the notebook
spark-rdd-homework.ipynb
to prototype your functions -
Finally write down your functions in
spark_rdd.py
and submit on K.A.T.E.