Skip to content

Latest commit

 

History

History
35 lines (18 loc) · 2.07 KB

Data_Gen.md

File metadata and controls

35 lines (18 loc) · 2.07 KB

Note: This is to regenerate the listeners.parquet and spins.parquet data files locally. Students shouldn't have to run this The files are provided on S3.

  1. If not a fresh checkout, make sure the tmp_ directories don't exist in the basedir. I don't automatically clean these up after runs because I'm not a fan of code that recursively removes directories, no matter how foolproof things might sound. Call me cautious.

  2. You'll need access to the real metadata to parse. On my local workstation this is located at /Users/bfemiano/Downloads/metadata.txt. This is why students won't be able to run this script.

  3. Make sure PySpark 2.2.1 is on the PYTHONPATH or part of the local pipenv install

  4. python generate_fake_data.py

You should see 5 sample rows printout to verify the listeners and spins TSV data joined together correctly with on the fake_listener_id.

  1. The data will go under tmp_listeners_parquet and tmp_spins_parquet.

  2. The part files from the tmp parquet locations will get copied to ./data and renamed to 'listeners.snappy.parquet' and 'spins-2019-02-08.snappy.parquet'.

What the script does.

The script generates 100 fake listener identifiers with age_buckets, gender and subscription_type.

It then cracks open real metadata from different song plays that happened and keeps only the real track titles and artist names. It autogenerates completely fake artist ids and track ids each time it's run to associate to the real names. Per run of the script, the same real artist_id will make to a consistent fake_id. It also generates fake elapsed_seconds and play sources.

The spins generation associates a random fake_listener_id from the first part of the script. This lets the 2 files be joined again later.

Elapsed seconds are kept between 0 and 300 (5 minutes)

Everything is completely autogenerated and fake except for the artist names and track names. The subscription_types, play sources, genders and age_buckets contain very small subsets of the real values we use at Next Big Sound/Pandora.

These properties help to give the fake data a feeling of realism.