This is not the contest problem, please go to the website to see it.
Data overflow contest mock problem.
We have a TSV(Tab Separated value) file containing user_id and location_id in each line, the goal of this task is to aggregate the user visitation into a output TSV file containing user_id and the location_ids in a single line without any duplicates
Note : user_id and location_id are integers, user_id represents a user and location_id represents a location.
USER_ID LOCATION_ID
1234 1
1234 2
1245 6
1293 7
1234 4
1245 5
1293 4
2345 1
1234 1
1234 1,2,4
1245 6,5
1293 7,4
2345 1
The code will be tested against test cases.
For performance we are testing the code with a file having 1 million records, 10 million records and 100 million records
Expected Runtime : Your code is expected to generate the output file for 1 million records within 5 seconds, 10 million within 50 seconds and 100 million within 500 seconds.
Important Note : While testing your code we will run your code with N files, 2 <= N <= 10. You can expect all the N files to be evenly divided with data.
1GB RAM, 2 core CPU
Sample data can be downloaded from here
This is a zip file, please uncompress it using a unzip command or simple uncompressing tool.
You can use this sample data to run your code.
-
Login to github and visit the repository.
-
Clone the forked respository to the local machine.
-
Start writing your code by updating the
location_aggregation
function in thecode/script.py
feel free add/modify the code. -
If your code is using additional libraries please mention it in the
requirements.txt
. -
Run the basic test cases by running.
python3 wrapper.py test
This tests your code with basic test cases.
-
To run your code with the given sample input file, please run
python3 wrapper.py run -i input_file_1.tsv [input_file_2.tsv ...] -o output_file.tsv
- Once you are happy with the code, commit the code
- Submit your github repository link along with the commit id in our website.