psycopg2
- Set up a database to hold the data. See
prepare_db.py
for an example query to create the required tables on a newly set up database. - Fill out
conn_options
in main.py as required. - Run
main.py
as a python script to ingest and aggregate the data.
Data ingestion takes around 2.5s as measured with timeit()
on a Ryzen 1700X. aggregate_roas()
takes around 0.8 seconds per alias field.
The profile.py
module facilitates measuring these values.
As it stands the main.py
script will simply ignore rows violating any unique or primary key constraints. There is an alternative approach implemented in alternative_ingest.py
which instead will update the rows with new values. To use it, uncomment line 3 (# from alternative_ingest import ingest
) in main.py
.
The status columns seem to only ever contain two values, ENABLED
and REMOVED
. One could implement these sql columns to be of boolean type to save memory. However, this would require editing the ingest()
function to map these values from strings to booleans. The speed-memory tradeoff might not be worth it.
The aggregate_roas()
function can be easily extended to allow aggregating by more fields of the alias column; One only needs to append the field name to valid_aggregate_columns
and the name with its position to alias_field_to_position_mapper
. For example adding in 'structure_value'
and {'structure_value': 4}
respectively would allow the following usage: aggregate_roas(cursor, ['country', 'structure_value'])
.