This repository contains:
- AWS Glue / Python ingestion scripts used to ingest telecom datasets into S3 (processed zone) and support downstream analytics (Athena and Power BI where applicable).
- Terraform Infrastructure as Code (IaC) that captures the AWS microservice ecosystem (S3, Glue, Lambda, IAM, S3 triggers) so it can be reproduced and managed consistently.
The intended direction is for GitHub to be the source of truth:
- Ingestion scripts are versioned here and referenced by Glue from S3.
- Terraform defines and preserves the production ingestion ecosystem.
Core ingestion scripts:
telus_quantities_ingestion.pytelus_spend_ingestion.pyrogers_spend_ingestion.py
Pricebook, mapping, and fact helpers:
load-ngta-rogers-pricebook-notebook.pyload-ngta-telus-pricebook-notebook.pyload-tsma-pricebook-notebook.pymapping-to-master.pyngta-rogers-fact.pyngta-telus-fact.pytsma-fact.py
Purpose
Ingests Telus “Quantities Reports” Excel workbooks and extracts the Monthly Services sheets for all entities.
Key logic
- Identifies sheets matching:
<entity> MONTHLY SERVICES(case-insensitive). - Derives
entityas the first word of the sheet name. - Removes repeated header rows inside sheets (rows where values equal the column headers).
- Renames ambiguous columns named
TBD,TBD.1,TBD.2, etc. based on their values:- If values contain
wireless data,wireless voice,wireline data, orwireline voice→ rename column toSource - If values contain
SRVEQUIPorMonthly Local Services→ rename column toSource System
- If values contain
- Adds metadata columns:
ingestion_year,ingestion_month,ingestion_ts - Writes a single Parquet file to S3.
Inputs
- Trigger mode (event-driven): an S3 key is passed in via
S3_KEY, expected layout:
raw/{provider}/{report_type}/{year}/{month_name}/<file>.xlsx - Manual mode:
YEAR,MONTH_NAME,MONTH_NUM
Output
s3://{BUCKET}/processed/telus/quantities_reports/{YEAR}/{MONTH_NUM}/combined_{YEAR}_{MONTH_NUM}_monthly_services.parquet
Purpose
Ingests Telus consolidated Spend Report workbooks, combines all relevant sheets, and writes a single Parquet output.
Key logic
- Finds the latest matching workbook under:
raw/telus/spend_reports/{YEAR}/{MONTH_NAME}/ - Reads all sheets and skips empty sheets
- Removes repeated header rows
- Adds
entity_nameas the sheet name per row - Adds metadata columns:
ingestion_year,ingestion_month,ingestion_ts - Forces all values to string to avoid schema issues
- Writes a single Parquet file to S3
Output
s3://{OUTPUT_BUCKET}/processed/telus/spend_reports/{YEAR}/{MONTH_NUM}/combined_{YEAR}_{MONTH_NUM}_spend_report.parquet
Purpose
Ingests Rogers “Usage & Spend” workbook and writes a single Parquet output.
Key logic
- Finds the latest workbook under:
raw/rogers/spend_reports/{YEAR}/{MONTH_NAME}/using filename heuristics (administrator + usage + spend + year) - Reads the
Usage_&_Spendsheet - Removes repeated header rows
- Adds
entity_name="Rogers"and ingestion metadata - Forces all values to string
- Writes a single Parquet file to S3
Output
s3://{OUTPUT_BUCKET}/processed/rogers/spend_reports/{YEAR}/{MONTH_NUM}/combined_{YEAR}_{MONTH_NUM}_rogers_spend_report.parquet
4) load-ngta-rogers-pricebook-notebook.py, load-ngta-telus-pricebook-notebook.py, load-tsma-pricebook-notebook.py
Purpose
Glue Studio generated scripts for loading and transforming telecom pricebook CSVs. These scripts typically:
- Load multiple S3 CSV inputs (Voice/Cellular/Data)
- Apply schema mappings and transformations
- Union into a single dataset
- Perform cleaning (trim, remove invisible chars, remove blanks, drop duplicates)
- Write to a target store (historically Redshift, where configured)
- Use
preactions/postactionswhere applicable for staging and merge patterns
Note
These are generated job scripts. If standardization is needed, refactor into shared helpers and remove interactive-only statements.
Recommended pattern implemented for NGTA:
- An S3
ObjectCreatedevent is configured onngta-raw-datafor:raw/telus/spend_reports/→lambda-ngta-telusraw/rogers/spend_reports/→lambda-ngta-rogersraw/telus/quantities_reports/→lambda-ngta-telus-quantities
Each Lambda triggers its corresponding Glue job, passing expected arguments (bucket, year/month parameters, or S3 key when used).
Processed outputs are written in Parquet under the processed zone:
s3://ngta-raw-data/processed/...
These can be exposed in Athena via external tables pointing to the processed prefixes.
Some column names contain spaces, so Athena queries should use quoted identifiers where needed.
Current state:
- Scripts are versioned in GitHub.
- Glue jobs reference scripts from S3 (AWS Glue assets bucket).
Future direction:
- Commit scripts to GitHub
- GitHub Actions uploads scripts to a controlled S3 “glue-scripts” path
- Glue jobs reference versioned script locations from S3
Suggested S3 deployment location:
s3://ngta-raw-data/glue-scripts/
Example:
s3://ngta-raw-data/glue-scripts/telus_quantities_ingestion.py
- Outputs are written as single Parquet objects using pandas + pyarrow in-memory buffers.
- Ingestion scripts convert values to strings before writing Parquet to reduce schema drift across months/entities.
- The ingestion ecosystem is event-driven for NGTA raw uploads (S3 → Lambda → Glue → processed Parquet).
These scripts and IaC support the DMP ingestion workflows (NGTA and TSMA) and are maintained as part of the Telecom Office DMP ongoing processes.