Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Beefing up command line dynamic data handling functionality #11

Merged
merged 36 commits into from
Feb 28, 2021

Conversation

prakaa
Copy link
Contributor

@prakaa prakaa commented Jun 10, 2020

PR to primarily improve programmatic functionality in handling dynamic (time-series) data.

Changes include:

  1. Modification of dyanmic_data_compiler function to include:
  • Optional user input of file format to store data. While feather might have faster read/write, parquet has excellent compression characteristics and good compatability with packages for handling large on-memory/cluster datasets (e.g. Dask). This helps with local storage (especially for Causer Pays data) and file size for version control
  • Option to retain or delete downloaded csvs in cache
  • Option to not merge downloaded data. Useful when NEMOSIS is used to download large volumes of data for further processing.

See docstring below:

def dynamic_data_compiler(start_time, end_time, table_name, raw_data_location,
                          select_columns=None, filter_cols=None,
                          filter_values=None, fformat='feather',
                          keep_csv=True, data_merge=True, **kwargs):

    Downloads and compiles data for all dynamic tables.
    Refer to README for tables
    Args:
        start_time (str): format 'yyyy/mm/dd HH:MM:SS'
        end_time (str): format 'yyyy/mm/dd HH:MM:SS'
        table_name (str): table as per documentation
        raw_data_location (str): directory to download and cache data to.
                                 existing data will be used if in this dir.
        select_columns (list): return select columns
        filter_cols (list): filter on columns
        filter_values (list): filter index n filter col such that values are
                              equal to index n filter value
        fformat (string): "feather" or "parquet" for storage and access
        keep_csv (bool): retains CSVs in cache
        data_merge (bool): concatenate DataFrames.
        **kwargs: additional arguments passed to the pd.to_{fformat}() function
    Returns:
        all_data (pd.Dataframe): All data concatenated.
  1. Added FCAS Providers as a static table. This reads another tab of the Generators and Exemptions xlsx from AEMO

  2. Generalisation and exception handling in downloader functions with some more descriptive errors. The most important error handling is for Causer Pays data. Each .zip is for 30 min intervals, but when unzipped, each .csv file is for 5 minutes.

  3. Omission of rows of data in filters if date format in raw data is incorrect (in some raw files, there have been cases where null or "incorrect" rows of data contained a timestamp that was formatted incorrectly)

  4. Write files names generalised, not just feather

  5. Tests for static tables now include start and end date as required by function

  6. setup.py now includes xlrd, which is require by pandas.read_excel

  7. Some files include wrapping lines to comply with PEP-8 80 char lines


TODO:

  • README needs to be updated to highlight new command-line functionality with examples

@prakaa prakaa changed the title Beefing up command line Causer Pays functionality Beefing up command line dynamic data handling functionality Jun 10, 2020
@nick-gorman nick-gorman merged commit 47acd8d into UNSW-CEEM:master Feb 28, 2021
@prakaa prakaa deleted the pocket-rocket-nemosis branch March 4, 2021 02:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants