

The following code reads each consumer's 2013 data from the ```data_2013/``` folder.

(i.e.: the output from STEP1)

The input data files have the format:

```
LCLid,stdorToU,DateTime,KWH/hh (per half hour) 
MAC005562,Std,2013-01-01 00:00:00.0000000, 0.298 
```

It creates two new files, each containing just the limited summer 
and winter WEEKEND datapoints.

It skips any consumer that has ANY missing or zero datapoints. 
It skips any ToU consumer.
It sums every two half-hour datapoints into single hourly datapoints.


Therefore the output will be two folders (winter_we and summer_we). 
Every consumer will have a complete set of weekend data represented
by a file in each folder containing (21 * 24) = 504 winter datapoints
and (28 * 24) = 672  summer datapoints.

The output data files will have the format:

```
HourlyDateTime,KWH
2013-06-21 00:00:00,0.109
2013-06-21 01:00:00,0.084
2013-06-21 02:00:00,0.133 ...etc
```


In [1]:
import pandas as pd

with open("data_2013/2013_ids.txt", "r") as input_file:
    idx_list = [lclid.strip() for lclid in input_file]

with open("errors_weekend.txt", "w") as errors:

    for id in idx_list:
        df = pd.read_csv(
            f"data_2013/{id}.csv",
            usecols=lambda x: x != "LCLid",
            parse_dates=['DateTime'],
            date_format='%Y-%m-%d %H:%M:%S.%f0',
            dtype={'stdorToU': 'category'}
        )
        print("Processing LCLid: ", id)

        # skip this LCLid if the stdorToU column contains 'ToU'
        if 'ToU' in df['stdorToU'].unique():
            errors.write(f"LCLid {id} contains 'ToU'. Skipping this LCLid.\n")
            continue

        # rename KWH column and drop any rows with Nan values
        df.rename(columns={df.columns[-1]: 'KWH'}, inplace=True)
        df = df.dropna(subset=['KWH'])
        # convert KWH column to float
        df['KWH'] = pd.to_numeric(df['KWH'], errors="coerce").astype(float)
        # drop duplicates
        df.drop_duplicates(inplace=True)
        # double check that the datetime field is in datetime format
        df['DateTime'] = pd.to_datetime(df['DateTime'])
        # now remove any rows that contain datapoints where the datetime field does not contain either minutes part of either 00 or 30
        df = df[(df['DateTime'].dt.minute == 0) | (df['DateTime'].dt.minute == 30)]
        # now remove any date which is not a saturday or sunday
        df = df[df['DateTime'].dt.dayofweek > 4]

        # now create two dfs - one for summer weekdays and one for winter weekdays
        summer_df = df[(df['DateTime'].dt.date >= pd.to_datetime('2013-06-21').date()) & (df['DateTime'].dt.date <= pd.to_datetime('2013-09-23').date())]
        winter_df = df[(df['DateTime'].dt.date >= pd.to_datetime('2013-01-06').date()) & (df['DateTime'].dt.date <= pd.to_datetime('2013-03-20').date())]

        # Remove any datapoints that contain zero KWH values
        summer_df = summer_df[summer_df['KWH'] != 0]
        winter_df = winter_df[winter_df['KWH'] != 0]

        # Now for each df create a list of dates between the start and end date

        summer_dates = list(pd.date_range(start='2013-06-21', end='2013-09-23', freq='D'))
        summer_weekend_dates = [date for date in summer_dates if date.dayofweek >= 5]

        winter_dates = list(pd.date_range(start='2013-01-06', end='2013-03-20', freq='D'))
        winter_weekend_dates = [date for date in winter_dates if date.dayofweek >= 5]

        missing_data = False

        # then loop through these dates and check that there are 48 values for each date.
        # If not, skip this LCLid
        for date in summer_weekend_dates:
            num = len(summer_df[summer_df['DateTime'].dt.date == date.date()])
            if num != 48:
                errors.write(f"{id} contains only {num} datapoints for date {date.date()}. Skipping this LCLid.\n")
                missing_data = True
                break

        if missing_data == True:
            continue

        for date in winter_weekend_dates:
            num = len(winter_df[winter_df['DateTime'].dt.date == date.date()])
            if num != 48:
                errors.write(f"{id} contains only {num} datapoints for date {date.date()}. Skipping this LCLid.\n")
                missing_data = True
                break

        if missing_data == True:
            continue

        # Convert DateTime to hourly
        summer_df['HourlyDateTime'] = summer_df['DateTime'].dt.floor('h')
        winter_df['HourlyDateTime'] = winter_df['DateTime'].dt.floor('h')

        # Group by hourly datetime and sum the values
        summer_hourly_sum_df = summer_df.groupby('HourlyDateTime')['KWH'].sum().reset_index()
        winter_hourly_sum_df = winter_df.groupby('HourlyDateTime')['KWH'].sum().reset_index()

        # now we can write these two dataframes to their respective csv files
        summer_hourly_sum_df.to_csv(f"data_2013/summer_we/{id}.csv", index=False)
        winter_hourly_sum_df.to_csv(f"data_2013/winter_we/{id}.csv", index=False)



Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


Processing LCLid:  MAC000002
Processing LCLid:  MAC000003
Processing LCLid:  MAC000004
Processing LCLid:  MAC000005
Processing LCLid:  MAC000006
Processing LCLid:  MAC000007
Processing LCLid:  MAC000008
Processing LCLid:  MAC000009
Processing LCLid:  MAC000010
Processing LCLid:  MAC000011
Processing LCLid:  MAC000012
Processing LCLid:  MAC000013
Processing LCLid:  MAC000014
Processing LCLid:  MAC000015
Processing LCLid:  MAC000016
Processing LCLid:  MAC000017
Processing LCLid:  MAC000018
Processing LCLid:  MAC000019
Processing LCLid:  MAC000020
Processing LCLid:  MAC000021
Processing LCLid:  MAC000022
Processing LCLid:  MAC000023
Processing LCLid:  MAC000024
Processing LCLid:  MAC000025
Processing LCLid:  MAC000026
Processing LCLid:  MAC000027
Processing LCLid:  MAC000028
Processing LCLid:  MAC000029
Processing LCLid:  MAC000030
Processing LCLid:  MAC000031
Processing LCLid:  MAC000032
Processing LCLid:  MAC000033
Processing LCLid:  MAC000034
Processing LCLid:  MAC000035
Processing LCL