### In this additional step, we need to identify each LCLid's yearly consumption as either Low, Medium or High, in order to plot their consumption on separate lines.

At this stage we have reduced the LCLids down to those which have a full set of datapoints just for the 
winter/summer date ranges.

It is possible that a marginal number of these LCLids will have full data for the winter/summer dates but
some missing data for the spring/autumn dates. If we discard these then we will not be able to classify 
the LCLids.

Therefore there will be a small potential classification error which should only affect marginal consumers.

In the next step we will add a L/M/H classification for each LCLid based on the consumer's yearly csv file.

Ranges are:  
TOO_LOW     < 1000  
LOW         1000 -> 2699  
MED         2700 -> 4199  
HIGH        4200 -> 10000  
TOO_HIGH    > 10000  


Given that this classification needs to be referred to in several places, for the moment we will just create a new csv file containing columns: 'LCLid, 'Yearly_total' and 'class_'

In [2]:
# Using the  2013_ids.txt file, read each MACnnnnn.csv file.
# Work out the annual consumption and classify the LCLid as L/M/H
# Create a new csv file with this data

import os
import pandas as pd

classify_df = pd.DataFrame(columns=["LCLid", "annual_KWH", "class_"])

for filename in os.listdir("data_2013"):

        if filename.startswith("MAC") and filename.endswith(".csv"):
            LCLid = filename.split(".")[0]
            print("Processing: ", LCLid)
            df = pd.read_csv("data_2013/" + filename)
            # Do some tidying up of the dataframe.
            df.rename(columns={df.columns[-1]: 'KWH'}, inplace=True)
            df['KWH'] = pd.to_numeric(df['KWH'], errors="coerce").astype(float) # convert KWH column to float
            df.drop_duplicates(inplace=True)                                    # drop all duplicate lines
            df = df.dropna(subset=['KWH'])
            df['DateTime'] = pd.to_datetime(df['DateTime'])
            df = df[(df['DateTime'].dt.minute == 0) | (df['DateTime'].dt.minute == 30)]
            # Work out annual consumption
            annual_KWH = sum(df["KWH"])

            # Classify
            if annual_KWH < 1000:
                class_ = "VERY-LOW"
            elif annual_KWH < 2700:
                class_ = "LOW"
            elif annual_KWH < 4200:
                class_ = "MED"
            elif annual_KWH <= 10000:
                class_ = "HIGH"
            else:
                class_ = "VERY-HIGH"

            # Add to dataframe
            new_row = {
                "LCLid": LCLid,
                "annual_KWH": annual_KWH,
                "class_": class_,
            }
            classify_df.loc[len(classify_df)]= new_row


classify_df.to_csv("data_2013/2013_ids_classified.csv", index=False)

Processing:  MAC000002
Processing:  MAC000003
Processing:  MAC000004
Processing:  MAC000005
Processing:  MAC000006
Processing:  MAC000007
Processing:  MAC000008
Processing:  MAC000009
Processing:  MAC000010
Processing:  MAC000011
Processing:  MAC000012
Processing:  MAC000013
Processing:  MAC000014
Processing:  MAC000015
Processing:  MAC000016
Processing:  MAC000017
Processing:  MAC000018
Processing:  MAC000019
Processing:  MAC000020
Processing:  MAC000021
Processing:  MAC000022
Processing:  MAC000023
Processing:  MAC000024
Processing:  MAC000025
Processing:  MAC000026
Processing:  MAC000027
Processing:  MAC000028
Processing:  MAC000029
Processing:  MAC000030
Processing:  MAC000031
Processing:  MAC000032
Processing:  MAC000033
Processing:  MAC000034
Processing:  MAC000035
Processing:  MAC000036
Processing:  MAC000037
Processing:  MAC000038
Processing:  MAC000039
Processing:  MAC000040
Processing:  MAC000041
Processing:  MAC000042
Processing:  MAC000043
Processing:  MAC000044
Processing: