# Data description
This notebook explores the Electricity Consumption and Occupancy (ECO) dataset which contains second-by-second power consumption measurements from three Swiss households (04, 05, and 06) collected between June 27, 2012 and January 31, 2013. The dataset is divided into two types of measurements: smart meter data capturing whole-household aggregate consumption, and plug-level data capturing individual appliance consumption.

The smart meter data contains total power consumption for every second of each day, by household. The plug data contains consumption for individual appliances for each second of the day, with each household monitoring between 7 and 8 appliances. The appliance sets differ across households but share several common categories: all three households monitor a fridge, an entertainment system (TV and stereo), and coffee-related kitchen appliances. Households 05 and 06 both monitor a kettle and a laptop or a personal computer, while household 04 uniquely monitors a freezer and a lamp.

# Data Science Questions
The questions I seek to answer with this dataset are as follows: 

1) How does power usage evolve throughout the day for each household?
2) How does overall power usage evolve over time for each household?
3) Within each household, which appliances dominate consumption?
3) How do the 3 households compare in terms of lifestyle patterns? 

In [1]:
# Import packages
import pandas as pd
import altair as alt
import os 
import glob

# Disabling max rows
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

# Chart 1 - How does power usage evolve throughout the day for each household?
This first chart explores how power consumption varies throughout the day for each household using two linked views with a dropdown menu for household selection and a clickable hour selection. 

The top chart is a heatmap with hour of day on the x-axis and day of week on the y-axis, where color intensity encodes average power consumption. This provides an overview of when each household uses the most power. 

The bottom panel is a bar chart showing average power per hour across all days of the week. This provides a cleaner summary of the daily consumption for each household. 

These two views are linked, allowing the user to highlight individual hours in the day. The figure also contains a dropdown menu to select household so that we may see how households compare to each other. I intentionally avoided putting the three households on the same chart because household 04 uses significantly more power than the other two, making it difficult to compare side by side. 

## Data preparation
To prepare the data, the code below reads in one csv file at a time and extracts only the first column. By default, missing values are encoded as -1, so they were replaced with NaN to exclude them from aggregations. A datetime index was constructed for each file using the filename as the start date and generating one timestamp per second for all 86,400 rows. From this, hour of day and day of week columns were extracted as well. The three households were then concatenated into a single dataframe. Due to the size of the data, the function was designed to avoid holding unnecessary data in memory. 

In [None]:
## Load data
def load_household_power(household_num):
    pattern = f"./eco/eco/{household_num}_sm_csv/{household_num}/*.csv"
    files = sorted(glob.glob(pattern))
    
    dfs = []
    for filepath in files:
        # Extract date from filename
        date_str = os.path.basename(filepath).replace(".csv", "") 
        
        # Read only the first column
        df = pd.read_csv(filepath, header=None, usecols=[0], names=["power"])
        
        # Replace missing values (-1) with NaN
        df["power"] = df["power"].replace(-1, pd.NA)
        
        # Build datetime index
        df["timestamp"] = pd.date_range(start=date_str, periods=len(df), freq="s")
        df["hour"] = df["timestamp"].dt.hour
        df["day_of_week"] = df["timestamp"].dt.day_name()

        # Household number
        df["household"] = household_num
        
        dfs.append(df)
    
    combined = pd.concat(dfs, ignore_index=True)
    return combined

df_04 = load_household_power("04")
df_05 = load_household_power("05")
df_06 = load_household_power("06")

df1 = pd.concat([df_04, df_05, df_06], ignore_index=True)

In [3]:
# Summary
print(df1.shape)
df1.head()

(51840000, 5)


Unnamed: 0,power,timestamp,hour,day_of_week,household
0,475.051,2012-06-27 00:00:00,0,Wednesday,4
1,473.888,2012-06-27 00:00:01,0,Wednesday,4
2,479.664,2012-06-27 00:00:02,0,Wednesday,4
3,473.089,2012-06-27 00:00:03,0,Wednesday,4
4,476.251,2012-06-27 00:00:04,0,Wednesday,4


## Plotting Results
As mentioned above, the plot below shows a heatmap of power consumption per hour-of-day and day-of-week with an associated bar graph for average power consumption per hour. The three households appear to have vastly different power consumption. Household 04 has the highest average consumption, reaching almost 3000 Watts at certain hours of the day. There also appear to be a heavy concentration of power usage around 3-4AM, which is unusual. Household 05 has a more even use of power, but still with a large spike around 11PM for every day of the week. There also appears to be a large spike reaching around 2000 Watts on Fridays around 7PM. Finally, household 06 uses the least power on average by far, with some intense usage around 500 Watts at 6AM on Mondays and around 6-8PM on Sundays. 

In [None]:
# Aggregate to hour x day_of_week x household
heatmap_data = (df1
    .groupby(["household", "hour", "day_of_week"])["power"]
    .mean()
    .reset_index()
    .rename(columns={"power": "avg_power"})
)

# Order days 
day_order = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]

# Linked line chart data
line_data = (df1
    .groupby(["household", "hour", 
              df1["timestamp"].dt.date.rename("date")])["power"]
    .mean()
    .reset_index()
)
line_data["date"] = pd.to_datetime(line_data["date"])

# Dropdown selection
household_select = alt.binding_select(
    options=["04", "05", "06"],
    name="Household: "
)
household_param = alt.param(
    bind=household_select,
    value="04"
)

# Selection on click 
click = alt.selection_point(fields=["hour"])

# Heatmap
heatmap = (
    alt.Chart(heatmap_data)
    .mark_rect()
    .encode(
        x=alt.X("hour:O", title="Hour of Day"),
        y=alt.Y("day_of_week:O", sort=day_order, title=None),
        color=alt.Color(
            "avg_power:Q",
            scale=alt.Scale(scheme="orangered"),
            title="Avg Power (W)"
        ),
        opacity=alt.condition(click, alt.value(1.0), alt.value(0.5)),
        tooltip=[
            alt.Tooltip("household:N", title="Household"),
            alt.Tooltip("hour:O", title="Hour"),
            alt.Tooltip("day_of_week:O", title="Day"),
            alt.Tooltip("avg_power:Q", title="Avg Power (W)", format=".1f")
        ]
    )
    .add_params(click, household_param)
    .transform_filter(alt.datum.household == household_param)
    .properties(
        title="Average Power Consumption by Hour and Day",
        width=400,
        height=200
    )
)

# Bar chart 
bar_chart = (
    alt.Chart(heatmap_data)
    .mark_bar()
    .encode(
        x=alt.X("hour:O", title="Hour of Day"),
        y=alt.Y("mean(avg_power):Q", title="Avg Power (W)"),
        color=alt.condition(click, alt.value("steelblue"), alt.value("lightgray")),
        tooltip=[
            alt.Tooltip("hour:O", title="Hour"),
            alt.Tooltip("mean(avg_power):Q", title="Avg Power (W)", format=".1f")
        ]
    )
    .transform_filter(alt.datum.household == household_param)
    .add_params(click)
    .properties(
        title="Average Power by Hour",
        width=700,
        height=200
    )
)

chart1 = (heatmap & bar_chart).resolve_scale(color="independent")
chart1

  final_chart = (heatmap & bar_chart).resolve_scale(color="independent")


# Chart 2 - How does overall power usage evolve over time for each household?
This chart explores how total household power consumption evolves over the 7-month period from June 2012 to January 2013. 

The top chart is a compact overview of the full time period with no y-axis labels, designed purely as a navigation tool. The user can click and drag across it to select a time window of interest, which then filters the detail chart below to zoom into that period.

The bottom chart shows daily average power consumption at full resolution for the selected time window, with tooltips on each point showing the exact date and wattage. The charts have a household dropdown menu, allowing the user to examine each household's seasonal trends individually. This is particularly useful for identifying patterns such as rising consumption in autumn and winter as heating and lighting demands increase, or quieter periods that might correspond to holidays when the household is unoccupied.

## Plotting Results
The charts below show the evolution of power usage over time. We can see from household 04 that there is a large spike of power usage around early december, perhaps indicating the beginning of the holidays. Household 05 has steady usage in the summer months but, as it gets colder, the line becomes more volatile. Finally, overall household 06 uses less power than the other 2, but the usage is more volatile throughout the whole period. There also appears to be a large chunk of missing data between November and December, with high usage immediately following. 

In [None]:
# Daily average per household
time_data = (
    df1
    .groupby(["household", df1["timestamp"].dt.date.rename("date")])["power"]
    .mean()
    .reset_index()
)

time_data["date"] = pd.to_datetime(time_data["date"])

# Household dropdown parameter
households = sorted(time_data["household"].unique())
household_param = alt.param(
    name="household",
    bind=alt.binding_select(options=households, name="Household: "),
    value=households[0]
)

# Brush on the overview chart
brush = alt.selection_interval(encodings=["x"])

# Overview Chart
overview = (
    alt.Chart(time_data)
    .mark_line()
    .encode(
        x=alt.X("date:T", title=""),
        y=alt.Y("power:Q", title="", axis=alt.Axis(labels=False)),
        color=alt.Color("household:N", legend=None)
    )
    .transform_filter(alt.datum.household == household_param)
    .add_params(brush, household_param)
    .properties(
        width=700,
        height=60,
        title=alt.Title(
            "Overview: Click and drag to select a time window",
            fontSize=12,
            color="gray",
            anchor="start"
        )
    )
)

# Detail chart 
detail = (
    alt.Chart(time_data)
    .mark_line(point=True)
    .encode(
        x=alt.X("date:T", title="Date"),
        y=alt.Y("power:Q", title="Avg Power (W)"),
        color=alt.Color("household:N", title="Household"),
        tooltip=[
            alt.Tooltip("household:N"),
            alt.Tooltip("date:T"),
            alt.Tooltip("power:Q", format=".1f")
        ]
    )
    .transform_filter(brush)
    .transform_filter(alt.datum.household == household_param)
    .properties(
        width=700,
        height=250,
        title=alt.Title(
            "Daily Power Consumption Over Time"
        )
    )
)

# Display chart
chart2 = (overview & detail).add_params(household_param)
chart2

# Chart 3 - Within each household, which appliances dominate consumption?
This chart explores which appliances dominate power consumption within each household, and how that changes over the sample period using a stacked bar chart. 

Each bar represents a month, with the total bar height showing the combined average power of all monitored appliances and each colored segment representing an individual appliance's contribution. This makes it easy to see which appliances consume the most energy within a household and whether the overall consumption profile shifts seasonally.

The household dropdown filters the chart to show one household at a time, which is important since each household monitors a different set of appliances making direct overlaid comparison misleading. One thing worth noting is that the plug data only covers a subset of each household's appliances. For example, household 04 has an unmonitored fridge in addition to the monitored one, so the stacked bars represent a partial rather than complete breakdown of total consumption.

## Data Preparation
The plug data required a different loading strategy from the smart meter data due to memory constraints. Each household's appliances are stored in separate subdirectories, so the loading function iterates over both appliances and their daily CSV files, reading the first column of each file and replacing -1 values with NaN. Rather than storing the raw second-by-second readings, the function immediately aggregates each file down to a single daily average before appending it to the results list. The date and month are extracted directly from the filename. Each household is loaded separately using a dictionary mapping plug numbers to the appliance names, and the three resulting dataframes are concatenated into a single plug_df with columns for household, appliance, date, month, and average power.

In [2]:
# Delete old data to free space 
# del df1

# Data loading function
def load_household_plugs(household_num, plug_names):
    dfs = []
    for plug_num, appliance in plug_names.items():
        pattern = f"./eco/eco/{household_num}_plugs_csv/{household_num}/{plug_num}/*.csv"
        files = sorted(glob.glob(pattern))
        for filepath in files:
            try:
                date_str = os.path.basename(filepath).replace(".csv", "")[:10]
                df = pd.read_csv(filepath, header=None, usecols=[0], names=["power"])
                df["power"] = df["power"].replace(-1, pd.NA)
                
                # Aggregate immediately - don't store second-by-second data
                daily_avg = df["power"].mean()
                month = date_str[:7]  # YYYY-MM
                
                dfs.append({
                    "household": household_num,
                    "appliance": appliance,
                    "date": date_str,
                    "month": month,
                    "power": daily_avg
                })
            except Exception as e:
                print(f"Skipping {filepath}: {e}")
    
    return pd.DataFrame(dfs)

# Define appliances per household
plugs_04 = {"01": "Fridge", "02": "Kitchen", "03": "Lamp", "04": "Stereo/Laptop",
            "05": "Freezer", "06": "Tablet", "07": "Entertainment", "08": "Microwave"}
plugs_05 = {"01": "Tablet", "02": "Coffee Machine", "03": "Fountain", "04": "Microwave",
            "05": "Fridge", "06": "Entertainment", "07": "PC", "08": "Kettle"}
plugs_06 = {"01": "Lamp", "02": "Laptop", "03": "Router", "04": "Coffee Machine",
            "05": "Entertainment", "06": "Fridge", "07": "Kettle"}

plug_04 = load_household_plugs("04", plugs_04)
plug_05 = load_household_plugs("05", plugs_05)
plug_06 = load_household_plugs("06", plugs_06)

plug_df = pd.concat([plug_04, plug_05, plug_06], ignore_index=True)

## Plotting Results
The resulting stacked bar chart reveals that household 04 uses a significant amount of power solely on their freezer. Their power usage stays fairly consistent among the appliances over time, with a large dip in January 2013 from decreased freezer usage. Household 05 has consistent usage among its appliances over time, with a small dip after ceasing use of their fountain. A significant amount of power seems to come from entertainment and their computer/tablet. Finally, household 06 has decreasing power consumption over time, especially after September 2012 when they stopped using their router. A significant amount of their power usage comes from Entertainment. 

In [5]:
# Aggregate to month x appliance x household
plug_data = (plug_df
    .groupby(["household", "month", "appliance"])["power"]
    .mean()
    .reset_index()
)

# Household dropdown - reuse the same param pattern
household_select = alt.binding_select(
    options=["04", "05", "06"],
    name="Household: "
)
household_param2 = alt.param(
    bind=household_select,
    value="04"
)

stacked_bar = (
    alt.Chart(plug_data)
    .mark_bar()
    .encode(
        x=alt.X("month:O", title="Month"),
        y=alt.Y("power:Q", title="Avg Power (W)"),
        color=alt.Color("appliance:N", title="Appliance"),
        tooltip=[
            alt.Tooltip("appliance:N", title="Appliance"),
            alt.Tooltip("month:O", title="Month"),
            alt.Tooltip("power:Q", title="Avg Power (W)", format=".1f")
        ]
    )
    .transform_filter(alt.datum.household == household_param2)
    .add_params(household_param2)
    .properties(
        width=700,
        height=400,
        title=alt.Title(
            "Appliance Power Consumption by Month",
            subtitle="Select a household to see its appliance breakdown"
        )
    )
)

stacked_bar

# Chart 4 - How do the 3 households compare in terms of lifestyle patterns? 
Finally, we seek to compare the 3 households side by side in terms of their lifestyle patterns. This is done by grouping appliances into four categories: Kitchen, Entertainment, Computing, and Lighting. Since the three households monitor different appliances, direct appliance-to-appliance comparison is not meaningful. Instead, appliances are mapped to common categories so that for example household 04's fridge, freezer, microwave and kitchen appliances are grouped together with household 05's fridge, coffee machine, microwave and kettle under the Kitchen category. This allows a fairer behavioral comparison despite the differing plug setups.

The chart consists of two linked views. The top selector bar chart shows the overall average power per category across all households and months. Clicking any bar switches the detail view below to that category, with unselected bars fading to indicate the active selection. The bottom view shows three small line charts tracking monthly average consumption for the selected category over the sample period for each household. This design lets the user directly compare whether households follow similar seasonal trends within a category, for instance whether all three households show rising entertainment consumption in winter, or whether the pattern is unique to one household.

## Plotting Results
Comparing computing, we see that household 05 uses significantly more power than the other two households. Both households 04 and 06 have a dip in computing in the fall. Regarding entertainment, all 3 households use a significant amount of power over time. Household 05 had relatively low entertainment power consumption near the beginning of the sample period, but it increased rapidly and stayed constant around 30-35 Watts alongside the other households. Regarding kitchen power usage, household 04 had the highest average power consumption due to their large freezer, however we see this decrease near the end of the sample period. Finally, regarding lighting, households 04 and 06 both had fairly low usage for most of the sample period around 1-5 Watts. However, around December, we see household 04's usage increase rapidly to over 40 Watts. 

In [6]:
# Define appliance categories
category_map = {
    # Household 04
    "04_Fridge": "Kitchen", "04_Kitchen": "Kitchen", "04_Freezer": "Kitchen",
    "04_Microwave": "Kitchen", "04_Entertainment": "Entertainment",
    "04_Stereo/Laptop": "Computing", "04_Tablet": "Computing",
    "04_Lamp": "Lighting",
    # Household 05
    "05_Fridge": "Kitchen", "05_Coffee Machine": "Kitchen",
    "05_Microwave": "Kitchen", "05_Kettle": "Kitchen",
    "05_Entertainment": "Entertainment", "05_PC": "Computing",
    "05_Tablet": "Computing", "05_Fountain": "Other",
    # Household 06
    "06_Fridge": "Kitchen", "06_Coffee Machine": "Kitchen",
    "06_Kettle": "Kitchen", "06_Entertainment": "Entertainment",
    "06_Laptop": "Computing", "06_Router": "Computing",
    "06_Lamp": "Lighting"
}

plug_df["category_key"] = plug_df["household"] + "_" + plug_df["appliance"]
plug_df["category"] = plug_df["category_key"].map(category_map)

# Aggregate to category level per household per date
category_data = (plug_df
    .groupby(["household", "category", "date"])["power"]
    .mean()
    .reset_index()
)
category_data["date"] = pd.to_datetime(category_data["date"])

# Aggregate to category x household x month
category_data["month"] = category_data["date"].dt.to_period("M").astype(str)
monthly_category = (category_data
    .groupby(["household", "category", "month"])["power"]
    .mean()
    .reset_index()
)

# Click selection on category
category_click = alt.selection_point(
    fields=["category"],
    value=[{"category": "Entertainment"}]  # default selection
)

# Selector bar chart 
selector = (
    alt.Chart(monthly_category)
    .mark_bar()
    .encode(
        x=alt.X("category:N", title=""),
        y=alt.Y("mean(power):Q", title="Avg Power (W)", 
        scale=alt.Scale(zero=True, padding=10)),
        color=alt.Color("category:N", legend=None),
        opacity=alt.condition(category_click, alt.value(1.0), alt.value(0.3)),
        tooltip=[
            alt.Tooltip("category:N", title="Category"),
            alt.Tooltip("mean(power):Q", title="Avg Power (W)", format=".1f")
        ]
    )
    .add_params(category_click)
    .properties(
        width=400,
        height=200,  # increased from 120
        title=alt.Title(
            "Click a category to compare across households",
            fontSize=12,
            color="gray",
            anchor="start"
        )
    )
)

# Main comparison chart 
comparison = (
    alt.Chart(monthly_category)
    .mark_line(point=True)
    .encode(
        x=alt.X("month:O", title="Month"),
        y=alt.Y("power:Q", title="Avg Power (W)"),
        color=alt.Color("household:N", title="Household"),
        tooltip=[
            alt.Tooltip("household:N", title="Household"),
            alt.Tooltip("month:O", title="Month"),
            alt.Tooltip("power:Q", title="Avg Power (W)", format=".1f")
        ]
    )
    .transform_filter(category_click)
    .facet(
        facet=alt.Facet("household:N", title="Household"),
        columns=3
    )
    .properties(
        title=alt.Title(
            "Household Lifestyle Comparison by Category",
            subtitle="Monthly average power consumption per household"
        )
    )
)

lifestyle_chart = selector & comparison
lifestyle_chart

# References 
Beckel, C., Kleiminger, W., Cicchetti, R., Staake, T., & Santini, S. (2014). The ECO data set and the performance of non-intrusive load monitoring algorithms. Proceedings of the 1st ACM Conference on Embedded Systems for Energy-Efficient Buildings (BuildSys 2014). ACM, Memphis, USA.

Hickman, J. (2024). Client-side interactive visualization: Altair, Vega/Vega-Lite, and Bokeh. Lecture presented in DSAN 5200: Advanced Data Visualization, Georgetown University, Washington, D.C.

McKinney, W. (2010). Data structures for statistical computing in Python. Proceedings of the 9th Python in Science Conference, 445, 51-56.

VanderPlas, J. et al. (2018). Altair: Interactive statistical visualizations for Python. Journal of Open Source Software, 3(32), 1057. 