## **Imports**

In [1]:
import sys, os, warnings
from os import listdir

path2add = os.path.normpath(os.path.abspath(os.path.join(os.path.dirname('__file__'), os.path.pardir, 'utils')))
if (not (path2add in sys.path)) :
    sys.path.append(path2add)

import pandas as pd
from cleaning_utils import clean_data
from cleaning_utils import split_csv

pd.set_option('display.max_columns', None)
warnings.filterwarnings("ignore")

## **Loading Data**

In [2]:
all_paths = []
fol_name = r"C:\Users\angel\Desktop\Data Analysis\Ironhack\Final Bootcamp Project\Smart-Home-Energy-Consumption-Project\data"
for e in os.listdir(fol_name):
    full = os.path.join(fol_name, e)
    if os.path.isfile(full) and full.endswith('.csv'):
        all_paths.append(full)

data = pd.concat(pd.read_csv(f, low_memory=False) for f in all_paths)
data = data.reset_index(drop=True)
data.head()


ValueError: No objects to concatenate

## **Variable Segmentation**

>##### *Energy Variables*
>
> - **use:** numerical value representing the total energy consumption, expressed in kW
> - **gen:** numerical value indicating the total energy generated by solar power devices, expressed in kW 
> - **dishwasher, furnace, home office, fridge, wine cellar, garage door, kitchen, barn, well, microwave, living room:** all numerical values representing energy consumption for specific appliances, measured in kW
> 
> ##### *Weather Variables*
> 
> - **temperature:** a numerical value, presumably expressed in degrees Fahrenheit, indicating the temperature, which reflects the hotness or coldness of a substance or environment 
> - **humidity:** a numerical value expressing the amount of suspended water in the air
> - **visibility:** a numerical measure ranging from 0 to 10 representing the meteorological optical range
> - **apparent temperature:** the perceived temperature experienced by individuals, factoring in the combined effects of air temperature, humidity, and wind speed 
> - **pressure:** a numerical value representing the force exerted per unit area by a substance, measured in millibars (mb)
> - **wind speed:** numerical value, presumably expressed in m/s, representing the rate at which air moves horizontally past a specific point.
> - **cloud cover:** numerical value expressed as a percentage, indicating the fraction of the sky obscured by clouds.
> - **wind bearing:** direction from which the wind is blowing, typically measured in degrees clockwise from true north.
> - **precipitation intensity:** numerical value, presumably expressed in mm/h, referring to the rate at which precipitation, such as rain or snow, is falling per unit of time.
> - **dew point:** temperature at which air becomes saturated with water vapor, leading to the formation of dew or condensation.
> - **precipitation probability:** likelihood or chance of precipitation occurring within a given time period and location.
> 
> ##### *Other Variables*
> 
> - **time:** each data point represents each minute of a whole year
> - **summary:** resume the overall climate conditions in a word
> - **icon:** probably refers to the icon that is associated with the summary value

## **Profiling**

In [3]:
print("Shape: ", data.shape)
print("Columns: ", data.columns)
data.info()

Shape:  (503911, 33)
Columns:  Index(['Unnamed: 0', 'time', 'use [kW]', 'gen [kW]', 'House overall [kW]',
       'Dishwasher [kW]', 'Furnace 1 [kW]', 'Furnace 2 [kW]',
       'Home office [kW]', 'Fridge [kW]', 'Wine cellar [kW]',
       'Garage door [kW]', 'Kitchen 12 [kW]', 'Kitchen 14 [kW]',
       'Kitchen 38 [kW]', 'Barn [kW]', 'Well [kW]', 'Microwave [kW]',
       'Living room [kW]', 'Solar [kW]', 'temperature', 'icon', 'humidity',
       'visibility', 'summary', 'apparentTemperature', 'pressure', 'windSpeed',
       'cloudCover', 'windBearing', 'precipIntensity', 'dewPoint',
       'precipProbability'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 503911 entries, 0 to 503910
Data columns (total 33 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   Unnamed: 0           503911 non-null  int64  
 1   time                 503911 non-null  object 
 2   use [kW]             503910 non-null  flo

In [4]:
data.describe()

Unnamed: 0.1,Unnamed: 0,use [kW],gen [kW],House overall [kW],Dishwasher [kW],Furnace 1 [kW],Furnace 2 [kW],Home office [kW],Fridge [kW],Wine cellar [kW],Garage door [kW],Kitchen 12 [kW],Kitchen 14 [kW],Kitchen 38 [kW],Barn [kW],Well [kW],Microwave [kW],Living room [kW],Solar [kW],temperature,humidity,visibility,apparentTemperature,pressure,windSpeed,windBearing,precipIntensity,dewPoint,precipProbability
count,503911.0,503910.0,503910.0,503910.0,503910.0,503910.0,503910.0,503910.0,503910.0,503910.0,503910.0,503910.0,503910.0,503910.0,503910.0,503910.0,503910.0,503910.0,503910.0,503910.0,503910.0,503910.0,503910.0,503910.0,503910.0,503910.0,503910.0,503910.0,503910.0
mean,251955.0,0.858962,0.076229,0.858962,0.031368,0.09921,0.136779,0.081287,0.063556,0.042137,0.014139,0.002755,0.007023,9e-06,0.05853,0.015642,0.010983,0.035313,0.076229,50.741935,0.664085,9.253444,48.263382,1016.301625,6.649936,202.356843,0.002598,38.694013,0.056453
std,145466.720086,1.058207,0.128428,1.058207,0.190951,0.169059,0.178631,0.104466,0.076199,0.057967,0.014292,0.02186,0.07674,1e-05,0.202706,0.137841,0.098859,0.096056,0.128428,19.113807,0.194389,1.611186,22.027916,7.895185,3.982716,106.520474,0.011257,19.087939,0.165836
min,0.0,0.0,0.0,0.0,0.0,1.7e-05,6.7e-05,8.3e-05,6.7e-05,1.7e-05,1.7e-05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-12.64,0.13,0.27,-32.08,986.4,0.0,0.0,0.0,-27.24,0.0
25%,125977.5,0.367667,0.003367,0.367667,0.0,0.020233,0.0644,0.040383,0.005083,0.007133,0.012733,0.0005,1.7e-05,0.0,0.029833,0.000983,0.003617,0.001483,0.003367,35.77,0.51,9.42,31.09,1011.29,3.66,148.0,0.0,24.6,0.0
50%,251955.0,0.562333,0.004283,0.562333,1.7e-05,0.020617,0.066633,0.042217,0.005433,0.008083,0.012933,0.000667,5e-05,1.7e-05,0.031317,0.001,0.004,0.001617,0.004283,50.32,0.68,10.0,50.32,1016.53,5.93,208.0,0.0,39.03,0.0
75%,377932.5,0.97025,0.083917,0.97025,0.000233,0.068733,0.080633,0.068283,0.125417,0.053192,0.0131,0.00075,0.000167,1.7e-05,0.032883,0.001017,0.004067,0.00175,0.083917,66.26,0.84,10.0,66.26,1021.48,8.94,295.0,0.0,54.79,0.0
max,503910.0,14.714567,0.613883,14.714567,1.401767,1.934083,0.794933,0.97175,0.851267,1.273933,1.088983,1.166583,2.262583,0.000183,7.0279,1.633017,1.9298,0.465217,0.613883,93.72,0.98,10.0,101.12,1042.46,22.91,359.0,0.191,75.49,0.84


In [5]:
data["icon"].unique()
#data[data["icon"].isna()]

array(['clear-night', 'partly-cloudy-night', 'clear-day', 'cloudy',
       'partly-cloudy-day', 'rain', 'snow', 'wind', 'fog', nan],
      dtype=object)

In [6]:
#data["cloudCover"].unique()
data[data["cloudCover"] == "cloudCover"].shape

(58, 33)

In [7]:
data["summary"].unique()

array(['Clear', 'Mostly Cloudy', 'Overcast', 'Partly Cloudy', 'Drizzle',
       'Light Rain', 'Rain', 'Light Snow', 'Flurries', 'Breezy', 'Snow',
       'Rain and Breezy', 'Foggy', 'Breezy and Mostly Cloudy',
       'Breezy and Partly Cloudy', 'Flurries and Breezy', 'Dry',
       'Heavy Snow', nan], dtype=object)

> **observations**
> 
> - The Unnamed: 0 column acts as an index and does not provide useful information.
> - Column names are challenging to identify due to the "[kW]" attached and the lack of standardization in uppercase and lowercase characters.
> - use and House overall, as well as gen and Solar columns, seem to be duplicated.
> - Various columns for furnaces and kitchens could be combined for simplification.
> - The time column is supposed to represent minutes, not seconds.
> - The last row appears to contain NaN values.
> - The cloudCover column contains 58 unhelpful values.
> 
> **impact**
> 
> - Drop the Unnamed: 0 column.
> - Drop the last row.
> - Replace invalid values in the cloudCover column with the next valid value using the backfill method, considering data time-sensitivity.
> - Standardize column names and remove "[kW]".
> - Check if use and House overall and gen and Solar have identical values, dropping one if they do.
> - Combine furnace and kitchen values into respective single columns.
> - Correct the time value to represent minutes and group time into different columns.
> - Drop the summary and icon columns since they are not expected to be used.

## **Data-cleaning pipeline**

In [8]:
output_path = r"C:\Users\angel\Desktop\Data Analysis\Ironhack\Final Bootcamp Project\Smart-Home-Energy-Consumption-Project\data\cleaned_data.csv"
clean_data(data, output_path=output_path)

In [None]:
output_path1 = r"C:\Users\angel\Desktop\Data Analysis\Ironhack\Final Bootcamp Project\Smart-Home-Energy-Consumption-Project\data\clean_df1"
output_path2 = r"C:\Users\angel\Desktop\Data Analysis\Ironhack\Final Bootcamp Project\Smart-Home-Energy-Consumption-Project\data\clean_df2"
split_csv(output_path, output_path1, output_path2)

## **Expectations and goals**
> 
> Upon analyzing the dataset's characteristics, I think I can derive value through two overarching strategies:
> 
> - *Uncovering consumption patterns and their correlations with other variables to offer valuable insights for informed decision-making*
> - *Utilizing identified patterns to deploy machine learning models aimed at automating decisions and optimizing consumption*
> 
> To achieve these objectives, I've outlined the following sub-goals for upcoming phases:
> 
> - **Conduct In-depth Exploratory Analysis:** *Dive deep into the dataset to gain a comprehensive understanding of consumption patterns and the relationships between variables.*
> - **Implement Time Series Baseline Model:** *Develop a foundational time series model capable of predicting total consumption with help of meteorological and time variables.*
> - **Perform Feature Engineering:** *Engineer additional variables if neccesary to enhance the predictive capabilities of the model.*
> - **Refine Goals:** *Refine and specify goals further as more insights are gained from the analysis.*