# Udacity Data Scienctist Nano Degree Project 1

## Project Overview
For [Udacity Data Scientist Nano Degree](https://www.udacity.com/course/data-scientist-nanodegree--nd025) project 1 you pick a dataset and using your dataset you choose 3 questions you aspire to answer from the data.

## Dataset
Dataset from the 'Power Laws: Detecting Anomalies in Usage' competition on [drivendata.org](https://www.drivendata.org/competitions/52/anomaly-detection-electricity/page/102/).   

Commercial buildings waste an estimated 15% to 30% of energy used due to poorly maintained, degraded, and improperly controlled equipment.  The goal is to detect abnormal energy consumption, find potential energy saving opportunities and provide actionable recommendations.  

The dataset includes energy consumption data, information about the buildings, historical weather data and public holiday information.  

## Questions
Needs more refinement.  Just a rough list at this point.  
- How much much energy is being wasted?
- What's the environmental impact?
- What's the financial impact?
- What buildings are prone to wasting engergy?
- Is there a time of day more prone to wasting energy?
- Is there a season/time of year more prone to wasting energy?
- How easy/hard/costly is it to fix energy wasting?

Possible outline:  
- How much energy is being wasted and the impact
- What attributes most to wasting engergy (building, time of day, time of year)
- Recommendations to fix

## Import neccessary libraries

In [2]:
import pandas as pd

## Common functions

In [246]:
def get_dataframe_info(df, print_info=True):
    d = {}
    
    d['num_rows'] = df.shape[0]
    d['num_cols'] = df.shape[1]
    d['num_nans'] = df_metadata.isnull().sum(axis=0).sum()
    
    d['col_names'] = df.columns.tolist()
    d['col_dtypes'] = df.dtypes.tolist()
    d['col_nans'] = df_metadata.isnull().sum(axis=0).tolist()
    
    col_data_range = []
    col_data_range_count = []
    for col_name, col_dtype in zip(d['col_names'], d['col_dtypes']):
        if col_dtype in ['int64', 'float64', 'datetime64[ns]']:
            col_data_range.append([df[col_name].min(), df[col_name].max()])
            col_data_range_count.append([])
        else:
            #col_data_range.append(df[col_name].unique().tolist())
            col_data_range.append(df[col_name].value_counts().index.tolist())
            col_data_range_count.append(df[col_name].value_counts().tolist())
    d['col_data_range'] = col_data_range
    d['col_data_range_count'] = col_data_range_count
    
    col_number_unique_vals = []
    col_percent_unique_vals = []
    for col_name in d['col_names']:
        col_number_unique_vals.append(len(df[col_name].unique()))
        col_percent_unique_vals.append(round(((len(df[col_name].unique())/d['num_rows'])*100),2))
    d['col_number_unique_vals'] = col_number_unique_vals
    d['col_percent_unique_vals'] = col_percent_unique_vals
        
    return d

In [247]:
df_metadata['units'].value_counts().index.tolist()

['kWh', 'degree celsius', 'Wh', 'count']

In [248]:
df_metadata['units'].value_counts().tolist()

[113, 42, 21, 2]

In [249]:
get_dataframe_info(df_metadata)

{'num_rows': 187,
 'num_cols': 6,
 'num_nans': 175,
 'col_names': ['site_id',
  'meter_id',
  'meter_description',
  'units',
  'surface',
  'activity'],
 'col_dtypes': [dtype('O'),
  dtype('O'),
  dtype('O'),
  dtype('O'),
  dtype('float64'),
  dtype('O')],
 'col_nans': [0, 0, 0, 9, 166, 0],
 'col_data_range': [['038', '234_203', '334_61'],
  ['38_52472',
   '38_9729',
   '38_10108',
   '38_9797',
   '38_52478',
   '38_52333',
   '38_52476',
   '38_56728',
   '920',
   '38_9795',
   '38_9799',
   '38_0',
   '38_10117',
   '38_9685',
   '38_9681',
   '38_56727',
   '38_56737',
   '38_9714',
   '38_52475',
   '38_52327',
   '38_9695',
   '935',
   '38_9762',
   '38_10120',
   '38_9763',
   '38_52479',
   '38_9748',
   '38_52379',
   '38_9706',
   '38_9732',
   '38_10123',
   '38_56732',
   '38_2',
   '38_59804',
   '38_9731',
   '869',
   '38_9726',
   '911',
   '38_9687',
   '38_9733',
   '38_9688',
   '38_52473',
   '38_10114',
   '38_9760',
   '38_52467',
   '925',
   '38_56731',
   

## Load the data

In [133]:
df_metadata = pd.read_csv('~/data/DSND-Project-1/power-laws-detecting-anomalies-in-usage-metadata.csv', sep=';')
df_holidays = pd.read_csv('~/data/DSND-Project-1/power-laws-detecting-anomalies-in-usage-holidays.csv', sep=';')
df_weather = pd.read_csv('~/data/DSND-Project-1/power-laws-detecting-anomalies-in-usage-weather.csv', sep=';')
df_energy = pd.read_csv('~/data/DSND-Project-1/train.csv', sep=',')

## Understand the data

### Metadata

In [77]:
df_metadata.head()

Unnamed: 0,site_id,meter_id,meter_description,units,surface,activity
0,038,38_9759,other,kWh,,office
1,038,38_9729,other,kWh,,office
2,038,38_9742,other,kWh,,office
3,234_203,863,main meter,Wh,5750.0,office
4,038,38_56030,heating,kWh,,laboratory


In [30]:
get_dataframe_info(df_metadata)

Number of rows: 187 Number of columns: 1
['site_id;meter_id;meter_description;units;surface;activity']


(187, 1)

### Holidays

In [89]:
df_holidays.head()

Unnamed: 0,Date,Holiday,site_id
0,2014-10-28,Ohi Day,334_61
1,2010-05-24,Whit Monday,334_61
2,2016-03-28,Easter Monday,038
3,2012-08-15,Assumption of Mary to Heaven,038
4,2016-08-15,Assumption of Mary to Heaven,334_61


In [134]:
# convert date to datetime64
df_holidays['Date'] = pd.to_datetime(df_holidays['Date'])

In [143]:
get_dataframe_info(df_holidays)

Number of rows: 234
Number of columns: 3
Column names:, Date, Holiday, site_id
Total NaNs: 175
NaNs by column:, 0, 0, 0, 9, 166, 0
dtypes by column:, datetime64[ns], object, object
column data ranges:, [Timestamp('2010-01-01 00:00:00'), Timestamp('2018-12-26 00:00:00')], ['Labour Day', 'Easter Monday', 'New year', 'Whit Monday', 'Christmas Day', 'Assumption of Mary to Heaven', 'Armistice Day', 'Ascension Thursday', 'Bastille Day', 'All Saints Day', 'Glorifying Mother of God', 'Annunciation', 'Good Friday', 'Pentecost', 'Epiphany', 'Clean Monday', 'Easter Sunday', 'Ohi Day', 'Victory in Europe Day', 'Independence Day'], ['334_61', '038']


(234, 3, ['Date', 'Holiday', 'site_id'], 175, [0, 0, 0, 9, 166, 0])

### Weather

In [90]:
df_weather.head()

Unnamed: 0,Timestamp,Temperature,Distance,site_id
0,2013-12-23T19:00:00-06:00,16.0,10.125819,
1,2013-12-28T18:00:00-06:00,18.0,10.125819,
2,2013-12-29T04:00:00-06:00,19.1,8.992769,
3,2013-12-30T00:30:00-06:00,16.0,10.125819,
4,2014-01-01T03:00:00-06:00,18.0,10.125819,


In [129]:
get_dataframe_info(df_weather)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



(391610,
 4,
 ['Timestamp', 'Temperature', 'Distance', 'site_id'],
 175,
 [0, 0, 0, 9, 166, 0])

### Energy Usage

In [92]:
df_energy.head()

Unnamed: 0.1,Unnamed: 0,meter_id,Timestamp,Values
0,2532,2,2015-06-11 00:00:00,2035.0
1,2543,2,2015-06-11 00:15:00,2074.0
2,2544,2,2015-06-11 00:30:00,2062.0
3,2525,2,2015-06-11 00:45:00,2025.0
4,2534,2,2015-06-11 01:00:00,2034.0


In [128]:
get_dataframe_info(df_energy)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



(43668606,
 4,
 ['Unnamed: 0', 'meter_id', 'Timestamp', 'Values'],
 175,
 [0, 0, 0, 9, 166, 0])