## Problem: 

In [25]:
137+3+76+345+431

992

The data we will work with comes from 5 sensors placed in an office collecting data on light, temperature, humidity and CO2 measurements. Every minute the sensor takes a reading and the occupancy of the room is determined. The data were collected with the intention of determining the preferred environmental conditions for office workers.

Data source:  
*Accurate occupancy detection of an office room from light, temperature, humidity and CO2 measurements using statistical learning models. Luis M. Candanedo, Varonique Feldheim. Energy and Buildings. Volume 112, 15 January 2016, Pages 28-39.*

<img src="office.jpeg">

In [13]:
import pandas as pd 
import numpy as np 
from sklearn.neighbors import LocalOutlierFactor
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
%matplotlib inline

In [14]:
# Reading the data
df = pd.read_csv('occupancy.csv')
df.head(10)

Unnamed: 0,ID,date,Temperature,Humidity,Light,CO2,HumidityRatio,Occupancy
0,1,2015-02-04 17:51:00,23.18,27272.0,426.0,721.25,0.004793,1
1,2,2015-02-04 17:51:59,23.15,27.2675,429.5,714.0,0.004783,1
2,3,2015-02-04 17:53:00,23.15,27245.0,426.0,713.5,0.004779,1
3,4,2015-02-04 17:54:00,23.15,27.2,426.0,708.25,0.004772,1
4,5,2015-02-04 17:55:00,23.1,27.2,426.0,704.5,0.004757,1
5,6,2015-02-04 17:55:59,23.1,27.2,419.0,701.0,0.004757,1
6,7,2015-02-04 17:57:00,23.1,27.2,419.0,701.666667,0.004757,1
7,8,2015-02-04 17:57:59,23.1,27.2,419.0,699.0,0.004757,1
8,9,2015-02-04 17:58:59,23.1,27.2,419.0,689.333333,0.004757,1
9,10,2015-02-04 18:00:00,23075.0,27175.0,419.0,688.0,0.004745,1


In [15]:
# Dropping the non-required columns
df = df.drop(['date','ID','Occupancy'],axis=1)

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8143 entries, 0 to 8142
Data columns (total 5 columns):
Temperature      8143 non-null float64
Humidity         8143 non-null float64
Light            8143 non-null float64
CO2              8143 non-null float64
HumidityRatio    8143 non-null float64
dtypes: float64(5)
memory usage: 318.2 KB


### Basic Technique: Z-scores

In [17]:
# Z-score function for detecting outliers
'''
	We have kept the threshold for outliers to be 3 i.e. 
	if the z_score of a values is greater than 3 or less than -3 
	then it shall be classified as an outlier
'''	
def outliers_z_score(data,threshold=3):
	mean_data = np.mean(data)
	stdev_data = np.std(data)
	z_scores = [(y - mean_data) / stdev_data for y in data]
	return np.where(np.abs(z_scores) > threshold)


In [18]:
# Applying the Z-score outlier detection technique on the data
for x in df:
    if x!='Occupancy':	
        result = outliers_z_score(df[x])

        # Replacing the detected outliers with 'ANOMALY'
        for y in result:
            df[x][y]='ANOMALY'

In [19]:
# inspect data
df.head(20)

Unnamed: 0,Temperature,Humidity,Light,CO2,HumidityRatio
0,23.18,27272.0,426.0,721.25,0.00479299
1,23.15,27.2675,429.5,714.0,0.00478344
2,23.15,27245.0,426.0,713.5,0.00477946
3,23.15,27.2,426.0,708.25,0.00477151
4,23.1,27.2,426.0,704.5,0.00475699
5,23.1,27.2,419.0,701.0,0.00475699
6,23.1,27.2,419.0,701.667,0.00475699
7,23.1,27.2,419.0,699.0,0.00475699
8,23.1,27.2,419.0,689.333,0.00475699
9,ANOMALY,27175.0,419.0,688.0,0.00474535


### Advanced Technique: Local Outlier Factors (LOF)

In [20]:
# Reading the data
df = pd.read_csv('occupancy.csv')
df.head(10)

Unnamed: 0,ID,date,Temperature,Humidity,Light,CO2,HumidityRatio,Occupancy
0,1,2015-02-04 17:51:00,23.18,27272.0,426.0,721.25,0.004793,1
1,2,2015-02-04 17:51:59,23.15,27.2675,429.5,714.0,0.004783,1
2,3,2015-02-04 17:53:00,23.15,27245.0,426.0,713.5,0.004779,1
3,4,2015-02-04 17:54:00,23.15,27.2,426.0,708.25,0.004772,1
4,5,2015-02-04 17:55:00,23.1,27.2,426.0,704.5,0.004757,1
5,6,2015-02-04 17:55:59,23.1,27.2,419.0,701.0,0.004757,1
6,7,2015-02-04 17:57:00,23.1,27.2,419.0,701.666667,0.004757,1
7,8,2015-02-04 17:57:59,23.1,27.2,419.0,699.0,0.004757,1
8,9,2015-02-04 17:58:59,23.1,27.2,419.0,689.333333,0.004757,1
9,10,2015-02-04 18:00:00,23075.0,27175.0,419.0,688.0,0.004745,1


In [21]:
# Dropping the non-required columns
df = df.drop(['date','ID','Occupancy'],axis=1)

In [22]:
# Applying the Local Outlier Factor method to detect outliers
lof = LocalOutlierFactor()
result = lof.fit_predict(df)


In [23]:
# Adding the result column to the data
'''
	In the result column named as 'outlier_detected':
		1 is for the records which are clean 
		-1 is for the records which are detected as outliers
'''
df['outlier_detected']=result


In [24]:
# inspect data
df.head(20)

Unnamed: 0,Temperature,Humidity,Light,CO2,HumidityRatio,outlier_detected
0,23.18,27272.0,426.0,721.25,0.004793,-1
1,23.15,27.2675,429.5,714.0,0.004783,1
2,23.15,27245.0,426.0,713.5,0.004779,-1
3,23.15,27.2,426.0,708.25,0.004772,1
4,23.1,27.2,426.0,704.5,0.004757,1
5,23.1,27.2,419.0,701.0,0.004757,1
6,23.1,27.2,419.0,701.666667,0.004757,1
7,23.1,27.2,419.0,699.0,0.004757,1
8,23.1,27.2,419.0,689.333333,0.004757,1
9,23075.0,27175.0,419.0,688.0,0.004745,1
