## Description

Demonstration on how the outliers are being removed from the database on a single parameter DF. The same principle is followed on the OutliersRemovalTools class. The only difference is that on the class, the methods updates the preprocessed_df attribute. 

Original docstrings from the class:

        '''
        Method that will remove all of the values that are lower or higher than
        the sum of the average + - std_factor * std dev.
        The average and std dev is considered to be different on each station and on each parameter.
        The outliers will be replaced with a NaN.

        :param std_factor: factor to which multiply the std dev
        :return: updates the preprocessed_df class attribute
        '''

## Importing the libraries

In [1]:
import pandas as pd
import numpy as np
import os


## Reading the .csv preprocessed files

In [2]:
#defining paths
preprocessed_path = r"C:\Users\victo\PycharmProjects\DataScienceProj\DS-Proj\Air_modelling\data\preprocessed_data\Parameters"
os.chdir(preprocessed_path)
preprocessed_fileslist = os.listdir()
#calling the first .csv file to work on 
#in this case, it will be CO data

#select the file or files to apply 
raw_P_df = pd.read_csv(preprocessed_fileslist[0])
raw_P_df.head()

Unnamed: 0,FECHAHORA,ATM,OBL,LPIN,SFE,TLA,VAL,CEN,AGU,LDO,MIR,FECHA,HORA
0,2016-01-01 00:00:00,1.471,1.01,7.165,3.513,2.215,0.24,0.18,0.615,2.83,4.72,2016-01-01,00:00:00
1,2016-01-01 01:00:00,2.653,1.069,6.272,4.953,1.835,0.387,0.736,1.177,2.15,5.8,2016-01-01,01:00:00
2,2016-01-01 02:00:00,2.712,2.026,7.088,4.286,3.287,0.822,0.948,1.594,1.957,7.098,2016-01-01,02:00:00
3,2016-01-01 03:00:00,2.099,3.375,5.977,4.577,4.691,1.414,2.207,2.074,1.956,6.499,2016-01-01,03:00:00
4,2016-01-01 04:00:00,2.019,2.195,5.833,5.18,4.873,1.277,4.192,1.601,3.221,4.743,2016-01-01,04:00:00


In [3]:
#eliminate columns we don't need for the moment such as FECHA and HORA
raw_P_df.columns.values
P_df = raw_P_df[['AGU', 'ATM', 'CEN', 'LDO', 'LPIN', 'MIR', 'OBL', 'SFE', 'TLA', 'VAL']]
P_df.head()

Unnamed: 0,AGU,ATM,CEN,LDO,LPIN,MIR,OBL,SFE,TLA,VAL
0,0.615,1.471,0.18,2.83,7.165,4.72,1.01,3.513,2.215,0.24
1,1.177,2.653,0.736,2.15,6.272,5.8,1.069,4.953,1.835,0.387
2,1.594,2.712,0.948,1.957,7.088,7.098,2.026,4.286,3.287,0.822
3,2.074,2.099,2.207,1.956,5.977,6.499,3.375,4.577,4.691,1.414
4,1.601,2.019,4.192,3.221,5.833,4.743,2.195,5.18,4.873,1.277


In [4]:
#Convert P_df into ndarray
P_arr = P_df.to_numpy()

#Create fvout_arr (first value out array) which has all the values but the first one
fvout_arr = P_arr[1:,:]

#Create a lvout_arr (last value out array) which has all the values but the last one 
lvout_arr = P_arr[:-1,:]


In [5]:
#create a delta_arr array that stores the value of the diff between fvout and lvout
delta_arr = fvout_arr - lvout_arr

#obtain the mean and std of the delta_arr values
mean_delta_arr = np.nanmean(delta_arr)
std_delta_arr = np.nanstd(delta_arr)

#create a std_factor var to specify the span of the scalar size
std_factor = 3 

#hscalar represents the highest value our parameter can have before we remove it 
#lscalar works the same but with the lowest value
hscalar = mean_delta_arr + std_factor * std_delta_arr 
lscalar = mean_delta_arr - std_factor * std_delta_arr 



In [1]:
#This one is a merge between the two previous steps but taking in consideration the index of the original array which has an extra value 

#get the index of the elements whose values are gt hscalar or lt lscalar
#IMPORTANT!!: add a + 1 on the row index as we are going to delete the values from the main ndarray and not from delta_arr

outliers = np.where((delta_arr <= lscalar) | (delta_arr >= hscalar))
coordinates = list(zip(outliers[0] + 1, outliers[1]))



NameError: name 'np' is not defined

In [1]:
#total of data removed
data_to_remove = len(coordinates)/(P_arr.shape[0]*10)
print('Percentage of data removed for this matrix: {0:.2f}%'.format(data_to_remove))


NameError: name 'coordinates' is not defined

In [12]:
#changing outliers with nan
for i in range(len(coordinates)):
    P_arr[coordinates[i]] = np.nan

In [13]:
#create the new df with outliers removed
processed_P_df = pd.DataFrame(columns=P_df.columns.values, data=P_arr)
processed_P_df.head()

Unnamed: 0,AGU,ATM,CEN,LDO,LPIN,MIR,OBL,SFE,TLA,VAL
0,0.615,1.471,0.18,2.83,7.165,4.72,1.01,3.513,2.215,0.24
1,1.177,2.653,0.736,2.15,6.272,5.8,1.069,,1.835,0.387
2,1.594,2.712,0.948,1.957,7.088,7.098,2.026,4.286,,0.822
3,2.074,2.099,2.207,1.956,5.977,6.499,,4.577,,1.414
4,1.601,2.019,,3.221,5.833,,2.195,5.18,4.873,1.277


In [14]:
#adding date and time columns
processed_P_df['FECHA'] = raw_P_df['FECHA']
processed_P_df['HORA'] = raw_P_df['HORA']
processed_P_df['FECHAHORA'] = raw_P_df['FECHAHORA']
processed_P_df.set_index('FECHAHORA')

Unnamed: 0_level_0,AGU,ATM,CEN,LDO,LPIN,MIR,OBL,SFE,TLA,VAL,FECHA,HORA
FECHAHORA,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2016-01-01 00:00:00,0.615,1.471,0.180,2.830,7.165,4.720,1.010,3.513,2.215,0.240,2016-01-01,00:00:00
2016-01-01 01:00:00,1.177,2.653,0.736,2.150,6.272,5.800,1.069,,1.835,0.387,2016-01-01,01:00:00
2016-01-01 02:00:00,1.594,2.712,0.948,1.957,7.088,7.098,2.026,4.286,,0.822,2016-01-01,02:00:00
2016-01-01 03:00:00,2.074,2.099,2.207,1.956,5.977,6.499,,4.577,,1.414,2016-01-01,03:00:00
2016-01-01 04:00:00,1.601,2.019,,3.221,5.833,,2.195,5.180,4.873,1.277,2016-01-01,04:00:00
...,...,...,...,...,...,...,...,...,...,...,...,...
2019-12-31 19:00:00,,,0.378,1.449,0.854,,,0.453,,0.438,2019-12-31,19:00:00
2019-12-31 20:00:00,,,0.551,1.807,1.532,,,0.440,,0.511,2019-12-31,20:00:00
2019-12-31 21:00:00,,,0.970,2.136,2.255,,,0.494,,0.643,2019-12-31,21:00:00
2019-12-31 22:00:00,,,1.268,2.029,1.481,,,0.510,,0.670,2019-12-31,22:00:00
