## Data Cleaning
    - Handle duplicate rows/entries
        - check duplicates with respect to identifiers, -> drop the duplicated rows and keep the latest entry
    
    - Handle missing values
        - whether the data exists and it is missing for some human/system error
            - if any column has more than 80% of values missing, drop the column
            - if any row has more than 60% of the values missing, drop the row
            - if any column having 5% to 20% of missing data - impute the missing value by mean/median/mode
            - if any column has more than 20% of values missing- ML based imputation
        
        - the data does not exist for that reason it is missing
            - convert the column into a binary or a categorical attribute
            
    - Handle unwanted columns
    - Handle outliers and unnatural values
        - if the proportion of outliers is less (less than 2%) - drop the rows
        - capping - replace outliers by nearest inliers

In [1]:
import pandas as pd
df = pd.read_csv(r"D:\AI\data\datasets-1\datawh_missing.csv",na_values=['?','.'])
df.shape

(23, 7)

In [2]:
df.head()

Unnamed: 0,Dates,Temperature,Humidity,Pressure,Air Quality,Day id,Vibration
0,30-04-2018,218.0,182.0,4.0,2.0,1,45
1,01-05-2018,,182.0,3.0,2.0,2,56
2,02-05-2018,,439.0,,0.0,3,45
3,03-05-2018,2439.0,53.0,5.0,1.0,4,23
4,04-05-2018,824.0,444.0,5.0,,5,35


### Handle duplicate entries

In [3]:
# check for duplicates
df.duplicated().sum()

2

In [4]:
# to see the duplicated rows
df[df.duplicated(keep=False)]

Unnamed: 0,Dates,Temperature,Humidity,Pressure,Air Quality,Day id,Vibration
19,19-05-2018,766.0,535.0,3.0,2.0,20,39
20,19-05-2018,766.0,535.0,3.0,2.0,20,39
21,19-05-2018,766.0,535.0,3.0,2.0,20,39


In [5]:
# dropping duplicates
df.drop_duplicates(keep='last',inplace=True)

In [6]:
# check for duplicates
df.duplicated().sum()

0

### Handling unwanted columns

In [7]:
df.head()

Unnamed: 0,Dates,Temperature,Humidity,Pressure,Air Quality,Day id,Vibration
0,30-04-2018,218.0,182.0,4.0,2.0,1,45
1,01-05-2018,,182.0,3.0,2.0,2,56
2,02-05-2018,,439.0,,0.0,3,45
3,03-05-2018,2439.0,53.0,5.0,1.0,4,23
4,04-05-2018,824.0,444.0,5.0,,5,35


In [8]:
df.drop(['Day id'],axis=1,inplace=True)

In [9]:
df.head()

Unnamed: 0,Dates,Temperature,Humidity,Pressure,Air Quality,Vibration
0,30-04-2018,218.0,182.0,4.0,2.0,45
1,01-05-2018,,182.0,3.0,2.0,56
2,02-05-2018,,439.0,,0.0,45
3,03-05-2018,2439.0,53.0,5.0,1.0,23
4,04-05-2018,824.0,444.0,5.0,,35


### Handle missing values

In [10]:
# check for missing values
df.isnull().sum()

Dates          0
Temperature    7
Humidity       3
Pressure       7
Air Quality    2
Vibration      0
dtype: int64

In [11]:
df

Unnamed: 0,Dates,Temperature,Humidity,Pressure,Air Quality,Vibration
0,30-04-2018,218.0,182.0,4.0,2.0,45
1,01-05-2018,,182.0,3.0,2.0,56
2,02-05-2018,,439.0,,0.0,45
3,03-05-2018,2439.0,53.0,5.0,1.0,23
4,04-05-2018,824.0,444.0,5.0,,35
5,05-05-2018,1744.0,,5.0,1.0,26
6,06-05-2018,786.0,,5.0,1.0,25
7,07-05-2018,1326.0,309.0,,1.0,26
8,08-05-2018,1804.0,188.0,,2.0,25
9,09-05-2018,,420.0,0.0,1.0,35


In [12]:
# drop the ros having more than 60% of values missing 
# 6*0.6 = 4 >> drop the rows having more than or equal to 4 missing values -> less than 3 real values
print(df.shape)
df.dropna(thresh=3,inplace=True)
print(df.shape)

(21, 6)
(20, 6)


In [13]:
df.skew()

Temperature    0.047677
Humidity      -0.469442
Pressure      -0.780891
Air Quality   -0.410217
Vibration      2.506968
dtype: float64

In [14]:
df.Temperature.fillna(df.Temperature.mean(),inplace=True)
df.fillna(df.median(),inplace=True)

In [15]:
# check for missing values
df.isnull().sum()

Dates          0
Temperature    0
Humidity       0
Pressure       0
Air Quality    0
Vibration      0
dtype: int64

### Handle outliers
    
    Check for outliers - skewness / boxplot approach
    skewness - if the skewness > +1 or skewness < -1  = heavy outliers are present
    


In [16]:
df.skew()

Temperature    0.054894
Humidity      -0.582281
Pressure      -1.048203
Air Quality   -0.372134
Vibration      2.506968
dtype: float64

In [17]:
# for pressure - we will drop top 0.1% rows having extreme high values of pressure
print(df.shape)
df = df[df.Pressure>df.Pressure.quantile(0.001)]
print(df.shape)

(20, 6)
(19, 6)


In [18]:
df.skew()

Temperature    0.053744
Humidity      -0.503875
Pressure      -0.834369
Air Quality   -0.410217
Vibration      2.422250
dtype: float64

In [19]:
# capping
(df.Vibration>df.Vibration.quantile(0.95)).sum()

1

In [20]:
df.Vibration[df.Vibration>df.Vibration.quantile(0.995)] = df.Vibration.quantile(0.95)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(new_data)


In [21]:
df.skew()

Temperature    0.053744
Humidity      -0.503875
Pressure      -0.834369
Air Quality   -0.410217
Vibration      0.246670
dtype: float64