At this point, if we want to analyze the performance of the sensors on the atmospheric observations we can use the Data Cleaned datasets. The reason is simple: we want to preserve the null data. Specially, if we want to analyze the moments when the sensors fail. It is important to remember that the measurements where the sensors where out of service were dropped from the datasets.

How ever, for our statistical analyses we need to deal with all the null data. So in this notebook we are going to approach this problem.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df1 = pd.read_csv("Atmospheric Data Cleaned - Time Format.csv")

In [3]:
df1.head()

Unnamed: 0,O3,O3_flag,NO2,NO2_flag,NO,NO_flag,CO,CO_flag,PM10,PM10_flag,...,WDir_Avg,WDir_SD,Rain_Tot,Press_Avg,Rad_Avg,Year,Month,Day,Hour,Minute
0,55.48,OK,0.72,OK,0.2,BDL,0.25,OK,25.47,OK,...,173.6,14.26,0.0,805.409,,2023,5,1,0,0
1,55.49,OK,0.81,OK,0.2,BDL,0.26,OK,25.74,OK,...,171.0,10.53,0.0,805.524,,2023,5,1,0,1
2,55.4,OK,0.93,OK,0.2,BDL,0.27,OK,26.6,OK,...,178.6,15.72,0.0,805.436,,2023,5,1,0,2
3,55.2,OK,0.87,OK,0.2,BDL,0.28,OK,27.59,OK,...,186.1,17.43,0.0,805.45,,2023,5,1,0,3
4,55.41,OK,0.98,OK,0.2,BDL,0.28,OK,27.83,OK,...,211.5,21.67,0.0,805.504,,2023,5,1,0,4


In [4]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 443518 entries, 0 to 443517
Data columns (total 26 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   O3          443461 non-null  float64
 1   O3_flag     443518 non-null  object 
 2   NO2         443211 non-null  float64
 3   NO2_flag    443518 non-null  object 
 4   NO          443315 non-null  float64
 5   NO_flag     443518 non-null  object 
 6   CO          443410 non-null  float64
 7   CO_flag     443518 non-null  object 
 8   PM10        440555 non-null  float64
 9   PM10_flag   443518 non-null  object 
 10  PM2.5       435279 non-null  float64
 11  PM2.5_flag  443518 non-null  object 
 12  Temp_Avg    443508 non-null  float64
 13  RH_Avg      443508 non-null  float64
 14  WSpeed_Avg  443508 non-null  float64
 15  WSpeed_Max  443508 non-null  float64
 16  WDir_Avg    443508 non-null  float64
 17  WDir_SD     443508 non-null  float64
 18  Rain_Tot    443508 non-null  float64
 19  Pr

As we can see, only the time and flag columns have no null data. The other columns (15) has missing values. Therefore, I will drop those instances which have 8 or more values missing.

In [5]:
df1.dropna(axis=0, thresh=18, inplace=True)

In [6]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
Index: 443508 entries, 0 to 443517
Data columns (total 26 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   O3          443451 non-null  float64
 1   O3_flag     443508 non-null  object 
 2   NO2         443201 non-null  float64
 3   NO2_flag    443508 non-null  object 
 4   NO          443305 non-null  float64
 5   NO_flag     443508 non-null  object 
 6   CO          443400 non-null  float64
 7   CO_flag     443508 non-null  object 
 8   PM10        440545 non-null  float64
 9   PM10_flag   443508 non-null  object 
 10  PM2.5       435269 non-null  float64
 11  PM2.5_flag  443508 non-null  object 
 12  Temp_Avg    443508 non-null  float64
 13  RH_Avg      443508 non-null  float64
 14  WSpeed_Avg  443508 non-null  float64
 15  WSpeed_Max  443508 non-null  float64
 16  WDir_Avg    443508 non-null  float64
 17  WDir_SD     443508 non-null  float64
 18  Rain_Tot    443508 non-null  float64
 19  Press_A

Moving forward, it is important to note that the radiation average variable has too many missing values. Specifically, the number of not missing values is 232948. So, the percentaje of not missing values is:

In [8]:
round(232948*100/443508, 2)

52.52

In any other case, I would drop this column. However, it is one of the most important columns for our analysis. So, the solution that I suggest is the following:
- Filling the missing values of this column will not be representative. In fact, if we do this all the analysis with this variable will not be useful.
- So, in the moment that we will use this variable, we will take in consideration only the not missing instances of this variable.

Now, we only have to deal with some instances with null data on the air condition variables. The meteorological variables (except for rad_avg) don't have missing values due to the drop politic at the beginning of this notebook.

Then, we will fill the missing values of these columns with their median values. The reason is quite simple: the median does not change their distribution (which is very important for our analysis)

In [9]:
columns = ["O3", "NO2", "NO", "CO", "PM10", "PM2.5"]

In [10]:
for c in columns: df1[c] = df1[c].fillna(df1[c].median())

In [11]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
Index: 443508 entries, 0 to 443517
Data columns (total 26 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   O3          443508 non-null  float64
 1   O3_flag     443508 non-null  object 
 2   NO2         443508 non-null  float64
 3   NO2_flag    443508 non-null  object 
 4   NO          443508 non-null  float64
 5   NO_flag     443508 non-null  object 
 6   CO          443508 non-null  float64
 7   CO_flag     443508 non-null  object 
 8   PM10        443508 non-null  float64
 9   PM10_flag   443508 non-null  object 
 10  PM2.5       443508 non-null  float64
 11  PM2.5_flag  443508 non-null  object 
 12  Temp_Avg    443508 non-null  float64
 13  RH_Avg      443508 non-null  float64
 14  WSpeed_Avg  443508 non-null  float64
 15  WSpeed_Max  443508 non-null  float64
 16  WDir_Avg    443508 non-null  float64
 17  WDir_SD     443508 non-null  float64
 18  Rain_Tot    443508 non-null  float64
 19  Press_A

Once this is done we can begin with the descriptive statistics.

In [12]:
df1.to_csv("Atmospheric Data With No Missing Values.csv", index=False)