# Calculate the measures of central tendency, dispersion, and determine the skewness of the distribution for each variable.

We are going to do it analytically.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

In [2]:
PATH = "../../../DataCleaning/"

In [3]:
df = pd.read_csv(PATH + "Atmospheric Data With No Missing Values.csv")

In [4]:
df.head()

Unnamed: 0,O3,O3_flag,NO2,NO2_flag,NO,NO_flag,CO,CO_flag,PM10,PM10_flag,...,WDir_Avg,WDir_SD,Rain_Tot,Press_Avg,Rad_Avg,Year,Month,Day,Hour,Minute
0,55.48,OK,0.72,OK,0.2,BDL,0.25,OK,25.47,OK,...,173.6,14.26,0.0,805.409,,2023,5,1,0,0
1,55.49,OK,0.81,OK,0.2,BDL,0.26,OK,25.74,OK,...,171.0,10.53,0.0,805.524,,2023,5,1,0,1
2,55.4,OK,0.93,OK,0.2,BDL,0.27,OK,26.6,OK,...,178.6,15.72,0.0,805.436,,2023,5,1,0,2
3,55.2,OK,0.87,OK,0.2,BDL,0.28,OK,27.59,OK,...,186.1,17.43,0.0,805.45,,2023,5,1,0,3
4,55.41,OK,0.98,OK,0.2,BDL,0.28,OK,27.83,OK,...,211.5,21.67,0.0,805.504,,2023,5,1,0,4


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 443508 entries, 0 to 443507
Data columns (total 26 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   O3          443508 non-null  float64
 1   O3_flag     443508 non-null  object 
 2   NO2         443508 non-null  float64
 3   NO2_flag    443508 non-null  object 
 4   NO          443508 non-null  float64
 5   NO_flag     443508 non-null  object 
 6   CO          443508 non-null  float64
 7   CO_flag     443508 non-null  object 
 8   PM10        443508 non-null  float64
 9   PM10_flag   443508 non-null  object 
 10  PM2.5       443508 non-null  float64
 11  PM2.5_flag  443508 non-null  object 
 12  Temp_Avg    443508 non-null  float64
 13  RH_Avg      443508 non-null  float64
 14  WSpeed_Avg  443508 non-null  float64
 15  WSpeed_Max  443508 non-null  float64
 16  WDir_Avg    443508 non-null  float64
 17  WDir_SD     443508 non-null  float64
 18  Rain_Tot    443508 non-null  float64
 19  Pr

## Flag variables

These are categorical variables. Therefore, we can create frequency tables to determine which of the air condition sensors has more instances below the detection limits or out of range. This information can be useful for determining the functioning of the sensors.

In [6]:
columns = ["O3_flag", "NO2_flag", "NO_flag", "CO_flag", "PM10_flag", "PM2.5_flag"]

In [9]:
for c in columns: print(df[c].value_counts(), "\n")

O3_flag
OK     442190
BDL      1302
OR         16
Name: count, dtype: int64 

NO2_flag
OK     440116
BDL      3085
OR        307
Name: count, dtype: int64 

NO_flag
BDL    348595
OK      94710
OR        203
Name: count, dtype: int64 

CO_flag
OK     437378
BDL      6022
OR        108
Name: count, dtype: int64 

PM10_flag
OK     417350
BDL     23273
OR       2885
Name: count, dtype: int64 

PM2.5_flag
OK     390297
BDL     44973
OR       8238
Name: count, dtype: int64 



Based on these absolute frequencies, it is interesting that the particle with the most records below the detection limit was NO. In fact, it has more instances in this category (BDL) than in (OK).

However, in terms of going out of range, NO remained normal compared to the results of the other variables. The one that had the most was PM2.5.

Now, to avoid biasing the analysis, we will first calculate the relative frequencies of the data that are below the detection limit for each flag variable. This is to truly see which pollutants remained lower during the year according to the measurement capabilities of our sensors.

Next, we will look at the relative frequencies of the data that were correctly taken (OK), to determine the certainty in our air quality measurements.

In [20]:
print("Relative Frequencies Below Detection Limit")
for c in columns:
    print(c)
    print(round(df[df[c] == "BDL"].shape[0]*100/df.shape[0], 2), "%\n")

Relative Frequencies Below Detection Limit
O3_flag
0.29 %

NO2_flag
0.7 %

NO_flag
78.6 %

CO_flag
1.36 %

PM10_flag
5.25 %

PM2.5_flag
10.14 %



78.6% of the NO data below the detection limit may have several explanations:
- It is a pollutant with a very low concentration.
- The sensor's detection limit is too high for the station's measurements.
- There was a problem with data processing, to the extent that its integrity was compromised.

Moving on to the relative frequencies of the data that operated correctly (OK).

In [21]:
print("Relative Frequencies Reliable Data")
for c in columns:
    print(c)
    print(round(df[df[c] == "OK"].shape[0]*100/df.shape[0], 2), "%\n")

Relative Frequencies Reliable Data
O3_flag
99.7 %

NO2_flag
99.24 %

NO_flag
21.35 %

CO_flag
98.62 %

PM10_flag
94.1 %

PM2.5_flag
88.0 %



With this, we can see that almost all the sensor samples are quite reliable for analysis. We just need to be careful with NO and a bit with PM2.5. We can address this in the inference tests by increasing the significance level (to avoid potential biases).

## Time variables.

This analysis will be quite simple. To assess the performance of the sensors in each month, let's examine the absolute frequencies of each month. Theoretically, all months with 31 days should have the same number of instances (likewise for the 30-day months). Naturally, the months with 31 days will have more instances, while February will have the fewest instances.

In [34]:
months = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]
print("Number of Instances per Month\n")
for i in range(12):
    month_name = months[i] + ":"
    instances_count = str(df[df["Month"] == i+1].shape[0]) + " instances"
    print(month_name.ljust(12), instances_count)

Number of Instances per Month

January:     40176 instances
February:    36600 instances
March:       18961 instances
April:       40498 instances
May:         29463 instances
June:        40500 instances
July:        39538 instances
August:      41555 instances
September:   40498 instances
October:     41700 instances
November:    30900 instances
December:    43119 instances


Here we can note several things:
- March is the month with the least instances by far: almost less than half compared to the others.
- Almost all months have a different number of instances. In fact, the only ones that have the same number are April and September.

These modifications could be due to the removal of some instances due to null data or when merging the meteorology and air quality datasets. However, it would be advisable to review the functioning of the minute-by-minute sensor measurements.

## Air Quality

Let's take a look at air quality data. This will be done using the central tendency, dispersion and skewness of each particle.

In [41]:
def fisher(variable: pd.Series) -> float:
    numerator = sum((variable - variable.mean())**3)
    denominator = variable.shape[0] * (variable.std()**3)    
    return numerator / denominator

*Note:* We are not using the mode because we are using numerical variables (floating point variables). 

In [56]:
def descriptiva(particle:pd.Series):
    print(f"Mean: ", particle.mean())
    print(f"Median: ", particle.median())
    print(f"First quartile: ", particle.quantile(0.25))
    print(f"Third quartile: ", particle.quantile(0.75))
    print(f"Standard deviation: ", particle.std())
    print(f"Minimum: ", particle.min())
    print(f"Maximum: ", particle.max())
    print(f"Fisher asymmetry: ", fisher(particle))

#### O3

In [57]:
descriptiva(df["O3"])

Mean:  34.1781475982395
Median:  32.57
First quartile:  20.61
Third quartile:  46.57
Standard deviation:  17.620797886814806
Minimum:  0.015
Maximum:  410.2
Fisher asymmetry:  0.7974751877463978


With this, we can make some observations about ozone:
- The difference between its mean and median is very small. However, since the mean is greater than the median, we have a right-skewed distribution. We can confirm this with the positive value of Fisher's skewness: we have positive skewness (right-skewed).
- We have a somewhat large standard deviation considering it is half the mean value. This may be due to some outliers that increase it.
- Regarding outliers, we note that the interquartile range ranges from 20.61 to 46.57. However, we then see that we have a minimum value of almost 0 and a maximum of over 400 ppb. This means that while half of the data is concentrated in a small interval, outside of it, values can increase or decrease significantly. We therefore see that there were points where ozone pollution spiked significantly and others where it practically decreased to zero.

#### NO2

In [58]:
descriptiva(df["NO2"])

Mean:  4.826200361662021
Median:  3.72
First quartile:  1.98
Third quartile:  6.47
Standard deviation:  4.04996493660314
Minimum:  0.2
Maximum:  163.27
Fisher asymmetry:  3.8661046832980417


Given that they are in the same units (ppb), nitrogen dioxide pollution is much lower than ozone pollution. This is based on their means, 4.82 vs 34.17. This is also reflected in other measures such as the standard deviation or the maximum peak.

However, for NO2, we do need to make observations about its outliers: the maximum was 163.27 when its interquartile range remained from 1.98 to 6.47. This indicates that after the third quartile, we can find extremely large data. Now, this does not necessarily mean errors in the data (remember that more than 99% of the data for this particle was classified as OK by the sensor flags). Rather, it means that there were very high pollution peaks compared to its normal behavior during the year.

As a result of these pollution peaks, we once again have a distribution with positive skewness. The problem here is the value of Fisher's skewness: 3.86. With this value, we would need to be cautious about the data. However, this distribution could be due to how we filled in the null data and handled the data cleaning for values that were out of range or below the detection limit: we replaced them with the median and mean (respectively). It is possible that by doing this, we loaded the data too much towards the center compared to the ends of the distribution, thus creating a strong right skew. This could cause problems during normality tests, but we will see that later on.

#### NO

*Note:* this is the variable with 78% of the data below the detection limit.

In [59]:
descriptiva(df["NO"])

Mean:  0.8407245190616629
Median:  0.2
First quartile:  0.2
Third quartile:  0.2
Standard deviation:  2.886681173901699
Minimum:  0.2
Maximum:  281.94
Fisher asymmetry:  27.311967898596578


We have the same case as with NO2, but taken to an extreme. Due to the data cleaning criteria for values below the detection limit, they had to be replaced with the mean if they were within a certain range. Therefore, having 78% of the values below the detection limit, it was expected that we would alter the distribution by replacing the values in this way. This explains the extremely high value of Fisher's skewness: 27.31.

The only interesting values are the maximum (because it represents pollution peaks that were not manipulated), the mean, and its standard deviation. In fact, having a standard deviation of 2.88 when our interquartile range is 0.2 indicates that we retained real measurement values.

Therefore, personally, I would only work with data that has been labeled with the OK flag. Because the others completely skew the distribution. When we graph it, we will see an interquartile range of just one value and a lot of outliers at the upper end of the distribution. However, these outliers will be important to us because they are the ones with which we can work more representatively.

#### CO

*Note:* This particle is in ppm units, not ppb. Multiply by 1000 if you want to convert it into the units of the other particles.

In [60]:
descriptiva(df["CO"])

Mean:  0.3305503395654644
Median:  0.31
First quartile:  0.24
Third quartile:  0.4
Standard deviation:  0.15373635046558537
Minimum:  0.02
Maximum:  3.45
Fisher asymmetry:  1.3573614988908342


The first thing to mention after the previous distributions is that we still maintain a distribution with a positive skew, but it's not as pronounced. The Fisher value is only 1.35. It's still high, but we can work with it.

In fact, although we still need to analyze the PM10 and PM2.5 variables, we can see a pattern of symmetry in the air quality variables: they all have a right skew because they are mostly equal throughout the year. However, at some point, they have pollution peaks that cause the mean to rise and we have right skews.

Now, we once again have a standard deviation that's almost half of the mean. This is also expected due to our pollution peaks.

Finally, regarding the units. If we convert ppm to ppb to compare it with the other particles, carbon monoxide has a much higher concentration compared to the other particles. However, a higher concentration doesn't necessarily mean it is more harmful than other particles. It's possible that other particles, even at a lower concentration than CO, can be more harmful and worsen air quality.

#### PM10

These are particles with an aerodynamic diameter equal to or less than 10 microns (in the case of PM2.5, it's the same idea but with a diameter equal to or less than 2.5 microns). For both particles, their unit of measurement is micrograms per cubic meter (µg/m^3).

In [61]:
descriptiva(df["PM10"])

Mean:  24.375818632358374
Median:  22.18
First quartile:  13.55
Third quartile:  32.92
Standard deviation:  15.637248471153509
Minimum:  2.0
Maximum:  399.64
Fisher asymmetry:  2.850676849715981


Again, we observe the behavior due to pollution peaks: right-skewed distribution, Fisher's skewness greater than 1 (in this case, 2.85), and a small interquartile range compared to the maximum value. We will have a lot of outliers when graphing this.

In this case, we see a fairly large standard deviation. Almost the same size as the interquartile range. This indicates that the concentration of these particles changes the most throughout the year.

#### PM2.5

In [62]:
descriptiva(df["PM2.5"])

Mean:  17.433051218918262
Median:  15.99
First quartile:  9.17
Third quartile:  23.83
Standard deviation:  11.99863727718033
Minimum:  2.0
Maximum:  373.28
Fisher asymmetry:  3.5071871139768738


Again, we have the positive skew behavior of all air quality variables.

However, the interesting part here is comparing it to the PM10 particle.
- First, we have a higher skewness in PM2.5 than in PM10: 3.50 vs 2.85. This may be due to PM2.5 having a smaller interquartile range.
- In fact, values such as the mean, median, and maximum are lower in PM2.5 than in PM10. Therefore, in the last year, we had more concentration of PM10 particles than PM2.5.
- However, in the standard deviation, although it is also lower for PM2.5, it didn't fall as far as with the other measures. This indicates that while its concentration is lower, its variation is almost equal to that of PM10.

## Meteorological variables

Let's do the same analysis procedure, but now with the meteorological variables.

#### Temp_Avg

Temperature average. Units: °C

In [65]:
descriptiva(df["Temp_Avg"])

Mean:  18.340548549293363
Median:  18.06
First quartile:  14.85
Third quartile:  22.17
Standard deviation:  5.267597043561073
Minimum:  2.966
Maximum:  33.57
Fisher asymmetry:  0.04107919069135361


Since we are no longer working with air quality variables, we no longer see the symmetry pattern. In fact, for the temperature average, it's quite the opposite: we find that the mean and the median are almost equal. Therefore, we have an almost symmetrical distribution. It's important to remember that in reality, we almost never see a Fisher skewness of 0 (in fact, if we do, the data is likely incorrect). Therefore, the skewness we have here can be considered objectively symmetrical.

This was expected not only because the mean coincides with the median, but also due to the interquartile range: the limits of its values are almost within one standard deviation of the median. Therefore, we will have almost all the data concentrated there.

Additionally, our outliers don't change due to Morelia's climate: the minimum recorded was 2.9°C and the maximum was 33.57°C. Since these values are not far from the mean, they contribute to this almost symmetrical distribution.

#### RH_Avg

Relative humidity. Units %

In [66]:
descriptiva(df["RH_Avg"])

Mean:  60.65806376435147
Median:  61.71
First quartile:  42.58
Third quartile:  80.6
Standard deviation:  23.014083653992493
Minimum:  7.406
Maximum:  99.9
Fisher asymmetry:  -0.17851379909815285


Here we see for the first time that the mean is lower than the median. Although this difference is very small (-0.17), it gives us a negative skewness (or left-skewed distribution). This indicates that the humidity in the city (at least near the atmospheric observatory) is relatively high. This information would need to be compared with the amount of rainfall to confirm if it's due to rain or bodies of water within and near the university.

The above statements are not only based on the distribution but also because our mean is at 60%, the interquartile range ranges from 42% to 80%, and the standard deviation doesn't increase much (it's only 23%).

Now, in what we could consider as drought data, we see that the minimum value was 7.4%. We would need to see how many data points are at the lower end to understand how much and to what extent the drought period affected it.

#### WSpeed_Avg

Wind speed average. Units: m per second

In [67]:
descriptiva(df["WSpeed_Avg"])

Mean:  1.5642836724478482
Median:  1.217
First quartile:  0.172
Third quartile:  2.454
Standard deviation:  1.523667332068103
Minimum:  0.0
Maximum:  17.15
Fisher asymmetry:  1.1006272101631005


Here we have a positive skewness again with a value greater than 1. This is because the maximum wind speed values are much higher than those within our interquartile range. These very large data points can be better observed with the maximum wind speed variable (remember that this is an average).

Now, although the interquartile range is quite small (from 0.17 to 2.45), the standard deviation is not very large (1.52) for the maximum value (17.15). This means that the larger values are not as frequent as to significantly alter the distribution.

#### WSpeed_Max

Maximum wind speed. Units: m per second

In [68]:
descriptiva(df["WSpeed_Max"])

Mean:  2.0565199229777136
Median:  1.618
First quartile:  0.451
Third quartile:  3.164
Standard deviation:  1.926558010016441
Minimum:  0.0
Maximum:  18.78
Fisher asymmetry:  1.0775158389943107


Naturally, it exhibits a very similar behavior to that of the average wind speed in terms of positive skewness. However, the difference is noted in that being the maximum recorded speed, almost all values are larger. Personally, I would have expected a greater increase, but that was not the case: the mean increased from 1.56 to 2.05, the maximum value from 17.15 to 18.78, and the interquartile range from 0.17-2.45 to 0.45-3.16.

The only noticeable change was the increase in the size of the interquartile range. Perhaps this is why the skewness decreased slightly from 1.10 to 1.07.

#### WDir_Avg

Wind direction average. Units deg.

In [69]:
descriptiva(df["WDir_Avg"])

Mean:  144.94631321870182
Median:  147.8
First quartile:  42.73
Third quartile:  228.6
Standard deviation:  108.05553721419147
Minimum:  0.0
Maximum:  360.0
Fisher asymmetry:  0.13667382639897396


Al having it constrained from 0 to 360 degrees, we managed to reduce the asymmetry of the distribution. However, since the mean is less than the median, there is a slight leftward bias.

Conducting this analysis in degrees is not very useful. It can only help us determine that the prevailing wind direction is in a range of 42° to 228° relative to the observatory's location. However, the direction varies with a standard deviation of 108°.

For a more interesting analysis in the next section of the work, we can discretize the variable and create a histogram of the wind direction frequencies.

#### WDir_SD

Standard deviation of wind. Units in degrees.

In [70]:
descriptiva(df["WDir_SD"])

Mean:  9.55129614572905
Median:  7.685
First quartile:  0.174
Third quartile:  14.25
Standard deviation:  10.212017005198792
Minimum:  0.0
Maximum:  103.4
Fisher asymmetry:  2.001601367086366


In this variable, descriptive analysis is equally unhelpful. Its real utility lies in complementing the average wind direction variable.

However, we can make some observations:
- In this variable, we again have a fairly large positive skewness (2.00). This can be explained by its very small interquartile range (from 0.17 to 14.25) and a very large standard deviation compared to that range (10.21). Therefore, the appearance of larger values (such as the maximum value of 103.4) causes us to have a right-skewed distribution.

#### Rain_Tot

Rainfall total. Units in mm.

In [71]:
descriptiva(df["Rain_Tot"])

Mean:  0.0010338032233916862
Median:  0.0
First quartile:  0.0
Third quartile:  0.0
Standard deviation:  0.01907520594380706
Minimum:  0.0
Maximum:  1.8
Fisher asymmetry:  36.21624704152785


Here we have the largest positive skewness in the entire dataset so far. It's interesting that even though the mean and median are practically the same, the skewness is modified so much by the outliers.

In fact, since almost all the data points are 0 (mean, median, quartile 1, quartile 3, standard deviation), it's interesting that the outliers are related to rainfall (even though we observed high humidity in the environment).

Now, one possibility would be to group the records into longer time intervals to see if this rainfall behavior changes. This is especially important because the highest value recorded was 1.8.

#### Press_Avg

Atmospheric pressure average. Units in hPa.

In [72]:
descriptiva(df["Press_Avg"])

Mean:  805.8385776107762
Median:  805.874
First quartile:  804.626
Third quartile:  807.097
Standard deviation:  1.8405765947928108
Minimum:  797.895
Maximum:  812.652
Fisher asymmetry:  -0.08737419129960401


Atmospheric pressure is not a variable that fluctuates much when measured in the same location (even over the course of a year). As we can see, the mean and median are almost together. Therefore, we observe a distribution that is practically symmetrical. Additionally, we have a small interquartile range (from 804.62 to 807.09) with an even smaller standard deviation of 1.84. Finally, we notice that the minimum and maximum values do not deviate much from the interquartile range. Perhaps when we conduct normality tests, this variable will be one of those that is closest to normality.

#### Rad_Avg

Solar Radiation average. Units in W/m^2

In [73]:
descriptiva(df["Rad_Avg"])

Mean:  471.9045388241155
Median:  404.5
First quartile:  130.1
Third quartile:  807.0
Standard deviation:  366.84912879481146
Minimum:  0.046
Maximum:  1439.0
Fisher asymmetry:  nan


En primer lugar, observamos que la asimetría de Fisher nos regresó un valor nulo. Esto se debe a que posiblemente se intentó dividir entre 0. Ahora bien, esto no es problema para averiguar el sesgo de la distribución porque como la media es mayor que la mediana, entonces tenemos sesgo a la derecha.

Aquí podemos ver también que tenemos una de las desviaciones estándar más grandes. Esto se debe a la variación de la radiación solar. Naturalmente, en la horas de la noche y de poca luz tendremos un mínimo. Por lo que realmente cuando se incrementa es en unas pocas horas del día (probablemente de las 11 a las 15 horas). En consecuencia tenemos la mayoría de los datos concentrados a la izquierda, pero los que se van a la derecha (los máximos) serán muy altos.

Nótese también que aquí tenemos el rango intercuartílico más grande: de 130 a 807.