## Data storage exploration - Understanding how the data is stored 

**GOAL**: Create a csv combining data from NO2 levels from the year 2018. Combining the pollution data with data from the measurement stations & hourly temperatures

## STRUCTURE OF THIS NOTEBOOK

- Explore January data from the year 2018


In [18]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from typing import Tuple
import joblib
import os

from pathlib import Path


In [15]:
path = '/home/carleondel/data-madrid_pollution_project/'
df_ene = pd.read_csv(path + 'ene_mo18.csv', sep = ';')

Let's start exploring our data from January

In [16]:
df_ene.head()

Unnamed: 0,PROVINCIA,MUNICIPIO,ESTACION,MAGNITUD,PUNTO_MUESTREO,ANO,MES,DIA,H01,V01,...,H20,V20,H21,V21,H22,V22,H23,V23,H24,V24
0,28,79,4,1,28079004_1_38,2018,1,1,1.0,V,...,2.0,V,1.0,V,1.0,V,1.0,V,1.0,V
1,28,79,4,1,28079004_1_38,2018,1,2,1.0,V,...,1.0,V,1.0,V,1.0,V,2.0,V,2.0,V
2,28,79,4,1,28079004_1_38,2018,1,3,1.0,V,...,2.0,V,2.0,V,2.0,V,3.0,V,2.0,V
3,28,79,4,1,28079004_1_38,2018,1,4,2.0,V,...,2.0,V,1.0,V,1.0,V,1.0,V,1.0,V
4,28,79,4,1,28079004_1_38,2018,1,5,1.0,V,...,2.0,V,2.0,V,2.0,V,1.0,V,2.0,V


### **Datos horarios y en tiempo real:**

- Cada registro está estructurado de la siguiente forma: PROVINCIA MUNICIPIO ESTACION MAGNITUD PUNTO_MUESTREO ANO MES DIA H01 V01 H02 V02
28 79 4 1 28079004_1_38 2019 1 1 23 V 17 V

- El campo punto de muestreo incluye el código de la estación completo
(provincia, municipio y estación) más la magnitud y la técnica de
muestreo.

- H01 corresponde al dato de la 1 de la mañana de ese día, V01 es el
código de validación, H02 al de las 2 de la mañana, V02 y así
sucesivamente.
- UNICAMENTE SON VÁLIDOS LOS DATOS QUE LLEVAN EL CÓDIGO DE VALIDACIÓN
“V".


In [19]:
df_ene.columns

Index(['PROVINCIA', 'MUNICIPIO', 'ESTACION', 'MAGNITUD', 'PUNTO_MUESTREO',
       'ANO', 'MES', 'DIA', 'H01', 'V01', 'H02', 'V02', 'H03', 'V03', 'H04',
       'V04', 'H05', 'V05', 'H06', 'V06', 'H07', 'V07', 'H08', 'V08', 'H09',
       'V09', 'H10', 'V10', 'H11', 'V11', 'H12', 'V12', 'H13', 'V13', 'H14',
       'V14', 'H15', 'V15', 'H16', 'V16', 'H17', 'V17', 'H18', 'V18', 'H19',
       'V19', 'H20', 'V20', 'H21', 'V21', 'H22', 'V22', 'H23', 'V23', 'H24',
       'V24'],
      dtype='object')

The first thing we need to do is to filter our data to only keep the NO2 levels, which are stored as MAGNITUD = 08

In [24]:
df_ene.MAGNITUD.nunique()

14

In [25]:
df_ene.MAGNITUD.unique()

array([ 1,  6,  7,  8, 12,  9, 10, 14, 20, 30, 35, 42, 43, 44])

In [38]:
# We filter by NO2 levels and drop the column
df_ene_NO2 = df_ene[df_ene['MAGNITUD'] == 8]
df_ene_NO2 = df_ene_NO2.drop('MAGNITUD', axis = 1)

In [39]:
df_ene_NO2.head()

Unnamed: 0,PROVINCIA,MUNICIPIO,ESTACION,PUNTO_MUESTREO,ANO,MES,DIA,H01,V01,H02,...,H20,V20,H21,V21,H22,V22,H23,V23,H24,V24
93,28,79,4,28079004_8_8,2018,1,1,15.0,V,35.0,...,39.0,V,35.0,V,27.0,V,23.0,V,11.0,V
94,28,79,4,28079004_8_8,2018,1,2,13.0,V,17.0,...,38.0,V,41.0,V,47.0,V,55.0,V,47.0,V
95,28,79,4,28079004_8_8,2018,1,3,35.0,V,17.0,...,53.0,V,47.0,V,51.0,V,56.0,V,41.0,V
96,28,79,4,28079004_8_8,2018,1,4,35.0,V,34.0,...,42.0,V,36.0,V,35.0,V,34.0,V,28.0,V
97,28,79,4,28079004_8_8,2018,1,5,24.0,V,17.0,...,53.0,V,54.0,V,52.0,V,46.0,V,44.0,V


In [42]:
print(f"We have kept {100*len(df_ene_NO2) / len(df_ene):.2f} % of the original dataset")

We have kept 15.93 % of the original dataset


Now we should filter by deleting/interpolating the values of H_X when V_X is 'N'. But before choosing how to input these data, we are going to analyze how many hour values have a Not valid measure. And also check what values are being input. If those values are a interpolation of the other measurements, we could consider keeping those.

In [53]:
# What values do we have
df_ene_NO2.V04.unique()

array(['V', 'N'], dtype=object)

In [49]:
# Create a list with values [V01, V02,...,V24]
V_list = ['V{:02d}'.format(i) for i in range(1, 25)]

print(V_list)

['V01', 'V02', 'V03', 'V04', 'V05', 'V06', 'V07', 'V08', 'V09', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24']


In [58]:
# Count the times 'N' appears each hour
count_N_values = df_ene_NO2[V_list].apply(lambda x: x.map({'N' : 1, 'V' : 0})).sum()

# Mostrar los resultados
print(count_N_values)

V01     0
V02     0
V03     0
V04     1
V05     2
V06     1
V07     1
V08     1
V09     4
V10     9
V11     6
V12    11
V13     5
V14     6
V15     3
V16     2
V17     4
V18     3
V19     3
V20     2
V21     1
V22     0
V23     0
V24     0
dtype: int64


In [62]:
100*11 / len(df_ene_NO2.V12)

1.478494623655914

Let's check our data in hour 12

In [59]:
df_ene_NO2[df_ene_NO2['V12'] == 'N'][['H12', 'V12']]

Unnamed: 0,H12,V12
108,82.0,N
249,61.0,N
262,71.0,N
1182,65.0,N
1865,50.0,N
2062,95.0,N
2237,59.0,N
2240,45.0,N
3104,79.0,N
3551,58.0,N


We should now see how our invalid data changes when compared to the valid data. If it doesn't change much we can consider keeping it. We must otherwise decide how to input those values

In [None]:

# Data Cleaning
# (Assuming columns like 'Date', 'NO2_Level', 'Station_ID', etc.)
# Add additional cleaning steps based on your actual dataset
df_no2['Date'] = pd.to_datetime(df_no2['Date'])
df_no2 = df_no2[df_no2['Pollutant'] == 'NO2']  # Only keep NO2 measurements

In [None]:
# Data Exploration
# Visualizing daily evolution
plt.figure(figsize=(12, 6))
sns.lineplot(x='Date', y='NO2_Level', data=df_no2)
plt.title('Daily Evolution of NO2 Levels in Madrid (2018)')
plt.xlabel('Date')
plt.ylabel('NO2 Level')
plt.show()

In [None]:
# Visualizing monthly evolution
df_no2_monthly = df_no2.resample('M', on='Date').mean()
plt.figure(figsize=(12, 6))
sns.barplot(x=df_no2_monthly.index, y='NO2_Level', data=df_no2_monthly)
plt.title('Monthly Average NO2 Levels in Madrid (2018)')
plt.xlabel('Month')
plt.ylabel('Average NO2 Level')
plt.show()

In [None]:


# Exploring factors affecting NO2 levels
# (Assuming additional columns like 'Temperature', 'Weather', etc.)
# Add relevant exploration based on your actual dataset
plt.figure(figsize=(12, 6))
sns.scatterplot(x='Temperature', y='NO2_Level', data=df_no2, hue='Weather')
plt.title('Scatter Plot of NO2 Levels vs Temperature (2018)')
plt.xlabel('Temperature')
plt.ylabel('NO2 Level')
plt.legend(title='Weather')
plt.show()

In [None]:

# Proposed Predictive Model
# (Provide detailed discussion and rationale, the following is a placeholder)
print("Proposed Predictive Model:")
print("Variables to introduce: Temperature, Season, Previous NO2 Levels")
print("Historical Data Depth: Consider using at least one year of historical data")
print("Modeling Techniques: Try linear regression, time series models, etc.")
