# Mongo queries & cleaning

Este archivo contiene las consultas a la base de datos creada en Mongo, con la colección de los datos recogidos en la API open airq. Del mismo modo, la información extraída se limpia para su uso.

### Conexión con MongoDB

In [15]:
import pymongo
import pandas as pd

In [2]:
from pymongo import MongoClient

In [3]:
str_conn='mongodb://localhost:27017'  # str_conn por defecto

cursor=MongoClient(str_conn)

cursor

MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True)

In [4]:
#Accede a la base de datos "ETL_project"

db = cursor.ETL_project
db

Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), 'ETL_project')

In [5]:
#Accede a la colección "airq"
coleccion = db.airq
coleccion

Collection(Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), 'ETL_project'), 'airq')

### Datos Madrid

Para cada ciudad, existen varias estaciones que recopilan datos de calidad de aire. Primero, se buscan todos los "id" para averiguar las estaciones por ciudad, en este caso Madrid. Posteriormente. se realiza la query para buscar el valor que queramos para todas esas estaciones.

In [113]:
#Busco todas las estaciones por "id" que hay en Madrid.

query={"city":"Madrid"}
select={"_id":0,"id":1,"city":1}

id_lst_mad = list(coleccion.find(query,select))

In [114]:
id_values = [element["id"] for element in id_lst_mad] #genera una lista con las id de todas las estación en Madrid

In [102]:
#Con todas las id de Madrid, busco el parámetro pm25 mensual.

query = {"id": {"$in": id_values}, "parameter": "pm25"}

select = {"_id": 0,"id": 1,"parameter": 1,"average": 1,"first_datetime": 1,"last_datetime": 1}

results = list(coleccion.find(query, select))

In [104]:
results

[{'id': 3265,
  'average': 13.607142857142858,
  'parameter': 'pm25',
  'first_datetime': '2022-01-01T01:00:00Z',
  'last_datetime': '2022-02-01T00:00:00Z'},
 {'id': 3265,
  'average': 11.491048593350383,
  'parameter': 'pm25',
  'first_datetime': '2022-02-01T04:00:00Z',
  'last_datetime': '2022-02-28T23:00:00Z'},
 {'id': 3265,
  'average': 11.305389221556887,
  'parameter': 'pm25',
  'first_datetime': '2022-03-01T01:00:00Z',
  'last_datetime': '2022-04-01T00:00:00Z'},
 {'id': 3265,
  'average': 5.217391304347826,
  'parameter': 'pm25',
  'first_datetime': '2022-04-01T01:00:00Z',
  'last_datetime': '2022-04-30T07:00:00Z'},
 {'id': 3265,
  'average': 9.842592592592593,
  'parameter': 'pm25',
  'first_datetime': '2022-05-01T01:00:00Z',
  'last_datetime': '2022-05-31T23:00:00Z'},
 {'id': 3265,
  'average': 9.505747126436782,
  'parameter': 'pm25',
  'first_datetime': '2022-06-01T01:00:00Z',
  'last_datetime': '2022-06-30T02:00:00Z'},
 {'id': 3265,
  'average': 12.936708860759493,
  'param

In [105]:
df = pd.DataFrame(results2)

In [106]:
df

Unnamed: 0,id,average,parameter,first_datetime,last_datetime
0,3265,13.607143,pm25,2022-01-01T01:00:00Z,2022-02-01T00:00:00Z
1,3265,11.491049,pm25,2022-02-01T04:00:00Z,2022-02-28T23:00:00Z
2,3265,11.305389,pm25,2022-03-01T01:00:00Z,2022-04-01T00:00:00Z
3,3265,5.217391,pm25,2022-04-01T01:00:00Z,2022-04-30T07:00:00Z
4,3265,9.842593,pm25,2022-05-01T01:00:00Z,2022-05-31T23:00:00Z
...,...,...,...,...,...
762,4338,12.983740,pm25,2022-08-01T01:00:00Z,2022-08-31T23:00:00Z
763,4338,8.123711,pm25,2022-09-01T01:00:00Z,2022-10-01T00:00:00Z
764,4338,13.869159,pm25,2022-10-01T01:00:00Z,2022-10-28T02:00:00Z
765,4338,8.418333,pm25,2022-11-02T06:00:00Z,2022-12-01T00:00:00Z


### Datos Málaga

In [110]:
query={"name":"ES0817A",
       "parameter":"pm25"}
select={"_id":0,"name":1,"parameter":1,"average":1, "first_datetime":1, "last_datetime":1}

results= list(coleccion.find(query,select))

In [111]:
df = pd.DataFrame(results)

In [112]:
df

Unnamed: 0,name,average,parameter,first_datetime,last_datetime
0,ES0817A,13.405634,pm25,2022-01-01T01:00:00Z,2022-02-01T00:00:00Z
1,ES0817A,12.812689,pm25,2022-02-01T04:00:00Z,2022-02-28T20:00:00Z
2,ES0817A,14.44546,pm25,2022-03-01T08:00:00Z,2022-04-01T00:00:00Z
3,ES0817A,5.96,pm25,2022-04-01T01:00:00Z,2022-04-30T10:00:00Z
4,ES0817A,9.836176,pm25,2022-05-03T01:00:00Z,2022-05-31T16:00:00Z
5,ES0817A,6.314894,pm25,2022-06-03T01:00:00Z,2022-06-28T04:00:00Z
6,ES0817A,10.278214,pm25,2022-07-01T20:00:00Z,2022-07-30T05:00:00Z
7,ES0817A,10.752885,pm25,2022-08-02T02:00:00Z,2022-08-31T18:00:00Z
8,ES0817A,6.923544,pm25,2022-09-03T01:00:00Z,2022-10-01T00:00:00Z
9,ES0817A,11.258485,pm25,2022-10-01T01:00:00Z,2022-10-27T02:00:00Z
