# Datasetovi - opis

Za air quality odabrali smo 3(ili samo ona 2) dataseta sa [sledeceg sajta](archive.ics.uci.edu/ml/datasets):
1. [Beijing Multi-Site Air-Quality Data Data Set](https://archive.ics.uci.edu/ml/datasets/Beijing+Multi-Site+Air-Quality+Data) - dataset sadrzi 6 glavnih polutanata i 6 glavnih meteoroloskih varijabli, merenih na nekoliko razlicitih sanica, svakog sata, oko Pekinga. Polja unutar dataseta su sledeca:
    1. No: row number
    1. year: year of data in this row
    1. month: month of data in this row
    1. day: day of data in this row
    1. hour: hour of data in this row
    1. PM2.5: PM2.5 concentration (ug/m^3)
    1. PM10: PM10 concentration (ug/m^3)
    1. SO2: SO2 concentration (ug/m^3)
    1. NO2: NO2 concentration (ug/m^3)
    1. CO: CO concentration (ug/m^3)
    1. O3: O3 concentration (ug/m^3)
    1. TEMP: temperature (degree Celsius)
    1. PRES: pressure (hPa)
    1. DEWP: dew point temperature (degree Celsius)
    1. RAIN: precipitation (mm)
    1. wd: wind direction
    1. WSPM: wind speed (m/s)
    1. station: name of the air-quality monitoring site


2. [Beijing PM2.5 Data Data Set](https://archive.ics.uci.edu/ml/datasets/Beijing+PM2.5+Data) - Sadrzi podatke iz US ambasade u Pekingu i glavnog aerodroma u Pekingu, takodje PM2.5 cestice merene svakog sata. Obelezja:
    1. No: row number
    1. year: year of data in this row
    1. month: month of data in this row
    1. day: day of data in this row
    1. hour: hour of data in this row
    1. pm2.5: PM2.5 concentration (ug/m^3)
    1. DEWP: Dew Point (â„ƒ)
    1. TEMP: Temperature (â„ƒ)
    1. PRES: Pressure (hPa)
    1. cbwd: Combined wind direction
    1. Iws: Cumulated wind speed (m/s)
    1. Is: Cumulated hours of snow
    1. Ir: Cumulated hours of rain


3. [Air Quality Data Set](https://archive.ics.uci.edu/ml/datasets/Air+Quality) - sadrzi merenja sa visesenzorskih uredjaja koji su postavljeni u polju jednog italijanskog grada. Obelezja su:
    1. Date (DD/MM/YYYY)
    1. Time (HH.MM.SS)
    1. True hourly averaged concentration CO in mg/m^3 (reference analyzer)
    1. PT08.S1 (tin oxide) hourly averaged sensor response (nominally CO targeted)
    1. True hourly averaged overall Non Metanic HydroCarbons concentration in microg/m^3 (reference analyzer)
    1. True hourly averaged Benzene concentration in microg/m^3 (reference analyzer)
    1. PT08.S2 (titania) hourly averaged sensor response (nominally NMHC targeted)
    1. True hourly averaged NOx concentration in ppb (reference analyzer)
    1. PT08.S3 (tungsten oxide) hourly averaged sensor response (nominally NOx targeted)
    1. True hourly averaged NO2 concentration in microg/m^3 (reference analyzer)
    1. PT08.S4 (tungsten oxide) hourly averaged sensor response (nominally NO2 targeted)
    1. PT08.S5 (indium oxide) hourly averaged sensor response (nominally O3 targeted)
    1. Temperature in °C
    1. Relative Humidity (%)
    1. AH Absolute Humidity

# Ucitavanje podataka

In [2]:
import pandas as pd

base_data_folder = "./Data"

## Beijing dataset

In [3]:
beijing_folder = "/Beijing dataset"

In [41]:
beijing_ds = pd.read_csv(base_data_folder + beijing_folder + "/Beijing.csv")
beijing_ds['datetime'] = pd.to_datetime(beijing_ds[['year', 'month', 'day', 'hour']])
beijing_ds.drop(columns=['year', 'month', 'day', 'hour'], inplace=True)
beijing_ds.rename(columns={'pm2.5': 'PM2.5'}, inplace=True)
beijing_ds.head(10)

Unnamed: 0,No,PM2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir,datetime
0,1,,-21,-11.0,1021.0,NW,1.79,0,0,2010-01-01 00:00:00
1,2,,-21,-12.0,1020.0,NW,4.92,0,0,2010-01-01 01:00:00
2,3,,-21,-11.0,1019.0,NW,6.71,0,0,2010-01-01 02:00:00
3,4,,-21,-14.0,1019.0,NW,9.84,0,0,2010-01-01 03:00:00
4,5,,-20,-12.0,1018.0,NW,12.97,0,0,2010-01-01 04:00:00
5,6,,-19,-10.0,1017.0,NW,16.1,0,0,2010-01-01 05:00:00
6,7,,-19,-9.0,1017.0,NW,19.23,0,0,2010-01-01 06:00:00
7,8,,-19,-9.0,1017.0,NW,21.02,0,0,2010-01-01 07:00:00
8,9,,-19,-9.0,1017.0,NW,24.15,0,0,2010-01-01 08:00:00
9,10,,-20,-8.0,1017.0,NW,27.28,0,0,2010-01-01 09:00:00


In [45]:
beijing_ds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43824 entries, 0 to 43823
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   No        43824 non-null  int64         
 1   PM2.5     41757 non-null  float64       
 2   DEWP      43824 non-null  int64         
 3   TEMP      43824 non-null  float64       
 4   PRES      43824 non-null  float64       
 5   cbwd      43824 non-null  object        
 6   Iws       43824 non-null  float64       
 7   Is        43824 non-null  int64         
 8   Ir        43824 non-null  int64         
 9   datetime  43824 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(4), int64(4), object(1)
memory usage: 3.3+ MB


In [34]:
print(beijing_ds.isnull().sum(), "\n")
print(beijing_ds.isna().sum(), "\n")
print(beijing_ds.shape)

No             0
pm2.5       2067
DEWP           0
TEMP           0
PRES           0
cbwd           0
Iws            0
Is             0
Ir             0
datetime       0
dtype: int64 

No             0
pm2.5       2067
DEWP           0
TEMP           0
PRES           0
cbwd           0
Iws            0
Is             0
Ir             0
datetime       0
dtype: int64 

(43824, 10)


In [57]:
beijing_ds.describe()

Unnamed: 0,No,PM2.5,DEWP,TEMP,PRES,Iws,Is,Ir
count,43824.0,41757.0,43824.0,43824.0,43824.0,43824.0,43824.0,43824.0
mean,21912.5,98.613215,1.817246,12.448521,1016.447654,23.88914,0.052734,0.194916
std,12651.043435,92.050387,14.43344,12.198613,10.268698,50.010635,0.760375,1.415867
min,1.0,0.0,-40.0,-19.0,991.0,0.45,0.0,0.0
25%,10956.75,29.0,-10.0,2.0,1008.0,1.79,0.0,0.0
50%,21912.5,72.0,2.0,14.0,1016.0,5.37,0.0,0.0
75%,32868.25,137.0,15.0,23.0,1025.0,21.91,0.0,0.0
max,43824.0,994.0,28.0,42.0,1046.0,585.6,27.0,36.0


## PRSA dataset

In [6]:
prsa_folder = "/PRSA Data - Chinese cities"

In [25]:
import os
datasets = [pd.read_csv(base_data_folder + prsa_folder+ "/" + file) for file in os.listdir(base_data_folder + prsa_folder) if file.endswith('.csv')]

prsa_dataset = pd.concat(datasets, axis=0)

prsa_dataset['datetime'] = pd.to_datetime(prsa_dataset[['year', 'month', 'day', 'hour']])
prsa_dataset.drop(columns=['year', 'month', 'day', 'hour', 'PM10'], inplace=True)

del datasets
prsa_dataset.head(10)

Unnamed: 0,No,PM2.5,PM10,SO2,NO2,CO,O3,TEMP,PRES,DEWP,RAIN,wd,WSPM,station,datetime
0,1,4.0,4.0,4.0,7.0,300.0,77.0,-0.7,1023.0,-18.8,0.0,NNW,4.4,Aotizhongxin,2013-03-01 00:00:00
1,2,8.0,8.0,4.0,7.0,300.0,77.0,-1.1,1023.2,-18.2,0.0,N,4.7,Aotizhongxin,2013-03-01 01:00:00
2,3,7.0,7.0,5.0,10.0,300.0,73.0,-1.1,1023.5,-18.2,0.0,NNW,5.6,Aotizhongxin,2013-03-01 02:00:00
3,4,6.0,6.0,11.0,11.0,300.0,72.0,-1.4,1024.5,-19.4,0.0,NW,3.1,Aotizhongxin,2013-03-01 03:00:00
4,5,3.0,3.0,12.0,12.0,300.0,72.0,-2.0,1025.2,-19.5,0.0,N,2.0,Aotizhongxin,2013-03-01 04:00:00
5,6,5.0,5.0,18.0,18.0,400.0,66.0,-2.2,1025.6,-19.6,0.0,N,3.7,Aotizhongxin,2013-03-01 05:00:00
6,7,3.0,3.0,18.0,32.0,500.0,50.0,-2.6,1026.5,-19.1,0.0,NNE,2.5,Aotizhongxin,2013-03-01 06:00:00
7,8,3.0,6.0,19.0,41.0,500.0,43.0,-1.6,1027.4,-19.1,0.0,NNW,3.8,Aotizhongxin,2013-03-01 07:00:00
8,9,3.0,6.0,16.0,43.0,500.0,45.0,0.1,1028.3,-19.2,0.0,NNW,4.1,Aotizhongxin,2013-03-01 08:00:00
9,10,3.0,8.0,12.0,28.0,400.0,59.0,1.2,1028.5,-19.3,0.0,N,2.6,Aotizhongxin,2013-03-01 09:00:00


In [46]:
prsa_dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 420768 entries, 0 to 35063
Data columns (total 15 columns):
 #   Column    Non-Null Count   Dtype         
---  ------    --------------   -----         
 0   No        420768 non-null  int64         
 1   PM2.5     412029 non-null  float64       
 2   PM10      414319 non-null  float64       
 3   SO2       411747 non-null  float64       
 4   NO2       408652 non-null  float64       
 5   CO        400067 non-null  float64       
 6   O3        407491 non-null  float64       
 7   TEMP      420370 non-null  float64       
 8   PRES      420375 non-null  float64       
 9   DEWP      420365 non-null  float64       
 10  RAIN      420378 non-null  float64       
 11  wd        418946 non-null  object        
 12  WSPM      420450 non-null  float64       
 13  station   420768 non-null  object        
 14  datetime  420768 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(11), int64(1), object(2)
memory usage: 51.4+ MB


In [38]:
prsa_dataset['station'].unique()

array(['Aotizhongxin', 'Changping', 'Dingling', 'Dongsi', 'Guanyuan',
       'Gucheng', 'Huairou', 'Nongzhanguan', 'Shunyi', 'Tiantan',
       'Wanliu', 'Wanshouxigong'], dtype=object)

In [None]:
print(prsa_dataset.isnull().sum(), "\n")
print(prsa_dataset.isna().sum(), "\n")
print(prsa_dataset.shape)

In [56]:
# pd.merge(beijing_ds, prsa_dataset, on=['datetime', 'PM2.5', 'TEMP', 'PRES', 'DEWP'], how='inner')
pd.merge(beijing_ds, prsa_dataset, on=['datetime'], how='inner')[['No_x', 'PM2.5_x', 'TEMP_x', 'DEWP_x', 'datetime', 'No_y', "PM2.5_y", 'TEMP_y', 'DEWP_y']]

Unnamed: 0,No_x,PM2.5_x,TEMP_x,DEWP_x,datetime,No_y,PM2.5_y,TEMP_y,DEWP_y
0,27721,9.0,0.0,-21,2013-03-01 00:00:00,1,4.0,-0.7,-18.8
1,27721,9.0,0.0,-21,2013-03-01 00:00:00,1,3.0,-2.3,-19.7
2,27721,9.0,0.0,-21,2013-03-01 00:00:00,1,4.0,-2.3,-19.7
3,27721,9.0,0.0,-21,2013-03-01 00:00:00,1,9.0,-0.5,-21.4
4,27721,9.0,0.0,-21,2013-03-01 00:00:00,1,4.0,-0.7,-18.8
...,...,...,...,...,...,...,...,...,...
193243,43824,12.0,-3.0,-21,2014-12-31 23:00:00,16104,8.0,-4.0,-23.7
193244,43824,12.0,-3.0,-21,2014-12-31 23:00:00,16104,3.0,-2.9,-23.4
193245,43824,12.0,-3.0,-21,2014-12-31 23:00:00,16104,3.0,-4.0,-23.7
193246,43824,12.0,-3.0,-21,2014-12-31 23:00:00,16104,3.0,-2.1,-23.3
