# Proyek Analisis Data: Air Quality Dataset
- **Nama:** Efrado Suryadi
- **Email:** efradosuryadi@gmail.com
- **ID Dicoding:** efrado_suryadi_tPYl

## Menentukan Pertanyaan Bisnis

- Pertanyaan 1
- Pertanyaan 2

## Import Semua Packages/Library yang Digunakan

Imports for dealing with data:

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Imports for dealing with files paths:

In [3]:
import os

## Data Wrangling

### Data information

#### Data source

Data source for this project is downloaded from the https://github.com/marceloreis/HTI/tree/master. The repository consists of monitoring data of air quality from stations in Beijing, China, which includes:

- Aotizhongxin
- Changping
- Dingling
- Dongsi
- Guanyuan
- Gucheng
- Huairou
- Nongzhanguan
- Shunyi
- Tiantan
- Wanliu
- Wanshouxigong

#### Data features

Features on the data and their explanations:

- `No` : An index or row data for identifying row of data.
- `year`: The year of the recorded data.
- `month`: The month of the recorded data.
- `day`: The day of the month of the recorded data.
- `hour`: The hour of the day (0-23) when the observation was made.
- `PM2.5`: Concentration of particulate matter with a diameter of 2.5 micrometers or smaller (measured in µg/m³). The higher its level is, the more dangerous it is, as it can affect health condition, especially respitory conditions.
- `PM10`: Concentration of particulate matter with a diameter of 10 micrometers or smaller (measured in µg/m³). Similar to PM2.5 but includes larger particles; important for assessing overall air quality.
- `SO2`: Concentration of sulfur dioxide (measured in µg/m³). A pollutant that can come from industrial processes and fossil fuel combustion; high levels can cause respiratory problems.
- `NO2`: Concentration of nitrogen dioxide (measured in µg/m³). A pollutant from vehicle emissions and other sources; contributes to smog and can affect lung function.
- `CO`: Concentration of carbon monoxide (measured in µg/m³). A colorless, odorless gas produced by incomplete combustion; can be harmful at high levels.
- `O3`: Concentration of ozone (measured in µg/m³). Ground-level ozone is a key component of smog and can harm health and the environment; usually forms in the presence of sunlight.
- `TEMP`: Temperature (measured in degrees Celsius). Provides context for air quality readings; can influence pollutant concentrations and reactions.
- `PRES`: Atmospheric pressure (measured in hPa or millibars). Important for understanding weather patterns and conditions affecting air quality.
- `DEWP`: Dew point temperature (measured in degrees Celsius). Indicates humidity levels and can help understand weather conditions affecting air quality.
- `RAIN`: Amount of rainfall (measured in mm). Rain can help clear pollutants from the air, so it's important for understanding air quality variations.
- `wd`: Wind direction (usually measured in degrees). Provides information about the source of air pollutants and can affect dispersion.
- `WSPM`: Wind speed (measured in meters per second or km/h). Important for understanding how pollutants disperse in the atmosphere.
- `station`: Identifier or name of the monitoring station where the data was collected. Helps identify the geographical location of the measurements, which is crucial for spatial analysis of air quality.

### Gathering Data

#### Reading data from the `data` directory

Get the directory path of `data` directory that consists of all of the `csv` files:

In [4]:
data_path = os.path.join(os.getcwd(), "data")

Get the name of all `.csv` files as a list:

In [17]:
csv_files = os.listdir(data_path)
csv_files

['PRSA_Data_Aotizhongxin_20130301-20170228.csv',
 'PRSA_Data_Changping_20130301-20170228.csv',
 'PRSA_Data_Dingling_20130301-20170228.csv',
 'PRSA_Data_Dongsi_20130301-20170228.csv',
 'PRSA_Data_Guanyuan_20130301-20170228.csv',
 'PRSA_Data_Gucheng_20130301-20170228.csv',
 'PRSA_Data_Huairou_20130301-20170228.csv',
 'PRSA_Data_Nongzhanguan_20130301-20170228.csv',
 'PRSA_Data_Shunyi_20130301-20170228.csv',
 'PRSA_Data_Tiantan_20130301-20170228.csv',
 'PRSA_Data_Wanliu_20130301-20170228.csv',
 'PRSA_Data_Wanshouxigong_20130301-20170228.csv']

Examples of opening one `.csv` file from the `data` directory (`Aotizhongxin` in this case):

In [18]:
example_df = pd.read_csv(os.path.join(data_path, csv_files[0]))
example_df

Unnamed: 0,No,year,month,day,hour,PM2.5,PM10,SO2,NO2,CO,O3,TEMP,PRES,DEWP,RAIN,wd,WSPM,station
0,1,2013,3,1,0,4.0,4.0,4.0,7.0,300.0,77.0,-0.7,1023.0,-18.8,0.0,NNW,4.4,Aotizhongxin
1,2,2013,3,1,1,8.0,8.0,4.0,7.0,300.0,77.0,-1.1,1023.2,-18.2,0.0,N,4.7,Aotizhongxin
2,3,2013,3,1,2,7.0,7.0,5.0,10.0,300.0,73.0,-1.1,1023.5,-18.2,0.0,NNW,5.6,Aotizhongxin
3,4,2013,3,1,3,6.0,6.0,11.0,11.0,300.0,72.0,-1.4,1024.5,-19.4,0.0,NW,3.1,Aotizhongxin
4,5,2013,3,1,4,3.0,3.0,12.0,12.0,300.0,72.0,-2.0,1025.2,-19.5,0.0,N,2.0,Aotizhongxin
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35059,35060,2017,2,28,19,12.0,29.0,5.0,35.0,400.0,95.0,12.5,1013.5,-16.2,0.0,NW,2.4,Aotizhongxin
35060,35061,2017,2,28,20,13.0,37.0,7.0,45.0,500.0,81.0,11.6,1013.6,-15.1,0.0,WNW,0.9,Aotizhongxin
35061,35062,2017,2,28,21,16.0,37.0,10.0,66.0,700.0,58.0,10.8,1014.2,-13.3,0.0,NW,1.1,Aotizhongxin
35062,35063,2017,2,28,22,21.0,44.0,12.0,87.0,700.0,35.0,10.5,1014.4,-12.9,0.0,NNW,1.2,Aotizhongxin


#### Combining all data from csv into one `main_data.csv`

In [19]:
csv_dataframes = []

for csv_file in csv_files:
    df = pd.read_csv(os.path.join(data_path, csv_file))
    csv_dataframes.append(df)

[          No  year  month  day  hour  PM2.5  PM10   SO2   NO2     CO    O3  \
 0          1  2013      3    1     0    4.0   4.0   4.0   7.0  300.0  77.0   
 1          2  2013      3    1     1    8.0   8.0   4.0   7.0  300.0  77.0   
 2          3  2013      3    1     2    7.0   7.0   5.0  10.0  300.0  73.0   
 3          4  2013      3    1     3    6.0   6.0  11.0  11.0  300.0  72.0   
 4          5  2013      3    1     4    3.0   3.0  12.0  12.0  300.0  72.0   
 ...      ...   ...    ...  ...   ...    ...   ...   ...   ...    ...   ...   
 35059  35060  2017      2   28    19   12.0  29.0   5.0  35.0  400.0  95.0   
 35060  35061  2017      2   28    20   13.0  37.0   7.0  45.0  500.0  81.0   
 35061  35062  2017      2   28    21   16.0  37.0  10.0  66.0  700.0  58.0   
 35062  35063  2017      2   28    22   21.0  44.0  12.0  87.0  700.0  35.0   
 35063  35064  2017      2   28    23   19.0  31.0  10.0  79.0  600.0  42.0   
 
        TEMP    PRES  DEWP  RAIN   wd  WSPM       

In [38]:
main_df = pd.concat(csv_dataframes, ignore_index=True).drop(columns=['No'])
main_df

Unnamed: 0,year,month,day,hour,PM2.5,PM10,SO2,NO2,CO,O3,TEMP,PRES,DEWP,RAIN,wd,WSPM,station
0,2013,3,1,0,4.0,4.0,4.0,7.0,300.0,77.0,-0.7,1023.0,-18.8,0.0,NNW,4.4,Aotizhongxin
1,2013,3,1,1,8.0,8.0,4.0,7.0,300.0,77.0,-1.1,1023.2,-18.2,0.0,N,4.7,Aotizhongxin
2,2013,3,1,2,7.0,7.0,5.0,10.0,300.0,73.0,-1.1,1023.5,-18.2,0.0,NNW,5.6,Aotizhongxin
3,2013,3,1,3,6.0,6.0,11.0,11.0,300.0,72.0,-1.4,1024.5,-19.4,0.0,NW,3.1,Aotizhongxin
4,2013,3,1,4,3.0,3.0,12.0,12.0,300.0,72.0,-2.0,1025.2,-19.5,0.0,N,2.0,Aotizhongxin
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
420763,2017,2,28,19,11.0,32.0,3.0,24.0,400.0,72.0,12.5,1013.5,-16.2,0.0,NW,2.4,Wanshouxigong
420764,2017,2,28,20,13.0,32.0,3.0,41.0,500.0,50.0,11.6,1013.6,-15.1,0.0,WNW,0.9,Wanshouxigong
420765,2017,2,28,21,14.0,28.0,4.0,38.0,500.0,54.0,10.8,1014.2,-13.3,0.0,NW,1.1,Wanshouxigong
420766,2017,2,28,22,12.0,23.0,4.0,30.0,400.0,59.0,10.5,1014.4,-12.9,0.0,NNW,1.2,Wanshouxigong


### Assessing Data

#### Checking object info of `main_df`:
Now check on the object type of each column in the dataframe:

In [39]:
main_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 420768 entries, 0 to 420767
Data columns (total 17 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   year     420768 non-null  int64  
 1   month    420768 non-null  int64  
 2   day      420768 non-null  int64  
 3   hour     420768 non-null  int64  
 4   PM2.5    412029 non-null  float64
 5   PM10     414319 non-null  float64
 6   SO2      411747 non-null  float64
 7   NO2      408652 non-null  float64
 8   CO       400067 non-null  float64
 9   O3       407491 non-null  float64
 10  TEMP     420370 non-null  float64
 11  PRES     420375 non-null  float64
 12  DEWP     420365 non-null  float64
 13  RAIN     420378 non-null  float64
 14  wd       418946 non-null  object 
 15  WSPM     420450 non-null  float64
 16  station  420768 non-null  object 
dtypes: float64(11), int64(4), object(2)
memory usage: 54.6+ MB


#### Checking null values in `main_df`:
Checking null values:

In [41]:
print("Air qualities' data null value:")
main_df.isna().sum()

Air qualities' data null value:


year           0
month          0
day            0
hour           0
PM2.5       8739
PM10        6449
SO2         9021
NO2        12116
CO         20701
O3         13277
TEMP         398
PRES         393
DEWP         403
RAIN         390
wd          1822
WSPM         318
station        0
dtype: int64

#### Checking duplicates in `main_df`:
Air qualities' data duplicate test:

In [43]:
print("Air quality data duplicated sum: ", main_df.duplicated().sum())

Air quality data duplicated sum:  0


#### Checking descriptive statistic in `main_df`:
Checking description on the `main_df` using descriptive statistics method.

In [44]:
main_df.describe()

Unnamed: 0,year,month,day,hour,PM2.5,PM10,SO2,NO2,CO,O3,TEMP,PRES,DEWP,RAIN,WSPM
count,420768.0,420768.0,420768.0,420768.0,412029.0,414319.0,411747.0,408652.0,400067.0,407491.0,420370.0,420375.0,420365.0,420378.0,420450.0
mean,2014.66256,6.52293,15.729637,11.5,79.793428,104.602618,15.830835,50.638586,1230.766454,57.372271,13.538976,1010.746982,2.490822,0.064476,1.729711
std,1.177198,3.448707,8.800102,6.922195,80.822391,91.772426,21.650603,35.127912,1160.182716,56.661607,11.436139,10.474055,13.793847,0.821004,1.246386
min,2013.0,1.0,1.0,0.0,2.0,2.0,0.2856,1.0265,100.0,0.2142,-19.9,982.4,-43.4,0.0,0.0
25%,2014.0,4.0,8.0,5.75,20.0,36.0,3.0,23.0,500.0,11.0,3.1,1002.3,-8.9,0.0,0.9
50%,2015.0,7.0,16.0,11.5,55.0,82.0,7.0,43.0,900.0,45.0,14.5,1010.4,3.1,0.0,1.4
75%,2016.0,10.0,23.0,17.25,111.0,145.0,20.0,71.0,1500.0,82.0,23.3,1019.0,15.1,0.0,2.2
max,2017.0,12.0,31.0,23.0,999.0,999.0,500.0,290.0,10000.0,1071.0,41.6,1042.8,29.1,72.5,13.2


#### Insights on assessing data

As we've seen in the assessment process of the data, there are several problems that I would like to change:

1. The first one is the number of features is too much. So, I decided to minimize the number of features. My decision is to minimize the time-related features on the data into one `date_time` featureo only, as `pandas` provide an object for `date_time` related object.
2. Cleaning null values.
3. As there is no duplicated data, there's no need to solve this problem.

### Cleaning Data

#### Minimizing numbers of features

##### Handling time features in the dataframe

As we can see that the time related features takes up too much columns in the dataframe, and I don't think this assignment of mine would need the time information to be so detailed, so, I decided to combine all of the time related columns, `year`, `month`, `day`, `hour`, into just one column features named `time`.

In [31]:
# Make the ['date_time'] column
main_df['date_time'] = pd.to_datetime(main_df[['year', 'month', 'day', 'hour']])

# drop the originals columns
main_df.drop(columns=['year', 'month', 'day', 'hour'], inplace=True)

Unnamed: 0,PM2.5,PM10,SO2,NO2,CO,O3,TEMP,PRES,DEWP,RAIN,wd,WSPM,station,date_time
0,4.0,4.0,4.0,7.0,300.0,77.0,-0.7,1023.0,-18.8,0.0,NNW,4.4,Aotizhongxin,2013-03-01 00:00:00
1,8.0,8.0,4.0,7.0,300.0,77.0,-1.1,1023.2,-18.2,0.0,N,4.7,Aotizhongxin,2013-03-01 01:00:00
2,7.0,7.0,5.0,10.0,300.0,73.0,-1.1,1023.5,-18.2,0.0,NNW,5.6,Aotizhongxin,2013-03-01 02:00:00
3,6.0,6.0,11.0,11.0,300.0,72.0,-1.4,1024.5,-19.4,0.0,NW,3.1,Aotizhongxin,2013-03-01 03:00:00
4,3.0,3.0,12.0,12.0,300.0,72.0,-2.0,1025.2,-19.5,0.0,N,2.0,Aotizhongxin,2013-03-01 04:00:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
420763,11.0,32.0,3.0,24.0,400.0,72.0,12.5,1013.5,-16.2,0.0,NW,2.4,Wanshouxigong,2017-02-28 19:00:00
420764,13.0,32.0,3.0,41.0,500.0,50.0,11.6,1013.6,-15.1,0.0,WNW,0.9,Wanshouxigong,2017-02-28 20:00:00
420765,14.0,28.0,4.0,38.0,500.0,54.0,10.8,1014.2,-13.3,0.0,NW,1.1,Wanshouxigong,2017-02-28 21:00:00
420766,12.0,23.0,4.0,30.0,400.0,59.0,10.5,1014.4,-12.9,0.0,NNW,1.2,Wanshouxigong,2017-02-28 22:00:00


In [36]:
# move the 'date_time' column to the leftmost position
columns = ['date_time'] + [col for col in main_df.columns if col != 'date_time']

main_df = main_df[columns]


In [37]:
main_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 420768 entries, 0 to 420767
Data columns (total 14 columns):
 #   Column     Non-Null Count   Dtype         
---  ------     --------------   -----         
 0   date_time  420768 non-null  datetime64[ns]
 1   PM2.5      412029 non-null  float64       
 2   PM10       414319 non-null  float64       
 3   SO2        411747 non-null  float64       
 4   NO2        408652 non-null  float64       
 5   CO         400067 non-null  float64       
 6   O3         407491 non-null  float64       
 7   TEMP       420370 non-null  float64       
 8   PRES       420375 non-null  float64       
 9   DEWP       420365 non-null  float64       
 10  RAIN       420378 non-null  float64       
 11  wd         418946 non-null  object        
 12  WSPM       420450 non-null  float64       
 13  station    420768 non-null  object        
dtypes: datetime64[ns](1), float64(11), object(2)
memory usage: 44.9+ MB


**Insight:**
- xxx
- xxx

**Insight:**
- xxx
- xxx

## Exploratory Data Analysis (EDA)

### Explore ...

**Insight:**
- xxx
- xxx

## Visualization & Explanatory Analysis

### Pertanyaan 1:

### Pertanyaan 2:

**Insight:**
- xxx
- xxx

## Analisis Lanjutan (Opsional)

## Conclusion

- Conclution pertanyaan 1
- Conclution pertanyaan 2