### Overview
 https://aqicn.org/scale/
 
 	0 - 50 (Good): Air quality is considered satisfactory, and air pollution poses little or no risk.
	51 - 100 (Moderate): Air quality is acceptable; however, some pollutants might pose a moderate health concern for a very small number of people who are unusually sensitive to air pollution.
	101 - 150 (Unhealthy for Sensitive Groups): Members of sensitive groups may experience health effects. The general public is less likely to be affected.
	151 - 200 (Unhealthy): Everyone may begin to experience health effects, with sensitive groups possibly experiencing more serious effects.
	201 - 300 (Very Unhealthy): Health warnings of emergency conditions. The entire population is more likely to be affected.
	300+ (Hazardous): Health alert: everyone may experience more serious health effects.


In [2]:
import pandas as pd

### Read Data + Cleaning

In [6]:
# Load the data
file_path = 'data/naivebayespolution.csv'
data = pd.read_csv(file_path)

# Clean the data
# Dropping unnecessary columns
data_cleaned = data[['date', 'pm 2.5', 'stasiun']].copy()


# Remove rows with missing PM 2.5 values
data_cleaned = data_cleaned.dropna(subset=['pm 2.5'])

# Convert the 'pm 2.5' column to numeric
data_cleaned['pm 2.5'] = pd.to_numeric(data_cleaned['pm 2.5'], errors='coerce')

print(data_cleaned.head(10))

      date  pm 2.5               stasiun
0  9/27/23    65.0  Balikpapan Sepinggan
1  9/28/23    63.0  Balikpapan Sepinggan
2  9/29/23    57.0  Balikpapan Sepinggan
3  9/30/23    46.0  Balikpapan Sepinggan
4  10/1/23    37.0  Balikpapan Sepinggan
5  10/2/23    55.0  Balikpapan Sepinggan
6  10/3/23    54.0  Balikpapan Sepinggan
7  10/4/23    62.0  Balikpapan Sepinggan
8  10/5/23    64.0  Balikpapan Sepinggan
9  10/6/23    63.0  Balikpapan Sepinggan


### Labelling Level AQI

In [7]:
# Define the function to label the pollution level
def label_aqi(pm25):
    if pm25 <= 50:
        return 'Good'
    elif pm25 <= 100:
        return 'Moderate'
    elif pm25 <= 150:
        return 'Unhealthy for Sensitive Groups'
    elif pm25 <= 200:
        return 'Unhealthy'
    elif pm25 <= 300:
        return 'Very Unhealthy'
    else:
        return 'Hazardous'

# Apply the function to create a new column with labels
data_cleaned['Pollution Level'] = data_cleaned['pm 2.5'].apply(label_aqi)

# Display the labeled data
data_cleaned.head()

Unnamed: 0,date,pm 2.5,stasiun,Pollution Level
0,9/27/23,65.0,Balikpapan Sepinggan,Moderate
1,9/28/23,63.0,Balikpapan Sepinggan,Moderate
2,9/29/23,57.0,Balikpapan Sepinggan,Moderate
3,9/30/23,46.0,Balikpapan Sepinggan,Good
4,10/1/23,37.0,Balikpapan Sepinggan,Good


### Giving ID Stasiun

In [8]:
# Assign unique ID to each station
data_cleaned['id_stasiun'] = data_cleaned['stasiun'].factorize()[0] + 1

# Reorder columns to place 'id_stasiun' next to 'stasiun'
data_cleaned = data_cleaned[['date', 'pm 2.5', 'stasiun', 'id_stasiun', 'Pollution Level']]

# Optionally, save the cleaned and labeled data to a new CSV file
output_file_path = 'labeled_pollution_data.csv'
data_cleaned.to_csv(output_file_path, index=False)

# Display the result
print(data_cleaned.head())

      date  pm 2.5               stasiun  id_stasiun Pollution Level
0  9/27/23    65.0  Balikpapan Sepinggan           1        Moderate
1  9/28/23    63.0  Balikpapan Sepinggan           1        Moderate
2  9/29/23    57.0  Balikpapan Sepinggan           1        Moderate
3  9/30/23    46.0  Balikpapan Sepinggan           1            Good
4  10/1/23    37.0  Balikpapan Sepinggan           1            Good
