## Step 2: Feature Engineering and Target Variable Creation

In this notebook, we perform feature engineering and define the machine learning targets for both regression and classification models. This step builds on the cleaned dataset prepared in the previous notebook.

### 1. Load Cleaned Dataset

We begin by loading the merged and cleaned weather + AQI dataset created in the previous step.


In [1]:
# load the cleaned datasets
import pandas as pd

merged_data = pd.read_csv('../data/cleaned_aqi_weather.csv')

# convert date to datetime
merged_data['date'] = pd.to_datetime(merged_data['date'])

# preview
merged_data.head()

Unnamed: 0,date,pm10,so2,co,o3,no2,max,tempmax,tempmin,temp,feelslike,dew,humidity,precip,precipcover,windspeed,winddir,cloudcover,uvindex,solarenergy
0,2016-01-01,72.0,32.0,53.0,51.0,11.0,72.0,32.39,24.0,28.17,31.33,76.0,81.2,0.998,16.67,14.5,312.2,58.2,8,19.6
1,2016-01-02,52.0,32.0,29.0,50.0,12.0,52.0,31.72,24.89,27.17,29.39,76.0,85.6,0.585,16.67,17.7,306.2,68.0,7,15.8
2,2016-01-03,44.0,33.0,33.0,50.0,18.0,50.0,32.72,25.0,28.39,31.56,75.5,79.1,0.0,0.0,11.1,292.7,56.7,7,15.1
3,2016-01-04,61.0,33.0,36.0,103.0,24.0,103.0,31.72,26.0,28.56,33.22,76.8,80.9,0.0,0.0,7.8,306.9,61.8,8,16.5
4,2016-01-05,99.0,33.0,54.0,67.0,22.0,99.0,32.39,25.17,28.56,32.72,76.5,80.6,0.457,8.33,11.4,10.7,59.0,7,18.7


### 2. Create ML Target Variables

We create two target columns:

- `target_aqi`: the AQI value of the next day (used for regression)
- `target_category`: the AQI category of the next day (used for classification)


In [2]:
# create regression target (next day aqi)
merged_data['next_day_aqi'] = merged_data['max'].shift(-1)

# classify the target
def categorize_aqi(aqi):
    if aqi <= 50:
        return 'Good'
    elif aqi <= 100:
        return 'Moderate'
    elif aqi <= 200:
        return 'Unhealthy'
    else:
        return 'Very Unhealthy'
    
merged_data['aqi_category'] = merged_data['next_day_aqi'].apply(categorize_aqi)

# preview the dataset
merged_data.head()

Unnamed: 0,date,pm10,so2,co,o3,no2,max,tempmax,tempmin,temp,...,humidity,precip,precipcover,windspeed,winddir,cloudcover,uvindex,solarenergy,next_day_aqi,aqi_category
0,2016-01-01,72.0,32.0,53.0,51.0,11.0,72.0,32.39,24.0,28.17,...,81.2,0.998,16.67,14.5,312.2,58.2,8,19.6,52.0,Moderate
1,2016-01-02,52.0,32.0,29.0,50.0,12.0,52.0,31.72,24.89,27.17,...,85.6,0.585,16.67,17.7,306.2,68.0,7,15.8,50.0,Good
2,2016-01-03,44.0,33.0,33.0,50.0,18.0,50.0,32.72,25.0,28.39,...,79.1,0.0,0.0,11.1,292.7,56.7,7,15.1,103.0,Unhealthy
3,2016-01-04,61.0,33.0,36.0,103.0,24.0,103.0,31.72,26.0,28.56,...,80.9,0.0,0.0,7.8,306.9,61.8,8,16.5,99.0,Moderate
4,2016-01-05,99.0,33.0,54.0,67.0,22.0,99.0,32.39,25.17,28.56,...,80.6,0.457,8.33,11.4,10.7,59.0,7,18.7,75.0,Moderate


### 3. Extract Date-Based Features

We extract useful temporal features:

- `month`: numerical month
- `dayofweek`: day of the week (0=Monday, 6=Sunday)
- `is_weekend`: whether it's a Saturday or Sunday

In [3]:
# extract based on month
merged_data['month'] = merged_data['date'].dt.month
# extract based on day of week
merged_data['dayofweek'] = merged_data['date'].dt.dayofweek
# extract based on weekend
merged_data['is_weekend'] = merged_data['dayofweek'] >= 5

# preview the dataset
merged_data.head()

Unnamed: 0,date,pm10,so2,co,o3,no2,max,tempmax,tempmin,temp,...,windspeed,winddir,cloudcover,uvindex,solarenergy,next_day_aqi,aqi_category,month,dayofweek,is_weekend
0,2016-01-01,72.0,32.0,53.0,51.0,11.0,72.0,32.39,24.0,28.17,...,14.5,312.2,58.2,8,19.6,52.0,Moderate,1,4,False
1,2016-01-02,52.0,32.0,29.0,50.0,12.0,52.0,31.72,24.89,27.17,...,17.7,306.2,68.0,7,15.8,50.0,Good,1,5,True
2,2016-01-03,44.0,33.0,33.0,50.0,18.0,50.0,32.72,25.0,28.39,...,11.1,292.7,56.7,7,15.1,103.0,Unhealthy,1,6,True
3,2016-01-04,61.0,33.0,36.0,103.0,24.0,103.0,31.72,26.0,28.56,...,7.8,306.9,61.8,8,16.5,99.0,Moderate,1,0,False
4,2016-01-05,99.0,33.0,54.0,67.0,22.0,99.0,32.39,25.17,28.56,...,11.4,10.7,59.0,7,18.7,75.0,Moderate,1,1,False


### 4. Preview of Enriched Dataset

We now have target columns and date-based features added to our dataset.


In [4]:
merged_data[['date', 'max', 'next_day_aqi', 'aqi_category', 'month', 'dayofweek', 'is_weekend']].head()

Unnamed: 0,date,max,next_day_aqi,aqi_category,month,dayofweek,is_weekend
0,2016-01-01,72.0,52.0,Moderate,1,4,False
1,2016-01-02,52.0,50.0,Good,1,5,True
2,2016-01-03,50.0,103.0,Unhealthy,1,6,True
3,2016-01-04,103.0,99.0,Moderate,1,0,False
4,2016-01-05,99.0,75.0,Moderate,1,1,False


### 5. Save Enriched Data

We save this version to use in the next notebook for further feature engineering (lag, rolling, interaction).

In [5]:
# save the enriched dataset
merged_data.to_csv('../data/enriched_aqi_weather.csv', index=False)