## Load Dataset

### Subtask:
Load the 'BeijingPM20100101_20151231.csv' file into a pandas DataFrame.


First, I'll import the pandas library to enable data manipulation. Then, I'll load the specified CSV file, 'BeijingPM20100101_20151231.csv', into a pandas DataFrame named 'df'. This action directly addresses the subtask's requirement to load the dataset.



In [None]:
import pandas as pd
df = pd.read_csv('/content/BeijingPM20100101_20151231.csv')
print(df.head())

   No  year  month  day  hour  season  PM_Dongsi  PM_Dongsihuan  \
0   1  2010      1    1     0       4        NaN            NaN   
1   2  2010      1    1     1       4        NaN            NaN   
2   3  2010      1    1     2       4        NaN            NaN   
3   4  2010      1    1     3       4        NaN            NaN   
4   5  2010      1    1     4       4        NaN            NaN   

   PM_Nongzhanguan  PM_US Post  DEWP  HUMI    PRES  TEMP cbwd    Iws  \
0              NaN         NaN -21.0  43.0  1021.0 -11.0   NW   1.79   
1              NaN         NaN -21.0  47.0  1020.0 -12.0   NW   4.92   
2              NaN         NaN -21.0  43.0  1019.0 -11.0   NW   6.71   
3              NaN         NaN -21.0  55.0  1019.0 -14.0   NW   9.84   
4              NaN         NaN -20.0  51.0  1018.0 -12.0   NW  12.97   

   precipitation  Iprec  
0            0.0    0.0  
1            0.0    0.0  
2            0.0    0.0  
3            0.0    0.0  
4            0.0    0.0  


## Inspect Data
Display the first few rows of the DataFrame, check its information (data types, non-null counts), and describe basic statistics to understand its structure and content.


To understand the DataFrame's structure and content, I will display its information, including data types and non-null counts, using `df.info()` and then generate descriptive statistics for numerical columns using `df.describe()`.



In [None]:
print(df.info())
print(df.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52584 entries, 0 to 52583
Data columns (total 18 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   No               52584 non-null  int64  
 1   year             52584 non-null  int64  
 2   month            52584 non-null  int64  
 3   day              52584 non-null  int64  
 4   hour             52584 non-null  int64  
 5   season           52584 non-null  int64  
 6   PM_Dongsi        25052 non-null  float64
 7   PM_Dongsihuan    20508 non-null  float64
 8   PM_Nongzhanguan  24931 non-null  float64
 9   PM_US Post       50387 non-null  float64
 10  DEWP             52579 non-null  float64
 11  HUMI             52245 non-null  float64
 12  PRES             52245 non-null  float64
 13  TEMP             52579 non-null  float64
 14  cbwd             52579 non-null  object 
 15  Iws              52579 non-null  float64
 16  precipitation    52100 non-null  float64
 17  Iprec       

## Summary:

### Data Analysis Key Findings

*   The dataset, named "BeijingPM20100101\_20151231.csv", contains 52,584 entries and 18 columns.
*   The columns include temporal information (`year`, `month`, `day`, `hour`), air quality measurements (e.g., `PM_Dongsi`, `PM_US Post`), and meteorological features (e.g., `DEWP`, `HUMI`, `PRES`, `TEMP`, `Iws`, `precipitation`, `Iprec`, `cbwd`).
*   The DataFrame consists of 11 float64 columns, 6 int64 columns, and 1 object column (`cbwd`).
*   Significant missing values are present, particularly in the PM-related columns (`PM_Dongsi`, `PM_Dongsihuan`, `PM_Nongzhanguan`, `PM_US Post`). Other columns like `DEWP`, `HUMI`, `PRES`, `TEMP`, `cbwd`, `Iws`, `precipitation`, and `Iprec` also have a small number of missing entries.
*   Descriptive statistics reveal a suspicious maximum value of 999,990.0 in both `precipitation` and `Iprec` columns, suggesting a potential data entry error or a special code.

### Insights or Next Steps

*   Further investigation is required to understand and address the anomalous maximum values in the `precipitation` and `Iprec` columns, possibly indicating a sentinel value for no precipitation.
*   A comprehensive strategy for handling missing values, especially in the PM-related columns, needs to be developed before conducting any in-depth analysis.


# Task
Prepare the `df` DataFrame for a classification task by identifying a target variable and features, handling missing values in columns including 'PM_Dongsi', 'PM_Dongsihuan', 'PM_Nongzhanguan', 'PM_US Post', 'DEWP', 'HUMI', 'PRES', 'TEMP', 'cbwd', 'Iws', 'precipitation', and 'Iprec', addressing anomalous values like 999,990.0 in 'precipitation' and 'Iprec', and encoding the categorical feature 'cbwd'.

## Identify Target and Features



### Target Variable Selection:

For this classification task, the target variable will be `PM_US Post`. This column represents PM2.5 concentration readings from the US Embassy in Beijing. It is chosen because it has the highest number of non-null observations among all PM-related columns, making it the most complete and potentially reliable for building a classification model. The goal will be to classify air quality based on this variable.

### Feature Selection:

The features for the model will include the temporal information (`year`, `month`, `day`, `hour`), other air quality measurements (`PM_Dongsi`, `PM_Dongsihuan`, `PM_Nongzhanguan`), and meteorological features (`DEWP`, `HUMI`, `PRES`, `TEMP`, `cbwd`, `Iws`, `precipitation`, `Iprec`). The 'No' column will be excluded as it is an identifier and does not contribute to the predictive power of the model. All selected features are considered relevant for predicting air quality.

Now that the target variable and features have been identified, I will create separate DataFrames for the features (X) and the target variable (y) based on the selections made, excluding the 'No' column from features.



In [None]:
target_variable = 'PM_US Post'
features = [col for col in df.columns if col not in ['No', target_variable]]

X = df[features]
y = df[target_variable]

print("Shape of features (X):", X.shape)
print("Shape of target (y):", y.shape)
print("\nFirst 5 rows of X:")
print(X.head())
print("\nFirst 5 rows of y:")
print(y.head())

Shape of features (X): (52584, 16)
Shape of target (y): (52584,)

First 5 rows of X:
   year  month  day  hour  season  PM_Dongsi  PM_Dongsihuan  PM_Nongzhanguan  \
0  2010      1    1     0       4        NaN            NaN              NaN   
1  2010      1    1     1       4        NaN            NaN              NaN   
2  2010      1    1     2       4        NaN            NaN              NaN   
3  2010      1    1     3       4        NaN            NaN              NaN   
4  2010      1    1     4       4        NaN            NaN              NaN   

   DEWP  HUMI    PRES  TEMP cbwd    Iws  precipitation  Iprec  
0 -21.0  43.0  1021.0 -11.0   NW   1.79            0.0    0.0  
1 -21.0  47.0  1020.0 -12.0   NW   4.92            0.0    0.0  
2 -21.0  43.0  1019.0 -11.0   NW   6.71            0.0    0.0  
3 -21.0  55.0  1019.0 -14.0   NW   9.84            0.0    0.0  
4 -20.0  51.0  1018.0 -12.0   NW  12.97            0.0    0.0  

First 5 rows of y:
0   NaN
1   NaN
2   NaN
3   Na

## Handle Missing Values

Address the missing values in the dataset by dropping rows with missing target values, and then imputing numerical and categorical features in the feature set.


First, I will remove rows from both the feature DataFrame X and the target Series y where the target variable y has missing values to ensure data consistency.



In [None]:
initial_rows = X.shape[0]

# Drop rows where the target variable 'y' has missing values
drop_indices = y[y.isnull()].index
X = X.drop(drop_indices)
y = y.drop(drop_indices)

print(f"Dropped {initial_rows - X.shape[0]} rows with missing target values.")
print("Shape of X after dropping missing target values:", X.shape)
print("Shape of y after dropping missing target values:", y.shape)

Dropped 2197 rows with missing target values.
Shape of X after dropping missing target values: (50387, 16)
Shape of y after dropping missing target values: (50387,)


Now that rows with missing target values have been removed, I will proceed to impute missing values in numerical columns with their respective medians and in the categorical column 'cbwd' with its mode, as specified in the subtask.



In [None]:
numerical_cols_to_impute = [
    'PM_Dongsi', 'PM_Dongsihuan', 'PM_Nongzhanguan', 'DEWP', 'HUMI',
    'PRES', 'TEMP', 'Iws', 'precipitation', 'Iprec'
]

for col in numerical_cols_to_impute:
    if X[col].isnull().any():
        median_val = X[col].median()
        X.loc[:, col] = X[col].fillna(median_val) # Fixed: Assign directly to avoid SettingWithCopyWarning
        print(f"Imputed missing values in numerical column '{col}' with median: {median_val}")

categorical_col_to_impute = 'cbwd'
if X[categorical_col_to_impute].isnull().any():
    mode_val = X[categorical_col_to_impute].mode()[0] # .mode() can return multiple values, take the first
    X.loc[:, categorical_col_to_impute] = X[categorical_col_to_impute].fillna(mode_val) # Fixed: Assign directly to avoid SettingWithCopyWarning
    print(f"Imputed missing values in categorical column '{categorical_col_to_impute}' with mode: {mode_val}")

print("\nMissing values after imputation:")
print(X.isnull().sum()[X.isnull().sum() > 0])


Missing values after imputation:
Series([], dtype: int64)


## Handle Anomalous Values

Investigate and address the anomalous maximum values (999,990.0) in the 'precipitation' and 'Iprec' columns in the feature set `X`.


To address the anomalous values, I will replace all instances of 999990.0 with 0.0 in both 'precipitation' and 'Iprec' columns, then print the maximum values to verify the replacement.



In [None]:
X.loc[X['precipitation'] == 999990.0, 'precipitation'] = 0.0
X.loc[X['Iprec'] == 999990.0, 'Iprec'] = 0.0

print("Maximum value in 'precipitation' after addressing anomalies:", X['precipitation'].max())
print("Maximum value in 'Iprec' after addressing anomalies:", X['Iprec'].max())

Maximum value in 'precipitation' after addressing anomalies: 69.2
Maximum value in 'Iprec' after addressing anomalies: 223.0


## Encode Categorical Features



To convert the categorical feature 'cbwd' into a numerical format, I will apply one-hot encoding using `pd.get_dummies()` and then display the updated DataFrame's shape and head to confirm the transformation.



In [None]:
X = pd.get_dummies(X, columns=['cbwd'], drop_first=True)

print("Shape of X after one-hot encoding:", X.shape)
print("\nFirst 5 rows of X after one-hot encoding:")
print(X.head())

Shape of X after one-hot encoding: (50387, 18)

First 5 rows of X after one-hot encoding:
    year  month  day  hour  season  PM_Dongsi  PM_Dongsihuan  PM_Nongzhanguan  \
23  2010      1    1    23       4       64.0           68.0             62.0   
24  2010      1    2     0       4       64.0           68.0             62.0   
25  2010      1    2     1       4       64.0           68.0             62.0   
26  2010      1    2     2       4       64.0           68.0             62.0   
27  2010      1    2     3       4       64.0           68.0             62.0   

    DEWP  HUMI    PRES  TEMP   Iws  precipitation  Iprec  cbwd_NW  cbwd_SE  \
23 -17.0  41.0  1020.0  -5.0  0.89            0.0    0.0    False    False   
24 -16.0  38.0  1020.0  -4.0  1.79            0.0    0.0    False     True   
25 -15.0  42.0  1020.0  -4.0  2.68            0.0    0.0    False     True   
26 -11.0  63.5  1021.0  -5.0  3.57            0.0    0.0    False     True   
27  -7.0  85.0  1022.0  -5.0  5.3

Summarize the data preparation steps taken, including how missing values and anomalous values were handled, and how categorical features were encoded, along with the final state of the dataset ready for classification.


## Summary:

### Q&A
The data preparation steps involved:
1.  **Missing Values Handling**: Rows with missing `PM_US Post` (target variable) values were dropped. Missing numerical values in columns such as 'PM_Dongsi', 'PM_Dongsihuan', 'PM_Nongzhanguan', 'DEWP', 'HUMI', 'PRES', 'TEMP', 'Iws', 'precipitation', and 'Iprec' were imputed using the median. Missing values in the categorical 'cbwd' column were imputed using the mode.
2.  **Anomalous Values Handling**: Anomalous values of 999,990.0 in 'precipitation' and 'Iprec' were replaced with 0.0.
3.  **Categorical Feature Encoding**: The 'cbwd' categorical feature was one-hot encoded using `pd.get_dummies()` with `drop_first=True`.
4.  **Final State of the Dataset**: The final dataset `X` has a shape of (50387, 18) and `y` has a shape of (50387,). All specified missing and anomalous values have been addressed, and the categorical feature 'cbwd' is now numerically encoded, making the dataset ready for classification tasks.

### Data Analysis Key Findings
*   The `PM_US Post` column was selected as the target variable for classification.
*   `2197` rows containing missing values in the target variable `PM_US Post` were removed, reducing the dataset from `52584` to `50387` rows.
*   Missing numerical values in features like 'PM_Dongsi', 'PM_Dongsihuan', 'PM_Nongzhanguan', 'DEWP', 'HUMI', 'PRES', 'TEMP', 'Iws', 'precipitation', and 'Iprec' were successfully imputed with their respective medians.
*   The categorical feature 'cbwd' had its missing values imputed with its mode, 'SE'.
*   Anomalous values of 999,990.0 in 'precipitation' and 'Iprec' were replaced with 0.0. After this correction, the maximum 'precipitation' value was 69.2, and the maximum 'Iprec' value was 223.0.
*   The 'cbwd' feature was one-hot encoded using `drop_first=True`, resulting in the addition of new columns (`cbwd_NW`, `cbwd_SE`, `cbwd_cv`) and changing the feature set `X`'s shape from (50387, 16) to (50387, 18).
*   The dataset is now entirely free of missing values and corrected for the specified anomalous values, with all features in a numerical format suitable for machine learning.

### Insights or Next Steps
*   The extensive data cleaning and preprocessing steps have ensured that the dataset is robust and ready for training a classification model. The choice to drop rows with missing target values and impute feature values helps maintain data integrity for model training.
*   The next step involves scaling the numerical features to ensure that no single feature dominates the model training due to its magnitude, followed by splitting the data into training and testing sets before model selection and training.


## Convert Target to Categorical

Build and evaluate a Gradient Boosting Classifier model to predict air quality categories based on the preprocessed features from the "BeijingPM20100101_20151231.csv" dataset. This involves converting the continuous 'PM_US Post' variable into categorical classes, scaling numerical features, splitting the data into training and test sets, implementing the model with cross-validation, and evaluating its performance using appropriate classification metrics.
Transform the continuous 'PM_US Post' target variable into discrete categorical classes (e.g., 'Good', 'Moderate', 'Unhealthy') based on predefined PM2.5 thresholds. This is a critical step to enable a classification task.


To convert the continuous 'PM_US Post' variable into categorical classes, I will first define the PM2.5 thresholds and corresponding labels, then create a function to classify each PM2.5 value, and finally apply this function to the target Series `y` to create the new categorical target `y_categorical`.



In [None]:
def pm25_to_category(pm25):
    if 0.0 <= pm25 <= 12.0:
        return 'Good'
    elif 12.1 <= pm25 <= 35.4:
        return 'Moderate'
    elif 35.5 <= pm25 <= 55.4:
        return 'Unhealthy for Sensitive Groups'
    elif 55.5 <= pm25 <= 150.4:
        return 'Unhealthy'
    elif 150.5 <= pm25 <= 250.4:
        return 'Very Unhealthy'
    elif pm25 > 250.4:
        return 'Hazardous'
    else:
        return 'Unknown'

y_categorical = y.apply(pm25_to_category)

print("First 5 rows of y_categorical:")
print(y_categorical.head())
print("\nValue counts of y_categorical:")
print(y_categorical.value_counts())

First 5 rows of y_categorical:
23         Unhealthy
24         Unhealthy
25    Very Unhealthy
26    Very Unhealthy
27         Unhealthy
Name: PM_US Post, dtype: object

Value counts of y_categorical:
PM_US Post
Unhealthy                         18560
Moderate                          10784
Very Unhealthy                     6811
Unhealthy for Sensitive Groups     6150
Good                               4546
Hazardous                          3536
Name: count, dtype: int64


## Scale Numerical Features
Apply a standard scaler to normalize all numerical features in the dataset (excluding the one-hot encoded 'cbwd' columns). This ensures that features with larger numerical ranges do not unduly influence the Gradient Boosting model.



To normalize the numerical features, I will first identify them, excluding the one-hot encoded 'cbwd' columns. Then, I will import `StandardScaler`, instantiate it, fit and transform the identified numerical features, and update the DataFrame `X`.



In [None]:
from sklearn.preprocessing import StandardScaler

# Identify numerical columns to scale, excluding one-hot encoded 'cbwd' columns
# Also exclude 'year', 'month', 'day', 'hour', 'season' since they are temporal/categorical and do not need scaling.
numerical_cols_to_scale = [col for col in X.select_dtypes(include=['int64', 'float64']).columns if not col.startswith('cbwd_') and col not in ['year', 'month', 'day', 'hour', 'season']]

# Instantiate StandardScaler
scaler = StandardScaler()

# Fit and transform the selected numerical columns
X[numerical_cols_to_scale] = scaler.fit_transform(X[numerical_cols_to_scale])

print("First 5 rows of X after numerical feature scaling:")
print(X.head())

First 5 rows of X after numerical feature scaling:
    year  month  day  hour  season  PM_Dongsi  PM_Dongsihuan  PM_Nongzhanguan  \
23  2010      1    1    23       4  -0.198081      -0.173303        -0.206696   
24  2010      1    2     0       4  -0.198081      -0.173303        -0.206696   
25  2010      1    2     1       4  -0.198081      -0.173303        -0.206696   
26  2010      1    2     2       4  -0.198081      -0.173303        -0.206696   
27  2010      1    2     3       4  -0.198081      -0.173303        -0.206696   

        DEWP      HUMI      PRES      TEMP       Iws  precipitation    Iprec  \
23 -1.338508 -0.524266  0.344392 -1.452731 -0.456249        -0.0755 -0.08331   
24 -1.268187 -0.640056  0.344392 -1.369963 -0.437803        -0.0755 -0.08331   
25 -1.197866 -0.485669  0.344392 -1.369963 -0.419563        -0.0755 -0.08331   
26 -0.916581  0.344159  0.441603 -1.452731 -0.401322        -0.0755 -0.08331   
27 -0.635296  1.173987  0.538814 -1.452731 -0.364635        -0

To prepare the data for model training and evaluation, I will split the features (X) and the categorical target (y_categorical) into training and testing sets using `train_test_split`. This will allow for proper assessment of the model's performance on unseen data.



In [None]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y_categorical, test_size=0.2, random_state=42)

print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of y_test: {y_test.shape}")

Shape of X_train: (40309, 18)
Shape of X_test: (10078, 18)
Shape of y_train: (40309,)
Shape of y_test: (10078,)


## Implement Gradient Boosting Model with Cross-Validation



Initialize a Gradient Boosting Classifier. Then, use k-fold cross-validation on the training data to train and evaluate the model.
To implement the Gradient Boosting Classifier with k-fold cross-validation, I will import the necessary modules, initialize the classifier and `StratifiedKFold`, then use `cross_val_score` to evaluate the model on the training data, and finally print the mean accuracy and standard deviation.



In [None]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
import numpy as np

# Initialize Gradient Boosting Classifier
gbc = GradientBoostingClassifier(random_state=42)

# Instantiate StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Perform cross-validation
cv_scores = cross_val_score(gbc, X_train, y_train, cv=skf, scoring='accuracy', n_jobs=-1)

# Print the results
print(f"Cross-validation Accuracy: {np.mean(cv_scores):.4f} +/- {np.std(cv_scores):.4f}")

Cross-validation Accuracy: 0.6777 +/- 0.0022


## Evaluate Model Performance

Evaluate the final Gradient Boosting model on the held-out test set. Calculate and present key classification metrics such as accuracy, precision, recall, F1-score, and a confusion matrix to assess the model's effectiveness.


To evaluate the Gradient Boosting model, I will first fit the model to the training data, then make predictions on the test set, and finally calculate and print various classification metrics including accuracy, precision, recall, F1-score, and the confusion matrix.



In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Fit the Gradient Boosting Classifier model
gbc.fit(X_train, y_train)

# Make predictions on the test set
y_pred = gbc.predict(X_test)

# Calculate and print classification metrics
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision (weighted): {precision_score(y_test, y_pred, average='weighted'):.4f}")
print(f"Recall (weighted): {recall_score(y_test, y_pred, average='weighted'):.4f}")
print(f"F1-score (weighted): {f1_score(y_test, y_pred, average='weighted'):.4f}")

# Generate and print the confusion matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.6882
Precision (weighted): 0.6971
Recall (weighted): 0.6882
F1-score (weighted): 0.6746

Confusion Matrix:
[[ 488    0  346   69    2    2]
 [   0  457    4  160    0   68]
 [ 148    2 1578  384  101    5]
 [   1   40  119 3371  117  124]
 [   1    0  215  497  449   10]
 [   0   77   18  632    0  593]]


In [None]:
import time

# Measure training time
start_train_time = time.time()
gbc.fit(X_train, y_train)
end_train_time = time.time()
train_time = end_train_time - start_train_time

print(f"Model training time: {train_time:.4f} seconds")

Model training time: 49.6301 seconds


In [None]:
# Measure prediction time
start_predict_time = time.time()
y_pred = gbc.predict(X_test)
end_predict_time = time.time()
predict_time = end_predict_time - start_predict_time

print(f"Model prediction time: {predict_time:.4f} seconds")

Model prediction time: 0.1199 seconds


## Summary:

### Data Analysis Key Findings
*   The continuous 'PM\_US Post' variable was successfully transformed into six discrete air quality categories: 'Good', 'Moderate', 'Unhealthy for Sensitive Groups', 'Unhealthy', 'Very Unhealthy', and 'Hazardous'. 'Unhealthy' was identified as the most frequent category.
*   Numerical features such as `PM_Dongsi`, `DEWP`, `HUMI`, `PRES`, `TEMP`, `Iws`, `precipitation`, and `Iprec` were successfully scaled using `StandardScaler`.
*   The dataset was split into training and testing sets with an 80/20 ratio, resulting in 40,309 samples for training and 10,078 samples for testing.
*   A Gradient Boosting Classifier, evaluated using 5-fold stratified cross-validation on the training data, achieved a mean accuracy of 0.6777 with a standard deviation of 0.0022.
*   On the held-out test set, the final Gradient Boosting model demonstrated an accuracy of 0.6882, a weighted precision of 0.6971, a weighted recall of 0.6882, and a weighted F1-score of 0.6746.

### Insights or Next Steps
*   The consistency between the cross-validation accuracy (0.6777) and the test set accuracy (0.6882) suggests that the model generalizes reasonably well to unseen data.
*   Analyzing the confusion matrix in detail can pinpoint specific misclassification patterns, which could guide further model refinement, such as optimizing class-specific weights or exploring advanced techniques to improve performance on under-represented or highly confused categories.
